Anyone who is or was a fan of Football can picture a time where they found themselves yelling at another "stupid" coach to be aggressive and "Go for it" on 4th down. For those who do not understand the importance of a 4th down, each team, when in possession of the ball, has 4 downs(attempts) to advance 10 yards and gain another 4 downs. On the 4th Down(attempt) a team has the option of punting the ball back to the defending team, which will usually end in poor field position for the opposing team, kicking a field goal for 3 points if they are close enough to the endzone, or attempt to go for it on 4th down to gain the remaining yards needed to have 4 more downs(attempts). If a team does convert then they have earned themselves another 4 downs(attempts) to gain 10 yards. If a team fails to convert they will forfeit the ball to the defending team at the spot in which they last stood, usually resulting in favorable field position for the opposing team, or the missed opportunity to score 3 points from a field goal. This process repeats itself until a team either scores a touchdown, punts, turns the ball over, or kicks a field goal.
The age old question for any coach, player, and fan who has no impact on the game whatsoever, has always been whether to go for it on 4th down or not. In this project we have analyzed every play, from every game, over a 10 season span from 2009 to 2018, to help us answer this question.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import chi2
import statsmodels.api as sm
from scipy import stats
import statsmodels.api as sm
import tkinter as tk
warnings.filterwarnings('ignore')
#sns.set(rc={'figure.figsize':(11.7,8.27)})
The data used for this study was collected by a group of Carnegie Mellon University statistical researchers including Maksim Horowitz, Ron Yurko, and Sam Ventura. The data is publicly available on Kaggle and Github. The Data contains all regular season plays from the 2009 to 2018 NFL seasons. The dataset has over 449,000 rows and 255 columns.
The Dataset was modified to only include information that was relevant to this study. Examples of information that was removed included stats on individual players, tackles, extra point attempts, etc.. Once we had our relevant data, three Dataframes were created with information over the 10 NFL season span. The first one includes every play from every game, the second only includes 4th down plays, and the last, which is displayed below, only includes the 4th down plays in which a team attempted to go for it.
#Scrape data from csv containg every in regular season football from 2009 until 2018
all_data = pd.read_csv('NFL_Data(2009-2018).csv',sep = ',')
#convert game dates to date time objects
all_data['game_date'] = pd.to_datetime(all_data['game_date'])
all_data['year'] = pd.DatetimeIndex(all_data['game_date']).year
#create column for successfully converted fourth down attempts change values to True and False
#all_data['succ_4th'] = None
all_data['succ_4th'] = all_data['fourth_down_converted']
all_data.loc[(all_data.succ_4th == 0.0),'succ_4th']= False
all_data.loc[(all_data.succ_4th == 1.0),'succ_4th']= True
#create a column for attempted fourth down attempts, only legitimate attempts so pass and run plays
all_data['att_fourth'] = np.where(((all_data.fourth_down_converted == 1.0) | \
(all_data.fourth_down_failed == 1.0)) & \
((all_data.play_type == 'pass') | \
(all_data.play_type == 'run')) , True, False)
#dataframe that has all fourth down plays
all_fourth = all_data[all_data.down == 4]
#dataframe only cotaining the plays in which a fourth down attempt occurrred
att_fourth = all_fourth[all_fourth.att_fourth == True]
pd.set_option('display.max_columns', None)
all_fourth = all_fourth.reset_index()
att_fourth = att_fourth.reset_index()
att_fourth
index | play_id | game_id | home_team | away_team | posteam | posteam_type | defteam | side_of_field | yardline_100 | game_date | quarter_seconds_remaining | half_seconds_remaining | game_seconds_remaining | game_half | quarter_end | drive | qtr | down | goal_to_go | time | yrdln | ydstogo | ydsnet | play_type | yards_gained | qb_spike | pass_length | field_goal_result | td_team | total_home_score | total_away_score | posteam_score | defteam_score | score_differential | posteam_score_post | defteam_score_post | score_differential_post | no_score_prob | opp_fg_prob | opp_td_prob | fg_prob | td_prob | wp | home_wp | away_wp | wpa | punt_blocked | fourth_down_converted | fourth_down_failed | incomplete_pass | interception | punt_inside_twenty | punt_in_endzone | punt_out_of_bounds | punt_downed | punt_fair_catch | penalty | tackled_for_loss | fumble_lost | rush_attempt | pass_attempt | sack | touchdown | field_goal_attempt | punt_attempt | fumble | complete_pass | return_yards | penalty_team | penalty_yards | penalty_type | year | succ_4th | att_fourth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 329 | 3611 | 2009091304 | CLE | MIN | CLE | home | MIN | CLE | 64.0 | 2009-09-13 | 292.0 | 292.0 | 292.0 | Half2 | 0 | 22 | 4 | 4.0 | 0.0 | 4:52 | CLE 36 | 10 | 17 | pass | 0.0 | 0 | short | NaN | NaN | 12 | 34 | 12.0 | 34.0 | -22.0 | 12.0 | 34.0 | -22.0 | 0.379171 | 0.152689 | 0.215235 | 0.112044 | 0.134403 | 0.003740 | 0.003740 | 0.996260 | -0.002612 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 2009 | False | True |
1 | 395 | 1037 | 2009091307 | NO | DET | DET | away | NO | NO | 4.0 | 2009-09-13 | 851.0 | 851.0 | 2651.0 | Half1 | 0 | 8 | 2 | 4.0 | 0.0 | 14:11 | NO 4 | 1 | 13 | run | 4.0 | 0 | NaN | NaN | DET | 14 | 9 | 3.0 | 14.0 | -11.0 | 9.0 | 14.0 | -5.0 | 0.011761 | 0.021944 | 0.035875 | 0.805946 | 0.121512 | 0.208384 | 0.791616 | 0.208384 | 0.122426 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 2009 | True | True |
2 | 516 | 3909 | 2009091307 | NO | DET | DET | away | NO | DET | 64.0 | 2009-09-13 | 378.0 | 378.0 | 378.0 | Half2 | 0 | 25 | 4 | 4.0 | 0.0 | 6:18 | DET 36 | 1 | 33 | pass | 14.0 | 0 | short | NaN | NaN | 45 | 26 | 26.0 | 45.0 | -19.0 | 26.0 | 45.0 | -19.0 | 0.293383 | 0.153133 | 0.228940 | 0.086751 | 0.231910 | 0.017130 | 0.982870 | 0.017130 | 0.012977 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN | 2009 | True | True |
3 | 676 | 3307 | 2009091308 | TB | DAL | TB | home | DAL | DAL | 26.0 | 2009-09-13 | 464.0 | 464.0 | 464.0 | Half2 | 0 | 19 | 4 | 4.0 | 0.0 | 7:44 | DAL 26 | 7 | 43 | pass | 0.0 | 0 | short | NaN | NaN | 14 | 27 | 14.0 | 27.0 | -13.0 | 14.0 | 27.0 | -13.0 | 0.113277 | 0.060039 | 0.089903 | 0.599755 | 0.132362 | 0.095709 | 0.095709 | 0.904291 | -0.030071 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 2009 | False | True |
4 | 697 | 3774 | 2009091308 | TB | DAL | TB | home | DAL | DAL | 2.0 | 2009-09-13 | 91.0 | 91.0 | 91.0 | Half2 | 0 | 21 | 4 | 4.0 | 1.0 | 1:31 | DAL 2 | 2 | 72 | pass | 2.0 | 0 | short | NaN | TB | 20 | 34 | 14.0 | 34.0 | -20.0 | 20.0 | 34.0 | -14.0 | 0.190697 | 0.012643 | 0.008474 | 0.656673 | 0.128562 | 0.003992 | 0.003992 | 0.996008 | 0.004568 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN | 2009 | True | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4714 | 449063 | 1300 | 2018121611 | LA | PHI | PHI | away | LA | LA | 49.0 | 2018-12-16 | 578.0 | 578.0 | 2378.0 | Half1 | 0 | 6 | 2 | 4.0 | 0.0 | 9:38:00 | LA 49 | 1 | 9 | run | 0.0 | 0 | NaN | NaN | NaN | 7 | 6 | 6.0 | 7.0 | -1.0 | 6.0 | 7.0 | -1.0 | 0.137745 | 0.137788 | 0.211307 | 0.220345 | 0.286395 | 0.475908 | 0.524092 | 0.475908 | -0.081295 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 2018 | False | True |
4715 | 449139 | 3142 | 2018121611 | LA | PHI | LA | home | PHI | LA | 70.0 | 2018-12-16 | 43.0 | 943.0 | 943.0 | Half2 | 0 | 17 | 3 | 4.0 | 0.0 | 0:43:00 | LA 30 | 5 | 5 | pass | 0.0 | 0 | short | NaN | NaN | 13 | 30 | 13.0 | 30.0 | -17.0 | 13.0 | 30.0 | -17.0 | 0.042998 | 0.252950 | 0.368980 | 0.108395 | 0.216498 | 0.041334 | 0.041334 | 0.958666 | -0.017997 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 2018 | False | True |
4716 | 449226 | 610 | 2018121700 | CAR | NO | CAR | home | NO | MID | 50.0 | 2018-12-17 | 252.0 | 1152.0 | 2952.0 | Half1 | 0 | 3 | 1 | 4.0 | 0.0 | 4:12:00 | MID 50 | 2 | 90 | pass | 50.0 | 0 | short | NaN | CAR | 6 | 0 | 0.0 | 0.0 | 0.0 | 6.0 | 0.0 | 6.0 | 0.015949 | 0.170502 | 0.262477 | 0.264513 | 0.278567 | 0.502523 | 0.502523 | 0.497477 | 0.214786 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN | 2018 | True | True |
4717 | 449355 | 3787 | 2018121700 | CAR | NO | NO | away | CAR | CAR | 14.0 | 2018-12-17 | 150.0 | 150.0 | 150.0 | Half2 | 0 | 19 | 4 | 4.0 | 0.0 | 2:30:00 | CAR 14 | 1 | 64 | run | 3.0 | 0 | NaN | NaN | NaN | 7 | 12 | 12.0 | 7.0 | 5.0 | 12.0 | 7.0 | 5.0 | 0.238610 | 0.028770 | 0.045261 | 0.548503 | 0.135899 | 0.906124 | 0.093876 | 0.906124 | 0.051376 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 2018 | True | True |
4718 | 449368 | 4106 | 2018121700 | CAR | NO | CAR | home | NO | CAR | 61.0 | 2018-12-17 | 38.0 | 38.0 | 38.0 | Half2 | 0 | 20 | 4 | 4.0 | 0.0 | 0:38:00 | CAR 39 | 5 | 19 | pass | 0.0 | 0 | short | NaN | NaN | 7 | 12 | 7.0 | 12.0 | -5.0 | 7.0 | 12.0 | -5.0 | 0.722469 | 0.063287 | 0.044687 | 0.093956 | 0.072837 | 0.032217 | 0.032217 | 0.967783 | -0.005457 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 2018 | False | True |
4719 rows × 75 columns
Once we have created our Dataframe that consists of all 4th down plays over the 10 season span from 2009-2018 we found the total number of 4th down attempts and of those 4th down attempts the success rate of converting. This data is displayed for the entire 10 season span and for each individual season.
The first pie chart labeled, “Percentage of 4th Downs Attempted (2009-2018)”, shows the number of all 4th down plays including, punts, field goals, passes, runs, etc. compared to 4th downs in which a team attempted to go for it (pass or run). From this chart we can see that 4th downs are only attempted about 12 percent of the time. This may be due to the distance to the first down marker (line to gain), location on the field, and whether a team decided to kick a field goal.
The second pie chart labeled, “Percentage of Successful 4th Down Attempts (2009-2018)“ shows the number of converted 4th downs out of the number of attempted 4th downs. From this chart we can see that a successful 4th down conversion happens about 49 to 50 percent of the time. This may be due to the location on the field and distance to the first down marker (line to gain).
The data is displayed below in the form of a pie charts based on percentages.
num_succ = att_fourth.succ_4th.sum()
pct_attempted = len(att_fourth.index) / len(all_fourth.index)
pct_succ = num_succ / len(att_fourth.index)
#print(pct_attempted)
#pie chart for number of 4th downs that teams attempt to go for
Tasks = [len(all_fourth.index) - len(att_fourth.index),len(att_fourth.index)]
my_labels = 'Did Not Go For It','4th Down Attempted'
my_colors = ['lightblue','yellow']
my_explode = (0,0.1)
plt.pie(Tasks,labels = my_labels,autopct='%.2f', colors = my_colors, startangle = 15, shadow = 15 , explode = my_explode)
plt.title('Percentage of 4th Downs Attempted (2009-2018)',weight = 'bold')
plt.axis('equal')
plt.show()
#out of the 4th down attemps, what percentage is successfull
Tasks2 = [len(att_fourth.index) - num_succ,num_succ]
my_labels2 = 'Unsuccessful Conversion','Successful 4th Down Conversion'
plt.pie(Tasks2,labels = my_labels2,autopct='%.2f', colors = my_colors,startangle = 15, shadow = 15 , explode = my_explode)
plt.title('Percentage of Successful 4th Down Attempts (2009-2018)', weight = 'bold')
plt.axis('equal')
plt.show()
The bar graph below labeled, “Percentage of Attempted 4th Downs by Year (2009-2018)” shows the number of attempted 4th downs based on the overall number of 4th down plays for each season. As displayed below there seems to be no correlation between time and the number of attempted 4th downs. The 2009 and 2018 seasons stand out as particularly high percentages of attempts with 14 percent 4th downs attempted, however, for the most part the individual seasons hover around the 11 to 12 percent average of attempts we saw earlier in part 2A.
The bar graph below labeled, “Percentage of Successful 4th Down Attempts by Year (2009-2018)”, shows the success rate of 4th down conversions out of attempted 4th downs for each individual year. Again there seems to be no correlation between year and the success rate of converting on 4th down. 2018 stands out as having a particularly high success rate of about 56 percent and 2011 having a very low success rate of about 42 percent. For the most part it seems that on a year to year basis the success rate of converting on 4th down hovers close to the 49 percent success rate found in part 2A.
#looking at the number of 4th down attempts from 2009-2018
#groups attempted 4th down data frame and converted 4th down data frame by year
att_fourth_years = all_fourth.groupby(all_fourth.game_date.dt.year, as_index = False).mean()
succ_4th_years = att_fourth.groupby(att_fourth.game_date.dt.year, as_index = False).mean()
#multiply percentages by 100
att_fourth_years.year = att_fourth_years.year.astype(int)
att_fourth_years.att_fourth = att_fourth_years.att_fourth * 100
succ_4th_years.year = succ_4th_years.year.astype(int)
succ_4th_years.fourth_down_converted = succ_4th_years.fourth_down_converted * 100
#plot percentages of attempted 4th downs based on year
#attempts to see if 4th down attempts have increased/decreased over time
sns.barplot(x = 'year', y = 'att_fourth', data = att_fourth_years)\
.set(title = "Percentage of Attempted 4th Downs by Year (2009-2018)", ylabel = "Percentage", xlabel = "Years")
plt.show()
#plots percentages of successful 4th downs on 4th down attempts by year
sns.barplot(x = 'year', y = 'fourth_down_converted', data = succ_4th_years)\
.set(title = "Percentage of Successful 4th Down Attempts by Year (2009-2018)", ylabel = "Percentage", xlabel = "Years")
plt.show()
Once we have our Dataframe that only includes information of 4th down attempts, we can take a look at the number of attempts and the success rate of converting on 4th down based on the distance from the first down marker (line to gain).
The graph below labeled, "Count of 4th Down Attempts Based on Yards To Go", shows the number of attempts from specific distances to the first down marker (line to gain). From the graph below we can conclude that 4th and 1 (1 yard away from line to gain) is the most common attempted 4th down and distance with over 1,750 attempts. We also see that 4th and a distance of 1 to 10 is the most commonly attempted group of distances and anything greater than about 20 yards to go had minimal attempts.
#plot to show number of attempted 4th downs based on distance to first down
sorted_df = att_fourth.sort_values(by = ['ydstogo'],ascending = False)
sns.countplot(x = 'ydstogo', data = sorted_df)\
.set(title = "Count of 4th Down Attempts Based on Yards To Go", ylabel = "Count", xlabel = "Yards To Go (Distance to First Down)")
plt.show()
As seen above we know what the most commonly attempted 4th downs are based on yards to go. Now we need to look at the success rate at these distances to see what point has the highest chance of converting. Because 4th down attempts with a distance of 1 to 10 yards from the first down marker were the top 10 most attempted distances we will take a look at each individual yard over the 10 yard range. From the line plot below labeled, “Percentage of Successful 4th Down Based on Yards To Go”, we can see that as you move further away from the first down marker the success rate of converting decreases. This is most likely due to the level of difficulty of converting a longer distance play then a shorter distance play. Also the play calling ability at a smaller distance allows teams to run or pass the ball making the team multidimensional and making it harder for the defense to stop them. You would usually not see teams running the ball at further distances due to the difficulty of gaining large chunks of yards on the ground vs. in the air by a pass. From the plot we also see that at a distance of 4th and 1 there is a 15 percent higher chance than the average success rate of converting we had seen earlier in part 2A.
#shows percentages of successful 4th downs based on distance to first down
#limits distances to 10
sorted_df = sorted_df[sorted_df.ydstogo <= 10]
#sorted_df = sorted_df.reset_index()
ydstogo_succ = sorted_df.groupby(sorted_df.ydstogo, as_index = False).mean()
ydstogo_succ['fourth_down_converted'] = ydstogo_succ['fourth_down_converted'].apply(lambda x: x*100)
#plots line plot
plot2 = sns.lineplot(x = 'ydstogo', y = 'fourth_down_converted', data = ydstogo_succ, marker = 'o')
plot2.set(title = "Percentage of Successful 4th Down Based on Yards To Go", xlabel = "Yards To Go (Distance to First Down)", ylabel = "Success Rate of Converting (Percentage)")
[Text(0.5, 1.0, 'Percentage of Successful 4th Down Based on Yards To Go'), Text(0.5, 0, 'Yards To Go (Distance to First Down)'), Text(0, 0.5, 'Success Rate of Converting (Percentage)')]
The graph below shows the same information as the graph seen in part 3B, however, now we can see the success rates of converting on 4th down at distances 1 to 20 yards away from the first down marker, grouped by 5 yards. The results are the same as seen in Part 3B, the further a team is from the first down marker the lower the success rate of converting on 4th down is.
#plots successful first downs based on distance 20 yards away and lower grouped by 5
#sort and remove values lower than 20
sorted_df20 = att_fourth.sort_values(by = ['ydstogo'])
sorted_df20 = sorted_df20[sorted_df20.ydstogo <= 20]
sorted_df20['fourth_down_converted'] = sorted_df20['fourth_down_converted'].apply(lambda x: x*100)
#bin by 5 yards
sorted_df20['ydstogo'] = pd.cut(sorted_df20['ydstogo'], [1,5,10,15,20],\
labels = ['1-5', '6-10','11-15','16-20'], include_lowest= True)
ydstogo_succ20 = sorted_df20.groupby(sorted_df20.ydstogo, as_index = False).mean()
plot = sns.barplot(x = 'ydstogo', y = 'fourth_down_converted', data = ydstogo_succ20)
plot.set(title = "Percentage of Successful 4th Down Based on Yards To Go", ylabel = "Success Rate of Converting (Percentage)", xlabel = "Yards To Go (Distance to First Down)")
plt.show()
Here we will take a deeper look at the position on the field and how that affects the fourth down attempts and the conversion rate. For this specific part we will look at the ‘yardline_100’ variable. This variable determines the yardline in which the team who has the ball is on. If a team in possession of the ball is on the 1 yard line then the team is the closest possible position to scoring. As the variable increases the possession team in getting farther and farther away from the endzone they are supposed to be scoring in.
The graphs below represent the number of fourth down attempts in a specific bin of yardage and the number of converted 4th downs in that specific bin of yardage. I think there are some important numbers to point out. From yardline 31 to 45 we see the most attempted fourth downs over our 10 years of data. This would make the most sense as at this position on the field a field goal (worth 3 points) is not automatic. As we get past the 70 yardline the number of attempts drastically decreases most likely because after a failed 4th down attempt a team would have to give the ball back to the opposing team at that specific yardline. The opposition would then be very close to scoring as soon as they come into possession of the ball. Most of the attempts beyond this point come when a team is in a "score or lose" situation. The number of attempts within the 1 yardline range is quite peculiar as the field goal probability at that point is very high. In that situation is it better to settle for the 3 points or do you risk going for the 4h down in the hopes to score a touchdown (7 points). Hopefully the conversion rates will shed more light on this.
#analyze field position
yard_line_att = att_fourth.copy()
yard_line_att['yardln_bin'] = pd.cut(x=yard_line_att['yardline_100'], \
bins=[1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100], \
labels=['1-5', '6-10', '11-15', '16-20', '21-25', '26-30', '31-35', '36-40',\
'41-45', '46-50', '51-55', '56-60','61-65', '66-70', '71-75', '76-80', \
'81-85', '86-90', '91-95', '96-100'])
yard_line_succ = yard_line_att[yard_line_att.succ_4th == True]
yard_line_succ = yard_line_succ.reset_index()
fig, ax =plt.subplots(1,2)
yard_line_plot2 = sns.countplot(x='yardln_bin', data=yard_line_succ, ax=ax[0])
yard_line_plot2.set_xticklabels(yard_line_plot2.get_xticklabels(), rotation=45)
yard_line_plot1 = sns.countplot(x='yardln_bin', data=yard_line_att, ax=ax[1])
yard_line_plot1.set_xticklabels(yard_line_plot1.get_xticklabels(), rotation=45)
ax[0].set(ylim=(0, 550))
# ax.set(xlabel='number of attempted 4th Downs', ylabel='Yardline on Football Field')
yard_line_plot1.set(title="Attempted 4th Downs v Field Position", \
xlabel="Number of attempted 4th Downs", \
ylabel="Yardline on Football Field")
yard_line_plot2.set(title="4th Downs Conversions v Field Position", \
xlabel="Number of Converted 4th Downs", \
ylabel="Yardline on Football Field")
fig.show()
The next graph gives a better representation of the conversion percentages for the bins we saw in the graphs above. Here we see some surprising numbers. Although many teams attempt a 4th down between the 31 to 45 yard line the highest conversion rate happens at the bin 61 to 65 yards. The next two highest conversion rates come at 26-30 yardlines and 51 to 55 yardlines. This definitely poses the question why are there so many attempts between the 1 to 5 yardline and the 36 to 40 yardline? At both these spots the conversion rate hovers right around 50%. As we get farther past the 75 yardline mark our conversion rate gets much lower but based on the previous graphs, the number of attempts from those spots are less than 100 and are most likely not as reliable because of the amount of data.
yardln_group = yard_line_att.groupby(yard_line_att.yardln_bin, as_index = False).mean()
yardln_group['fourth_down_converted'] = yardln_group['fourth_down_converted'].apply(lambda x: x*100)
sns.lineplot(x = 'yardln_bin', y = 'fourth_down_converted', data = yardln_group,marker = 'o')\
.set(title = "Percentage of Successful 4th Down Based on Distance to Endzone (2009-2018)", ylabel = "Success Rate of Converting (Percentage)", xlabel = "Distance to Endzone (Yards)")
plt.xticks(rotation=45)
plt.show()
After looking at the distance to the first down and yardline (distance to the endzone) we still do not really have any great predictor for if a team should go for the fourth down. Hopefully by combining the two variables we will be able to come up with an equation to give a prediction of the success rate for that given situation.
The first graph shown below shows a linear regression model for conversion rate of fourth downs based on yardline (distance from the endzone). There is a slight correlation based on the coefficient of -0.00196745. This means that as you increase the yardline (move further away from the endzone) your chances decrease by 0.01%. Based on the graph there seems to be outliers as we increase the yardline. To get a better estimate of error we will look at the Residual plot.
group = att_fourth.groupby('yardline_100', as_index = False).mean()
sns.regplot(x=group["yardline_100"], y=group["fourth_down_converted"])
plt.title("Conversion Percentage for 4th Down Attempts by Yardline")
plt.ylabel("Conversion Percentage")
plt.xlabel("Yardline")
regr = linear_model.LinearRegression()
x = group['yardline_100'].values.reshape(len(group.index),1)
y = group['fourth_down_converted'].values
regr = regr.fit(x, y)
y_pred = regr.predict(x)
# The coefficients
print('Coefficients: ', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'% mean_squared_error(y, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'% r2_score(y, y_pred))
Coefficients: [-0.00196745] Mean squared error: 0.02 Coefficient of determination: 0.12
The residual plot shows that yardline on its own is not a great predictor of conversion success. Most of the data seems to be with 20 percentage points but that is quite a high difference that would make this unreliable. Towards the end we see a few points in which the prediction was nearly 60 percentage points off.
group['predicted'] = y_pred
group['residual'] = group['fourth_down_converted']- group['predicted']
resid_plot = sns.residplot(x = "yardline_100",y = "residual",data = group)
resid_plot.set(title="Residual Plot of Conversion Percentage by Yardline",xlabel="Yardline", ylabel="Residual in Percentage")
[Text(0.5, 1.0, 'Residual Plot of Conversion Percentage by Yardline'), Text(0.5, 0, 'Yardline'), Text(0, 0.5, 'Residual in Percentage')]
This next graph shown below shows a linear regression model for conversion rate of fourth downs based on yards to go to the first down. Here we have a much better correlation than the previous variable yardline. The correlation coefficient here is -0.01445535. This means that as you increase the yards to go (distance from first down) by 1 your chances decrease by 1%. It is important to point out that you will never have a negative percentage which is why our data stalls at 0%. For our purposes anything beyond 33 yards to the first will have a 0% chance based on the predictor. To get a better estimate of error we will look at the Residual plot.
group = att_fourth.groupby('ydstogo', as_index = False).mean()
sns.regplot(x=group["ydstogo"], y=group["fourth_down_converted"])
plt.title("Conversion Percentage for 4th Down Attempts by Yards to Go")
plt.ylabel("Conversion Percentage")
plt.xlabel("Yards To Go")
regr = linear_model.LinearRegression()
x = group['ydstogo'].values.reshape(len(group.index),1)
y = group['fourth_down_converted'].values
regr = regr.fit(x, y)
y_pred = regr.predict(x)
# The coefficients
print('Coefficients: ', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'% mean_squared_error(y, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'% r2_score(y, y_pred))
Coefficients: [-0.01445535] Mean squared error: 0.01 Coefficient of determination: 0.71
This residual plot is much more condensed with a smaller margin of error. Remember that anything beyond 33 yards we will deem inaccurate as our predictor gives a negative percentage. For the most our data is all within 10 percentage points making yards to the first down seem like a much better predictor.
group['predicted'] = y_pred
group['residual'] = group['fourth_down_converted']- group['predicted']
resid_plot = sns.residplot(x="ydstogo", y="residual", data=group)
resid_plot.set(title="Residual Plot of Conversion Percentage by Yards To Go",xlabel="yards To Go", ylabel="Residual in Percentage")
[Text(0.5, 1.0, 'Residual Plot of Conversion Percentage by Yards To Go'), Text(0.5, 0, 'yards To Go'), Text(0, 0.5, 'Residual in Percentage')]
After looking at both yardline and distance to the first down we will create a linear equation using the two above variables to predict 4th down conversion success. To do this I will be using SKLearn in order to fit a linear regression model with mutiple variables. Below I have printed out the coefficient of each variable and also the intercept value.
The equation derived is: 4th Down conversion rate = 0.5538705517152108 + 0.0008x1 + -0.0238x2 x1= yardline x2 = distance to first down
Based on teh OLS Regression results we see that yardline does have a p value of 0.076 which is slightly higher than what we would want howeer still I beleive this equation will prove to be a good predictor for any given situation on the baseball field.
group = att_fourth.groupby(['yardline_100','ydstogo'], as_index = False).mean()
X = group[['yardline_100','ydstogo']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = group['fourth_down_converted']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
Intercept: 0.5538705517152108 Coefficients: [ 0.00082249 -0.02381385] OLS Regression Results ================================================================================= Dep. Variable: fourth_down_converted R-squared: 0.163 Model: OLS Adj. R-squared: 0.161 Method: Least Squares F-statistic: 102.5 Date: Sun, 16 May 2021 Prob (F-statistic): 2.12e-41 Time: 23:07:25 Log-Likelihood: -361.04 No. Observations: 1056 AIC: 728.1 Df Residuals: 1053 BIC: 743.0 Df Model: 2 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- const 0.5539 0.026 21.533 0.000 0.503 0.604 yardline_100 0.0008 0.000 1.775 0.076 -8.66e-05 0.002 ydstogo -0.0238 0.002 -14.259 0.000 -0.027 -0.021 ============================================================================== Omnibus: 71.069 Durbin-Watson: 2.056 Prob(Omnibus): 0.000 Jarque-Bera (JB): 82.586 Skew: 0.672 Prob(JB): 1.17e-18 Kurtosis: 2.731 Cond. No. 129. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Once we have collected data on the yards to go (distance to first down) and location on the field based on yardline (distance to endzone) we will look at the combined success rate of converting on 4th down based on a teams location on the field and distance to the first down marker. We then compare this data to the field goal probability at that location (retrieved from the dataframe and grouped) to see whether or not a team should go for it or kick the field goal. Because we are comparing field goal probability to success rate of converting on 4th down we do not look at distances from the endzone further than 50 yards away, as the play that resulted in the longest field goal ever kicked in NFL history took place at the 46 yard line (64 yard field goal, started at 46 yardline + 10 yards for length of endzone + 8 yards for snap distance).
#create 2 data frames, one for distances beyond 6-10 yards away with 4th n goal 1-20 yards away
# 2nd for distanced 6-10 yards away and lower with 10 yards distance and below to 1st down
combined_df = att_fourth.sort_values(by = ['ydstogo'])
combined_df = combined_df[combined_df.ydstogo <= 20]
combined_df2 = combined_df[combined_df.ydstogo <= 10]
#multiply df percentages of 4th down converted and fg prob by 100
combined_df['fourth_down_converted'] = combined_df['fourth_down_converted'].apply(lambda x: x*100)
combined_df['fg_prob'] = combined_df['fg_prob'].apply(lambda x: x*100)
combined_df2['fourth_down_converted'] = combined_df2['fourth_down_converted'].apply(lambda x: x*100)
combined_df2['fg_prob'] = combined_df2['fg_prob'].apply(lambda x: x*100)
#bin distances from first down and distances from endzone for first data frame
combined_df['ydstogo'] = pd.cut(combined_df['ydstogo'], [1,5,10,15,20],\
labels = ['1-5', '6-10','11-15','16-20'], include_lowest= True)
combined_df['yardln_bin'] = pd.cut(x=combined_df['yardline_100'], \
bins=[1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100], \
labels=['1-5', '6-10', '11-15', '16-20', '21-25', '26-30', '31-35', '36-40',\
'41-45', '46-50', '51-55', '56-60','61-65', '66-70', '71-75', '76-80', \
'81-85', '86-90', '91-95', '96-100'],include_lowest = True)
#bin distances from first down and distances from endzone for second data frame
combined_df2['yardln_bin'] = pd.cut(x=combined_df['yardline_100'], \
bins=[1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100], \
labels=['1-5', '6-10', '11-15', '16-20', '21-25', '26-30', '31-35', '36-40',\
'41-45', '46-50', '51-55', '56-60','61-65', '66-70', '71-75', '76-80', \
'81-85', '86-90', '91-95', '96-100'],include_lowest = True)
combined_df2['ydstogo'] = pd.cut(combined_df2['ydstogo'], [0,1,2,3,4,5,6,7,8,9,10],\
labels = ['1', '2','3','4','5','6','7','8','9','10'], include_lowest= True)
#get rid of penalty yardage col to avoid removing columns with no penaltys
#creates grouped data frame to be used for distances away from endzone less than 10 yards
success2 = combined_df2.groupby(['yardln_bin','ydstogo'], as_index = False).mean()
success2 = success2.drop(columns = 'penalty_yards')
success2 = success2.dropna()
#creates grouped data frame to be used for distances away from endzone greater than 10 yards
success = combined_df.groupby(['yardln_bin','ydstogo'], as_index = False).mean()
success = success.drop(columns = 'penalty_yards')
success = success.dropna()
#create binned dataframe based on yardln_bin so that ydstogo col can be used on x axis
#bins1 for distances from endzone greater than 10 yards
#bins2 less than 10 yards away
bins = success.groupby('yardln_bin')
bins2 = success2.groupby('yardln_bin')
#create individualized binned data frames for specific distances to be used for plotting
bin1 = bins2.get_group('1-5').reset_index()
bin2 = bins2.get_group('6-10').reset_index()
bin3 = bins.get_group('11-15').reset_index()
bin4 = bins.get_group('16-20').reset_index()
bin5 = bins.get_group('21-25').reset_index()
bin6 = bins.get_group('26-30').reset_index()
bin7 = bins.get_group('31-35').reset_index()
bin8 = bins.get_group('36-40').reset_index()
bin9 = bins.get_group('41-45').reset_index()
bin10 = bins.get_group('46-50').reset_index()
#plotting and labeling occurs for each distance away from endzone
bin1.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o',\
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (1 to 5 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin2.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o',\
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (6 to 10 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin3.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o', \
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (11 to 15 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin4.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o',\
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (16 to 20 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin5.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o', \
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (21 to 25 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin6.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o', \
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (26 to 30 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin7.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o', \
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (31 to 35 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin8.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o', \
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (36 to 40 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin9.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o', \
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (41 to 45 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
bin10.plot(x= 'ydstogo',y =['fourth_down_converted','fg_prob'], marker = 'o',\
label = ['4th Down Converted','Field Goal Probability'])
plt.title("Kicking Field Goal vs. Going For It (46 to 50 Yards Away From Endzone)", weight = 'bold')
plt.xlabel("Distance From 1st Down (Yards)", weight = 'bold')
plt.ylabel("Probability of Converting (Percentage)", weight = 'bold')
Text(0, 0.5, 'Probability of Converting (Percentage)')
Looking at the line plots above that display the success rate of converting on 4th down and probability of successfully kicking a field goal based on yards to go (distance from first down marker) and yardline (distance from endzone) we see a mixed number of results depending on where a team is on the field. From a distance of 1 to 15 yards away from the endzone, displayed in the first three graphs above, there is no yards to go where a team has a higher success rate of converting on 4th down against successfully kicking a field goal. However, for yardlines 1 to 10 spanning the two graphs, 1-10 yards away from the endzone, there is about a 55 to 65 percent success rate of converting on 4th down when there is only 1 yard to go, which is much higher than the average success rate of converting on 4th down overall and is about the same success rate we saw in part 3B which calculates the success rate of converting on 4th and 1 from any location on the field. From the 11 to 15 yardline and 1 to 5 yards to go there is about a 60 percent chance of converting, again higher than the overall average of converting on 4th down. Although these results are promising, We have concluded that from the 1 to 15 yardline it is a safer bet for teams to kick the field goal then go for it on 4th down.
For the 3 graphs that span the 16 to 30 yardlines, each of them show a higher success rate of converting on 4th down when the yards to go is from 1 to 5. For all the graphs the success rate at this point is about 60 to 65 percent, and the difference between the probability of successfully kicking a field goal and the success rate of converting a 4th down increased the further the distance from the endzone. This is due the difficulty of kicking a field goal the further away you are from the endzone. These results are promising because the 16 to 30 yardlines are considered to be in the “field goal range” and there is a greater chance of converting on 4th Down instead of kicking a field goal, when the yards to go is 1 to 5. We conclude that at a distance of 16 to 30 yards away from the endzone and a distance of 1 to 5 yards to go to the first down that teams should go for it over kicking a field goal.
For the last 4 graphs that span the 31 to 50 yardlines, each of them show higher success rates of converting on 4th downs when there is 1 to 5 yards to go compared to the probability of successfully kicking a field goal from those distances. All 4 graphs show around a 55 to 60% success rate of converting on 4th down which is similar to the results found in part 3C for a yards to go range of 1 to 5. From distances 41 to 50 yards away from the endzone there is also a higher chance of converting a 4th down compared to a field goal when the yards to go is 6 to 10 yards, although field goals from these distances are extremely difficult and kicking a field goal from here may not be as common, it is still promising to see that the success rate of converting a 4th down is around 40 to 45 percent. From this graph we can conclude that when a team is at the 31 to 50 yardlines and 1 to 5 yards away from the first down marker they should go for it and not kick a field goal. Similarly at the 41 to 50 yardlines and with 6 to 10 yards to go a team should again go for it over kicking a field goal.
When comparing the probability of successfully kicking a field goal against the success rate of converting a 4th down attempt it is interesting to see how the location on the field and yards to go can impact a coach’s or player’s decision. From the analysis above we can determine that at no point should a coach decide to go for it over kicking a field goal when they are 1 to 15 yards away from the endzone. 4th downs at distances further than 15 yards away from the endzone and 1 to 5 yards to go should always be attempted over kicking a field goal and at 41 to 50 yards away from the end zone 4th downs should be attempted over kicking field goals when the yards to go is between 1 and 10 yards. For 9 of the 10 graphs above we can see that with a yards to go over 10 the field goal probability is higher, therefore we can conclude that at a yards to go greater than 10 no coach or player should attempt to go for it on 4th down over kicking a field goal. There is no answer to whether a team should always go for it on 4th down or not, but we have showed the scenarios in which going for it is the correct option.