I love watching the NFL, but when the season ends it gets boring for a few months. Probably the biggest event of the offseason is the draft, which I think is interesting but I can’t get excited about. I don’t watch college football, so I can’t evaluate or project players.
Most of the articles you read about the NFL draft are complete garbage. It’s speculation about player quality or draft tactics based, at best, someone who’s casually watched a player in a few games. So I decided that this year I’m going to do my own “mock draft” but it’s going to be based on the best data science I can muster.
If you’re not interested in how I did this, skip to the results.
Fortuantely Pro Football Reference has great data on historical drafts and combine results from 2000-2016. They also link to college statistics of a large number of the players who were drafted or appeared at the combine.
I won’t bore you with the scraping code, but you can see how I did it or just directly use the files I created. This was probably the bulk of the work!
I was able to gather the following data:
The goal of this exercise is to build a model that answers the following question: what is the probability that the player will be picked in the first round? We’ll assume that players with higher first round probabilities are more likely to be drafted higher. Obviously we could do something fancier, e.g. learning to rank, or regression to predict where they will be pick. My experience was that these models performed much worse than a logistic loss function on the first round outcome.
Not every player performs every test at the NFL combine, so I used mice to impute the missing combine scores. This allows me to ignore missingness in these variables (which may be informative!) while doing machine learning. You can see how I prepare the data in the source files.
For the college statistics, I only use count statistics (e.g. number of tackles, number of interceptions) so you can interpret a zero as the player did not do this in college. It’s not perfect, since we are missing college data for a number of players and they will look the same as players who didn’t accumulate any statistics.
I first tried my favorite ML tool: sparse regularized regression. Glmnet is my favorite implementation. The trick to getting good results here is producing a lot of interactions. We are learning a different model for each position, including which colleges teams prefer as well as which statistics matter. Because the position dummy variables are sparse, the linear model has sparse features. The matrix has 2125 features and only 6264 rows so we’ll be regularizing a lot.
library(glmnet)
sparseX <- sparse.model.matrix(~ + (1 + factor(pos)) * (1 +
factor(short_college) +
age + height + weight +
forty + bench + vertical +
threecone + broad + shuttle +
games + seasons +
completions + attempts +
pass_yards + pass_ints + pass_tds +
rec_yards + rec_td + receptions +
rush_att + rush_yds + rush_td +
solo_tackes + tackles + loss_tackles + ast_tackles +
fum_forced + fum_rec + fum_tds + fum_yds +
sacks + int + int_td + int_yards + pd +
punt_returns + punt_return_td + punt_return_yards +
kick_returns + kick_return_td + kick_return_yards)
,training)
m1 <- cv.glmnet(sparseX[train.set,],
first.round[train.set],
alpha = 0.5,
family = 'binomial')
training$sparse.fr.hat <- predict(m1, newx = sparseX, type = 'response')[,1]
The first thing we probably want to do is look at an ROC curve to see how well we do out-of-sample. The AUC of the model is 0.76.
library(ROCR)
preds <- prediction(training$sparse.fr.hat[test.set], first.round[test.set])
perf <- performance(preds, 'tpr', 'fpr')
plot(perf)
The results for the sparse model were kind of underwhelming, so we’re going to try a more complex model. My favorite technique these days is gradient boosting, and there’s no better implementation than the XGBoost package.
Notice that I include the in-sample predictions from the sparse model here as features. The sparse model doesn’t perform great, but it can pick up on things the tree cannot efficiently learn, such as the college and position effects. This is essentially a cheap hack to do ensembling.
fitX <- model.matrix(~ 0 +
factor(pos) +
# Ensemble the sparse model here.
sparse.pick.hat +
age + height + weight +
forty + bench + vertical +
threecone + broad + shuttle +
games + seasons +
completions + attempts +
pass_yards + pass_ints + pass_tds +
rec_yards + rec_td + receptions +
rush_att + rush_yds + rush_td +
solo_tackes + tackles + loss_tackles + ast_tackles +
fum_forced + fum_rec + fum_tds + fum_yds +
sacks + int + int_td + int_yards + pd +
punt_returns + punt_return_td + punt_return_yards +
kick_returns + kick_return_td + kick_return_yards
,training)
b1.tuning <- expand.grid(depth = c(3, 4, 5, 6),
rounds = c(50, 100, 150, 200, 250)) %>%
group_by(depth, rounds) %>%
do({
m <- xgboost(data = fitX[train.set,],
label = first.round[train.set],
max.depth = .$depth,
nround =.$rounds,
print.every.n = 50,
objective = 'binary:logistic')
yhat <- predict(m, newdata = fitX)
data_frame(test.set = test.set, yhat = yhat, label = first.round)
})
We’ll compute the AUC for each point on the grid and see which one predicts best on the test set. Remember we’d normally do a cross-validation procedure here, but I’m lazy.
aucs <- b1.tuning %>%
ungroup %>%
filter(test.set) %>%
group_by(depth, rounds) %>%
do({
auc <- performance(prediction(.$yhat, .$label), "auc")@y.values[[1]]
data_frame(auc = auc)
}) %>%
ungroup %>%
arrange(-auc)
best <- aucs %>% head(1)
best
## Source: local data frame [1 x 3]
##
## depth rounds auc
## 1 4 50 0.8363486
That’s a pretty good AUC! To get another perspective, we can train on pre-2015 and look at how many of the 2015 first rounders we could predict.
pre2015 <- with(training, year < 2015)
b1.train <- xgboost(data = fitX[pre2015,],
label = first.round[pre2015],
max.depth = best$depth,
nround = best$rounds,
verbose = FALSE,
objective = "binary:logistic")
training$fr.hat2015 <- predict(b1.train, newdata = fitX)
preds2015 <- training %>%
filter(year == 2015) %>%
arrange(-fr.hat2015) %>%
mutate(predicted.pick = row_number()) %>%
select(predicted.pick, pick, player, college, pos, fr.hat2015) %>%
head(32)
kable(preds2015, digits = 2)
predicted.pick | pick | player | college | pos | fr.hat2015 |
---|---|---|---|---|---|
1 | 2 | Marcus Mariota | Oregon | QB | 0.97 |
2 | 147 | Brett Hundley | UCLA | QB | 0.95 |
3 | 1 | Jameis Winston | Florida St. | QB | 0.95 |
4 | 4 | Amari Cooper | Alabama | WR | 0.90 |
5 | 13 | Andrus Peat | Stanford | T | 0.90 |
6 | 9 | Ereck Flowers | Miami (FL) | T | 0.85 |
7 | 5 | Brandon Scherff | Iowa | T | 0.77 |
8 | 185 | Tyrus Thompson | Oklahoma | T | 0.76 |
9 | 31 | Stephone Anthony | Clemson | ILB | 0.70 |
10 | 25 | Shaq Thompson | Washington | OLB | 0.69 |
11 | 6 | Leonard Williams | USC | DE | 0.65 |
12 | 21 | Cedric Ogbuehi | Texas A&M | T | 0.58 |
13 | 35 | Mario Edwards | Florida St. | DT | 0.55 |
14 | 12 | Danny Shelton | Washington | NT | 0.54 |
15 | 137 | Grady Jarrett | Clemson | NT | 0.54 |
16 | 41 | Devin Funchess | Michigan | TE | 0.51 |
17 | 72 | Jamon Brown | Louisville | T | 0.45 |
18 | 226 | Bobby Hart | Florida St. | G | 0.44 |
19 | 73 | Tevin Coleman | Indiana | RB | 0.42 |
20 | 24 | D.J. Humphries | Florida | T | 0.42 |
21 | 34 | Donovan Smith | Penn St. | T | 0.41 |
22 | 17 | Arik Armstead | Oregon | DT | 0.38 |
23 | 87 | Sammie Coates | Auburn | WR | 0.38 |
24 | 20 | Nelson Agholor | USC | WR | 0.38 |
25 | 68 | Clive Walford | Miami (FL) | TE | 0.37 |
26 | 91 | Chaz Green | Florida | T | 0.35 |
27 | 88 | Danielle Hunter | LSU | DE | 0.34 |
28 | 51 | Nate Orchard | Utah | DE | 0.34 |
29 | 67 | A.J. Cann | South Carolina | G | 0.33 |
30 | 53 | Jake Fisher | Oregon | T | 0.31 |
31 | 192 | Darius Philon | Arkansas | DT | 0.30 |
32 | 114 | Jamil Douglas | Arizona St. | G | 0.29 |
Not bad. We’re able to find 44% of the first round picks just using machine learning and combine/college data. I did not watch a single college football game in 2014 and I could have done almost as good as the experts ;)
We can also look to see how these predictions correlate across the whole draft:
library(ggplot2)
training %>%
filter(year == 2015) %>%
ggplot(aes(x = pick, y = fr.hat2015)) +
geom_smooth() +
geom_point(size = 0.5) +
theme_bw() +
xlab('Pick') + ylab('P(first round)')
Let’s predict the first round of the 2016 NFL draft! We’ll train one final model on all the pre-2016 data with the hyperparameters we chose.
training %>%
filter(year == 2016) %>%
arrange(-fr.hat) %>%
mutate(predicted.pick = row_number()) %>%
select(predicted.pick, player, college, pos, fr.hat) %>%
head(32) %>%
kable(digits = 2)
predicted.pick | player | college | pos | fr.hat |
---|---|---|---|---|
1 | Jared Goff | California | QB | 0.82 |
2 | Myles Jack | UCLA | OLB | 0.81 |
3 | Derrick Henry | Alabama | RB | 0.65 |
4 | Corey Coleman | Baylor | WR | 0.63 |
5 | Paxton Lynch | Memphis | QB | 0.60 |
6 | Drew Ott | Iowa | DE | 0.57 |
7 | Will Fuller | Notre Dame | WR | 0.45 |
8 | Ryan Kelly | Alabama | C | 0.41 |
9 | Yannick Ngakoue | Maryland | DE | 0.29 |
10 | Laquon Treadwell | Mississippi | WR | 0.27 |
11 | Dak Prescott | Mississippi State | QB | 0.23 |
12 | Ezekiel Elliott | Ohio State | RB | 0.22 |
13 | Jordan Howard | Indiana | RB | 0.22 |
14 | Joey Bosa | Ohio State | DE | 0.21 |
15 | Willie Henry | Michigan | DT | 0.21 |
16 | Javon Hargrave | South Carolina St | DT | 0.20 |
17 | Christian Hackenberg | Penn State | QB | 0.20 |
18 | Sterling Shepard | Oklahoma | WR | 0.20 |
19 | Emmanuel Ogbah | Oklahoma State | DE | 0.18 |
20 | Robert Nkemdiche | Mississippi | DT | 0.18 |
21 | Eric Murray | Minnesota | CB | 0.16 |
22 | Charles Tapper | Oklahoma | DE | 0.16 |
23 | Bronson Kaufusi | Brigham Young | DE | 0.15 |
24 | Sheldon Rankins | Louisville | DT | 0.15 |
25 | Shaq Lawson | Clemson | DE | 0.15 |
26 | Trevor Davis | California | WR | 0.15 |
27 | T.J. Green | Clemson | FS | 0.14 |
28 | Kenny Clark | UCLA | DT | 0.14 |
29 | Maliek Collins | Nebraska | DT | 0.13 |
30 | Leonard Floyd | Georgia | OLB | 0.13 |
31 | A’Shawn Robinson | Alabama | DT | 0.13 |
32 | Dadi Nicolas | Virginia Tech | DE | 0.12 |
A few simple observations: