NFL Elo Ranking

Posted by Kendra Frederick on Mon 08 October 2018

Metis Project 2 ("Luther"): Regression & Web-Scraping

Project Background

Metis Project 2, aka "Project Luther" required us to create a Regression model from data we scraped from the web

Project Design

An Elo rating system is a measure of relative strength in a zero-sum game such as chess or soccer. It is named after Arpad Elo, who created the system to improve chess rankings in the 1960s. A difference in competitors’ Elo scores serves as a predictor of who will win and by how much.

FiveThirtyEight, the popular website that applies statistical analyses to politics, economics, and sports, has implemented an Elo ranking system for NFL teams. They claim their ranking algorithm is based only on final score and location (home vs. away) of game.

I used FiveThirtyEight’s Elo rankings and game statistics to predict how a team’s Elo ranking will change after a given week’s game. Insights from this model could reveal statistics that indicate a team’s success as well as inform more favorable bets.


  • Web Scraping: Selenium, BeautifulSoup
  • Data Handling: Python, Pandas, Numpy, SciPy, Pickling
  • Data Fitting: Statsmodel, scikit-learn
  • Visualization: Matplotlib, Seaborn


FiveThirtyEight provides their Elo rankings as a .csv file on their ‘nfl-elo-game’ GitHub repository. (See Footnote 1.) I scraped Pro Football Reference for game and boxscore statistics for the 267 games in each of the 2007-2017 seasons. This initially provided me with 5858 rows and 44 features.

My target was the change in a team's Elo score before and after a given week’s game, referred to hereafter as $\Delta$Elo. When Team A plays Team B, $\Delta$EloA - $\Delta$EloB, so performing analysis by game was most appropriate. This halved my number of rows and nearly doubled my number of features.

I calculated trailing season averages (SA) of game statistics. After cleaning and exclusion, I had 2540 rows and 53 features to build a model upon. The features are listed in my GitHub repo.


I performed a 75:25 training:test split. Using the training set, I examined the following Linear Regression models and pre-processing tools from scikit-learn and statsmodel:

  • Ordinary Least Squares (OLS) / Linear Regression
  • Polynomial Features
  • Standard Scaling
  • Regularization:
    • Lasso Regularization
    • Ridge Regularization
    • ElasticNet Regularization.

I explored their performance with a combination of different features. For model evaluation and feature selection, I used 5-fold cross-validation. I ultimately chose an Elastic Net Regularization with 2nd-order Polynomial features and Standard Scaling, with a lambda of 0.173 and an L1-ratio of 0.714. The 15 columns which provided the best R2 and MSE while offering the simplest model are listed below in Footnote 2 and were used in the final model.


The results of applying my model to the Test (hold-out) set shown below.

Predicted-vs-Actual scatter plot

Metric Train Test
Metric Train Test
R2 0.389 0.383
MSE 340 331
RMSE 18.4 18.1

While the R2 is not as high as I would have hoped, it and MSE are similar between the Training and Test sets. This suggests that my model is not over-fitting and will perform similarly on data unseen. Inspection of the residuals reveals them to be normally distributed, although the tails fit less well to a normal distribution. There is also a curious void of data along the Actual $\Delta$Elo = 0 line. As ties are rare in the NFL, nearly all games result in the transfer of Elo points.

Lessons Learned

In evaluating my model against the Elo ratings of FiveThirtyEight, I realized that choosing to predict Elo ratings was a complicated and fraught target. The FiveThrityEight algorithm computes the number of Elo ratings points that are exchanged between teams after a game (i.e. $\Delta$Elo) from the difference in Elo ratings between teams, adjusted for home-field advantage and the margin of victory. I was trying to predict something ($\Delta$Elo) which was calculated using post-game knowledge (point difference) and which was dependent on this algorithm (pre-game Elo ratings). A better, more direct, target might have been point difference, using Elo ratings as variables. Anyway, when I tested my model vs. FiveThirtyEight’s algorithm, I did rather poorly. Using a Brier system for scoring probibalistic models, my forecasts would have gotten 424 points per season; FiveThrityEight averages 892 points. (See the Predictions folder in my GitHub repo.)

I initially made an error in how I calculated $\Delta$Elo. See the accompanying blog post, Panda Tricks: self-joins.


1. I initally found the .js source url behind their interactive historical Elo graphics. I extracted data for one team (Green Bay, obviously), but given that FiveThirtyEight offered a neat and tidy .csv, I opted to use that rather than extract more data from the javascript file.

2. Features in final model:

  • Team 1 Elo (before game)
  • Team 2 Elo (before game)
  • First Downs
  • Fourth-Down Attempts
  • number of Penalties
  • Penalty Yards
  • Points against
  • Points scored
  • Rush Attempts
  • Time of Possession
  • Win Percentage