Panda Tricks: Self-Joins

Posted by Kendra Frederick on Tue 09 October 2018

Self-Join to extract information in two columns

The problem

In timeseries data, information for an entity (here, a team) is sometimes in one column, sometimes in another. We want to compute the difference in the entity's value from one timepoint to the next. But this is not straightforward.

A solution

Our dataset is FiveThirtyEight's NFL Elo Rankings. Note, I have cropped this dataset to seasons 2007 & after.

In [2]:
elo_df = pd.read_csv('Data/nfl_games.csv')

Here is what our dataframe looks like:

In [4]:
elo_df.head()
Out[4]:
date season neutral playoff team1 team2 elo1 elo2 elo_prob1 score1 score2 result1
0 2007-09-06 2007 0 0 IND NO 1653.923 1515.973 0.762833 41 10 1.0
1 2007-09-09 2007 0 0 WSH MIA 1448.841 1489.941 0.534341 16 13 1.0
2 2007-09-09 2007 0 0 BUF DEN 1516.684 1558.574 0.533209 14 15 0.0
3 2007-09-09 2007 0 0 SEA TB 1511.936 1417.385 0.714726 20 6 1.0
4 2007-09-09 2007 0 0 CLE PIT 1396.563 1568.569 0.350697 7 34 0.0

The entries are by game. We want to calculate the change in a team's Elo ranking score from one week to the next. We will ultimately want to execute a 'shift' to get the before- and after-Elo scores in the same row. But first, we must address the fact that sometimes a team is 'team1' and other times it is 'team2':

In [5]:
elo_df[(elo_df['team1'] == 'NO') | (elo_df['team2'] == 'NO')][['date','team1','team2','elo1','elo2']].head()
Out[5]:
date team1 team2 elo1 elo2
0 2007-09-06 IND NO 1653.923 1515.973
28 2007-09-16 TB NO 1402.979 1500.922
47 2007-09-24 NO TEN 1468.805 1506.098
74 2007-10-07 NO CAR 1437.203 1484.207
85 2007-10-14 SEA NO 1531.254 1422.503

To remedy this, we'll effectively perform two "self-joins" of the table. One on team1 and its Elo score and game score, and one on team2

In [6]:
elo_append_cols_h = ['date','season','neutral','playoff','team1','elo1','score1']
elo_append_cols_a = ['date','season','neutral','playoff','team2','elo2','score2']
In [7]:
elo_df_h = elo_df[elo_append_cols_h]
elo_df_a = elo_df[elo_append_cols_a]
In [8]:
elo_df_h = elo_df_h.rename(columns={'elo1':'elo', 'team1':'team','score1':'score'})
elo_df_a = elo_df_a.rename(columns={'elo2':'elo', 'team2':'team','score2':'score'})

Here is what one half of the self-joined dataframes looks like. It contains only the data for team 1. The other dataframe, as you can guess, contains only the data for team2.

In [9]:
elo_df_h.head()
Out[9]:
date season neutral playoff team elo score
0 2007-09-06 2007 0 0 IND 1653.923 41
1 2007-09-09 2007 0 0 WSH 1448.841 16
2 2007-09-09 2007 0 0 BUF 1516.684 14
3 2007-09-09 2007 0 0 SEA 1511.936 20
4 2007-09-09 2007 0 0 CLE 1396.563 7

We append one df to the other, sort by date, and then apply a shift:

In [10]:
elo_df_flat = elo_df_h.append(elo_df_a, ignore_index=True)

elo_df_flat.sort_values('date', inplace=True)
elo_df_flat.reset_index(inplace=True, drop=True)

elo_df_flat['elo_after'] = elo_df_flat.groupby('team')['elo'].shift(-1)
elo_df_flat['elo_delta'] = elo_df_flat['elo_after'] - elo_df_flat['elo']

Recall from our glimpses above that IND played NO in the first game. Their elo_delta's should sum to 0, and they do:

In [13]:
elo_df_flat.head(2)
Out[13]:
date season neutral playoff team elo score elo_after elo_delta
0 2007-09-06 2007 0 0 IND 1653.923 41 1668.974 15.051
1 2007-09-06 2007 0 0 NO 1515.973 10 1500.922 -15.051

Now, we merge back with the original dataframe on date & team = team1, merging only 'elo_delta'.

In [14]:
elo_df_flat_for_merge = elo_df_flat[['date','team','elo_delta']]

elo_df = pd.merge(elo_df, elo_df_flat_for_merge, left_on=['date','team1'], right_on=['date','team'])

elo_df.drop(columns='team', inplace=True)
elo_df.rename(columns={'elo_delta':'elo_delta1'}, inplace=True)
In [15]:
elo_df.head()
Out[15]:
date season neutral playoff team1 team2 elo1 elo2 elo_prob1 score1 score2 result1 elo_delta1
0 2007-09-06 2007 0 0 IND NO 1653.923 1515.973 0.762833 41 10 1.0 15.051
1 2007-09-09 2007 0 0 WSH MIA 1448.841 1489.941 0.534341 16 13 1.0 12.772
2 2007-09-09 2007 0 0 BUF DEN 1516.684 1558.574 0.533209 14 15 0.0 -7.471
3 2007-09-09 2007 0 0 SEA TB 1511.936 1417.385 0.714726 20 6 1.0 14.406
4 2007-09-09 2007 0 0 CLE PIT 1396.563 1568.569 0.350697 7 34 0.0 -22.287

Voila! $\Delta$Elo's for a given game, computed from the next week's Elo Ranking for the teams involved.