Self-Join to extract information in two columns¶
The problem¶
In timeseries data, information for an entity (here, a team) is sometimes in one column, sometimes in another. We want to compute the difference in the entity's value from one timepoint to the next. But this is not straightforward.
A solution¶
Our dataset is FiveThirtyEight's NFL Elo Rankings. Note, I have cropped this dataset to seasons 2007 & after.
elo_df = pd.read_csv('Data/nfl_games.csv')
Here is what our dataframe looks like:
elo_df.head()
The entries are by game. We want to calculate the change in a team's Elo ranking score from one week to the next. We will ultimately want to execute a 'shift' to get the before- and after-Elo scores in the same row. But first, we must address the fact that sometimes a team is 'team1' and other times it is 'team2':
elo_df[(elo_df['team1'] == 'NO') | (elo_df['team2'] == 'NO')][['date','team1','team2','elo1','elo2']].head()
To remedy this, we'll effectively perform two "self-joins" of the table. One on team1 and its Elo score and game score, and one on team2
elo_append_cols_h = ['date','season','neutral','playoff','team1','elo1','score1']
elo_append_cols_a = ['date','season','neutral','playoff','team2','elo2','score2']
elo_df_h = elo_df[elo_append_cols_h]
elo_df_a = elo_df[elo_append_cols_a]
elo_df_h = elo_df_h.rename(columns={'elo1':'elo', 'team1':'team','score1':'score'})
elo_df_a = elo_df_a.rename(columns={'elo2':'elo', 'team2':'team','score2':'score'})
Here is what one half of the self-joined dataframes looks like. It contains only the data for team 1. The other dataframe, as you can guess, contains only the data for team2.
elo_df_h.head()
We append one df to the other, sort by date, and then apply a shift:
elo_df_flat = elo_df_h.append(elo_df_a, ignore_index=True)
elo_df_flat.sort_values('date', inplace=True)
elo_df_flat.reset_index(inplace=True, drop=True)
elo_df_flat['elo_after'] = elo_df_flat.groupby('team')['elo'].shift(-1)
elo_df_flat['elo_delta'] = elo_df_flat['elo_after'] - elo_df_flat['elo']
Recall from our glimpses above that IND played NO in the first game. Their elo_delta's should sum to 0, and they do:
elo_df_flat.head(2)
Now, we merge back with the original dataframe on date & team = team1, merging only 'elo_delta'.
elo_df_flat_for_merge = elo_df_flat[['date','team','elo_delta']]
elo_df = pd.merge(elo_df, elo_df_flat_for_merge, left_on=['date','team1'], right_on=['date','team'])
elo_df.drop(columns='team', inplace=True)
elo_df.rename(columns={'elo_delta':'elo_delta1'}, inplace=True)
elo_df.head()
Voila! $\Delta$Elo's for a given game, computed from the next week's Elo Ranking for the teams involved.