I am trying to use a data frame that includes historical game statistics like the below df1, and build a second data frame that shows what the various column averages were going into each game (as I show in df2). How can I use grouby or something else to find the various averages for each team but only for games that have a date prior to the date in that specific row. Example of historical games column:
Df1 = Date Team Opponent Points Points Against 1st Downs Win?
4/16/20 Eagles Ravens 10 20 10 0
2/10/20 Eagles Falcons 30 40 8 0
12/15/19 Eagles Cardinals 40 10 7 1
11/15/19 Eagles Giants 20 15 5 1
10/12/19 Jets Giants 10 18 2 1
Below is the dataframe that i'm trying to create. As you can see, it is showing the averages for each column but only for the games that happened prior to each game. Note: this is a simplified example of a much larger data set that i'm working with. In case the context helps, I'm trying to create this dataframe so I can analyze the correlation between the averages and whether the team won.
Df2 = Date Team Opponent Avg Pts Avg Pts Against Avg 1st Downs Win %
4/16/20 Eagles Ravens 25.0 21.3 7.5 75%
2/10/20 Eagles Falcons 30.0 12.0 6.0 100%
12/15/19 Eagles Cardinals 20.0 15.0 5.0 100%
11/15/19 Eagles Giants NaN NaN NaN NaN
10/12/19 Jets Giants NaN NaN NaN NaN
Let me know if anything above isn't clear, appreciate the help.
The easiest way is to turn your dataframe into a Time Series.
Run this for a file:
data=pd.read_csv(r'C:\Users\...csv',index_col='Date',parse_dates=True)
This is an example with a CSV file.
You can run this after:
data[:'#The Date you want to have all the dates before it']
If you want build a Series that has time indexed:
index=pd.DatetimeIndex(['2014-07-04',...,'2015-08-04'])
data=pd.Series([0, 1, 2, 3], index=index)
Define your own function
def aggs_under_date(df, date):
first_team = df.Team.iloc[0]
first_opponent= df.Opponent.iloc[0]
if df.date.iloc[0] <= date:
avg_points = df.Points.mean()
avg_againts = df['Points Against'].mean()
avg_downs = df['1st Downs'].mean()
win_perc = f'{win_perc.sum()/win_perc.count()*100} %'
return [first_team, first_opponent, avg_points, avg_againts, avg_downs, win_perc]
else:
return [first_team, first_opponent, np.nan, np.nan, np.nan, np.nan]
And do the groupby applying the function you just defined
date_max = pd.to_datetime('11/15/19')
Df1.groupby(['Date']).agg(aggs_under_date, date_max)
Related
I have this scenario. I’m in the process of learning. I'm cleaning a dataset. Now I have a problem
There are a lot of rows that have this issue
I have the key but not the name product. I have the name product but not the key.
prod_key product
0 21.0 NaN
1 21.0 NaN
2 0.0 metal
3 35.0 NaN
4 22.0 NaN
5 0.0 wood
I know that the key of metal is 24 and the key of wood is 25
The product name that belongs to key 21 is plastic and the product name that belongs to key 22 is paper
There are hundreds of rows with the same situation. So, rename each and everyone of them will take me a lot of time.
I created a dictionary and then I used the .map() method but I’m still unable to ‘merge’ or you can say ‘mix’ the missing values in both columns without removing the other column value.
Thank you
You can create an extra dataframe and do merge two times
lst = [
['metal', 24],
['wood', 25],
['plastic', 21],
['paper', 22]
]
df2 = pd.DataFrame(lst, columns=['name', 'key'])
df1['product'].update(df1.merge(df2, left_on='prod_key', right_on='key', how='left')['name'])
df1['prod_key'].update(df1.merge(df2, left_on='product', right_on='name', how='left')['key'])
print(df2)
name key
0 metal 24
1 wood 25
2 plastic 21
3 paper 22
print(df1)
prod_key product
0 21.0 plastic
1 21.0 plastic
2 24.0 metal
3 35.0 NaN
4 22.0 paper
5 25.0 wood
I have two dataframes one at the lower level and one that summarizes the data at a higher level. I'm trying to add a new column to the summary table that sums the total spending of all people who are fans of a particular sport. IE in the summary row of soccer I do NOT want to sum the total soccer spending, but the total sports spending of anyone who spends anything on soccer.
df = pd.DataFrame({'Person': [1,2,3,3,3],
'Sport': ['Soccer','Tennis','Tennis','Football','Soccer'],
'Ticket_Cost': [10,20,10,10,20]})
df2 = pd.DataFrame({'Sport': ['Soccer','Tennis','Football']})
I can currently do this in many steps, but I'm sure there is a more efficient/quicker way. Here is how I currently do it.
#Calculate the total spend for each person in an temporary dataframe
df_intermediate = df.groupby(['Person'])['Ticket_Cost'].sum()
df_intermediate= df_intermediate.rename("Total_Sports_Spend")
Person Total_Sports_Spend
1 10
2 20
3 40
#place this total in the detailed table
df = pd.merge(df,df_intermediate,how='left',on='Person')
#Create a second temporary dataframe
df_intermediate2 = df.groupby(['Sport'])['Total_Sports_Spend'].sum()
Sport Total_Sports_Spend
Football 40
Soccer 50
Tennis 60
#Merge this table with the summary table
df2 = pd.merge(df2,df_intermediate2,how='left',on='Sport')
Sport Total_Sports_Spend
0 Soccer 50
1 Tennis 60
2 Football 40
Finally, I clean up the temporary dataframes and remove the extra column from the detailed table. I'm sure there is a better way.
You might want to rotate your DataFrame in 2D:
df2 = df.pivot_table(index = 'Person', columns = 'Sport', values = 'Ticket_Cost')
You get
Sport Football Soccer Tennis
Person
1 NaN 10.0 NaN
2 NaN NaN 20.0
3 10.0 20.0 10.0
Now you can compute the total spending per person:
total = df2.sum(axis=1)
which is
Person
1 10.0
2 20.0
3 40.0
dtype: float64
Finally you place the total spending values of total in the cells of df2 where the cell has a positive value:
df3 = (df2>0).mul(total, axis=0)
which is here:
Sport Football Soccer Tennis
Person
1 0.0 10.0 0.0
2 0.0 0.0 20.0
3 40.0 40.0 40.0
Finally you just have to sum along columns to get what you want:
spending = df3.sum(axis=0)
and will get what you expect.
I have 4 Excel files that I have to merge into one Excel file.
Demography file containing ID, Initials, Age, and Sex.
Laboratory file containing ID, Initials Test name, Test date, and Test Value.
Medical History containing ID, Initials, Medical condition, Start and Stop Dates.
Medication given containing ID, Initials, Drug name, dose, frequency, start and stop dates.
There are 50 patients. The demography file contains all 50 rows of 50 patients. The rest of the files have 50 patients but between 100 to 400 rows because each patient has multiple lab tests or multiple drugs.
When I merge in pandas, I have duplicates or assignment of entities to wrong patients. The challenge is to do this a way such that where you have a patient with more medications given than lab tests, the lab test should replace the duplicates with whitespaces.
This is a shortened representation:
import pandas as pd
lab = pd.read_excel('data/data.xlsx', sheetname='lab')
drugs = pd.read_excel('data/data.xlsx', sheetname='drugs')
merged_data = pd.merge(drugs, lab, on='ID', how='left')
merged_data.to_excel('merged_data.xls')
You get this result: Pandas merge result
I would prefer this result: Prefered output
Consider using cumcount() on a groupby() and then join on both that field with ID:
drugs['GrpCount'] = (drugs.groupby(['ID'])).cumcount()
lab['GrpCount'] = (lab.groupby(['ID'])).cumcount()
merged_data = pd.merge(drugs, lab, on=['ID', 'GrpCount'], how='left').drop(['GrpCount'], axis=1)
# ID Initials_x Drug Name Frequency Route Start Date End Date Initials_y Name Result Date Result
# 0 1 AB AMPICLOX NaN Oral 21-Jun-2016 21-Jun-2016 AB Rapid Diagnostic Test 30-May-16 Abnormal
# 1 1 AB CIPROFLOXACIN Daily Oral 30-May-2016 03-Jun-2016 AB Microscopy 30-May-16 Normal
# 2 1 AB Ibuprofen Tablet 400 mg Two Times a Day Oral 06-Oct-2016 10-Oct-2016 NaN NaN NaN NaN
# 3 1 AB COARTEM NaN Oral 17-Jun-2016 17-Jun-2016 NaN NaN NaN NaN
# 4 1 AB INJECTABLE ARTESUNATE 12 Hourly Intravenous 01-Jun-2016 02-Jun-2016 NaN NaN NaN NaN
# 5 1 AB COTRIMOXAZOLE Daily Oral 30-May-2016 12-Jun-2016 NaN NaN NaN NaN
# 6 1 AB METRONIDAZOLE Two Times a Day Oral 30-May-2016 03-Jun-2016 NaN NaN NaN NaN
# 7 2 SS GENTAMICIN Daily Intravenous 04-Jun-2016 04-Jun-2016 SS Microscopy 6-Jun-16 Abnormal
# 8 2 SS METRONIDAZOLE 8 Hourly Intravenous 04-Jun-2016 06-Jun-2016 SS Complete Blood Count 6-Oct-16 Recorded
# 9 2 SS Oral Rehydration Salts Powder PRN Oral 06-Jun-2016 06-Jun-2016 NaN NaN NaN NaN
# 10 2 SS ZINC 8 Hourly Oral 06-Jun-2016 06-Jun-2016 NaN NaN NaN NaN
Suppose I have the following DataFrame:
>>> cols = ['model', 'parameter', 'condition', 'value']
>>> df = pd.DataFrame([['BMW', '0-60', 'rain', '7'], ['BMW', '0-60', 'sun', '7'],
['BMW','mpg', 'rain','25'],
['BMW', 'stars', 'rain','5'],
['Toyota', '0-60', 'rain','9'],
['Toyota','mpg', 'rain','40'],
['Toyota', 'stars', 'rain','4']], columns=cols)
>>> df
model parameter condition value
0 BMW 0-60 rain 7
1 BMW 0-60 sun 7
2 BMW mpg rain 25
3 BMW stars rain 5
4 Toyota 0-60 rain 9
5 Toyota mpg rain 40
6 Toyota stars rain 4
This is a list of performance metrics for various cars at different conditions. This is a made up data set, of course, but its representative of my problem.
What I ultimately want is to have observation for a given condition on its own row, and each metric on its own column. This would look something like this:
parameter condition 0-60 mpg stars
model
0 BMW rain 7 25 5
1 BMW sun 7 NaN NaN
2 Toyota rain 9 40 4
Note that I just made up the format above. I don't know if Pandas would generate something exactly like that, but that's the general idea. I would also of course transform the "condition" into a Boolean array and fill in the NaNs.
My problem is that when I try to use the pivot method I get an error. I think this is because my "column" key is repeated (because I have BMW 0-60 stats for the rain and for the sun conditions).
df.pivot(index='model',columns='parameter')
ValueError: Index contains duplicate entries, cannot reshape
Does anyone know of a slick way to do this? I'm finding a lot of these Pandas reshaping methods to be quite obtuse.
You can just change the index and unstack it...
df.set_index(['model', 'condition', 'parameter']).unstack()
returns
value
parameter 0-60 mpg stars
model condition
BMW rain 7 25 5
sun 7 NaN NaN
Toyota rain 9 40 4
You can get the result you want using pivot_table and passing the following parameters:
>>> df.pivot_table(index=['model', 'condition'], values='value', columns='parameter')
parameter 0-60 mpg stars
model condition
BMW rain 7 25 5
sun 7 NaN NaN
Toyota rain 9 40 4
(You may need to ensure the "value" column has numeric types first or else you can pass aggfunc=lambda x: x in the pivot_table function to get around this requirement.)
Here is a sample of my data:
In[177]:df_data[['Date', 'TeamName', 'Opponent', 'ScoreOff']].head()
Out[177]:
Date TeamName Opponent ScoreOff
4128 2005-09-08 00:00:00 New England Patriots Oakland Raiders 30
4129 2005-09-08 00:00:00 Oakland Raiders New England Patriots 20
4130 2005-09-11 00:00:00 Arizona Cardinals New York Giants 19
4131 2005-09-11 00:00:00 Baltimore Ravens Indianapolis Colts 7
4132 2005-09-11 00:00:00 Buffalo Bills Houston Texans 22
For each row, I need to set a new column ['OpponentScoreOff'] equal to that team's opponent's ScoreOff on that day.
I have done it by basically doing the following, but it's slow and I feel like there is a more pythonic/vectorized way to do it.
g1 = df_data.groupby('Date')
for date, teams in g1:
g2 = teams.groupby('TeamName')
for teamname, game in teams:
df_data[(df_data['TeamName'] == teamname) & (dfdata['Date'] == date)]['OppScoreOff'] = df_data[(df_data['Opponent'] == teamname) & (df_data['Date'] == date)]['ScoreOff']
It worked, but it's slow. Any better way to do this?
You could use sort to take advantage of the bijection between TeamName and Opponent for any given date. Consider the following:
import pandas as pd
import numpy as np
df_data = df_data.sort(['Date', 'TeamName'])
opp_score = np.array(df_data.sort(['Date', 'Opponent'])['ScoreOff'])
df_data['OpponentScoreOff'] = opp_score
The array call is necessary to remove the DataFrame indexing. That way, the array isn't resorted once it's put back into df_data.