Applying function to Pandas Groupby - python
I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
Compute a rolling average for each time step in a given column, per group ID
Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid
date id rev_resid exp_resid
2005-09-01 1 NaN NaN
2005-12-01 1 -10000 -5500
2006-03-01 1 -352584 -262058.5
2006-06-01 1 240000 190049.5
2006-09-01 1 82648.75 37724.25
2005-09-01 2 NaN NaN
2005-12-01 2 4206.5 24353
2006-03-01 2 -302574 -331951
2006-06-01 2 103179 117405.5
2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.
To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.
d1 = df[['rev', 'exp']]
df.join(
d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)
date id rev exp rev_resid exp_resid
0 2005-09-01 1 745168.0 545168.0 NaN NaN
1 2005-12-01 1 725168.0 534168.0 -10000.00 -5500.00
2 2006-03-01 1 20000.0 10051.0 -352584.00 -262058.50
3 2006-06-01 1 500000.0 390150.0 240000.00 190049.50
4 2006-09-01 1 665297.5 465598.5 82648.75 37724.25
5 2005-09-01 2 956884.0 736987.0 NaN NaN
6 2005-12-01 2 965297.0 785693.0 4206.50 24353.00
7 2006-03-01 2 360149.0 121791.0 -302574.00 -331951.00
8 2006-06-01 2 566507.0 356602.0 103179.00 117405.50
9 2006-09-01 2 461207.0 212009.0 -52650.00 -72296.50
Related
Python: Comparing rows values in a time period conditional
This is a sample of a pandas dataframe that I'm working on. ID DATE HOUR TYPE CODE CITY 0 222304678 27/09/22 15:19:00 50201 3 Manila 1 222304694 18/09/22 10:46:00 30202 2 Innsbruck 2 222081537 18/09/22 10:47:00 30202 1 Innsbruck 3 221848197 17/09/22 21:54:00 30202 2 Austin 4 221455590 13/09/22 4:50:00 30409 2 Panama 5 220540157 06/09/22 12:29:00 30603 3 Sydney 6 220367113 06/09/22 12:32:00 30202 2 Sydney 7 221380583 06/09/22 12:56:00 30204 4 Sydney 8 221381826 06/09/22 12:58:00 30202 1 Sydney 9 221365584 22/08/22 12:35:00 50202 1 Tokyo When a row is Code = 1. I need a comparison to be made of the rows that occurred 30 minutes before, with the following conditions: The same city The same date Codes other than 1 And need to create another dataframe with the rows that met the condition (or at least just highlight them) I have tried with df.loc but I dont know how to make the range in time
Sum two columns only if the values of one column is bigger/greater 0
I've got the following dataframe lst=[['01012021','',100],['01012021','','50'],['01022021',140,5],['01022021',160,12],['01032021','',20],['01032021',200,25]] df1=pd.DataFrame(lst,columns=['Date','AuM','NNA']) I am looking for a code which sums the columns AuM and NNA only if the values of column AuM contains a value. The result is showed below: lst=[['01012021','',100,''],['01012021','','50',''],['01022021',140,5,145],['01022021',160,12,172],['01032021','',20,'']] df2=pd.DataFrame(lst,columns=['Date','AuM','NNA','Sum'])
It is not a good practice to use '' in place of NaN when you have numeric data. That said, a generic solution to your issue would be to use sum with the skipna=False option: df1['Sum'] = (df1[['AuM', 'NNA']] # you can use as many columns as you want .apply(pd.to_numeric, errors='coerce') # convert to numeric .sum(1, skipna=False) # sum if all are non-NaN .fillna('') # fill NaN with empty string (bad practice) ) output: Date AuM NNA Sum 0 01012021 100 1 01012021 50 2 01022021 140 5 145.0 3 01022021 160 12 172.0 4 01032021 20 5 01032021 200 25 225.0
I assume you mean to include the last row too: df2 = (df1.assign(Sum=df1.loc[df1.AuM.ne(""), ["AuM", "NNA"]].sum(axis=1)) .fillna("")) print(df2) Result: Date AuM NNA Sum 0 01012021 100 1 01012021 50 2 01022021 140 5 145.0 3 01022021 160 12 172.0 4 01032021 20 5 01032021 200 25 225.0
Comparing one date column to another in a different row
I have a dataframe df_corp: ID arrival_date leaving_date 1 01/02/20 05/02/20 2 01/03/20 07/03/20 1 12/02/20 20/02/20 1 07/03/20 10/03/20 2 10/03/20 15/03/20 I would like to find the difference between leaving_date of a row and arrival date of the next entry with respect to ID. Basically I want to know how long before they book again. So it'll look something like this. ID arrival_date leaving_date time_between 1 01/02/20 05/02/20 NaN 2 01/03/20 07/03/20 NaN 1 12/02/20 20/02/20 7 1 07/03/20 10/03/20 15 2 10/03/20 15/03/20 3 I've tried grouping by ID to do the sum but I'm seriously lost on how to get the value from the next row and a different column in one.
You need to convert to_datetime and to perform a GroupBy.shift to get the previous departure date: # arrival a = pd.to_datetime(df_corp['arrival_date'], dayfirst=True) # previous departure per ID l = pd.to_datetime(df_corp['leaving_date'], dayfirst=True).groupby(df_corp['ID']).shift() # difference in days df_corp['time_between'] = (a-l).dt.days output: ID arrival_date leaving_date time_between 0 1 01/02/20 05/02/20 NaN 1 2 01/03/20 07/03/20 NaN 2 1 12/02/20 20/02/20 7.0 3 1 07/03/20 10/03/20 16.0 4 2 10/03/20 15/03/20 3.0
Output raw value difference from one period to the next using Python
I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference: Date Value 10/01/2020 1 10/02/2020 2 10/03/2020 5 10/04/2020 8 Desired output: Date Value PercentDifference ValueDifference 10/01/2020 1 10/02/2020 2 100 2 10/03/2020 5 150 3 10/04/2020 8 60 3 This is what I am doing: import pandas as pd df = pd.read_csv('df.csv') df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')), on='Date') .assign(Value = lambda x: x['Value_y']-x['Value_x']) [['Date','Value']] ) df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) - 1).fillna(0)] A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output. Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on? Any suggestion is appreciated
Use Series.pct_change and Series.diff df['PercentageDiff'] = df['Value'].pct_change().mul(100) df['ValueDiff'] = df['Value'].diff() Date Value PercentageDiff ValueDiff 0 10/01/2020 1 NaN NaN 1 10/02/2020 2 100.0 1.0 2 10/03/2020 5 150.0 3.0 3 10/04/2020 8 60.0 3.0 Or you use df.assign df.assign( percentageDiff=df["Value"].pct_change().mul(100), ValueDiff=df["Value"].diff() )
Merge/Concat 2 dataframe with different holiday dates
I would like to merge/concat "outer" for 2 different dataframes with different set of holiday dates. Date column is string. Both dataframe prices exclude non-pricing days e.g. public holiday and weekends Assuming Dataframe 1 follows US holiday: df1_US_holiday Date Price_A 5/6/2020 2 5/5/2020 3 5/4/2020 4 5/1/2020 5 4/30/2020 6 4/29/2020 1 4/28/2020 3 4/27/2020 1 Assuming Dataframe 2 follows China holiday (note: 1-5 May is China holiday): df2_China_holiday Date Price_B 5/6/2020 4 4/30/2020 3 4/29/2020 2 4/28/2020 2 4/27/2020 5 Expected merge/concat results: Date Price_A Price_B 5/6/2020 2 4 5/5/2020 3 NaN 5/4/2020 4 NaN 5/1/2020 5 NaN 4/30/2020 6 3 4/29/2020 1 2 4/28/2020 3 2 4/27/2020 1 5 Ultimately, Would like fill the NaN for fillna(method='bfill'). Should I include any holiday library pack for this merge/concat action?
Pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations. Please take a look at these documents that may be useful for what you want to achieve