Conditional expanding group aggregation pandas - python

For some data preprocessing I have a huge dataframe where I need historical performance within groups. However since it is for a predictive model that runs a week before the target I cannot use any data that happened in that week in between. There are a variable number of rows per day per group which means I cannot always discard the last 7 values by using a shift on the expanding functions, I have to somehow condition on the datetime of rows before it. I can write my own function to apply on the groups however this is usually very slow in my experience (albeit flexible). This is how I did it without conditioning on date and just looking at previous records:
df.loc[:, 'new_col'] = df_gr['old_col'].apply(lambda x: x.expanding(5).mean().shift(1))
The 5 represents that I want at least a sample size of 5 or to put it to NaN.
Small example with aggr_mean looking at the mean of all samples within group A at least a week earlier:
group | dt | value | aggr_mean
A | 01-01-16 | 5 | NaN
A | 03-01-16 | 4 | NaN
A | 08-01-16 | 12 | 5 (only looks at first row)
A | 17-01-16 | 11 | 7 (looks at first three rows since all are
at least a week earlier)

new answer
using #JulienMarrec's better example
dt group value
2016-01-01 A 5
2016-01-03 A 4
2016-01-08 A 12
2016-01-17 A 11
2016-01-04 B 10
2016-01-05 B 5
2016-01-08 B 12
2016-01-17 B 11
Condition df to be more useful
d1 = df.drop('group', 1)
d1.index = [df.group, df.groupby('group').cumcount().rename('gidx')]
d1
create a custom function that does what old answer did. Then apply it within groupby
def lag_merge_asof(df, lag):
d = df.set_index('dt').value.expanding().mean()
d.index = d.index + pd.offsets.Day(lag)
d = d.reset_index(name='aggr_mean')
return pd.merge_asof(df, d)
d1.groupby(level='group').apply(lag_merge_asof, lag=7)
we can get some formatting with this
d1.groupby(level='group').apply(lag_merge_asof, lag=7) \
.reset_index('group').reset_index(drop=True)
old answer
create a lookback dataframe by offsetting the dates by 7 days, then use it to pd.merge_asof
lookback = df.set_index('dt').value.expanding().mean()
lookback.index += pd.offsets.Day(7)
lookback = lookback.reset_index(name='aggr_mean')
lookback
pd.merge_asof(df, lookback, left_on='dt', right_on='dt')

Given this dataframe where I added another group in order to more clearly see what's happening:
dt group value
2016-01-01 A 5
2016-01-03 A 4
2016-01-08 A 12
2016-01-17 A 11
2016-01-04 B 10
2016-01-05 B 5
2016-01-08 B 12
2016-01-17 B 11
Let's load it:
df = pd.read_clipboard(index_col=0, sep='\s+', parse_dates=True)
Now we can use a groupby, resample daily, and do an shift that 7 days, and take the mean:
x = df.groupby('group')['value'].apply(lambda gp: gp.resample('1D').mean().shift(7).expanding().mean())
Now you can merge left that back into your df:
merged = df.reset_index().set_index(['group','dt']).join(x, rsuffix='_aggr_mean', how='left')
merged

Related

Output raw value difference from one period to the next using Python

I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)

Pandas - Sum Previous Rows if Value In Column Meets Condition

I have a dataframe that is of the following type. I have all the columns except the final column, "Total Previous Points P1", which I am hoping to create:
The data is sorted by the "Date" column.
Date | Points_P1 | P1_id | P2_id | Total_Previous_Points_P1
-------------+---------------+----------+-----------------------------------
10/08/15 | 5 | 100 | 90 | 500
-------------+---------------+----------+-----------------------------------
11/09/16 | 5 | 100 | 90 | 500
-------------+---------------+----------+-----------------------------------
20/09/19 | 10 | 10000 | 360 | 4,200
-------------+---------------+----------+-----------------------------------
... | | ... | ... | ...
-------------+---------------+----------+-----------------------------------
n | | | |
Now the column I want to create, is the "Total_Previous_Points_P1" column shown above.
The way to create it:
For each row, check the date (call this DATE_VAL) and P1_id (call this ID_VAL)
Now, for all rows before DATE_VAL AND where P1 id == ID_VAL, sum up the previous points.
Put this sum in the final column, in the current row
Is there a fast pandas pythonic way to do this? My data set is very large.
Thank you!
The solution by SIA computes sum of Points_P1 including the
current value of Points_P1, whereas the requirement is to sum
previous points (for all rows before...).
Assuming that dates in each group are unique (in your sample they are),
the proper, pandasonic solution should include the following steps:
Sort by Date.
Group by P1_id, then for each group:
Take Points_P1 column.
Compute cumulative sum.
Subtract the current value of Points_P1.
So the whole code should be:
df['Total_Previous_Points_P1'] = df.sort_values('Date')\
.groupby(['P1_id']).Points_P1.cumsum() - df.Points_P1
Edit
If Date is not unique (within group of rows with some P1_id), the case
is more complicated, what can be shown on such source DataFrame:
Date Points_P1 P1_id
0 2016-11-09 5 100
1 2016-11-09 3 100
2 2015-10-08 5 100
3 2019-09-20 10 10000
4 2019-09-21 7 100
5 2019-07-10 12 10000
6 2019-12-10 12 10000
Note that for P1_id there are two rows for 2016-11-09.
In this case, start from computing "group" sums of previous points,
for each P1_id and Date:
sumPrev = df.groupby(['P1_id', 'Date']).Points_P1.sum()\
.groupby(level=0).apply(lambda gr: gr.shift(fill_value=0).cumsum())\
.rename('Total_Previous_Points_P1')
The result is:
P1_id Date
100 2015-10-08 0
2016-11-09 5
2019-09-21 13
10000 2019-07-10 0
2019-09-20 12
2019-12-10 22
Name: Total_Previous_Points_P1, dtype: int64
Then merge df with sumPrev on P1_id and Date (in sumPrev on the index):
df = pd.merge(df, sumPrev, left_on=['P1_id', 'Date'], right_index=True)
To show the result, it is more instructive to sort df also on ['P1_id', 'Date']:
Date Points_P1 P1_id Total_Previous_Points_P1
2 2015-10-08 5 100 0
0 2016-11-09 5 100 5
1 2016-11-09 3 100 5
4 2019-09-21 7 100 13
5 2019-07-10 12 10000 0
3 2019-09-20 10 10000 12
6 2019-12-10 12 10000 22
As you can see:
The first sum for each P1_id is 0 (no points from previous dates).
E.g. for both rows with Date == 2016-11-09 the sum of previous
points is 5 (which is in row for Date == 2015-10-08).
Try:
df['Total_Previous_Points_P1'] = df.groupby(['P1_id'])['Points_P1'].cumsum()
How It Works
First, it groups the data using P1_id feature.
Then it accesses the Points_P1 values on the grouped dataframe and apply the cumulative sum function cumsum(), which returns the sum of points up to and including the current row for each group.

Weird behavior with pandas Grouper method with datetime objects

I am trying to make groups of x days within groups of another column. For some reason the grouping behavior is changed when I add another level of grouping.
See toy example below:
Create a random dataframe with 40 consecutive dates, an ID column and random values:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'dates':pd.date_range('2018-1-1',periods=40,freq='D'),
'id': np.concatenate((np.repeat(1,10),np.repeat(2,30))),
'amount':np.random.random(40)
}
)
I want to group by id first and then make groups of let's say 7 consecutive days within these groups. I do:
(df
.groupby(['id',pd.Grouper(key='dates',freq='7D')])
.amount
.agg(['mean','count'])
)
And the output is:
mean count
id dates
1 2018-01-01 0.591755 7
2018-01-08 0.701657 3
2 2018-01-08 0.235837 4
2018-01-15 0.650085 7
2018-01-22 0.463854 7
2018-01-29 0.643556 7
2018-02-05 0.459864 5
There is something weird going on in the second group! I would expect to see 4 groups of 7 and then a last group of 2. When I run the same code on a dataframe with just the id=2 I do get what I actually expect:
df2=df[df.id==2]
(df2
.groupby(['id',pd.Grouper(key='dates',freq='7D')])
.amount
.agg(['mean','count'])
)
Output
mean count
id dates
2 2018-01-11 0.389343 7
2018-01-18 0.672550 7
2018-01-25 0.486620 7
2018-02-01 0.520816 7
2018-02-08 0.529915 2
What is going on here? Is it first creating a group of 4 in the id=2 group because the last group in id=1 group was only 3 rows? This is not what I want to do!
When you group with both IDs, you have a spillover from the first group into the second when you perform a weekly groupby (because there are not enough days in the last week to complete a full 7 days in group #1). This is obvious when you look at the first date per group:
"2018-01-08" in the first case v/s "2018-01-11".
The workaround is to perform a groupby on id and then apply a resampling operation:
df.groupby('id').apply(
lambda x: x.set_index('dates').amount.resample('7D').count()
)
id dates
1 2018-01-01 7
2018-01-08 3
2 2018-01-11 7
2018-01-18 7
2018-01-25 7
2018-02-01 7
2018-02-08 2
Name: amount, dtype: int64

Calculate new column in pandas dataframe based only on grouped records

I have a dataframe with various events(id) and following structure, the df is grouped by id and sorted on timestamp :
id | timestamp | A | B
1 | 02-05-2016|bla|bla
1 | 04-05-2016|bla|bla
1 | 05-05-2016|bla|bla
2 | 11-02-2015|bla|bla
2 | 14-02-2015|bla|bla
2 | 18-02-2015|bla|bla
2 | 31-03-2015|bla|bla
3 | 02-08-2016|bla|bla
3 | 07-08-2016|bla|bla
3 | 27-09-2016|bla|bla
Each timestamp-id combo indicates a different stage in the process of the event with that particular id. Each new record for a specific id indicates the start of a new stage for that event-id.
I would like to add a new column Duration that calculates the duration of each stage for each event (see desired df below). This is easy as i can simply calculate the difference between the timestamp of the next stage for the same event id with the timestamp of the current stage as following:
df['Start'] = pd.to_datetime(df['timestamp'])
df['End'] = pd.to_datetime(df['timestamp'].shift(-1))
df['Duration'] = df['End'] - df['Start']
My problem appears on the last stage of each event id, as i want to simply display NaNs or dashes as the stage has not finished yet and the end time is unknown. My solution simply takes the timestamp of the next row which is not always correct, as it might belong to a completele different event.
Desired output:
id | timestamp | A | B | Duration
1 | 02-05-2016|bla|bla| 2 days
1 | 04-05-2016|bla|bla| 1 days
1 | 05-05-2016|bla|bla| ------
2 | 11-02-2015|bla|bla| 3 days
2 | 14-02-2015|bla|bla| 4 days
2 | 18-02-2015|bla|bla| 41 days
2 | 31-03-2015|bla|bla| -------
3 | 02-08-2016|bla|bla| 5 days
3 | 07-08-2016|bla|bla| 50 days
3 | 27-09-2016|bla|bla| -------
I think this does what you want:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Duration'] = df.groupby('id')['timestamp'].diff().shift(-1)
If I understand correctly: groupby('id') tells pandas to apply .diff().shift(-1) to each group as if it were a miniature DataFrame independent of the other rows. I tested it on this fake data:
import pandas as pd
import numpy as np
# Generate some fake data
df = pd.DataFrame()
df['id'] = [1]*5 + [2]*3 + [3]*4
df['timestamp'] = pd.to_datetime('2017-01-1')
duration = sorted(np.random.randint(30,size=len(df)))
df['timestamp'] += pd.to_timedelta(duration)
df['A'] = 'spam'
df['B'] = 'eggs'
but double-check just to be sure I didn't make a mistake!
Here is one approach using apply
def timediff(row):
row['timestamp'] = pd.to_datetime(row['timestamp'], format='%d-%m-%Y')
return pd.DataFrame(row['timestamp'].diff().shift(-1))
res = df.assign(duration=df.groupby('id').apply(timediff))
Output:
id timestamp duration
0 1 02-05-2016 2 days
1 1 04-05-2016 1 days
2 1 05-05-2016 NaT
3 2 11-02-2015 3 days
4 2 14-02-2015 4 days
5 2 18-02-2015 41 days
6 2 31-03-2015 NaT
7 3 02-08-2016 5 days
8 3 07-08-2016 51 days
9 3 27-09-2016 NaT

Python function to add values in a Pandas Dataframe using values from another Dataframe

I am a newbie in Python and I am struggling for coding things that seem simple in PHP/SQL and I hope you can help me.
I have 2 Pandas Dataframes that I have simplified for a better understanding.
In the first Dataframe df2015, I have the Sales for the 2015.
! Notice that unfortunately, we do not have ALL the values for each store !
>>> df2015
Store Date Sales
0 1 2015-01-15 6553
1 3 2015-01-15 7016
2 6 2015-01-15 8840
3 8 2015-01-15 10441
4 9 2015-01-15 7952
And another Dataframe named df2016 for the Sales Forecast in 2016, which lists ALL the stores.( As you guess, the column SalesForecast is the column to fill. )
>>> df2016
Store Date SalesForecast
0 1 2016-01-15
1 2 2016-01-15
2 3 2016-01-15
3 4 2016-01-15
4 5 2016-01-15
I want to create a function that for each row in df2016 will retrieve the Sales values from df2015, and for example, will increase by 5% these values and add these new values in SalesForecast column of df2016.
Let's say forecast is the function I have created that I want to apply :
def forecast(store_id,date):
sales2015 = df2015['Sales'].loc[(df2015['Store'].values == store_id) & (df2015['Date'].values == date )].values
forecast2016 = sales2015 * 1.05
return forecast2016
I have tested this function in a hardcoding way as below and it works:
>>> forecast(1,'2015-01-15')
array([ 6880.65])
But here we are where my problem is... How can I apply this function to the dataframes ?
It would be very easy to do it in PHP by creating a loop for each row in df2016 and retrieve the values (if they exist) from df2015 by a SELECT and WHERE Store = store_id and Date = date.. ...but the it seems the logic is not the same with Pandas Dataframes and Python.
I have tried the apply function as follows :
df2016['SalesForecast'] = df2016.apply(df2016['Store'],df2016['Date'])
but I am unable to put the arguments correctly or there is something I am doing wrong..
I think I do not have the good method or maybe my method is not suitable at all with Pandas and Python.. ?
I believe you are almost there! What's missing is the function, you've passed in the args.
The apply function takes in a function and its args. The documentation is here.
Without having tried this on my own system, I would suggest doing:
df2016['SalesForecast'] = df2016.apply(func=forecast, args=(df2016['Store'],df2016['Date']))
One of the nice things about Pandas is that it handles missing data well. The trick is to use a common index on both dataframes. For instance, if we set the index of both dataframes to be the 'Store' column:
df2015.set_index('Store', inplace=True)
df2016.set_index('Store', inplace=True)
Then doing what you'd like is as simple as:
df2016['SalesForecast'] = df2015['Sales'] * 1.05
resulting in:
Date SalesForecast
Store
1 2016-01-15 6880.65
2 2016-01-15 NaN
3 2016-01-15 7366.80
4 2016-01-15 NaN
5 2016-01-15 NaN
That the SalesForecast for store 2 is NaN reflects the fact that store 2 doesn't exist in the df2015 dataframe.
Notice that if you just need to multiply the Sales column from df2015 by 1.05, you can just do so, all in df2015:
In [18]: df2015['Forecast'] = df2015['Sales'] * 1.05
In [19]: df2015
Out[19]:
Store Date Sales Forecast
0 1 2015-01-15 6553 6880.65
1 3 2015-01-15 7016 7366.80
2 6 2015-01-15 8840 9282.00
3 8 2015-01-15 10441 10963.05
4 9 2015-01-15 7952 8349.60
At this point, you can join that result onto df2016 if you need this to appear in the df2016 data set:
In [20]: pandas.merge(df2016, # left side of join
df2015, # right side of join
on='Store', # similar to SQL 'on' for 'join'
how='outer', # same as SQL, outer join.
suffixes=('_2016', '_2015')) # rename same-named
# columns w/suffix
Out[20]:
Store Date_2016 Date_2015 Sales Forecast
0 1 2016-01-15 2015-01-15 6553 6880.65
1 2 2016-01-15 NaN NaN NaN
2 3 2016-01-15 2015-01-15 7016 7366.80
3 4 2016-01-15 NaN NaN NaN
4 5 2016-01-15 NaN NaN NaN
5 6 2016-01-15 2015-01-15 8840 9282.00
6 7 2016-01-15 NaN NaN NaN
7 8 2016-01-15 2015-01-15 10441 10963.05
8 9 2016-01-15 2015-01-15 7952 8349.60
If the two DataFrames happen to have compatible indexs already, you can simply write in the result column to df2016 directly, even if it's a computation on another DataFrame like df2015. In general though, you need to be careful about this, and it can be more general to perform the join explicitly (as I did above by using the merge function). Which way is best will depend on your application and your knowledge of index columns.
For more general function application to a column, a whole DataFrame, or groups of sub-frames, refer to the documentation for this type of operation in Pandas.
There are also links with some cookbook examples and comparisons with the way you might express similar operations in SQL.
Note that I created data to replicate your example data with these commands:
df2015 = pandas.DataFrame([[1, datetime.date(2015, 1, 15), 6553],
[3, datetime.date(2015, 1, 15), 7016],
[6, datetime.date(2015, 1, 15), 8840],
[8, datetime.date(2015, 1, 15), 10441],
[9, datetime.date(2015, 1, 15), 7952]],
columns=['Store', 'Date', 'Sales'])
from itertools import izip_longest
df2016 = pandas.DataFrame(
list(izip_longest(range(1,10),
[datetime.date(2016, 1, 15)],
fillvalue=datetime.date(2016, 1, 15))),
columns=['Store', 'Date']
)

Categories

Resources