Speeding up past-60-day mean in pandas - python

I use data from a past kaggle challenge based on panel data across a number of stores and a period spanning 2.5 years. Each observation includes the number of customers for a given store-date. For each store-date, my objective is to compute the average number of customers that visited this store during the past 60 days.
Below is code that does exactly what I need. However, it lasts forever - it would take a night to process the c.800k rows. I am looking for a clever way to achieve the same objective faster.
I have included 5 observations of the initial dataset with the relevant variables: store id (Store), Date and number of customers ("Customers").
Note:
For each row in the iteration, I end up writing the results using .loc instead of e.g. row["Lagged No of customers"] because "row" does not write anything in the cells. I wonder why that's the case.
I normally populate new columns using "apply, axis = 1" so I would really appreciate any solution based on that. I found that "apply" works fine when for each row, computation is done across columns using values at the same row level. However, I don't know how an "apply" function can involve different rows, which is what this problem requires. the only exception I have seen so far is "diff", which is not useful here.
Thanks.
Sample data:
pd.DataFrame({
'Store': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Customers': {0: 668, 1: 578, 2: 619, 3: 635, 4: 785},
'Date': {
0: pd.Timestamp('2013-01-02 00:00:00'),
1: pd.Timestamp('2013-01-03 00:00:00'),
2: pd.Timestamp('2013-01-04 00:00:00'),
3: pd.Timestamp('2013-01-05 00:00:00'),
4: pd.Timestamp('2013-01-07 00:00:00')
}
})
Code that works but is incredibly slow:
import pandas as pd
import numpy as np
data = pd.read_csv("Rossman - no of cust/dataset.csv")
data.Date = pd.to_datetime(data.Date)
data.Customers = data.Customers.astype(int)
for index, row in data.iterrows():
d = row["Date"]
store = row["Store"]
time_condition = (d - data["Date"]<np.timedelta64(60, 'D')) & (d > data["Date"])
sub_df = data.loc[ time_condition & (data["Store"] == store), :]
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Lagged No customers"] = sub_df["Customers"].sum()
data.loc[ (data["Date"]==d) & (data["Store"] == store), "No of days"] = len(sub_df["Customers"])
if len(sub_df["Customers"]) > 0:
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Av No of customers"] = int(sub_df["Customers"].sum()/len(sub_df["Customers"]))

Given your small sample data, I used a two day rolling average instead of 60 days.
>>> (pd.rolling_mean(data.pivot(columns='Store', index='Date', values='Customers'), window=2)
.stack('Store'))
Date Store
2013-01-03 1 623.0
2013-01-04 1 598.5
2013-01-05 1 627.0
2013-01-07 1 710.0
dtype: float64
By taking a pivot of the data with dates as your index and stores as your columns, you can simply take a rolling average. You then need to stack the stores to get the data back into the correct shape.
Here is some sample output of the original data prior to the final stack:
Store 1 2 3
Date
2015-07-29 541.5 686.5 767.0
2015-07-30 534.5 664.0 769.5
2015-07-31 550.5 613.0 822.0
After .stack('Store'), this becomes:
Date Store
2015-07-29 1 541.5
2 686.5
3 767.0
2015-07-30 1 534.5
2 664.0
3 769.5
2015-07-31 1 550.5
2 613.0
3 822.0
dtype: float64
Assuming the above is named df, you can then merge it back into your original data as follows:
data.merge(df.reset_index(),
how='left',
on=['Date', 'Store'])
EDIT:
There is a clear seasonal pattern in the data for which you may want to make adjustments. In any case, you probably want your rolling average to be in multiples of seven to represent even weeks. I've used a time window of 63 days in the example below (9 weeks).
In order to avoid losing data on stores that just open (and those at the start of the time period), you can specify min_periods=1 in the rolling mean function. This will give you the average value over all available observations for your given time window
df = data.loc[data.Customers > 0, ['Date', 'Store', 'Customers']]
result = (pd.rolling_mean(df.pivot(columns='Store', index='Date', values='Customers'),
window=63, min_periods=1)
.stack('Store'))
result.name = 'Customers_63d_mvg_avg'
df = df.merge(result.reset_index(), on=['Store', 'Date'], how='left')
>>> df.sort_values(['Store', 'Date']).head(8)
Date Store Customers Customers_63d_mvg_avg
843212 2013-01-02 1 668 668.000000
842103 2013-01-03 1 578 623.000000
840995 2013-01-04 1 619 621.666667
839888 2013-01-05 1 635 625.000000
838763 2013-01-07 1 785 657.000000
837658 2013-01-08 1 654 656.500000
836553 2013-01-09 1 626 652.142857
835448 2013-01-10 1 615 647.500000
To more clearly see what is going on, here is a toy example:
s = pd.Series([1,2,3,4,5] + [np.NaN] * 2 + [6])
>>> pd.concat([s, pd.rolling_mean(s, window=4, min_periods=1)], axis=1)
0 1
0 1 1.0
1 2 1.5
2 3 2.0
3 4 2.5
4 5 3.5
5 NaN 4.0
6 NaN 4.5
7 6 5.5
The window is four observations, but note that the final value of 5.5 equals (5 + 6) / 2. The 4.0 and 4.5 values are (3 + 4 + 5) / 3 and (4 + 5) / 2, respectively.
In our example, the NaN rows of the pivot table do not get merged back into df because we did a left join and all the rows in df have one or more Customers.
You can view a chart of the rolling data as follows:
df.set_index(['Date', 'Store']).unstack('Store').plot(legend=False)

Related

Faster alternative for implementing this pandas solution

I have a dataframe which contains sales information of products, what i need to do is to create a function which based on the product id, product type and date, calculates the average sales for a time period which is less than the given date in the function.
This is how I have implemented it, but this approach takes a lot of time and I was wondering if there was a faster way to do this.
Dataframe:
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
Current code:
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
Calling the function:
cal_avg(2,'A','2018-02-12 15:00:00')
53.983
If you are running the calc_avg function "rarely" then I suggest ignoring my answer. Otherwise, it might be beneficial to you to simply calculate the expanding window average for each product/product type. It might be slow depending on your dataset size (in which case maybe just run it on specific product types?), but you'll only need to run it once. First sort by the column you want to perform the 'expanding' on (expanding is missing the 'on' parameter) to ensure the proper row order. Then 'groupby' and transform each group (to keep the indices of the original dataframe) with your expanding window aggregation of choice (in this case 'mean').
df = df.sort_values('sales_time')
df['exp_mean_sales'] = df.groupby(['prod_id', 'prod_type'])['sale_amt'].transform(lambda gr: gr.expanding().mean())
With the result being:
df.head()
prod_id prod_type sales_time sale_amt exp_mean_sales
0 2 B 2018-01-01 00:00:00 8 8.000000
1 2 B 2018-01-01 03:00:00 72 40.000000
2 2 B 2018-01-01 06:00:00 33 37.666667
3 2 A 2018-01-01 09:00:00 81 81.000000
4 2 B 2018-01-01 12:00:00 83 49.000000
Check Below code, with %%timeit comparison (Google Colab)
import pandas as pd
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
## OP's function
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
## Numpy data prep
prod_id_array = np.array(df.values[:,:1])
prod_type_array = np.array(df.values[:,1:2])
sales_time_array = np.array(df.values[:,2:3], dtype=np.datetime64)
values = np.array(df.values[:,3:])
OP's function -
%%timeit
cal_avg(2,'A','2018-02-12 15:00:00')
Output:
Numpy version
%%timeit -n 1000
cal_vals = [2,'A','2018-02-12 15:00:00']
mask = np.logical_and(prod_id_array == cal_vals[0], prod_type_array == cal_vals[1], sales_time_array <= np.datetime64(cal_vals[2]) )
np.mean(values[mask])
Output:

Pandas: Annualized Returns

I have a dataframe with quarterly returns of financial entities and I want to calculate 1, 3, 5 10-year annualized returns. The formula for calculating annualized returns is:
R = product(1+r)^(4/N) -1
r are the quarterly return of an entity, N is the number of quarters
for example 3-year annualized return is:
R_3yr = product(1+r)^(4/12) -1 = ((1+r1)*(1+r2)*(1+r3)*...*(1+r12))^(1/3) -1
r1, r2, r3 ... r12 are the quarterly returns of the previous 11 quarters plus current quarter.
I created a code which provides the right results but it is very slow because it is looping through each row of the dataframe. The code below is an extract of my code for 1-year and 3-year annualized retruns (I applied the same concept for 5, 7, 10, 15 and 20-year returns). r_qrt is the field with the quarterly returns
import pandas as pd
import numpy as np
#create dataframe where I append the results
df_final = pd.DataFrame()
columns=['Date','Entity','r_qrt','R_1yr','R_3yr']
#loop thorugh the dataframe
for row in df.itertuples():
R_1yr=np.nan #1-year annualized return
R_3yr=np.nan #3-year annualized return
#Calculate 1 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-1)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=4:
b=(1+(temp_table.r_qrt))[-4:].product()
R_1yr=(b-1)
#Calculate 3 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-3)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=12:
b=(1+(temp_table.r_qrt))[-12:].product()
R_3yr=((b**(1/3))-1)
d=[row.Date,row.Entity,row.r_qrt,R_1yr,R_3yr]
df_final = df_final.append(pd.Series(d, index=columns), ignore_index=True)
df_final looks as below (only reporting 1-year return results for space limitations)
Date
Entity
r_qrt
R_1yr
2015-03-31
A
0.035719
NaN
2015-06-30
A
0.031417
NaN
2015-09-30
A
0.030872
NaN
2015-12-31
A
0.029147
0.133335
2016-03-31
A
0.022100
0.118432
2016-06-30
A
0.020329
0.106408
2016-09-30
A
0.017676
0.092245
2016-12-31
A
0.017304
0.079676
2015-03-31
B
0.034705
NaN
2015-06-30
B
0.037772
NaN
2015-09-30
B
0.036726
NaN
2015-12-31
B
0.031889
0.148724
2016-03-31
B
0.029567
0.143020
2016-06-30
B
0.028958
0.133312
2016-09-30
B
0.028890
0.124746
2016-12-31
B
0.030389
0.123110
I am sure there is a more efficient way to run the same calculations but I have not been able to find it. My code is not efficient and takes more than 2 hours for large dataframes with long time series and many entities.
Thanks
see (https://www.investopedia.com/terms/a/annualized-total-return.asp) for the definition of annualized return
data=[ 3, 7, 5, 12, 1]
def annualize_rate(data):
retVal=0
accum=1
for item in data:
print(1+(item/100))
accum*=1+(item/100)
retVal=pow(accum,1/len(data))-1
return retVal
print(annualize_rate(data))
output
0.05533402290765199
2015 (a and b)
data=[0.133335,0.148724]
print(annualize_rate(data))
output:
0.001410292043902306
2016 (a&b)
data=[0.079676,0.123110]
print(annualize_rate(data))
output
0.0010139064424810051
you can store each year annualized value then use pct_chg to get a 3 year result
data=[0.05,0.06,0.07]
df=pd.DataFrame({'Annualized':data})
df['Percent_Change']=df['Annualized'].pct_change().fillna(0)
amount=1
returns_plus_one=df['Percent_Change']+1
cumulative_return = returns_plus_one.cumprod()
df['Cumulative']=cumulative_return.mul(amount)
df['2item']=df['Cumulative'].rolling(window=2).mean().plot()
print(df)
For future reference of other users, this is the new version of the code that I implemented following Golden Lion suggestion:
def compoundfunct(arr):
return np.product(1+arr)**(4/len(arr)) - 1
# 1-yr annulized return
df["R_1Yr"]=df.groupby('Entity')['r_qrt'].rolling(4).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
# 3-yr annulized return
df["R_3Yr"]=df.groupby('Entity')['r_qrt'].rolling(12).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
The performance of the previous code was 36.4 sec for a dataframe of 5,640 rows. The new code is more than 10x faster, it took 2.8 sec
One of the issues with this new code is that one has to make sure that rows are sorted by group (Entity in my case) and date before running the calculations, otherwise results could be wrong.
Thanks,
S.

Pandas dataframe vectorized bucketing/aggregation?

The Task
I have a dataframe that looks like this:
date
money_spent ($)
meals_eaten
weight
2021-01-01 10:00:00
350
5
140
2021-01-02 18:00:00
250
2
170
2021-01-03 12:10:00
200
3
160
2021-01-04 19:40:00
100
1
150
I want to discretize this so that it "cuts" the rows every $X. I want to know some statistics on how much is being done for every $X i spend.
So if I were to use $500 as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:
first date of the cut
average meals_eaten
minimum weight
maximum weight
So the final table would be two rows like this:
date
cumulative_spent ($)
meals_eaten
min_weight
max_weight
2021-01-01 10:00:00
600
3.5
140
170
2021-01-03 12:10:00
300
2
150
160
My Approach:
My first instinct is to calculate the cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() to basically make a new column, we call it spent_bin, that determines each row's bin.
Note: In this toy example, spent_bin would basically be: [0,500] for the first two rows and (500-1000] for the last two.
Then it's fairly simple, I do a groupby spent_bin then aggregate as follows:
.agg({
'date':'first',
'meals_eaten':'mean',
'returns': ['min', 'max']
})
What I've Tried
import pandas as pd
rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]
df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()
print(df)
print(pd.cut(df.cum_spent, 500))
For some reason, I can't get the cut step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000] for some reason. Honestly I'd settle for [350,500],(500-800] (this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?
Caveats and Difficulties:
It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.
There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.
The customized function to achieve the cumsum with reset limitation
df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
date = ('date','first'),
meals_eaten = ('meals_eaten','mean'),
min_weight = ('weight','min'),
max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]:
date meals_eaten min_weight max_weight
new
1 2021-01-01 3.5 140 170
0 2021-01-03 2.0 150 160
from numba import njit
#njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
check = 0
total += y
if total >= lim:
total = 0
check = 1
result.append(check)
return result

How to sum over a Pandas dataframe conditionally

I'm looking for an efficient way (without looping) to add a column to a dataframe, containing a sum over a column of that same dataframe, filtered by some values in the row. Example:
Dataframe:
ClientID Date Orders
123 2020-03-01 23
123 2020-03-05 10
123 2020-03-10 7
456 2020-02-22 3
456 2020-02-25 15
456 2020-02-28 5
...
I want to add a colum "orders_last_week" containing the total number of orders for that specific client in the 7 days before the given date.
The Excel equivalent would be something like:
SUMIFS([orders],[ClientID],ClientID,[Date]>=Date-7,[Date]<Date)
So this would be the result:
ClientID Date Orders Orders_Last_Week
123 2020-03-01 23 0
123 2020-03-05 10 23
123 2020-03-10 7 10
456 2020-02-22 3 0
456 2020-02-25 15 3
456 2020-02-28 5 18
...
I can solve this with a loop, but since my dataframe contains >20M records, this is not a feasible solution. Can anyone please help me out?
Much appreciated!
I'll assume your dataframe is named df. I'll also assume that dates aren't repeated for a given ClientID, and are in ascending order (If this isn't the case, do a groupby sum and sort the result so that it is).
The gist of my solution is, for a given ClientID and Date.
Use groupby.transform to split this problem up by ClientID.
Use rolling to check the next 7 rows for dates that are within the 1-week timespan.
In those 7 rows, dates within the timespan are labelled True (=1). Dates that are not are labelled False (=0).
In those 7 rows, multiply the Orders column by the True/False labelling of dates.
Sum the result.
Actually, we use 8 rows, because, e.g., SuMoTuWeThFrSaSu has 8 days.
What makes this hard is that rolling aggregates columns one at a time, and so doesn't obviously allow you to work with multiple columns when aggregating. If it did, you could make a filter using the date column, and use that to sum the orders.
There is a loophole, though: you can use multiple columns if you're happy to smuggle them in via the index!
I use some helper functions. Note a is understood to be a pandas series with 8 rows and values "Orders", with "Date" in the index.
Curious to know what performance is like on your real data.
import pandas as pd
data = {
'ClientID': {0: 123, 1: 123, 2: 123, 3: 456, 4: 456, 5: 456},
'Date': {0: '2020-03-01', 1: '2020-03-05', 2: '2020-03-10',
3: '2020-02-22', 4: '2020-02-25', 5: '2020-02-28'},
'Orders': {0: 23, 1: 10, 2: 7, 3: 3, 4: 15, 5: 5}
}
df = pd.DataFrame(data)
# Make sure the dates are datetimes
df['Date'] = pd.to_datetime(df['Date'])
# Put into index so we can smuggle them through "rolling"
df = df.set_index(['ClientID', 'Date'])
def date(a):
# get the "Date" index-column from the dataframe
return a.index.get_level_values('Date')
def previous_week(a):
# get a column of 0s and 1s identifying the previous week,
# (compared to the date in the last row in a).
return (date(a) >= date(a)[-1] - pd.DateOffset(days=7)) * (date(a) < date(a)[-1])
def previous_week_order_total(a):
#compute the order total for the previous week
return sum(previous_week(a) * a)
def total_last_week(group):
# for a "ClientID" compute all the "previous week order totals"
return group.rolling(8, min_periods=1).apply(previous_week_order_total, raw=False)
# Ok, actually compute this
df['Orders_Last_Week'] = df.groupby(['ClientID']).transform(total_last_week)
# Reset the index back so you can have the ClientID and Date columns back
df = df.reset_index()
The above code relies upon the fact that the past week encompasses at most 7 rows of data i.e., the 7 days in a week (although in your example, it is actually less than 7)
If your time window is something other than a week, you'll need to replace all the references to a the length of a week in terms of the finest division of your timestamps.
For example, if your date timestamps are spaced are no closer than 1 second, and you are interested in a time window of 1 minutes (e.g., "Orders_last_minute"), replace pd.DateOffset(days=7) with pd.DateOffset(seconds=60), and group.rolling(8,... with group.rolling(61,....)
Obviously, this code is a bit pessimistic: for each row, it always looks at 61 rows, in this case. Unfortunately rolling does not offer a suitable variable window size function. I suspect that in some cases a python loop that takes advantage of the fact that the dataframe is sorted by date might run faster than this partly-vectorized solution.

How to convert this to a for-loop with an output to CSV

I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938

Categories

Resources