I'm looking for an efficient way (without looping) to add a column to a dataframe, containing a sum over a column of that same dataframe, filtered by some values in the row. Example:
Dataframe:
ClientID Date Orders
123 2020-03-01 23
123 2020-03-05 10
123 2020-03-10 7
456 2020-02-22 3
456 2020-02-25 15
456 2020-02-28 5
...
I want to add a colum "orders_last_week" containing the total number of orders for that specific client in the 7 days before the given date.
The Excel equivalent would be something like:
SUMIFS([orders],[ClientID],ClientID,[Date]>=Date-7,[Date]<Date)
So this would be the result:
ClientID Date Orders Orders_Last_Week
123 2020-03-01 23 0
123 2020-03-05 10 23
123 2020-03-10 7 10
456 2020-02-22 3 0
456 2020-02-25 15 3
456 2020-02-28 5 18
...
I can solve this with a loop, but since my dataframe contains >20M records, this is not a feasible solution. Can anyone please help me out?
Much appreciated!
I'll assume your dataframe is named df. I'll also assume that dates aren't repeated for a given ClientID, and are in ascending order (If this isn't the case, do a groupby sum and sort the result so that it is).
The gist of my solution is, for a given ClientID and Date.
Use groupby.transform to split this problem up by ClientID.
Use rolling to check the next 7 rows for dates that are within the 1-week timespan.
In those 7 rows, dates within the timespan are labelled True (=1). Dates that are not are labelled False (=0).
In those 7 rows, multiply the Orders column by the True/False labelling of dates.
Sum the result.
Actually, we use 8 rows, because, e.g., SuMoTuWeThFrSaSu has 8 days.
What makes this hard is that rolling aggregates columns one at a time, and so doesn't obviously allow you to work with multiple columns when aggregating. If it did, you could make a filter using the date column, and use that to sum the orders.
There is a loophole, though: you can use multiple columns if you're happy to smuggle them in via the index!
I use some helper functions. Note a is understood to be a pandas series with 8 rows and values "Orders", with "Date" in the index.
Curious to know what performance is like on your real data.
import pandas as pd
data = {
'ClientID': {0: 123, 1: 123, 2: 123, 3: 456, 4: 456, 5: 456},
'Date': {0: '2020-03-01', 1: '2020-03-05', 2: '2020-03-10',
3: '2020-02-22', 4: '2020-02-25', 5: '2020-02-28'},
'Orders': {0: 23, 1: 10, 2: 7, 3: 3, 4: 15, 5: 5}
}
df = pd.DataFrame(data)
# Make sure the dates are datetimes
df['Date'] = pd.to_datetime(df['Date'])
# Put into index so we can smuggle them through "rolling"
df = df.set_index(['ClientID', 'Date'])
def date(a):
# get the "Date" index-column from the dataframe
return a.index.get_level_values('Date')
def previous_week(a):
# get a column of 0s and 1s identifying the previous week,
# (compared to the date in the last row in a).
return (date(a) >= date(a)[-1] - pd.DateOffset(days=7)) * (date(a) < date(a)[-1])
def previous_week_order_total(a):
#compute the order total for the previous week
return sum(previous_week(a) * a)
def total_last_week(group):
# for a "ClientID" compute all the "previous week order totals"
return group.rolling(8, min_periods=1).apply(previous_week_order_total, raw=False)
# Ok, actually compute this
df['Orders_Last_Week'] = df.groupby(['ClientID']).transform(total_last_week)
# Reset the index back so you can have the ClientID and Date columns back
df = df.reset_index()
The above code relies upon the fact that the past week encompasses at most 7 rows of data i.e., the 7 days in a week (although in your example, it is actually less than 7)
If your time window is something other than a week, you'll need to replace all the references to a the length of a week in terms of the finest division of your timestamps.
For example, if your date timestamps are spaced are no closer than 1 second, and you are interested in a time window of 1 minutes (e.g., "Orders_last_minute"), replace pd.DateOffset(days=7) with pd.DateOffset(seconds=60), and group.rolling(8,... with group.rolling(61,....)
Obviously, this code is a bit pessimistic: for each row, it always looks at 61 rows, in this case. Unfortunately rolling does not offer a suitable variable window size function. I suspect that in some cases a python loop that takes advantage of the fact that the dataframe is sorted by date might run faster than this partly-vectorized solution.
Related
Consider this sample data created by this code:
import random
np.random.seed(0)
rng = pd.date_range('2017-09-19', periods=1000, freq='D')
randomlist = np.random.choice(1000, 10000, replace=True)
print(f'randomlist length is {len(randomlist)}')
test = pd.DataFrame({ 'id': randomlist[:(len(rng))], 'Date': rng, 'Val': np.random.randn(len(rng)) })
The desired output is a groupby id, summing all values, but only within a particular date range of the Date column. Even more complicated than that, I want to see the total Val by id for dates that are the following:
Using the date which is one month later than the earliest date for each id and one year later than that starting date of one month later than the earliest date.
So, for example, if my data appeared this way:
id Date Val
0 684 2017-09-19 0.640472
1 684 2017-10-20 -0.732568
2 501 2017-08-21 -1.141365
3 501 2017-09-22 -0.283020
4 501 2017-09-23 0.725941
5 684 2017-09-24 0.56789
I would want the groupby to only consider the dates for id 684 between 2017-10-19 (i.e. one month later than the earliest date) and 2018-10-19 (i.e. one year after the earliest date plus one month).
I have tried straight groupby and Grouper to no avail. None seem to have this ability to limit the consideration by date. Perhaps I am missing something easy? Thanks for taking a look
Sample date:
Dataframe 1
cusip_id
trd_exctn_dt
time_to_maturity
00077AA2
2015-05-09
1.20 years
00077TBO
2015-05-06
3.08 years
Dataframe 2:
Index
SVENY01
SVENY02
SVENY03
SVENY04
2015-05-09
1.35467
1.23367
1.52467
1.89467
2015-05-08
1.65467
1.87967
1.43251
1.98765
2015-05-07
1.35467
1.76567
1.90271
1.43521
2015-05-06
1.34467
1.35417
1.67737
1.11167
Desired output:
I am wanting to exactly match the 'trd_exctn_dt' in df1 with the date in the index of df2, whilst at the same time matching the 'time_to_maturity' in df1 with the nearest SVENYXX in df2 (rounded up e.g. 1.20 years would be equivalent to SVENY02). For example, for cusip_id (00077AA2), the trd_exctn_dt is 2015-05-09 and the time_to_maturity is 1.20 years. As this is the case I want to obtain the corresponding value in df2 with the date of 2015-05-09 in the column SVENY02.
I want to repeat this for several cusip_ids, how would I achieve this?
Any help would be appreciated!
Here is my solution code:
import pandas as pd
SVENYXX = []
for i in range(df1['cusip_id'].shape[0]):
cusip_id = df1['cusip_id'][i]
trd_exctn_date = df1['trd_exctn_dt'][i]
maturity_time = df1['time_to_maturity'][i]
svenyVals = df2.loc[trd_exctn_date]
closestSvenyVal = svenyVals.iloc[(svenyVals-maturity_time).abs().argsort()[0]]
SVENYXX.append(closestSvenyVal)
where df1 is Dataframe 1, df2 is Dataframe 2, and SVENYXX is the list with all the closest SVENYXX values to the given cusip_id.
I loop through all the cusip_id's and obtain the correspond trd_exctn_dt and time_to_maturity values. Then with the extracted data, I find the corresponding row in DataFrame 2, and then by finding the lowest difference in svenyVals compared to time_to_maturity, I append that value to the SVENYXX list.
I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.
Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.
date player kills
2019-01-01 a 15
2019-01-02 b 20
2019-01-03 a 10
2019-03-04 a 20
Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).
df['sum_kills'] = df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())
This creates the following values:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 25
However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 NaN
The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?
I think you are a bit in a pinch using groupby and transform. As explained here, transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby, but you want a different result for each row.
So the best solution I can propose is to use apply without groupy, and perform all the selection by yourself inside the custom function:
def killcount(x, data, timewin):
"""count the player's kills in a time window before the time of current row.
x: dataframe row
data: full dataframe
timewin: a pandas.Timedelta
"""
return data.loc[(data['date'] < x['date']) #select dates preceding current row
& (data['date'] >= x['date']-timewin) #select dates in the timewin
& (data['player'] == x['player'])]['kills'].sum() #select rows with same player
df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)
This returns:
date player kills sum_kills
0 2019-01-01 a 15 0
1 2019-01-02 b 20 0
2 2019-01-03 a 10 15
3 2019-03-04 a 20 0
In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.
I have a DataFrame like this
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
value date
0 64.885 2018-01-11
1 74.839 2018-01-15
2 41.481 2018-01-17
3 22.027 2018-01-17
4 53.747 2018-01-18
... ... ...
514 61.017 2018-12-22
515 68.376 2018-12-21
516 79.079 2018-12-26
517 73.975 2018-12-26
518 76.923 2018-12-26
519 rows × 2 columns
And I want to plot this value vs date and I am using this
df.plot( x='date',y='value')
And I get this
The point here, this plot have to many fluctuation, and I want to soften this, my idea is group the values by date intervals and get the mean, for example 10 days, the mean between July 1 and July 10, and create de point in July 5
A long way is, get date range, separate in N ranges with start and end dates, filter data with date calculate the mean, and put in other DataFrame
Is there a short way to do that?
PD: Ignore the peaks
One thing you could do for instance is to take the rolling mean of the dataframe, using DataFrame.rolling along with mean:
df = df.set_index(df.date).drop('date', axis=1)
df.rolling(3).mean().plot()
For the example dataframe you have, directly plotting the dataframe would result in:
And having taking the rolling mean, you would have:
Here I chose a window of 3, but his will depend on how wmooth you want it to be
Based on yatu answer
The problem with his answer, is the rolling function considere values as index, not as date, with some transformations rolling can read Timestamp as use time as window [ pandas.rolling ]
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
df['date'] = df.apply(lambda row: pd.Timestamp(row.date), axis=1 )
df = df.set_index(df.date).drop('date', axis=1)
df.sort_index(inplace=True)
df.rolling('10d').mean().plot( ylim=(30,100) , figsize=(16,5),grid='true')
Final results
I use data from a past kaggle challenge based on panel data across a number of stores and a period spanning 2.5 years. Each observation includes the number of customers for a given store-date. For each store-date, my objective is to compute the average number of customers that visited this store during the past 60 days.
Below is code that does exactly what I need. However, it lasts forever - it would take a night to process the c.800k rows. I am looking for a clever way to achieve the same objective faster.
I have included 5 observations of the initial dataset with the relevant variables: store id (Store), Date and number of customers ("Customers").
Note:
For each row in the iteration, I end up writing the results using .loc instead of e.g. row["Lagged No of customers"] because "row" does not write anything in the cells. I wonder why that's the case.
I normally populate new columns using "apply, axis = 1" so I would really appreciate any solution based on that. I found that "apply" works fine when for each row, computation is done across columns using values at the same row level. However, I don't know how an "apply" function can involve different rows, which is what this problem requires. the only exception I have seen so far is "diff", which is not useful here.
Thanks.
Sample data:
pd.DataFrame({
'Store': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Customers': {0: 668, 1: 578, 2: 619, 3: 635, 4: 785},
'Date': {
0: pd.Timestamp('2013-01-02 00:00:00'),
1: pd.Timestamp('2013-01-03 00:00:00'),
2: pd.Timestamp('2013-01-04 00:00:00'),
3: pd.Timestamp('2013-01-05 00:00:00'),
4: pd.Timestamp('2013-01-07 00:00:00')
}
})
Code that works but is incredibly slow:
import pandas as pd
import numpy as np
data = pd.read_csv("Rossman - no of cust/dataset.csv")
data.Date = pd.to_datetime(data.Date)
data.Customers = data.Customers.astype(int)
for index, row in data.iterrows():
d = row["Date"]
store = row["Store"]
time_condition = (d - data["Date"]<np.timedelta64(60, 'D')) & (d > data["Date"])
sub_df = data.loc[ time_condition & (data["Store"] == store), :]
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Lagged No customers"] = sub_df["Customers"].sum()
data.loc[ (data["Date"]==d) & (data["Store"] == store), "No of days"] = len(sub_df["Customers"])
if len(sub_df["Customers"]) > 0:
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Av No of customers"] = int(sub_df["Customers"].sum()/len(sub_df["Customers"]))
Given your small sample data, I used a two day rolling average instead of 60 days.
>>> (pd.rolling_mean(data.pivot(columns='Store', index='Date', values='Customers'), window=2)
.stack('Store'))
Date Store
2013-01-03 1 623.0
2013-01-04 1 598.5
2013-01-05 1 627.0
2013-01-07 1 710.0
dtype: float64
By taking a pivot of the data with dates as your index and stores as your columns, you can simply take a rolling average. You then need to stack the stores to get the data back into the correct shape.
Here is some sample output of the original data prior to the final stack:
Store 1 2 3
Date
2015-07-29 541.5 686.5 767.0
2015-07-30 534.5 664.0 769.5
2015-07-31 550.5 613.0 822.0
After .stack('Store'), this becomes:
Date Store
2015-07-29 1 541.5
2 686.5
3 767.0
2015-07-30 1 534.5
2 664.0
3 769.5
2015-07-31 1 550.5
2 613.0
3 822.0
dtype: float64
Assuming the above is named df, you can then merge it back into your original data as follows:
data.merge(df.reset_index(),
how='left',
on=['Date', 'Store'])
EDIT:
There is a clear seasonal pattern in the data for which you may want to make adjustments. In any case, you probably want your rolling average to be in multiples of seven to represent even weeks. I've used a time window of 63 days in the example below (9 weeks).
In order to avoid losing data on stores that just open (and those at the start of the time period), you can specify min_periods=1 in the rolling mean function. This will give you the average value over all available observations for your given time window
df = data.loc[data.Customers > 0, ['Date', 'Store', 'Customers']]
result = (pd.rolling_mean(df.pivot(columns='Store', index='Date', values='Customers'),
window=63, min_periods=1)
.stack('Store'))
result.name = 'Customers_63d_mvg_avg'
df = df.merge(result.reset_index(), on=['Store', 'Date'], how='left')
>>> df.sort_values(['Store', 'Date']).head(8)
Date Store Customers Customers_63d_mvg_avg
843212 2013-01-02 1 668 668.000000
842103 2013-01-03 1 578 623.000000
840995 2013-01-04 1 619 621.666667
839888 2013-01-05 1 635 625.000000
838763 2013-01-07 1 785 657.000000
837658 2013-01-08 1 654 656.500000
836553 2013-01-09 1 626 652.142857
835448 2013-01-10 1 615 647.500000
To more clearly see what is going on, here is a toy example:
s = pd.Series([1,2,3,4,5] + [np.NaN] * 2 + [6])
>>> pd.concat([s, pd.rolling_mean(s, window=4, min_periods=1)], axis=1)
0 1
0 1 1.0
1 2 1.5
2 3 2.0
3 4 2.5
4 5 3.5
5 NaN 4.0
6 NaN 4.5
7 6 5.5
The window is four observations, but note that the final value of 5.5 equals (5 + 6) / 2. The 4.0 and 4.5 values are (3 + 4 + 5) / 3 and (4 + 5) / 2, respectively.
In our example, the NaN rows of the pivot table do not get merged back into df because we did a left join and all the rows in df have one or more Customers.
You can view a chart of the rolling data as follows:
df.set_index(['Date', 'Store']).unstack('Store').plot(legend=False)