Computing 7-day retention numbers in pandas - python

I have a dataframe with two columns--date and id. I'd like to calculate for each date the number of id's on that date which reappear on a later date within 7 days. If I were doing this in postgres, it would look something like:
SELECT df1.date, COUNT(DISTINCT df1.id)
FROM df df1 INNER JOIN df df2
ON df1.id = df2.id AND
df2.date BETWEEN df1.date + 1 AND df1.date + 7
GROUP BY df1.date;
What is problematic for me is how to translate this statement into pandas in a way that is fast and idiomatic and etc.
I've already tried for one-day retention by simply creating a lagged column and merging the original with the lagged dataframe. This certainly works. However, for seven-day retention I would need to create 7 dataframes and merge them together. That's not reasonable, as far as I'm concerned. (Especially because I'd also like to know 30-day numbers.)
(I should also point out that my research led me to https://github.com/pydata/pandas/issues/2996, which indicates a merge behavior that does not work on my install (pandas 0.14.0) which fails with error message TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got Series). So there appears to be some sort of advanced merge/join behavior which I clearly don't know how to activate.)

If I understand you correctly, I think you can do it with a groupby/apply. It's a bit tricky. So I think you have data like the following:
>>> df
date id y
0 2012-01-01 1 0.1
1 2012-01-03 1 0.3
2 2012-01-09 1 0.4
3 2012-01-12 1 0.0
4 2012-01-14 1 0.2
5 2012-01-16 1 0.4
6 2012-01-01 2 0.2
7 2012-01-02 2 0.1
8 2012-01-03 2 0.4
9 2012-01-04 2 0.6
10 2012-01-09 2 0.7
11 2012-01-10 2 0.4
I'm going to create rolling forward count within an 'id' group of the number of times that id shows up in the next 7 days including the current day:
def count_forward7(g):
# Add column to the datframe so I can set date as the index
g['foo'] = 1
# New dataframe with daily frequency, so 7 rows = 7 days
# If there are no gaps in the dates you don't need to do this
x = g.set_index('date').resample('D')
# Do Andy Hayden Method for a forward looking rolling windows
# reverses the series and then reverses back the answer
fsum = pd.rolling_sum(x[::-1],window=7,min_periods=0)[::-1]
return pd.DataFrame(fsum[fsum.index.isin(g.date)].values,index=g.index)
>>> df['f7'] = df.groupby('id')[['date']].apply(count_forward7)
>>> df
date id y f7
0 2012-01-01 1 0.1 2
1 2012-01-03 1 0.3 2
2 2012-01-09 1 0.4 3
3 2012-01-12 1 0.0 3
4 2012-01-14 1 0.2 2
5 2012-01-16 1 0.4 1
6 2012-01-01 2 0.2 4
7 2012-01-02 2 0.1 3
8 2012-01-03 2 0.4 3
9 2012-01-04 2 0.6 3
10 2012-01-09 2 0.7 2
11 2012-01-10 2 0.4 1
Now if you want to now "calculate for each date the number of id's on that date which reappear on a later date within 7 days" just count for each date where f7 > 1:
>>> df['bool_f77'] = df['f7'] > 1
>>> df.groupby('date')['bool_f77'].sum()
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-10 0
2012-01-12 1
2012-01-14 1
2012-01-16 0
Or Something like the following:
>>> df.query('f7 > 1').groupby('date')['date'].count()
date
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-12 1
2012-01-14 1

Related

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

Subtract pandas series with different indices

I have a dataframe where I have a collection of the following entries: (Date, Volume).
I would like to create a new dataframe column where the Volume column is subtracted with the mean of the monthly volume. I would like what is the way to achieve that in pandas.
In below you can find the setup of the above:
import pandas as pd
import io
import numpy as np
df = pd.read_csv(io.StringIO(csv_file_content))
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
# Get the means for each year/month pair
month_means = df.groupby([df.index.year, df.index.month])['Volume'].mean().round(2)
# I would like to subtract the Volume with the mean of its month.
df['spread_monthly'] = df['Volume'] - month_means[zip(df.index.year, df.index.month)]
It seems to be complaining about the indexing with the grouped month_means and the original (datetime) index of df['Volume']. To avoid problems with the indexing, you can remove the different indices using x.values for each series.
Do df['spread_monthly'] = df['Volume'].values - month_means[zip(df.index.year, df.index.month)].values instead.
Just to add another option with vectorized solution.
We can use groupby with pd.Grouper with freq=M to get the volume spread per month.
Setting the Date column to index is a choice.
Toy Example
df = pd.DataFrame({
'Date': pd.date_range('2020.10.10', periods=12, freq='15D'),
'Volume': np.arange(1,13)
})
df
Date Volume
0 2020-10-10 1
1 2020-10-25 2
2 2020-11-09 3
3 2020-11-24 4
4 2020-12-09 5
5 2020-12-24 6
6 2021-01-08 7
7 2021-01-23 8
8 2021-02-07 9
9 2021-02-22 10
10 2021-03-09 11
11 2021-03-24 12
Code
df['spread_monthly'] = df.groupby([pd.Grouper(key='Date', freq='M')]).transform('mean')
df['spread_monthly'] = df.spread_monthly - df.Volume
df
Output
Date Volume spread_monthly
0 2020-10-10 1 0.5
1 2020-10-25 2 -0.5
2 2020-11-09 3 0.5
3 2020-11-24 4 -0.5
4 2020-12-09 5 0.5
5 2020-12-24 6 -0.5
6 2021-01-08 7 0.5
7 2021-01-23 8 -0.5
8 2021-02-07 9 0.5
9 2021-02-22 10 -0.5
10 2021-03-09 11 0.5
11 2021-03-24 12 -0.5
You can also do this by groupby() and transform() method:
df['spread_monthly']=df['Volume']-df.groupby([df.index.month,df.index.year])['Volume'].transform('mean').values

python and dataframe: group by week and calculate the sum and difference

I have a dataframe with the following columns:
DATE ALFA BETA
2016-04-26 1 3
2016-04-27 3 0
2016-04-28 0 8
2016-04-29 4 2
2016-04-30 3 1
2016-05-01 -2 -5
2016-05-02 3 0
2016-05-03 3 3
2016-05-08 1 7
2016-05-11 3 1
2016-05-12 10 1
2016-05-13 4 2
I would like to group the data in a weekly range but treat the alpha and beta columns differently. I would like to calculate the sum of the numbers in the ALFA column for each week while for the BETA column I would like to calculate the difference between the beginning and the end of the week. I show you an example of the expected result.
DATE sum_ALFA diff_BETA
2016-04-26 12 3
2016-05-03 4 4
2016-05-11 17 1
I have tried this code but it calculates the sum for each column
df = df.resample('W', on='DATE').sum().reset_index().sort_values(by='DATE')
this is my dataset https://drive.google.com/uc?export=download&id=1fEqjINx9R5io7t_YxA9qShvNDxWRCUke
I'd guess I'm having a different locale here (hence my week is different), you can do:
df.resample("W", on="DATE",closed="left", label="left"
).agg({"ALFA":"sum", "BETA": lambda g: g.iloc[0] - g.iloc[-1]})
ALFA BETA
DATE
2016-04-24 11 2
2016-05-01 4 -8
2016-05-08 18 5
I think there is a solution for your data with my approach. Define
def get_series_first_minus_last(s):
try:
return s.iloc[0] - s.iloc[-1]
except IndexError:
return 0
and replace the lambda call just by the function call, i.e.
df.resample("W", on="DATE",closed="left", label="left"
).agg({"ALFA":"sum", "BETA": get_series_first_minus_last})
Note that in the newly defined function, you could also return nan if you'd prefer that.

Cumulative sum that resets when the condition is no longer met

I have a dataframe with a column that consists of datetime values, one that consists of speed values, and one that consists of timedelta values between the rows.
I would want to get the cumulative sum of timedeltas whenever the speed is below 2 knots. When the speed rises above 2 knots, I would like this cumulative sum to reset to 0, and then to start summing at the next instance of speed observations below 2 knots.
I have started by flagging all observations of speed values < 2. I only manage to get the cumulative sum for all of the observations with speed < 2, but not a cumulative sum separated for each instance.
The dataframe looks like this, and cum_sum is the desired output:
datetime speed timedelta cum_sum flag
1-1-2019 19:30:00 0.5 0 0 1
1-1-2019 19:32:00 0.7 2 2 1
1-1-2019 19:34:00 0.1 2 4 1
1-1-2019 19:36:00 5.0 2 0 0
1-1-2019 19:38:00 25.0 2 0 0
1-1-2019 19:42:00 0.1 4 4 1
1-1-2019 19:49:00 0.1 7 11 1
You can use the method from "How to groupby consecutive values in pandas DataFrame" to get the groups where flag is either 1 or 0, and then you will just need to apply the cumsum on the timedelta column, and set those values where flag == 0 to 0:
gb = df.groupby((df['flag'] != df['flag'].shift()).cumsum())
df['cum_sum'] = gb['timedelta'].cumsum()
df.loc[df['flag'] == 0, 'cum_sum'] = 0
print(df)
will give
datetime speed timedelta flag cum_sum
0 1-1-2019 19:30:00 0.5 0 1 0
1 1-1-2019 19:32:00 0.7 2 1 2
2 1-1-2019 19:34:00 0.1 2 1 4
3 1-1-2019 19:36:00 5.0 2 0 0
4 1-1-2019 19:38:00 25.0 2 0 0
5 1-1-2019 19:42:00 0.1 4 1 4
6 1-1-2019 19:49:00 0.1 7 1 11
Note: Uses global variable
c = 0
def fun(x):
global c
if x['speed'] > 2.0:
c = 0
else:
c = x['timedelta']+c
return c
df = pd.DataFrame( {'datetime': ['1-1-2019 19:30:00']*7,
'speed': [0.5,.7,0.1,5.0,25.0,0.1,0.1], 'timedelta': [0,2,2,2,2,4,7]})
df['cum_sum']=df.apply(fun, axis=1)
datetime speed timedelta cum_sum
0 1-1-2019 19:30:00 0.5 0 0
1 1-1-2019 19:30:00 0.7 2 2
2 1-1-2019 19:30:00 0.1 2 4
3 1-1-2019 19:30:00 5.0 2 0
4 1-1-2019 19:30:00 25.0 2 0
5 1-1-2019 19:30:00 0.1 4 4
6 1-1-2019 19:30:00 0.1 7 11

Expand Pandas date range

I have data that looks like this. Each row represents a value of that ID at some date.
ID Date Value
A 2012-01-05 50
A 2012-01-08 100
A 2012-01-10 200
B 2012-07-01 10
B 2012-07-03 20
I need to expand this so that I have rows for all days. The value of each day should be the value of the day before (i.e., think of the data above as updates of values, and the data below as a timeseries of values).
ID Date Value
A 2012-01-05 50
A 2012-01-06 50
A 2012-01-07 50
A 2012-01-08 100
A 2012-01-09 100
A 2012-01-10 200
B 2012-07-01 10
B 2012-07-02 10
B 2012-07-03 20
Currently, I have a solution that amounts to the following:
Group by ID
For each group, figure out the min and max date
Create a pd.date_range
Iterate simultaneously through the rows and through the date range, filling the values in the date range and incrementing the index-pointer to the rows if necessary
Append all these date ranges to a final dataframe
It works, but seems like a pretty bad bruteforce solution. I wonder if there's a better approach supported by Pandas?
Using resample on Date indexed dataframe with ID groups and ffill on value
In [1725]: df.set_index('Date').groupby('ID').resample('1D')['Value'].ffill().reset_index()
Out[1725]:
ID Date Value
0 A 2012-01-05 50
1 A 2012-01-06 50
2 A 2012-01-07 50
3 A 2012-01-08 100
4 A 2012-01-09 100
5 A 2012-01-10 200
6 B 2012-07-01 10
7 B 2012-07-02 10
8 B 2012-07-03 20
Or you can try this one (Notice : this can be used for expend numeric column too ).
df.Date=pd.to_datetime(df.Date)
df=df.set_index(df.Date)
df.set_index(df.Date).groupby('ID')\
.apply(lambda x : x.reindex(pd.date_range(min(x.index), max(x.index),freq='D')))\
.ffill().reset_index(drop=True)
Out[519]:
ID Date Value
0 A 2012-01-05 50.0
1 A 2012-01-05 50.0
2 A 2012-01-05 50.0
3 A 2012-01-08 100.0
4 A 2012-01-08 100.0
5 A 2012-01-10 200.0
6 B 2012-07-01 10.0
7 B 2012-07-01 10.0
8 B 2012-07-03 20.0

Categories

Resources