Pandas - groupby continuous datetime periods - python

I have a pandas dataframe that looks like this:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
I would like to groupby on KEY and sum on VALUE but only on continuous periods of time. For instance in the above example I would like to get:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-01 5.0
2 B 2017-01-01 2017-02-10 6.0
There are tow groups for A since there is a gap in the time periods.
I would like to avoid for loops since the dataframe has tens of millions of rows.

Create helper Series by compare shifted START column per group and use it for groupby:
s = df.loc[df.groupby('KEY')['START'].shift(-1) == df['END'], 'END']
s = s.combine_first(df['START'])
print (s)
0 2017-01-01
1 2017-01-23
2 2017-01-23
3 2017-02-02
4 2017-02-02
Name: END, dtype: datetime64[ns]
df = df.groupby(['KEY', s], as_index=False).agg({'START':'first','END':'last','VALUE':'sum'})
print (df)
KEY VALUE START END
0 A 2.1 2017-01-01 2017-01-16
1 A 5.0 2017-01-28 2017-03-01
2 B 6.0 2017-01-01 2017-02-10

The answer from jezrael works like a charm if there are only two consecutive rows to aggregate. In the new example, it would not aggregate the last three rows for KEY = A.
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
5 A 2017-03-01 2017-03-23 1.0
The following solution (slight modification of jezrael's solution) enables to aggregate all rows that should be aggregated:
df = df.sort_values(by='START')
idx = df.groupby('KEY')['START'].shift(-1) != df['END']
df['DATE'] = df.loc[idx, 'START']
df['DATE'] = df.groupby('KEY').DATE.fillna(method='backfill')
df = (df.groupby(['KEY', 'DATE'], as_index=False)
.agg({'START': 'first', 'END': 'last', 'VALUE': 'sum'})
.drop(['DATE'], axis=1))
Which gives:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-23 6.0
2 B 2017-01-01 2017-02-10 6.0
Thanks #jezrael for the elegant approach!

Related

Slice data by ID and datetime index

I have a dataframe, x_train, with three variables an index datetime which takes a reading every 5 minutes and an ID column:
x_train
Time ID var_1 var_2 var_3
2020-01-01 00:00:00 1 9.3 4.2 2.4
2020-01-02 00:00:05 1 3.5 4.5 7.6
2020-01-01 00:00:00 2 2.1 7.6 4.5
2020-01-02 00:00:05 2 3.9 7.5 7.0
and a second dataframe, y_train, with labels for each mode the IDs are in:
y_train
Time ID mode label
2020-01-01 00:00:00 1 1 B
2020-01-02 00:00:05 1 1 B
2020-01-01 00:00:00 2 0 A
2020-01-02 00:00:05 2 0 A
I want to slice the data by ID and time with a step size of 1 day or 288 rows as this data is time-series dependent. So far I've managed to split the data by id using groupby, however I'm not sure how to apply the time slicing.
Heres what I've tried:
FEATURE_COLUMNS = X_train.columns.to_list()
sequences = []
for Id, group in X_train.groupby("ID"):
sequence_features = group[FEATURE_COLUMNS]
label = y_train[y_train.ID == ID].iloc[0].label
sequences.append((sequence_features, label))
Which gives me a slice of all the different IDs but not the time sliced:
( ID var_1 var_2 var_3
Time
2016-01-09 01:55:00 2 0.402679 0.588398 0.560771
2016-03-22 11:40:00 2 0.382457 0.507188 0.450901
2016-02-29 09:40:00 2 0.344540 0.652963 0.607460
2016-01-06 01:00:00 2 0.384479 0.825977 0.499619
2016-01-19 18:10:00 2 0.437563 0.631526 0.479827
... ... ... ... ...
2016-01-10 23:30:00 2 0.366026 0.829760 0.636387
2016-01-22 18:25:00 2 0.976997 0.350567 0.674448
2016-01-28 06:30:00 2 0.975986 0.719546 0.727988
2016-02-27 04:15:00 2 0.451972 0.674149 0.470185
2016-03-10 19:15:00 2 0.354146 0.423203 0.487947
[17673 rows x 4 columns],
'b')
I feel I need to add a line that tells the loop to only look at 288 rows per ID at a time but I'm not sure how to execute it.
Edit: also my sliced output data reorganises the index datetime in a weird order is there a way to fix this?

How to get aggregate of data from multiple dates in pandas?

I have the following data
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'name':['a', 'b', 'c', 'd', 'e', 'f'],
'vaccine_1':['2021-01-20', '2021-01-20', '2021-02-20', np.nan, '2021-02-22', '2021-02-23'],
'vaccine_2':['2021-02-22', '2021-02-22', '2021-02-25', np.nan, '2021-03-22', np.nan]})
df['vaccine_1'] = pd.to_datetime(df['vaccine_1']).dt.date
df['vaccine_2'] = pd.to_datetime(df['vaccine_2']).dt.date
df
I want to convert the table into something like this
date
vaccine_1_total
vaccine_2_total
2021-01-20
2
0
2021-02-20
1
0
2021-02-22
1
3
2021-02-25
0
0
2021-03-22
0
1
Basically I want to get the aggregate from each date to get how many people get the vaccine in certain dates, but since there are two dates there, I am lost.
A simple groupby doesn't give me the result.
df.groupby(['vaccine_1'])['name'].count()
The code only gives me the number of people vaccinated for the first time, I can't obtain the second one. How do I solve this? Thanks.
We can first melt the dataframe using DataFrame.melt then use pd.crosstab
out = df.filter(like='vaccine').melt(var_name='vaccine', value_name='date')
print(pd.crosstab(out['date'], out['vaccine']))
vaccine vaccine_1 vaccine_2
date
2021-01-20 2 0
2021-02-20 1 0
2021-02-22 1 2
2021-02-23 1 0
2021-02-25 0 1
2021-03-22 0 1
Count each vaccine column separately:
(df.filter(like='vaccine')
.apply(pd.Series.value_counts)
.fillna(0)
.add_suffix('_total')
.rename_axis('date')
.reset_index())
date vaccine_1_total vaccine_2_total
0 2021-01-20 2.0 0.0
1 2021-02-20 1.0 0.0
2 2021-02-22 1.0 2.0
3 2021-02-23 1.0 0.0
4 2021-02-25 0.0 1.0
5 2021-03-22 0.0 1.0
You could do a melt, get the value counts, then unstack to put the vaccines as headers:
(df.melt('name', value_name = 'Date')
.drop(columns='name')
.value_counts()
.unstack('variable', fill_value=0)
.add_suffix('_total')
# last two not necessary
# indexes are a good thing
.rename_axis(columns=None)
.reset_index()
)
vaccine_1_total vaccine_2_total
Date
2021-01-20 2 0
2021-02-20 1 0
2021-02-22 1 2
2021-02-23 1 0
2021-02-25 0 1
2021-03-22 0 1

Calculate delta between two columns and two following rows for different group

Are there any vector operations for improving runtime?
I found no other way besides for loops.
Sample DataFrame:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan','06-Jan', '10-Jan'],
'start_value': [12, 15, 1, 3, 2, 6],
'end_value': [20, 17, 6,19,13.5,9]})
ID start_date start_value end_value
0 1 01-Jan 12 20.0
1 1 05-Jan 15 17.0
2 1 08-Jan 1 6.0
3 2 05-Jan 3 19.0
4 2 06-Jan 2 13.5
5 2 10-Jan 6 9.0
I've tried:
import pandas as pd
df_original # contains data
data_frame_diff= pd.DataFrame()
for ID in df_original ['ID'].unique():
tmp_frame = df_original .loc[df_original ['ID']==ID]
tmp_start_value = 0
for label, row in tmp_frame.iterrows():
last_delta = tmp_start_value - row['value']
tmp_start_value = row['end_value']
row['last_delta'] = last_delta
data_frame_diff= data_frame_diff.append(row,True)
Expected Result:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan', '06-Jan',
'10-Jan'],
'last_delta': [0, 5, 16, 0, 17, 7.5]})
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
I want to calculate the delta between start_value and end_value of the timestamp and the following timestamp after for each user ID.
Is there a way to improve runtime of this code?
Use DataFrame.groupby
on ID and shift the column end_value then use Series.sub to subtract it from start_value, finally use Series.fillna and assign this new column s to the dataframe using DataFrame.assign:
s = df.groupby('ID')['end_value'].shift().sub(df['start_value']).fillna(0)
df1 = df[['ID', 'start_date']].assign(last_delta=s)
Result:
print(df1)
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
It's a bit difficult to follow from your description what you need, but you might find this helpful:
import pandas as pd
df = (pd.DataFrame({'t1': pd.date_range(start="2020-01-01", end="2020-01-02", freq="H"),
})
.reset_index().rename(columns={'index': 'ID'})
)
df['t2'] = df['t1']+pd.Timedelta(value=10, unit="H")
df['delta_t1_t2'] = df['t2']-df['t1']
df['delta_to_previous_t1'] = df['t1'] - df['t1'].shift()
print(df)
It results in
ID t1 t2 delta_t1_t2 delta_to_previous_t1
0 0 2020-01-01 00:00:00 2020-01-01 10:00:00 10:00:00 NaT
1 1 2020-01-01 01:00:00 2020-01-01 11:00:00 10:00:00 01:00:00
2 2 2020-01-01 02:00:00 2020-01-01 12:00:00 10:00:00 01:00:00
3 3 2020-01-01 03:00:00 2020-01-01 13:00:00 10:00:00 01:00:00

Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN

Pandas fill in missing date within each group with information in the previous row

Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
By using resample with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
use asfreq & groupby
first convert dt to datetime & get rid of duplicates
then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts
finally reset_index on amount column as there's a duplicate sub_id column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df.
Create a base_df with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0

Categories

Resources