I have a dataframe with huge amount of rows, and I want to conditional groupby sum to this dataframe.
This is an example of my dataframe and code:
import pandas as pd
data = {'Case': [1, 1, 1, 1, 1, 1],
'Id': [1, 1, 1, 1, 2, 2],
'Date1': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01', '2020-01-01', '2020-01-01'],
'Date2': ['2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01'],
'Quantity': [50,100,150,20,30,35]
}
df = pd.DataFrame(data)
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
sum_list = []
for d in df['Date1'].unique():
temp = df.groupby(['Case','Id']).apply(lambda x: x[(x['Date2'] == d) & (x['Date1']<d)]['Quantity'].sum()).rename('sum').to_frame()
temp['Date'] = d
sum_list.append(temp)
output = pd.concat(sum_list, axis=0).reset_index()
When I apply this for loop to the real dataframe, it's extremely slow. I want to find a better way to do this conditional groupby sum operation. Here are my questions:
is for loop a good method to do what I need here?
are there any better ways to replace line 1 inside for loop;
I feel line 2 inside for loop is also time-consuming, how should I improve it.
Thanks for your help.
One option is a double merge and a groupby:
date = pd.Series(df.Date1.unique(), name='Date')
step1 = df.merge(date, left_on = 'Date2', right_on = 'Date', how = 'outer')
step2 = step1.loc[step1.Date1 < step1.Date]
step2 = step2.groupby(['Case', 'Id', 'Date']).agg(sum=('Quantity','sum'))
(df
.loc[:, ['Case', 'Id', 'Date2']]
.drop_duplicates()
.rename(columns={'Date2':'Date'})
.merge(step2, how = 'left', on = ['Case', 'Id', 'Date'])
.fillna({'sum': 0}, downcast='infer')
)
Case Id Date sum
0 1 1 2020-01-01 0
1 1 1 2020-02-01 100
2 1 2 2020-01-01 0
3 1 2 2020-02-01 35
apply is the slow one. Avoid it as much as you can.
I tested this with your small snippet and it gives the correct answer. You need to test more thoroughly with your real data:
case = df["Case"].unique()
id_= df["Id"].unique()
d = df["Date1"].unique()
index = pd.MultiIndex.from_product([case, id_, d], names=["Case", "Id", "Date"])
# Sum only rows whose Date2 belong to a specific list of dates
# This is equivalent to `x['Date2'] == d` in your original code
cond = df["Date2"].isin(d)
tmp = df[cond].groupby(["Case", "Id", "Date1", "Date2"], as_index=False).sum()
# Select only those sums where Date1 < Date2 and sum again
# This takes care of the `x['Date1'] < d` condition
cond = tmp["Date1"] < tmp["Date2"]
output = tmp[cond].groupby(["Case", "Id", "Date2"]).sum().reindex(index, fill_value=0).reset_index()
Another solution:
x = df.groupby(["Case", "Id", "Date1"], as_index=False).apply(
lambda x: x.loc[x["Date1"] < x["Date2"], "Quantity"].sum()
)
print(
x.pivot(index=["Case", "Id"], columns="Date1", values=None)
.fillna(0)
.melt(ignore_index=False)
.drop(columns=[None])
.reset_index()
.rename(columns={"Date1": "Date", "value":"sum"})
)
Prints:
Case Id Date sum
0 1 1 2020-01-01 100.0
1 1 2 2020-01-01 35.0
2 1 1 2020-02-01 0.0
3 1 2 2020-02-01 0.0
Related
I am currently working with session logs and are interested in counting the number of occurrences of specific events in a custom time frame (first 1 [5, 10, after 10]) minute. To simplify: the start of a session is defined as the time of the first occurrence of a relevant event.
I have already filtered the sessions by only the relevant events and the dataframe looks similar to this.
Input
import pandas as pd
data_in = {'SessionId': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'Timestamp': ['2020-08-24 12:46:30.726000+00:00', '2020-08-24 12:46:38.726000+00:00', '2020-08-24 12:49:30.726000+00:00', '2020-08-24 12:50:49.726000+00:00', '2020-08-24 12:58:30.726000+00:00', '2021-02-12 16:12:12.726000+00:00', '2021-02-12 16:15:24.726000+00:00', '2021-02-12 16:31:07.726000+00:00', '2020-12-03 23:58:17.726000+00:00', '2020-12-04 00:03:44.726000+00:00'],
'event': ['match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match']
}
df_in = pd.DataFrame(data_in)
df_in
Desired Output:
data_out = {'SessionId': ['A', 'B', 'C'],
'#events_first_1_minute': [2, 1, 1],
'#events_first_5_minute': [4, 2, 1],
'#events_first_10_minute': [4, 2, 2],
'#events_after_10_minute': [5, 3, 2]
}
df_out = pd.DataFrame(data_out)
df_out
I already played around with groupby and pd.Grouper. I get the number of relevant events per session in total, but I donĀ“t see any option for custom time bins. Another idea was also to get rid of the date part and focus only on the time, but there are of course also sessions that started on a day and ended on the other (SessionId: C).
Any help is appreciated!
Using pandas.cut:
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])
bins = ['1min', '5min', '10min']
bins2 = pd.to_timedelta(['0']+bins+['10000days'])
group = pd.cut(df_in.groupby('SessionId')['Timestamp'].apply(lambda x: x-x.min()),
bins=bins2, labels=bins+['>'+bins[-1]]).fillna(bins[0])
(df_in
.groupby(['SessionId', group]).size()
.unstack(level=1)
.cumsum(axis=1)
)
output:
Timestamp 1min 5min 10min >10min
SessionId
A 2 4 4 5
B 1 2 2 3
C 1 1 2 2
First convert your Timestamp column to datetime64 dtype then group by SessionId and aggregate data:
# Not mandatory if it's already the case
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
out = df.groupby('SessionId')['Timestamp'] \
.agg(**{'<1m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=1)),
'<5m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=5)),
'<10m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=10)),
'>10m': 'size'}).reset_index()
Output:
>>> out
SessionId <1m <5m <10m >10m
0 A 2 4 4 5
1 B 1 2 2 3
2 C 1 1 2 2
Use:
#convert values to datetimes
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])
#get minutes by substract minimal datetime per group
df_in['g']=(df_in['Timestamp'].sub(df_in.groupby('SessionId')['Timestamp'].transform('min'))
.dt.total_seconds().div(60)
#binning to intervals
lab = ['events_first_1_minute','events_first_5_minute','events_first_10_minute',
'events_after_10_minute']
df_in['g'] = pd.cut(df_in['g'],
bins=[0, 1, 5, 10, np.inf],
labels=lab, include_lowest=True)
#count values with cumulative sums
df = (pd.crosstab(df_in['SessionId'], df_in['g'])
.cumsum(axis=1)
.rename(columns=str)
.reset_index()
.rename_axis(None, axis=1))
print (df)
SessionId events_first_1_minute events_first_5_minute \
0 A 2 4
1 B 1 2
2 C 1 1
events_first_10_minute events_after_10_minute
0 4 5
1 2 3
2 2 2
I'm attempting to count the number of events that occurred in the past for each user in a table. Actually, I have two dataframe, one for each user at a specific point 'T' in time and one for each event that also occur in time.
This is the exemple of the user table:
ID_CLIENT START_DATE
0 A 2015-12-31
1 A 2016-12-31
2 A 2017-12-31
3 B 2016-12-31
This is the exemple of the event table:
ID_CLIENT DATE_EVENT
0 A 2017-01-01
1 A 2017-05-01
2 A 2018-02-01
3 A 2016-05-02
4 B 2015-01-01
The idea is that I want for each line in the "user" table the count of event that occurs before the date registered on "START_DATE".
Exemple of the final result :
ID_CLIENT START_DATE nb_event_tot
0 A 2015-12-31 0
1 A 2016-12-31 1
2 A 2017-12-31 3
3 B 2016-12-31 1
I have created a function which leverage the ".apply" function of pandas but it's too slow... If anyone have an idea on how to speed it up it would be glady appreciated. I have 800K line of user and 200k line of event which take up to 3 hours with the apply method.
Here is my code to reproduce :
import pandas as pd
def check_below_df(row, df_events, col_event):
# Select the ids
id_c = row['ID_CLIENT']
date = row['START_DATE']
# Select subset of events df
sub_df_events = df_events.loc[df_events['ID_CLIENT'] == id_c, :]
sub_df_events = sub_df_events.loc[sub_df_events[col_event] <= date, :]
count = len(sub_df_events)
return count
def count_events(df_clients: pd.DataFrame, df_event: pd.DataFrame, col_event_date: str = 'DATE_EVENEMENT',
col_start_date: str = 'START_DATE', col_end_date: str = 'END_DATE', col_event:str = 'nb_sin', events = ['compensation']):
df_clients_cp = df_clients[["ID_CLIENT", col_start_date]].copy()
df_event_cp = df_event.copy()
df_event_cp[col_event] = 1
# TOTAL
df_clients_cp[f'{col_event}_tot'] = df_clients_cp.apply(lambda row: check_below_df(row, df_event_cp, col_event_date), axis=1)
return df_clients_cp
# ------------------------------------------------------------------
# ------------------------------------------------------------------
df_users = pd.DataFrame(data={
'ID_CLIENT': ['A', 'A', 'A', 'B'],
'START_DATE': ['2015-12-31', '2016-12-31', '2017-12-31', '2016-12-31'],
})
df_users["START_DATE"] = pd.to_datetime(df_users["START_DATE"])
df_events = pd.DataFrame(data={
'ID_CLIENT': ['A', 'A', 'A', 'A', 'B'],
'DATE_EVENT': ['2017-01-01', '2017-05-01', '2018-02-01', '2016-05-02', '2015-01-01']
})
df_events["DATE_EVENT"] = pd.to_datetime(df_events["DATE_EVENT"])
tmp = count_events(df_users, df_events, col_event_date='DATE_EVENT', col_event='nb_event')
tmp
Thank's for your help.
I guess the slow exection is caused by pd.apply(axis=1), which is explained here.
I estimate that you can improve the execution time by using functions that are not applied rowwise, for instance by using merge and groupby.
First we merge the frames:
df_merged = pd.merge(df_users, df_events, on='ID_CLIENT', how='left')
Then we check where DATE_EVENT <= START_DATE for the entire frame:
df_merged.loc[:, 'before'] = df_merged['DATE_EVENT'] <= df_merged['START_DATE']
Then we group by CLIENT_ID and START_DATE, and sum the 'before' column:
df_grouped = df_merged.groupby(by=['ID_CLIENT', 'START_DATE'])
df_out = df_grouped['before'].sum() # returns a series
Finally we convert df_out (a series) back to a dataframe, renaming the new column to 'nb_event_tot', and subsequently reset the index to get your desired output:
df_out = df_out.to_frame('nb_event_tot')
df_out = df_out.reset_index()
I have two data frames like this:
df1:
col1 col2 time
0 A A_1 05:02:03
1 A A_2 15:36:14
2 A A_1 28:21:47
3 A A_1 47:21:17
4 A A_1 32:28:01
5 A A_2 37:27:14
I want to compare if the time in column "time" is <24h, >24 but <48, >48 but <72h and >72 and put these results to another dataframe like this:
df2:
col1 col2 time <24 24<time<48 48<time<72 time>72
0 A A_1 1 3 NaN NaN
1 A A_2 1 1 NaN NaN
So, basically what I want in this df2 is to have the count of files that meet comparison, like for example there are three files in "time" column that belong to A and A_1 and the time is 24<time<48, and we just put 3 in the "24<time<48" column.
I tried this code from #Andreas but it is failing if there is no time in "time" column that 48<time<72 and time>72:
df['day'] = (df['time'].str.split(':').str[0].astype(int)/24).astype(int)
df = df.pivot_table(index=['col1', 'col2'], columns=['day'], values=['time'], aggfunc='count').reset_index()
d = {'time0':'time <24', 'time1':'24<time<48', 'time2':'48<time<72', 'time3':'time>72'}
df.columns = [d.get(''.join(map(str, x)), ''.join(map(str, x))) for x in df.columns]
p.s Im making this as a new question because the other one got edited to many times
Let's try:
Coverting Time Values to TimeDelta to get the days
clip to make sure values don't go beyond 3 days
Use a pivot_table then cleanup columns
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A'},
'col2': {0: 'A_1', 1: 'A_2', 2: 'A_1', 3: 'A_1', 4: 'A_1',
5: 'A_2'},
'time': {0: '05:02:03', 1: '15:36:14', 2: '28:21:47',
3: '47:21:17', 4: '32:28:01', 5: '37:27:14'}})
df['days'] = (
pd.to_timedelta(df['time']).dt.days # Get Days from Time Delta
.clip(lower=0, upper=3) # Clip at 3 Days
)
time_cols = ['time < 24', '24 <= time < 48',
'48 <= time < 72', 'time >= 72']
df = (
df.pivot_table(index=['col1', 'col2'],
columns='days',
aggfunc='count',
fill_value=np.nan)
.droplevel(0, 1) # Remove Column Multi Index
.reset_index() # Reset index
.rename_axis(None, axis=1) # Remove Axis Name
.rename(columns={i: v for i, v in enumerate(time_cols)})
)
# Add Missing Columns
df[list(set(time_cols).difference(df.columns))] = np.nan
# Reorder Columns
df = df[['col1', 'col2', *time_cols]]
print(df)
df:
col1 col2 time < 24 24 <= time < 48 48 <= time < 72 time >= 72
0 A A_1 1 3 NaN NaN
1 A A_2 1 1 NaN NaN
I would like to combine two columns: Column 1 + Column 2 and that for each row individually. Unfortunately it didn't work for me. How do i solve this?
import pandas as pd
import numpy as np
d = {'Nameid': [1, 2, 3, 1], 'Name': ['Michael', 'Max', 'Susan', 'Michael'], 'Project': ['S455', 'G874', 'B7445', 'Z874']}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
d2 = {'Nameid': [4, 2, 5, 1], 'Name': ['Petrova', 'Michael', 'Mike', 'Gandalf'], 'Project': ['Z845', 'Q985', 'P512', 'Y541']}
df2 = pd.DataFrame(data=d2)
display(df2.head(10))
df2['Dataframe']='df2'
What I tried
df_merged = pd.concat([df,df2])
df_merged.head(10)
df3 = pd.concat([df,df2])
df3['unique_string'] = df['Nameid'].astype(str) + df['Dataframe'].astype(str)
df3.head(10)
As you can see, he didn't combine every row. He probably only has the first combined with all of them. How can I combine the two columns row by row?
What I want
You can simply concat strings like this:
You don't need to do df['Dataframe'].astype(str)
In [363]: df_merged['unique_string'] = df_merged.Nameid.astype(str) + df_merged.Dataframe
In [365]: df_merged
Out[365]:
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
0 4 Petrova Z845 df2 4df2
1 2 Michael Q985 df2 2df2
2 5 Mike P512 df2 5df2
3 1 Gandalf Y541 df2 1df2
Please make sure you are using the df3 assign back to df3 ,also do reset_index
df3 = df3.reset_index()
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe'].astype(str)
Use df3 instead df, also ignore_index=True for default index is added:
df3 = pd.concat([df,df2], ignore_index=True)
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe']
print (df3)
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
4 4 Petrova Z845 df2 4df2
5 2 Michael Q985 df2 2df2
6 5 Mike P512 df2 5df2
7 1 Gandalf Y541 df2 1df2
I read this excellent guide to pivoting but I can't work out how to apply it to my case. I have tidy data like this:
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'case': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', ],
... 'perf_var': ['num', 'time', 'num', 'time', 'num', 'time', 'num', 'time'],
... 'perf_value': [1, 10, 2, 20, 1, 30, 2, 40]
... }
... )
>>>
>>> df
case perf_var perf_value
0 a num 1
1 a time 10
2 a num 2
3 a time 20
4 b num 1
5 b time 30
6 b num 2
7 b time 40
What I want is:
To use "case" as the columns
To use the "num" values as the index
To use the "time" values as the value.
to give:
case a b
1.0 10 30
2.0 20 40
All the pivot examples I can see have the index and values in separate columns, but the above seems like a valid/common "tidy" data case to me (I think?). Is it possible to pivot from this?
You need a bit of preprocessing to get your final result :
(df.assign(num=np.where(df.perf_var == "num",
df.perf_value,
np.nan),
time=np.where(df.perf_var == "time",
df.perf_value,
np.nan))
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.loc[:, ["case", "num", "time"]]
.drop_duplicates()
.pivot("num", "case", "time"))
case a b
num
1.0 10.0 30.0
2.0 20.0 40.0
An alternative route to the same end point :
(
df.set_index(["case", "perf_var"], append=True)
.unstack()
.droplevel(0, 1)
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.drop_duplicates()
.droplevel(0)
.set_index("num", append=True)
.unstack(0)
.rename_axis(index=None)
)