Counting number of events on each user in two dataframe

Counting number of events on each user in two dataframe - python

I'm attempting to count the number of events that occurred in the past for each user in a table. Actually, I have two dataframe, one for each user at a specific point 'T' in time and one for each event that also occur in time.
This is the exemple of the user table:
ID_CLIENT START_DATE
0 A 2015-12-31
1 A 2016-12-31
2 A 2017-12-31
3 B 2016-12-31
This is the exemple of the event table:
ID_CLIENT DATE_EVENT
0 A 2017-01-01
1 A 2017-05-01
2 A 2018-02-01
3 A 2016-05-02
4 B 2015-01-01
The idea is that I want for each line in the "user" table the count of event that occurs before the date registered on "START_DATE".
Exemple of the final result :
ID_CLIENT START_DATE nb_event_tot
0 A 2015-12-31 0
1 A 2016-12-31 1
2 A 2017-12-31 3
3 B 2016-12-31 1
I have created a function which leverage the ".apply" function of pandas but it's too slow... If anyone have an idea on how to speed it up it would be glady appreciated. I have 800K line of user and 200k line of event which take up to 3 hours with the apply method.
Here is my code to reproduce :
import pandas as pd
def check_below_df(row, df_events, col_event):
# Select the ids
id_c = row['ID_CLIENT']
date = row['START_DATE']
# Select subset of events df
sub_df_events = df_events.loc[df_events['ID_CLIENT'] == id_c, :]
sub_df_events = sub_df_events.loc[sub_df_events[col_event] <= date, :]
count = len(sub_df_events)
return count
def count_events(df_clients: pd.DataFrame, df_event: pd.DataFrame, col_event_date: str = 'DATE_EVENEMENT',
col_start_date: str = 'START_DATE', col_end_date: str = 'END_DATE', col_event:str = 'nb_sin', events = ['compensation']):
df_clients_cp = df_clients[["ID_CLIENT", col_start_date]].copy()
df_event_cp = df_event.copy()
df_event_cp[col_event] = 1
# TOTAL
df_clients_cp[f'{col_event}_tot'] = df_clients_cp.apply(lambda row: check_below_df(row, df_event_cp, col_event_date), axis=1)
return df_clients_cp
# ------------------------------------------------------------------
# ------------------------------------------------------------------
df_users = pd.DataFrame(data={
'ID_CLIENT': ['A', 'A', 'A', 'B'],
'START_DATE': ['2015-12-31', '2016-12-31', '2017-12-31', '2016-12-31'],
})
df_users["START_DATE"] = pd.to_datetime(df_users["START_DATE"])
df_events = pd.DataFrame(data={
'ID_CLIENT': ['A', 'A', 'A', 'A', 'B'],
'DATE_EVENT': ['2017-01-01', '2017-05-01', '2018-02-01', '2016-05-02', '2015-01-01']
})
df_events["DATE_EVENT"] = pd.to_datetime(df_events["DATE_EVENT"])
tmp = count_events(df_users, df_events, col_event_date='DATE_EVENT', col_event='nb_event')
tmp
Thank's for your help.

I guess the slow exection is caused by pd.apply(axis=1), which is explained here.
I estimate that you can improve the execution time by using functions that are not applied rowwise, for instance by using merge and groupby.
First we merge the frames:
df_merged = pd.merge(df_users, df_events, on='ID_CLIENT', how='left')
Then we check where DATE_EVENT <= START_DATE for the entire frame:
df_merged.loc[:, 'before'] = df_merged['DATE_EVENT'] <= df_merged['START_DATE']
Then we group by CLIENT_ID and START_DATE, and sum the 'before' column:
df_grouped = df_merged.groupby(by=['ID_CLIENT', 'START_DATE'])
df_out = df_grouped['before'].sum() # returns a series
Finally we convert df_out (a series) back to a dataframe, renaming the new column to 'nb_event_tot', and subsequently reset the index to get your desired output:
df_out = df_out.to_frame('nb_event_tot')
df_out = df_out.reset_index()

Related

how to speed up conditional groupby sum in pandas

I have a dataframe with huge amount of rows, and I want to conditional groupby sum to this dataframe.
This is an example of my dataframe and code:
import pandas as pd
data = {'Case': [1, 1, 1, 1, 1, 1],
'Id': [1, 1, 1, 1, 2, 2],
'Date1': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01', '2020-01-01', '2020-01-01'],
'Date2': ['2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01'],
'Quantity': [50,100,150,20,30,35]
}
df = pd.DataFrame(data)
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
sum_list = []
for d in df['Date1'].unique():
temp = df.groupby(['Case','Id']).apply(lambda x: x[(x['Date2'] == d) & (x['Date1']<d)]['Quantity'].sum()).rename('sum').to_frame()
temp['Date'] = d
sum_list.append(temp)
output = pd.concat(sum_list, axis=0).reset_index()
When I apply this for loop to the real dataframe, it's extremely slow. I want to find a better way to do this conditional groupby sum operation. Here are my questions:
is for loop a good method to do what I need here?
are there any better ways to replace line 1 inside for loop;
I feel line 2 inside for loop is also time-consuming, how should I improve it.
Thanks for your help.

One option is a double merge and a groupby:
date = pd.Series(df.Date1.unique(), name='Date')
step1 = df.merge(date, left_on = 'Date2', right_on = 'Date', how = 'outer')
step2 = step1.loc[step1.Date1 < step1.Date]
step2 = step2.groupby(['Case', 'Id', 'Date']).agg(sum=('Quantity','sum'))
(df
.loc[:, ['Case', 'Id', 'Date2']]
.drop_duplicates()
.rename(columns={'Date2':'Date'})
.merge(step2, how = 'left', on = ['Case', 'Id', 'Date'])
.fillna({'sum': 0}, downcast='infer')
)
Case Id Date sum
0 1 1 2020-01-01 0
1 1 1 2020-02-01 100
2 1 2 2020-01-01 0
3 1 2 2020-02-01 35

apply is the slow one. Avoid it as much as you can.
I tested this with your small snippet and it gives the correct answer. You need to test more thoroughly with your real data:
case = df["Case"].unique()
id_= df["Id"].unique()
d = df["Date1"].unique()
index = pd.MultiIndex.from_product([case, id_, d], names=["Case", "Id", "Date"])
# Sum only rows whose Date2 belong to a specific list of dates
# This is equivalent to `x['Date2'] == d` in your original code
cond = df["Date2"].isin(d)
tmp = df[cond].groupby(["Case", "Id", "Date1", "Date2"], as_index=False).sum()
# Select only those sums where Date1 < Date2 and sum again
# This takes care of the `x['Date1'] < d` condition
cond = tmp["Date1"] < tmp["Date2"]
output = tmp[cond].groupby(["Case", "Id", "Date2"]).sum().reindex(index, fill_value=0).reset_index()

Another solution:
x = df.groupby(["Case", "Id", "Date1"], as_index=False).apply(
lambda x: x.loc[x["Date1"] < x["Date2"], "Quantity"].sum()
)
print(
x.pivot(index=["Case", "Id"], columns="Date1", values=None)
.fillna(0)
.melt(ignore_index=False)
.drop(columns=[None])
.reset_index()
.rename(columns={"Date1": "Date", "value":"sum"})
)
Prints:
Case Id Date sum
0 1 1 2020-01-01 100.0
1 1 2 2020-01-01 35.0
2 1 1 2020-02-01 0.0
3 1 2 2020-02-01 0.0

How to prevent data from being recycled when using pd.merge_asof in Python

I am looking to join two data frames using the pd.merge_asof function. This function allows me to match data on a unique id and/or a nearest key. In this example, I am matching on the id as well as the nearest date that is less than or equal to the date in df1.
Is there a way to prevent the data from df2 being recycled when joining?
This is the code that I currently have that recycles the values in df2.
import pandas as pd
import datetime as dt
df1 = pd.DataFrame({'date': [dt.datetime(2020, 1, 2), dt.datetime(2020, 2, 2), dt.datetime(2020, 3, 2)],
'id': ['a', 'a', 'a']})
df2 = pd.DataFrame({'date': [dt.datetime(2020, 1, 1)],
'id': ['a'],
'value': ['1']})
pd.merge_asof(df1,
df2,
on='date',
by='id',
direction='backward',
allow_exact_matches=True)
This is the output that I would like to see instead where only the first match is successful

Given your merge direction is backward, you can do a mask on duplicated id and df2's date after merge_asof:
out = pd.merge_asof(df1,
df2.rename(columns={'date':'date1'}), # rename df2's date
left_on='date',
right_on='date1', # so we can work on it later
by='id',
direction='backward',
allow_exact_matches=True)
# mask the value
out['value'] = out['value'].mask(out.duplicated(['id','date1']))
# equivalently
# out.loc[out.duplicated(['id', 'date1']), 'value'] = np.nan
Output:
date id date1 value
0 2020-01-02 a 2020-01-01 1
1 2020-02-02 a 2020-01-01 NaN
2 2020-03-02 a 2020-01-01 NaN

Keep pandas DataFrame rows in df2 for each row in df1 with timedelta

I have two pandas dataframes. I would like to keep all rows in df2 where Type is equal to Type in df1 AND Date is between Date in df1 (- 1 day or + 1 day). How can I do this?
df1
IBSN Type Date
0 1 X 2014-08-17
1 1 Y 2019-09-22
df2
IBSN Type Date
0 2 X 2014-08-16
1 2 D 2019-09-22
2 9 X 2014-08-18
3 3 H 2019-09-22
4 3 Y 2019-09-23
5 5 G 2019-09-22
res
IBSN Type Date
0 2 X 2014-08-16 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] - 1
1 9 X 2014-08-18 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] + 1
2 3 Y 2019-09-23 <-- keep because Type = df1[1]['Type'] AND Date = df1[1]['Date'] + 1

This should do it:
import pandas as pd
from datetime import timedelta
# create dummy data
df1 = pd.DataFrame([[1, 'X', '2014-08-17'], [1, 'Y', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df1['Date'] = pd.to_datetime(df1['Date']) # might not be necessary if your Date column already contain datetime objects
df2 = pd.DataFrame([[2, 'X', '2014-08-16'], [2, 'D', '2019-09-22'], [9, 'X', '2014-08-18'], [3, 'H', '2019-09-22'], [3, 'Y', '2014-09-23'], [5, 'G', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df2['Date'] = pd.to_datetime(df2['Date']) # might not be necessary if your Date column already contain datetime objects
# add date boundaries to the first dataframe
df1['Date_from'] = df1['Date'].apply(lambda x: x - timedelta(days=1))
df1['Date_to'] = df1['Date'].apply(lambda x: x + timedelta(days=1))
# merge the date boundaries to df2 on 'Type'. Filter rows where date is between
# data_from and date_to (inclusive). Drop 'date_from' and 'date_to' columns
df2 = df2.merge(df1.loc[:, ['Type', 'Date_from', 'Date_to']], on='Type', how='left')
df2[(df2['Date'] >= df2['Date_from']) & (df2['Date'] <= df2['Date_to'])].\
drop(['Date_from', 'Date_to'], axis=1)
Note that according to your logic, row 4 in df2 (3 Y 2014-09-23) should not remain as its date (2014) is not in between the given dates in df1 (year 2019).

Assume Date columns in both dataframes are already in dtype datetime. I would construct IntervalIndex to assign to index of df1. Map columns Type of df1 to df2. Finally check equality to create mask to slice
iix = pd.IntervalIndex.from_arrays(df1.Date + pd.Timedelta(days=-1),
df1.Date + pd.Timedelta(days=1), closed='both')
df1 = df1.set_index(iix)
s = df2['Date'].map(df1.Type)
df_final = df2[df2.Type == s]
Out[1131]:
IBSN Type Date
0 2 X 2014-08-16
2 9 X 2014-08-18
4 3 Y 2019-09-23

Pandas: Multi-index apply function between column and index

I have a multi-index dataframe that look like this:
In[13]: df
Out[13]:
Last Trade
Date Ticker
1983-03-30 CLM83 1983-05-18
CLN83 1983-06-17
CLQ83 1983-07-18
CLU83 1983-08-19
CLV83 1983-09-16
CLX83 1983-10-18
CLZ83 1983-11-18
1983-04-04 CLM83 1983-05-18
CLN83 1983-06-17
CLQ83 1983-07-18
CLU83 1983-08-19
CLV83 1983-09-16
CLX83 1983-10-18
CLZ83 1983-11-18
With two levels for indexes (namely 'Date' and 'Ticker'). I would like to apply a function to the column 'Last Trade' that would let me know how many months separate this 'Last Trade' date from the index 'Date'
I found a function that does the calculation:
from calendar import monthrange
def monthdelta(d1, d2):
delta = 0
while True:
mdays = monthrange(d1.year, d1.month)[1]
d1 += datetime.timedelta(days=mdays)
if d1 <= d2:
delta += 1
else:
break
return delta
I tried to apply the following function h but it returns me an AttributeError: 'Timestamp' object has no attribute 'index':
In[14]: h = lambda x: monthdelta(x.index.get_level_values(0),x)
In[15]: df['Last Trade'] = df['Last Trade'].apply(h)
How can I apply a function that would use both a column and an index value?
Thank you for your tips,

Use df.index.to_series().str.get(0) to get at first level of index.
(df['Last Trade'].dt.month - df.index.to_series().str.get(0).dt.month) + \
(df['Last Trade'].dt.year - df.index.to_series().str.get(0).dt.year) * 12
Date Ticker
1983-03-30 CLM83 2
CLN83 3
CLQ83 4
CLU83 5
CLV83 6
CLX83 7
CLZ83 8
1983-04-04 CLM83 1
CLN83 2
CLQ83 3
CLU83 4
CLV83 5
CLX83 6
CLZ83 7
dtype: int64
Timing
Given df
pd.concat([df for _ in range(10000)])

Try this instead of your function:
Option 1
You get an integer number
def monthdelta(row):
trade = row['Last Trade'].year*12 + row['Last Trade'].month
date = row['Date'].year*12 + row['Date'].month
return trade - date
df.reset_index().apply(monthdelta, axis=1)
Inspired by PiRsquared:
df = df.reset_index()
(df['Last Trade'].dt.year*12 + df['Last Trade'].dt.month) -\
(df['Date'].dt.year*12 + df['Date'].dt.month)
Option 2
You get a numpy.timedelta64
Which can be directly used for other date computations. However, this will be in the form of days, not months, because the number of days in a month are not constant.
def monthdelta(row):
return row['Last Trade'] - row['Date']
df.reset_index().apply(monthdelta, axis=1)
Inspired by PiRsquared:
df = df.reset_index()
df['Last Trade'] - df['Date']
Option 2 will of course be faster, because it involves less computations. Pick what you like!
To get your index back: df.index = df[['Date', 'Ticker']]

Efficiently finding overlap between many date ranges

How can I efficiently find overlapping dates between many date ranges?
I have a pandas dataframe containing information on the daily warehouse stock of many products. There are only records for those dates where stock actually changed.
import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
'stock': [10, 0, 10, 5, 0, 5],
'date': ['2016-01-01', '2016-01-05', '2016-01-15',
'2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
Out[4]:
date product stock
0 2016-01-01 a 10
1 2016-01-05 a 0
2 2016-01-15 a 10
3 2016-01-01 b 5
4 2016-01-10 b 0
5 2016-01-20 b 5
From this data I want to identify the number of days where stock of all products was 0. In the example this would be 5 days (from 2016-01-10 to 2016-01-14).
I initially tried resampling the date to create one record for every day and then comparing day by day. This works but it creates a very large dataframe, that I can hardly keep in Memory, because my data contains many dates where stock does not change.
Is there a more memory-efficient way to calculate overlaps other than creating a record for every date and comparing day by day?
Maybe I can somehow create a period representation for the time range implicit in every records and then compare all periods for all products?
Another option could be to first subset only those time periods where a product has zero stock (relatively few) and then apply the resampling only on that subset of the data.
What other, more efficient ways are there?

You can pivot the table using the dates as index and the products as columns, then fill nan's with previous values, convert to daily frequency and look for rows with 0's in all columns.
ptable = (df.pivot(index='date', columns='product', values='stock')
.fillna(method='ffill').asfreq('D', method='ffill'))
cond = ptable.apply(lambda x: (x == 0).all(), axis='columns')
print(ptable.index[cond])
DatetimeIndex(['2016-01-10', '2016-01-11', '2016-01-12', '2016-01-13',
'2016-01-14'],
dtype='datetime64[ns]', name=u'date', freq='D')

Here try this, I know its not the prettiest of codes but according to all the data provided here this should work:
from datetime import timedelta
import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
'stock': [10, 0, 10, 5, 0, 5],
'date': ['2016-01-01', '2016-01-05', '2016-01-15',
'2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
df = df.sort('date', ascending=True)
no_stock_dates = []
product_stock = {}
in_flag = False
begin = df['date'][0]
for index, row in df.iterrows():
current = row['date']
product_stock[row['product']] = row['stock']
if current > begin:
if sum(product_stock.values()) == 0 and not in_flag:
in_flag = True
begin = row['date']
if sum(product_stock.values()) != 0 and in_flag:
in_flag = False
no_stock_dates.append((begin, current-timedelta(days=1)))
print no_stock_dates
This code should run at O(n*k) where n is the number of lines, and k is the number of product categories.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting number of events on each user in two dataframe - python

Related

how to speed up conditional groupby sum in pandas

How to prevent data from being recycled when using pd.merge_asof in Python

Keep pandas DataFrame rows in df2 for each row in df1 with timedelta

Pandas: Multi-index apply function between column and index

Efficiently finding overlap between many date ranges

Categories

Resources