Efficiently finding overlap between many date ranges - python

How can I efficiently find overlapping dates between many date ranges?
I have a pandas dataframe containing information on the daily warehouse stock of many products. There are only records for those dates where stock actually changed.
import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
'stock': [10, 0, 10, 5, 0, 5],
'date': ['2016-01-01', '2016-01-05', '2016-01-15',
'2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
Out[4]:
date product stock
0 2016-01-01 a 10
1 2016-01-05 a 0
2 2016-01-15 a 10
3 2016-01-01 b 5
4 2016-01-10 b 0
5 2016-01-20 b 5
From this data I want to identify the number of days where stock of all products was 0. In the example this would be 5 days (from 2016-01-10 to 2016-01-14).
I initially tried resampling the date to create one record for every day and then comparing day by day. This works but it creates a very large dataframe, that I can hardly keep in Memory, because my data contains many dates where stock does not change.
Is there a more memory-efficient way to calculate overlaps other than creating a record for every date and comparing day by day?
Maybe I can somehow create a period representation for the time range implicit in every records and then compare all periods for all products?
Another option could be to first subset only those time periods where a product has zero stock (relatively few) and then apply the resampling only on that subset of the data.
What other, more efficient ways are there?

You can pivot the table using the dates as index and the products as columns, then fill nan's with previous values, convert to daily frequency and look for rows with 0's in all columns.
ptable = (df.pivot(index='date', columns='product', values='stock')
.fillna(method='ffill').asfreq('D', method='ffill'))
cond = ptable.apply(lambda x: (x == 0).all(), axis='columns')
print(ptable.index[cond])
DatetimeIndex(['2016-01-10', '2016-01-11', '2016-01-12', '2016-01-13',
'2016-01-14'],
dtype='datetime64[ns]', name=u'date', freq='D')

Here try this, I know its not the prettiest of codes but according to all the data provided here this should work:
from datetime import timedelta
import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
'stock': [10, 0, 10, 5, 0, 5],
'date': ['2016-01-01', '2016-01-05', '2016-01-15',
'2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
df = df.sort('date', ascending=True)
no_stock_dates = []
product_stock = {}
in_flag = False
begin = df['date'][0]
for index, row in df.iterrows():
current = row['date']
product_stock[row['product']] = row['stock']
if current > begin:
if sum(product_stock.values()) == 0 and not in_flag:
in_flag = True
begin = row['date']
if sum(product_stock.values()) != 0 and in_flag:
in_flag = False
no_stock_dates.append((begin, current-timedelta(days=1)))
print no_stock_dates
This code should run at O(n*k) where n is the number of lines, and k is the number of product categories.

Related

Dataframe: Calculate moving average for the Last 2 rows per group

This code below calculate moving average for every row within a group.
However I only interested in the moving average of the last 2 rows for each group of id.
Since my data is quite large, this code takes too much time to run.
The desired output is column avg having NaN for all rows except for time = 4 and 5.
Thank you so much for your help. HC
import pandas as pd
df = {'id':[1,1,1,1,1,1,2,2,2,2],
'time':[1,2,3,4,5,5,1,2,3,4],
'value': [1, 2, 3, 4,2 ,16, 26, 50, 10, 30],
}
df = pd.DataFrame(data=df)
df.sort_values(by=['id','time'], ascending=[True, True] , inplace=True)
df['avg'] = df['value'].groupby(df['id']).apply(lambda g: g.rolling( 3 ).mean())
df

How to prevent data from being recycled when using pd.merge_asof in Python

I am looking to join two data frames using the pd.merge_asof function. This function allows me to match data on a unique id and/or a nearest key. In this example, I am matching on the id as well as the nearest date that is less than or equal to the date in df1.
Is there a way to prevent the data from df2 being recycled when joining?
This is the code that I currently have that recycles the values in df2.
import pandas as pd
import datetime as dt
df1 = pd.DataFrame({'date': [dt.datetime(2020, 1, 2), dt.datetime(2020, 2, 2), dt.datetime(2020, 3, 2)],
'id': ['a', 'a', 'a']})
df2 = pd.DataFrame({'date': [dt.datetime(2020, 1, 1)],
'id': ['a'],
'value': ['1']})
pd.merge_asof(df1,
df2,
on='date',
by='id',
direction='backward',
allow_exact_matches=True)
This is the output that I would like to see instead where only the first match is successful
Given your merge direction is backward, you can do a mask on duplicated id and df2's date after merge_asof:
out = pd.merge_asof(df1,
df2.rename(columns={'date':'date1'}), # rename df2's date
left_on='date',
right_on='date1', # so we can work on it later
by='id',
direction='backward',
allow_exact_matches=True)
# mask the value
out['value'] = out['value'].mask(out.duplicated(['id','date1']))
# equivalently
# out.loc[out.duplicated(['id', 'date1']), 'value'] = np.nan
Output:
date id date1 value
0 2020-01-02 a 2020-01-01 1
1 2020-02-02 a 2020-01-01 NaN
2 2020-03-02 a 2020-01-01 NaN

Counting number of events on each user in two dataframe

I'm attempting to count the number of events that occurred in the past for each user in a table. Actually, I have two dataframe, one for each user at a specific point 'T' in time and one for each event that also occur in time.
This is the exemple of the user table:
ID_CLIENT START_DATE
0 A 2015-12-31
1 A 2016-12-31
2 A 2017-12-31
3 B 2016-12-31
This is the exemple of the event table:
ID_CLIENT DATE_EVENT
0 A 2017-01-01
1 A 2017-05-01
2 A 2018-02-01
3 A 2016-05-02
4 B 2015-01-01
The idea is that I want for each line in the "user" table the count of event that occurs before the date registered on "START_DATE".
Exemple of the final result :
ID_CLIENT START_DATE nb_event_tot
0 A 2015-12-31 0
1 A 2016-12-31 1
2 A 2017-12-31 3
3 B 2016-12-31 1
I have created a function which leverage the ".apply" function of pandas but it's too slow... If anyone have an idea on how to speed it up it would be glady appreciated. I have 800K line of user and 200k line of event which take up to 3 hours with the apply method.
Here is my code to reproduce :
import pandas as pd
def check_below_df(row, df_events, col_event):
# Select the ids
id_c = row['ID_CLIENT']
date = row['START_DATE']
# Select subset of events df
sub_df_events = df_events.loc[df_events['ID_CLIENT'] == id_c, :]
sub_df_events = sub_df_events.loc[sub_df_events[col_event] <= date, :]
count = len(sub_df_events)
return count
def count_events(df_clients: pd.DataFrame, df_event: pd.DataFrame, col_event_date: str = 'DATE_EVENEMENT',
col_start_date: str = 'START_DATE', col_end_date: str = 'END_DATE', col_event:str = 'nb_sin', events = ['compensation']):
df_clients_cp = df_clients[["ID_CLIENT", col_start_date]].copy()
df_event_cp = df_event.copy()
df_event_cp[col_event] = 1
# TOTAL
df_clients_cp[f'{col_event}_tot'] = df_clients_cp.apply(lambda row: check_below_df(row, df_event_cp, col_event_date), axis=1)
return df_clients_cp
# ------------------------------------------------------------------
# ------------------------------------------------------------------
df_users = pd.DataFrame(data={
'ID_CLIENT': ['A', 'A', 'A', 'B'],
'START_DATE': ['2015-12-31', '2016-12-31', '2017-12-31', '2016-12-31'],
})
df_users["START_DATE"] = pd.to_datetime(df_users["START_DATE"])
df_events = pd.DataFrame(data={
'ID_CLIENT': ['A', 'A', 'A', 'A', 'B'],
'DATE_EVENT': ['2017-01-01', '2017-05-01', '2018-02-01', '2016-05-02', '2015-01-01']
})
df_events["DATE_EVENT"] = pd.to_datetime(df_events["DATE_EVENT"])
tmp = count_events(df_users, df_events, col_event_date='DATE_EVENT', col_event='nb_event')
tmp
Thank's for your help.
I guess the slow exection is caused by pd.apply(axis=1), which is explained here.
I estimate that you can improve the execution time by using functions that are not applied rowwise, for instance by using merge and groupby.
First we merge the frames:
df_merged = pd.merge(df_users, df_events, on='ID_CLIENT', how='left')
Then we check where DATE_EVENT <= START_DATE for the entire frame:
df_merged.loc[:, 'before'] = df_merged['DATE_EVENT'] <= df_merged['START_DATE']
Then we group by CLIENT_ID and START_DATE, and sum the 'before' column:
df_grouped = df_merged.groupby(by=['ID_CLIENT', 'START_DATE'])
df_out = df_grouped['before'].sum() # returns a series
Finally we convert df_out (a series) back to a dataframe, renaming the new column to 'nb_event_tot', and subsequently reset the index to get your desired output:
df_out = df_out.to_frame('nb_event_tot')
df_out = df_out.reset_index()

Calculate percent return over certain time frame for each group

I have a pandas dataframe that contains daily price data for thousands of stocks. I'd like to calculate the percent change in the stock price for each stock over different time frames. Right now, I am putting all the stock symbols in a list and looping through my data frame with a standard "for loop" to calculate the different fields. This takes forever and there must be a faster way to achieve the same thing.
Here is what I am currently doing. I am looking for a faster and more efficient way of writing this:
from datetime import date
import pandas as pd
date1 = date(2020, 1, 1)
date2 = date(2020, 1, 2)
date3 = date(2020, 1, 3)
date4 = date(2020, 1, 4)
my_dict = {'ticker' : ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'close' : [1, 2, 3, 4, 1, 1, 2, 5],
'date' : [date1, date2, date3, date4, date1, date2, date3, date4]}
df = pd.DataFrame(my_dict)
print('')
print(df)
print('')
ticker_list = list(sorted(set(df['ticker'].tolist())))
final = pd.DataFrame()
for ticker in ticker_list:
x = df[df['ticker'] == ticker]
x.set_index('date', inplace=True)
some_return = x.iloc[-1]['close'] / x.loc[date2:].iloc[0]['close'] - 1
d = {"ticker": ticker, "some_return": some_return}
temp = pd.DataFrame(d, index=[ticker])
final = final.append(temp)
print(final)
Not sure if you ever found a faster way since March, but here's an option assuming all the symbols have the same amount of data.
build dataframe with 1000 symbols; 7500 dates, and 7500 values
symbols = ['A'+ str(i) for i in range(1,1001)]
df_hold_list = []
for s in symbols[0:1001]:
my_dict = {'ticker' : [s]*7500,
'close' : [i for i in range(1,7501)],
'date' : [d.date() for d in pd.date_range(start='1/1/2000', periods=7500)]}
dft = pd.DataFrame(my_dict)
df_hold_list.append(dft)
df = pd.concat(df_hold_list, axis=0).reset_index()
create your analysis based on your period (selected by date).
# percent change from last date to selected date; this filters out all dates for all symbols first then uses group_by
ddate = date(2005, 1, 4)
# calculate rows for use in pct_change later; assumes all groups have same number of data points
rows = df.loc[df['date'] >= ddate].groupby('ticker').get_group((list(df.groupby('ticker').groups)[0])).shape[0]
# calculate percent change
df['pctchnge'] = df.loc[df['date'] >= ddate].groupby('ticker')['close'].pct_change(periods=rows-1)
#only last value is need; all others are nan
df_final = df.groupby('ticker').tail(1)[['ticker', 'pctchnge']]
On my PC it takes ~10 seconds to build the df and less than 5 seconds to create the final df.

Pandas - multiple condition lookup speed

I'm working with some historical baseball data and trying to get matchup information (batter/pitcher) for previous games.
Example data:
import pandas as pd
data = {'ID': ['A','A','A','A','A','A','B','B','B','B','B'],
'Year' : ['2017-05-01', '2017-06-03', '2017-08-02', '2018-05-30', '2018-07-23', '2018-09-14', '2017-06-01', '2017-08-03', '2018-05-15', '2018-07-23', '2017-05-01'],
'ID2' : [1,2,3,2,2,1,2,2,2,1,1],
'Score 2': [1,4,5,7,5,5,6,1,4,5,6],
'Score 3': [1,4,5,7,5,5,6,1,4,5,6],
'Score 4': [1,4,5,7,5,5,6,1,4,5,6]}
df = pd.DataFrame(data)
lookup_data = {"First_Person" : ['A', 'B'],
"Second_Person" : ['1', '2'],
"Year" : ['2018', '2018']}
lookup_df = pd.DataFrame(lookup_data)
Lookup df has the current matchups, df has the historical data and current matchups.
I want to find, for example, for Person A against Person 2, what were the results of any of their matchups on any previous date?
I can do this with:
history_list = []
def get_history(row, df, hist_list):
#we filter the df to matchups containing both players before the previous date and sum all events in their history
history = df[(df['ID'] == row['First_Person']) & (df['ID2'] == row['Second_Person']) & (df['Year'] < row['Year'])].sum().iloc[3:]
#add to a list to keep track of results
hist_list.append(list(history.values) + [row['Year']+row['First_Person']+row['Second_Person']])
and then execute with apply like so:
lookup_df.apply(get_history, df=df, hist_list = history_list, axis=1)
Expected results would be something like:
1st P Matchup date 2nd p Historical scores
A 2018-07-23 2 11 11 11
B 2018-05-15 2 7 7 7
But this is pretty slow - the filtering operation takes around 50ms per lookup.
Is there a better way I can approach this problem? This currently would take over 3 hours to run across 250k historical matchups.
You can merge or map and groupby,
lookup_df['Second_Person'] = lookup_df['Second_Person'].astype(int)
merged = df.merge(lookup_df, left_on = ['ID', 'ID2'], right_on = ['First_Person', 'Second_Person'], how = 'left').query('Year_x < Year_y').drop(['Year_x', 'First_Person', 'Second_Person', 'Year_y'], axis = 1)
merged.groupby('ID', as_index = False).sum()
ID ID2 Score 2 Score 3 Score 4
0 A 1 1 1 1
1 B 4 7 7 7

Categories

Resources