I have a dataframe simmilar to this one:
user
start_date
end_date
activity
1
2022-05-21 07:23:58
2022-05-21 14:23:48
Sleep
2
2022-05-21 12:59:16
2022-05-21 14:59:16
Eat
1
2022-05-21 18:59:16
2022-05-21 21:20:16
Work
3
2022-05-21 18:50:00
2022-05-21 21:20:16
Work
I want to have a dataframe in which the columns will be certain activities and the rows would contain the sum of the duration of each of these activities for all users in that hour. Sorry, I find it hard to even put my thoughts into words. The expected result should be simillar to this one:
hour
sleep[s]
eat[s]
work[s]
00
0
0
0
01
0
0
0
...
...
...
...
07
2162
0
0
08
3600
0
0
...
...
...
...
18
0
0
644
...
...
...
...
The dataframe has more than 10 milion rows so I'm looking for something fast. I was trying to do something with crosstab to get expected columns and resample to get the rows, but I'm not even close to find a soluction. I would be really grateful for any ideas of how to do that. This is my first question, so I apologize in advance for all mistakes :)
I am assuming you the activities start time can span more than one day and you actually want the summary for each day. If not, the answer could be adapted by using datetime.hour instead of binning the start time.
The answer:
Calculate the duration at the top of the hour intervals
Expand the rows of 1-hour intervals
pivot the dataframe by activity
Code:
import pandas as pd
import numpy as np
import math
from datetime import datetime, timedelta
data={'users':[1,2,1,3],
'start':['2022-05-21 07:23:58', '2022-05-21 12:59:16', '2022-05-21 18:59:16', '2022-05-21 18:50:00'],
'end':[ '2022-05-21 14:23:48', '2022-05-21 14:59:16', '2022-05-21 21:20:16', '2022-05-21 21:20:16'],
'activity':[ 'Sleep', 'Eat', 'Work', 'Work']
}
df=pd.DataFrame(data)
#concert to datetime
df.start=pd.to_datetime(df.start)
df.end=pd.to_datetime(df.end)
#Here is where the answer starts
# function to expand the duration in a list of one hour intervals
def genhourlist(e):
start = e.start
end = e.end
nexthour = datetime(start.year,start.month,start.day,start.hour) + timedelta(hours=1)
lst=[]
while end > nexthour:
inter = (nexthour - start)/pd.Timedelta(seconds=1)
lst.append((datetime(start.year,start.month,start.day,start.hour),inter))
start=nexthour
nexthour=nexthour+timedelta(hours=1)
inter = (end - start)/pd.Timedelta(seconds=1)
lst.append((datetime(end.year,end.month,end.day,end.hour),inter))
return lst
# expand the duration
df['duration']=df.apply(genhourlist, axis=1)
df=df.explode('duration')
# update the duration and start
df['start']=df.duration.apply(lambda x: x[0])
df['duration']=df.duration.apply(lambda x: x[1])
pd.pivot_table(df,index=['start'],columns=['activity'], values=['duration'],aggfunc='sum')
Result:
duration
activity Eat Sleep Work
start
2022-05-21 07:00:00 NaN 2162.0 NaN
2022-05-21 08:00:00 NaN 3600.0 NaN
2022-05-21 09:00:00 NaN 3600.0 NaN
2022-05-21 10:00:00 NaN 3600.0 NaN
2022-05-21 11:00:00 NaN 3600.0 NaN
2022-05-21 12:00:00 44.0 3600.0 NaN
2022-05-21 13:00:00 3600.0 3600.0 NaN
2022-05-21 14:00:00 3556.0 1428.0 NaN
2022-05-21 18:00:00 NaN NaN 644.0
2022-05-21 19:00:00 NaN NaN 7200.0
2022-05-21 20:00:00 NaN NaN 7200.0
2022-05-21 21:00:00 NaN NaN 2432.0
Use:
#get difference in seconds between end and start
tot = df['end_date'].sub(df['start_date']).dt.total_seconds()
#repeat unique index values, if necessary create default
#df = df.reset_index(drop=True)
df = df.loc[df.index.repeat(tot)].copy()
#add timedeltas in second to start datetimes
df['start'] = df['start_date'] + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='s')
#create DatetimeIndex in hours with count activity - total seconds
df = df.groupby([pd.Grouper(freq='H',key='start'), 'activity']).size().unstack()
print (df)
activity Eat Sleep Work
start
2022-05-21 07:00:00 NaN 2162.0 NaN
2022-05-21 08:00:00 NaN 3600.0 NaN
2022-05-21 09:00:00 NaN 3600.0 NaN
2022-05-21 10:00:00 NaN 3600.0 NaN
2022-05-21 11:00:00 NaN 3600.0 NaN
2022-05-21 12:00:00 44.0 3600.0 NaN
2022-05-21 13:00:00 3600.0 3600.0 NaN
2022-05-21 14:00:00 3556.0 1428.0 NaN
2022-05-21 18:00:00 NaN NaN 644.0
2022-05-21 19:00:00 NaN NaN 7200.0
2022-05-21 20:00:00 NaN NaN 7200.0
2022-05-21 21:00:00 NaN NaN 2432.0
Here's what I did with the 4 rows of data you posted.
Here's what I tried to do and my assumptions:
Goal = for each row,
- calculate total elapsed time,
- determine start hour sh, end hour eh,
- work out how many seconds pertain to sh & eh
then:
- create a 24-row df with results added up for all users (assuming the data starts at 00:00 and ends at 23:59 on the same day, there isn't any other day, and you're interested in the aggregate result for all users)
I'm using a For loop to build the final dataframe. It's probably not efficient enough if you have 10m rows in your original dataset, but it gets the job done on a smaller subset. Hopefully you can improve upon it!
import pandas as pd
import numpy as np
# work out total elapsed time, in seconds
df['total'] = df['end_date'] - df['start_date']
df['total'] = df['total'].dt.total_seconds()
# determine start/end hours (just the value: 07:00 AM => 7)
df['sh'] = df['start_date'].dt.hour
df['eh'] = df['end_date'].dt.hour
#determine start/end hours again
df['next_hour'] = df['start_date'].dt.ceil('H') # 07:23:58 => 08:00:00
df['previous_hour'] = df['end_date'].dt.floor('H') # 14:23:48 => 14:00:00
df
# determine how many seconds pertain to start/end hours
df['sh_s'] = df['next_hour'] - df['start_date']
df['sh_s'] = df['sh_s'].dt.total_seconds()
df['eh_s'] = df['end_date'] - df['previous_hour']
df['eh_s'] = df['eh_s'].dt.total_seconds()
# where start hour & end hour are the same (start=07:20, end=07:30)
my_mask = df['sh'].eq(df['eh']).values
# set sh_s to be 07:30 - 07:20 = 10 minutes = 600 seconds
df['sh_s'] = df['sh_s'].where(~my_mask, other=(df['end_date'] - df['start_date']).dt.total_seconds())
# set eh_s to be 0
df['eh_s'] = df['eh_s'].where(~my_mask, other=0)
df
# add all column hours to df
hour_columns = np.arange(24)
df[hour_columns] = ''
df
# Sorry for this horrible loop below... Hopefully you can improve upon it or get rid of it completely!
for i in range(len(df)): # for each row:
sh = df.iloc[i, df.columns.get_loc('sh')] # get start hour value
sh_s = df.iloc[i, df.columns.get_loc('sh_s')] #get seconds pertaining to start hour
eh = df.iloc[i, df.columns.get_loc('eh')] # get end hour value
eh_s = df.iloc[i, df.columns.get_loc('eh_s')] #get seconds pertaining to end hour
# fill in hour columns
# if start hour = end hour:
if sh == eh:
df.iloc[i, df.columns.get_loc(sh)] = sh_s
# but ignore eh_s (it would cancel out the previous line)
# if start hour != end hour, report both to the hour columns:
else:
df.iloc[i, df.columns.get_loc(sh)] = sh_s
df.iloc[i, df.columns.get_loc(eh)] = eh_s
# for each col between sh & eh, input 3600
for j in range(sh + 1, eh):
df.iloc[i, df.columns.get_loc(j)] = 3600
df
df.groupby('activity', as_index=False)[hour_columns].sum().transpose()
Result:
I need to split a year in enumerated 20-minute chunks and then find the sequece number of corresponding time range chunk for randomly distributed timestamps in a year for further processing.
I tried to use pandas for this, but I can't find a way to index timestamp in date_range:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from datetime import timedelta
if __name__ == '__main__':
date_start = pd.to_datetime('2018-01-01')
date_end = date_start + timedelta(days=365)
index = pd.date_range(start=date_start, end=date_end, freq='20min')
data = range(len(index))
df = pd.DataFrame(data, index=index, columns=['A'])
print(df)
event_ts = pd.to_datetime('2018-10-14 02:17:43')
# How to find the corresponding df['A'] for event_ts?
# print(df.loc[event_ts])
Output:
A
2018-01-01 00:00:00 0
2018-01-01 00:20:00 1
2018-01-01 00:40:00 2
2018-01-01 01:00:00 3
2018-01-01 01:20:00 4
... ...
2018-12-31 22:40:00 26276
2018-12-31 23:00:00 26277
2018-12-31 23:20:00 26278
2018-12-31 23:40:00 26279
2019-01-01 00:00:00 26280
[26281 rows x 1 columns]
What is the best practice to do it in python? I imagine how to find the range "by hand" converting date_range to integers and comparing it, but may be there are some elegant pandas/python-style ways to do it?
First of all, I've worked with a small interval, one week:
date_end = date_start + timedelta(days=7)
Then I've followed your steps, and got a portion of your dataframe.
My event_ts is this:
event_ts = pd.to_datetime('2018-01-04 02:17:43')
And I've chosen to reset the index, and have a dataframe easy to manipulate:
df = df.reset_index()
With this code I found the last value where event_ts belongs:
for i in df['index']:
if i <= event_ts:
run.append(i)
print(max(run))
#2018-01-04 02:00:00
or:
top = max(run)
Finally:
df.loc[df['index'] == top].index[0]
222
event_ts belongs to index df[222]
I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!
We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps
How about transpose it:
df_seasons.T.plot()
Output:
(Not duplicate / my question is entirely different)
My dataframe looks like this:
# [df2] is day based
time time2
2017-01-01, 2017-01-01 00:12:00
2017-01-02, 2017-01-02 03:15:00
2017-01-03, 2017-01-03 01:25:00
2017-01-04, 2017-01-04 04:12:00
2017-01-05, 2017-01-05 00:45:00
....
# [df] is minute based
time value
2017-01-01 00:01:00, 0.1232
2017-01-01 00:02:00, 0.1232
2017-01-01 00:03:00, 0.1232
2017-01-01 00:04:00, 0.1232
2017-01-01 00:05:00, 0.1232
....
I want to create a new column called time_val_min in [df2] that finds the min value between df2['time2'] and df2['time'] form [df] within the range specified in df2['time'] and df2['time2']
What did I do?
I did df2['time_val_min'] = df[df['time'].dt.hour.between(df2['time'], df2['time'])].min() but it does not work.
Could you please let me know how to fix it?
You can merge the two data frame on date, and filter the time:
# create the date from the time column
df['date'] = df['time'].dt.normalize()
# merge
new_df = (df.merge(df2, left_on='date', # left on date
right_on='time', # right on time, if time is purely beginning of days
how='right',
suffixes=['','_y'])
.query('time < time2')
.groupby('date')
['time'].min()
.to_frame(name='time_val_min')
.merge(df2, right_on='time', left_index=True)
)
Output:
time_val_min time time2
0 2017-01-01 00:01:00 2017-01-01 2017-01-01 00:12:00
I've got two dataframes (very long, with hundreds or thousands of rows each). One of them, called df1, contains a timeseries, in intervals of 10 minutes. For example:
date value
2016-11-24 00:00:00 1759.199951
2016-11-24 00:10:00 992.400024
2016-11-24 00:20:00 1404.800049
2016-11-24 00:30:00 45.799999
2016-11-24 00:40:00 24.299999
2016-11-24 00:50:00 159.899994
2016-11-24 01:00:00 82.499999
2016-11-24 01:10:00 37.400003
2016-11-24 01:20:00 159.899994
....
And the other one, df2, contains datetime intervals:
start_date end_date
0 2016-11-23 23:55:32 2016-11-24 00:14:03
1 2016-11-24 01:03:18 2016-11-24 01:07:12
2 2016-11-24 01:11:32 2016-11-24 02:00:00
...
I need to select all the rows in df1 that "falls" into an interval in df2.
With these examples, the result dataframe should be:
date value
2016-11-24 00:00:00 1759.199951 # Fits in row 0 of df2
2016-11-24 00:10:00 992.400024 # Fits in row 0 of df2
2016-11-24 01:00:00 82.499999 # Fits in row 1 of df2
2016-11-24 01:10:00 37.400003 # Fits on row 2 of df2
2016-11-24 01:20:00 159.899994 # Fits in row 2 of df2
....
Using np.searchsorted:
Here's a variation based on np.searchsorted that seems to be an order of magnitude faster than using intervaltree or merge, assuming my larger sample data is correct.
# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)
This may need to be modified if the intervals in df2 are nested or overlapping; I haven't fully thought it through in that scenario, but it may still work.
Using an Interval Tree
Not quite a pure Pandas solution, but you may want to consider building an Interval Tree from df2, and querying it against your intervals in df1 to find the ones that overlap.
The intervaltree package on PyPI seems to have good performance and easy to use syntax.
from intervaltree import IntervalTree
# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
I converted the dates to their integer equivalents for performance reasons. I doubt the intervaltree package was built with pd.Timestamp in mind, so there probably some intermediate conversion steps that slow things down a bit.
Also, note that intervals in the intervaltree package do not include the end point, although the start point is included. That's why I have the + [0, 1] when creating tree; I'm padding the end point by a nanosecond to make sure the real end point is actually included. It's also the reason why it's fine for me to add pd.offsets.Minute(10) to get the interval end when querying the tree, instead of adding only 9m 59s.
The resulting output for either method:
date value
0 2016-11-24 00:00:00 1759.199951
1 2016-11-24 00:10:00 992.400024
6 2016-11-24 01:00:00 82.499999
7 2016-11-24 01:10:00 37.400003
8 2016-11-24 01:20:00 159.899994
Timings
Using the following setup to produce larger sample data:
# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2
Which yields the following for df1 and df2:
df1
date value
0 2016-11-24 00:00:00 0.444939
1 2016-11-24 00:10:00 0.407554
2 2016-11-24 00:20:00 0.460148
3 2016-11-24 00:30:00 0.465239
4 2016-11-24 00:40:00 0.462691
...
54995 2017-12-10 21:50:00 0.754123
54996 2017-12-10 22:00:00 0.401820
54997 2017-12-10 22:10:00 0.146284
54998 2017-12-10 22:20:00 0.394759
54999 2017-12-10 22:30:00 0.907233
df2
start_date end_date
0 2016-11-24 00:00:19 2016-11-24 00:41:24
1 2016-11-24 18:22:44 2016-11-24 18:36:44
2 2016-11-25 12:44:44 2016-11-25 13:03:13
3 2016-11-26 07:07:05 2016-11-26 07:49:29
4 2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53
And using the following functions for timing purposes:
def root_searchsorted(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
return df1[np.any(cond, axis=0)].drop('date_end', axis=1)
def root_intervaltree(df1, df2):
# Build the Interval Tree.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter the DataFrame.
return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
def ptrj(df1, df2):
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
def parfait(df1, df2):
df1['key'] = 1
df2['key'] = 1
df2['row'] = df2.index.values
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= #s1.end_date.values) |\
(date_end <= #s1.end_date.values) |\
(#s1.index.values != #s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)
I get the following timings:
%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop
%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop
%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop
%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop
%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop
This solution (I believe it works) uses pandas.Series.asof. Under the hood, it's some version of searchsorted - but for some reason it's four times faster than it's comparable in speed with #root's function.
I assume that all date columns are in the pandas datetime format, sorted, and that df2 intervals are non-overlapping.
The code is pretty short but somewhat intricate (explanation below).
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# The main function (see explanation below):
def get_it(df1):
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
The advantage of this approach is twofold: sdate and edate are evaluated only once and the main function can take chunks of df1 if df1 is very large.
Explanation
pandas.Series.asof returns the last valid row for a given index. It can take an array as an input and is quite fast.
For the sake of this explanation, let s[j] = sdate.index[j] be the jth date in sdate and x be some arbitrary date (timestamp).
There is always s[sdate.asof(x)] <= x (this is exactly how asof works) and it's not difficult to show that:
j <= sdate.asof(x) if and only if s[j] <= x
sdate.asof(x) < j if and only if x < s[j]
Similarly for edate. Unfortunately, we can't have the same inequalities (either week or strict) in both 1. and 2.
Two intervals [a, b) and [x, y] intersect iff x < b and a <= y.
(We may think of a, b as coming from sdate.index and edate.index - the interval [a, b) is chosen to be closed-open because of properties 1. and 2.)
In our case x is a date from df1, y = x + 10min - epsilon,
a = s[j], b = e[j] (note that epsilon has been added to edate), where j is some number.
So, finally, the condition equivalent to "[a, b) and [x, y] intersect" is
"sdate.asof(x) < j and j <= edate.asof(y) for some number j". And it roughly boils down to l < r inside the function get_it (modulo some technicalities).
This is not exactly straightforward but you can do the following:
First get the relevant date columns from the two dataframes and concatenate them together so that one column is all the dates and the other two are columns representing the indexes from df2. (Note that df2 gets a multiindex after stacking)
dfm = pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)
print(dfm)
0 level_0 level_1
0 2016-11-23 23:55:32 0.0 start_date
0 2016-11-24 00:00:00 NaN NaN
1 2016-11-24 00:10:00 NaN NaN
1 2016-11-24 00:14:03 0.0 end_date
2 2016-11-24 00:20:00 NaN NaN
3 2016-11-24 00:30:00 NaN NaN
4 2016-11-24 00:40:00 NaN NaN
5 2016-11-24 00:50:00 NaN NaN
6 2016-11-24 01:00:00 NaN NaN
2 2016-11-24 01:03:18 1.0 start_date
3 2016-11-24 01:07:12 1.0 end_date
7 2016-11-24 01:10:00 NaN NaN
4 2016-11-24 01:11:32 2.0 start_date
8 2016-11-24 01:20:00 NaN NaN
5 2016-11-24 02:00:00 2.0 end_date
You can see that the values from df1 have NaN in the right two columns and since we have sorted the dates, these rows fall in between the start_date and end_date rows (from df2).
In order to indicate that the rows from df1 fall between the rows from df2 we can interpolate the level_0 column which gives us:
dfm['level_0'] = dfm['level_0'].interpolate()
0 level_0 level_1
0 2016-11-23 23:55:32 0.000000 start_date
0 2016-11-24 00:00:00 0.000000 NaN
1 2016-11-24 00:10:00 0.000000 NaN
1 2016-11-24 00:14:03 0.000000 end_date
2 2016-11-24 00:20:00 0.166667 NaN
3 2016-11-24 00:30:00 0.333333 NaN
4 2016-11-24 00:40:00 0.500000 NaN
5 2016-11-24 00:50:00 0.666667 NaN
6 2016-11-24 01:00:00 0.833333 NaN
2 2016-11-24 01:03:18 1.000000 start_date
3 2016-11-24 01:07:12 1.000000 end_date
7 2016-11-24 01:10:00 1.500000 NaN
4 2016-11-24 01:11:32 2.000000 start_date
8 2016-11-24 01:20:00 2.000000 NaN
5 2016-11-24 02:00:00 2.000000 end_date
Notice that the level_0 column now contains integers (mathematically, not the data type) for the rows that fall between a start date and an end date (this assumes that an end date will not overlap the following start date).
Now we can just filter out the rows originally in df1:
df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']
And merge back with the original dataframe
df_final = pd.merge(df1, right=df_falls, on='date', how='outer')
which gives:
print(df_final)
date value falls_index
0 2016-11-24 00:00:00 1759.199951 0.0
1 2016-11-24 00:10:00 992.400024 0.0
2 2016-11-24 00:20:00 1404.800049 NaN
3 2016-11-24 00:30:00 45.799999 NaN
4 2016-11-24 00:40:00 24.299999 NaN
5 2016-11-24 00:50:00 159.899994 NaN
6 2016-11-24 01:00:00 82.499999 NaN
7 2016-11-24 01:10:00 37.400003 NaN
8 2016-11-24 01:20:00 159.899994 2.0
Which is the same as the original dataframe with the extra column falls_index which indicates the index of the row in df2 that that row falls into.
Consider a cross join merge that returns the cartesian product between both sets (all possible row pairings M x N). You can cross join using an all 1's key column in merge's on argument. Then, run a filter on large returned set using pd.series.between(). Specifically, the series between() keeps rows where start date falls within the 9:59 range of date or date falls within start and end times.
However, prior to the merge, create a df1['date'] column equal to the date index so it can be a retained column after merge and used for date filtering. Additionally, create a df2['row'] column to be used as row indicator at the end. For demo, below recreates posted df1 and df2 dataframes:
from io import StringIO
import pandas as pd
import datetime as dt
data1 = '''
date value
"2016-11-24 00:00:00" 1759.199951
"2016-11-24 00:10:00" 992.400024
"2016-11-24 00:20:00" 1404.800049
"2016-11-24 00:30:00" 45.799999
"2016-11-24 00:40:00" 24.299999
"2016-11-24 00:50:00" 159.899994
"2016-11-24 01:00:00" 82.499999
"2016-11-24 01:10:00" 37.400003
"2016-11-24 01:20:00" 159.899994
'''
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values
data2 = '''
start_date end_date
"2016-11-23 23:55:32" "2016-11-24 00:14:03"
"2016-11-24 01:03:18" "2016-11-24 01:07:12"
"2016-11-24 01:11:32" "2016-11-24 02:00:00"
'''
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
(df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
print(df3)
# value row
# date
# 2016-11-24 00:00:00 1759.199951 0
# 2016-11-24 00:10:00 992.400024 0
# 2016-11-24 01:00:00 82.499999 1
# 2016-11-24 01:10:00 37.400003 2
# 2016-11-24 01:20:00 159.899994 2
I tried to modify the #root's code with the experimental query pandas method see.
It should be faster than the original implementation for very large dataFrames. For small dataFrames it will be definitely slower.
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= #s1.end_date.values) |\
(date_end <= #s1.end_date.values) |\
(#s1.index.values != #s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)