Comparing timestamps in dataframe columns with pandas - python

Lets say I have a dataframe like this
df1:
datetime1 datetime2
0 2021-05-09 19:52:14 2021-05-09 20:52:14
1 2021-05-09 19:52:14 2021-05-09 21:52:14
2 NaN NaN
3 2021-05-09 16:30:14 NaN
4 NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14
I want to compare the timestamps in datetime1 and datetime2 and create a new column with the difference between them.
In some scenarios I have a cases that I don't have values in datetime1 and datetime2, or I have values in datatime1 but I don't in datatime2, so is there is a possible way to get NaN in "difference" column if there is no timestamp in datetime1 and 2, and if there is a timestamp only in datetime1, get the difference compared to datetime.now() and put that in another column.
Desirable df output:
datetime1 datetime2 Difference in H:m:s Compared with datetime.now()
0 2021-05-09 19:52:14 2021-05-09 20:52:14 01:00:00 NaN
1 2021-05-09 19:52:14 2021-05-09 21:52:14 02:00:00 NaN
2 NaN NaN NaN NaN
3 2021-05-09 16:30:14 NaN NaN e.g(04:00:00)
4 NaN NaN NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14 02:00:00 NaN
I tried a solution from #AndrejKesely, but it is failing if there is no timestamp in datetime1 and datetime2:
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
# if datetime1/datetime2 aren't already datetime, apply `.to_datetime()`:
df["datetime1"] = pd.to_datetime(df["datetime1"])
df["datetime2"] = pd.to_datetime(df["datetime2"])
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["datetime2"] - x["datetime1"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)
print(df)

Select only rows match conditions by using boolean indexing (mask) to do what you need and let Pandas fill missing values with NaN:
def strfdelta(td: pd.Timestamp):
seconds = td.total_seconds()
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
seconds = int(seconds % 60)
return f"{hours:02}:{minutes:02}:{seconds:02}"
bm1 = df["datetime1"].notna() & df["datetime2"].notna()
bm2 = df["datetime1"].notna() & df["datetime2"].isna()
df["Difference in H:m:s"] = (df.loc[bm1, "datetime2"] - df.loc[bm1, "datetime1"]).apply(strfdelta)
df["Compared with datetime.now()"] = (datetime.now() - df.loc[bm2, "datetime1"]).apply(strfdelta)
>>> df
datetime1 datetime2 Diff... Comp...
0 2021-05-09 19:52:14 2021-05-09 20:52:14 01:00:00 NaN
1 2021-05-09 19:52:14 2021-05-09 21:52:14 02:00:00 NaN
2 NaT NaT NaN NaN
3 2021-05-09 16:30:14 NaT NaN 103:09:19
4 NaT NaT NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14 02:00:00 NaN

You could start by replacing all NaN values in the datetime2 column with datetime.now value. Thus it would make it easier to compare datetime1 to now if datetime1 is NaN.
You can do it with :
df["datetime2"] = df["datetime2"].fillna(value=pandas.to_datetime('today').normalize(),axis=1)
Then you hace only 2 conditions remaining :
If datetime1 column is empty, the result is NaN.
Otherwise, the result is the difference between datetime1 and datetime2 column (as there is no NaN remaining in datetime2 column).
You can perform this with :
import numpy as np
df["Difference in H:m:s"] = np.where(
df["datetime1"].isnull(),
pd.NA,
df["datetime2"] - df["datetime1"]
)
You can finally format your Difference in H:m:s in the required format with the function you provided :
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["Difference in H:m:s"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)
The complete code is :
import numpy as np
# if datetime1/datetime2 aren't already datetime, apply `.to_datetime()`:
df["datetime1"] = pd.to_datetime(df["datetime1"])
df["datetime2"] = pd.to_datetime(df["datetime2"])
df["datetime2"] = df["datetime2"].fillna(value=pandas.to_datetime('today').normalize(),axis=1)
df["Difference in H:m:s"] = np.where(
df["datetime1"].isnull(),
pd.NA,
df["datetime2"] - df["datetime1"]
)
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["Difference in H:m:s"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)

Related

Time intervals to evenly-spaced time series

I need to prepare data with time intervals for machine learning in the way that I get equal spacing between timestamps. For example, for 3 hours spacing, I would like to have the following timestamps: 00:00, 03:00, 6:00, 9:00, 12:00, 15:00... For example:
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
print(df)
Output:
Start End Val
0 2022-07-01 11:30:00 2022-07-01 18:30:00 a
1 2022-07-01 22:30:00 2022-07-02 03:30:00 b
I try to get timestamps:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
df['Datetime'] = df['Datetime'].dt.round('H')
print(df[['Datetime', 'Val']])
Output:
Datetime Val
0 2022-07-01 12:00:00 a
0 2022-07-01 14:00:00 a
0 2022-07-01 18:00:00 a
1 2022-07-01 22:00:00 b
1 2022-07-02 02:00:00 b
As you can see, those timestamps are not equally spaced. My expected result:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
We can use the function merge_asof:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
date_min, date_max = df['Datetime'].dt.date.min(), df['Datetime'].dt.date.max() + pd.Timedelta('1D')
time_range = pd.date_range(date_min, date_max, freq='3H').to_series(name='Datetime')
df = pd.merge_asof(time_range, df, tolerance=pd.Timedelta('3H'))
df.truncate(df['Val'].first_valid_index(), df['Val'].last_valid_index())
Output:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
Annotated code
# Find min and max date of interval
s, e = df['Start'].min(), df['End'].max()
# Create a date range with freq=3H
# Create a output dataframe by assigning daterange to datetime column
df_out = pd.DataFrame({'datetime': pd.date_range(s.ceil('H'), e, freq='3H')})
# Create interval index from start and end date
idx = pd.IntervalIndex.from_arrays(df['Start'], df['End'], closed='both')
# Set the index of df to interval index and select Val column to create mapping series
# Then use this mapping series to substitute values in output dataframe
df_out['Val'] = df_out['datetime'].map(df.set_index(idx)['Val'])
Result
datetime Val
0 2022-07-01 12:00:00 a
1 2022-07-01 15:00:00 a
2 2022-07-01 18:00:00 a
3 2022-07-01 21:00:00 NaN
4 2022-07-02 00:00:00 b
5 2022-07-02 03:00:00 b
For this problem I really like to use pd.DataFrame.reindex. In particular, you can specify the method='nearest, and a tolerance='90m' to ensure you leave gaps where you need them.
You can create you regular spaced time series using pd.date_range with start and end arguments using the .floor('3H') and .ceil('3H') methods, respectively.
import pandas as pd
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
result = pd.DataFrame()
for name, group in df.groupby('Val'):
group = group.set_index('Datetime')
group.index = group.index.ceil('1H')
idx = pd.date_range(group.index.min().floor('3H'), group.index.max().ceil('3H'), freq='3H')
group = group.reindex(idx, tolerance = '90m', method='nearest')
result = pd.concat([result, group])
result = result.sort_index()
which returns:
Val
2022-07-01 12:00:00 a
2022-07-01 15:00:00 a
2022-07-01 18:00:00 a
2022-07-01 21:00:00
2022-07-02 00:00:00 b
2022-07-02 03:00:00 b
Another method would be to simply add the timehours to the start time in a loop.
from datetime import datetime,timedelta
#taking start time as current time just for example
start = datetime.now()
#taking end time as current time + 15 hours just for example
end = datetime.now() + timedelta(hours = 15)
times = []
while end>start:
start = start+timedelta(hours = 3)
print(start)
times.append(start)
df = pd.Dataframe(columns = ['Times'])
df['Times'] = times
Output
Times
0 2022-07-15 01:28:56.912013
1 2022-07-15 04:28:56.912013
2 2022-07-15 07:28:56.912013
3 2022-07-15 10:28:56.912013
4 2022-07-15 13:28:56.912013

Filling in missing hourly data in Pandas

I have a dataframe containing time series with hourly measurements with the following structure: name, time, output. For each name the measurements come from more or less the same time period. I am trying to fill in the missing values, such that for each day all 24h appear in the time column.
So I'm expecting a table like this:
name time output
x 2018-02-22 00:00:00 100
...
x 2018-02-22 23:00:00 200
x 2018-02-24 00:00:00 300
...
x 2018-02-24 23:00:00 300
y 2018-02-22 00:00:00 100
...
y 2018-02-22 23:00:00 200
y 2018-02-25 00:00:00 300
...
y 2018-02-25 23:00:00 300
For this I groupby name and then try to apply a custom function that adds the missing timestamps in the corresponding dataframe.
def add_missing_hours(df):
start_date = df.time.iloc[0].date()
end_date = df.time.iloc[-1].date()
dates_range = pd.date_range(start_date, end_date, freq = '1H')
new_dates = set(dates_range) - set(df.time)
name = df["name"].iloc[0]
df = df.append(pd.DataFrame({'GSRN':[name]*len(new_dates), 'time': new_dates}))
return df
For some reason the name column is dropped when I create the DataFrame, but I can't understand why. Does anyone know why or have a better idea how to fill in the missing timestamps?
Edit 1:
This is different than the [question here][1] because they didn't need all 24 values/day -- resampling between 2pm and 10pm will only give the values in between.
Edit 2:
I found a (not great) solution by creating a multi index with all name-timestamps pairs and combining with the table. Code below for anyone interested, but still interested in a better solution:
start_date = datetime.datetime.combine(df.time.min().date(),datetime.time(0, 0))
end_date = datetime.datetime.combine(df.time.max().date(),datetime.time(23, 0))
new_idx = pd.date_range(start_date, end_date, freq = '1H')
mux = pd.MultiIndex.from_product([df['name'].unique(),new_idx], names=('name','time'))
df_complete = pd.DataFrame(index=mux).reset_index().combine_first(df)
df_complete = df_complete.groupby(["name",df_complete.time.dt.date]).filter(lambda g: (g["output"].count() == 0))
The last line removes any days that were completely missing for the specific name in the initial dataframe.
try:
1st create dataframe starting from min date to max date with hour as an interval. Then concatenate them together.
df.time = pd.to_datetime(df.time)
min_date = df.time.min()
max_date = df.time.max()
dates_range = pd.date_range(min_date, max_date, freq = '1H')
df.set_index('time', inplace=True)
df3=pd.DataFrame(dates_range).set_index(0)
df4 = df3.join(df)
df4:
name output
2018-02-22 00:00:00 x 100.0
2018-02-22 00:00:00 y 100.0
2018-02-22 01:00:00 NaN NaN
2018-02-22 02:00:00 NaN NaN
2018-02-22 03:00:00 NaN NaN
... ... ...
2018-02-25 19:00:00 NaN NaN
2018-02-25 20:00:00 NaN NaN
2018-02-25 21:00:00 NaN NaN
2018-02-25 22:00:00 NaN NaN
2018-02-25 23:00:00 y 300.0
98 rows × 2 columns

Python adding row into dataframe using while loop

I have a dataset like this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 B456 2019-10-01 2019-08-01 2019-09-01
3 B456 2019-10-01 2019-09-01 2019-10-01
generated by this code:
from pandas import DataFrame
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
I'm trying to write a function to achieve this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 A123 2020-01-02 2019-03-02 2019-04-02
3 A123 2020-01-02 2019-04-02 2019-05-02
4 A123 2020-01-02 2019-05-02 2019-06-02
5 A123 2020-01-02 2019-06-02 2019-07-02
6 A123 2020-01-02 2019-07-02 2019-08-02
7 A123 2020-01-02 2019-08-02 2019-09-02
8 A123 2020-01-02 2019-09-02 2019-10-02
9 A123 2020-01-02 2019-10-02 2019-11-02
10 A123 2020-01-02 2019-11-02 2019-12-02
11 A123 2020-01-02 2019-12-02 2020-01-02
12 B456 2019-10-01 2019-08-01 2019-09-01
13 B456 2019-10-01 2019-09-01 2019-10-01
Essentially the function should keep adding row, for each user_id while the max(end_date) is less than or equal to lapsed_date. The newly added row will take previous row's end_date as start_date, and previous row's end_date + 1 month as end_date.
I have generated this function below.
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
It works with all the logic above, except the while loop doesn't work. So I have to apply this function using this apply command manually 10 times:
df = df.groupby('user_id').apply(add_row).reset_index(drop = True)
I'm not really sure what I did wrong with the while loop there. Any advice would be highly appreciated!
So there are a few reasons your loop did not work, I will explain them as we go!
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
In the above, you call return which returns the result to the code that called the function. This essentially stops your loop from iterating multiple times and returns the result of the first append.
return x.append(last_row) Another caveat here is that dataframe.append() does not actually append to the dataframe, you need to call x = x.append(last_row)
Pandas Append
Secondly, I noted that it may be required to do this over multiple, unique user_id rows. Due to this, in the code below, I have split the dataframe into multiple frames, dictated by the total unique user_id's stored in the frame.
Here is how you can get this to work;
import pandas as pd
from pandas import DataFrame
def add_row(df):
while df['end_date'].max() < df['lapsed_date'].max():
new_row = {'user_id': df['user_id'][0],
'lapsed_date': df['lapsed_date'].max(),
'start_date': df['end_date'].max(),
'end_date': df['end_date'].max() + pd.DateOffset(months=1),
}
df = df.append(new_row, ignore_index = True)
return df ## Note the return is called OUTSIDE of the while loop, ensuring only the final result is returned.
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
ids = df['user_id'].unique()
g = df.groupby(['user_id'])
result = pd.DataFrame(columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
for i in ids:
group = g.get_group(i)
result = result.append(add_row(group), ignore_index=True)
print(result)
Split the frames based on unique user id's
Create empty data frame to store result in under result
Iterate over all user_id's
Run the same while loop, ensuring that df is updated with the append rows
Return the result and print
Hope this helps!

Differance between two days excluding weekends in hours

I have a code that calculates the date differance excluding the weekends using np.busdaycount, but i need it in the hours which i cannot able to get.
import datetime
import numpy as np
df.Inflow_date_time= [pandas.Timestamp('2019-07-22 21:11:26')]
df.End_date_time= [pandas.Timestamp('2019-08-02 11:44:47')]
df['Day'] = ([np.busday_count(b,a) for a, b in zip(df['End_date_time'].values.astype('datetime64[D]'),df['Inflow_date_time'].values.astype('datetime64[D]'))])
Day
0 9
I need the out put as hours excluding the weekend. Like
Hours
0 254
Problems
Inflow_date_time=2019-08-01 23:22:46
End_date_time = 2019-08-05 17:43:51
Hours expected 42 hours
(1+24+17)
Inflow_date_time=2019-08-03 23:22:46
End_date_time = 2019-08-05 17:43:51
Hours expected 17 hours
(0+0+17)
Inflow_date_time=2019-08-01 23:22:46
End_date_time = 2019-08-05 17:43:51
Hours expected 17 hours
(0+0+17)
Inflow_date_time=2019-07-26 23:22:46
End_date_time = 2019-08-05 17:43:51
Hours expected 138 hours
(1+120+17)
Inflow_date_time=2019-08-05 11:22:46
End_date_time = 2019-08-05 17:43:51
Hours expected 6 hours
(0+0+6)
Please suggest.
Idea is floor datetimes for remove times by floor by days and get number of business days between start day + one day to hours3 column by numpy.busday_count and then create hour1 and hour2 columns for start and end hours with floor by hours if not weekends hours. Last sum all hours columns together:
df = pd.DataFrame(columns=['Inflow_date_time','End_date_time', 'need'])
df.Inflow_date_time= [pd.Timestamp('2019-08-01 23:22:46'),
pd.Timestamp('2019-08-03 23:22:46'),
pd.Timestamp('2019-08-01 23:22:46'),
pd.Timestamp('2019-07-26 23:22:46'),
pd.Timestamp('2019-08-05 11:22:46')]
df.End_date_time= [pd.Timestamp('2019-08-05 17:43:51')] * 5
df.need = [42,17,41,138,6]
#print (df)
df["hours1"] = df["Inflow_date_time"].dt.ceil('d')
df["hours2"] = df["End_date_time"].dt.floor('d')
one_day_mask = df["Inflow_date_time"].dt.floor('d') == df["hours2"]
df['hours3'] = [np.busday_count(b,a)*24 for a, b in zip(df['hours2'].dt.strftime('%Y-%m-%d'),
df['hours1'].dt.strftime('%Y-%m-%d'))]
mask1 = df['hours1'].dt.dayofweek < 5
hours1 = df['hours1'] - df['Inflow_date_time'].dt.floor('H')
df['hours1'] = np.where(mask1, hours1, np.nan) / np.timedelta64(1 ,'h')
mask2 = df['hours2'].dt.dayofweek < 5
df['hours2'] = (np.where(mask2, df['End_date_time'].dt.floor('H')-df['hours2'], np.nan) /
np.timedelta64(1 ,'h'))
df['date_diff'] = df['hours1'].fillna(0) + df['hours2'].fillna(0) + df['hours3']
one_day = (df['End_date_time'].dt.floor('H') - df['Inflow_date_time'].dt.floor('H')) /
np.timedelta64(1 ,'h')
df["date_diff"] = df["date_diff"].mask(one_day_mask, one_day)
print (df)
Inflow_date_time End_date_time need hours1 hours2 hours3 \
0 2019-08-01 23:22:46 2019-08-05 17:43:51 42 1.0 17.0 24
1 2019-08-03 23:22:46 2019-08-05 17:43:51 17 NaN 17.0 0
2 2019-08-01 23:22:46 2019-08-05 17:43:51 41 1.0 17.0 24
3 2019-07-26 23:22:46 2019-08-05 17:43:51 138 NaN 17.0 120
4 2019-08-05 11:22:46 2019-08-05 17:43:51 6 13.0 17.0 -24
date_diff
0 42.0
1 17.0
2 42.0
3 137.0
4 6.0
If i am not completly wrong you can also use a shorter workaround:
First save your day difference in an array:
res = np.busday_count(df['Inflow_date_time'].values.astype('datetime64[D]'), df['End_date_time'].values.astype('datetime64[D]'))
Then we need an extra hour column for every row:
df['starth'] = df['Inflow_date_time'].dt.hour
df['endh'] = df['End_date_time'].dt.hour
Then we will get the day difference to your dataframe:
my_list = res.tolist()
dfhelp =pd.DataFrame(my_list,columns=['col1'])
df2 = pd.concat((df, df2) , axis=1)
Then we have to get a help column, as the hour of End_date_timecan be before Inflow_date-time:
df2['h'] = df2['endh']-df2['starth']
And then we can calculate the hour difference (one day has 24 hours, based if the hour of the end date is before the start hour date or not):
df2['differenceh'] = np.where(df2['h'] >= 0, df2['col1']*24+df2['h'], df2['col1']*24-24+(24+df2['h']))
I updated jezrael answer's to work with version 1.x.x of pandas. I edited the code and the logic a bit to calculate the difference in hours and minutes.
Function
def datetimes_hours_difference(df_end: pd.Series, df_start: pd.Series) -> pd.Series:
"""
Calculate the total hours difference between two Pandas Series
containing datetime values (df_end - df_start)
Args:
df_end (pd.Series): Contains datetime values
df_start (pd.Series): Contains datetime values
Returns:
df_date_diff (pd.Series): Difference between df_end and df_start
"""
df_start_hours = df_start.dt.ceil('d')
df_end_hours = df_end.dt.floor('d')
one_day_mask = df_start.dt.floor('d') == df_end_hours
df_days_hours = [np.busday_count(
b, a, weekmask='1111011') * 24 for a, b in zip(
df_end_hours.dt.strftime('%Y-%m-%d'),
df_start_hours.dt.strftime('%Y-%m-%d')
)
]
mask1 = df_start.dt.dayofweek != 4
hours1 = df_start_hours - df_start.dt.floor('min')
hours1.loc[~mask1] = pd.NaT
df_start_hours = hours1 / pd.to_timedelta(1, unit='H')
df_start_hours = df_start_hours.fillna(0)
mask2 = df_end.dt.dayofweek != 4
hours2 = df_end.dt.floor('min') - df_end_hours
hours2.loc[~mask2] = pd.NaT
df_end_hours = hours2 / pd.to_timedelta(1, unit='H')
df_end_hours = df_end_hours.fillna(0)
df_date_diff = df_start_hours + df_end_hours + df_days_hours
one_day = (df_end.dt.floor('min') - df_start.dt.floor('min'))
one_day = one_day / pd.to_timedelta(1, unit='H')
df_date_diff = df_date_diff.mask(one_day_mask, one_day)
return df_date_diff
Example
df = pd.DataFrame({
'datetime1': ["2022-06-15 16:06:00", "2022-06-15 03:45:00", "2022-06-10 12:13:00", "2022-06-11 12:13:00", "2022-06-10 12:13:00", "2022-05-31 17:20:00"],
'datetime2': ["2022-06-22 22:36:00", "2022-06-15 22:36:00", "2022-06-22 10:10:00", "2022-06-22 10:10:00", "2022-06-24 10:10:00", "2022-06-02 05:29:00"],
'hours_diff': [150.5, 18.9, 250.9, 237.9, 288.0, 36.2]
})
df['datetime1'] = pd.to_datetime(df['datetime1'])
df['datetime2'] = pd.to_datetime(df['datetime2'])
df['hours_diff_fun'] = datetimes_hours_difference(df['datetime2'], df['datetime1'])
print(df)
datetime1 datetime2 hours_diff hours_diff_fun
0 2022-06-15 16:06:00 2022-06-22 22:36:00 150.5 150.500000
1 2022-06-15 03:45:00 2022-06-15 22:36:00 18.9 18.850000
2 2022-06-10 12:13:00 2022-06-22 10:10:00 250.9 250.166667
3 2022-06-11 12:13:00 2022-06-22 10:10:00 237.9 237.950000
4 2022-06-10 12:13:00 2022-06-24 10:10:00 288.0 288.000000
5 2022-05-31 17:20:00 2022-06-02 05:29:00 36.2 36.150000

Pandas: select DF rows based on another DF

I've got two dataframes (very long, with hundreds or thousands of rows each). One of them, called df1, contains a timeseries, in intervals of 10 minutes. For example:
date value
2016-11-24 00:00:00 1759.199951
2016-11-24 00:10:00 992.400024
2016-11-24 00:20:00 1404.800049
2016-11-24 00:30:00 45.799999
2016-11-24 00:40:00 24.299999
2016-11-24 00:50:00 159.899994
2016-11-24 01:00:00 82.499999
2016-11-24 01:10:00 37.400003
2016-11-24 01:20:00 159.899994
....
And the other one, df2, contains datetime intervals:
start_date end_date
0 2016-11-23 23:55:32 2016-11-24 00:14:03
1 2016-11-24 01:03:18 2016-11-24 01:07:12
2 2016-11-24 01:11:32 2016-11-24 02:00:00
...
I need to select all the rows in df1 that "falls" into an interval in df2.
With these examples, the result dataframe should be:
date value
2016-11-24 00:00:00 1759.199951 # Fits in row 0 of df2
2016-11-24 00:10:00 992.400024 # Fits in row 0 of df2
2016-11-24 01:00:00 82.499999 # Fits in row 1 of df2
2016-11-24 01:10:00 37.400003 # Fits on row 2 of df2
2016-11-24 01:20:00 159.899994 # Fits in row 2 of df2
....
Using np.searchsorted:
Here's a variation based on np.searchsorted that seems to be an order of magnitude faster than using intervaltree or merge, assuming my larger sample data is correct.
# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)
This may need to be modified if the intervals in df2 are nested or overlapping; I haven't fully thought it through in that scenario, but it may still work.
Using an Interval Tree
Not quite a pure Pandas solution, but you may want to consider building an Interval Tree from df2, and querying it against your intervals in df1 to find the ones that overlap.
The intervaltree package on PyPI seems to have good performance and easy to use syntax.
from intervaltree import IntervalTree
# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
I converted the dates to their integer equivalents for performance reasons. I doubt the intervaltree package was built with pd.Timestamp in mind, so there probably some intermediate conversion steps that slow things down a bit.
Also, note that intervals in the intervaltree package do not include the end point, although the start point is included. That's why I have the + [0, 1] when creating tree; I'm padding the end point by a nanosecond to make sure the real end point is actually included. It's also the reason why it's fine for me to add pd.offsets.Minute(10) to get the interval end when querying the tree, instead of adding only 9m 59s.
The resulting output for either method:
date value
0 2016-11-24 00:00:00 1759.199951
1 2016-11-24 00:10:00 992.400024
6 2016-11-24 01:00:00 82.499999
7 2016-11-24 01:10:00 37.400003
8 2016-11-24 01:20:00 159.899994
Timings
Using the following setup to produce larger sample data:
# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2
Which yields the following for df1 and df2:
df1
date value
0 2016-11-24 00:00:00 0.444939
1 2016-11-24 00:10:00 0.407554
2 2016-11-24 00:20:00 0.460148
3 2016-11-24 00:30:00 0.465239
4 2016-11-24 00:40:00 0.462691
...
54995 2017-12-10 21:50:00 0.754123
54996 2017-12-10 22:00:00 0.401820
54997 2017-12-10 22:10:00 0.146284
54998 2017-12-10 22:20:00 0.394759
54999 2017-12-10 22:30:00 0.907233
df2
start_date end_date
0 2016-11-24 00:00:19 2016-11-24 00:41:24
1 2016-11-24 18:22:44 2016-11-24 18:36:44
2 2016-11-25 12:44:44 2016-11-25 13:03:13
3 2016-11-26 07:07:05 2016-11-26 07:49:29
4 2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53
And using the following functions for timing purposes:
def root_searchsorted(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
return df1[np.any(cond, axis=0)].drop('date_end', axis=1)
def root_intervaltree(df1, df2):
# Build the Interval Tree.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter the DataFrame.
return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
def ptrj(df1, df2):
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
def parfait(df1, df2):
df1['key'] = 1
df2['key'] = 1
df2['row'] = df2.index.values
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= #s1.end_date.values) |\
(date_end <= #s1.end_date.values) |\
(#s1.index.values != #s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)
I get the following timings:
%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop
%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop
%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop
%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop
%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop
This solution (I believe it works) uses pandas.Series.asof. Under the hood, it's some version of searchsorted - but for some reason it's four times faster than it's comparable in speed with #root's function.
I assume that all date columns are in the pandas datetime format, sorted, and that df2 intervals are non-overlapping.
The code is pretty short but somewhat intricate (explanation below).
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# The main function (see explanation below):
def get_it(df1):
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
The advantage of this approach is twofold: sdate and edate are evaluated only once and the main function can take chunks of df1 if df1 is very large.
Explanation
pandas.Series.asof returns the last valid row for a given index. It can take an array as an input and is quite fast.
For the sake of this explanation, let s[j] = sdate.index[j] be the jth date in sdate and x be some arbitrary date (timestamp).
There is always s[sdate.asof(x)] <= x (this is exactly how asof works) and it's not difficult to show that:
j <= sdate.asof(x) if and only if s[j] <= x
sdate.asof(x) < j if and only if x < s[j]
Similarly for edate. Unfortunately, we can't have the same inequalities (either week or strict) in both 1. and 2.
Two intervals [a, b) and [x, y] intersect iff x < b and a <= y.
(We may think of a, b as coming from sdate.index and edate.index - the interval [a, b) is chosen to be closed-open because of properties 1. and 2.)
In our case x is a date from df1, y = x + 10min - epsilon,
a = s[j], b = e[j] (note that epsilon has been added to edate), where j is some number.
So, finally, the condition equivalent to "[a, b) and [x, y] intersect" is
"sdate.asof(x) < j and j <= edate.asof(y) for some number j". And it roughly boils down to l < r inside the function get_it (modulo some technicalities).
This is not exactly straightforward but you can do the following:
First get the relevant date columns from the two dataframes and concatenate them together so that one column is all the dates and the other two are columns representing the indexes from df2. (Note that df2 gets a multiindex after stacking)
dfm = pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)
print(dfm)
0 level_0 level_1
0 2016-11-23 23:55:32 0.0 start_date
0 2016-11-24 00:00:00 NaN NaN
1 2016-11-24 00:10:00 NaN NaN
1 2016-11-24 00:14:03 0.0 end_date
2 2016-11-24 00:20:00 NaN NaN
3 2016-11-24 00:30:00 NaN NaN
4 2016-11-24 00:40:00 NaN NaN
5 2016-11-24 00:50:00 NaN NaN
6 2016-11-24 01:00:00 NaN NaN
2 2016-11-24 01:03:18 1.0 start_date
3 2016-11-24 01:07:12 1.0 end_date
7 2016-11-24 01:10:00 NaN NaN
4 2016-11-24 01:11:32 2.0 start_date
8 2016-11-24 01:20:00 NaN NaN
5 2016-11-24 02:00:00 2.0 end_date
You can see that the values from df1 have NaN in the right two columns and since we have sorted the dates, these rows fall in between the start_date and end_date rows (from df2).
In order to indicate that the rows from df1 fall between the rows from df2 we can interpolate the level_0 column which gives us:
dfm['level_0'] = dfm['level_0'].interpolate()
0 level_0 level_1
0 2016-11-23 23:55:32 0.000000 start_date
0 2016-11-24 00:00:00 0.000000 NaN
1 2016-11-24 00:10:00 0.000000 NaN
1 2016-11-24 00:14:03 0.000000 end_date
2 2016-11-24 00:20:00 0.166667 NaN
3 2016-11-24 00:30:00 0.333333 NaN
4 2016-11-24 00:40:00 0.500000 NaN
5 2016-11-24 00:50:00 0.666667 NaN
6 2016-11-24 01:00:00 0.833333 NaN
2 2016-11-24 01:03:18 1.000000 start_date
3 2016-11-24 01:07:12 1.000000 end_date
7 2016-11-24 01:10:00 1.500000 NaN
4 2016-11-24 01:11:32 2.000000 start_date
8 2016-11-24 01:20:00 2.000000 NaN
5 2016-11-24 02:00:00 2.000000 end_date
Notice that the level_0 column now contains integers (mathematically, not the data type) for the rows that fall between a start date and an end date (this assumes that an end date will not overlap the following start date).
Now we can just filter out the rows originally in df1:
df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']
And merge back with the original dataframe
df_final = pd.merge(df1, right=df_falls, on='date', how='outer')
which gives:
print(df_final)
date value falls_index
0 2016-11-24 00:00:00 1759.199951 0.0
1 2016-11-24 00:10:00 992.400024 0.0
2 2016-11-24 00:20:00 1404.800049 NaN
3 2016-11-24 00:30:00 45.799999 NaN
4 2016-11-24 00:40:00 24.299999 NaN
5 2016-11-24 00:50:00 159.899994 NaN
6 2016-11-24 01:00:00 82.499999 NaN
7 2016-11-24 01:10:00 37.400003 NaN
8 2016-11-24 01:20:00 159.899994 2.0
Which is the same as the original dataframe with the extra column falls_index which indicates the index of the row in df2 that that row falls into.
Consider a cross join merge that returns the cartesian product between both sets (all possible row pairings M x N). You can cross join using an all 1's key column in merge's on argument. Then, run a filter on large returned set using pd.series.between(). Specifically, the series between() keeps rows where start date falls within the 9:59 range of date or date falls within start and end times.
However, prior to the merge, create a df1['date'] column equal to the date index so it can be a retained column after merge and used for date filtering. Additionally, create a df2['row'] column to be used as row indicator at the end. For demo, below recreates posted df1 and df2 dataframes:
from io import StringIO
import pandas as pd
import datetime as dt
data1 = '''
date value
"2016-11-24 00:00:00" 1759.199951
"2016-11-24 00:10:00" 992.400024
"2016-11-24 00:20:00" 1404.800049
"2016-11-24 00:30:00" 45.799999
"2016-11-24 00:40:00" 24.299999
"2016-11-24 00:50:00" 159.899994
"2016-11-24 01:00:00" 82.499999
"2016-11-24 01:10:00" 37.400003
"2016-11-24 01:20:00" 159.899994
'''
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values
data2 = '''
start_date end_date
"2016-11-23 23:55:32" "2016-11-24 00:14:03"
"2016-11-24 01:03:18" "2016-11-24 01:07:12"
"2016-11-24 01:11:32" "2016-11-24 02:00:00"
'''
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
(df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
print(df3)
# value row
# date
# 2016-11-24 00:00:00 1759.199951 0
# 2016-11-24 00:10:00 992.400024 0
# 2016-11-24 01:00:00 82.499999 1
# 2016-11-24 01:10:00 37.400003 2
# 2016-11-24 01:20:00 159.899994 2
I tried to modify the #root's code with the experimental query pandas method see.
It should be faster than the original implementation for very large dataFrames. For small dataFrames it will be definitely slower.
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= #s1.end_date.values) |\
(date_end <= #s1.end_date.values) |\
(#s1.index.values != #s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)

Categories

Resources