I have a large data set, which contains, for each property_id, a cost for every month that a cost was incurred, as per the below dataset.
property_id period amount
1 2016-07-01 105908.20
1 2016-08-01 0.00
2 2016-08-01 114759.40
3 2014-05-01 -934.00
3 2014-06-01 -845.95
3 2017-12-01 92175.77
4 2015-09-01 -1859.75
4 2015-12-01 1859.75
4 2017-12-01 130105.00
5 2014-07-01 -6929.58
I would like to create a cumulative sum, grouped by property_id, and carry it forward each month, from the first month of that property_id through to the most recent full month.
I've tried the below, wherein I'm using resampling by property_id and trying to forward fill, but it gives an error
cost = cost.groupby['property_id'].apply(lambda x: x.set_index('period').resample('M').fillna(method='pad'))
TypeError: 'method' object is not subscriptable
Example output below:
> property_id period amount
> 1 2016-07-01 105908.20
> 1 2016-08-01 105908.20
> 1 2016-09-01 105908.20
> 1 2016-10-01 105908.20
> ...
> 1 2019-07-01 105908.20
> 2 2016-08-01 114759.40
> 2 2016-09-01 114759.40
> 2 2016-10-01 114759.40
> ...
> 2 2019-07-01 114759.40
> 3 2014-05-01 -934.00
> 3 2014-06-01 -1779.95
> 3 2014-07-01 -1779.95
> 3 2014-08-01 -1779.95
> ...
> 3 2017-12-01 90395.82
> 3 2018-01-01 90395.82
> 3 2018-02-01 90395.82
> 3 2018-03-01 90395.82
> ...
> 3 2019-07-01 90395.82
> 4 2015-09-01 -1859.75
> 4 2015-10-01 -1859.75
> 4 2015-11-01 -1859.75
> 4 2015-12-01 0
> 4 2016-01-01 0
> ...
> 4 2017-11-01 0
> 4 2017-12-01 130105.00
> 4 2018-01-01 130105.00
> ...
> 4 2019-07-01 130105.00
> 5 2014-07-01 -6929.58
> 5 2014-08-01 -6929.58
> 5 2014-09-01 -6929.58
> ...
> 5 2019-07-01 -6929.58
Any help would be great.
Thanks!
Create DatetimeIndex first and then use groupby with resample:
df['period'] = pd.to_datetime(df['period'])
df1 = df.set_index('period').groupby('property_id').resample('M').pad()
#alternative
#df1 = df.set_index('period').groupby('property_id').resample('M').ffill()
print (df1)
property_id amount
property_id period
1 2016-07-31 1 105908.20
2016-08-31 1 0.00
2 2016-08-31 2 114759.40
3 2014-05-31 3 -934.00
2014-06-30 3 -845.95
... ...
4 2017-09-30 4 1859.75
2017-10-31 4 1859.75
2017-11-30 4 1859.75
2017-12-31 4 130105.00
5 2014-07-31 5 -6929.58
[76 rows x 2 columns]
EDIT: Idea is create new DataFrame by filtering by last value of property_id and assign month by condition, then append to original and use solution above:
df['period'] = pd.to_datetime(df['period'])
df = df.sort_values(['property_id','period'])
last = pd.to_datetime('now').floor('d')
nextday = (last + pd.Timedelta(1, 'd')).day
orig_month = last.to_period('m').to_timestamp()
before_month = (last.to_period('m') - 1).to_timestamp()
last = orig_month if nextday == 1 else before_month
print (last)
2019-07-01 00:00:00
df1 = df.drop_duplicates('property_id', keep='last').assign(period=last)
print (df1)
property_id period amount
1 1 2019-07-01 0.00
2 2 2019-07-01 114759.40
5 3 2019-07-01 92175.77
8 4 2019-07-01 130105.00
9 5 2019-07-01 -6929.58
df = pd.concat([df, df1])
df1 = (df.set_index('period')
.groupby('property_id')['amount']
.resample('MS')
.asfreq(fill_value=0)
.groupby(level=0)
.cumsum())
print (df1)
property_id period
1 2016-07-01 105908.20
2016-08-01 105908.20
2016-09-01 105908.20
2016-10-01 105908.20
2016-11-01 105908.20
5 2019-03-01 -394986.06
2019-04-01 -401915.64
2019-05-01 -408845.22
2019-06-01 -415774.80
2019-07-01 -422704.38
Name: amount, Length: 244, dtype: float64
Related
I got the following dataframe with two groups:
start_time
end_time
ID
10/10/2021 13:38
10/10/2021 14:30
A
31/10/2021 14:00
31/10/2021 15:00
A
21/10/2021 14:47
21/10/2021 15:30
B
23/10/2021 14:00
23/10/2021 15:30
B
I will ignore the date but only preserve the time for counting.
And I would like to create an 30 minutes interval as rows for each group first and then count, which should be similar to this:
start_interval
end_interval
count
ID
13:00
13:30
0
A
13:30
14:00
1
A
14:00
14:30
2
A
14:30
15:00
1
A
13:00
13:30
0
B
13:30
14:00
0
B
14:00
14:30
1
B
14:30
15:00
2
B
15:00
15:30
2
B
Use:
#normalize all datetimes for 30 minutes
f = lambda x: pd.to_datetime(x).dt.floor('30Min')
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(f)
#get difference of 30 minutes
df['diff'] = df['end_time'].sub(df['start_time']).dt.total_seconds().div(1800).astype(int)
df['start_time'] = df['start_time'].sub(df['start_time'].dt.floor('d'))
#repeat by 30 minutes
df = df.loc[df.index.repeat(df['diff'])]
df['start_time'] += pd.to_timedelta(df.groupby(level=0).cumcount().mul(30), unit='Min')
print (df)
start_time end_time ID diff
0 0 days 13:30:00 2021-10-10 14:30:00 A 2
0 0 days 14:00:00 2021-10-10 14:30:00 A 2
1 0 days 14:00:00 2021-10-31 15:00:00 A 2
1 0 days 14:30:00 2021-10-31 15:00:00 A 2
2 0 days 14:30:00 2021-10-21 15:30:00 B 2
2 0 days 15:00:00 2021-10-21 15:30:00 B 2
3 0 days 14:00:00 2021-10-23 15:30:00 B 3
3 0 days 14:30:00 2021-10-23 15:30:00 B 3
3 0 days 15:00:00 2021-10-23 15:30:00 B 3
#add starting dates - here 12:00
df1 = pd.DataFrame({'ID':df['ID'].unique(), 'start_time': pd.Timedelta(12, unit='H')})
print (df1)
ID start_time
0 A 0 days 12:00:00
1 B 0 days 12:00:00
df = pd.concat([df, df1])
#count per 30 minutes
df = df.set_index('start_time').groupby('ID').resample('30Min')['end_time'].count().reset_index(name='count')
#add end column
df['end_interval'] = df['start_time'] + pd.Timedelta(30, unit='Min')
df = df.rename(columns={'start_time':'start_interval'})[['start_interval','end_interval','count','ID']]
print (df)
start_interval end_interval count ID
0 0 days 12:00:00 0 days 12:30:00 0 A
1 0 days 12:30:00 0 days 13:00:00 0 A
2 0 days 13:00:00 0 days 13:30:00 0 A
3 0 days 13:30:00 0 days 14:00:00 1 A
4 0 days 14:00:00 0 days 14:30:00 2 A
5 0 days 14:30:00 0 days 15:00:00 1 A
6 0 days 12:00:00 0 days 12:30:00 0 B
7 0 days 12:30:00 0 days 13:00:00 0 B
8 0 days 13:00:00 0 days 13:30:00 0 B
9 0 days 13:30:00 0 days 14:00:00 0 B
10 0 days 14:00:00 0 days 14:30:00 1 B
11 0 days 14:30:00 0 days 15:00:00 2 B
12 0 days 15:00:00 0 days 15:30:00 2 B
EDIT:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df[['start_interval','end_interval']] = df[['start_interval','end_interval']].applymap(f)
print (df)
start_interval end_interval count ID
0 12:00:00 12:30:00 0 A
1 12:30:00 13:00:00 0 A
2 13:00:00 13:30:00 0 A
3 13:30:00 14:00:00 1 A
4 14:00:00 14:30:00 2 A
5 14:30:00 15:00:00 1 A
6 12:00:00 12:30:00 0 B
7 12:30:00 13:00:00 0 B
8 13:00:00 13:30:00 0 B
9 13:30:00 14:00:00 0 B
10 14:00:00 14:30:00 1 B
11 14:30:00 15:00:00 2 B
12 15:00:00 15:30:00 2 B
The input dataframe has start and end times. The resultant dataframe is a series of timestamps with 30min interval between them.
Here it is
# Import libs
import pandas as pd
from datetime import timedelta
# Sample Dataframe
df = pd.DataFrame(
[
["10/10/2021 13:40", "10/10/2021 14:30", "A"],
["31/10/2021 14:00", "31/10/2021 15:00", "A"],
["21/10/2021 14:40", "21/10/2021 15:30", "B"],
["23/10/2021 14:00", "23/10/2021 15:30", "B"],
],
columns=["start_time", "end_time", "ID"],
)
# convert to timedelta
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(
lambda x: pd.to_datetime(x) - pd.to_datetime(x).dt.normalize()
)
# Extract seconds elapsed
df[["start_secs", "end_secs"]] = df[["start_time", "end_time"]].applymap(
lambda x: x.seconds
)
# OUTPUT
# start_time end_time ID start_secs end_secs
# 0 0 days 13:40:00 0 days 14:30:00 A 49200 52200
# 1 0 days 14:00:00 0 days 15:00:00 A 50400 54000
# 2 0 days 14:40:00 0 days 15:30:00 B 52800 55800
# 3 0 days 14:00:00 0 days 15:30:00 B 50400 55800
# Get rounded Min and Max time in secs of the dataframe
min_t, max_t = (df["start_secs"].min() // 3600) * 3600, (
df["end_secs"].max() // 3600
) * 3600 + 3600
# Create Interval dataframe with 30min bins
interval_df = pd.DataFrame(
map(lambda x: [x, x + 30 * 60], range(min_t, max_t, 30 * 60)),
columns=["start_interval", "end_interval"],
)
# OUTPUT
# start_interval end_interval
# 0 46800 48600
# 1 48600 50400
# 2 50400 52200
# 3 52200 54000
# 4 54000 55800
# 5 55800 57600
# It finds if the bin interval overlaps with the actual timeline and then count overlapping timelines of a single ID.
interval_df[["A", "B"]] = (
df.groupby(["ID"])
.apply(
lambda x: x.apply(
lambda y: ~(
((interval_df["end_interval"] - y["start_secs"]) <= 0)
| ((interval_df["start_interval"] - y["end_secs"]) >= 0)
),
axis=1,
).sum(axis=0)
)
.T
)
# OUTPUT
# start_interval end_interval A B
# 0 46800 48600 0 0
# 1 48600 50400 1 0
# 2 50400 52200 2 1
# 3 52200 54000 1 2
# 4 54000 55800 0 2
# 5 55800 57600 0 0
# Convert seconds to time
interval_df[["start_interval", "end_interval"]] = interval_df[
["start_interval", "end_interval"]
].applymap(lambda x: str(timedelta(seconds=x)))
# Stack counts of A and B into one single column
interval_df.melt(["start_interval", "end_interval"])
# OUTPUT
# start_interval end_interval variable value
# 0 13:00:00 13:30:00 A 0
# 1 13:30:00 14:00:00 A 1
# 2 14:00:00 14:30:00 A 2
# 3 14:30:00 15:00:00 A 1
# 4 15:00:00 15:30:00 A 0
# 5 15:30:00 16:00:00 A 0
# 6 13:00:00 13:30:00 B 0
# 7 13:30:00 14:00:00 B 0
# 8 14:00:00 14:30:00 B 1
# 9 14:30:00 15:00:00 B 2
# 10 15:00:00 15:30:00 B 2
# 11 15:30:00 16:00:00 B 0
Here's some made up time series data on 1 minute intervals:
import pandas as pd
import numpy as np
import random
random.seed(5)
rows,cols = 8760,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='1T')
df = pd.DataFrame(data, columns=['condition1','condition2','condition3'], index=tidx)
This is just some code to create some Boolean columns
df['condition1_bool'] = df['condition1'].lt(.1)
df['condition2_bool'] = df['condition2'].lt(df['condition1']) & df['condition2'].gt(df['condition3'])
df['condition3_bool'] = df['condition3'].gt(.9)
df = df[['condition1_bool','condition2_bool','condition3_bool']]
df = df.astype(int)
On my screen this prints:
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 0 0 0
2019-01-01 00:01:00 0 0 1 <---- Count as same event!
2019-01-01 00:02:00 0 0 1 <---- Count as same event!
2019-01-01 00:03:00 1 0 0
2019-01-01 00:04:00 0 0 0
What I am trying to figure out is how to rollup per hour cumulative events (True or 1) but if there is no 0 between events, its the same event! Hopefully that makes sense what I was describing above on the <---- Count as same event!
If I do:
df = df.resample('H').sum()
This will just resample and count all events, right regardless of the time series commitment I was trying to highlight with the <---- Count as same event!
Thanks for any tips!!
Check if the current row ("2019-01-01 00:02:00") equals to 1 and check if the previous row ("2019-01-01 00:01:00") is not equal to 1. This removes consecutive 1 of the sum.
>>> df.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 4 8 4
2019-01-01 01:00:00 9 7 6
2019-01-01 02:00:00 7 14 4
2019-01-01 03:00:00 2 8 7
2019-01-01 04:00:00 4 9 5
... ... ... ...
2019-01-06 21:00:00 4 8 2
2019-01-06 22:00:00 3 11 4
2019-01-06 23:00:00 6 11 4
2019-01-07 00:00:00 8 7 8
2019-01-07 01:00:00 4 9 6
[146 rows x 3 columns]
Using your code:
>>> df.resample('H').sum()
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 5 8 5
2019-01-01 01:00:00 9 8 6
2019-01-01 02:00:00 7 14 5
2019-01-01 03:00:00 2 9 7
2019-01-01 04:00:00 4 11 5
... ... ... ...
2019-01-06 21:00:00 5 11 3
2019-01-06 22:00:00 3 15 4
2019-01-06 23:00:00 6 12 4
2019-01-07 00:00:00 8 7 10
2019-01-07 01:00:00 4 9 7
[146 rows x 3 columns]
Check:
dti = pd.date_range('2021-11-15 21:00:00', '2021-11-15 22:00:00',
closed='left', freq='T')
df1 = pd.DataFrame({'c1': 1}, index=dti)
>>> df1.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
c1
2021-11-15 21:00:00 1
>>> df1.resample('H').sum()
c1
2021-11-15 21:00:00 60
Below is the sample of dataframe (df):-
alpha
value
0
a
5
1
a
8
2
a
4
3
b
2
4
b
1
I know how to make the sequence (numbers) as per the group:
df["serial"] = df.groupby("alpha").cumcount()+1
alpha
value
serial
0
a
5
1
1
a
8
2
2
a
4
3
3
b
2
1
4
b
1
2
But instead of number I need date-time in sequence having 30 mins interval:
Expected result:
alpha
value
serial
0
a
5
2021-01-01 23:30:00
1
a
8
2021-01-02 00:00:00
2
a
4
2021-01-02 00:30:00
3
b
2
2021-01-01 23:30:00
4
b
1
2021-01-02 00:00:00
You can simply multiply your result with a pd.Timedelta:
print ((df.groupby("alpha").cumcount()+1)*pd.Timedelta(minutes=30)+pd.Timestamp("2021-01-01 23:00:00"))
0 2021-01-01 23:30:00
1 2021-01-02 00:00:00
2 2021-01-02 00:30:00
3 2021-01-01 23:30:00
4 2021-01-02 00:00:00
dtype: datetime64[ns]
Try with to_datetime and groupby with cumcount, and then multiplying by pd.Timedelta for 30 minutes:
>>> df['serial'] = pd.to_datetime('2021-01-01 23:30:00') + df.groupby('alpha').cumcount() * pd.Timedelta(minutes=30)
>>> df
alpha value serial
0 a 5 2021-01-01 23:30:00
1 a 8 2021-01-02 00:00:00
2 a 4 2021-01-02 00:30:00
3 b 2 2021-01-01 23:30:00
4 b 1 2021-01-02 00:00:00
>>>
I have two dataframes (df and df1) like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM', '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]
What I would like to do is
a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id
ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).
As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.
b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1
c) If not then assign, "Out of Range"
I tried the below
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful
14 202 2012-12-13 11:00:00 NA
Let's do:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
on=['person_id', 'within_id'], how='left', indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
Details:
Left merge the dataframe df1 with df on person_id and within_id:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:
print(d)
person_id date_1 within_id
0 101 2013-07-07 11:20:00 out of range
1 101 2013-05-07 14:30:00 ABC1
2 101 2013-06-07 14:40:00 out of range
3 101 2014-08-06 00:00:00 out of range
4 101 2014-11-06 00:00:00 out of range
5 101 2013-02-03 12:30:00 out of range
6 101 2014-06-13 00:00:00 out of range
7 202 2011-12-11 00:00:00 DEF1
8 202 2012-10-13 07:00:00 DEF2
9 202 2015-12-13 00:00:00 out of range
10 202 2012-12-13 00:00:00 DEF3
11 202 2012-12-13 18:30:00 DEF3
12 202 2011-07-13 10:00:00 out of range
13 202 2012-12-18 10:00:00 DEF3
14 202 2013-12-19 11:00:00 NaN
I used df and df1 as provided above.
The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.
Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index, 'within_id'] = 'out of range'
df1.at[t.Index, 'rule'] += 1_000
else:
df1.at[t.Index, 'within_id'] = 'impossible!'
df1.at[t.Index, 'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
The results are:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
I want to obtain the timedelta interval between several timestamp columns in a dataframe. Also, several entries are NaN.
Original DF:
0 1 2 3 4 5
0 date1 date2 NaN NaN NaN NaN
1 date3 date4 date5 date6 date7 date8
Desired Output:
0 1 2 3 4
0 date2-date1 NaN NaN NaN NaN
1 date4-date3 date5-date4 date6-date5 date7-date6 date8-date7
I think you can use if consecutive NaNs to end of rows:
df = pd.DataFrame([['2015-01-02','2015-01-03', np.nan, np.nan],
['2015-01-02','2015-01-05','2015-01-07','2015-01-12']])
print (df)
0 1 2 3
0 2015-01-02 2015-01-03 NaN NaN
1 2015-01-02 2015-01-05 2015-01-07 2015-01-12
df = df.apply(pd.to_datetime).ffill(axis=1).diff(axis=1)
print (df)
0 1 2 3
0 NaT 1 days 0 days 0 days
1 NaT 3 days 2 days 5 days
Details:
First convert all columns to datetimes:
print (df.apply(pd.to_datetime))
0 1 2 3
0 2015-01-02 2015-01-03 NaT NaT
1 2015-01-02 2015-01-05 2015-01-07 2015-01-12
Replace NaNs by forward filling last value per rows:
print (df.apply(pd.to_datetime).ffill(axis=1))
0 1 2 3
0 2015-01-02 2015-01-03 2015-01-03 2015-01-03
1 2015-01-02 2015-01-05 2015-01-07 2015-01-12
Get difference by diff:
print (df.apply(pd.to_datetime).ffill(axis=1).diff(axis=1))
0 1 2 3
0 NaT 1 days 0 days 0 days
1 NaT 3 days 2 days 5 days