How to find occurrence of consecutive events in python timeseries data frame? - python

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!

Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Related

Subtract one column by itself based on a condition set by another column

I have the following data frame, where time_stamp is already sorted in the ascending order:
time_stamp indicator
0 2021-01-01 00:00:00 1
1 2021-01-01 00:02:00 1
2 2021-01-01 00:03:00 NaN
3 2021-01-01 00:04:00 NaN
4 2021-01-01 00:09:00 NaN
5 2021-01-01 00:14:00 NaN
6 2021-01-01 00:19:00 NaN
7 2021-01-01 00:24:00 NaN
8 2021-01-01 00:27:00 1
9 2021-01-01 00:29:00 NaN
10 2021-01-01 00:32:00 2
11 2021-01-01 00:34:00 NaN
12 2021-01-01 00:37:00 2
13 2021-01-01 00:38:00 NaN
14 2021-01-01 00:39:00 NaN
I want to create a new column in the above data frame, that shows the time difference between each row's time_stamp value and the first time_stamp value above that row where indicator is not NaN (immediately above row, where indicator is not NaN).
Below is how the output should look like (time_diff is a timedelta value, but I'll just show subtraction by indices to better illustrate. For example, ( 2 - 1 ) = df['time_stamp'][2] - df['time_stamp'][1] ):
time_stamp indicator time_diff
0 2021-01-01 00:00:00 1 NaT # (or undefined)
1 2021-01-01 00:02:00 1 1 - 0
2 2021-01-01 00:03:00 NaN 2 - 1
3 2021-01-01 00:04:00 NaN 3 - 1
4 2021-01-01 00:09:00 NaN 4 - 1
5 2021-01-01 00:14:00 NaN 5 - 1
6 2021-01-01 00:19:00 NaN 6 - 1
7 2021-01-01 00:24:00 NaN 7 - 1
8 2021-01-01 00:27:00 1 8 - 1
9 2021-01-01 00:29:00 NaN 9 - 8
10 2021-01-01 00:32:00 1 10 - 8
11 2021-01-01 00:34:00 NaN 11 - 10
12 2021-01-01 00:37:00 1 12 - 10
13 2021-01-01 00:38:00 NaN 13 - 12
14 2021-01-01 00:39:00 NaN 14 - 12
We can use a for loop that keeps track of the last NaN entry, but I'm looking for a solution that does not use a for loop.
I've ended up doing this:
# create an intermediate column to track the last timestamp corresponding to the non-NaN `indicator` value
df['tracking'] = np.nan
df['tracking'][~df['indicator'].isna()] = df['time_stamp'][~df['indicator'].isna()]
df['tracking'] = df['tracking'].ffill()
# use that to subtract the value from the `time_stamp`
df['time_diff'] = df['time_stamp'] - df['tracking']

Filtering out another dataframe based on selected hours

I'm trying to filter out my dataframe based only on 3 hourly frequency, meaning starting from 0000hr, 0300hr, 0900hr, 1200hr, 1500hr, 1800hr, 2100hr, so on and so forth.
A sample of my dataframe would look like this
Time A
2019-05-25 03:54:00 1
2019-05-25 03:57:00 2
2019-05-25 04:00:00 3
...
2020-05-25 03:54:00 4
2020-05-25 03:57:00 5
2020-05-25 04:00:00 6
Desired output:
Time A
2019-05-25 06:00:00 1
2019-05-25 09:00:00 2
2019-05-25 12:00:00 3
...
2020-05-25 00:00:00 4
2020-05-25 03:00:00 5
2020-05-25 06:00:00 6
2020-05-25 09:00:00 6
2020-05-25 12:00:00 6
2020-05-25 15:00:00 6
2020-05-25 18:00:00 6
2020-05-25 21:00:00 6
2020-05-26 00:00:00 6
...
You can define a date range with 3 hours interval with pd.date_range() and then filter your dataframe with .loc and isin(), as follows:
date_rng_3H = pd.date_range(start=df['Time'].dt.date.min(), end=df['Time'].dt.date.max() + pd.DateOffset(days=1), freq='3H')
df_out = df.loc[df['Time'].isin(date_rng_3H)]
Input data:
date_rng = pd.date_range(start='2019-05-25 03:54:00', end='2020-05-25 04:00:00', freq='3T')
np.random.seed(123)
df = pd.DataFrame({'Time': date_rng, 'A': np.random.randint(1, 6, len(date_rng))})
Time A
0 2019-05-25 03:54:00 3
1 2019-05-25 03:57:00 5
2 2019-05-25 04:00:00 3
3 2019-05-25 04:03:00 2
4 2019-05-25 04:06:00 4
... ... ...
175678 2020-05-25 03:48:00 2
175679 2020-05-25 03:51:00 1
175680 2020-05-25 03:54:00 2
175681 2020-05-25 03:57:00 2
175682 2020-05-25 04:00:00 1
175683 rows × 2 columns
Output:
print(df_out)
Time A
42 2019-05-25 06:00:00 4
102 2019-05-25 09:00:00 2
162 2019-05-25 12:00:00 1
222 2019-05-25 15:00:00 3
282 2019-05-25 18:00:00 5
... ... ...
175422 2020-05-24 15:00:00 1
175482 2020-05-24 18:00:00 5
175542 2020-05-24 21:00:00 2
175602 2020-05-25 00:00:00 3
175662 2020-05-25 03:00:00 3

find specific value that meets conditions - python

Trying to create new column with values that meet specific conditions. Below I have set out code which goes some way in explaining the logic but does not produce the correct output:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2019-08-06 09:00:00', '2019-08-06 12:00:00', '2019-08-06 18:00:00', '2019-08-06 21:00:00', '2019-08-07 09:00:00', '2019-08-07 16:00:00', '2019-08-08 17:00:00' ,'2019-08-09 16:00:00'],
'type': [0, 1, np.nan, 1, np.nan, np.nan, 0 ,0],
'colour': ['blue', 'red', np.nan, 'blue', np.nan, np.nan, 'blue', 'red'],
'maxPixel': [255, 7346, 32, 5184, 600, 322, 72, 6000],
'minPixel': [86, 96, 14, 3540, 528, 300, 12, 4009],
'colourDate': ['2019-08-06 12:00:00', '2019-08-08 16:00:00', '2019-08-06 23:00:00', '2019-08-06 22:00:00', '2019-08-08 09:00:00', '2019-08-09 16:00:00', '2019-08-08 23:00:00' ,'2019-08-11 16:00:00'] })
max_conditions = [(df['type'] == 1) & (df['colour'] == 'blue'),
(df['type'] == 1) & (df['colour'] == 'red')]
max_choices = [np.where(df['date'] <= df['colourDate'], max(df['maxPixel']), np.nan),
np.where(df['date'] <= df['colourDate'], min(df['minPixel']), np.nan)]
df['pixelLimit'] = np.select(max_conditions, max_choices, default=np.nan)
Incorrect output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 12.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 6000.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is incorrect:
Value 12.0 in index row 1 for column df['pixelLimit'] is incorrect because this value is from df['minPixel'] index row 6 which has has a df['date'] datetime of 2019-08-08 17:00:00 which is greater than the 2019-08-08 16:00:00 df['date'] datetime contained in index row 1.
Value 6000.0 in index row 3 for column df['pixelLimit'] is incorrect because this value is from df['maxPixel'] index row 7 which has a df['date'] datetime of 2019-08-09 16:00:00 which is greater than the 2019-08-06 22:00:00 df['date'] datetime contained in index row .
Correct output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is correct:
Value 14.0 in index row 1 for column df['pixelLimit'] is correct because we are looking for the minimum value in column df['minPixel'] which has a datetime in column df['date'] less than the datetime in index row 1 for column df['colourDate'] and greater or equal to the datetime in index row 1 for column df['date']
Value 5184.0 in index row 3 for column df['pixelLimit'] is correct because we are looking for the maximum value in column df['maxPixel'] which has a datetime in column df['date'] less than the datetime in index row 3 for column df['colourDate'] and greater or equal to the datetime in index row 3 for column df['date']
Considerations:
Maybe np.select is not best suited for this task and some sort of function might serve the task better?
Also, maybe I need to create some sort of dynamic len to use as a starting point for each row?
Request
Please can anyone out there help me amend my code to achieve the correct output
For matching problems like this one possibility is to do the complete merge, then subset, using a Boolean Series, to all rows that satisfy your condition (for that row) and find the max or min among all the possible matches. Since this requires slightly different columns and different functions I split the operations into 2 very similar pieces of code, one to deal with 1/blue and the other for 1/red.
First some housekeeping, make things datetime
import pandas as pd
df['date'] = pd.to_datetime(df['date'])
df['colourDate'] = pd.to_datetime(df['colourDate'])
Calculate the min pixel for 1/red between the times for each row
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']], how='cross')
# If pd.version < 1.2 instead use:
#dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')
# Only keep rows between the dates, then among those find the min minPixel
smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)]
.groupby('index')['minPixel_y'].min()
.rename('pixel_limit'))
#index
#1 14
#Name: pixel_limit, dtype: int64
# Max is basically a mirror
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']], how='cross')
#dfmax = dfmax.assign(t=1).merge(df[['date', 'maxPixel']].assign(t=1), on='t')
smax = (dfmax[dfmax.date_y.between(dfmax.date_x, dfmax.colourDate)]
.groupby('index')['maxPixel_y'].max()
.rename('pixel_limit'))
Finally because the above groups over the original index (i.e. 'index') we can simply assign back to align with the original DataFrame.
df['pixel_limit'] = pd.concat([smin, smax])
date type colour maxPixel minPixel colourDate pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
If you need to bring along a lot of different information for the row with the min/max Pixel then instead of groupby min/max we will sort_values and then gropuby + head or tail to get the min or max pixel. For the min this would look like (slight renaming of suffixes):
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
# For older pandas < 1.2
#dfmin = (dfmin.assign(t=1)
# .merge(df[['date', 'minPixel']].reset_index().assign(t=1),
# on='t', suffixes=['', '_match']))
# Only keep rows between the dates, then among those find the min minPixel row.
# A bunch of renaming.
smin = (dfmin[dfmin.date_match.between(dfmin.date, dfmin.colourDate)]
.sort_values('minPixel_match', ascending=True)
.groupby('index').head(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'minPixel_match': 'pixel_limit'}))
The Max would then be similar using .tail
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
smax = (dfmax[dfmax.date_match.between(dfmax.date, dfmin.colourDate)]
.sort_values('maxPixel_match', ascending=True)
.groupby('index').tail(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'maxPixel_match': 'pixel_limit'}))
And finally we concat along axis=1 now that we need to join multiple columns to the original:
result = pd.concat([df, pd.concat([smin, smax])], axis=1)
date type colour maxPixel minPixel colourDate index_match date_match pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN NaN NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 2.0 2019-08-06 18:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN NaN NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 3.0 2019-08-06 21:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN NaN NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN NaN NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN NaN NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN NaN NaN

count contiguous NaN values by unique values

I have contiguous periods of NaN values by code. I want to count NaN values from periods of contiguous NaN values by code, and also i want the start and end date of the contiguos period of NaN values.
df :
CODE TMIN
1998-01-01 00:00:00 12 2.5
1999-01-01 00:00:00 12 NaN
2000-01-01 00:00:00 12 NaN
2001-01-01 00:00:00 12 2.2
2002-01-01 00:00:00 12 NaN
1998-01-01 00:00:00 41 NaN
1999-01-01 00:00:00 41 NaN
2000-01-01 00:00:00 41 5.0
2001-01-01 00:00:00 41 9.0
2002-01-01 00:00:00 41 8.0
1998-01-01 00:00:00 52 2.0
1999-01-01 00:00:00 52 NaN
2000-01-01 00:00:00 52 NaN
2001-01-01 00:00:00 52 NaN
2002-01-01 00:00:00 52 1.0
1998-01-01 00:00:00 91 NaN
Expected results :
Start_Date End date CODE number of contiguous missing values
1999-01-01 00:00:00 2000-01-01 00:00:00 12 2
2002-01-01 00:00:00 2002-01-01 00:00:00 12 1
1998-01-01 00:00:00 1999-01-01 00:00:00 41 2
1999-01-01 00:00:00 2001-01-01 00:00:00 52 3
1998-01-01 00:00:00 1998-01-01 00:00:00 91 1
How can i solve this? Thanks!
You can try groupby the cumsum of non-null:
df['group'] = df.TMIN.notna().cumsum()
(df[df.TMIN.isna()]
.groupby(['group','CODE'])
.agg(Start_Date=('group', lambda x: x.index.min()),
End_Date=('group', lambda x: x.index.max()),
cont_missing=('TMIN', 'size')
)
)
Output:
Start_Date End_Date cont_missing
group CODE
1 12 1999-01-01 00:00:00 2000-01-01 00:00:00 2
2 12 2002-01-01 00:00:00 2002-01-01 00:00:00 1
41 1998-01-01 00:00:00 1999-01-01 00:00:00 2
6 52 1999-01-01 00:00:00 2001-01-01 00:00:00 3
7 91 1998-01-01 00:00:00 1998-01-01 00:00:00 1

compare dates within a dataframe and assign a value to another variable

I have two dataframes (df and df1) like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM', '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]
What I would like to do is
a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id
ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).
As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.
b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1
c) If not then assign, "Out of Range"
I tried the below
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful
14 202 2012-12-13 11:00:00 NA
Let's do:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
on=['person_id', 'within_id'], how='left', indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
Details:
Left merge the dataframe df1 with df on person_id and within_id:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:
print(d)
person_id date_1 within_id
0 101 2013-07-07 11:20:00 out of range
1 101 2013-05-07 14:30:00 ABC1
2 101 2013-06-07 14:40:00 out of range
3 101 2014-08-06 00:00:00 out of range
4 101 2014-11-06 00:00:00 out of range
5 101 2013-02-03 12:30:00 out of range
6 101 2014-06-13 00:00:00 out of range
7 202 2011-12-11 00:00:00 DEF1
8 202 2012-10-13 07:00:00 DEF2
9 202 2015-12-13 00:00:00 out of range
10 202 2012-12-13 00:00:00 DEF3
11 202 2012-12-13 18:30:00 DEF3
12 202 2011-07-13 10:00:00 out of range
13 202 2012-12-18 10:00:00 DEF3
14 202 2013-12-19 11:00:00 NaN
I used df and df1 as provided above.
The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.
Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index, 'within_id'] = 'out of range'
df1.at[t.Index, 'rule'] += 1_000
else:
df1.at[t.Index, 'within_id'] = 'impossible!'
df1.at[t.Index, 'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
The results are:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1

Categories

Resources