Python - filtering lines from data frame - python

I have a simple data frame:
ID Stime Etime
1 13:00:00 13:15:00
1 14:00:00 14:15:00
2 15:00:00 15:42:00
3 13:00:00 13:25:00
4 15:00:00 15:15:00
4 15:05:00 15:15:00
What I would like to do is to unit the 2 last lines, because they belong to the same ID (ID=4) and the time of the last line is contained in the time of the penultimate line.
What I want the output to be is:
ID Stime Etime
1 13:00:00 13:15:00
1 14:00:00 14:15:00
2 15:00:00 15:42:00
3 13:00:00 13:25:00
4 15:00:00 15:15:00

Solution
def setup(df):
td = df.Stime - df.Etime.shift()
td = td.apply(lambda x: x.total_seconds() > 1)
td.iloc[0] = True
return td.cumsum()
def collapse(df):
df_ = df.iloc[0, :]
df_.loc['Stime'] = df.Stime.min()
df_.loc['Etime'] = df.Etime.max()
return df_
df['group id'] = df.groupby('ID').apply(setup).values
gbcols = ['ID', 'group id']
fcols = ['ID', 'Stime', 'Etime']
print df.groupby(gbcols)[fcols].apply(collapse).reset_index(drop=True)
ID Stime Etime
0 1 2016-05-30 13:00:00 2016-05-30 13:15:00
1 1 2016-05-30 14:00:00 2016-05-30 14:15:00
2 2 2016-05-30 15:00:00 2016-05-30 15:42:00
3 3 2016-05-30 13:00:00 2016-05-30 13:25:00
4 4 2016-05-30 15:00:00 2016-05-30 15:15:00

Related

Count time in an 30 minutes interval in pandas

I got the following dataframe with two groups:
start_time
end_time
ID
10/10/2021 13:38
10/10/2021 14:30
A
31/10/2021 14:00
31/10/2021 15:00
A
21/10/2021 14:47
21/10/2021 15:30
B
23/10/2021 14:00
23/10/2021 15:30
B
I will ignore the date but only preserve the time for counting.
And I would like to create an 30 minutes interval as rows for each group first and then count, which should be similar to this:
start_interval
end_interval
count
ID
13:00
13:30
0
A
13:30
14:00
1
A
14:00
14:30
2
A
14:30
15:00
1
A
13:00
13:30
0
B
13:30
14:00
0
B
14:00
14:30
1
B
14:30
15:00
2
B
15:00
15:30
2
B
Use:
#normalize all datetimes for 30 minutes
f = lambda x: pd.to_datetime(x).dt.floor('30Min')
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(f)
#get difference of 30 minutes
df['diff'] = df['end_time'].sub(df['start_time']).dt.total_seconds().div(1800).astype(int)
df['start_time'] = df['start_time'].sub(df['start_time'].dt.floor('d'))
#repeat by 30 minutes
df = df.loc[df.index.repeat(df['diff'])]
df['start_time'] += pd.to_timedelta(df.groupby(level=0).cumcount().mul(30), unit='Min')
print (df)
start_time end_time ID diff
0 0 days 13:30:00 2021-10-10 14:30:00 A 2
0 0 days 14:00:00 2021-10-10 14:30:00 A 2
1 0 days 14:00:00 2021-10-31 15:00:00 A 2
1 0 days 14:30:00 2021-10-31 15:00:00 A 2
2 0 days 14:30:00 2021-10-21 15:30:00 B 2
2 0 days 15:00:00 2021-10-21 15:30:00 B 2
3 0 days 14:00:00 2021-10-23 15:30:00 B 3
3 0 days 14:30:00 2021-10-23 15:30:00 B 3
3 0 days 15:00:00 2021-10-23 15:30:00 B 3
#add starting dates - here 12:00
df1 = pd.DataFrame({'ID':df['ID'].unique(), 'start_time': pd.Timedelta(12, unit='H')})
print (df1)
ID start_time
0 A 0 days 12:00:00
1 B 0 days 12:00:00
df = pd.concat([df, df1])
#count per 30 minutes
df = df.set_index('start_time').groupby('ID').resample('30Min')['end_time'].count().reset_index(name='count')
#add end column
df['end_interval'] = df['start_time'] + pd.Timedelta(30, unit='Min')
df = df.rename(columns={'start_time':'start_interval'})[['start_interval','end_interval','count','ID']]
print (df)
start_interval end_interval count ID
0 0 days 12:00:00 0 days 12:30:00 0 A
1 0 days 12:30:00 0 days 13:00:00 0 A
2 0 days 13:00:00 0 days 13:30:00 0 A
3 0 days 13:30:00 0 days 14:00:00 1 A
4 0 days 14:00:00 0 days 14:30:00 2 A
5 0 days 14:30:00 0 days 15:00:00 1 A
6 0 days 12:00:00 0 days 12:30:00 0 B
7 0 days 12:30:00 0 days 13:00:00 0 B
8 0 days 13:00:00 0 days 13:30:00 0 B
9 0 days 13:30:00 0 days 14:00:00 0 B
10 0 days 14:00:00 0 days 14:30:00 1 B
11 0 days 14:30:00 0 days 15:00:00 2 B
12 0 days 15:00:00 0 days 15:30:00 2 B
EDIT:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df[['start_interval','end_interval']] = df[['start_interval','end_interval']].applymap(f)
print (df)
start_interval end_interval count ID
0 12:00:00 12:30:00 0 A
1 12:30:00 13:00:00 0 A
2 13:00:00 13:30:00 0 A
3 13:30:00 14:00:00 1 A
4 14:00:00 14:30:00 2 A
5 14:30:00 15:00:00 1 A
6 12:00:00 12:30:00 0 B
7 12:30:00 13:00:00 0 B
8 13:00:00 13:30:00 0 B
9 13:30:00 14:00:00 0 B
10 14:00:00 14:30:00 1 B
11 14:30:00 15:00:00 2 B
12 15:00:00 15:30:00 2 B
The input dataframe has start and end times. The resultant dataframe is a series of timestamps with 30min interval between them.
Here it is
# Import libs
import pandas as pd
from datetime import timedelta
# Sample Dataframe
df = pd.DataFrame(
[
["10/10/2021 13:40", "10/10/2021 14:30", "A"],
["31/10/2021 14:00", "31/10/2021 15:00", "A"],
["21/10/2021 14:40", "21/10/2021 15:30", "B"],
["23/10/2021 14:00", "23/10/2021 15:30", "B"],
],
columns=["start_time", "end_time", "ID"],
)
# convert to timedelta
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(
lambda x: pd.to_datetime(x) - pd.to_datetime(x).dt.normalize()
)
# Extract seconds elapsed
df[["start_secs", "end_secs"]] = df[["start_time", "end_time"]].applymap(
lambda x: x.seconds
)
# OUTPUT
# start_time end_time ID start_secs end_secs
# 0 0 days 13:40:00 0 days 14:30:00 A 49200 52200
# 1 0 days 14:00:00 0 days 15:00:00 A 50400 54000
# 2 0 days 14:40:00 0 days 15:30:00 B 52800 55800
# 3 0 days 14:00:00 0 days 15:30:00 B 50400 55800
# Get rounded Min and Max time in secs of the dataframe
min_t, max_t = (df["start_secs"].min() // 3600) * 3600, (
df["end_secs"].max() // 3600
) * 3600 + 3600
# Create Interval dataframe with 30min bins
interval_df = pd.DataFrame(
map(lambda x: [x, x + 30 * 60], range(min_t, max_t, 30 * 60)),
columns=["start_interval", "end_interval"],
)
# OUTPUT
# start_interval end_interval
# 0 46800 48600
# 1 48600 50400
# 2 50400 52200
# 3 52200 54000
# 4 54000 55800
# 5 55800 57600
# It finds if the bin interval overlaps with the actual timeline and then count overlapping timelines of a single ID.
interval_df[["A", "B"]] = (
df.groupby(["ID"])
.apply(
lambda x: x.apply(
lambda y: ~(
((interval_df["end_interval"] - y["start_secs"]) <= 0)
| ((interval_df["start_interval"] - y["end_secs"]) >= 0)
),
axis=1,
).sum(axis=0)
)
.T
)
# OUTPUT
# start_interval end_interval A B
# 0 46800 48600 0 0
# 1 48600 50400 1 0
# 2 50400 52200 2 1
# 3 52200 54000 1 2
# 4 54000 55800 0 2
# 5 55800 57600 0 0
# Convert seconds to time
interval_df[["start_interval", "end_interval"]] = interval_df[
["start_interval", "end_interval"]
].applymap(lambda x: str(timedelta(seconds=x)))
# Stack counts of A and B into one single column
interval_df.melt(["start_interval", "end_interval"])
# OUTPUT
# start_interval end_interval variable value
# 0 13:00:00 13:30:00 A 0
# 1 13:30:00 14:00:00 A 1
# 2 14:00:00 14:30:00 A 2
# 3 14:30:00 15:00:00 A 1
# 4 15:00:00 15:30:00 A 0
# 5 15:30:00 16:00:00 A 0
# 6 13:00:00 13:30:00 B 0
# 7 13:30:00 14:00:00 B 0
# 8 14:00:00 14:30:00 B 1
# 9 14:30:00 15:00:00 B 2
# 10 15:00:00 15:30:00 B 2
# 11 15:30:00 16:00:00 B 0

Filtering out another dataframe based on selected hours

I'm trying to filter out my dataframe based only on 3 hourly frequency, meaning starting from 0000hr, 0300hr, 0900hr, 1200hr, 1500hr, 1800hr, 2100hr, so on and so forth.
A sample of my dataframe would look like this
Time A
2019-05-25 03:54:00 1
2019-05-25 03:57:00 2
2019-05-25 04:00:00 3
...
2020-05-25 03:54:00 4
2020-05-25 03:57:00 5
2020-05-25 04:00:00 6
Desired output:
Time A
2019-05-25 06:00:00 1
2019-05-25 09:00:00 2
2019-05-25 12:00:00 3
...
2020-05-25 00:00:00 4
2020-05-25 03:00:00 5
2020-05-25 06:00:00 6
2020-05-25 09:00:00 6
2020-05-25 12:00:00 6
2020-05-25 15:00:00 6
2020-05-25 18:00:00 6
2020-05-25 21:00:00 6
2020-05-26 00:00:00 6
...
You can define a date range with 3 hours interval with pd.date_range() and then filter your dataframe with .loc and isin(), as follows:
date_rng_3H = pd.date_range(start=df['Time'].dt.date.min(), end=df['Time'].dt.date.max() + pd.DateOffset(days=1), freq='3H')
df_out = df.loc[df['Time'].isin(date_rng_3H)]
Input data:
date_rng = pd.date_range(start='2019-05-25 03:54:00', end='2020-05-25 04:00:00', freq='3T')
np.random.seed(123)
df = pd.DataFrame({'Time': date_rng, 'A': np.random.randint(1, 6, len(date_rng))})
Time A
0 2019-05-25 03:54:00 3
1 2019-05-25 03:57:00 5
2 2019-05-25 04:00:00 3
3 2019-05-25 04:03:00 2
4 2019-05-25 04:06:00 4
... ... ...
175678 2020-05-25 03:48:00 2
175679 2020-05-25 03:51:00 1
175680 2020-05-25 03:54:00 2
175681 2020-05-25 03:57:00 2
175682 2020-05-25 04:00:00 1
175683 rows × 2 columns
Output:
print(df_out)
Time A
42 2019-05-25 06:00:00 4
102 2019-05-25 09:00:00 2
162 2019-05-25 12:00:00 1
222 2019-05-25 15:00:00 3
282 2019-05-25 18:00:00 5
... ... ...
175422 2020-05-24 15:00:00 1
175482 2020-05-24 18:00:00 5
175542 2020-05-24 21:00:00 2
175602 2020-05-25 00:00:00 3
175662 2020-05-25 03:00:00 3

compare dates within a dataframe and assign a value to another variable

I have two dataframes (df and df1) like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM', '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]
What I would like to do is
a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id
ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).
As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.
b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1
c) If not then assign, "Out of Range"
I tried the below
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful
14 202 2012-12-13 11:00:00 NA
Let's do:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
on=['person_id', 'within_id'], how='left', indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
Details:
Left merge the dataframe df1 with df on person_id and within_id:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:
print(d)
person_id date_1 within_id
0 101 2013-07-07 11:20:00 out of range
1 101 2013-05-07 14:30:00 ABC1
2 101 2013-06-07 14:40:00 out of range
3 101 2014-08-06 00:00:00 out of range
4 101 2014-11-06 00:00:00 out of range
5 101 2013-02-03 12:30:00 out of range
6 101 2014-06-13 00:00:00 out of range
7 202 2011-12-11 00:00:00 DEF1
8 202 2012-10-13 07:00:00 DEF2
9 202 2015-12-13 00:00:00 out of range
10 202 2012-12-13 00:00:00 DEF3
11 202 2012-12-13 18:30:00 DEF3
12 202 2011-07-13 10:00:00 out of range
13 202 2012-12-18 10:00:00 DEF3
14 202 2013-12-19 11:00:00 NaN
I used df and df1 as provided above.
The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.
Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index, 'within_id'] = 'out of range'
df1.at[t.Index, 'rule'] += 1_000
else:
df1.at[t.Index, 'within_id'] = 'impossible!'
df1.at[t.Index, 'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
The results are:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1

Python Pandas - Get Attributes Associated With Consecutive datetime

I have a data frame that has a list of datetime by minutes (generally in hour increments), for example 2018-01-14 03:00, 2018-01-14 04:00, etc.
What I want to do is capture the number of consecutive records by the minute increment (some could be 60 others 15, etc.) that I define. Then, I want to associate the first and last reading time in the block.
Take the following data for instance:
id reading_time type
1 1/6/2018 00:00 Interval
1 1/6/2018 01:00 Interval
1 1/6/2018 02:00 Interval
1 1/6/2018 03:00 Interval
1 1/6/2018 06:00 Interval
1 1/6/2018 07:00 Interval
1 1/6/2018 09:00 Interval
1 1/6/2018 10:00 Interval
1 1/6/2018 14:00 Interval
1 1/6/2018 15:00 Interval
I would like the output to look like the following:
id first_reading_time last_reading_time number_of_records type
1 1/6/2018 00:00 1/6/2018 03:00 4 Received
1 1/6/2018 04:00 1/6/2018 05:00 2 Missed
1 1/6/2018 06:00 1/6/2018 07:00 2 Received
1 1/6/2018 08:00 1/6/2018 08:00 1 Missed
1 1/6/2018 09:00 1/6/2018 10:00 2 Received
1 1/6/2018 11:00 1/6/2018 13:00 3 Missed
1 1/6/2018 14:00 1/6/2018 15:00 2 Received
Now, in this example there is only 1 day and I can write the code for one day. Many of the rows extend across multiple days.
Now, what I've been able to is capture this aggregation up to the point the first consecutive records come in, but not the next set using this code:
first_reading_time = df['reading_time'][0]
last_reaeding_time = df['reading_time'][idx_loc-1]
df = pd.DataFrame(data=d)
df.reading_time = pd.to_datetime(df.reading_time)
d = pd.Timedelta(60, 'm')
df = df.sort_values('reading_time', ascending=True)
consecutive = df.reading_time.diff().fillna(0).abs().le(d)
df['consecutive'] = consecutive
df.iloc[:idx_loc]
idx_loc = df.index.get_loc(consecutive.idxmin())
where the data frame 'd' represents the more granular level data up top. The line of code that sets the variable 'consecutive' tags each records as True or False based on the number of minutes difference between the current row and the previous. The variable idx_loc captures the number of rows that were consecutive, but it only captures the first set (in this case 1/6/2018 00:00 and 1/6/2018 00:03).
Any help is appreciated.
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
result = df.groupby('group')['reading_time'].agg(['first','last','count'])
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
yields
first last count type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 Received
4 2018-01-06 08:00:00 2018-01-06 08:00:00 1 Missed
5 2018-01-06 09:00:00 2018-01-06 10:00:00 2 Received
6 2018-01-06 11:00:00 2018-01-06 13:00:00 3 Missed
7 2018-01-06 14:00:00 2018-01-06 15:00:00 2 Received
You could use asfreq to expand the DataFrame to include missing rows:
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
# reading_time id type
# 0 2018-01-06 00:00:00 1.0 Interval
# 1 2018-01-06 01:00:00 1.0 Interval
# 2 2018-01-06 02:00:00 1.0 Interval
# 3 2018-01-06 03:00:00 1.0 Interval
# 4 2018-01-06 04:00:00 NaN NaN
# 5 2018-01-06 05:00:00 NaN NaN
# 6 2018-01-06 06:00:00 1.0 Interval
# 7 2018-01-06 07:00:00 1.0 Interval
# 8 2018-01-06 08:00:00 NaN NaN
# 9 2018-01-06 09:00:00 1.0 Interval
# 10 2018-01-06 10:00:00 1.0 Interval
# 11 2018-01-06 11:00:00 NaN NaN
# 12 2018-01-06 12:00:00 NaN NaN
# 13 2018-01-06 13:00:00 NaN NaN
# 14 2018-01-06 14:00:00 1.0 Interval
# 15 2018-01-06 15:00:00 1.0 Interval
Next, use the NaNs in, say, the id column to identify groups:
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
then group by the group values to find first and last reading_times for each group:
result = df.groupby('group')['reading_time'].agg(['first','last','count'])
# first last count
# group
# 1 2018-01-06 00:00:00 2018-01-06 03:00:00 4
# 2 2018-01-06 04:00:00 2018-01-06 05:00:00 2
# 3 2018-01-06 06:00:00 2018-01-06 07:00:00 2
# 4 2018-01-06 08:00:00 2018-01-06 08:00:00 1
# 5 2018-01-06 09:00:00 2018-01-06 10:00:00 2
# 6 2018-01-06 11:00:00 2018-01-06 13:00:00 3
# 7 2018-01-06 14:00:00 2018-01-06 15:00:00 2
Since the Missed and Received values alternate, they can be generated from the index:
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
To handle multiple frequencies on a per-id basis, you could use:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.sort_values(by='reading_time')
df = df.set_index('reading_time')
freqmap = {1:'1H', 2:'15T'}
df = df.groupby('id', group_keys=False).apply(
lambda grp: grp.asfreq(freqmap[grp['id'][0]]))
df = df.reset_index(level='reading_time')
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
grouped = df.groupby('group')
result = grouped['reading_time'].agg(['first','last','count'])
result['id'] = grouped['id'].agg('first')
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
which yields
first last count id type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 1.0 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 NaN Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 1.0 Received
4 2018-01-06 07:15:00 2018-01-06 08:45:00 7 NaN Missed
5 2018-01-06 09:00:00 2018-01-06 09:00:00 1 2.0 Received
6 2018-01-06 09:15:00 2018-01-06 09:45:00 3 NaN Missed
7 2018-01-06 10:00:00 2018-01-06 10:00:00 1 2.0 Received
8 2018-01-06 10:15:00 2018-01-06 13:45:00 15 NaN Missed
9 2018-01-06 14:00:00 2018-01-06 14:00:00 1 2.0 Received
10 2018-01-06 14:15:00 2018-01-06 14:45:00 3 NaN Missed
11 2018-01-06 15:00:00 2018-01-06 15:00:00 1 2.0 Received
It seems plausible that "Missed" rows should not be associated with any id, but to bring the result a little closer to the one you posted, you could ffill to forward-fill NaN id values:
result['id'] = result['id'].ffill()
changes the result to
first last count id type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 1 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 1 Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 1 Received
4 2018-01-06 07:15:00 2018-01-06 08:45:00 7 1 Missed
5 2018-01-06 09:00:00 2018-01-06 09:00:00 1 2 Received
6 2018-01-06 09:15:00 2018-01-06 09:45:00 3 2 Missed
7 2018-01-06 10:00:00 2018-01-06 10:00:00 1 2 Received
8 2018-01-06 10:15:00 2018-01-06 13:45:00 15 2 Missed
9 2018-01-06 14:00:00 2018-01-06 14:00:00 1 2 Received
10 2018-01-06 14:15:00 2018-01-06 14:45:00 3 2 Missed
11 2018-01-06 15:00:00 2018-01-06 15:00:00 1 2 Received

Conditional selection before certain time of day - Pandas dataframe

I have the above dataframe (snippet) and want create a new dataframe which is a conditional selection where I keep only the rows that are timestamped with a time before 15:00:00.
I'm still somewhat new to Pandas / python and have been stuck on this for a while :(
You can use DataFrame.between_time:
start = pd.to_datetime('2015-02-24 11:00')
rng = pd.date_range(start, periods=10, freq='14h')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
print (df)
Date a
0 2015-02-24 11:00:00 0
1 2015-02-25 01:00:00 1
2 2015-02-25 15:00:00 2
3 2015-02-26 05:00:00 3
4 2015-02-26 19:00:00 4
5 2015-02-27 09:00:00 5
6 2015-02-27 23:00:00 6
7 2015-02-28 13:00:00 7
8 2015-03-01 03:00:00 8
9 2015-03-01 17:00:00 9
df = df.set_index('Date').between_time('00:00:00', '15:00:00')
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-25 15:00:00 2
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
If need exclude 15:00:00 add parameter include_end=False:
df = df.set_index('Date').between_time('00:00:00', '15:00:00', include_end=False)
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
You can check the hours of the date column and use it for subsetting:
df['date'] = pd.to_datetime(df['date']) # optional if the date column is of datetime type
df[df.date.dt.hour < 15]

Categories

Resources