I have following dataframe in pandas
code date time tank
123 01-01-2018 08:00:00 1
123 01-01-2018 11:00:00 1
123 01-01-2018 12:00:00 1
123 01-01-2018 13:00:00 1
123 01-01-2018 07:00:00 1
123 01-01-2018 09:00:00 1
124 01-01-2018 08:00:00 2
124 01-01-2018 11:00:00 2
124 01-01-2018 12:00:00 2
124 01-01-2018 13:00:00 2
124 01-01-2018 07:00:00 2
124 01-01-2018 09:00:00 2
I am grouping by and sorting it by 'time'
df= df.groupby(['code', 'date', 'tank']).apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index()
When I do reset_index() I am getting following error
ValueError: cannot insert tank, already exists
How about sorting by every grouper key column, with "time" in descending?
df.sort_values(['code', 'date', 'tank', 'time'], ascending=[True]*3 + [False])
code date time tank
3 123 01-01-2018 13:00:00 1
2 123 01-01-2018 12:00:00 1
1 123 01-01-2018 11:00:00 1
5 123 01-01-2018 09:00:00 1
0 123 01-01-2018 08:00:00 1
4 123 01-01-2018 07:00:00 1
9 124 01-01-2018 13:00:00 2
8 124 01-01-2018 12:00:00 2
7 124 01-01-2018 11:00:00 2
11 124 01-01-2018 09:00:00 2
6 124 01-01-2018 08:00:00 2
10 124 01-01-2018 07:00:00 2
This will achieve the same effect, but without the groupby.
If groupby is needed, you will need two reset_index calls (to remove the last level):
(df.groupby(['code', 'date', 'tank'])
.time.apply(lambda x: x.sort_values(ascending=False))
.reset_index(level=-1, drop=True)
.reset_index())
code date tank time
0 123 01-01-2018 1 13:00:00
1 123 01-01-2018 1 12:00:00
2 123 01-01-2018 1 11:00:00
3 123 01-01-2018 1 09:00:00
4 123 01-01-2018 1 08:00:00
5 123 01-01-2018 1 07:00:00
6 124 01-01-2018 2 13:00:00
7 124 01-01-2018 2 12:00:00
8 124 01-01-2018 2 11:00:00
9 124 01-01-2018 2 09:00:00
10 124 01-01-2018 2 08:00:00
11 124 01-01-2018 2 07:00:00
Related
I have data set which contain hourly data
Date Count
20200101 0:00:00 1352
20200101 1:00:00 1250
20200101 2:00:00 1022
20200101 3:00:00 628
20200101 4:00:00 2984
20200101 6:00:00 1694
20200101 7:00:00 2804
20200101 8:00:00 1050
20200101 9:00:00 540
20200101 13:00:00 4282
how can I fill missing hours with count 0
Expected resuslts
20200101 10:00:00 0
20200101 11:00:00 0
20200101 12:00:00 0
This is my code.
import cx_Oracle
import pandas as pd
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
df_full = pd.read_sql('SELECT * FROM 1H_FILE_COUNT order by event_date asc', conn1)
# Rename Column
df_full = df_full.rename(columns={'EVENT_DATE': 'ds','FILE_COUNT' : 'y'})
# set Datatime Index for column[0]
df_full['ds'] = pd.DatetimeIndex(df_full['ds'])
df_full.head(20)
First convert to datetime, set index, resample and fill values:
# df['Date'] = pd.to_datetime(df['Date'])
>>> df.set_index('Date').resample('H').asfreq(fill_value=0).reset_index()
Date Count
0 2020-01-01 00:00:00 1352
1 2020-01-01 01:00:00 1250
2 2020-01-01 02:00:00 1022
3 2020-01-01 03:00:00 628
4 2020-01-01 04:00:00 2984
5 2020-01-01 05:00:00 0
6 2020-01-01 06:00:00 1694
7 2020-01-01 07:00:00 2804
8 2020-01-01 08:00:00 1050
9 2020-01-01 09:00:00 540
10 2020-01-01 10:00:00 0
11 2020-01-01 11:00:00 0
12 2020-01-01 12:00:00 0
13 2020-01-01 13:00:00 4282
If you want to preserve your date format:
>>> df['Date'].dt.strftime('%Y%m%d %-H:%M:%S')
0 20200101 0:00:00
1 20200101 1:00:00
2 20200101 2:00:00
3 20200101 3:00:00
4 20200101 4:00:00
5 20200101 5:00:00
6 20200101 6:00:00
7 20200101 7:00:00
8 20200101 8:00:00
9 20200101 9:00:00
10 20200101 10:00:00
11 20200101 11:00:00
12 20200101 12:00:00
13 20200101 13:00:00
Name: Date, dtype: object
I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0
I have two dataframes (df and df1) like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM', '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]
What I would like to do is
a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id
ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).
As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.
b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1
c) If not then assign, "Out of Range"
I tried the below
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful
14 202 2012-12-13 11:00:00 NA
Let's do:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
on=['person_id', 'within_id'], how='left', indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
Details:
Left merge the dataframe df1 with df on person_id and within_id:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:
print(d)
person_id date_1 within_id
0 101 2013-07-07 11:20:00 out of range
1 101 2013-05-07 14:30:00 ABC1
2 101 2013-06-07 14:40:00 out of range
3 101 2014-08-06 00:00:00 out of range
4 101 2014-11-06 00:00:00 out of range
5 101 2013-02-03 12:30:00 out of range
6 101 2014-06-13 00:00:00 out of range
7 202 2011-12-11 00:00:00 DEF1
8 202 2012-10-13 07:00:00 DEF2
9 202 2015-12-13 00:00:00 out of range
10 202 2012-12-13 00:00:00 DEF3
11 202 2012-12-13 18:30:00 DEF3
12 202 2011-07-13 10:00:00 out of range
13 202 2012-12-18 10:00:00 DEF3
14 202 2013-12-19 11:00:00 NaN
I used df and df1 as provided above.
The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.
Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index, 'within_id'] = 'out of range'
df1.at[t.Index, 'rule'] += 1_000
else:
df1.at[t.Index, 'within_id'] = 'impossible!'
df1.at[t.Index, 'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
The results are:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
I have following dataframe in pandas
code tank date time no_operation_flag
123 1 01-01-2019 00:00:00 1
123 1 01-01-2019 00:30:00 1
123 1 01-01-2019 01:00:00 0
123 1 01-01-2019 01:30:00 1
123 1 01-01-2019 02:00:00 1
123 1 01-01-2019 02:30:00 1
123 1 01-01-2019 03:00:00 1
123 1 01-01-2019 03:30:00 1
123 1 01-01-2019 04:00:00 1
123 1 01-01-2019 05:00:00 1
123 1 01-01-2019 14:00:00 1
123 1 01-01-2019 14:30:00 1
123 1 01-01-2019 15:00:00 1
123 1 01-01-2019 15:30:00 1
123 1 01-01-2019 16:00:00 1
123 1 01-01-2019 16:30:00 1
123 2 02-01-2019 00:00:00 1
123 2 02-01-2019 00:30:00 0
123 2 02-01-2019 01:00:00 0
123 2 02-01-2019 01:30:00 0
123 2 02-01-2019 02:00:00 1
123 2 02-01-2019 02:30:00 1
123 2 02-01-2019 03:00:00 1
123 2 03-01-2019 03:30:00 1
123 2 03-01-2019 04:00:00 1
123 1 03-01-2019 14:00:00 1
123 2 03-01-2019 15:00:00 1
123 2 03-01-2019 00:30:00 1
123 2 04-01-2019 11:00:00 1
123 2 04-01-2019 11:30:00 0
123 2 04-01-2019 12:00:00 1
123 2 04-01-2019 13:30:00 1
123 2 05-01-2019 03:00:00 1
123 2 05-01-2019 03:30:00 1
123 2 05-01-2019 04:00:00 1
What I want to do is to flag consecutive 1's in no_operation_flag more than 5 times at tank level and day level, but the time should be consecutive (time is at half an hour level). Dataframe is already sorted at tank, date and time level.
My desired dataframe would be
code tank date time no_operation_flag final_flag
123 1 01-01-2019 00:00:00 1 0
123 1 01-01-2019 00:30:00 1 0
123 1 01-01-2019 01:00:00 0 0
123 1 01-01-2019 01:30:00 1 1
123 1 01-01-2019 02:00:00 1 1
123 1 01-01-2019 02:30:00 1 1
123 1 01-01-2019 03:00:00 1 1
123 1 01-01-2019 03:30:00 1 1
123 1 01-01-2019 04:00:00 1 1
123 1 01-01-2019 05:00:00 1 0
123 1 01-01-2019 14:00:00 1 1
123 1 01-01-2019 14:30:00 1 1
123 1 01-01-2019 15:00:00 1 1
123 1 01-01-2019 15:30:00 1 1
123 1 01-01-2019 16:00:00 1 1
123 1 01-01-2019 16:30:00 1 1
123 2 02-01-2019 00:00:00 1 0
123 2 02-01-2019 00:30:00 0 0
123 2 02-01-2019 01:00:00 0 0
123 2 02-01-2019 01:30:00 0 0
123 2 02-01-2019 02:00:00 1 0
123 2 02-01-2019 02:30:00 1 0
123 2 02-01-2019 03:00:00 1 0
123 2 03-01-2019 03:30:00 1 0
123 2 03-01-2019 04:00:00 1 0
123 1 03-01-2019 14:00:00 1 0
123 2 03-01-2019 15:00:00 1 0
123 2 03-01-2019 00:30:00 1 0
123 2 04-01-2019 11:00:00 1 0
123 2 04-01-2019 11:30:00 0 0
123 2 04-01-2019 12:00:00 1 0
123 2 04-01-2019 13:30:00 1 0
123 2 05-01-2019 03:00:00 1 0
123 2 05-01-2019 03:30:00 1 0
123 2 05-01-2019 04:00:00 1 0
How can I do this in pandas?
You can use solution like this, only filtering for consecutive datetimes per groups with new helper DataFrame with added all missing datetimes, last merge for add new column:
df['datetimes'] = pd.to_datetime(df['date'].astype(str) + ' ' + df['time'].astype(str))
df1 = (df.set_index('datetimes')
.groupby(['code','tank', 'date'])['no_operation_flag']
.resample('30T')
.first()
.reset_index())
shifted1 = df1.groupby(['code','tank', 'date'])['no_operation_flag'].shift()
g1 = df1['no_operation_flag'].ne(shifted1).cumsum()
mask1 = g1.map(g1.value_counts()).gt(5) & df1['no_operation_flag'].eq(1)
df1['final_flag'] = mask1.astype(int)
#print (df1.head(40))
df = df.merge(df1[['code','tank','datetimes','final_flag']]).drop('datetimes', axis=1)
print (df)
code tank date time no_operation_flag final_flag
0 123 1 01-01-2019 00:00:00 1 0
1 123 1 01-01-2019 00:30:00 1 0
2 123 1 01-01-2019 01:00:00 0 0
3 123 1 01-01-2019 01:30:00 1 1
4 123 1 01-01-2019 02:00:00 1 1
5 123 1 01-01-2019 02:30:00 1 1
6 123 1 01-01-2019 03:00:00 1 1
7 123 1 01-01-2019 03:30:00 1 1
8 123 1 01-01-2019 04:00:00 1 1
9 123 1 01-01-2019 05:00:00 1 0
10 123 1 01-01-2019 14:00:00 1 1
11 123 1 01-01-2019 14:30:00 1 1
12 123 1 01-01-2019 15:00:00 1 1
13 123 1 01-01-2019 15:30:00 1 1
14 123 1 01-01-2019 16:00:00 1 1
15 123 1 01-01-2019 16:30:00 1 1
16 123 2 02-01-2019 00:00:00 1 0
17 123 2 02-01-2019 00:30:00 0 0
18 123 2 02-01-2019 01:00:00 0 0
19 123 2 02-01-2019 01:30:00 0 0
20 123 2 02-01-2019 02:00:00 1 0
21 123 2 02-01-2019 02:30:00 1 0
22 123 2 02-01-2019 03:00:00 1 0
23 123 2 03-01-2019 03:30:00 1 0
24 123 2 03-01-2019 04:00:00 1 0
25 123 1 03-01-2019 14:00:00 1 0
26 123 2 03-01-2019 15:00:00 1 0
27 123 2 03-01-2019 00:30:00 1 0
28 123 2 04-01-2019 11:00:00 1 0
29 123 2 04-01-2019 11:30:00 0 0
30 123 2 04-01-2019 12:00:00 1 0
31 123 2 04-01-2019 13:30:00 1 0
32 123 2 05-01-2019 03:00:00 1 0
33 123 2 05-01-2019 03:30:00 1 0
34 123 2 05-01-2019 04:00:00 1 0
Use:
df['final_flag'] = ( df.groupby([df['no_operation_flag'].ne(1).cumsum(),
'tank',
'date',
pd.to_datetime(df['time'].astype(str))
.diff()
.ne(pd.Timedelta(minutes = 30))
.cumsum(),
'no_operation_flag'])['no_operation_flag']
.transform('size')
.gt(5)
.view('uint8') )
print(df)
Output
code tank date time no_operation_flag final_flag
0 123 1 01-01-2019 00:00:00 1 0
1 123 1 01-01-2019 00:30:00 1 0
2 123 1 01-01-2019 01:00:00 0 0
3 123 1 01-01-2019 01:30:00 1 1
4 123 1 01-01-2019 02:00:00 1 1
5 123 1 01-01-2019 02:30:00 1 1
6 123 1 01-01-2019 03:00:00 1 1
7 123 1 01-01-2019 03:30:00 1 1
8 123 1 01-01-2019 04:00:00 1 1
9 123 1 01-01-2019 05:00:00 1 0
10 123 1 01-01-2019 14:00:00 1 1
11 123 1 01-01-2019 14:30:00 1 1
12 123 1 01-01-2019 15:00:00 1 1
13 123 1 01-01-2019 15:30:00 1 1
14 123 1 01-01-2019 16:00:00 1 1
15 123 1 01-01-2019 16:30:00 1 1
16 123 2 02-01-2019 00:00:00 1 0
17 123 2 02-01-2019 00:30:00 0 0
18 123 2 02-01-2019 01:00:00 0 0
19 123 2 02-01-2019 01:30:00 0 0
20 123 2 02-01-2019 02:00:00 1 0
21 123 2 02-01-2019 02:30:00 1 0
22 123 2 02-01-2019 03:00:00 1 0
23 123 2 03-01-2019 03:30:00 1 0
24 123 2 03-01-2019 04:00:00 1 0
25 123 1 03-01-2019 14:00:00 1 0
26 123 2 03-01-2019 15:00:00 1 0
27 123 2 03-01-2019 00:30:00 1 0
28 123 2 04-01-2019 11:00:00 1 0
29 123 2 04-01-2019 11:30:00 0 0
30 123 2 04-01-2019 12:00:00 1 0
31 123 2 04-01-2019 13:30:00 1 0
32 123 2 05-01-2019 03:00:00 1 0
33 123 2 05-01-2019 03:30:00 1 0
There might be a way to do in one go but the two steps approach is simpler,
first you select tanks one by one and then you look for the sequence of five 1.
This other question already solves the searching the pattern in a column.
If you want to go the other way you might take a look at rolling, you can either sum the 1 or use a all values are True condition to find the sequence of n elements.
You could also just mask mask a column but that would give you just the values in the mask. This solves the other problem, "which tanks where non operative at a give time".
This is very premitive and somewhat dirty way but easy to understand, I think.
For loop of rows, check time after 4 rows is 2 hours far.
(if 1 is True) Check all of corresponding five values of df['no_operation_flag'] are 1.
(if 2 is True) Put 1 in corresponding five values of df['final_flag'].
# make col with zero
df['final_flag'] = 0
for i in range(1, len(df)-4):
j = i + 4
dt1 = df['date'].iloc[i]+' '+df['time'].iloc[i]
ts1 = pd.to_datetime(dt1)
dt2 = df['date'].iloc[j]+' '+df['time'].iloc[j]
ts2 = pd.to_datetime(dt2)
# timedelta is 2 hours?
if ts2 - ts1 == datetime.timedelta(hours=2, minutes=0):
# all of no_operation_flag == 1?
if (df['no_operation_flag'].iloc[i:j+1] == 1).all():
df['final_flag'].iloc[i:j+1] = 1
I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...