Pandas create new calssification column from timestamp - python

I am trying to create a new classification column 'Stages_So' and post it to my original dataframe.
Event_Code Timestamp
2053 13/08/2016 11:30
1029 10/09/2016 14:00
2053 02/10/2016 13:15
2053 06/11/2016 16:30
2053 19/11/2016 15:00
2053 03/12/2016 17:30
1029 02/01/2017 15:00
1029 05/02/2017 16:00
2053 11/02/2017 15:00
1029 04/03/2017 15:00
2053 01/04/2017 14:00
1029 21/05/2017 14:00
I have tried the following function.
def label_stage(row):
if row['Timestamp'] > '2016-08-12' and row['Timestamp'] < '2016-11-07':
return 0
if row['Timestamp'] > '2016-11-18' and row['Timestamp'] < '2017-02-06':
return 1
if row['Timestamp'] > '2017-02-10' and row['Timestamp'] < '2017-05-22':
return 2
df['Stages_So'] = df.apply(lambda row: label_stage(row), axis=1)
But it gives an error.
TypeError: ("Cannot compare type 'Timestamp' with type 'str'", 'occurred at index 957').

You need convert column to datetimes first by to_datetime and then compare by datetimes:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
def label_stage(row):
if row['Timestamp'] > pd.Timestamp('2016-08-12') and
row['Timestamp'] < pd.Timestamp('2016-11-07'):
return 0
if row['Timestamp'] > pd.Timestamp('2016-11-18') and
row['Timestamp'] < pd.Timestamp('2017-02-06'):
return 1
if row['Timestamp'] > pd.Timestamp('2017-02-10') and
row['Timestamp'] < pd.Timestamp('2017-05-22'):
return 2
df['Stages_So'] = df.apply(lambda row: label_stage(row), axis=1)
print (df)
Event_Code Timestamp Stages_So
0 2053 2016-08-13 11:30:00 0.0
1 1029 2016-10-09 14:00:00 0.0
2 2053 2016-02-10 13:15:00 NaN
3 2053 2016-06-11 16:30:00 NaN
4 2053 2016-11-19 15:00:00 1.0
5 2053 2016-03-12 17:30:00 NaN
6 1029 2017-02-01 15:00:00 1.0
7 1029 2017-05-02 16:00:00 2.0
8 2053 2017-11-02 15:00:00 NaN
9 1029 2017-04-03 15:00:00 2.0
10 2053 2017-01-04 14:00:00 1.0
11 1029 2017-05-21 14:00:00 2.0
Another faster solution:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
m1 = (df['Timestamp'] > '2016-08-12') & (df['Timestamp'] < '2016-11-07')
m2 = (df['Timestamp'] > '2016-11-18') & (df['Timestamp'] < '2017-02-06')
m3 = (df['Timestamp'] > '2017-02-10') & (df['Timestamp'] < '2017-05-22')
df['Stages_So'] = np.select([m1, m2, m3], [0,1,2], default=np.nan)
print (df)
Event_Code Timestamp Stages_So
0 2053 2016-08-13 11:30:00 0.0
1 1029 2016-10-09 14:00:00 0.0
2 2053 2016-02-10 13:15:00 NaN
3 2053 2016-06-11 16:30:00 NaN
4 2053 2016-11-19 15:00:00 1.0
5 2053 2016-03-12 17:30:00 NaN
6 1029 2017-02-01 15:00:00 1.0
7 1029 2017-05-02 16:00:00 2.0
8 2053 2017-11-02 15:00:00 NaN
9 1029 2017-04-03 15:00:00 2.0
10 2053 2017-01-04 14:00:00 1.0
11 1029 2017-05-21 14:00:00 2.0

Related

compare dates within a dataframe and assign a value to another variable

I have two dataframes (df and df1) like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM', '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]
What I would like to do is
a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id
ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).
As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.
b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1
c) If not then assign, "Out of Range"
I tried the below
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful
14 202 2012-12-13 11:00:00 NA
Let's do:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
on=['person_id', 'within_id'], how='left', indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
Details:
Left merge the dataframe df1 with df on person_id and within_id:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:
print(d)
person_id date_1 within_id
0 101 2013-07-07 11:20:00 out of range
1 101 2013-05-07 14:30:00 ABC1
2 101 2013-06-07 14:40:00 out of range
3 101 2014-08-06 00:00:00 out of range
4 101 2014-11-06 00:00:00 out of range
5 101 2013-02-03 12:30:00 out of range
6 101 2014-06-13 00:00:00 out of range
7 202 2011-12-11 00:00:00 DEF1
8 202 2012-10-13 07:00:00 DEF2
9 202 2015-12-13 00:00:00 out of range
10 202 2012-12-13 00:00:00 DEF3
11 202 2012-12-13 18:30:00 DEF3
12 202 2011-07-13 10:00:00 out of range
13 202 2012-12-18 10:00:00 DEF3
14 202 2013-12-19 11:00:00 NaN
I used df and df1 as provided above.
The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.
Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index, 'within_id'] = 'out of range'
df1.at[t.Index, 'rule'] += 1_000
else:
df1.at[t.Index, 'within_id'] = 'impossible!'
df1.at[t.Index, 'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
The results are:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1

How to check if any string is missing in pandas

I have following dataframe in pandas
Date half_hourly_bucket Value
2018-01-01 00:00:01 - 00:30:00 123
2018-01-01 00:30:01 - 01:00:00 12
2018-01-01 01:00:01 - 01:30:00 122
2018-01-01 02:00:01 - 02:30:00 111
2018-01-01 03:00:01 - 03:30:00 122
2018-01-01 04:00:01 - 04:30:00 111
My desired dataframe would be
Date half_hourly_bucket Value
2018-01-01 00:00:01 - 00:30:00 123
2018-01-01 00:30:01 - 01:00:00 12
2018-01-01 01:00:01 - 01:30:00 122
2018-01-01 01:30:01 - 02:00:00 0
2018-01-01 02:00:01 - 02:30:00 122
2018-01-01 02:30:01 - 03:00:00 0
2018-01-01 03:00:01 - 03:30:00 111
2018-01-01 03:30:01 - 04:00:00 0
2018-01-01 04:00:01 - 04:30:00 111
2018-01-01 04:30:01 - 05:00:00 0
2018-01-01 05:00:01 - 05:30:00 0
2018-01-01 05:30:01 - 06:00:00 0
2018-01-01 06:00:01 - 06:30:00 0
2018-01-01 06:30:01 - 07:00:00 0
2018-01-01 07:00:01 - 07:30:00 0
2018-01-01 07:30:01 - 08:00:00 0
2018-01-01 08:00:01 - 08:30:00 0
2018-01-01 09:00:01 - 09:30:00 0
2018-01-01 10:00:01 - 10:30:00 0
2018-01-01 10:30:01 - 11:00:00 0
2018-01-01 11:00:01 - 11:30:00 0
2018-01-01 11:30:01 - 12:00:00 0
2018-01-01 12:00:01 - 12:30:00 0
2018-01-01 12:30:01 - 13:00:00 0
2018-01-01 13:00:01 - 13:30:00 0
2018-01-01 13:30:01 - 14:00:00 0
2018-01-01 14:00:01 - 14:30:00 0
2018-01-01 14:30:01 - 15:00:00 0
2018-01-01 15:00:01 - 15:30:00 0
2018-01-01 15:30:01 - 16:00:00 0
2018-01-01 16:00:01 - 16:30:00 0
2018-01-01 16:30:01 - 17:00:00 0
2018-01-01 17:00:01 - 17:30:00 0
2018-01-01 18:00:01 - 18:30:00 0
2018-01-01 18:30:01 - 19:00:00 0
2018-01-01 19:00:01 - 19:30:00 0
2018-01-01 19:30:01 - 20:00:00 0
2018-01-01 20:00:01 - 20:30:00 0
2018-01-01 20:30:01 - 21:00:00 0
2018-01-01 21:00:01 - 21:30:00 0
2018-01-01 21:30:01 - 22:00:00 0
2018-01-01 22:00:01 - 22:30:00 0
2018-01-01 22:30:01 - 23:00:00 0
2018-01-01 23:00:01 - 23:30:00 0
2018-01-01 23:30:01 - 00:00:00 0
What I want to check on Date column is if in any half hourly bucket (48 buckets in total per day) there is missing data and if it is missing then that bucket has to be added in order and will have value as 0.
How can I do it in pandas?
Solution break half_hourly_bucket to 2 new columns, process it and join back:
#create DatetimeIndex
df = df.set_index('Date')
#split to new columns
df[['one','two']] = df['half_hourly_bucket'].str.split(' - ', expand=True)
#add first column to DatetimeIndex
df.index += pd.to_timedelta(df['one'])
#add mising values to DatetimeIndex
one_sec = pd.Timedelta(1, unit='s')
one_day = pd.Timedelta(1, unit='d')
df = df.reindex(pd.date_range(df.index.min().floor('D') + one_sec,
df.index.max().floor('D') + one_day - one_sec, freq='30T'))
#recreate column two
df['two'] = df.index + pd.Timedelta(30*60 - 1, unit='s')
#join together
df['half_hourly_bucket'] = (df.index.strftime('%H:%M:%S') + ' - ' +
df['two'].dt.strftime('%H:%M:%S'))
#replace missing values
df['Value'] = df['Value'].fillna(0)
df = df.rename_axis('Date').reset_index()
#filter only necessary columns
df = df[['Date','half_hourly_bucket','Value']]
print (df)
Date half_hourly_bucket Value
0 2018-01-01 00:00:01 00:00:01 - 00:30:00 123.0
1 2018-01-01 00:30:01 00:30:01 - 01:00:00 12.0
2 2018-01-01 01:00:01 01:00:01 - 01:30:00 122.0
3 2018-01-01 01:30:01 01:30:01 - 02:00:00 0.0
4 2018-01-01 02:00:01 02:00:01 - 02:30:00 111.0
5 2018-01-01 02:30:01 02:30:01 - 03:00:00 0.0
6 2018-01-01 03:00:01 03:00:01 - 03:30:00 122.0
7 2018-01-01 03:30:01 03:30:01 - 04:00:00 0.0
8 2018-01-01 04:00:01 04:00:01 - 04:30:00 111.0
9 2018-01-01 04:30:01 04:30:01 - 05:00:00 0.0
10 2018-01-01 05:00:01 05:00:01 - 05:30:00 0.0
11 2018-01-01 05:30:01 05:30:01 - 06:00:00 0.0
12 2018-01-01 06:00:01 06:00:01 - 06:30:00 0.0
13 2018-01-01 06:30:01 06:30:01 - 07:00:00 0.0
14 2018-01-01 07:00:01 07:00:01 - 07:30:00 0.0
15 2018-01-01 07:30:01 07:30:01 - 08:00:00 0.0
16 2018-01-01 08:00:01 08:00:01 - 08:30:00 0.0
17 2018-01-01 08:30:01 08:30:01 - 09:00:00 0.0
18 2018-01-01 09:00:01 09:00:01 - 09:30:00 0.0
19 2018-01-01 09:30:01 09:30:01 - 10:00:00 0.0
20 2018-01-01 10:00:01 10:00:01 - 10:30:00 0.0
21 2018-01-01 10:30:01 10:30:01 - 11:00:00 0.0
22 2018-01-01 11:00:01 11:00:01 - 11:30:00 0.0
23 2018-01-01 11:30:01 11:30:01 - 12:00:00 0.0
24 2018-01-01 12:00:01 12:00:01 - 12:30:00 0.0
25 2018-01-01 12:30:01 12:30:01 - 13:00:00 0.0
26 2018-01-01 13:00:01 13:00:01 - 13:30:00 0.0
27 2018-01-01 13:30:01 13:30:01 - 14:00:00 0.0
28 2018-01-01 14:00:01 14:00:01 - 14:30:00 0.0
29 2018-01-01 14:30:01 14:30:01 - 15:00:00 0.0
30 2018-01-01 15:00:01 15:00:01 - 15:30:00 0.0
31 2018-01-01 15:30:01 15:30:01 - 16:00:00 0.0
32 2018-01-01 16:00:01 16:00:01 - 16:30:00 0.0
33 2018-01-01 16:30:01 16:30:01 - 17:00:00 0.0
34 2018-01-01 17:00:01 17:00:01 - 17:30:00 0.0
35 2018-01-01 17:30:01 17:30:01 - 18:00:00 0.0
36 2018-01-01 18:00:01 18:00:01 - 18:30:00 0.0
37 2018-01-01 18:30:01 18:30:01 - 19:00:00 0.0
38 2018-01-01 19:00:01 19:00:01 - 19:30:00 0.0
39 2018-01-01 19:30:01 19:30:01 - 20:00:00 0.0
40 2018-01-01 20:00:01 20:00:01 - 20:30:00 0.0
41 2018-01-01 20:30:01 20:30:01 - 21:00:00 0.0
42 2018-01-01 21:00:01 21:00:01 - 21:30:00 0.0
43 2018-01-01 21:30:01 21:30:01 - 22:00:00 0.0
44 2018-01-01 22:00:01 22:00:01 - 22:30:00 0.0
45 2018-01-01 22:30:01 22:30:01 - 23:00:00 0.0
46 2018-01-01 23:00:01 23:00:01 - 23:30:00 0.0
47 2018-01-01 23:30:01 23:30:01 - 00:00:00 0.0

How to convert datetime series to actual duration in hours?

I have a dataframe like this:
index = ['2018-02-17 00:30:00', '2018-02-17 07:00:00',
'2018-02-17 13:00:00', '2018-02-17 19:00:00',
'2018-02-18 00:00:00', '2018-02-18 07:00:00',
'2018-02-18 10:30:00', '2018-02-18 13:00:00']
df = pd.DataFrame({'col': list(range(len(index)))})
df.index = pd.to_datetime(index)
col
2018-02-17 00:30:00 0
2018-02-17 07:00:00 1
2018-02-17 13:00:00 2
2018-02-17 19:00:00 3
2018-02-18 00:00:00 4
2018-02-18 07:00:00 5
2018-02-18 10:30:00 6
2018-02-18 13:00:00 7
and would like to add a column that reflects the actual duration in hours, so my desired outcome looks like this:
col time_range
2018-02-17 00:30:00 0 0.0
2018-02-17 07:00:00 1 6.5
2018-02-17 13:00:00 2 12.5
2018-02-17 19:00:00 3 18.5
2018-02-18 00:00:00 4 23.5
2018-02-18 07:00:00 5 30.5
2018-02-18 10:30:00 6 34.0
2018-02-18 13:00:00 7 36.5
I currently do this as follows:
df['time_range'] = [(ti - df.index[0]).delta / (10 ** 9 * 60 * 60) for ti in df.index]
Is there a smarter (i.e. vectorized/built-in) way of doing this?
Use:
df['new'] = (df.index - df.index[0]).total_seconds() / 3600
Or:
df['new'] = (df.index - df.index[0]) / np.timedelta64(1, 'h')
print (df)
col new
2018-02-17 00:30:00 0 0.0
2018-02-17 07:00:00 1 6.5
2018-02-17 13:00:00 2 12.5
2018-02-17 19:00:00 3 18.5
2018-02-18 00:00:00 4 23.5
2018-02-18 07:00:00 5 30.5
2018-02-18 10:30:00 6 34.0
2018-02-18 13:00:00 7 36.5

add timedelta data within a group in pandas dataframe

I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...

Python - filtering lines from data frame

I have a simple data frame:
ID Stime Etime
1 13:00:00 13:15:00
1 14:00:00 14:15:00
2 15:00:00 15:42:00
3 13:00:00 13:25:00
4 15:00:00 15:15:00
4 15:05:00 15:15:00
What I would like to do is to unit the 2 last lines, because they belong to the same ID (ID=4) and the time of the last line is contained in the time of the penultimate line.
What I want the output to be is:
ID Stime Etime
1 13:00:00 13:15:00
1 14:00:00 14:15:00
2 15:00:00 15:42:00
3 13:00:00 13:25:00
4 15:00:00 15:15:00
Solution
def setup(df):
td = df.Stime - df.Etime.shift()
td = td.apply(lambda x: x.total_seconds() > 1)
td.iloc[0] = True
return td.cumsum()
def collapse(df):
df_ = df.iloc[0, :]
df_.loc['Stime'] = df.Stime.min()
df_.loc['Etime'] = df.Etime.max()
return df_
df['group id'] = df.groupby('ID').apply(setup).values
gbcols = ['ID', 'group id']
fcols = ['ID', 'Stime', 'Etime']
print df.groupby(gbcols)[fcols].apply(collapse).reset_index(drop=True)
ID Stime Etime
0 1 2016-05-30 13:00:00 2016-05-30 13:15:00
1 1 2016-05-30 14:00:00 2016-05-30 14:15:00
2 2 2016-05-30 15:00:00 2016-05-30 15:42:00
3 3 2016-05-30 13:00:00 2016-05-30 13:25:00
4 4 2016-05-30 15:00:00 2016-05-30 15:15:00

Categories

Resources