add timedelta data within a group in pandas dataframe - python

I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.

UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...

Related

Counting each day in a dataframe (Not resetting on new year)

I have two years worth of data in a Dataframe called df, with an additional column called dayNo which labels what day it is in the year. See below:
Code which handles dayNo:
df['dayNo'] = pd.to_datetime(df['TradeDate'], dayfirst=True).dt.day_of_year
I would like to amened dayNo so that when 2023 begins, dayNo doesn't reset to 1, but changes to 366, 367 and so on. Expected output below:
Maybe a completely different approach will have to be taken to what I've done above. Any help greatly appreciated, Thanks!
You could define a start day to start counting days from, and use the number of days from that point forward as your column. An example using self generated data to illustrate the point:
df = pd.DataFrame({"dates": pd.date_range("2022-12-29", "2023-01-03", freq="8H")})
start = pd.Timestamp("2021-12-31")
df["dayNo"] = df["dates"].sub(start).dt.days
dates dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
5 2022-12-30 16:00:00 364
6 2022-12-31 00:00:00 365
7 2022-12-31 08:00:00 365
8 2022-12-31 16:00:00 365
9 2023-01-01 00:00:00 366
10 2023-01-01 08:00:00 366
11 2023-01-01 16:00:00 366
12 2023-01-02 00:00:00 367
13 2023-01-02 08:00:00 367
14 2023-01-02 16:00:00 367
15 2023-01-03 00:00:00 368
You are nearly there with your solution just do Apply for final result as
df['dayNo'] = df['dayNo'].apply(lambda x : x if x>= df.loc[0].dayNo else x+df.loc[0].dayNo)
df
Out[108]:
dates TradeDate dayNo
0 2022-12-31 00:00:00 2022-12-31 365
1 2022-12-31 01:00:00 2022-12-31 365
2 2022-12-31 02:00:00 2022-12-31 365
3 2022-12-31 03:00:00 2022-12-31 365
4 2022-12-31 04:00:00 2022-12-31 365
.. ... ... ...
68 2023-01-02 20:00:00 2023-01-02 367
69 2023-01-02 21:00:00 2023-01-02 367
70 2023-01-02 22:00:00 2023-01-02 367
71 2023-01-02 23:00:00 2023-01-02 367
72 2023-01-03 00:00:00 2023-01-03 368
Let's suppose we have a pandas dataframe as follows with this script (inspired by Chrysophylaxs dataframe) :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
The dataframe has then dates from 2022 to 2030 :
TradeDate
0 2022-12-29 00:00:00
1 2022-12-29 08:00:00
2 2022-12-29 16:00:00
3 2022-12-30 00:00:00
4 2022-12-30 08:00:00
... ...
7682 2030-01-01 16:00:00
7683 2030-01-02 00:00:00
7684 2030-01-02 08:00:00
7685 2030-01-02 16:00:00
7686 2030-01-03 00:00:00
[7687 rows x 1 columns]
I propose you the following commented-inside code to aim our target :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
# Initialize Days counter
dyc = df['TradeDate'].iloc[0].dayofyear
# Initialize Previous day of Year
prv_dof = dyc
def func(row):
global dyc, prv_dof
# Get the day of the year
dof = row.iloc[0].dayofyear
# If New day then increment days counter
if dof != prv_dof:
dyc+=1
prv_dof = dof
return dyc
df['dayNo'] = df.apply(func, axis=1)
Resulting dataframe :
TradeDate dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
... ... ...
7682 2030-01-01 16:00:00 2923
7683 2030-01-02 00:00:00 2924
7684 2030-01-02 08:00:00 2924
7685 2030-01-02 16:00:00 2924
7686 2030-01-03 00:00:00 2925

Add a column with the hourly difference of the Datetime Index [duplicate]

This question already has answers here:
compute time difference of DateTimeIndex
(3 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a Dataframe with a datetimeindex and I need to create a column that contains the difference in time between the rows of the datetimeindex expressed in hours. This is what I have:
Datetime Numbers
2020-11-27 08:30:00 1
2020-11-27 13:00:00 2
2020-11-27 15:15:00 3
2020-11-27 20:45:00 4
2020-11-28 08:45:00 5
2020-11-28 10:45:00 6
2020-12-01 04:00:00 7
2020-12-01 08:15:00 8
2020-12-01 12:45:00 9
2020-12-01 14:45:00 10
2020-12-01 17:15:00 11
...
This is what I need:
Datetime Numbers Delta
2020-11-27 08:30:00 1 Nan
2020-11-27 13:00:00 2 4.5
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.5
2020-11-28 08:45:00 5 12
2020-11-28 10:45:00 6 2
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.5
2020-12-01 14:45:00 10 2
2020-12-01 17:15:00 11 2.5
...
The Dataframe has thousands of rows so I can't use a "for" loop. Thanks in advance!
EDIT: I found a solution:
df = df.reset_index()
df['Time'] = df['Datetime'].astype(np.int64) // 10**9
df['Delta'] = df['Time'].diff()/3600
df.drop(columns=['Time'],inplace =True)
df.set_index('Datetime', inplace=True)
I assume that Datetime is set as index:
df.reset_index(inplace=True)
df['Delta'] = df['Datetime'].diff().dt.total_seconds()/3600
df.set_index('Datetime', inplace=True)
OUTPUT:
Numbers Delta
Datetime
2020-11-27 08:30:00 1 NaN
2020-11-27 13:00:00 2 4.50
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.50
2020-11-28 08:45:00 5 12.00
2020-11-28 10:45:00 6 2.00
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.50
2020-12-01 14:45:00 10 2.00
2020-12-01 17:15:00 11 2.50

Replacing NaNs with date and time format

I'm working with the following dataframes.
Date Light (umols) Time_difference
0 2018-01-12 07:16:52 2.5 NaT
1 2018-01-12 07:19:52 4.9 0 days 00:03:00
2 2018-01-12 07:22:52 4.9 0 days 00:03:00
3 2018-01-12 07:25:52 7.4 0 days 00:03:00
4 2018-01-12 07:28:50 9.9 0 days 00:02:58
... ... ... ...
6252 2018-12-18 17:54:24 12.2 0 days 00:03:00
6253 2018-12-18 17:57:24 7.6 0 days 00:03:00
6254 2018-12-18 18:00:24 4.9 0 days 00:03:00
6255 2018-12-18 18:03:24 2.5 0 days 00:03:00
6256 2018-12-18 18:06:24 0.2 0 days 00:03:00
Date Light (umols) Time_difference
0 2019-01-10 00:00:00 500.4 NaT
1 2019-01-10 00:00:01 451.2 0 days 00:00:01
2 2019-01-10 00:00:02 343.7 0 days 00:00:01
3 2019-01-10 00:00:03 354.5 0 days 00:00:01
4 2019-01-10 00:00:04 176.4 0 days 00:00:00
... ... ... ...
81264 2021-02-22 23:59:55 937.7 0 days 00:00:00
81265 2021-02-22 23:59:56 634.4 0 days 00:00:00
81266 2021-02-22 23:59:57 574.3 0 days 00:00:00
81267 2021-02-22 23:59:58 598.9 0 days 00:00:00
81268 2021-02-22 23:59:59 676.9 0 days 00:00:00
I'm wanting to calculate where there are gaps, how long they are and how many there are. The idea is to have a consistent timeline every 3 minutes in a day tops, and anything above that needs to flagged up, the idea would be to merge the two dataframes together afterwards. There are some pesky NaTs in both their first rows, and I want to replace each one with something like '0 days 00:00:00'. I tried writing the following code with little success:
better = clean['Date'] == '2018-01-12 07:16:52'
clean.loc[better, 'Time_difference'] = clean.loc[clean, 'Time_difference'].replace('NaT', '0 days 00:00:00')
Any suggestions?

Reorder day of week in pandas groupby plot bar

I have sorted df data like below:
day_name Day_id
time
2019-05-20 19:00:00 Monday 0
2018-12-31 15:00:00 Monday 0
2019-02-25 17:00:00 Monday 0
2019-05-06 20:00:00 Monday 0
2019-03-12 12:00:00 Tuesday 1
2019-04-16 15:00:00 Tuesday 1
2019-04-02 18:00:00 Tuesday 1
2019-02-05 09:00:00 Tuesday 1
2019-05-28 21:00:00 Tuesday 1
2019-01-15 12:00:00 Tuesday 1
2019-06-04 20:00:00 Tuesday 1
2018-12-04 07:00:00 Tuesday 1
2019-01-22 11:00:00 Tuesday 1
2019-01-09 07:00:00 Wednesday 2
2019-03-06 16:00:00 Wednesday 2
2019-06-19 17:00:00 Wednesday 2
2019-04-10 20:00:00 Wednesday 2
2019-04-24 15:00:00 Wednesday 2
2019-01-31 08:00:00 Thursday 3
2019-01-03 08:00:00 Thursday 3
2019-02-28 19:00:00 Thursday 3
2019-05-23 20:00:00 Thursday 3
2018-12-20 07:00:00 Thursday 3
2019-05-09 19:00:00 Thursday 3
2019-06-28 15:00:00 Friday 4
2019-03-22 12:00:00 Friday 4
2019-03-29 14:00:00 Friday 4
2018-12-15 08:00:00 Saturday 5
2019-02-17 11:00:00 Sunday 6
2019-06-16 19:00:00 Sunday 6
2018-12-02 08:00:00 Sunday 6
Currentry with help of this post:
df = df.groupby(df.day_name).count().plot(kind="bar")
plt.show()
my output is:
How to plot histogram with days of week in proper order like: Monday, Tuesday ...?
I have found several approaches: 1, 2, 3, to solve this but can't find method for using them in my case.
Thank You all for hard work.
You need sort=False under groupby:
m = df.groupby(df.day_name,sort=False).count().plot(kind="bar")
plt.show()

Separate by threshold

I am trying to take the value of c_med one value as threshold from input:1 and separate the above and below values in two different outputs from input:2. Write above.csv & below.csv with reference to column c_total.
Read the above.csv as input and categorize them with percentage as mentioned in point 2 written in pure python.
Input: 1
date_count,all_hours,c_min,c_max,c_med,c_med_med,u_min,u_max,u_med,u_med_med
2,12,2309,19072,12515,13131,254,785,686,751
Input: 2 ['date','startTime','endTime','day','c_total','u_total']
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636
2004-01-06,06:00:00,07:00:00,Tue,5972,513
2004-01-06,07:00:00,08:00:00,Tue,3424,382
2004-01-06,08:00:00,09:00:00,Tue,2696,303
2004-01-06,09:00:00,10:00:00,Tue,2350,262
2004-01-06,10:00:00,11:00:00,Tue,2309,254
I am trying to read a threshold value from another input csv c_med
I am getting following error:
Traceback (most recent call last):
File "class_med.py", line 10, in <module>
above_median = df_data['c_total'] > df_med['c_med']
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 735, in wrapper
raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare
filter the separated data column c_total with percentage. Pure python solution given below but I am looking for a pandas solution. like in Reference one
for row in csv.reader(inp):
if int(row[1])<(.20 * max_value):
val = 'viewers'
elif int(row[1])>=(0.20*max_value) and int(row[1])<(0.40*max_value):
val= 'event based'
elif int(row[1])>=(0.40*max_value) and int(row[1])<(0.60*max_value):
val= 'situational'
elif int(row[1])>=(0.60*max_value) and int(row[1])<(0.80*max_value):
val = 'active'
else:
val= 'highly active'
writer.writerow([row[0],row[1],val])
Code:
import pandas as pd
import numpy as np
df_med = pd.read_csv('stat_result.csv')
df_med.columns = ['date_count', 'all_hours', 'c_min', 'c_max', 'c_med', 'c_med_med', 'u_min', 'u_max', 'u_med', 'u_med_med']
df_data = pd.read_csv('mini_out.csv')
df_data.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
above = df_data['c_total'] > df_med['c_med']
#print above_median
above.to_csv('above.csv', index=None, header=None)
df_above = pd.readcsv('above_median.csv')
df_above.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
#Percentage block should come here
Edit: In case of single column value the qcut is the simplest solution. But when it comes to using two values from two different columns how to achieve that in pandas ?
for row in csv.reader(inp):
if int(row[1])>(0.80*max_user) and int(row[2])>(0.80*max_key):
val='highly active'
elif int(row[1])>=(0.60*max_user) and int(row[2])<=(0.60*max_key):
val='active'
elif int(row[1])<=(0.40*max_user) and int(row[2])>=(0.40*max_key):
val='event based'
elif int(row[1])<(0.20*max_user) and int(row[2])<(0.20*max_key):
val ='situational'
else:
val= 'viewers'
assuming you have the following DFs:
In [7]: df1
Out[7]:
date_count all_hours c_min c_max c_med c_med_med u_min u_max u_med u_med_med
0 2 12 2309 19072 12515 13131 254 785 686 751
In [8]: df2
Out[8]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254
separate by threshold (you can compare two series with the same length or with a scalar value - i assume you will to separate your second data set, comparing it to the scalar value (c_med column) from the first of your first data set:
In [22]: above = df2[df2.c_total > df1.ix[0, 'c_med']]
In [23]: above
Out[23]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
you can use qcut() method in order to categorize your data:
In [29]: df2['cat'] = pd.qcut(df2.c_total,
....: q=[0, .2, .4, .6, .8, 1.],
....: labels=['viewers','event based','situational','active','highly active'])
In [30]: df2
Out[30]:
date startTime endTime day c_total u_total cat
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790 highly active
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750 active
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747 active
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777 highly active
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785 highly active
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757 situational
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735 situational
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636 situational
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513 event based
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382 event based
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303 viewers
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262 viewers
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254 viewers
check:
In [32]: df2.assign(pct=df2.c_total/df2.c_total.max())[['c_total','pct','cat']]
Out[32]:
c_total pct cat
0 18944 0.993289 highly active
1 17534 0.919358 active
2 17262 0.905096 active
3 19072 1.000000 highly active
4 18275 0.958211 highly active
5 13589 0.712510 situational
6 16053 0.841705 situational
7 11440 0.599832 situational
8 5972 0.313129 event based
9 3424 0.179530 event based
10 2696 0.141359 viewers
11 2350 0.123217 viewers
12 2309 0.121068 viewers

Categories

Resources