Count total work hours of the employee per date in pandas - python

I have the pandas dataframe like this:
Employee_id timestamp
1 2017-06-21 04:47:45
1 2017-06-21 04:48:45
1 2017-06-21 04:49:45
for each employee, I am getting ping every 1 minute if he/she is in the office.
I have around 2000 employee's ping, I need the output like:
Employee_id date Total_work_hour
1 2018-06-21 8
1 2018-06-22 7
2 2018-06-21 6
2 2018-06-22 8
for all 2000 employee

Use groupby with lambda function for diff with sum of all diferences, then convert it to seconds by total_seconds and divide by 3600 for hours:
df1 = (df.groupby(['Employee_id', df['timestamp'].dt.date])['timestamp']
.apply(lambda x: x.diff().sum())
.dt.total_seconds()
.div(3600)
.reset_index(name='Total_work_hour'))
print (df1)
Employee_id timestamp Total_work_hour
0 1 2017-06-21 0.033333
But if possible some missing consecutive minutes, is possible use custom function:
print (df)
Employee_id timestamp
0 1 2017-06-21 04:47:45
1 1 2017-06-21 04:48:45
2 1 2017-06-21 04:49:45
3 1 2017-06-21 04:55:45
def f(x):
vals = x.diff()
return vals.mask(vals > pd.Timedelta(60, unit='s')).sum()
df1 = (df.groupby(['Employee_id', df['timestamp'].dt.date])['timestamp']
.apply(f)
.dt.total_seconds()
.div(3600)
.reset_index(name='Total_work_hour')
)
print (df1)
Employee_id timestamp Total_work_hour
0 1 2017-06-21 0.033333

Related

Create New DataFrame, assigning a count for each instance in a time frame

Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'start_date':['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-03','2021-01-04','2021-02-03','2021-03-04','2021-03-05']})
plan_dates
start_date end_date
0 2021-01-01 2021-01-03
1 2021-01-02 2021-01-04
2 2021-01-03 2021-02-03
3 2021-01-04 2021-03-04
4 2021-01-05 2021-03-05
I would like to create a new DataFrame which has 2 columns:
date
count of active plans (the count of cases where the date is within the start_date & end_date in each row of the plan_dates df)
INTENDED DF:
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
Any help would be greatly appreciated.
First convert both columns to datetimes and add one day to end_date, then repeat index by Index.repeat with subtraction of days and add counter values by GroupBy.cumcount with to_timedelta, last count by Series.value_counts with come data cleaning and converting to DataFrame:
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df1 = (df['start_date'].add(pd.to_timedelta(counter, unit='d'))
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='count_active_plans'))
print (df1)
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
.. ... ...
59 2021-03-01 2
60 2021-03-02 2
61 2021-03-03 2
62 2021-03-04 2
63 2021-03-05 1
[64 rows x 2 columns]

panda dataframe aggregate by ID and date

I'm trying to aggregate a dataframe by both ID and date. Suppose I had a dataframe:
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-03 0 20
2 2000-02-17 0 30
3 2000-01-04 1 40
I would like to aggregate the value by ID and date (frequency = 1W) and get a dataframe like:
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-17 0 30
2 2000-01-04 1 40
I understand it can be achieved by iterating the ID and using grouper to aggregate the price. Is there any more efficient way without iterating the IDs? Many thanks.
Use Grouper with aggregate sum, but not sure about frequency of Grouper (because all looks different like in question):
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-09 0 20
2 2000-02-20 0 30
3 2000-01-09 1 40
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W-Mon', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-03 0 30
1 2000-02-21 0 30
2 2000-01-10 1 40
Or:
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='7D', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-13 0 30
2 2000-01-02 1 40

Time difference in hours of a column of pandas dataframe

id time_taken
1 2017-06-21 07:36:53
2 2017-06-21 07:32:28
3 2017-06-22 08:55:09
4 2017-06-22 08:04:31
5 2017-06-21 03:38:46
current_time = 2017-06-22 10:08:16
i want to create df2 where time difference of time_taken columns is greater than 24 hours with current_time
i.e
1 2017-06-21 07:36:53
2 2017-06-21 07:32:28
5 2017-06-21 03:38:46
You can convert Timedelta to total_seconds and compare or compare with Timedelta, filter by boolean indexing:
current_time = '2017-06-22 10:08:16'
df['time_taken'] = pd.to_datetime(df['time_taken'])
df = df[(pd.to_datetime(current_time) - df['time_taken']).dt.total_seconds() > 60 * 60 * 24]
print (df)
id time_taken
0 1 2017-06-21 07:36:53
1 2 2017-06-21 07:32:28
4 5 2017-06-21 03:38:46
Or:
df = df[(pd.to_datetime(current_time) - df['time_taken']) > pd.Timedelta(24, unit='h')]
print (df)
id time_taken
0 1 2017-06-21 07:36:53
1 2 2017-06-21 07:32:28
4 5 2017-06-21 03:38:46

Pandas AVG () function between two date columns

Have a df like that:
Client Status Dat_Start Dat_End
1 A 2015-01-01 2015-01-19
1 B 2016-01-01 2016-02-02
1 A 2015-02-12 2015-02-20
1 B 2016-01-30 2016-03-01
I'd like to get average between two dates (Dat_end and Dat_Start) for Status='A' grouping by client column using Pandas syntax.
So it will be smth SQL-like:
Select Client, AVG (Dat_end-Dat_Start) as Date_Diff
from Table
where Status='A'
Group by Client
Thanks!
Calculate the timedeltas:
df['duration'] = df.Dat_End-df.Dat_Start
df
Out[92]:
Client Status Dat_Start Dat_End duration
0 1 A 2015-01-01 2015-01-19 18 days
1 1 B 2016-01-01 2016-02-02 32 days
2 1 A 2015-02-12 2015-02-20 8 days
3 1 B 2016-01-30 2016-03-01 31 days
Filter and ask for sum and count for pandas <0.20:
df[df.Status=='A'].groupby('Client').duration.agg(['sum', 'count'])
Out[98]:
sum count
Client
1 26 days 2
For upcoming pandas 0.20, see mean added to groupby here for timedeltas. This will work:
df[df.Status=='A'].groupby('Client').duration.mean()
In [10]: df.loc[df.Status == 'A'].groupby('Client') \
.apply(lambda x: (x.Dat_End-x.Dat_Start).mean()).reset_index()
Out[10]:
Client 0
0 1 13 days

Pandas: count some values in a column

I have dataframe, it's part of them
ID,"url","app_name","used_at","active_seconds","device_connection","device_os","device_type","device_usage"
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:29:11,13,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:33:00,3,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:33:07,1,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:34:30,5,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-06-01 09:36:22,133,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-05-02 09:38:40,5,3g,android,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Yandex.Navigator,2015-05-01 11:04:48,70,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-6-01 12:02:27,248,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Viber,2015-07-01 12:06:35,7,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-08-01 12:23:26,86,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-02 12:24:52,0,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",My Talking Angela,2015-08-03 12:24:52,167,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-04 12:27:39,34,3g,ios,smartphone,home
I need to count quantity of days in every month to every ID.
If I try df.groupby('ID')['used_at'].count() I get quantity of visiting, how can I take and count days at month?
I think you need groupby by ID, month and day and aggregate size:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.month,df.used_at.dt.day ]).size()
print (df1)
ID used_at used_at
574c4969b017ae6481db9a7c77328bc3 5 1 1
6 1 1
7 1 1
8 1 1
2 1
3 1
4 1
e990fae0f48b7daf52619b5ccbec61bc 5 1 2
2 1
6 1 3
dtype: int64
Or by date - it is same as by year, month and day:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.date]).size()
print (df1)
ID used_at
574c4969b017ae6481db9a7c77328bc3 2015-05-01 1
2015-06-01 1
2015-07-01 1
2015-08-01 1
2015-08-02 1
2015-08-03 1
2015-08-04 1
e990fae0f48b7daf52619b5ccbec61bc 2015-05-01 2
2015-05-02 1
2015-06-01 3
dtype: int64
Differences between count and size:
size counts NaN values, count does not.

Categories

Resources