Time difference in hours of a column of pandas dataframe - python

id time_taken
1 2017-06-21 07:36:53
2 2017-06-21 07:32:28
3 2017-06-22 08:55:09
4 2017-06-22 08:04:31
5 2017-06-21 03:38:46
current_time = 2017-06-22 10:08:16
i want to create df2 where time difference of time_taken columns is greater than 24 hours with current_time
i.e
1 2017-06-21 07:36:53
2 2017-06-21 07:32:28
5 2017-06-21 03:38:46

You can convert Timedelta to total_seconds and compare or compare with Timedelta, filter by boolean indexing:
current_time = '2017-06-22 10:08:16'
df['time_taken'] = pd.to_datetime(df['time_taken'])
df = df[(pd.to_datetime(current_time) - df['time_taken']).dt.total_seconds() > 60 * 60 * 24]
print (df)
id time_taken
0 1 2017-06-21 07:36:53
1 2 2017-06-21 07:32:28
4 5 2017-06-21 03:38:46
Or:
df = df[(pd.to_datetime(current_time) - df['time_taken']) > pd.Timedelta(24, unit='h')]
print (df)
id time_taken
0 1 2017-06-21 07:36:53
1 2 2017-06-21 07:32:28
4 5 2017-06-21 03:38:46

Related

Identify Dates in DataFrame - Pandas

I have a dataframe:
Datetime
0 2022-06-01 00:00:00 0
1 2022-06-01 00:01:00 0
2 2022-06-01 00:02:00 0
3 2022-06-01 00:03:00 0
4 2022-06-01 00:04:00 0
How to identify the hour is "00", and so for the minutes and seconds. My requirement is to later on like to put them in a function.
You can use:
s = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S 0') # what is the 0?
df['hour_0'] = s.dt.hour.eq(0)
df['min_0'] = s.dt.minute.eq(0)
df['sec_0'] = s.dt.second.eq(0)
Output:
Datetime hour_0 min_0 sec_0
0 2022-06-01 00:00:00 0 True True True
1 2022-06-01 00:01:00 0 True False True
2 2022-06-01 00:02:00 0 True False True
3 2022-06-01 00:03:00 0 True False True
4 2022-06-01 00:04:00 0 True False True
So, your question is a bit unclear to me, but if I understand correctly you just need to extract the hours from your DF? If so the easiest way to do this is to use Pandas inbuilt datetime functionality. For example:
import pandas as pd
df = pd.DataFrame([["2022-12-12 01:59:00"], ["2022-13-12 01:59:00"]])
print(df)
This will yield:
0
0 2022-12-12 01:59:00
1 2022-12-13 01:59:00
Now can do:
pd['timestamp'] = pd.to_datetime(df[0])
pd['hour'] = pd['timestamp'].dt.hour
You can do this for minutes and seconds etc. Hope that helps.
You can easily extract hours, minutues, seconds directly from date time string. what is extra 0?. If you have extra strings then simply filter first then extra parameters.
df['new'] = pd.to_datetime(df['Datetime'].str.split(' ').str[1],format='%H:%M:%S')
df['hour'] = df['new'].dt.hour
df['minute'] = df['new'].dt.minute
df['second'] = df['new'].dt.second
del df['new']
Gives #
Datetime hour minute second
0 2022-06-01 00:00:00 0 0 0 0
1 2022-06-01 00:01:00 0 0 1 0
2 2022-06-01 00:02:00 0 0 2 0
3 2022-06-01 00:03:00 0 0 3 0
4 2022-06-01 00:04:00 0 0 4 0
Explination:
your date string looks likes this
2022-06-01 00:02:00 0
analysis
2022 - Year - %y
06 - Month - %m
01 - Day - %d
00: - hours - %H
02: - minutes - %M
00: - Seconds - %S
You have an extra 0in date format to filter that I've split the string by space.
df['Datetime'].str.split(' ').str[1],format='%H:%M:%S'
Logically iplies
'2022-06-01 00:02:00 0'.split(' ').str[1],format='%H:%M:%S'
Which wraps elemnts to list sep by spaces.
[2022-06-01, 00:02:00, 0]
Analysis
0th elemnt in list = 2022-06-01
1st elemnt in list = 00:02:00
2nd elemnt in list = 0
Currently we are interested in time which is 1st elemnt in list = 00:02:00
pd.to_datetime(df['Datetime'].str.split(' ').str[1],format='%H:%M:%S')
Pandas has inbuilt time series functions - pandas.Series.dt.minute

Create a date counter variable starting with a particular date

I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019
You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.

Count total work hours of the employee per date in pandas

I have the pandas dataframe like this:
Employee_id timestamp
1 2017-06-21 04:47:45
1 2017-06-21 04:48:45
1 2017-06-21 04:49:45
for each employee, I am getting ping every 1 minute if he/she is in the office.
I have around 2000 employee's ping, I need the output like:
Employee_id date Total_work_hour
1 2018-06-21 8
1 2018-06-22 7
2 2018-06-21 6
2 2018-06-22 8
for all 2000 employee
Use groupby with lambda function for diff with sum of all diferences, then convert it to seconds by total_seconds and divide by 3600 for hours:
df1 = (df.groupby(['Employee_id', df['timestamp'].dt.date])['timestamp']
.apply(lambda x: x.diff().sum())
.dt.total_seconds()
.div(3600)
.reset_index(name='Total_work_hour'))
print (df1)
Employee_id timestamp Total_work_hour
0 1 2017-06-21 0.033333
But if possible some missing consecutive minutes, is possible use custom function:
print (df)
Employee_id timestamp
0 1 2017-06-21 04:47:45
1 1 2017-06-21 04:48:45
2 1 2017-06-21 04:49:45
3 1 2017-06-21 04:55:45
def f(x):
vals = x.diff()
return vals.mask(vals > pd.Timedelta(60, unit='s')).sum()
df1 = (df.groupby(['Employee_id', df['timestamp'].dt.date])['timestamp']
.apply(f)
.dt.total_seconds()
.div(3600)
.reset_index(name='Total_work_hour')
)
print (df1)
Employee_id timestamp Total_work_hour
0 1 2017-06-21 0.033333

Python pandas datetime difference

I have this:
x[l[0]] = pd.to_datetime(x[l[0]], format="%Y-%m-%d %H:%M:%S")
Where l=list(x)
How can I have the difference between this objects in seconds, If I do this
x[l[0]][1]-x[l[0]][2]
It returns me a timedelta object
print (x[:5])
LogDate Query_BoxID_ID Query_Function_ID SC_Win32_Status
0 2017-06-15 09:50:14 12 24 0
1 2017-06-15 09:50:14 12 26 0
2 2017-06-15 09:50:14 12 26 0
3 2017-06-15 09:50:14 12 30 0
4 2017-06-15 09:50:32 12 19 0
Use diff for timedeltas, which are converted by total_seconds:
#convert column to datetime
x['LogDate'] = pd.to_datetime(x['LogDate'], format="%Y-%m-%d %H:%M:%S")
#first value is NaN always, so replaced to 0 by fillna and cast to int
a = x['LogDate'].diff().dt.total_seconds().fillna(0).astype(int)
print (a)
0 0
1 0
2 0
3 0
4 18
Name: LogDate, dtype: int32
b = int((x.loc[0, 'LogDate'] - x.loc[0, 'LogDate']).total_seconds())
print (b)
0
I can just do
(x[l[0]][1]-x[l[0]][2]).total_seconds()

Pandas AVG () function between two date columns

Have a df like that:
Client Status Dat_Start Dat_End
1 A 2015-01-01 2015-01-19
1 B 2016-01-01 2016-02-02
1 A 2015-02-12 2015-02-20
1 B 2016-01-30 2016-03-01
I'd like to get average between two dates (Dat_end and Dat_Start) for Status='A' grouping by client column using Pandas syntax.
So it will be smth SQL-like:
Select Client, AVG (Dat_end-Dat_Start) as Date_Diff
from Table
where Status='A'
Group by Client
Thanks!
Calculate the timedeltas:
df['duration'] = df.Dat_End-df.Dat_Start
df
Out[92]:
Client Status Dat_Start Dat_End duration
0 1 A 2015-01-01 2015-01-19 18 days
1 1 B 2016-01-01 2016-02-02 32 days
2 1 A 2015-02-12 2015-02-20 8 days
3 1 B 2016-01-30 2016-03-01 31 days
Filter and ask for sum and count for pandas <0.20:
df[df.Status=='A'].groupby('Client').duration.agg(['sum', 'count'])
Out[98]:
sum count
Client
1 26 days 2
For upcoming pandas 0.20, see mean added to groupby here for timedeltas. This will work:
df[df.Status=='A'].groupby('Client').duration.mean()
In [10]: df.loc[df.Status == 'A'].groupby('Client') \
.apply(lambda x: (x.Dat_End-x.Dat_Start).mean()).reset_index()
Out[10]:
Client 0
0 1 13 days

Categories

Resources