I'm calculating time differences, in seconds, between busses' expected and actual stop times.
My problem looks like this:
# creating data
d = {
'time_A': ['2022-08-30 06:21:00', '2022-08-30 16:41:00'],
'time_B': ['2022-08-30 06:21:09', '2022-08-30 16:40:16'],
}
# creating DataFrame
my_df = pd.DataFrame(d)
my_df['time_A'] = pd.to_datetime(my_df['time_A'])
my_df['time_B'] = pd.to_datetime(my_df['time_B'])
# subtracting times
my_df['difference'] = my_df['time_B'] - my_df['time_A']
my_df
result:
time_A time_B difference
0 2022-08-30 06:21:00 2022-08-30 06:21:09 0 days 00:00:09
1 2022-08-30 16:41:00 2022-08-30 16:40:16 -1 days +23:59:16
I don't understand why the difference between today 16:40:16 and today 16:41:00 is -1 days +23:59:16.
if I do this
my_df['difference'] = (my_df['time_B'] - my_df['time_A']).dt.seconds
Then I get
time_A time_B difference
0 2022-08-30 06:21:00 2022-08-30 06:21:09 9
1 2022-08-30 16:41:00 2022-08-30 16:40:16 86356
I would like the "difference" cell on row O to display something like "+9", and the one below to display "-44".
How do I do this? Thanks!
Subtracting datetime.datetimes gives datetime.timedeltas which are represented that way, use .total_seconds() to get numeric value of seconds, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame({"schedule":pd.to_datetime(["2000-01-01 12:00:00"]),"actual":pd.to_datetime(["2000-01-01 12:00:05"])})
df['difference_sec'] = (df['schedule'] - df['actual']).apply(datetime.timedelta.total_seconds)
print(df)
output
schedule actual difference_sec
0 2000-01-01 12:00:00 2000-01-01 12:00:05 -5.0
Note that this is feature of datetime.timedelta, it is not specific to pandas.
Related
I'm trying to subtract two columns on a dataset which have string times in order to get a time value for statistical analysis.
Basically, TOC is start time and IA is end time.
Something is slightly wrong:
dfc = pd.DataFrame(zip(*[TOC,IA]),columns=['TOC','IA'])
print (dfc)
dfc.['TOC']= dfc.['TOC'].astype(dt.datetime)
dfc['TOC'] = pd.to_datetime(dfc['TOC'])
dfc['TOC'] = [time.time() for time in dfc['TOC']]
Convert the columns to datetime before subtracting:
>>> pd.to_datetime(dfc["IA"], format="%H:%M:%S")-pd.to_datetime(dfc["TOC"], format="%H:%M:%S")
0 0 days 00:08:07
1 0 days 00:15:29
2 0 days 00:11:14
3 0 days 00:27:50
dtype: timedelta64[ns]
I have a pandas DataFrame:
I want to calculate the diffrence between confirm and cancel in the following way:
For date 13.01.2020 and desk_id 1.0 : 10:35:00 – 8:00:00 + 12:36:00 – 11:36:00 + 20:00:00 - 13:36:00
I was able to perform these actions only for a desk with one hour of confirm and cancel. By one hour I mean that in date for desk_id I have only one row for confirm and cancel time. The interesting diff and I get when I subtract from confirm 8:00:00 and from 20:00:00 the cancel time and add them together.
For many hours, I can't put it together. By mamy hour I mean that desk_id in one date have few rows with cancel and confirm time. I would like to choose the date, desk_id and calculate the desk occupancy time - the difference between confirm and cancel for each desk.
Output should looks like:
I would like to find periods of time when a desk is free.
In my data can be many confirms and cancels for desk in one date.
I did it for one hour confirm and cancel:
df_1['confirm'] = pd.to_timedelta(df_1['confirm'].astype(str))
df_1['diff_confirm'] = df_1['confirm'].apply(lambda x: x - datetime.timedelta(days=0, hours=8, minutes=0))
df_1['cancel'] = pd.to_timedelta(df_1['cancel'].astype(str))
df_1['diff_cancel'] = df_1['cancel'].apply(lambda x: datetime.timedelta(days=0, hours=20, minutes=0)-x)
and this works.
Any tips?
You did not make it entirely clear what format you need your results in, but I assume it is okay to put them in a separate dataframe. So this solution operates on each group of rows defined by values of date and desk_id and computes the total time for each group, with output placed in a new dataframe:
Code to create your input dataframe:
from datetime import timedelta
import pandas as pd
df = pd.DataFrame(
{
'date': [pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-13'),
pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-14'),
pd.Timestamp('2020-1-14'), pd.Timestamp('2020-1-14')],
'desk_id': [1.0, 1.0, 2.0, 1.0, 2.0, 2.0],
'confirm': ['10:36:00', '12:36:00', '09:36:00', '10:36:00', '12:36:00',
'15:36:00'],
'cancel': ['11:36:00', '13:36:00', '11:36:00', '11:36:00', '14:36:00',
'16:36:00']
}
)
Solution:
df['confirm'] = pd.to_timedelta(df['confirm'])
df['cancel'] = pd.to_timedelta(df['cancel'])
# function to compute total time each desk is free
def total_time(df):
return (
(df.iloc[0]['confirm'] - timedelta(days=0, hours=8, minutes=0)) +
(df['confirm'] - df['cancel'].shift()).sum() +
(timedelta(days=0, hours=20, minutes=0) - df.iloc[-1]['cancel'])
)
# apply function to each combination of 'desk_id' and 'date', producing
# a new dataframe
df.groupby(['desk_id', 'date']).apply(total_time).reset_index(name='total_time')
# desk_id date total_time
# 0 1.0 2020-01-13 0 days 10:00:00
# 1 1.0 2020-01-14 0 days 11:00:00
# 2 2.0 2020-01-13 0 days 10:00:00
# 3 2.0 2020-01-14 0 days 09:00:00
The function takes the difference between the first value of confirm and 8:00:00, takes differences between each confirm and preceding cancel values, and then the difference between 20:00:00 and the last value of cancel. Those differences added together to produce the final value.
One guess at what you're trying to do (I still can't fully understand, but here's an attempt):
import pandas as pd
from datetime import timedelta as td
#create the dataframe
a = pd.DataFrame({'data':['2020-01-13','2020-01-13','2020-01-14'],'desk_id':[1.0,1.0,1.0],'confirm':['10:36:00','12:36:00','13:14:00'],'cancel':['11:36:00','13:36:00','13:44:00']})
def get_avail_times(df,start_end_delta=td(hours=12)):
df['confirm'] = pd.to_timedelta(df['confirm'])
df['cancel'] = pd.to_timedelta(df['cancel'])
#group by the two keys so that we can perform calculations on the specific groups!!
df_g = df.groupby(['data','desk_id'], as_index=False).sum()
df_g['total_time'] = start_end_delta - df_g['cancel'] + df_g['confirm']
return df_g.drop('confirm',1).drop('cancel',1)
output = get_avail_times(a)
Which gives the output:
data desk_id total_time
0 2020-01-13 1.0 0 days 10:00:00
1 2020-01-14 1.0 0 days 11:30:00
The key here is to use the .groupby() function which we can then sum together to essentially perform the equation:
total_time = 20:00 + sum_confirm_times - sum_cancel_times - 08:00
I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.
You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month
Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8
Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.
I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)
I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)