Let's suppose we have a pandas dataframe with work shifts:
df_aux = pd.DataFrame({'Worker' : ['Alice','Alice','Alice','Alice','Alice', 'Bob','Bob','Bob'],
'Shift_start' : ['2022-01-01 10:00:00', '2022-01-01 10:30:00', '2022-01-01 11:45:00', '2022-01-01 12:45:00', '2022-01-01 13:15:00', '2022-01-01 10:30:00', '2022-01-01 12:00:00', '2022-01-01 13:15:00'],
'Shift_end' : ['2022-01-01 10:15:00', '2022-01-01 11:45:00', '2022-01-01 12:30:00', '2022-01-01 13:15:00', '2022-01-01 14:00:00', '2022-01-01 11:30:00', '2022-01-01 13:10:00', '2022-01-01 14:30:00'],
'Position' : [1, 1, 2, 2, 2, 1, 2, 3],
'Role' : ['A', 'B', 'B', 'A', 'B', 'A', 'B', 'A']})
Worker
Shift_start
Shift_end
Position
Role
Alice
2022-01-01 10:00:00
2022-01-01 10:15:00
1
A
Alice
2022-01-01 10:30:00
2022-01-01 11:45:00
1
B
Alice
2022-01-01 11:45:00
2022-01-01 12:30:00
2
B
Alice
2022-01-01 12:45:00
2022-01-01 13:15:00
2
A
Alice
2022-01-01 13:15:00
2022-01-01 14:00:00
2
B
Bob
2022-01-01 10:30:00
2022-01-01 11:30:00
1
A
Bob
2022-01-01 12:00:00
2022-01-01 13:10:00
2
B
Bob
2022-01-01 13:15:00
2022-01-01 14:30:00
3
A
The Position column refers to the place where the workers are, while there are two roles, A and B (let's say there are main and auxiliar, for example). I would need to compute the time each worker is at the current position, regardless of their role, and the time they are in the same position AND role at the time of certain events. These events are given in a df_main, which records the time and position:
df_main = pd.DataFrame({'Event_time' : ['2022-01-01 11:05:00', '2022-01-01 12:35:00', '2022-01-01 13:25:00'] ,
'Position' : [1, 2, 2]})
Event_time
Position
2022-01-01 11:05:00
1
2022-01-01 12:35:00
2
2022-01-01 13:25:00
2
The idea would be to perform a merge between df_main and df_aux to have the following info:
Event_time
Worker
Shift_start
Shift_end
Position
Role
Time_in_position
Time_in_position_role
2022-01-01 11:05:00
Alice
2022-01-01 10:30:00
2022-01-01 11:45:00
1
B
1 hours 05 minutes
0 hours 35 minutes
2022-01-01 11:05:00
Bob
2022-01-01 10:30:00
2022-01-01 13:30:00
1
A
0 hours 35 minutes
0 hours 35 minutes
2022-01-01 12:35:00
Bob
2022-01-01 12:00:00
2022-01-01 15:10:00
2
B
0 hours 35 minutes
0 hours 35 minutes
2022-01-01 13:25:00
Alice
2022-01-01 13:15:00
2022-01-01 14:00:00
2
B
1 hours 40 minutes
0 hours 10 minutes
The first row is duplicated, because both Alice and Bob were in that position at the time of the event, but with different roles. I managed to compute the Time_in_position_role column:
df_full = df_main.merge(df_aux, on='Position')
df_full = df_full[(df_full['Event_time']>df_full['Shift_start']) & (df_full['Event_time']<df_full['Shift_end'])]
df_full['Time_in_position_role'] = df_full['Event_time'] - df_full['Shift_start']
But I am unable to do the same for the Time_in_position one. Any ideas?
The logic is:
For each "Worker", find the time period for which he was in particular position. If there are multiple rows, then merge them.
Join this with your result df and filter with same logic for "Time_in_position".
# For each "Worker", find the time period for which he was in particular position. If there are multiple rows, then merge them.
def sort_n_rank(g):
df_g = g.apply(pd.Series)
df_g = df_g.sort_values(0)
return (df_g[1] != df_g[1].shift(1)).cumsum()
df_aux["start_position"] = df_aux[["Shift_start", "Position"]].apply(tuple, axis=1)
df_aux["rank"] = df_aux.groupby("Worker")[["start_position"]].transform(sort_n_rank)
df_worker_position = df_aux.groupby(["Worker", "rank"]) \
.agg( \
Shift_start_min = ("Shift_start", "min"),
Shift_end_max = ("Shift_end", "max"),
Position = ("Position", "first")
) \
.reset_index()
df_full = df_full.merge(df_worker_position, on=["Worker", "Position"])
df_full = df_full[(df_full["Event_time"] > df_full["Shift_start_min"]) & (df_full["Event_time"] < df_full["Shift_end_max"])]
df_full["Time_in_position"] = df_full["Event_time"] - df_full["Shift_start_min"]
Output:
Event_time Worker Shift_start Shift_end Position Role Time_in_position Time_in_position_role
0 2022-01-01 11:05:00 Alice 2022-01-01 10:30:00 2022-01-01 11:45:00 1 B 0 days 01:05:00 0 days 00:35:00
1 2022-01-01 11:05:00 Bob 2022-01-01 10:30:00 2022-01-01 11:30:00 1 A 0 days 00:35:00 0 days 00:35:00
2 2022-01-01 12:35:00 Bob 2022-01-01 12:00:00 2022-01-01 13:10:00 2 B 0 days 00:35:00 0 days 00:35:00
3 2022-01-01 13:25:00 Alice 2022-01-01 13:15:00 2022-01-01 14:00:00 2 B 0 days 01:40:00 0 days 00:10:00
Related
I have a dataframe:
df = T1 C1
01/01/2022 11:20 2
01/01/2022 15:40 8
01/01/2022 17:50 3
I want to expand it such that
I will have the value in specific given times
I will have a row for each round timestamp
So if the times are given in
l=[ 01/01/2022 15:46 , 01/01/2022 11:28]
I will have:
df_new = T1 C1
01/01/2022 11:20 2
01/01/2022 11:28 2
01/01/2022 12:00 2
01/01/2022 13:00 2
01/01/2022 14:00 2
01/01/2022 15:00 2
01/01/2022 15:40 8
01/01/2022 15:46 8
01/01/2022 16:00 8
01/01/2022 17:00 8
01/01/2022 17:50 3
You can add the extra dates and ffill:
df['T1'] = pd.to_datetime(df['T1'])
extra = pd.date_range(df['T1'].min().ceil('H'), df['T1'].max().floor('H'), freq='1h')
(pd.concat([df, pd.DataFrame({'T1': extra})])
.sort_values(by='T1', ignore_index=True)
.ffill()
)
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0
Here is a way to do what your question asks that will ensure:
there are no duplicate times in T1 in the output, even if any of the times in the original are round hours
the results will be of the same type as the values in the C1 column of the input (in this case, integers not floats).
hours = pd.date_range(df.T1.min().ceil("H"), df.T1.max().floor("H"), freq="60min")
idx_new = df.set_index('T1').join(pd.DataFrame(index=hours), how='outer', sort=True).index
df_new = df.set_index('T1').reindex(index = idx_new, method='ffill').reset_index().rename(columns={'index':'T1'})
Output:
T1 C1
0 2022-01-01 11:20:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 8
8 2022-01-01 17:50:00 3
Example of how round dates in the input are handled:
df = pd.DataFrame({
#'T1':pd.to_datetime(['01/01/2022 11:20','01/01/2022 15:40','01/01/2022 17:50']),
'T1':pd.to_datetime(['01/01/2022 11:00','01/01/2022 15:40','01/01/2022 17:00']),
'C1':[2,8,3]})
Input:
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 15:40:00 8
2 2022-01-01 17:00:00 3
Output (no duplicates):
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 3
Another possible solution, based on pandas.DataFrame.resample:
df['T1'] = pd.to_datetime(df['T1'])
(pd.concat([df, df.set_index('T1').resample('1H').asfreq().reset_index()])
.sort_values('T1').ffill().dropna().reset_index(drop=True))
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0
I have a dataframe in pandas which reflects work shifts for employees (the time they are actually working). A snippet of it is the following:
df = pd.DataFrame({'Worker' : ['Alice','Alice','Alice', 'Bob','Bob','Bob'],
'Shift_start' : ['2022-01-01 10:00:00', '2022-01-01 13:10:00', '2022-01-01 15:45:00', '2022-01-01 11:30:00', '2022-01-01 13:40:00', '2022-01-01 15:20:00'],
'Shift_end' : ['2022-01-01 12:30:00', '2022-01-01 15:30:00', '2022-01-01 17:30:00', '2022-01-01 13:30:00', '2022-01-01 15:10:00', '2022-01-01 18:10:00']})
Worker
Shift_start
Shift_end
Alice
2022-01-01 10:00:00
2022-01-01 12:30:00
Alice
2022-01-01 13:10:00
2022-01-01 15:30:00
Alice
2022-01-01 15:45:00
2022-01-01 17:30:00
Bob
2022-01-01 11:30:00
2022-01-01 13:30:00
Bob
2022-01-01 13:40:00
2022-01-01 15:10:00
Bob
2022-01-01 15:20:00
2022-01-01 18:10:00
Now, I need to compute in very row the time since the last partial break, defined as a pause >20 minutes, and computed with respect to the start time of each shift. This is, if there is a pause of 15 minutes it should be considered that the pause has not existed, and the time would be computed since the last >20 min pause. If no pause exists, the time should be taken as the time since the start of the day.
So I would need something like:
Worker
Shift_start
Shift_end
Hours_since_break
Alice
2022-01-01 10:00:00
2022-01-01 12:30:00
0
Alice
2022-01-01 13:10:00
2022-01-01 15:30:00
0
Alice
2022-01-01 15:45:00
2022-01-01 17:30:00
2.58
Bob
2022-01-01 11:30:00
2022-01-01 13:30:00
0
Bob
2022-01-01 13:40:00
2022-01-01 15:10:00
2.17
Bob
2022-01-01 15:20:00
2022-01-01 18:10:00
3.83
For Alice, the first row is 0, as there is no previous break, so it is taken as the value since the start of the day. As it is her first shift, 0 hours is the result. In the second row, she has just taken a 40-minute pause, so again, 0 hours since the break. In the third row, she has just taken 15 minutes, but as the minimum break is 20 minutes, it is as if she hadn't take any break. Therefore, the time since her last break is since 13:10:00, when her last break finished, so the result is 2 hours and 35 minutes, i.e., 2.58 hours.
In the case of Bob the same logic applies. The first row is 0 (is the first shift of the day). In the second row he has taken just a 10-minute break which doesn't count, so the time since the last break would be since the start of his day, i.e., 2h10m (2.17 hours). In the third row, he has taken a 10-minute break again, so the time would be again since the start of the day, so 3h50m (3.83 hours).
To compute the breaks with the 20-minute constraint I did the following:
shifted_end = df.groupby("Worker")["Shift_end"].shift()
df["Partial_break"] = (df["Shift_start"] - shifted_end)
df['Partial_break_hours'] = df["Partial_break"].dt.total_seconds() / 3600
df.loc[(df['Partial_break_hours']<0.33), 'Partial_break_hours'] = 0
But I can't think of a way to implement the search logic to give the desired output. Any help is much appreciated!
You can try (assuming the DataFrame is sorted):
def fn(x):
rv = []
last_zero = 0
for a, c in zip(
x["Shift_start"],
(x["Shift_start"] - x["Shift_end"].shift()) < "20 minutes",
):
if c:
rv.append(round((a - last_zero) / pd.to_timedelta(1, unit="hour"), 2))
else:
last_zero = a
rv.append(0)
return pd.Series(rv, index=x.index)
df["Hours_since_break"] = df.groupby("Worker").apply(fn).droplevel(0)
print(df)
Prints:
Worker Shift_start Shift_end Hours_since_break
0 Alice 2022-01-01 10:00:00 2022-01-01 12:30:00 0.00
1 Alice 2022-01-01 13:10:00 2022-01-01 15:30:00 0.00
2 Alice 2022-01-01 15:45:00 2022-01-01 17:30:00 2.58
3 Bob 2022-01-01 11:30:00 2022-01-01 13:30:00 0.00
4 Bob 2022-01-01 13:40:00 2022-01-01 15:10:00 2.17
5 Bob 2022-01-01 15:20:00 2022-01-01 18:10:00 3.83
You could calculate a "fullBreakAtStart" flag. And based on that, set a "lastShiftStart". If there is no "fullBreakAtStart", then just enter a np.nan, and then use the fillna(method="ffill") function. Here is the code:
df["Shift_end_prev"] = df.groupby("Worker")["Shift_end"].shift(1)
df["timeDiff"] = pd.to_datetime(df["Shift_start"]) - pd.to_datetime(df["Shift_end_prev"])
df["fullBreakAtStart"] = (df["timeDiff"]> "20 minutes") | (df["timeDiff"].isna())
df["lastShiftStart"] = np.where(df["fullBreakAtStart"], df["Shift_start"], np.nan)
df["lastShiftStart"] = df["lastShiftStart"].fillna(method="ffill")
df["Hours_since_break"] = pd.to_datetime(df["Shift_start"]) - pd.to_datetime(df["lastShiftStart"])
df["Hours_since_break"] = df["Hours_since_break"]/np.timedelta64(1, 'h')
df["Hours_since_break"] = np.where(df["fullBreakAtStart"],0,df["Hours_since_break"])
This question already has answers here:
How to compare two date columns in a dataframe using pandas?
(2 answers)
Closed 5 months ago.
I have a dataset of New York taxi rides. There are some wrong values in the pickup_datetime and dropoff_datetime, because the dropoff is before than the pickup. How can i compare this two values and drop the row?
You can do with:
import pandas as pd
import numpy as np
df=pd.read_excel("Taxi.xlsx")
#convert to datetime
df["Pickup"]=pd.to_datetime(df["Pickup"])
df["Drop"]=pd.to_datetime(df["Drop"])
#Identify line to remove
df["ToRemove"]=np.where(df["Pickup"]>df["Drop"],1,0)
print(df)
#filter the dataframe if pickup time is higher than drop time
dfClean=df[df["ToRemove"]==0]
print(dfClean)
Result:
df:
Pickup Drop Price
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3
3 2022-01-01 11:00:00 2022-01-01 10:05:00 10
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5
5 2022-01-01 11:40:00 2022-01-01 08:45:00 8
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3
Df Clean
Pickup Drop Price ToRemove
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5 0
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8 0
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3 0
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5 0
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3 0
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10 0
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5 0
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8 0
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3 0
I have the below issue and I feel I'm just a few steps away from solving it, but I'm not experienced enough just yet. I've used business-duration for this.
I've looked through other similar answers to this and tried many methods, but this is the closest I have gotten (Using this answer). I'm using Anaconda and Spyder, which is the only method I have on my work laptop at the moment. I can't install some of the custom Business days functions into anaconda.
I have a large dataset (~200k rows) which I need to solve this for:
import pandas as pd
import business_duration as bd
import datetime as dt
import holidays as pyholidays
#Specify Business Working hours (8am - 5pm)
Bus_start_time = dt.time(8,00,0)
Bus_end_time = dt.time(17,0,0)
holidaylist = pyholidays.ZA()
unit='min'
list = [[10, '2022-01-01 07:00:00', '2022-01-08 15:00:00'], [11, '2022-01-02 18:00:00', '2022-01-10 15:30:00'],
[12, '2022-01-01 09:15:00', '2022-01-08 12:00:00'], [13, '2022-01-07 13:00:00', '2022-01-23 17:00:00']]
df = pd.DataFrame(list, columns =['ID', 'Start', 'End'])
print(df)
Which gives:
ID Start End
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00
The next step works in testing single dates:
startdate = pd.to_datetime('2022-01-01 00:00:00')
enddate = pd.to_datetime('2022-01-14 23:00:00')
df['TimeAdj'] = bd.businessDuration(startdate,enddate,Bus_start_time,Bus_end_time,holidaylist=holidaylist,unit=unit)
print(df)
Which results in:
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 5400.0
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 5400.0
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 5400.0
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 5400.0
For some reason I have float values showing up, but I can fix that later.
Next, I need to have this calculation run per row in the dataframe.
I tried replacing the df columns in start date and end date, but got an error:
startdate = df['Start']
enddate = df['End']
print(bd.businessDuration(startdate,enddate,Bus_start_time,Bus_end_time,holidaylist=holidaylist,unit=unit))`
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I then checked the documentation for business-duration, and adjusted to the below:
from itertools import repeat
df['TimeAdj'] = list(map(bd.businessDuration,startdate,enddate,repeat(Bus_start_time),repeat(Bus_end_time),repeat(holidaylist),repeat(unit)))
AttributeError: 'str' object has no attribute 'date'
I'm hoping to end with the correct values in each row of the TimeAdj column (example figures added).
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 2300
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 2830
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 2115
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 4800
What do I need to adjust on this?
Use:
from functools import partial
# Convert strings to datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# Get holidays list
years = range(df['Start'].min().year, df['End'].max().year+1)
holidaylist = pyholidays.ZA(years=years).keys()
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration,
starttime=Bus_start_time, endtime=Bus_end_time,
holidaylist=holidaylist, unit=unit)
# Compute business duration
df['TimeAdj'] = df.apply(lambda x: bduration(x['Start'], x['End']), axis=1)
Output:
>>> df
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 2700.0
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 3150.0
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 2700.0
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 5640.0
I have a dataframe with a datetimeindex as shown with the format, the raw data is supposed to contains record every hourly for a year(each day having 24 record). Some hours/days are missing and not recorded in the data.
How can i get a list of all the missing datetimeindex hour.
Example: 01 hour is missing, how can i find and print out 2012-10-02 01:00:00
I'm currently able to get the missing days but unable to do so for the hour.
missing_day = pd.date_range(start = mdf.index[0], end = mdf.index[-1]).difference(mdf.index)
missing_day = missing_day.strftime('%Y%m%d')
missing = pd.Series(missing_day).array
for i in missing:
print(i)
for x in range(24):
x = str(x)
m = i + x
m = datetime.strptime(m,'%Y%m%d%H')
print(m)
Output(printing 24 hour for each missing days)
What would be the best way to list out all of the missing datetime.
Use set predicates to find missing index:
out = pd.date_range(df.index.min(), df.index.max(), freq='H').difference(df.index)
print(out)
# Output
DatetimeIndex(['2022-01-01 06:00:00', '2022-01-01 12:00:00',
'2022-01-01 14:00:00', '2022-01-01 16:00:00'],
dtype='datetime64[ns]', freq=None)
Setup:
df = pd.DataFrame({'A':[0]}, index=pd.date_range('2022-01-01', freq='H', periods=24))
df = df.sample(n=20).sort_index()
print(df)
# Output
A
2022-01-01 00:00:00 0
2022-01-01 01:00:00 0
2022-01-01 02:00:00 0
2022-01-01 03:00:00 0
2022-01-01 04:00:00 0
2022-01-01 05:00:00 0
2022-01-01 07:00:00 0
2022-01-01 08:00:00 0
2022-01-01 09:00:00 0
2022-01-01 10:00:00 0
2022-01-01 11:00:00 0
2022-01-01 13:00:00 0
2022-01-01 15:00:00 0
2022-01-01 17:00:00 0
2022-01-01 18:00:00 0
2022-01-01 19:00:00 0
2022-01-01 20:00:00 0
2022-01-01 21:00:00 0
2022-01-01 22:00:00 0
2022-01-01 23:00:00 0