Pandas groupby, melt and drop in one go - python

I want to add column to the dataframe with values(comments) based on the Timestamp, grouped per each day.
I made it as per example below, but... is there any other more "pandonic" way? maybe one-liner or at least close to it?
Example data frame (the actual has much more dates and more different values):
import pandas as pd
data = {"Values": ["absd","abse", "dara", "absd","abse", "dara"],
"Date": ["2022-05-25","2022-05-25","2022-05-25", "2022-05-26","2022-05-26","2022-05-26"],
"Timestamp": ["2022-05-25 08:00:00", "2022-05-25 11:30:00", "2022-05-25 20:25:00",
"2022-05-26 09:00:00", "2022-05-26 13:40:00", "2022-05-26 19:15:00"]}
df = pd.DataFrame(data)
df.Timestamp = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M:%S')
df.Date = pd.to_datetime(df.Date, format='%Y-%m-%d')
df out:
Values Date Timestamp
0 absd 2022-05-25 2022-05-25 08:00:00
1 abse 2022-05-25 2022-05-25 11:30:00
2 dara 2022-05-25 2022-05-25 20:25:00
3 absd 2022-05-26 2022-05-26 09:00:00
4 abse 2022-05-26 2022-05-26 13:40:00
5 dara 2022-05-26 2022-05-26 19:15:00
the end result I want is:
Values Date Period Datetime
0 absd 2022-05-25 Start 2022-05-25 08:00:00
1 abse 2022-05-25 Start 2022-05-25 08:00:00
2 dara 2022-05-25 Start 2022-05-25 08:00:00
3 dara 2022-05-25 Mid 2022-05-25 11:30:00
4 abse 2022-05-25 Mid 2022-05-25 11:30:00
5 absd 2022-05-25 Mid 2022-05-25 11:30:00
6 dara 2022-05-25 End 2022-05-25 20:25:00
7 abse 2022-05-25 End 2022-05-25 20:25:00
8 absd 2022-05-25 End 2022-05-25 20:25:00
9 dara 2022-05-26 Start 2022-05-26 09:00:00
10 abse 2022-05-26 Start 2022-05-26 09:00:00
11 absd 2022-05-26 Start 2022-05-26 09:00:00
12 absd 2022-05-26 Mid 2022-05-26 13:40:00
13 abse 2022-05-26 Mid 2022-05-26 13:40:00
14 dara 2022-05-26 Mid 2022-05-26 13:40:00
15 absd 2022-05-26 End 2022-05-26 19:15:00
16 abse 2022-05-26 End 2022-05-26 19:15:00
17 dara 2022-05-26 End 2022-05-26 19:15:00
my working approach is below:
df["Start"] = df["Timestamp"].groupby(df["Date"]).transform("min")
df["End"] = df["Timestamp"].groupby(df["Date"]).transform("max")
df["Mid"] = df["Timestamp"].groupby(df["Date"]).transform("median")
df1 = df.melt(id_vars = ["Values","Date"],
var_name="Period",value_name="Datetime").sort_values("Datetime")
df1 = df1[df1.Period != "Timestamp"].reset_index(drop=True)

From the end result dataframe, it looks like you need a combination of all the columns (well, a combination of the Values column and the ('Date', Timestamp') columns).
One option is with complete from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.assign(Period = ['Start', 'Mid', 'End'] * 2)
.complete(('Date', 'Timestamp', 'Period'), 'Values')
)
Values Date Timestamp Period
0 absd 2022-05-25 2022-05-25 08:00:00 Start
1 abse 2022-05-25 2022-05-25 08:00:00 Start
2 dara 2022-05-25 2022-05-25 08:00:00 Start
3 absd 2022-05-25 2022-05-25 11:30:00 Mid
4 abse 2022-05-25 2022-05-25 11:30:00 Mid
5 dara 2022-05-25 2022-05-25 11:30:00 Mid
6 absd 2022-05-25 2022-05-25 20:25:00 End
7 abse 2022-05-25 2022-05-25 20:25:00 End
8 dara 2022-05-25 2022-05-25 20:25:00 End
9 absd 2022-05-26 2022-05-26 09:00:00 Start
10 abse 2022-05-26 2022-05-26 09:00:00 Start
11 dara 2022-05-26 2022-05-26 09:00:00 Start
12 absd 2022-05-26 2022-05-26 13:40:00 Mid
13 abse 2022-05-26 2022-05-26 13:40:00 Mid
14 dara 2022-05-26 2022-05-26 13:40:00 Mid
15 absd 2022-05-26 2022-05-26 19:15:00 End
16 abse 2022-05-26 2022-05-26 19:15:00 End
17 dara 2022-05-26 2022-05-26 19:15:00 End

Using only pandas:
(
df['Timestamp'].groupby(df['Date']).agg(['min','median','max']).merge(df, on='Date')
.melt(id_vars=['Values','Date'], var_name='Period', value_name='Datetime')
.query('Period!="Timestamp"')
.sort_values('Datetime')
)
Output:
Values Date Period Datetime
0 absd 2022-05-25 min 2022-05-25 08:00:00
1 abse 2022-05-25 min 2022-05-25 08:00:00
2 dara 2022-05-25 min 2022-05-25 08:00:00
7 abse 2022-05-25 median 2022-05-25 11:30:00
6 absd 2022-05-25 median 2022-05-25 11:30:00
8 dara 2022-05-25 median 2022-05-25 11:30:00
12 absd 2022-05-25 max 2022-05-25 20:25:00
13 abse 2022-05-25 max 2022-05-25 20:25:00
14 dara 2022-05-25 max 2022-05-25 20:25:00
4 abse 2022-05-26 min 2022-05-26 09:00:00
3 absd 2022-05-26 min 2022-05-26 09:00:00
5 dara 2022-05-26 min 2022-05-26 09:00:00
9 absd 2022-05-26 median 2022-05-26 13:40:00
10 abse 2022-05-26 median 2022-05-26 13:40:00
11 dara 2022-05-26 median 2022-05-26 13:40:00
16 abse 2022-05-26 max 2022-05-26 19:15:00
15 absd 2022-05-26 max 2022-05-26 19:15:00
17 dara 2022-05-26 max 2022-05-26 19:15:00

Another pandas only method:
out = (df.groupby('Date')
.agg({'Timestamp':['min', 'median', 'max'], 'Values':list})
.explode(('Values', 'list'))
.droplevel(0, axis=1)
.rename(columns={'list':'Values'})
.reset_index()
.melt(['Values', 'Date'], var_name='Period', value_name='Datetime')
.sort_values('Datetime', ignore_index=True))
print(out)
Output:
Values Date Period Datetime
0 absd 2022-05-25 min 2022-05-25 08:00:00
1 abse 2022-05-25 min 2022-05-25 08:00:00
2 dara 2022-05-25 min 2022-05-25 08:00:00
3 abse 2022-05-25 median 2022-05-25 11:30:00
4 absd 2022-05-25 median 2022-05-25 11:30:00
5 dara 2022-05-25 median 2022-05-25 11:30:00
6 absd 2022-05-25 max 2022-05-25 20:25:00
7 abse 2022-05-25 max 2022-05-25 20:25:00
8 dara 2022-05-25 max 2022-05-25 20:25:00
9 abse 2022-05-26 min 2022-05-26 09:00:00
10 absd 2022-05-26 min 2022-05-26 09:00:00
11 dara 2022-05-26 min 2022-05-26 09:00:00
12 absd 2022-05-26 median 2022-05-26 13:40:00
13 abse 2022-05-26 median 2022-05-26 13:40:00
14 dara 2022-05-26 median 2022-05-26 13:40:00
15 abse 2022-05-26 max 2022-05-26 19:15:00
16 absd 2022-05-26 max 2022-05-26 19:15:00
17 dara 2022-05-26 max 2022-05-26 19:15:00

Related

Create a new DataFrame using pandas date_range

I have the following DataFrame:
date_start date_end
0 2023-01-01 16:00:00 2023-01-01 17:00:00
1 2023-01-02 16:00:00 2023-01-02 17:00:00
2 2023-01-03 16:00:00 2023-01-03 17:00:00
3 2023-01-04 17:00:00 2023-01-04 19:00:00
4 NaN NaN
and I want to create a new DataFrame which will contain values starting from the date_start and ending at the date_end of each row.
So for the first row by using the code below:
new_df = pd.Series(pd.date_range(start=df['date_start'][0], end=df['date_end'][0], freq= '15min'))
I get the following:
0 2023-01-01 16:00:00
1 2023-01-01 16:15:00
2 2023-01-01 16:30:00
3 2023-01-01 16:45:00
4 2023-01-01 17:00:00
How can I get the same result for all the rows of the df combined in a new df?
You can use a list comprehension and concat:
out = pd.concat([pd.DataFrame({'date': pd.date_range(start=start, end=end,
freq='15min')})
for start, end in zip(df['date_start'], df['date_end'])],
ignore_index=True))
Output:
date
0 2023-01-01 16:00:00
1 2023-01-01 16:15:00
2 2023-01-01 16:30:00
3 2023-01-01 16:45:00
4 2023-01-01 17:00:00
5 2023-01-02 16:00:00
6 2023-01-02 16:15:00
7 2023-01-02 16:30:00
8 2023-01-02 16:45:00
9 2023-01-02 17:00:00
10 2023-01-03 16:00:00
11 2023-01-03 16:15:00
12 2023-01-03 16:30:00
13 2023-01-03 16:45:00
14 2023-01-03 17:00:00
15 2023-01-04 17:00:00
16 2023-01-04 17:15:00
17 2023-01-04 17:30:00
18 2023-01-04 17:45:00
19 2023-01-04 18:00:00
20 2023-01-04 18:15:00
21 2023-01-04 18:30:00
22 2023-01-04 18:45:00
23 2023-01-04 19:00:00
handling NAs:
out = pd.concat([pd.DataFrame({'date': pd.date_range(start=start, end=end,
freq='15min')})
for start, end in zip(df['date_start'], df['date_end'])
if pd.notna(start) and pd.notna(end)
],
ignore_index=True)
Adding to the previous answer that date_range has a to_series() method and that you could proceed like this as well:
pd.concat(
[
pd.date_range(start=row['date_start'], end=row['date_end'], freq= '15min').to_series()
for _, row in df.iterrows()
], ignore_index=True
)

Pandas dataframe with hourly data: Calculating sums for specific times

There is a dataframe with hourly data, e.g.:
DATE TIME Amount
2022-11-07 21:00:00 10
2022-11-07 22:00:00 11
2022-11-08 07:00:00 10
2022-11-08 08:00:00 13
2022-11-08 09:00:00 12
2022-11-08 10:00:00 11
2022-11-08 11:00:00 13
2022-11-08 12:00:00 12
2022-11-08 13:00:00 10
2022-11-08 14:00:00 9
...
I would like to add a new column sum_morning where I calculate the sum of "Amount" for the morning hours only (07:00 - 12:00):
DATE TIME Amount sum_morning
2022-11-07 21:00:00 10 NaN
2022-11-07 22:00:00 11 NaN
2022-11-08 07:00:00 10 NaN
2022-11-08 08:00:00 13 NaN
2022-11-08 09:00:00 12 NaN
2022-11-08 10:00:00 11 NaN
2022-11-08 11:00:00 13 NaN
2022-11-08 12:00:00 12 71
2022-11-08 13:00:00 10 NaN
2022-11-08 14:00:00 9 NaN
...
There can be gaps in the dataframe (e.g. from 22:00 - 07:00), so shift is probably not working here.
I thought about
creating a new dataframe where I filter all time slices from 07:00 - 12:00 for all dates
do a group by and calculate the sum for each day
and then merge this back to the original df.
But maybe there is a more effective solution?
I really enjoy working with Python / pandas, but hourly data still makes my head spin.
First set a DatetimeIndex in order to use DataFrame.between_time, then groupby DATE and aggregate by sum. Finally, get the last value of datetimes per day, in order to match the index of the original DataFrame:
df.index = pd.to_datetime(df['DATE'] + ' ' + df['TIME'])
s = (df.between_time('7:00','12:00')
.reset_index()
.groupby('DATE')
.agg({'Amount':'sum', 'index':'last'})
.set_index('index')['Amount'])
df['sum_morning'] = s
print (df)
DATE TIME Amount sum_morning
2022-11-07 21:00:00 2022-11-07 21:00:00 10 NaN
2022-11-07 22:00:00 2022-11-07 22:00:00 11 NaN
2022-11-08 07:00:00 2022-11-08 07:00:00 10 NaN
2022-11-08 08:00:00 2022-11-08 08:00:00 13 NaN
2022-11-08 09:00:00 2022-11-08 09:00:00 12 NaN
2022-11-08 10:00:00 2022-11-08 10:00:00 11 NaN
2022-11-08 11:00:00 2022-11-08 11:00:00 13 NaN
2022-11-08 12:00:00 2022-11-08 12:00:00 12 71.0
2022-11-08 13:00:00 2022-11-08 13:00:00 10 NaN
2022-11-08 14:00:00 2022-11-08 14:00:00 9 NaN
Lastly, if you need to remove DatetimeIndex you can use:
df = df.reset_index(drop=True)
You can use:
# get values between 7 and 12h
m = pd.to_timedelta(df['TIME']).between('7h', '12h')
# find last True per day
idx = m&m.groupby(df['DATE']).shift(-1).ne(True)
# assign the sum of the 7-12h values on the last True per day
df.loc[idx, 'sum_morning'] = df['Amount'].where(m).groupby(df['DATE']).transform('sum')
Output:
DATE TIME Amount sum_morning
0 2022-11-07 21:00:00 10 NaN
1 2022-11-07 22:00:00 11 NaN
2 2022-11-08 07:00:00 10 NaN
3 2022-11-08 08:00:00 13 NaN
4 2022-11-08 09:00:00 12 NaN
5 2022-11-08 10:00:00 11 NaN
6 2022-11-08 11:00:00 13 NaN
7 2022-11-08 12:00:00 12 71.0
8 2022-11-08 13:00:00 10 NaN
9 2022-11-08 14:00:00 9 NaN

Python Datetime calculation

I would like to create a function with python, this is the calculation, if end time of a shift is after 20:00 and between 06:00 it has to create me an extra 25% in minutes for each hour passed after 20:00.
Any suggestions?
UPDATED:
Here is a way to do what I believe your question asks:
from datetime import datetime, timedelta
def getHours(startTime, endTime, extraFraction):
if endTime < startTime:
raise ValueError(f'endTime {endTime} is before startTime {startTime}')
startDateStr = startTime.strftime("%Y-%m-%d")
bonusStartTime = datetime.strptime(startDateStr + " " + "20:00:00", "%Y-%m-%d %H:%M:%S")
prevBonusEndTime = datetime.strptime(startTime.strftime("%Y-%m-%d") + " " + "06:00:00", "%Y-%m-%d %H:%M:%S")
bonusEndTime = prevBonusEndTime + timedelta(days=1)
bonusPeriod = timedelta(days=0)
duration = endTime - startTime
hours = duration.total_seconds() // 3600
if hours > 24:
fullDays = hours // 24
bonusPeriod += fullDays * (bonusEndTime - bonusStartTime)
endTime -= timedelta(days=fullDays)
if startTime < prevBonusEndTime:
bonusPeriod += prevBonusEndTime - startTime
if endTime < prevBonusEndTime:
bonusPeriod -= prevBonusEndTime - endTime
if startTime > bonusStartTime:
bonusPeriod -= startTime - bonusStartTime
if endTime > bonusStartTime:
bonusPeriod += min(endTime, bonusEndTime) - bonusStartTime
delta = duration + bonusPeriod * extraFraction
return delta
Explanation:
confirm startTime is before endTime, otherwise raise an exception
set the following:
prevBonusStartTime as 20:00 on the day before startTime
bonusStartTime as 20:00 on the day of startTime
bonusEndTime as 06:00 on the day after startTime
if endTime is more than 24 hours after startTime, record this in duration and bonusPeriod and rewind endTime by the number of full days (24-hour periods) by which it exceeds startTime
add or subtract to bonusPeriod by the number of hours (in addition to any calculated above) of overlap between startTime, endTime and the intervals 00:00, prevBonusEndTime and/or bonusStartTime, bonusEndTime.
Test code:
def testing(start, end):
print(f'start {start}, end {end}, actual hours {getHours(start, end, 0)}, effective hours {getHours(start, end, 0.25)}')
startTime = datetime.strptime("2022-05-26 06:00:00", "%Y-%m-%d %H:%M:%S")
endTime = startTime
for h in range(0, 48, 3):
testing(startTime, endTime + timedelta(hours=h))
endTime += timedelta(hours=48)
for h in range(0, 48, 3):
testing(startTime + timedelta(hours=h), endTime)
Output:
start 2022-05-26 06:00:00, end 2022-05-26 06:00:00, actual hours 0:00:00, effective hours 0:00:00
start 2022-05-26 06:00:00, end 2022-05-26 09:00:00, actual hours 3:00:00, effective hours 3:00:00
start 2022-05-26 06:00:00, end 2022-05-26 12:00:00, actual hours 6:00:00, effective hours 6:00:00
start 2022-05-26 06:00:00, end 2022-05-26 15:00:00, actual hours 9:00:00, effective hours 9:00:00
start 2022-05-26 06:00:00, end 2022-05-26 18:00:00, actual hours 12:00:00, effective hours 12:00:00
start 2022-05-26 06:00:00, end 2022-05-26 21:00:00, actual hours 15:00:00, effective hours 15:15:00
start 2022-05-26 06:00:00, end 2022-05-27 00:00:00, actual hours 18:00:00, effective hours 19:00:00
start 2022-05-26 06:00:00, end 2022-05-27 03:00:00, actual hours 21:00:00, effective hours 22:45:00
start 2022-05-26 06:00:00, end 2022-05-27 06:00:00, actual hours 1 day, 0:00:00, effective hours 1 day, 2:30:00
start 2022-05-26 06:00:00, end 2022-05-27 09:00:00, actual hours 1 day, 3:00:00, effective hours 1 day, 5:30:00
start 2022-05-26 06:00:00, end 2022-05-27 12:00:00, actual hours 1 day, 6:00:00, effective hours 1 day, 8:30:00
start 2022-05-26 06:00:00, end 2022-05-27 15:00:00, actual hours 1 day, 9:00:00, effective hours 1 day, 11:30:00
start 2022-05-26 06:00:00, end 2022-05-27 18:00:00, actual hours 1 day, 12:00:00, effective hours 1 day, 14:30:00
start 2022-05-26 06:00:00, end 2022-05-27 21:00:00, actual hours 1 day, 15:00:00, effective hours 1 day, 17:45:00
start 2022-05-26 06:00:00, end 2022-05-28 00:00:00, actual hours 1 day, 18:00:00, effective hours 1 day, 21:30:00
start 2022-05-26 06:00:00, end 2022-05-28 03:00:00, actual hours 1 day, 21:00:00, effective hours 2 days, 1:15:00
start 2022-05-26 06:00:00, end 2022-05-28 06:00:00, actual hours 2 days, 0:00:00, effective hours 2 days, 5:00:00
start 2022-05-26 09:00:00, end 2022-05-28 06:00:00, actual hours 1 day, 21:00:00, effective hours 2 days, 2:00:00
start 2022-05-26 12:00:00, end 2022-05-28 06:00:00, actual hours 1 day, 18:00:00, effective hours 1 day, 23:00:00
start 2022-05-26 15:00:00, end 2022-05-28 06:00:00, actual hours 1 day, 15:00:00, effective hours 1 day, 20:00:00
start 2022-05-26 18:00:00, end 2022-05-28 06:00:00, actual hours 1 day, 12:00:00, effective hours 1 day, 17:00:00
start 2022-05-26 21:00:00, end 2022-05-28 06:00:00, actual hours 1 day, 9:00:00, effective hours 1 day, 13:45:00
start 2022-05-27 00:00:00, end 2022-05-28 06:00:00, actual hours 1 day, 6:00:00, effective hours 1 day, 10:00:00
start 2022-05-27 03:00:00, end 2022-05-28 06:00:00, actual hours 1 day, 3:00:00, effective hours 1 day, 6:15:00
start 2022-05-27 06:00:00, end 2022-05-28 06:00:00, actual hours 1 day, 0:00:00, effective hours 1 day, 2:30:00
start 2022-05-27 09:00:00, end 2022-05-28 06:00:00, actual hours 21:00:00, effective hours 23:30:00
start 2022-05-27 12:00:00, end 2022-05-28 06:00:00, actual hours 18:00:00, effective hours 20:30:00
start 2022-05-27 15:00:00, end 2022-05-28 06:00:00, actual hours 15:00:00, effective hours 17:30:00
start 2022-05-27 18:00:00, end 2022-05-28 06:00:00, actual hours 12:00:00, effective hours 14:30:00
start 2022-05-27 21:00:00, end 2022-05-28 06:00:00, actual hours 9:00:00, effective hours 11:15:00
start 2022-05-28 00:00:00, end 2022-05-28 06:00:00, actual hours 6:00:00, effective hours 7:30:00
start 2022-05-28 03:00:00, end 2022-05-28 06:00:00, actual hours 3:00:00, effective hours 3:45:00
UPDATE #2:
Here is slightly modified code that outputs regular hours, bonus hours (i.e., hours in the bonus window from 20:00 to 06:00) and extra hours (25% * bonus hours):
from datetime import datetime, timedelta
def getRegularAndBonusHours(startTime, endTime):
if endTime < startTime:
raise ValueError(f'endTime {endTime} is before startTime {startTime}')
startDateStr = startTime.strftime("%Y-%m-%d")
bonusStartTime = datetime.strptime(startDateStr + " " + "20:00:00", "%Y-%m-%d %H:%M:%S")
prevBonusEndTime = datetime.strptime(startTime.strftime("%Y-%m-%d") + " " + "06:00:00", "%Y-%m-%d %H:%M:%S")
bonusEndTime = prevBonusEndTime + timedelta(days=1)
bonusPeriod = timedelta(days=0)
duration = endTime - startTime
hours = duration.total_seconds() // 3600
if hours > 24:
fullDays = hours // 24
bonusPeriod += fullDays * (bonusEndTime - bonusStartTime)
endTime -= timedelta(days=fullDays)
if startTime < prevBonusEndTime:
bonusPeriod += prevBonusEndTime - startTime
if endTime < prevBonusEndTime:
bonusPeriod -= prevBonusEndTime - endTime
if startTime > bonusStartTime:
bonusPeriod -= startTime - bonusStartTime
if endTime > bonusStartTime:
bonusPeriod += min(endTime, bonusEndTime) - bonusStartTime
return duration, bonusPeriod
def getHours(startTime, endTime, extraFraction):
duration, bonusPeriod = getRegularAndBonusHours(startTime, endTime)
delta = duration + bonusPeriod * extraFraction
return delta
def testing(start, end):
duration, bonusPeriod = getRegularAndBonusHours(start, end)
def getHoursRoundedUp(delta):
return delta.days * 24 + delta.seconds // 3600 + (1 if delta.seconds % 3600 else 0)
regularHours, bonusHours = getHoursRoundedUp(duration), getHoursRoundedUp(bonusPeriod)
print(f'start {start}, end {end}, regular {regularHours}, bonus {bonusHours}, extra {0.25 * bonusHours}')
startTime = datetime.strptime("2022-05-26 06:00:00", "%Y-%m-%d %H:%M:%S")
endTime = startTime
for h in range(0, 48, 3):
testing(startTime, endTime + timedelta(hours=h))
endTime += timedelta(hours=48)
for h in range(0, 48, 3):
testing(startTime + timedelta(hours=h), endTime)
Output:
start 2022-05-26 06:00:00, end 2022-05-26 06:00:00, regular 0, bonus 0, extra 0.0
start 2022-05-26 06:00:00, end 2022-05-26 09:00:00, regular 3, bonus 0, extra 0.0
start 2022-05-26 06:00:00, end 2022-05-26 12:00:00, regular 6, bonus 0, extra 0.0
start 2022-05-26 06:00:00, end 2022-05-26 15:00:00, regular 9, bonus 0, extra 0.0
start 2022-05-26 06:00:00, end 2022-05-26 18:00:00, regular 12, bonus 0, extra 0.0
start 2022-05-26 06:00:00, end 2022-05-26 21:00:00, regular 15, bonus 1, extra 0.25
start 2022-05-26 06:00:00, end 2022-05-27 00:00:00, regular 18, bonus 4, extra 1.0
start 2022-05-26 06:00:00, end 2022-05-27 03:00:00, regular 21, bonus 7, extra 1.75
start 2022-05-26 06:00:00, end 2022-05-27 06:00:00, regular 24, bonus 10, extra 2.5
start 2022-05-26 06:00:00, end 2022-05-27 09:00:00, regular 27, bonus 10, extra 2.5
start 2022-05-26 06:00:00, end 2022-05-27 12:00:00, regular 30, bonus 10, extra 2.5
start 2022-05-26 06:00:00, end 2022-05-27 15:00:00, regular 33, bonus 10, extra 2.5
start 2022-05-26 06:00:00, end 2022-05-27 18:00:00, regular 36, bonus 10, extra 2.5
start 2022-05-26 06:00:00, end 2022-05-27 21:00:00, regular 39, bonus 11, extra 2.75
start 2022-05-26 06:00:00, end 2022-05-28 00:00:00, regular 42, bonus 14, extra 3.5
start 2022-05-26 06:00:00, end 2022-05-28 03:00:00, regular 45, bonus 17, extra 4.25
start 2022-05-26 06:00:00, end 2022-05-28 06:00:00, regular 48, bonus 20, extra 5.0
start 2022-05-26 09:00:00, end 2022-05-28 06:00:00, regular 45, bonus 20, extra 5.0
start 2022-05-26 12:00:00, end 2022-05-28 06:00:00, regular 42, bonus 20, extra 5.0
start 2022-05-26 15:00:00, end 2022-05-28 06:00:00, regular 39, bonus 20, extra 5.0
start 2022-05-26 18:00:00, end 2022-05-28 06:00:00, regular 36, bonus 20, extra 5.0
start 2022-05-26 21:00:00, end 2022-05-28 06:00:00, regular 33, bonus 19, extra 4.75
start 2022-05-27 00:00:00, end 2022-05-28 06:00:00, regular 30, bonus 16, extra 4.0
start 2022-05-27 03:00:00, end 2022-05-28 06:00:00, regular 27, bonus 13, extra 3.25
start 2022-05-27 06:00:00, end 2022-05-28 06:00:00, regular 24, bonus 10, extra 2.5
start 2022-05-27 09:00:00, end 2022-05-28 06:00:00, regular 21, bonus 10, extra 2.5
start 2022-05-27 12:00:00, end 2022-05-28 06:00:00, regular 18, bonus 10, extra 2.5
start 2022-05-27 15:00:00, end 2022-05-28 06:00:00, regular 15, bonus 10, extra 2.5
start 2022-05-27 18:00:00, end 2022-05-28 06:00:00, regular 12, bonus 10, extra 2.5
start 2022-05-27 21:00:00, end 2022-05-28 06:00:00, regular 9, bonus 9, extra 2.25
start 2022-05-28 00:00:00, end 2022-05-28 06:00:00, regular 6, bonus 6, extra 1.5
start 2022-05-28 03:00:00, end 2022-05-28 06:00:00, regular 3, bonus 3, extra 0.75
UPDATE #3
Latest clarification from OP in a comment indicates:
A need to update in excel the allowances received in case of night work
The goal in the excel sheet is to separately enter start time, end time, working time (without supplement), and night work supplement (25% from 20:00 to 06:) for each hour started for night work.
Here is updated code to create the required data result, and optionally to use a pandas dataframe to put this into an Excel file. Test inputs are used to explore a range of start and end times, including partial hours:
from datetime import datetime, timedelta
def getRegularAndBonusHours(startTime, endTime):
if endTime < startTime:
raise ValueError(f'endTime {endTime} is before startTime {startTime}')
startDateStr = startTime.strftime("%Y-%m-%d")
bonusStartTime = datetime.strptime(startDateStr + " " + "20:00:00", "%Y-%m-%d %H:%M:%S")
prevBonusEndTime = datetime.strptime(startTime.strftime("%Y-%m-%d") + " " + "06:00:00", "%Y-%m-%d %H:%M:%S")
bonusEndTime = prevBonusEndTime + timedelta(days=1)
bonusPeriod = timedelta(days=0)
duration = endTime - startTime
hours = duration.total_seconds() // 3600
if hours > 24:
fullDays = hours // 24
bonusPeriod += fullDays * (bonusEndTime - bonusStartTime)
endTime -= timedelta(days=fullDays)
if startTime < prevBonusEndTime:
bonusPeriod += prevBonusEndTime - startTime
if endTime < prevBonusEndTime:
bonusPeriod -= prevBonusEndTime - endTime
if startTime > bonusStartTime:
bonusPeriod -= startTime - bonusStartTime
if endTime > bonusStartTime:
bonusPeriod += min(endTime, bonusEndTime) - bonusStartTime
return duration, bonusPeriod
def testing(start, end):
duration, bonusPeriod = getRegularAndBonusHours(start, end)
def getHoursFromDelta(delta, roundUp=False):
return delta.days * 24 + (delta.seconds // 3600 + (1 if delta.seconds % 3600 else 0)) if roundUp else (delta.seconds / 3600)
fullHours, bonusHours = getHoursFromDelta(duration + bonusPeriod), getHoursFromDelta(bonusPeriod, True)
return start, end, fullHours, bonusHours * 0.25
# calculate test results
results = []
startTime = datetime.strptime("2022-05-26 06:00:00", "%Y-%m-%d %H:%M:%S")
endTime = startTime
for halfHours in range(0, 2 * 48, 5):
results.append(testing(startTime, endTime + timedelta(hours=halfHours / 2)))
endTime += timedelta(hours=48)
for halfHours in range(0, 2 * 48, 5):
results.append(testing(startTime + timedelta(hours=halfHours / 2), endTime))
# print results
headings = ['Start Time', 'End Time', 'Working Hours', '25% of Supplemental Hours Started']
[print(f'{x:30}', end='') for x in headings]
[[print(f'{f"{x}":30}', end='') for x in row] for row in results if print() or True]
print()
# OPTIONAL: save results in pandas dataframe and save as Excel file
import pandas as pd
df = pd.DataFrame(results, columns=headings)
print(df)
with pd.ExcelWriter('TestTimesheet.xlsx') as writer:
df.to_excel(writer, index=None, sheet_name='Timesheet')
ws = writer.sheets['Timesheet']
for column in df:
column_length = max(df[column].astype(str).map(len).max(), len(column))
col_idx = df.columns.get_loc(column)
ws.column_dimensions[chr(ord('A') + col_idx)].width = column_length
Output:
Start Time End Time Working Hours 25% of Supplemental Hours Started
0 2022-05-26 06:00:00 2022-05-26 06:00:00 0.0 0.00
1 2022-05-26 06:00:00 2022-05-26 08:30:00 2.5 0.00
2 2022-05-26 06:00:00 2022-05-26 11:00:00 5.0 0.00
3 2022-05-26 06:00:00 2022-05-26 13:30:00 7.5 0.00
4 2022-05-26 06:00:00 2022-05-26 16:00:00 10.0 0.00
5 2022-05-26 06:00:00 2022-05-26 18:30:00 12.5 0.00
6 2022-05-26 06:00:00 2022-05-26 21:00:00 16.0 0.25
7 2022-05-26 06:00:00 2022-05-26 23:30:00 21.0 1.00
8 2022-05-26 06:00:00 2022-05-27 02:00:00 2.0 1.50
9 2022-05-26 06:00:00 2022-05-27 04:30:00 7.0 2.25
10 2022-05-26 06:00:00 2022-05-27 07:00:00 11.0 2.50
11 2022-05-26 06:00:00 2022-05-27 09:30:00 13.5 2.50
12 2022-05-26 06:00:00 2022-05-27 12:00:00 16.0 2.50
13 2022-05-26 06:00:00 2022-05-27 14:30:00 18.5 2.50
14 2022-05-26 06:00:00 2022-05-27 17:00:00 21.0 2.50
15 2022-05-26 06:00:00 2022-05-27 19:30:00 23.5 2.50
16 2022-05-26 06:00:00 2022-05-27 22:00:00 4.0 3.00
17 2022-05-26 06:00:00 2022-05-28 00:30:00 9.0 3.75
18 2022-05-26 06:00:00 2022-05-28 03:00:00 14.0 4.25
19 2022-05-26 06:00:00 2022-05-28 05:30:00 19.0 5.00
20 2022-05-26 06:00:00 2022-05-28 06:00:00 20.0 5.00
21 2022-05-26 08:30:00 2022-05-28 06:00:00 17.5 5.00
22 2022-05-26 11:00:00 2022-05-28 06:00:00 15.0 5.00
23 2022-05-26 13:30:00 2022-05-28 06:00:00 12.5 5.00
24 2022-05-26 16:00:00 2022-05-28 06:00:00 10.0 5.00
25 2022-05-26 18:30:00 2022-05-28 06:00:00 7.5 5.00
26 2022-05-26 21:00:00 2022-05-28 06:00:00 4.0 4.75
27 2022-05-26 23:30:00 2022-05-28 06:00:00 23.0 4.25
28 2022-05-27 02:00:00 2022-05-28 06:00:00 18.0 3.50
29 2022-05-27 04:30:00 2022-05-28 06:00:00 13.0 3.00
30 2022-05-27 07:00:00 2022-05-28 06:00:00 9.0 2.50
31 2022-05-27 09:30:00 2022-05-28 06:00:00 6.5 2.50
32 2022-05-27 12:00:00 2022-05-28 06:00:00 4.0 2.50
33 2022-05-27 14:30:00 2022-05-28 06:00:00 1.5 2.50
34 2022-05-27 17:00:00 2022-05-28 06:00:00 23.0 2.50
35 2022-05-27 19:30:00 2022-05-28 06:00:00 20.5 2.50
36 2022-05-27 22:00:00 2022-05-28 06:00:00 16.0 2.00
37 2022-05-28 00:30:00 2022-05-28 06:00:00 11.0 1.50
38 2022-05-28 03:00:00 2022-05-28 06:00:00 6.0 0.75
39 2022-05-28 05:30:00 2022-05-28 06:00:00 1.0 0.25
from datetime import datetime, timedelta
soup_dienstbegin = '20:00'
soup_dienstende = '23:00'
startinfo = f'{soup_datum} {soup_dienstbegin}'
IcsStartData = datetime.strptime(startinfo, "%d.%m.%Y %H:%M")
endTime = endingtime
startTime = IcsStartData
def zeit_zuschlag_function(startTime, endTime):
if endTime < startTime:
raise ValueError(f'endTime {endTime} is before startTime {startTime}')
startDateStr = startTime.strftime("%d.%m.%Y")
zeit_zuschlag_start = datetime.strptime(startDateStr + " " + "22:00", "%d.%m.%Y %H:%M")
zeit_zuschlag_end = datetime.strptime(startTime.strftime("%d.%m.%Y") + " " + "00:00", "%d.%m.%Y %H:%M")
zeit_zuschlag_time = zeit_zuschlag_end + timedelta(days=1)
zeit_zuschlag_zeit = timedelta(days=0)
dienstdauer = endTime - startTime
hours = dienstdauer.total_seconds() // 3600
if hours > 24:
fullDays = hours // 24
zeit_zuschlag_zeit += fullDays * (zeit_zuschlag_time - zeit_zuschlag_start)
endTime -= timedelta(days=fullDays)
if startTime < zeit_zuschlag_end:
zeit_zuschlag_zeit += zeit_zuschlag_end - startTime
if endTime < zeit_zuschlag_end:
zeit_zuschlag_zeit -= zeit_zuschlag_end - endTime
if startTime > zeit_zuschlag_start:
zeit_zuschlag_zeit -= startTime - zeit_zuschlag_start
if endTime > zeit_zuschlag_start:
zeit_zuschlag_zeit += min(endTime, zeit_zuschlag_time) - zeit_zuschlag_start
return dienstdauer, zeit_zuschlag_zeit
def nacht_zulagen_stunden(startTime, endTime, extraFraction):
dienstdauer, zeit_zuschlag_zeit = zeit_zuschlag_function(startTime, endTime)
delta = dienstdauer + zeit_zuschlag_zeit * extraFraction
return delta
def nacht_zulagen_executing(start, end):
dienstdauer, zeit_zuschlag_zeit = zeit_zuschlag_function(start, end)
def nacht_zulagen_hours_roundeup(delta):
return delta.days * 24 + delta.seconds // 3600 + (1 if delta.seconds % 3600 else 0)
regularHours, nachtszulage = nacht_zulagen_hours_roundeup(dienstdauer), nacht_zulagen_hours_roundeup(zeit_zuschlag_zeit)
# nachtarbeitzeitzuschlag 10 % pro Stunde
nachtarbeit_zeit_zuschlag_10 = zeit_zuschlag_zeit / 100 * 110 - (zeit_zuschlag_zeit)
print(nachtarbeit_zeit_zuschlag_10)
for h in range(1):
nacht_zulagen_executing(startTime, endTime + timedelta(hours=h))

Find whether an event (with a start & end time) goes beyond a certain time (e.g. 6pm) in dataframe using Python (pandas, datetime)

I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips.
The data that I am analysing is from the Google Calendar API that is synced with my work schedule. It prints the events in that particular calendar to a csv file in this format:
Start
End
Title
Hours
0
02.12.2020 07:00
02.12.2020 16:00
Shift
9.0
1
04.12.2020 18:00
04.12.2020 21:00
Shift
3.0
2
05.12.2020 07:00
05.12.2020 12:00
Shift
5.0
3
06.12.2020 09:00
06.12.2020 18:00
Shift
9.0
4
07.12.2020 19:00
07.12.2020 23:00
Shift
4.0
5
08.12.2020 19:00
08.12.2020 23:00
Shift
4.0
6
09.12.2020 10:00
09.12.2020 15:00
Shift
5.0
As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates). I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm. So using this as an example the output would be like:
Start
End
Title
Hours
1
04.12.2020 15:00
04.12.2020 21:00
Shift
6.0
Start
End
Title
Total Hours
Hours before 3pm
Hours after 3pm
1
04.12.2020 15:00
04.12.2020 21:00
Shift
6.0
3.0
3.0
I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:
df['day_of_week'] = df['Start'].dt.day_name()
I appreciate any help in Python or even other coding languages/techniques this can be applied to:)
Edit:
This is how my dataframe is looking at the moment
Start
End
Title
Hours
day_of_week
Pay
week_of_year
0
2020-12-02 07:00:00
2020-12-02 16:00:00
Shift
9.0
Wednesday
337.30
49
EDIT
In response to David Erickson's comment.
value
variable
bool
0
2020-12-02 07:00:00
Start
False
1
2020-12-02 08:00:00
Start
False
2
2020-12-02 09:00:00
Start
False
3
2020-12-02 10:00:00
Start
False
4
2020-12-02 11:00:00
Start
False
5
2020-12-02 12:00:00
Start
False
6
2020-12-02 13:00:00
Start
False
7
2020-12-02 14:00:00
Start
False
8
2020-12-02 15:00:00
Start
False
9
2020-12-02 16:00:00
End
False
10
2020-12-04 18:00:00
Start
False
11
2020-12-04 19:00:00
Start
True
12
2020-12-04 20:00:00
Start
True
13
2020-12-04 21:00:00
End
True
14
2020-12-05 07:00:00
Start
False
15
2020-12-05 08:00:00
Start
False
16
2020-12-05 09:00:00
Start
False
17
2020-12-05 10:00:00
Start
False
18
2020-12-05 11:00:00
Start
False
19
2020-12-05 12:00:00
End
False
20
2020-12-06 09:00:00
Start
False
21
2020-12-06 10:00:00
Start
False
22
2020-12-06 11:00:00
Start
False
23
2020-12-06 12:00:00
Start
False
24
2020-12-06 13:00:00
Start
False
25
2020-12-06 14:00:00
Start
False
26
2020-12-06 15:00:00
Start
False
27
2020-12-06 6:00:00
Start
False
28
2020-12-06 17:00:00
Start
False
29
2020-12-06 18:00:00
End
False
30
2020-12-07 19:00:00
Start
False
31
2020-12-07 20:00:00
Start
True
32
2020-12-07 21:00:00
Start
True
33
2020-12-07 22:00:00
Start
True
34
2020-12-07 23:00:00
End
True
35
2020-12-08 19:00:00
Start
False
36
2020-12-08 20:00:00
Start
True
37
2020-12-08 21:00:00
Start
True
38
2020-12-08 22:00:00
Start
True
39
2020-12-08 23:00:00
End
True
40
2020-12-09 10:00:00
Start
False
41
2020-12-09 11:00:00
Start
False
42
2020-12-09 12:00:00
Start
False
43
2020-12-09 13:00:00
Start
False
44
2020-12-09 14:00:00
Start
False
45
2020-12-09 15:00:00
End
False
46
2020-12-11 19:00:00
Start
False
47
2020-12-11 20:00:00
Start
True
48
2020-12-11 21:00:00
Start
True
49
2020-12-11 22:00:00
Start
True
UPDATE: (2020-12-19)
I have simply filtered out the Start rows, as you were correct an extra row wa being calculated. Also, I passed dayfirst=True to pd.to_datetime() to convert the date correctly. I have also made the output clean with some extra columns.
higher_pay = 40
lower_pay = 30
df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]:
Start End Title Hours Day Week \
0 2020-12-02 07:00:00 2020-12-02 16:00:00 Shift 9.0 Wednesday 49
1 2020-12-04 18:00:00 2020-12-04 21:00:00 Shift 3.0 Friday 49
2 2020-12-05 07:00:00 2020-12-05 12:00:00 Shift 5.0 Saturday 49
3 2020-12-06 09:00:00 2020-12-06 18:00:00 Shift 9.0 Sunday 49
4 2020-12-07 19:00:00 2020-12-07 23:00:00 Shift 4.0 Monday 50
5 2020-12-08 19:00:00 2020-12-08 23:00:00 Shift 4.0 Tuesday 50
6 2020-12-09 10:00:00 2020-12-09 15:00:00 Shift 5.0 Wednesday 50
Higher Pay Hours Lower Pay Hours Pay
0 0 9 270
1 3 0 120
2 5 0 200
3 9 0 360
4 4 0 160
5 4 0 160
6 0 5 150
There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach. You can melt the dataframe to have Start and End in the same column and fill in the gap hours with resample making sure to groupby by the 'Start' and 'End' values that were initially on the same row. The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount of the values in the new the dataframe grouped by 'Start' and 'End'. I'll show you how this works later in the answer.
Full Code:
df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26
Adding some more details...
Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output
df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]:
variable
value
2020-02-12 07:00:00 Start
2020-04-12 18:00:00 Start
2020-05-12 07:00:00 Start
2020-06-12 09:00:00 Start
2020-07-12 19:00:00 Start
2020-08-12 19:00:00 Start
2020-09-12 10:00:00 Start
2020-02-12 16:00:00 End
2020-04-12 21:00:00 End
2020-05-12 12:00:00 End
2020-06-12 18:00:00 End
2020-07-12 23:00:00 End
2020-08-12 23:00:00 End
2020-09-12 15:00:00 End
Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing '
Start' and 'End' as they were together previously
df.groupby('variable').cumcount()
Out[2]:
value
2020-02-12 07:00:00 0
2020-04-12 18:00:00 1
2020-05-12 07:00:00 2
2020-06-12 09:00:00 3
2020-07-12 19:00:00 4
2020-08-12 19:00:00 5
2020-09-12 10:00:00 6
2020-02-12 16:00:00 0
2020-04-12 21:00:00 1
2020-05-12 12:00:00 2
2020-06-12 18:00:00 3
2020-07-12 23:00:00 4
2020-08-12 23:00:00 5
2020-09-12 15:00:00 6
Step 3: Resample the data per group by hour to fill in the gaps for each group:
df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]:
value variable
0 2020-02-12 07:00:00 Start
1 2020-02-12 08:00:00 Start
2 2020-02-12 09:00:00 Start
3 2020-02-12 10:00:00 Start
4 2020-02-12 11:00:00 Start
5 2020-02-12 12:00:00 Start
6 2020-02-12 13:00:00 Start
7 2020-02-12 14:00:00 Start
8 2020-02-12 15:00:00 Start
9 2020-02-12 16:00:00 End
10 2020-04-12 18:00:00 Start
11 2020-04-12 19:00:00 Start
12 2020-04-12 20:00:00 Start
13 2020-04-12 21:00:00 End
14 2020-05-12 07:00:00 Start
15 2020-05-12 08:00:00 Start
16 2020-05-12 09:00:00 Start
17 2020-05-12 10:00:00 Start
18 2020-05-12 11:00:00 Start
19 2020-05-12 12:00:00 End
20 2020-06-12 09:00:00 Start
21 2020-06-12 10:00:00 Start
22 2020-06-12 11:00:00 Start
23 2020-06-12 12:00:00 Start
24 2020-06-12 13:00:00 Start
25 2020-06-12 14:00:00 Start
26 2020-06-12 15:00:00 Start
27 2020-06-12 16:00:00 Start
28 2020-06-12 17:00:00 Start
29 2020-06-12 18:00:00 End
30 2020-07-12 19:00:00 Start
31 2020-07-12 20:00:00 Start
32 2020-07-12 21:00:00 Start
33 2020-07-12 22:00:00 Start
34 2020-07-12 23:00:00 End
35 2020-08-12 19:00:00 Start
36 2020-08-12 20:00:00 Start
37 2020-08-12 21:00:00 Start
38 2020-08-12 22:00:00 Start
39 2020-08-12 23:00:00 End
40 2020-09-12 10:00:00 Start
41 2020-09-12 11:00:00 Start
42 2020-09-12 12:00:00 Start
43 2020-09-12 13:00:00 Start
44 2020-09-12 14:00:00 Start
45 2020-09-12 15:00:00 End
Step 4 - From there, you can calculate the boolean series I have called m: *True values represent conditions met for "Higher Rate".
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 True
31 True
32 True
33 True
34 True
35 True
36 True
37 True
38 True
39 True
40 True
41 True
42 True
43 True
44 True
45 True
Step 5: Filter the dataframe by True or False to count total hours for the Normal Rate and Higher Rate and print values.
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

Flagging list of datetimes within date ranges in pandas dataframe

I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object

Categories

Resources