I'm a little new to Python and I'm not sure where to start.
I have a python dataframe that contains shift information like the below:
EmployeeID ShiftType BeginDate EndDate
10 Holiday 2020-01-01 21:00:00 2020-01-02 07:00:00
10 Regular 2020-01-02 21:00:00 2020-01-03 07:00:00
10 Regular 2020-01-03 21:00:00 2020-01-04 07:00:00
10 Regular 2020-01-04 21:00:00 2020-01-05 07:00:00
20 Regular 2020-02-01 09:00:00 2020-02-01 17:00:00
20 Regular 2020-02-02 09:00:00 2020-02-02 17:00:00
20 Regular 2020-02-03 09:00:00 2020-02-03 17:00:00
20 Regular 2020-02-04 09:00:00 2020-02-04 17:00:00
I'd like to be able to break each shift down and summarize hours worked each day.
The desired ouput is below:
EmployeeID ShiftType Date HoursWorked
10 Holiday 2020-01-01 3
10 Regular 2020-01-02 10
10 Regular 2020-01-03 10
10 Regular 2020-01-04 10
10 Regular 2020-01-05 7
20 Regular 2020-02-01 10
20 Regular 2020-02-02 10
20 Regular 2020-02-03 10
20 Regular 2020-02-04 10
I know how to get the work hours like below. This does get me hours for each shift, but I would like to be able to break out each calendar day and hours for that day. Retaining 'ShiftType' is not that important here.
df['HoursWorked'] = ((pd.to_datetime(schedule['EndDate']) - pd.to_datetime(schedule['BeginDate'])).dt.total_seconds() / 3600)
Any suggestion would be appreciated.
You could calculate the working hours for each date (BeginDate and EndDate) separately, which would give you three pd.Series, two for the parts of the night shifts that are on different dates and one for the day shift that is on one date. Use the according date as index for those Series.
# make sure dtype is correct
df["BeginDate"] = pd.to_datetime(df["BeginDate"])
df["EndDate"] = pd.to_datetime(df["EndDate"])
# get a mask where there is a night shift; end date = start date +1
m = df["BeginDate"].dt.date != df["EndDate"].dt.date
# extract the night shifts
s0 = pd.Series((df["BeginDate"][m].dt.ceil('d')-df["BeginDate"][m]).values,
index=df["BeginDate"][m].dt.floor('d'))
s1 = pd.Series((df["EndDate"][m]-df["EndDate"][m].dt.floor('d')).values,
index=df["EndDate"][m].dt.floor('d'))
# ...and the day shifts
s2 = pd.Series((df["EndDate"][~m]-df["BeginDate"][~m]).values, index=df["BeginDate"][~m].dt.floor('d'))
Now concat and sum them, giving you a pd.Series with the working hours for each date:
working_hours = pd.concat([s0, s1, s2], axis=1).sum(axis=1)
# working_hours
# 2020-01-01 0 days 03:00:00
# 2020-01-02 0 days 10:00:00
# 2020-01-03 0 days 10:00:00
# ...
# Freq: D, dtype: timedelta64[ns]
To join with your original df, you can reindex that and add the working_hours:
new_index = pd.concat([df["BeginDate"].dt.floor('d'),
df["EndDate"].dt.floor('d')]).drop_duplicates()
df_out = df.set_index(df["BeginDate"].dt.floor('d')).reindex(new_index, method='nearest')
df_out.index = df_out.index.set_names('date')
df_out = df_out.drop(['BeginDate', 'EndDate'], axis=1).sort_index()
df_out['HoursWorked'] = working_hours.dt.total_seconds()/3600
# df_out
# EmployeeID ShiftType HoursWorked
# date
# 2020-01-01 10 Holiday 3.0
# 2020-01-02 10 Regular 10.0
# 2020-01-03 10 Regular 10.0
# 2020-01-04 10 Regular 10.0
# 2020-01-05 10 Regular 7.0
# 2020-02-01 20 Regular 8.0
# 2020-02-02 20 Regular 8.0
# 2020-02-03 20 Regular 8.0
# 2020-02-04 20 Regular 8.0
I have the following dataframe:
ID ..... Quantity Time
54 100 2020-01-01 00:00:04
55 100 2020-01-01 00:00:04
54 88 2020-01-01-00:00:05
54 66 2020-01-01 00:00:06
55 100 2020-01-01 00:00:07
55 88 2020-01-01 00:00:07
I would like to group the dataframe (sorted by time!) by ID and then take the quantity of the last row and divide it by the first row per ID.
The result should look like this:
ID ..... Quantity Time Result
54 100 2020-01-01 00:00:04
54 88 2020-01-01-00:00:05
54 66 2020-01-01 00:00:06 0.66
55 100 2020-01-01 00:00:04
55 100 2020-01-01 00:00:07
55 88 2020-01-01 00:00:07 0.88
So far I used the following code to get the first and the last row for every ID.
g = df.sort_values(by=['Time']).groupby('ID')
df_new=(pd.concat([g.head(1), g.tail(1)])
.sort_values(by='ID')
.reset_index(drop=True))
and then I used the following code to get the Result of the division:
df_new['Result'] = df_new['Quantity'].iloc[1::2].div(df_new['Quantity'].shift())
The problem is: the dataframe stays not sorted by time. It is really important that I take (timewise) the last quantity per ID and divide it by the first quantity (in time) per ID.
Thanks for any hints where I need to change the code!
There are not pairs ID values, but triples, so first convert column to datetime if necessary by to_datetime, then sorting per 2 columns by DataFrame.sort_values and last use second or third solution from previous answer:
df['Time'] = pd.to_datetime(df['Time'])
df = df.sort_values(['ID','Time'])
first = df.groupby('ID')['Quantity'].transform('first')
df['Result'] = df.drop_duplicates('ID', keep='last')['Quantity'].div(first)
print (df)
ID Quantity Time Result
0 54 100 2020-01-01 00:00:04 NaN
2 54 88 2020-01-01 00:00:05 NaN
3 54 66 2020-01-01 00:00:06 0.66
1 55 100 2020-01-01 00:00:04 NaN
4 55 100 2020-01-01 00:00:07 NaN
5 55 88 2020-01-01 00:00:07 0.88
Convert Time to datetime :
df["Time"] = pd.to_datetime(df["Time"])
Sort on ID and Time :
df = df.sort_values(["ID", "Time"])
Group on ID:
grouping = df.groupby("ID").Quantity
Get results for the division of last by first:
result = grouping.last().div(grouping.first()).array
Now, you can assign the results back to the original dataframe :
df.loc[df.Quantity.eq(grouping.transform("last")), "Result"] = result
df
ID Quantity Time Result
0 54 100 2020-01-01 00:00:04 NaN
2 54 88 2020-01-01 00:00:05 NaN
3 54 66 2020-01-01 00:00:06 0.66
1 55 100 2020-01-01 00:00:04 NaN
4 55 100 2020-01-01 00:00:07 NaN
5 55 88 2020-01-01 00:00:07 0.88
I have a large csv file in which I want to replace values with zero in a particular range of time. For example in between 20:00:00 to 05:00:00 I want to replace all the values greater than zero with 0. How do I do it?
dff = pd.read_csv('108e.csv', header=None) # reading the data set
data = df.copy()
df = pd.DataFrame(data)
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
for i in df.set_index('timeStamp').between_time('20:00:00' , '05:00:00')['luminosity']:
if( i > 0):
df[['luminosity']] = df[["luminosity"]].replace({i:0})
You can use the function select from numpy.
import numpy as np
df['luminosity'] = np.select((df['timeStamp']>='20:00:00') & (df['timeStamp']<='05:00:00') & (df['luminosity']>=0), 0, df['luminosity'])
Here are other examples to use it and here are the official docs.
Assume that your DataFrame contains:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 11
2 2020-01-02 22:00:00 12
3 2020-01-03 02:00:00 13
4 2020-01-03 05:00:00 14
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 17
8 2020-01-03 22:10:00 18
9 2020-01-04 02:10:00 19
10 2020-01-04 05:00:00 20
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22
To only retrieve rows in the time range of interest you could run:
df.set_index('timeStamp').between_time('20:00' , '05:00')
But if you attempted to modify these data, e.g.
df = df.set_index('timeStamp')
df.between_time('20:00' , '05:00')['luminosity'] = 0
you would get SettingWithCopyWarning. The reason is that this function
returns a view of the original data.
To circumvent this limitation, you can use indexer_between_time,
on the index of a DataFrame, which returns a Numpy array - locations
of rows meeting your time range criterion.
To update the underlying data, with setting index only to get row positions,
you can run:
df.iloc[df.set_index('timeStamp').index\
.indexer_between_time('20:00', '05:00'), 1] = 0
Note that to keep the code short, I passed the int location of the column
of interest.
Access by iloc should be quite fast.
When you print the df again, the result is:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 0
2 2020-01-02 22:00:00 0
3 2020-01-03 02:00:00 0
4 2020-01-03 05:00:00 0
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 0
8 2020-01-03 22:10:00 0
9 2020-01-04 02:10:00 0
10 2020-01-04 05:00:00 0
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22
I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.
Considering the following dataframe:
df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df.tail(6)
week extra_hours extra_hours_over
12 3 01:00:00 NaN
13 3 02:00:00 NaN
14 3 01:00:00 NaN
15 3 02:00:00 NaN
16 3 00:00:00 NaN
17 3 00:00:00 NaN
Now, in every week, the maximum amount of extra_hours is 4h, meaning I have to subtract 30min blocks from extra_hour column, and fill the extra_hour_over column, so that in every week, total sum of extra_hour has a maximum of 4h.
So, given the example dataframe, a possible solution (for week 3) would be like this:
week extra_hours extra_hours_over
12 3 01:00:00 00:00:00
13 3 01:30:00 00:30:00
14 3 00:30:00 00:30:00
15 3 01:00:00 01:00:00
16 3 00:00:00 00:00:00
17 3 00:00:00 00:00:00
I would need to aggregate total extra_hours per week, check in which days it passes 4h, and then randomly subtract half-hour chunks.
What would be the easiest/most direct way to achieve this?
Here goes one attempt for what you seem to be asking. The idea is simple, although the code fairly verbose:
1) Create some helper variables (minutes, extra_minutes, total for the week)
2) Loop through a temporary dataset that will contain only while sum is > 240 minutes.
3) In the loop, use random.choice to select a time to remove 30 min from.
4) Apply the changes to minutes and extra minutes
The code:
df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df['minutes'] = pd.DatetimeIndex(df['extra_hours']).hour * 60 + pd.DatetimeIndex(df['extra_hours']).minute
df['extra_minutes'] = 0
df['tot_time'] = df.groupby('week')['minutes'].transform('sum')
while not df[df['tot_time'] > 240].empty:
mask = df[(df['minutes']>=30)&(df['tot_time']>240)].groupby('week').apply(lambda x: np.random.choice(x.index)).values
df.loc[mask,'minutes'] -= 30
df.loc[mask,'extra_minutes'] += 30
df['tot_time'] = df.groupby('week')['minutes'].transform('sum')
df['extra_hours_over'] = df['extra_minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df['extra_hours'] = df['minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df.drop(['minutes','extra_minutes'], axis=1).tail(6)
Out[1]:
week extra_hours extra_hours_over tot_time
12 3 00:30:00 00:30:00 240
13 3 01:30:00 00:30:00 240
14 3 00:30:00 00:30:00 240
15 3 01:30:00 00:30:00 240
16 3 00:00:00 00:00:00 240
17 3 00:00:00 00:00:00 240
Note: Because I am using np.random.choice, the same observation can be picked twice, which will make that observation change by a chunk of more than 30 min.