I have the following df
import pandas as pd
import datetime as dt
start='2020-01-01'
end='2021-12-31'
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df['Day'] = df['Date'].dt.day
df['Day_name'] = df[['Date']].apply(lambda x: dt.datetime.strftime(x['Date'], '%A'), axis=1)
I want to add another column to the df['wk'] that will loop trough the dates and create a custom week starting with a specific date.
For example Wk 1 will start from 2020-01-03, loop 7 days till 2020-01-09 and create wk 1, wk 2 will be from 2020-01-10 till 2020-01-16 and so on. Always move 7 days
How can I do this in python
I am thinking it should be something like this:
for i,row in df.iterrows():
df.loc[i,'wk']= row['Date'] + dt.timedelta(days = 7)
But this just adds 7 days to the current one, not stores the wk. I need a little guidance on how to do this
You can try something like this:
#You can create the Day_name column without apply
df['Day_name'] = df['Date'].dt.day_name()
#Solution starts here
week_start = '2020-01-03' #inout date
i = df.loc[df['Date'] == week_start,'Date'].dt.day_name().iloc[0]
df['wk'] = df['Day_name'].eq(i).cumsum()
print(df.head(12))
Date Day Day_name wk
0 2020-01-01 1 Wednesday 0.0
1 2020-01-02 2 Thursday 0.0
2 2020-01-03 3 Friday 1.0
3 2020-01-04 4 Saturday 1.0
4 2020-01-05 5 Sunday 1.0
5 2020-01-06 6 Monday 1.0
6 2020-01-07 7 Tuesday 1.0
7 2020-01-08 8 Wednesday 1.0
8 2020-01-09 9 Thursday 1.0
9 2020-01-10 10 Friday 2.0
10 2020-01-11 11 Saturday 2.0
11 2020-01-12 12 Sunday 2.0
Related
I am working on converting a list of online classes into a heat map using Python & Pandas and I've come to a dead end. Right now, I have a data frame 'data' with some events containing a day of the week listed as 'DAY' and the time of the event in hours listed as 'TIME'. The dataset is displayed as follows:
ID TIME DAY
108 15 Saturday
110 15 Sunday
112 16 Wednesday
114 16 Friday
116 15 Monday
.. ... ...
639 12 Wednesday
640 12 Saturday
641 18 Saturday
642 16 Thursday
643 15 Friday
I'm looking for a way to sum repetitions of every 'TIME' value for every 'DAY' and then present these sums in a new table 'event_count'. I need to turn the linear data in my 'data' table into a more timetable-like form that can later be converted into a visual heatmap.
Sounds like a difficult transformation, but I feel like I'm missing something very obvious.
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
10 5 2 4 6 1 0 2
11 4 2 4 6 1 0 2
12 6 2 4 6 1 0 2
13 3 2 4 6 1 0 2
14 7 2 4 6 1 0 2
I tried to achieve this through pivot_table and stack, however, the best I got was a list of all days of the week with mean averages for time. Could you advise me which direction should I look into and how can I approach solving this?
IIUC you can do something like this:
df is from your given example data.
import pandas as pd
df = pd.DataFrame({
'ID': [108, 110, 112, 114, 116, 639, 640, 641, 642, 643],
'TIME': [15, 15, 16, 16, 15, 12, 12, 18, 16, 15],
'DAY': ['Saturday','Sunday','Wednesday','Friday','Monday','Wednesday','Saturday','Saturday','Thursday','Friday']
})
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
out = (pd.crosstab(index=df['TIME'], columns=df['DAY'], values=df['TIME'],aggfunc='count')
.sort_index(axis=0) #sort by the index 'TIME'
.reindex(weekdays, axis=1) # sort columns in order of the weekdays
.rename_axis(None, axis=1) # delete name of index
.reset_index() # 'TIME' from index to column
)
print(out)
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 12 NaN NaN 1.0 NaN NaN 1.0 NaN
1 15 1.0 NaN NaN NaN 1.0 1.0 1.0
2 16 NaN NaN 1.0 1.0 1.0 NaN NaN
3 18 NaN NaN NaN NaN NaN 1.0 NaN
You were also in the right path with pivot_table. I'm not sure what was missing to get you the right result but here is one approach with it. I added `margins, maybe it is also interesting for you to get the total amount of each index/column.
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Total']
out2 = (pd.pivot_table(data=df, index='TIME', columns='DAY', aggfunc='count', margins=True, margins_name='Total')
.droplevel(0,axis=1)
.reindex(weekdays, axis=1)
)
print(out2)
DAY Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total
TIME
12 NaN NaN 1.0 NaN NaN 1.0 NaN 2
15 1.0 NaN NaN NaN 1.0 1.0 1.0 4
16 NaN NaN 1.0 1.0 1.0 NaN NaN 3
18 NaN NaN NaN NaN NaN 1.0 NaN 1
Total 1.0 NaN 2.0 1.0 2.0 3.0 1.0 10
I'm a little new to Python and I'm not sure where to start.
I have a python dataframe that contains shift information like the below:
EmployeeID ShiftType BeginDate EndDate
10 Holiday 2020-01-01 21:00:00 2020-01-02 07:00:00
10 Regular 2020-01-02 21:00:00 2020-01-03 07:00:00
10 Regular 2020-01-03 21:00:00 2020-01-04 07:00:00
10 Regular 2020-01-04 21:00:00 2020-01-05 07:00:00
20 Regular 2020-02-01 09:00:00 2020-02-01 17:00:00
20 Regular 2020-02-02 09:00:00 2020-02-02 17:00:00
20 Regular 2020-02-03 09:00:00 2020-02-03 17:00:00
20 Regular 2020-02-04 09:00:00 2020-02-04 17:00:00
I'd like to be able to break each shift down and summarize hours worked each day.
The desired ouput is below:
EmployeeID ShiftType Date HoursWorked
10 Holiday 2020-01-01 3
10 Regular 2020-01-02 10
10 Regular 2020-01-03 10
10 Regular 2020-01-04 10
10 Regular 2020-01-05 7
20 Regular 2020-02-01 10
20 Regular 2020-02-02 10
20 Regular 2020-02-03 10
20 Regular 2020-02-04 10
I know how to get the work hours like below. This does get me hours for each shift, but I would like to be able to break out each calendar day and hours for that day. Retaining 'ShiftType' is not that important here.
df['HoursWorked'] = ((pd.to_datetime(schedule['EndDate']) - pd.to_datetime(schedule['BeginDate'])).dt.total_seconds() / 3600)
Any suggestion would be appreciated.
You could calculate the working hours for each date (BeginDate and EndDate) separately, which would give you three pd.Series, two for the parts of the night shifts that are on different dates and one for the day shift that is on one date. Use the according date as index for those Series.
# make sure dtype is correct
df["BeginDate"] = pd.to_datetime(df["BeginDate"])
df["EndDate"] = pd.to_datetime(df["EndDate"])
# get a mask where there is a night shift; end date = start date +1
m = df["BeginDate"].dt.date != df["EndDate"].dt.date
# extract the night shifts
s0 = pd.Series((df["BeginDate"][m].dt.ceil('d')-df["BeginDate"][m]).values,
index=df["BeginDate"][m].dt.floor('d'))
s1 = pd.Series((df["EndDate"][m]-df["EndDate"][m].dt.floor('d')).values,
index=df["EndDate"][m].dt.floor('d'))
# ...and the day shifts
s2 = pd.Series((df["EndDate"][~m]-df["BeginDate"][~m]).values, index=df["BeginDate"][~m].dt.floor('d'))
Now concat and sum them, giving you a pd.Series with the working hours for each date:
working_hours = pd.concat([s0, s1, s2], axis=1).sum(axis=1)
# working_hours
# 2020-01-01 0 days 03:00:00
# 2020-01-02 0 days 10:00:00
# 2020-01-03 0 days 10:00:00
# ...
# Freq: D, dtype: timedelta64[ns]
To join with your original df, you can reindex that and add the working_hours:
new_index = pd.concat([df["BeginDate"].dt.floor('d'),
df["EndDate"].dt.floor('d')]).drop_duplicates()
df_out = df.set_index(df["BeginDate"].dt.floor('d')).reindex(new_index, method='nearest')
df_out.index = df_out.index.set_names('date')
df_out = df_out.drop(['BeginDate', 'EndDate'], axis=1).sort_index()
df_out['HoursWorked'] = working_hours.dt.total_seconds()/3600
# df_out
# EmployeeID ShiftType HoursWorked
# date
# 2020-01-01 10 Holiday 3.0
# 2020-01-02 10 Regular 10.0
# 2020-01-03 10 Regular 10.0
# 2020-01-04 10 Regular 10.0
# 2020-01-05 10 Regular 7.0
# 2020-02-01 20 Regular 8.0
# 2020-02-02 20 Regular 8.0
# 2020-02-03 20 Regular 8.0
# 2020-02-04 20 Regular 8.0
I have data that looks like this.
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag
2 1/1/2018 0:18:50 1/1/2018 12:24:39 AM N
2 1/1/2018 0:30:26 1/1/2018 12:46:42 AM N
2 1/1/2018 0:07:25 1/1/2018 12:19:45 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:38:35 1/1/2018 1:08:50 AM N
2 1/1/2018 0:18:41 1/1/2018 12:28:22 AM N
2 1/1/2018 0:38:02 1/1/2018 12:55:02 AM N
2 1/1/2018 0:05:02 1/1/2018 12:18:35 AM N
2 1/1/2018 0:35:23 1/1/2018 12:42:07 AM N
So, I converted df.lpep_pickup_datetime to datetime, but originally it comes in as a string. I'm not sure which one is easier to work with. I want to append 5 fields onto my current dataframe: year, month, day, weekday, and hour.
I tried this:
df['Year']=[d.split('-')[0] for d in df.lpep_pickup_datetime]
df['Month']=[d.split('-')[1] for d in df.lpep_pickup_datetime]
df['Day']=[d.split('-')[2] for d in df.lpep_pickup_datetime]
That gives me this error: AttributeError: 'Timestamp' object has no attribute 'split'
I tried this:
df2 = pd.DataFrame(df.lpep_pickup_datetime.dt.strftime('%m-%d-%Y-%H').str.split('/').tolist(),
columns=['Month', 'Day', 'Year', 'Hour'],dtype=int)
df = pd.concat((df,df2),axis=1)
That gives me this error: AssertionError: 4 columns passed, passed data had 1 columns
Basically, I want to parse df.lpep_pickup_datetime into year, month, day, weekday, and hour, appending each to the same dataframe. How can I do that?
Thanks!!
Here you go, first I'm creating a random dataset and then renaming the column date to the name you want, so you can just copy the code. Pandas has a big section of time-series series manipulation, you don't actually need to import datetime. Here you can find a lot more information about it:
import pandas as pd
date_rng = pd.date_range(start='1/1/2018', end='4/01/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['lpep_pickup_datetime'] = df['date']
df['year'] = df['lpep_pickup_datetime'].dt.year
df['year'] = df['lpep_pickup_datetime'].dt.month
df['weekday'] = df['lpep_pickup_datetime'].dt.weekday
df['day'] = df['lpep_pickup_datetime'].dt.day
df['hour'] = df['lpep_pickup_datetime'].dt.hour
print(df)
Output:
date lpep_pickup_datetime year weekday day hour
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 0 1 0
1 2018-01-01 01:00:00 2018-01-01 01:00:00 1 0 1 1
2 2018-01-01 02:00:00 2018-01-01 02:00:00 1 0 1 2
3 2018-01-01 03:00:00 2018-01-01 03:00:00 1 0 1 3
4 2018-01-01 04:00:00 2018-01-01 04:00:00 1 0 1 4
... ... ... ... ... ... ...
2156 2018-03-31 20:00:00 2018-03-31 20:00:00 3 5 31 20
2157 2018-03-31 21:00:00 2018-03-31 21:00:00 3 5 31 21
2158 2018-03-31 22:00:00 2018-03-31 22:00:00 3 5 31 22
2159 2018-03-31 23:00:00 2018-03-31 23:00:00 3 5 31 23
2160 2018-04-01 00:00:00 2018-04-01 00:00:00 4 6 1 0
EDIT: Since this is not working (As stated in the comments in this answer), I believe your data is formated incorrectly. Try this before applying anything:
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'], format='%d/%m/%y %H:%M:%S')
If this format is recognized properly, then you should have no trouble using dt.year,dt.month,dt.hour,dt.day,dt.weekday.
Give this a go. Since your dates are in the datetime dtype already, just use the datetime properties to extract each part.
import pandas as pd
from datetime import datetime as dt
# Creating a fake dataset of dates.
dates = [dt.now().strftime('%d/%m/%Y %H:%M:%S') for i in range(10)]
df = pd.DataFrame({'lpep_pickup_datetime': dates})
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
# Parse each date into its parts and store as a new column.
df['month'] = df['lpep_pickup_datetime'].dt.month
df['day'] = df['lpep_pickup_datetime'].dt.day
df['year'] = df['lpep_pickup_datetime'].dt.year
# ... and so on ...
Output:
lpep_pickup_datetime month day year
0 2019-09-24 16:46:10 9 24 2019
1 2019-09-24 16:46:10 9 24 2019
2 2019-09-24 16:46:10 9 24 2019
3 2019-09-24 16:46:10 9 24 2019
4 2019-09-24 16:46:10 9 24 2019
5 2019-09-24 16:46:10 9 24 2019
6 2019-09-24 16:46:10 9 24 2019
7 2019-09-24 16:46:10 9 24 2019
8 2019-09-24 16:46:10 9 24 2019
9 2019-09-24 16:46:10 9 24 2019
I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64
I'm struggling to figure out how to achieve this. I'm trying to get the average price for each day and hour entries. So a DataFrame like
day hour price booked
0 monday 7 12.0 True
1 monday 8 12.0 False
2 tuesday 7 13.0 True
3 tuesday 8 13.0 False
4 monday 7 15.0 True
5 monday 8 13.0 False
6 tuesday 7 13.0 True
7 tuesday 8 15.0 False
should give something like:
day hour avg. price
0 monday 7 13
1 monday 8 12.5
2 tuesday 7 13.0
3 tuesday 8 14.0
I would like this generalize to larger data sets.
You can groupby the day and hour column and then call mean on the price column:
In [46]:
df.groupby(['day','hour'])['price'].mean()
Out[46]:
day hour
monday 7 13.5
8 12.5
tuesday 7 13.0
8 14.0
Name: price, dtype: float64
To restore the day and hour back as columns you can call reset_index:
In [47]:
df.groupby(['day','hour'])['price'].mean().reset_index()
Out[47]:
day hour price
0 monday 7 13.5
1 monday 8 12.5
2 tuesday 7 13.0
3 tuesday 8 14.0
You can also rename the column if you desire:
In [48]:
avg = df.groupby(['day','hour'])['price'].mean().reset_index()
avg.rename(columns={'price':'avg_price'},inplace=True)
avg
Out[48]:
day hour avg_price
0 monday 7 13.5
1 monday 8 12.5
2 tuesday 7 13.0
3 tuesday 8 14.0