Dataframe from Series grouped by weekday and hour of day - python

I have a Series with a DatetimeIndex, as such :
time my_values
2017-12-20 09:00:00 0.005611
2017-12-20 10:00:00 -0.004704
2017-12-20 11:00:00 0.002980
2017-12-20 12:00:00 0.001497
...
2021-08-20 13:00:00 -0.001084
2021-08-20 14:00:00 -0.001608
2021-08-20 15:00:00 -0.002182
2021-08-20 16:00:00 -0.012891
2021-08-20 17:00:00 0.002711
I would like to create a dataframe of average values with the weekdays as columns names and hour of the day as index, resulting in this :
hour Monday Tuesday ... Sunday
0 0.005611 -0.001083 -0.003467
1 -0.004704 0.003362 -0.002357
2 0.002980 0.019443 0.009814
3 0.001497 -0.002967 -0.003466
...
19 -0.001084 0.009822 0.003362
20 -0.001608 -0.002967 -0.003567
21 -0.002182 0.035600 -0.003865
22 -0.012891 0.002945 -0.002345
23 0.002711 -0.002458 0.006467
How can do this in Python ?

Do as follows
# Coerce time to datetime
df['time'] = pd.to_datetime(df['time'])
# Extract day and hour
df = df.assign(day=df['time'].dt.strftime('%A'), hour=df['time'].dt.hour)
# Pivot
pd.pivot_table(df, values='my_values', index=['hour'],
columns=['day'], aggfunc=np.mean)

Since you asked for a solution that returns the average values, I propose this groupby solution
df["weekday"] = df.time.dt.strftime('%A')
df["hour"] = df.time.dt.strftime('%H')
df = df.drop(["time"], axis=1)
# calculate averages by weekday and hour
df2 = df.groupby(["hour", "weekday"]).mean()
# put it in the right format
df2.unstack()

Related

Reshape dataframe into several columns based on date column

I want to rearrange my example dataframe (df.csv) below based on the date column. Each row represents an hour's data for instance for both dates 2002-01-01 and 2002-01-02, there is 5 rows respectively, each representing 1 hour.
date,symbol
2002-01-01,A
2002-01-01,A
2002-01-01,A
2002-01-01,B
2002-01-01,A
2002-01-02,B
2002-01-02,B
2002-01-02,A
2002-01-02,A
2002-01-02,A
My expected output is as below .
date,hour1, hour2, hour3, hour4, hour5
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
I have tried the below as explained here: https://pandas.pydata.org/docs/user_guide/reshaping.html, but it doesnt work in my case because the symbol column contains duplicates.
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv')
pivoted = df.pivot(index="date", columns="symbol")
print(pivoted)
The data does not have the timestamps but only the date. However, each row for the same date represents an hourly interval, for instance the output could also be represented as below:
date,01:00, 02:00, 03:00, 04:00, 05:00
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
where the hour1 represent 01:00, hour2 represent 02:00...etc
You had the correct pivot approach, but you were missing a column 'time', so let's split the datetime into date and time:
s = pd.to_datetime(df['date'])
df['date'] = s.dt.date
df['time'] = s.dt.time
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
Alternatively for having a HH:MM time, use df['time'] = s.dt.strftime('%H:%M')
used input:
date,symbol
2002-01-01 01:00,A
2002-01-01 02:00,A
2002-01-01 03:00,A
2002-01-01 04:00,B
2002-01-01 05:00,A
2002-01-02 01:00,B
2002-01-02 02:00,B
2002-01-02 03:00,A
2002-01-02 04:00,A
2002-01-02 05:00,A
not time as input!
If really you have no time in the input dates and need to 'invent' increasing ones, you could use groupby.cumcount:
df['time'] = pd.to_datetime(df.groupby('date').cumcount(), format='%H').dt.strftime('%H:%M')
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00 02:00 03:00 04:00 05:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
For each entry as an hour:
k = df.groupby("date").cumcount().add(1).astype(str).radd("hour")
out = df.pivot_table('symbol','date',k,aggfunc=min)
print(out)
hour1 hour2 hour3 hour4 hour5
date
2002-01-01 A A A B A
2002-01-02 B B A A A
I'd have an approach for you, I guess it not the most elegant way since I have to rename both index and columns but it does the job.
new_cols = ['01:00', '02:00', '03:00', '04:00', '05:00']
df1 = df.loc[df['date']=='2002-01-01', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-01'])
df2 = df.loc[df['date']=='2002-01-02', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-02'])
result = pd.concat([df1,df2])
print(result)
Output:
01:00 02:00 03:00 04:00 05:00
2002-01-01 A A A B A
2002-01-02 B B A A A

how to get the part of the day from a given 24 hour time format in pandas and python

Hi I have dataset in which a col value looks like 08:25:00 I want to the resultant value as morning.
10:36:00 - Morning
16:00:00 - afternoon
17:00:00 - afternoon
19:00:00 -evening
I tried with this below steps but for few rows I am getting Nan values and incorrect result
df['PNR_CREATE_TM_1']=pd.DataFrame({'PNR_CREATE_TM':range(1,25)})
bns=[0,4,8,12,16,20,24]
part_days=['Late Night','Early Morning','Morning','Noon','Evening','Night']
df['PNR_CREATE_SESSION'] = pd.cut(df['PNR_CREATE_TM_1'],bins=bns,labels=part_days,include_lowest=True)
Assuming 'time' the initial column as string type, you could split the hours, and use pandas.cut:
df = pd.DataFrame({'time': ['10:36:00', '16:00:00', '17:00:00', '19:00:00']})
bns=[0,4,8,12,16,20,24]
part_days=['Late Night','Early Morning','Morning','Noon','Evening','Night']
s = df['time'].str.split(':').str[0].astype(int)
df['part'] = pd.cut(s, bins=bns, labels=part_days, include_lowest=True)
output:
time part
0 10:36:00 Morning
1 16:00:00 Noon
2 17:00:00 Evening
3 19:00:00 Evening
Convert values to datetimes by to_datetime and get hours by Series.dt.hour:
df['PNR_CREATE_SESSION'] = pd.cut(pd.to_datetime(df['PNR_CREATE_TM_1']).dt.hour,
bins=bns,
labels=part_days,
include_lowest=True)
Or if python object times:
df['PNR_CREATE_SESSION'] = pd.cut(pd.to_datetime(df['PNR_CREATE_TM_1'].astype(str)).dt.hour,
bins=bns,
labels=part_days,
include_lowest=True)

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

Python: Working with columns inside a pandas Dataframe

Good evening,
is it possible to calculate with - let's say - two columns inside a dataframe and add a third column with the fitting result?
Dataframe (original):
name time_a time_b
name_a 08:00:00 09:00:00
name_b 07:45:00 08:15:00
name_c 07:00:00 08:10:00
name_d 06:00:00 10:00:00
Or to be specific...is it possible to obtain the difference of two times (time_b - time_a) and create a
new column (time_c) at the end of the dataframe?
Dataframe (new):
name time_a time_b time_c
name_a 08:00:00 09:00:00 01:00:00
name_b 07:45:00 08:15:00 00:30:00
name_c 07:00:00 08:10:00 01:10:00
name_d 06:00:00 10:00:00 04:00:00
Thanks and a good night!
If your columns are in datetime or timedelta format:
# New column is a timedelta object
df["time_c"] = (df["time_b"] - df["time_a"])
If your columns are in datetime.time format (which it appears they are):
def time_diff(time_1,time_2):
"""returns the difference between time 1 and time 2 (time_2-time_1)"""
now = datetime.datetime.now()
time_1 = datetime.datetime.combine(now,time_1)
time_2 = datetime.datetime.combine(now,time_2)
return time_2 - time_1
# Apply the function
df["time_c"] = df[["time_a","time_b"]].apply(lambda arr: time_diff(*arr), axis=1)
Alternatively, you can convert to a timedelta by first converting to a string:
df["time_a"]=pd.to_timedelta(df["time_a"].astype(str))
df["time_b"]=pd.to_timedelta(df["time_b"].astype(str))
df["time_c"] = df["time_b"] - df["time_a"]

upsample between two date columns

I have the following df
lst = [[1548828606206000000, 1548840373139000000],
[1548841285708000000, 1548841458405000000],
[1548842198276000000, 1548843109519000000],
[1548844022821000000, 1548844934207000000],
[1548845431090000000, 1548845539219000000],
[1548845555332000000, 1548845846621000000],
[1548847176147000000, 1548851020030000000],
[1548851704053000000, 1548852256143000000],
[1548852436514000000, 1548855900767000000],
[1548856817770000000, 1548857162183000000],
[1548858736931000000, 1548858979032000000]]
df = pd.DataFrame(lst,columns =['start','end'])
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
and I would like to get the duration of that event with start and end times per hour: e.g.
in my dummy df then for 6th hour should be 60 mins(maximum per hour) - 00:10:06 = 00:49:54. For 7th and 8th should be 1:00:00 each as the end time is 09:26:13. For 9th should be 00:26:13 plus all the intervals in the following .rows that overlap with 9th hour 09:44 - 09:41 = 3mins and 60mins -00:56 =4 mins. So the total for 9th should be 26+ 3 +4~=00:32:28
My initial apporach was to merge start and end, add dummy points every 3rd row, upsample to 1S, get the difference between rows, sum up only the actual rows. There must be a more pythonic way of doing this. Any hint would be great.
IIUC, something like this:
df.apply(lambda x: pd.to_timedelta(pd.Series(1, index=pd.date_range(x.start, x.end, freq='S'))
.groupby(pd.Grouper(freq='H')).count(), unit='S'), axis=1).sum()
Output:
2019-01-30 06:00:00 00:49:54
2019-01-30 07:00:00 01:00:00
2019-01-30 08:00:00 01:00:00
2019-01-30 09:00:00 00:32:28
2019-01-30 10:00:00 00:33:43
2019-01-30 11:00:00 00:40:24
2019-01-30 12:00:00 00:45:37
2019-01-30 13:00:00 00:45:01
2019-01-30 14:00:00 00:09:48
Freq: H, dtype: timedelta64[ns]
Or to get it down to hours, try:
df.apply(lambda r: pd.to_timedelta(pd.Series(1, index=pd.date_range(r.start, r.end, freq='S'))
.pipe(lambda x: x.groupby(x.index.hour).count()), unit='S'), axis=1)\
.sum()
Output:
6 00:49:54
7 01:00:00
8 01:00:00
9 00:32:28
10 00:33:43
11 00:40:24
12 00:45:37
13 00:45:01
14 00:09:48
dtype: timedelta64[ns]

Categories

Resources