How can I control the hourly GROUPBY setting in Pandas? - python

I have the following data set:
time value
2019-01-01 8:00:00 10
2019-01-01 8:30:00 20
2019-01-01 9:00:00 30
2019-01-01 9:30:00 100
2019-01-01 10:00:00 400
By using the pd.groupby(pd.Grouper(key = 'time', freq = '1h')).sum().reset_index(). It returned:
time value
2019-01-01 8:00:00 30
2019-01-01 9:00:00 130
2019-01-01 10:00:00 400
It based on any related the Hour value to have a group aggregation. But How can I control the group time setting? Since I would like to make any >8 to <= 9 as 9 group. For example:
time value
2019-01-01 8:00:00 10
2019-01-01 9:00:00 50
2019-01-01 10:00:00 500

IIUC ceil
Yourdf=df.groupby(df.index.ceil('H')).sum()
value
time
2019-01-01 08:00:00 10
2019-01-01 09:00:00 50
2019-01-01 10:00:00 500
Or resample
df.resample('H',closed='right').sum()
value
time
2019-01-01 07:00:00 10
2019-01-01 08:00:00 50
2019-01-01 09:00:00 500

Use closed='right' i.e.
pd.Grouper(key = 'time', freq = '1h', closed='right')

Related

Timestamp range to hours Python Pandas

Is it a way to split timestamp range to hours ?
FROM:
Person
start
stop
Tom
2019-01-01 12:15:00
2019-01-01 14:25:00
TO:
Person
start
stop
Tom
2019-01-01 12:15:00
2019-01-01 13:00:00
Tom
2019-01-01 13:00:00
2019-01-01 14:00:00
Tom
2019-01-01 14:00:00
2019-01-01 14:25:00
First get all the ranges between start.floor('h') and stop.ceil('h') with hour frequency using pd.date_range, then return start, range from second to second last, and stop, it'll give a list, so explode it, assign stop to this range by shifting it by -1, and finally assign start to the range, and dropna rows (this na will appear due to the effect of shift, and is not required.)
def getRange(row):
rang = pd.date_range(row['start'].floor('h'), row['stop'].ceil('h'), freq='h')
return [row['start']] + rang.to_list()[1:-1] + [row['stop']]
df =(df.assign(range=df.apply(getRange, axis=1))
.drop(columns=['start', 'stop'])
.explode('range')
)
df = df.assign(stop=df['range'].shift(-1)).dropna().rename(columns={'range':'start'})
OUTPUT:
Person start stop
0 Tom 2019-01-01 12:15:00 2019-01-01 13:00:00
0 Tom 2019-01-01 13:00:00 2019-01-01 14:00:00
0 Tom 2019-01-01 14:00:00 2019-01-01 14:25:00

Transform the Random time intervals to 30 mins Structured interval

I have this dataFrame where some tasks happened time period
Date Start Time End Time
0 2016-01-01 0:00:00 2016-01-01 0:10:00 2016-01-01 0:25:00
1 2016-01-01 0:00:00 2016-01-01 1:17:00 2016-01-01 1:31:00
2 2016-01-02 0:00:00 2016-01-02 0:30:00 2016-01-02 0:32:00
... ... ... ...
Convert this df to 30 mins interval
Expected outcome
Date Hours
1 2016-01-01 0:30:00 0:15
2 2016-01-01 1:00:00 0:00
3 2016-01-01 1:30:00 0:13
4 2016-01-01 2:00:00 0:01
5 2016-01-01 2:30:00 0:00
6 2016-01-01 3:00:00 0:00
... ...
47 2016-01-01 23:30:00 0:00
48 2016-01-02 23:59:59 0:00
49 2016-01-02 00:30:00 0:00
50 2016-01-02 01:00:00 0:02
... ...
I was trying to do with for loop which was getting tedious. Any simple way to do in pandas.
IIUC you can discard the Date column, get the time difference between start and end, groupby 30 minutes and agg on first (assuming you always have one entry only per 30 minutes slot):
print (df.assign(Diff=df["End Time"]-df["Start Time"])
.groupby(pd.Grouper(key="Start Time", freq="30T"))
.agg({"Diff": "first"})
.fillna(pd.Timedelta(seconds=0)))
Diff
Start Time
2016-01-01 00:00:00 0 days 00:15:00
2016-01-01 00:30:00 0 days 00:00:00
2016-01-01 01:00:00 0 days 00:14:00
2016-01-01 01:30:00 0 days 00:00:00
2016-01-01 02:00:00 0 days 00:00:00
2016-01-01 02:30:00 0 days 00:00:00
...
2016-01-02 00:30:00 0 days 00:02:00
The idea is to create a series with 0 and DatetimeIndex per minutes between min start time and max end time. Then add 1 where Start Time and subtract 1 where End Time. You can then use cumsum to count the values between Start and End, resample.sum per 30 minutes and reset_index. The last line of code is to get the proper format in the Hours column.
#create a series of 0 with a datetime index
res = pd.Series(data=0,
index= pd.DatetimeIndex(pd.date_range(df['Start Time'].min(),
df['End Time'].max(),
freq='T'),
name='Dates'),
name='Hours')
# add 1 o the start time and remove 1 to the end start
res[df['Start Time']] += 1
res[df['End Time']] -= 1
# cumsum to get the right value for each minute then resample per 30 minutes
res = (res.cumsum()
.resample('30T', label='right').sum()
.reset_index('Dates')
)
# change the format of the Hours column, honestly not necessary
res['Hours'] = pd.to_datetime(res['Hours'], format='%M').dt.strftime('%H:%M') # or .dt.time
print(res)
Dates Hours
0 2016-01-01 00:30:00 00:15
1 2016-01-01 01:00:00 00:00
2 2016-01-01 01:30:00 00:13
3 2016-01-01 02:00:00 00:01
4 2016-01-01 02:30:00 00:00
5 2016-01-01 03:00:00 00:00
...
48 2016-01-02 00:30:00 00:00
49 2016-01-02 01:00:00 00:02

combine columns with different data types to make a single dateTime column in pandas data frames

I have imported data from some source that has date in datatype class 'object' and hour in integer and looks something like:
Date Hour Val
2019-01-01 1 0
2019-01-01 2 0
2019-01-01 3 0
2019-01-01 4 0
2019-01-01 5 0
2019-01-01 6 0
2019-01-01 7 0
2019-01-01 8 0
I need a single column that has the date-time in a column that looks like this:
DATETIME
2019-01-01 01:00:00
2019-01-01 02:00:00
2019-01-01 03:00:00
2019-01-01 04:00:00
2019-01-01 05:00:00
2019-01-01 06:00:00
2019-01-01 07:00:00
2019-01-01 08:00:00
I tried to convert the date column to dateTime format using
pd.datetime(df.Date)
and then using
df.Date.dt.hour = df.Hour
I get the error
ValueError: modifications to a property of a datetimelike object are not supported. Change values on the original.
Is there an easy way to do this?
Use pandas.to_timedelta and pandas.to_datetime:
# if needed
df['Date'] = pd.to_datetime(df['Date'])
df['Datetime'] = df['Date'] + pd.to_timedelta(df['Hour'], unit='H')
[out]
Date Hour Val Datetime
0 2019-01-01 1 0 2019-01-01 01:00:00
1 2019-01-01 2 0 2019-01-01 02:00:00
2 2019-01-01 3 0 2019-01-01 03:00:00
3 2019-01-01 4 0 2019-01-01 04:00:00
4 2019-01-01 5 0 2019-01-01 05:00:00
5 2019-01-01 6 0 2019-01-01 06:00:00
6 2019-01-01 7 0 2019-01-01 07:00:00
7 2019-01-01 8 0 2019-01-01 08:00:00
Since you asked for a method combining the columns and using a single pd.to_datetime call, you could do:
df['Datetime'] = pd.to_datetime((df['Date'].astype(str) + ' ' +
df['Hour'].astype(str)),
format='%Y-%m-%d %I')

grouping multiple time values into a start and finish time

I have a dataframe as follows
import pandas as pd
import numpy as np
IDs = ['A','A','A','B','B']
times = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h')
times_2 = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h') + pd.Timedelta('15min')
Vals = [np.random.randint(15,250) for x in enumerate(times)]
df = pd.DataFrame({'id' : IDs*5,
'Start' : times,
'End' : times_2,
'Value' : Vals},columns=['id','Start','End','Value'])
this gives me a df as follows.
print(df.head(5))
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 00:15:00 52
1 A 2019-01-01 01:00:00 2019-01-01 01:15:00 69
2 A 2019-01-01 02:00:00 2019-01-01 02:15:00 209
3 B 2019-01-01 03:00:00 2019-01-01 03:15:00 163
4 B 2019-01-01 04:00:00 2019-01-01 04:15:00 70
now what I'm trying to do is apply a group by to my data frame to get the sum of the value column, however, whilst doing this I would like to retain the min start and max end time of my df.
so my example output would be as follows :
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 22:15:00 2007
1 B 2019-01-01 03:00:00 2019-01-02 00:15:00 1385
The only way I've sort of made this work is pass the min and max of each unique ID by start and end time, pass these to a list and then manually create the start and end times, but it was slow and messy and prone to error... hoping someone here can guide me as to what I'm missing.
Using groupby with agg
df.groupby('id').agg({'Start':'min','End':'max','Value':'sum'})#reset_index()
Out[92]:
Start End Value
id
A 2019-01-01 00:00:00 2019-01-01 22:15:00 2152
B 2019-01-01 03:00:00 2019-01-02 00:15:00 972

Create 5-minute interval between two timestamp

I have a bunch of data point for each there are two columns: start_dt and end_dt. I am wondering how can I split the time gap between start_dt and end_dt into 5 minutes interval?
For instance,
id+++++++start_tm ++++++++++++++ end_dt
1+++++++2019-01-01 10:00 +++++++ 2019-01-01 11:00
=====================================================
What I am looking for is:
id+++++++start_tm ++++++++++++++ end_dt
1+++++++2019-01-01 10:00 +++++++ 2019-01-01 10:05
1+++++++2019-01-01 10:05 +++++++ 2019-01-01 10:10
1+++++++2019-01-01 10:10 +++++++ 2019-01-01 10:15
1+++++++2019-01-01 10:15 +++++++ 2019-01-01 10:20
==================================================
and so fort
is there any function out of the box to do so?
If not, any help to create this function is wonderful
If you have two Python datetime objects representing a timespan, and you just want to break that timespan up into 5 minute intervals represented by datetime objects, you could just do this:
import datetime
d1 = datetime.datetime(2019, 1, 1, 10, 0)
d2 = datetime.datetime(2019, 1, 1, 11, 0)
delta = datetime.timedelta(minutes=5)
times = []
while d1 < d2:
times.append(d1)
d1 += delta
times.append(d2)
for i in range(len(times) - 1):
print("{} - {}".format(times[i], times[i+1]))
Output:
2019-01-01 10:00:00 - 2019-01-01 10:05:00
2019-01-01 10:05:00 - 2019-01-01 10:10:00
2019-01-01 10:10:00 - 2019-01-01 10:15:00
2019-01-01 10:15:00 - 2019-01-01 10:20:00
2019-01-01 10:20:00 - 2019-01-01 10:25:00
2019-01-01 10:25:00 - 2019-01-01 10:30:00
2019-01-01 10:30:00 - 2019-01-01 10:35:00
2019-01-01 10:35:00 - 2019-01-01 10:40:00
2019-01-01 10:40:00 - 2019-01-01 10:45:00
2019-01-01 10:45:00 - 2019-01-01 10:50:00
2019-01-01 10:50:00 - 2019-01-01 10:55:00
2019-01-01 10:55:00 - 2019-01-01 11:00:00
This should handle a period that isn't an even multiple of the delta, giving you a shorter interval at the end.
I don't know pyspark, but if you are using pandas this works. (and pyspark may be similar):
1:create data
import pandas as pd
import numpy as np
data = pd.DataFrame({
'id':[1, 2],
'start_tm': pd.date_range('2019-01-01 00:00', periods=2, freq='D'),
'end_dt': pd.date_range('2019-01-01 00:30', periods=2, freq='D')})
# pandas dataframe is similar to the data in pyspark
output
id start_tm end_dt
1 2019-01-01 2019-01-01 00:30:00
2 2019-01-02 2019-01-02 00:30:00
2: split columns
period = np.timedelta64(5, 'm') # 5 minutes
idx = (data['end_dt'] - data['start_tm']) > period
while idx.any():
new_data = data[idx].copy()
new_data['start_tm'] = new_data['start_tm'] + period
data.loc[idx, 'end_dt'] = (data[idx]['start_tm'] + period).values
data = pd.concat([data, new_data], axis=0)
idx = (data['end_dt'] - data['start_tm']) > period
output
id start_tm end_dt
1 2019-01-01 00:00:00 2019-01-01 00:05:00
2 2019-01-02 00:00:00 2019-01-02 00:05:00
1 2019-01-01 00:05:00 2019-01-01 00:10:00
2 2019-01-02 00:05:00 2019-01-02 00:10:00
1 2019-01-01 00:10:00 2019-01-01 00:15:00
2 2019-01-02 00:10:00 2019-01-02 00:15:00
1 2019-01-01 00:15:00 2019-01-01 00:20:00
2 2019-01-02 00:15:00 2019-01-02 00:20:00
1 2019-01-01 00:20:00 2019-01-01 00:25:00
2 2019-01-02 00:20:00 2019-01-02 00:25:00
1 2019-01-01 00:25:00 2019-01-01 00:30:00
2 2019-01-02 00:25:00 2019-01-02 00:30:00

Categories

Resources