Transform the Random time intervals to 30 mins Structured interval - python

I have this dataFrame where some tasks happened time period
Date Start Time End Time
0 2016-01-01 0:00:00 2016-01-01 0:10:00 2016-01-01 0:25:00
1 2016-01-01 0:00:00 2016-01-01 1:17:00 2016-01-01 1:31:00
2 2016-01-02 0:00:00 2016-01-02 0:30:00 2016-01-02 0:32:00
... ... ... ...
Convert this df to 30 mins interval
Expected outcome
Date Hours
1 2016-01-01 0:30:00 0:15
2 2016-01-01 1:00:00 0:00
3 2016-01-01 1:30:00 0:13
4 2016-01-01 2:00:00 0:01
5 2016-01-01 2:30:00 0:00
6 2016-01-01 3:00:00 0:00
... ...
47 2016-01-01 23:30:00 0:00
48 2016-01-02 23:59:59 0:00
49 2016-01-02 00:30:00 0:00
50 2016-01-02 01:00:00 0:02
... ...
I was trying to do with for loop which was getting tedious. Any simple way to do in pandas.

IIUC you can discard the Date column, get the time difference between start and end, groupby 30 minutes and agg on first (assuming you always have one entry only per 30 minutes slot):
print (df.assign(Diff=df["End Time"]-df["Start Time"])
.groupby(pd.Grouper(key="Start Time", freq="30T"))
.agg({"Diff": "first"})
.fillna(pd.Timedelta(seconds=0)))
Diff
Start Time
2016-01-01 00:00:00 0 days 00:15:00
2016-01-01 00:30:00 0 days 00:00:00
2016-01-01 01:00:00 0 days 00:14:00
2016-01-01 01:30:00 0 days 00:00:00
2016-01-01 02:00:00 0 days 00:00:00
2016-01-01 02:30:00 0 days 00:00:00
...
2016-01-02 00:30:00 0 days 00:02:00

The idea is to create a series with 0 and DatetimeIndex per minutes between min start time and max end time. Then add 1 where Start Time and subtract 1 where End Time. You can then use cumsum to count the values between Start and End, resample.sum per 30 minutes and reset_index. The last line of code is to get the proper format in the Hours column.
#create a series of 0 with a datetime index
res = pd.Series(data=0,
index= pd.DatetimeIndex(pd.date_range(df['Start Time'].min(),
df['End Time'].max(),
freq='T'),
name='Dates'),
name='Hours')
# add 1 o the start time and remove 1 to the end start
res[df['Start Time']] += 1
res[df['End Time']] -= 1
# cumsum to get the right value for each minute then resample per 30 minutes
res = (res.cumsum()
.resample('30T', label='right').sum()
.reset_index('Dates')
)
# change the format of the Hours column, honestly not necessary
res['Hours'] = pd.to_datetime(res['Hours'], format='%M').dt.strftime('%H:%M') # or .dt.time
print(res)
Dates Hours
0 2016-01-01 00:30:00 00:15
1 2016-01-01 01:00:00 00:00
2 2016-01-01 01:30:00 00:13
3 2016-01-01 02:00:00 00:01
4 2016-01-01 02:30:00 00:00
5 2016-01-01 03:00:00 00:00
...
48 2016-01-02 00:30:00 00:00
49 2016-01-02 01:00:00 00:02

Related

Split time series in intervals of non-uniform length

I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.
Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]
#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]

How can I control the hourly GROUPBY setting in Pandas?

I have the following data set:
time value
2019-01-01 8:00:00 10
2019-01-01 8:30:00 20
2019-01-01 9:00:00 30
2019-01-01 9:30:00 100
2019-01-01 10:00:00 400
By using the pd.groupby(pd.Grouper(key = 'time', freq = '1h')).sum().reset_index(). It returned:
time value
2019-01-01 8:00:00 30
2019-01-01 9:00:00 130
2019-01-01 10:00:00 400
It based on any related the Hour value to have a group aggregation. But How can I control the group time setting? Since I would like to make any >8 to <= 9 as 9 group. For example:
time value
2019-01-01 8:00:00 10
2019-01-01 9:00:00 50
2019-01-01 10:00:00 500
IIUC ceil
Yourdf=df.groupby(df.index.ceil('H')).sum()
value
time
2019-01-01 08:00:00 10
2019-01-01 09:00:00 50
2019-01-01 10:00:00 500
Or resample
df.resample('H',closed='right').sum()
value
time
2019-01-01 07:00:00 10
2019-01-01 08:00:00 50
2019-01-01 09:00:00 500
Use closed='right' i.e.
pd.Grouper(key = 'time', freq = '1h', closed='right')

Python - take the time difference from the first date in a column

Given the date column, I want to create another column diff that count how many days apart from the first date.
date diff
2011-01-01 00:00:10 0
2011-01-01 00:00:11 0.000011 days
2011-02-01 00:00:11 30.000011 days
2013-02-01 00:00:11 395.000011 days
2014-02-01 00:00:11 760.000011 days
Dates are in datetime. What I tried so far:
df = df.sort_values(['date'], ascending=True)
df.set_index('date', inplace = True)
first = df.index[0]
df['diff'] = (first - df.index.shift()).fillna(0)
you can try
df['diff'] = df.date - df.date.min()
df
date diff
0 2011-01-01 00:00:10 0 days 00:00:00
1 2011-01-01 00:00:11 0 days 00:00:01
2 2011-02-01 00:00:11 31 days 00:00:01
3 2013-02-01 00:00:11 762 days 00:00:01
4 2014-02-01 00:00:11 1127 days 00:00:01
This is what you try..
>>> df
date
0 2011-01-01 00:00:10
1 2011-01-01 00:00:11
2 2011-02-01 00:00:11
3 2013-02-01 00:00:11
4 2014-02-01 00:00:11
First convert them to timestamps, so data can be framed correctly, Once converted, simply difference the DataFrame:
>>> df2 = df.apply(lambda x: [pd.Timestamp(ts) for ts in x])
>>> df['diff'] = (df2 - df2.shift()).fillna(0)
>>> df
date diff
0 2011-01-01 00:00:10 0 days 00:00:00
1 2011-01-01 00:00:11 0 days 00:00:01
2 2011-02-01 00:00:11 31 days 00:00:00
3 2013-02-01 00:00:11 731 days 00:00:00
4 2014-02-01 00:00:11 365 days 00:00:00
Here's what I'd do to get days as float number values:
dates = pd.to_datetime(df.date) # make sure we are working with dates and not strings
df["diff"] = (dates - dates[0]).apply(lambda x: x.total_seconds() / 86400))
The resulting df:
date diff
0 2011-01-01 00:00:10 0.000000
1 2011-01-01 00:00:11 0.000012
2 2011-02-01 00:00:11 31.000012
3 2013-02-01 00:00:11 762.000012
4 2014-02-01 00:00:11 1127.000012
You can use this approach without setting a new index
Raw dataframe
df
date diff
0 2011-01-01 00:00:10 0.000000
1 2011-01-01 00:00:11 0.000011
2 2011-02-01 00:00:11 30.000011
3 2013-02-01 00:00:11 395.000011
4 2014-02-01 00:00:11 760.000011
Possible answer
df['diff_new'] = df['date'] - df.loc[0,'date']
date diff diff_new
0 2011-01-01 00:00:10 0.000000 0 days 00:00:00
1 2011-01-01 00:00:11 0.000011 0 days 00:00:01
2 2011-02-01 00:00:11 30.000011 31 days 00:00:01
3 2013-02-01 00:00:11 395.000011 762 days 00:00:01
4 2014-02-01 00:00:11 760.000011 1127 days 00:00:01
BTW, I get different date differences that you show in the raw data for the 3rd row. You can compare manually with this online tool to calculate date differences in days.

Merge daily values into intraday DataFrame

Suppose I have two DataFrames: intraday which has one row per minute, and daily which has one row per day.
How can I add a column intraday['some_val'] where some_val is taken from the daily['some_val'] row where the intraday.index value (date component) equals the daily.index value (date component)?
Given the following setup,
intraday = pd.DataFrame(index=pd.date_range('2016-01-01', '2016-01-07', freq='T'))
daily = pd.DataFrame(index=pd.date_range('2016-01-01', '2016-01-07', freq='D'))
daily['some_val'] = np.arange(daily.shape[0])
you can create a column from the date component of both indices, and merge on that column
daily['date'] = daily.index.date
intraday['date'] = intraday.index.date
daily.merge(intraday)
date some_val
0 2016-01-01 0
1 2016-01-01 0
2 2016-01-01 0
3 2016-01-01 0
4 2016-01-01 0
... ... ...
8636 2016-01-06 5
8637 2016-01-06 5
8638 2016-01-06 5
8639 2016-01-06 5
8640 2016-01-07 6
Alternatively, you can take advantage of automatic index alignment, and use fillna.
intraday['some_val'] = daily['some_val']
intraday.fillna(method='ffill', downcast='infer')
some_val
2016-01-01 00:00:00 0
2016-01-01 00:01:00 0
2016-01-01 00:02:00 0
2016-01-01 00:03:00 0
2016-01-01 00:04:00 0
... ...
2016-01-06 23:56:00 5
2016-01-06 23:57:00 5
2016-01-06 23:58:00 5
2016-01-06 23:59:00 5
2016-01-07 00:00:00 6
Note that this only works if the time component of your daily index is 00:00.

rolling_sum on business day and return new dataframe with date as index

I have such a DataFrame:
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-02 00:00:00 2
2016-01-02 12:00:00 3
2016-01-03 00:00:00 4
2016-01-03 12:00:00 5
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
The reason I separate 2016-01-02 00:00:00 to 2016-01-03 12:00:00 is that, those two days are weekends.
So here is what I wish to do:
I wish to rolling_sum with window = 2 business days.
For example, I wish to sum
A
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
and then sum (we skip any non-business days)
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
And the result is
A
2016-01-01 Nan
2016-01-04 14
2016-01-05 30
How can I achieve that?
I tried rolling_sum(df, window=2, freq=BDay(1)), it seems it just pick one row of the same day, but not sum the two rows (00:00 and 12:00) within the same day.
You could first select only business days, resample to (business) daily frequency for the remaining data points and sum, and then apply rolling_sum:
Starting with some sample data:
df = pd.DataFrame(data={'A': np.random.randint(0, 10, 500)}, index=pd.date_range(datetime(2016,1,1), freq='6H', periods=500))
A
2016-01-01 00:00:00 6
2016-01-01 06:00:00 9
2016-01-01 12:00:00 3
2016-01-01 18:00:00 9
2016-01-02 00:00:00 7
2016-01-02 06:00:00 5
2016-01-02 12:00:00 8
2016-01-02 18:00:00 6
2016-01-03 00:00:00 2
2016-01-03 06:00:00 0
2016-01-03 12:00:00 0
2016-01-03 18:00:00 0
2016-01-04 00:00:00 5
2016-01-04 06:00:00 4
2016-01-04 12:00:00 1
2016-01-04 18:00:00 4
2016-01-05 00:00:00 6
2016-01-05 06:00:00 9
2016-01-05 12:00:00 7
2016-01-05 18:00:00 2
....
First select the values on business days:
tsdays = df.index.values.astype('<M8[D]')
bdays = pd.bdate_range(tsdays[0], tsdays[-1]).values.astype('<M8[D]')
df = df[np.in1d(tsdays, bdays)]
Then apply rolling_sum() to the resampled data, where each value represents the sum for an individual business day:
pd.rolling_sum(df.resample('B', how='sum'), window=2)
to get:
A
2016-01-01 NaN
2016-01-04 41
2016-01-05 38
2016-01-06 56
2016-01-07 52
2016-01-08 37
See also [here] for the type conversion and 1[this question]2 for the business day extraction.

Categories

Resources