Split time series in intervals of non-uniform length

Split time series in intervals of non-uniform length - python

I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.

Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]

#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]

Related

How to number timestamps that comes under particular duration of time in dataframe

If we can divide time of a day from 00:00:00 hrs to 23:59:00 into 15 min blocks we will have 96 blocks. we can number them from 0 to 95.
I want to add a "timeblock" column to the dataframe, where i can number each row with a timeblock number that time stamp sits in as shown below.
tagdatetime tagvalue timeblock
2020-01-01 00:00:00 47.874423 0
2020-01-01 00:01:00 14.913561 0
2020-01-01 00:02:00 56.368034 0
2020-01-01 00:03:00 16.555687 0
2020-01-01 00:04:00 42.138176 0
... ... ...
2020-01-01 00:13:00 47.874423 0
2020-01-01 00:14:00 14.913561 0
2020-01-01 00:15:00 56.368034 0
2020-01-01 00:16:00 16.555687 1
2020-01-01 00:17:00 42.138176 1
... ... ...
2020-01-01 23:55:00 18.550685 95
2020-01-01 23:56:00 51.219147 95
2020-01-01 23:57:00 15.098951 95
2020-01-01 23:58:00 37.863191 95
2020-01-01 23:59:00 51.380950 95

I think there's a better way to do it, but I think it's possible below.
import pandas as pd
import numpy as np
tindex = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='min')
tvalue = np.random.randint(1,50, (1440,))
df = pd.DataFrame({'tagdatetime':tindex, 'tagvalue':tvalue})
min15 = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='15min')
tblock = np.arange(96)
df2 = pd.DataFrame({'min15':min15, 'timeblock':tblock})
df3 = pd.merge(df, df2, left_on='tagdatetime', right_on='min15', how='outer')
df3.ffill(axis=0, inplace=True)
df3 = df3.drop('min15', axis=1)
df3.iloc[10:20,]
tagdatetime tagvalue timeblock
10 2020-01-01 00:10:00 20 0.0
11 2020-01-01 00:11:00 25 0.0
12 2020-01-01 00:12:00 42 0.0
13 2020-01-01 00:13:00 45 0.0
14 2020-01-01 00:14:00 11 0.0
15 2020-01-01 00:15:00 15 1.0
16 2020-01-01 00:16:00 38 1.0
17 2020-01-01 00:17:00 23 1.0
18 2020-01-01 00:18:00 5 1.0
19 2020-01-01 00:19:00 32 1.0

How to add values occurring between 2 consecutive hours?

I have a df as follows:
dates values
2020-01-01 00:15:00 87.321
2020-01-01 00:30:00 87.818
2020-01-01 00:45:00 88.514
2020-01-01 01:00:00 89.608
2020-01-01 01:15:00 90.802
2020-01-01 01:30:00 91.896
2020-01-01 01:45:00 92.393
2020-01-01 02:00:00 91.995
2020-01-01 02:15:00 90.504
2020-01-01 02:30:00 88.216
2020-01-01 02:45:00 85.929
2020-01-01 03:00:00 84.238
I want to just keep hourly values when the minute is 00 and the values occurring before it must be added.
Example: For finding the value at 2020-01-01 01:00:00, the values from 2020-01-01 00:15:00 to 2020-01-01 01:00:00 should be added (87.321+87.818+88.514+59.608 = 353.261). Similarly, for finding the value at 2020-01-01 02:00:00, the values from 2020-01-01 01:15:00 to 2020-01-01 02:00:00 should be added (90.802+91.896+92.393+91.995 = 348.887)
Desired output
dates values
2020-01-01 01:00:00 353.261
2020-01-01 02:00:00 348.887
2020-01-01 03:00:00 333.67
I used df['dates'].dt.minute.eq(0) to obtain the boolean masking, but I am unable to find a way to add them.
Thanks in advance

hourly = df.set_index('dates') \ # Set the dates as index
.resample('1H', closed='right', label='right') \ # Resample, so that you have one value for each hour
.sum() # Set the sum of values as new value
hourly = hourly.reset_index() # If you want to have the dates as column again

How to fill the first date in the column?

I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?

Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00

How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00

Replace "flatline" repeated data in Pandas series with nan

I have hourly time series data stored in a pandas series. Similar to this example:
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2019', end='1/2/2019', freq='H')
data = np.random.uniform(180,182,size=(len(date_rng)))
timeseries = pd.Series(data, index=date_rng)
timeseries.iloc[4:12] = 181.911
At three decimal places, it is highly unlikely the data will be exactly the same for more than, say, 3 hours in a row. When this flatlining occurs, it indicates an issue with the sensor. So I want to detect repeated data and replace it with nan values (i.e., detect the repeated values 181.911 in the above and replace with nan)
I assume I can iterate over the time series and detect/replace that way, but is there a more efficient way to do this?

You can do it with diff, but the first occurrence retain in the series.
timeseries.where(timeseries.diff(1)!=0.0,np.nan)
2019-01-01 00:00:00 180.539278
2019-01-01 01:00:00 181.509729
2019-01-01 02:00:00 180.740326
2019-01-01 03:00:00 181.736425
2019-01-01 04:00:00 181.911000
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.093216
2019-01-01 13:00:00 180.623440
First occurrence also can be removed using diff(-1) and diff(1):
np.c_[timeseries.where(timeseries.diff(-1)!=0.0,np.nan), timeseries.where(timeseries.diff(1)!=0.0,np.nan)].mean(axis=1)
It works when repetitions are sequential in series.

With following reasonably efficient function one can choose the minimum number of repeated values to consider as flatline:
import numpy as np
def remove_flatlines(ts, threshold):
# get start and end indices of each flatline as an n x 2 array
isflat = np.concatenate(([False], np.isclose(ts.diff(), 0), [False]))
isedge = isflat[1:] != isflat[:-1]
flatrange = np.where(isedge)[0].reshape(-1, 2)
# include also first value of each flatline
flatrange[:, 0] -= 1
# remove flatlines with at least threshold number of equal values
ts = ts.copy()
for j in range(len(flatrange)):
if flatrange[j][1] - flatrange[j][0] >= threshold:
ts.iloc[flatrange[j][0]:flatrange[j][1]] = np.nan
return ts
Applied to example:
remove_flatlines(timeseries, threshold=3)
2019-01-01 00:00:00 181.447940
2019-01-01 01:00:00 180.142692
2019-01-01 02:00:00 180.994674
2019-01-01 03:00:00 180.116489
2019-01-01 04:00:00 NaN
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.972644
2019-01-01 13:00:00 181.969759
2019-01-01 14:00:00 181.008693
2019-01-01 15:00:00 180.769328
2019-01-01 16:00:00 180.576061
2019-01-01 17:00:00 181.562315
2019-01-01 18:00:00 181.978567
2019-01-01 19:00:00 181.928330
2019-01-01 20:00:00 180.773995
2019-01-01 21:00:00 180.475290
2019-01-01 22:00:00 181.460028
2019-01-01 23:00:00 180.220693
2019-01-02 00:00:00 181.630176
Freq: H, dtype: float64

Using pandas.resample().agg() with 'interpolate'

i need to resample a df, different columns with different functions.
import pandas as pd
import numpy as np
df=pd.DataFrame(index=pd.DatetimeIndex(start='2020-01-01 00:00:00', end='2020-01-02 00:00:00', freq='3H'),
data=np.random.rand(9,3),
columns=['A','B','C'])
df = df.resample('1H').agg({'A': 'ffill',
'B': 'interpolate',
'C': 'max'})
Functions like 'mean', 'max', 'sum' work.
But it seems that 'interpolate' can't be used like that.
Any workarounds?

One work-around would be to concat the different aggregated frames:
df_out = pd.concat([df[['A']].resample('1H').ffill(),
df[['B']].resample('1H').interpolate(),
df[['C']].resample('1H').max().ffill()],
axis=1)
[out]
A B C
2020-01-01 00:00:00 0.836547 0.436186 0.520913
2020-01-01 01:00:00 0.836547 0.315646 0.520913
2020-01-01 02:00:00 0.836547 0.195106 0.520913
2020-01-01 03:00:00 0.577291 0.074566 0.754697
2020-01-01 04:00:00 0.577291 0.346092 0.754697
2020-01-01 05:00:00 0.577291 0.617617 0.754697
2020-01-01 06:00:00 0.490666 0.889143 0.685191
2020-01-01 07:00:00 0.490666 0.677584 0.685191
2020-01-01 08:00:00 0.490666 0.466025 0.685191
2020-01-01 09:00:00 0.603678 0.254466 0.605424
2020-01-01 10:00:00 0.603678 0.358240 0.605424
2020-01-01 11:00:00 0.603678 0.462014 0.605424
2020-01-01 12:00:00 0.179458 0.565788 0.596706
2020-01-01 13:00:00 0.179458 0.477367 0.596706
2020-01-01 14:00:00 0.179458 0.388946 0.596706
2020-01-01 15:00:00 0.702992 0.300526 0.476644
2020-01-01 16:00:00 0.702992 0.516952 0.476644
2020-01-01 17:00:00 0.702992 0.733378 0.476644
2020-01-01 18:00:00 0.884276 0.949804 0.793237
2020-01-01 19:00:00 0.884276 0.907233 0.793237
2020-01-01 20:00:00 0.884276 0.864661 0.793237
2020-01-01 21:00:00 0.283859 0.822090 0.186542
2020-01-01 22:00:00 0.283859 0.834956 0.186542
2020-01-01 23:00:00 0.283859 0.847822 0.186542
2020-01-02 00:00:00 0.410897 0.860688 0.894249

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split time series in intervals of non-uniform length - python

Related

How to number timestamps that comes under particular duration of time in dataframe

How to add values occurring between 2 consecutive hours?

How to fill the first date in the column?

Replace "flatline" repeated data in Pandas series with nan

Using pandas.resample().agg() with 'interpolate'

Categories

Resources