How to split a pandas time-series by NAN values

How to split a pandas time-series by NAN values - python

I have a pandas TimeSeries which looks like this:
2007-02-06 15:00:00 0.780
2007-02-06 16:00:00 0.125
2007-02-06 17:00:00 0.875
2007-02-06 18:00:00 NaN
2007-02-06 19:00:00 0.565
2007-02-06 20:00:00 0.875
2007-02-06 21:00:00 0.910
2007-02-06 22:00:00 0.780
2007-02-06 23:00:00 NaN
2007-02-07 00:00:00 NaN
2007-02-07 01:00:00 0.780
2007-02-07 02:00:00 0.580
2007-02-07 03:00:00 0.880
2007-02-07 04:00:00 0.791
2007-02-07 05:00:00 NaN
I would like split the pandas TimeSeries everytime there occurs one or more NaN values in a row. The goal is that I have separated events.
Event1:
2007-02-06 15:00:00 0.780
2007-02-06 16:00:00 0.125
2007-02-06 17:00:00 0.875
Event2:
2007-02-06 19:00:00 0.565
2007-02-06 20:00:00 0.875
2007-02-06 21:00:00 0.910
2007-02-06 22:00:00 0.780
I could loop through every row but is there also a smart way of doing that???

You can use numpy.split and then filter the resulting list. Here is one example assuming that the column with the values is labeled "value":
events = np.split(df, np.where(np.isnan(df.value))[0])
# removing NaN entries
events = [ev[~np.isnan(ev.value)] for ev in events if not isinstance(ev, np.ndarray)]
# removing empty DataFrames
events = [ev for ev in events if not ev.empty]
You will have a list with all the events separated by the NaN values.

Note, this answer is for pandas<0.25.0, if you're using 0.25.0 or greater see this answer by thesofakillers
I found an efficient solution for very large and sparse datasets. In my case, hundreds of thousands of rows with only a dozen or so brief segments of data between NaN values. I (ab)used the internals of pandas.SparseIndex, which is a feature to help compress sparse datasets in memory.
Given some data:
import pandas as pd
import numpy as np
# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)
# Three blocks of non-null data throughout timeseries
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)
Which looks like:
2011-01-01 00:00:00 NaN
2011-01-01 00:00:01 NaN
2011-01-01 00:00:02 NaN
2011-01-01 00:00:03 NaN
..
2011-01-10 23:59:56 NaN
2011-01-10 23:59:57 NaN
2011-01-10 23:59:58 NaN
2011-01-10 23:59:59 NaN
Freq: S, Length: 864000, dtype: float64
We can find the blocks efficiently and easily:
# Convert to sparse then query index to find block locations
sparse_ts = dense_ts.to_sparse()
block_locs = zip(sparse_ts.sp_index.blocs, sparse_ts.sp_index.blengths)
# Map the sparse blocks back to the dense timeseries
blocks = [dense_ts.iloc[start:(start + length - 1)] for (start, length) in block_locs]
Voila:
[2011-01-01 00:08:20 0.531793
2011-01-01 00:08:21 0.484391
2011-01-01 00:08:22 0.022686
2011-01-01 00:08:23 -0.206495
2011-01-01 00:08:24 1.472209
2011-01-01 00:08:25 -1.261940
2011-01-01 00:08:26 -0.696388
2011-01-01 00:08:27 -0.219316
2011-01-01 00:08:28 -0.474840
Freq: S, dtype: float64, 2011-01-01 03:20:00 -0.147190
2011-01-01 03:20:01 0.299565
2011-01-01 03:20:02 -0.846878
2011-01-01 03:20:03 -0.100975
2011-01-01 03:20:04 1.288872
2011-01-01 03:20:05 -0.092474
2011-01-01 03:20:06 -0.214774
2011-01-01 03:20:07 -0.540479
2011-01-01 03:20:08 -0.661083
2011-01-01 03:20:09 1.129878
2011-01-01 03:20:10 0.791373
2011-01-01 03:20:11 0.119564
2011-01-01 03:20:12 0.345459
2011-01-01 03:20:13 -0.272132
Freq: S, dtype: float64, 2011-01-01 05:33:20 1.028268
2011-01-01 05:33:21 1.476468
2011-01-01 05:33:22 1.308881
2011-01-01 05:33:23 1.458202
2011-01-01 05:33:24 -0.874308
..
2011-01-01 05:34:02 0.941446
2011-01-01 05:34:03 -0.996767
2011-01-01 05:34:04 1.266660
2011-01-01 05:34:05 -0.391560
2011-01-01 05:34:06 1.498499
2011-01-01 05:34:07 -0.636908
2011-01-01 05:34:08 0.621681
Freq: S, dtype: float64]

For anyone looking for a non-deprecated (pandas>=0.25.0) version of bloudermilk's answer, after a bit of digging in the pandas sparse source code, I came up with the following. I tried to keep it as similar as possible to their answer so you can compare:
Given some data:
import pandas as pd
import numpy as np
# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')
# NaN data interspersed with 3 blocks of non-NaN data
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)
Which looks like:
2011-01-01 00:00:00 NaN
2011-01-01 00:00:01 NaN
2011-01-01 00:00:02 NaN
2011-01-01 00:00:03 NaN
2011-01-01 00:00:04 NaN
..
2011-01-10 23:59:55 NaN
2011-01-10 23:59:56 NaN
2011-01-10 23:59:57 NaN
2011-01-10 23:59:58 NaN
2011-01-10 23:59:59 NaN
Freq: S, Length: 864000, dtype: float64
We can find the blocks efficiently and easily:
# Convert to sparse then query index to find block locations
# different way of converting to sparse in pandas>=0.25.0
sparse_ts = dense_ts.astype(pd.SparseDtype('float'))
# we need to use .values.sp_index.to_block_index() in this version of pandas
block_locs = zip(
sparse_ts.values.sp_index.to_block_index().blocs,
sparse_ts.values.sp_index.to_block_index().blengths,
)
# Map the sparse blocks back to the dense timeseries
blocks = [
dense_ts.iloc[start : (start + length - 1)]
for (start, length) in block_locs
]
Voila
> blocks
[2011-01-01 00:08:20 0.092338
2011-01-01 00:08:21 1.196703
2011-01-01 00:08:22 0.936586
2011-01-01 00:08:23 -0.354768
2011-01-01 00:08:24 -0.209642
2011-01-01 00:08:25 -0.750103
2011-01-01 00:08:26 1.344343
2011-01-01 00:08:27 1.446148
2011-01-01 00:08:28 1.174443
Freq: S, dtype: float64,
2011-01-01 03:20:00 1.327026
2011-01-01 03:20:01 -0.431162
2011-01-01 03:20:02 -0.461407
2011-01-01 03:20:03 -1.330671
2011-01-01 03:20:04 -0.892480
2011-01-01 03:20:05 -0.323433
2011-01-01 03:20:06 2.520965
2011-01-01 03:20:07 0.140757
2011-01-01 03:20:08 -1.688278
2011-01-01 03:20:09 0.856346
2011-01-01 03:20:10 0.013968
2011-01-01 03:20:11 0.204514
2011-01-01 03:20:12 0.287756
2011-01-01 03:20:13 -0.727400
Freq: S, dtype: float64,
2011-01-01 05:33:20 -1.409744
2011-01-01 05:33:21 0.338251
2011-01-01 05:33:22 0.215555
2011-01-01 05:33:23 -0.309874
2011-01-01 05:33:24 0.753737
2011-01-01 05:33:25 -0.349966
2011-01-01 05:33:26 0.074758
2011-01-01 05:33:27 -1.574485
2011-01-01 05:33:28 0.595844
2011-01-01 05:33:29 -0.670004
2011-01-01 05:33:30 1.655479
2011-01-01 05:33:31 -0.362853
2011-01-01 05:33:32 0.167355
2011-01-01 05:33:33 0.703780
2011-01-01 05:33:34 2.633756
2011-01-01 05:33:35 1.898891
2011-01-01 05:33:36 -1.129365
2011-01-01 05:33:37 -0.765057
2011-01-01 05:33:38 0.279869
2011-01-01 05:33:39 1.388705
2011-01-01 05:33:40 -1.420761
2011-01-01 05:33:41 0.455692
2011-01-01 05:33:42 0.367106
2011-01-01 05:33:43 0.856598
2011-01-01 05:33:44 1.920748
2011-01-01 05:33:45 0.648581
2011-01-01 05:33:46 -0.606784
2011-01-01 05:33:47 -0.246285
2011-01-01 05:33:48 -0.040520
2011-01-01 05:33:49 1.422764
2011-01-01 05:33:50 -1.686568
2011-01-01 05:33:51 1.282430
2011-01-01 05:33:52 1.358482
2011-01-01 05:33:53 -0.998765
2011-01-01 05:33:54 -0.009527
2011-01-01 05:33:55 0.647671
2011-01-01 05:33:56 -1.098435
2011-01-01 05:33:57 -0.638245
2011-01-01 05:33:58 -1.820668
2011-01-01 05:33:59 0.768250
2011-01-01 05:34:00 -1.029975
2011-01-01 05:34:01 -0.744205
2011-01-01 05:34:02 1.627130
2011-01-01 05:34:03 2.058689
2011-01-01 05:34:04 -1.194971
2011-01-01 05:34:05 1.293214
2011-01-01 05:34:06 0.029523
2011-01-01 05:34:07 -0.405785
2011-01-01 05:34:08 0.837123
Freq: S, dtype: float64]

Related

Python Pandas: Aggregate data by hour and display it instead of the index

I would like to aggregate some data by hour using pandas and display the date instead of an index.
The code I have right now is the following:
import pandas as pd
import numpy as np
dates = pd.date_range('1/1/2011', periods=20, freq='25min')
data = pd.Series(np.random.randint(100, size=20), index=dates)
result = data.groupby(data.index.hour).sum().reset_index(name='Sum')
print(result)
Which displays something along the lines of:
index Sum
0 0 131
1 1 116
2 2 180
3 3 62
4 4 95
5 5 107
6 6 89
7 7 169
The problem is that instead of index I want to display the date associated with that hour.
The result I'm trying to achieve is the following:
index Sum
0 2011-01-01 01:00:00 131
1 2011-01-01 02:00:00 116
2 2011-01-01 03:00:00 180
3 2011-01-01 04:00:00 62
4 2011-01-01 05:00:00 95
5 2011-01-01 06:00:00 107
6 2011-01-01 07:00:00 89
7 2011-01-01 08:00:00 169
Is there any way I can do that easily using pandas?

data.groupby(data.index.strftime('%Y-%m-%d %H:00:00')).sum().reset_index(name='Sum')

You could use resample.
data.resample('H').sum()
Output:
2011-01-01 00:00:00 84
2011-01-01 01:00:00 121
2011-01-01 02:00:00 160
2011-01-01 03:00:00 70
2011-01-01 04:00:00 88
2011-01-01 05:00:00 131
2011-01-01 06:00:00 56
2011-01-01 07:00:00 109
Freq: H, dtype: int32
Option #2
data.groupby(data.index.floor('H')).sum()
Output:
2011-01-01 00:00:00 84
2011-01-01 01:00:00 121
2011-01-01 02:00:00 160
2011-01-01 03:00:00 70
2011-01-01 04:00:00 88
2011-01-01 05:00:00 131
2011-01-01 06:00:00 56
2011-01-01 07:00:00 109
dtype: int32

Pandas computer hourly average and set at middle of interval

I want to compute the hourly mean for a time series of wind speed and direction, but I want to set the time at the half hour. So, the average for values from 14:00 to 15:00 will be at 14:30. Right now, I can only seem to get it on left or right of the interval. Here is what I currently have:
ts_g=[item.replace(second=0, microsecond=0) for item in dates_g]
dg = {'ws': data_g.ws, 'wdir': data_g.wdir}
df_g = pandas.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
grouped_g = df_g.groupby(pandas.TimeGrouper('H'))
hourly_ws_g = grouped_g['ws'].mean()
hourly_wdir_g = grouped_g['wdir'].mean()
the output for this looks like:
2016-04-08 06:00:00+00:00 46.980000
2016-04-08 07:00:00+00:00 64.313333
2016-04-08 08:00:00+00:00 75.678333
2016-04-08 09:00:00+00:00 127.383333
2016-04-08 10:00:00+00:00 145.950000
2016-04-08 11:00:00+00:00 184.166667
....
but I would like it to be like:
2016-04-08 06:30:00+00:00 54.556
2016-04-08 07:30:00+00:00 78.001
....
Thanks for your help!

So the easiest way is to resample and then use linear interpolation:
In [21]: rng = pd.date_range('1/1/2011', periods=72, freq='H')
In [22]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
...:
In [23]: ts.head()
Out[23]:
2011-01-01 00:00:00 0.796704
2011-01-01 01:00:00 -1.153179
2011-01-01 02:00:00 -1.919475
2011-01-01 03:00:00 0.082413
2011-01-01 04:00:00 -0.397434
Freq: H, dtype: float64
In [24]: ts2 = ts.resample('30T').interpolate()
In [25]: ts2.head()
Out[25]:
2011-01-01 00:00:00 0.796704
2011-01-01 00:30:00 -0.178237
2011-01-01 01:00:00 -1.153179
2011-01-01 01:30:00 -1.536327
2011-01-01 02:00:00 -1.919475
Freq: 30T, dtype: float64
In [26]:
I believe this is what you need.
Edit to add clarifying example
Perhaps it's easier to see what's going on without random Data:
In [29]: ts.head()
Out[29]:
2011-01-01 00:00:00 0
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 4
Freq: H, dtype: int64
In [30]: ts2 = ts.resample('30T').interpolate()
In [31]: ts2.head()
Out[31]:
2011-01-01 00:00:00 0.0
2011-01-01 00:30:00 0.5
2011-01-01 01:00:00 1.0
2011-01-01 01:30:00 1.5
2011-01-01 02:00:00 2.0
Freq: 30T, dtype: float64

This post is already several years old and uses the API that has long been deprecated. Modern Pandas already provides the resample method that is easier to use than pandas.TimeGrouper. Yet it allows only left and right labelled intervals but getting the intervals centered at the middle of the interval is not readily available.
Yet this is not hard to do.
First we fill in the data that we want to resample:
ts_g=[datetime.datetime.fromisoformat('2019-11-20') +
datetime.timedelta(minutes=10*x) for x in range(0,100)]
dg = {'ws': range(0,100), 'wdir': range(0,100)}
df_g = pd.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
df_g.head()
The output would be:
ws wdir
2019-11-20 00:00:00 0 0
2019-11-20 00:10:00 1 1
2019-11-20 00:20:00 2 2
2019-11-20 00:30:00 3 3
2019-11-20 00:40:00 4 4
Now we first resample to 30 minute intervals
grouped_g = df_g.resample('30min')
halfhourly_ws_g = grouped_g['ws'].mean()
halfhourly_ws_g.head()
The output would be:
2019-11-20 00:00:00 1
2019-11-20 00:30:00 4
2019-11-20 01:00:00 7
2019-11-20 01:30:00 10
2019-11-20 02:00:00 13
Freq: 30T, Name: ws, dtype: int64
Finally the trick to get the centered intervals:
hourly_ws_g = halfhourly_ws_g.add(halfhourly_ws_g.shift(1)).div(2)\
.loc[halfhourly_ws_g.index.minute % 60 == 30]
hourly_ws_g.head()
This would produce the expected output:
2019-11-20 00:30:00 2.5
2019-11-20 01:30:00 8.5
2019-11-20 02:30:00 14.5
2019-11-20 03:30:00 20.5
2019-11-20 04:30:00 26.5
Freq: 60T, Name: ws, dtype: float64

Merge time to time period

I have a DataFrame with measurements, containing the values of the measurement and the times.
time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in xrange(10)]
df_meas = pandas.DataFrame({'time': time, 'value': np.random.random(10)})
for example:
time value
0 2011-01-01 21:56:00 0.115025
1 2011-01-01 04:40:00 0.678882
2 2011-01-01 02:18:00 0.507168
3 2011-01-01 22:40:00 0.938408
4 2011-01-01 12:53:00 0.193573
5 2011-01-01 19:37:00 0.464744
6 2011-01-01 16:06:00 0.794495
7 2011-01-01 18:32:00 0.482684
8 2011-01-01 13:26:00 0.381747
9 2011-01-01 01:50:00 0.035798
the data-taking is organized in periods and I have another DataFrame for it:
start = pandas.date_range('1/1/2011', periods=5, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pandas.DataFrame({'start': start, 'stop': stop}, index=np.random.randint(0, 1000000, 5))
df_runs.index.name = 'run'
for example:
start stop
run
721158 2011-01-01 00:00:00 2011-01-01 00:50:00
340902 2011-01-01 01:00:00 2011-01-01 01:50:00
211578 2011-01-01 02:00:00 2011-01-01 02:50:00
120232 2011-01-01 03:00:00 2011-01-01 03:50:00
122199 2011-01-01 04:00:00 2011-01-01 04:50:00
Now I want to merge the two tables, obtaining:
time value run
0 2011-01-01 21:56:00 0.115025 NaN
1 2011-01-01 04:40:00 0.678882 122199
2 2011-01-01 02:18:00 0.507168 211578
3 2011-01-01 22:40:00 0.938408 NaN
...
time periods (runs) have a start and a stop and stop >= start. Different runs never overlap. (Even if in my example it is not true) you can assume that runs are ordered (by run) and if run1 < run2 then start1 < start2 (or you can simply sort the table by start). You can also assume that df_meas is sorted by time.
How to do that? Is there something build in? What is the most efficient way?

You can first reshape df_runs by stack - start and stop are in one column time. Then groupby by run, resample by minutes and ffill for filling NaN values. Last merge to df_meas:
Notice - this code works in last pandas version 0.18.1 see docs.
import pandas as pd
import numpy as np
import datetime as datetime
#for testing
np.random.seed(1)
time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in range(10)]
df_meas = pd.DataFrame({'time': time, 'value': np.random.random(10)})
start = pd.date_range('1/1/2011', periods=5, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pd.DataFrame({'start': start, 'stop': stop}, index=np.random.randint(0, 1000000, 5))
df_runs.index.name = 'run'
df = (df_runs.stack().reset_index(level=1, drop=True).reset_index(name='time'))
print (df)
run time
0 99335 2011-01-01 00:00:00
1 99335 2011-01-01 00:50:00
2 823615 2011-01-01 01:00:00
3 823615 2011-01-01 01:50:00
4 117565 2011-01-01 02:00:00
5 117565 2011-01-01 02:50:00
6 790038 2011-01-01 03:00:00
7 790038 2011-01-01 03:50:00
8 369977 2011-01-01 04:00:00
9 369977 2011-01-01 04:50:00
df1 = (df.set_index('time')
.groupby('run')
.resample('Min')
.ffill()
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
time run
0 2011-01-01 00:00:00 99335
1 2011-01-01 00:01:00 99335
2 2011-01-01 00:02:00 99335
3 2011-01-01 00:03:00 99335
4 2011-01-01 00:04:00 99335
5 2011-01-01 00:05:00 99335
6 2011-01-01 00:06:00 99335
7 2011-01-01 00:07:00 99335
8 2011-01-01 00:08:00 99335
9 2011-01-01 00:09:00 99335
...
...
print (pd.merge(df_meas, df1, on='time', how='left'))
time value run
0 2011-01-01 05:44:00 0.524548 NaN
1 2011-01-01 12:09:00 0.443453 NaN
2 2011-01-01 09:12:00 0.229577 NaN
3 2011-01-01 05:16:00 0.534414 NaN
4 2011-01-01 00:17:00 0.913962 99335.0
5 2011-01-01 01:13:00 0.457205 823615.0
6 2011-01-01 07:46:00 0.430699 NaN
7 2011-01-01 06:26:00 0.939128 NaN
8 2011-01-01 18:21:00 0.778389 NaN
9 2011-01-01 05:19:00 0.715971 NaN
Solution of IanS is very nice, and I try improve it with pd.lreshape:
df_runs['run1'] = -1
df_runs = df_runs.reset_index()
run_times = (pd.lreshape(df_runs, {'Run':['run', 'run1'],
'Time':['start', 'stop']})
.sort_values('Time')
.set_index('Time'))
print (run_times['Run'].asof(df_meas['time']))
time
2011-01-01 05:44:00 -1
2011-01-01 12:09:00 -1
2011-01-01 09:12:00 -1
2011-01-01 05:16:00 -1
2011-01-01 00:17:00 99335
2011-01-01 01:13:00 823615
2011-01-01 07:46:00 -1
2011-01-01 06:26:00 -1
2011-01-01 18:21:00 -1
2011-01-01 05:19:00 -1
Name: Run, dtype: int64

Edit: As suggested in a comment, there is no need to sort the times. Rather, use stack instead of unstack.
First step: transform the times dataframe
Since the start and end times are nicely ordered, I set them as index. I also add a column with the run id for starts, and NaN for stops. I do this in many lines (hopefully each one self-explanatory), but you could certainly condense the code:
run_times = df_runs.stack().to_frame(name='times')
run_times.reset_index(inplace=True)
run_times['actual_run'] = np.where(run_times['level_1'] == 'start', run_times['run'], np.nan)
run_times.drop(['level_1', 'run'], axis=1, inplace=True)
run_times.set_index('times', drop=True, inplace=True)
Result:
In[101] : run_times
Out[101]:
actual_run
times
2011-01-01 00:00:00 110343
2011-01-01 00:50:00 NaN
2011-01-01 01:00:00 839451
2011-01-01 01:50:00 NaN
2011-01-01 02:00:00 742879
2011-01-01 02:50:00 NaN
2011-01-01 03:00:00 275509
2011-01-01 03:50:00 NaN
2011-01-01 04:00:00 788777
2011-01-01 04:50:00 NaN
Second step: lookup the values
You can now look this up in the original dataframe with the asof method:
In[131] : run_times['actual_run'].fillna(-1).asof(df_meas['time'])
Out[131]:
2011-01-01 21:56:00 -1
2011-01-01 04:40:00 122199
2011-01-01 02:18:00 211578
2011-01-01 22:40:00 -1
2011-01-01 12:53:00 -1
Note that I had to use -1 instead of NaN because asof returns the last valid value.

Edited
If you want to benefit from tables being sorted, sometimes (or usually) it's better to leave it to pandas (or numpy). For example with merging two sorted arrays, there is not much you can do by hand, as this answer suggests. And pandas uses low level functions to do it automatically.
I measured time used by asof (as in A.asof(I)) and it looked as if it didn't benefit from I being sorted. But I don't see an easy way to beat it, if at all it's possible.
In my tests, asof was even faster than .loc when the index (A.index) already contained I. The only object I know of that could take advantage of the indices being sorted is pd.Index. And indeed, A.reindex(idx) for idx = pd.Index(I) was much faster (to use it, A.index has to be unique). Unfortunately, the time needed to build the right data frame or series outweighed the benefits.
The answer by #IanS and #jezrael are very fast. In fact, most of the time (almost 40%) in jezrael's second function is spent in lreshape. sort_values and asof take up to 15%.
Certainly, it's possible to optimize it further. The results are quite good, so I put it here.
I use the following setup to generate sorted data frames for testing:
def setup(intervals, periods):
time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in range(intervals)]
df_meas = pd.DataFrame({'time': time, 'value': np.random.random(intervals)})
df_meas = df_meas.sort_values(by='time')
df_meas.index = range(df_meas.shape[0])
start = pd.date_range('1/1/2011', periods=periods, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pd.DataFrame({'start': start, 'stop': stop}, index=np.unique(np.random.randint(0, 1000000, periods)))
df_runs.index.name = 'run'
return df_meas, df_runs
The function benefits from the use of asof and some tricks to cut down unnecessary formatting.
def run(df_meas, df_runs):
run_times = pd.Series(np.concatenate([df_runs.index, [-1] * df_runs.shape[0]]),
index=df_runs.values.flatten(order='F'))
run_times.sort_index(inplace=True)
return run_times.asof(df_meas['time'])
I tested it with intervals=100 and periods=20. The results measured with timeit:
# #jezrael's second function:
100 loops, best of 3: 3.43 ms per loop
# #IanS's function:
100 loops, best of 3: 3.92 ms per loop
# my function:
1000 loops, best of 3: 752 µs per loop

The merge() function could be used to merge 2 dataframes horizontally:
merge(x, y, by ="name") # merge df x and y using the "name" column
So you may have to rename the "start" col of the first dataframe in "time" and have a try...

Keeping only data for which timedelta=1minute with pandas

Let's generate 10 rows of a time series with non-constant time step :
import pandas as pd
import numpy as np
x = pd.DataFrame(np.random.random(10),pd.date_range('1/1/2011', periods=5, freq='1min') \
.union(pd.date_range('1/2/2011', periods=5, freq='1min')))
Example of data:
2011-01-01 00:00:00 0.144852
2011-01-01 00:01:00 0.510248
2011-01-01 00:02:00 0.911903
2011-01-01 00:03:00 0.392504
2011-01-01 00:04:00 0.054307
2011-01-02 00:00:00 0.918862
2011-01-02 00:01:00 0.988054
2011-01-02 00:02:00 0.780668
2011-01-02 00:03:00 0.831947
2011-01-02 00:04:00 0.707357
Now let's define r as the so-called "returns" (difference between consecutive rows):
r = x[1:] - x[:-1].values
How to clean the data by removing the r[i] for which the time difference was not 1 minute? (here there is exactly one such row in r to clean)

IIUC I think you want the following:
In [26]:
x[(x.index.to_series().diff() == pd.Timedelta(1, 'm')) | (x.index.to_series().diff().isnull())]
Out[26]:
0
2011-01-01 00:00:00 0.367675
2011-01-01 00:01:00 0.128325
2011-01-01 00:02:00 0.772191
2011-01-01 00:03:00 0.638847
2011-01-01 00:04:00 0.476668
2011-01-02 00:01:00 0.992888
2011-01-02 00:02:00 0.944810
2011-01-02 00:03:00 0.171831
2011-01-02 00:04:00 0.316064
This converts the index to a series using to_seriesso we can call diff and we can then compare this with a timedelta of 1 minute, we also handle the first row case where diff will return NaT

Find recurring events in time series with pandas

I have a time series of events and I would like to count previous non-consecutive occurrences of each type of event in the time series. I want to do this with pandas. I could do it iterating through the items, but I wonder if there is a clever way of doing it w/o loops.
To make it clearer. Consider the following time series:
dates = pd.date_range('1/1/2011', periods=4, freq='H')
data = ['a', 'a', 'b', 'a']
df = pd.DataFrame(data,index=dates,columns=["event"])
event
2011-01-01 00:00:00 a
2011-01-01 01:00:00 a
2011-01-01 02:00:00 b
2011-01-01 03:00:00 a
I would like to add a new column that tells, for each element in the "event" column, how many non-consecutive times that element has previously appeared. That is, something like this:
event #prev-occurr
2011-01-01 00:00:00 a 0
2011-01-01 01:00:00 a 0
2011-01-01 02:00:00 b 0
2011-01-01 03:00:00 a 1

We don't really have good groupby support for contiguous groups yet, but we can use the shift-compare-cumsum pattern and then a dense rank to get what you need, IIUC:
>>> egroup = (df["event"] != df["event"].shift()).cumsum()
>>> df["prev_occur"] = egroup.groupby(df["event"]).rank(method="dense") - 1
>>> df
event prev_occur
2011-01-01 00:00:00 a 0
2011-01-01 01:00:00 a 0
2011-01-01 02:00:00 b 0
2011-01-01 03:00:00 a 1
2011-01-01 04:00:00 a 1
2011-01-01 05:00:00 b 1
2011-01-01 06:00:00 a 2
This works because we get a contiguous event group count:
>>> egroup
2011-01-01 00:00:00 1
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 3
2011-01-01 05:00:00 4
2011-01-01 06:00:00 5
Freq: H, Name: event, dtype: int64
and then we can group this by the event types, giving us the non-ranked version:
>>> for k,g in egroup.groupby(df["event"]):
... print(g)
...
2011-01-01 00:00:00 1
2011-01-01 01:00:00 1
2011-01-01 03:00:00 3
2011-01-01 04:00:00 3
2011-01-01 06:00:00 5
Name: event, dtype: int64
2011-01-01 02:00:00 2
2011-01-01 05:00:00 4
Name: event, dtype: int64
which we can finally do a dense rank on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a pandas time-series by NAN values - python

Related

Python Pandas: Aggregate data by hour and display it instead of the index

Pandas computer hourly average and set at middle of interval

Merge time to time period

Keeping only data for which timedelta=1minute with pandas

Find recurring events in time series with pandas

Categories

Resources