I would like to aggregate some data by hour using pandas and display the date instead of an index.
The code I have right now is the following:
import pandas as pd
import numpy as np
dates = pd.date_range('1/1/2011', periods=20, freq='25min')
data = pd.Series(np.random.randint(100, size=20), index=dates)
result = data.groupby(data.index.hour).sum().reset_index(name='Sum')
print(result)
Which displays something along the lines of:
index Sum
0 0 131
1 1 116
2 2 180
3 3 62
4 4 95
5 5 107
6 6 89
7 7 169
The problem is that instead of index I want to display the date associated with that hour.
The result I'm trying to achieve is the following:
index Sum
0 2011-01-01 01:00:00 131
1 2011-01-01 02:00:00 116
2 2011-01-01 03:00:00 180
3 2011-01-01 04:00:00 62
4 2011-01-01 05:00:00 95
5 2011-01-01 06:00:00 107
6 2011-01-01 07:00:00 89
7 2011-01-01 08:00:00 169
Is there any way I can do that easily using pandas?
data.groupby(data.index.strftime('%Y-%m-%d %H:00:00')).sum().reset_index(name='Sum')
You could use resample.
data.resample('H').sum()
Output:
2011-01-01 00:00:00 84
2011-01-01 01:00:00 121
2011-01-01 02:00:00 160
2011-01-01 03:00:00 70
2011-01-01 04:00:00 88
2011-01-01 05:00:00 131
2011-01-01 06:00:00 56
2011-01-01 07:00:00 109
Freq: H, dtype: int32
Option #2
data.groupby(data.index.floor('H')).sum()
Output:
2011-01-01 00:00:00 84
2011-01-01 01:00:00 121
2011-01-01 02:00:00 160
2011-01-01 03:00:00 70
2011-01-01 04:00:00 88
2011-01-01 05:00:00 131
2011-01-01 06:00:00 56
2011-01-01 07:00:00 109
dtype: int32
I want to compute the hourly mean for a time series of wind speed and direction, but I want to set the time at the half hour. So, the average for values from 14:00 to 15:00 will be at 14:30. Right now, I can only seem to get it on left or right of the interval. Here is what I currently have:
ts_g=[item.replace(second=0, microsecond=0) for item in dates_g]
dg = {'ws': data_g.ws, 'wdir': data_g.wdir}
df_g = pandas.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
grouped_g = df_g.groupby(pandas.TimeGrouper('H'))
hourly_ws_g = grouped_g['ws'].mean()
hourly_wdir_g = grouped_g['wdir'].mean()
the output for this looks like:
2016-04-08 06:00:00+00:00 46.980000
2016-04-08 07:00:00+00:00 64.313333
2016-04-08 08:00:00+00:00 75.678333
2016-04-08 09:00:00+00:00 127.383333
2016-04-08 10:00:00+00:00 145.950000
2016-04-08 11:00:00+00:00 184.166667
....
but I would like it to be like:
2016-04-08 06:30:00+00:00 54.556
2016-04-08 07:30:00+00:00 78.001
....
Thanks for your help!
So the easiest way is to resample and then use linear interpolation:
In [21]: rng = pd.date_range('1/1/2011', periods=72, freq='H')
In [22]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
...:
In [23]: ts.head()
Out[23]:
2011-01-01 00:00:00 0.796704
2011-01-01 01:00:00 -1.153179
2011-01-01 02:00:00 -1.919475
2011-01-01 03:00:00 0.082413
2011-01-01 04:00:00 -0.397434
Freq: H, dtype: float64
In [24]: ts2 = ts.resample('30T').interpolate()
In [25]: ts2.head()
Out[25]:
2011-01-01 00:00:00 0.796704
2011-01-01 00:30:00 -0.178237
2011-01-01 01:00:00 -1.153179
2011-01-01 01:30:00 -1.536327
2011-01-01 02:00:00 -1.919475
Freq: 30T, dtype: float64
In [26]:
I believe this is what you need.
Edit to add clarifying example
Perhaps it's easier to see what's going on without random Data:
In [29]: ts.head()
Out[29]:
2011-01-01 00:00:00 0
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 4
Freq: H, dtype: int64
In [30]: ts2 = ts.resample('30T').interpolate()
In [31]: ts2.head()
Out[31]:
2011-01-01 00:00:00 0.0
2011-01-01 00:30:00 0.5
2011-01-01 01:00:00 1.0
2011-01-01 01:30:00 1.5
2011-01-01 02:00:00 2.0
Freq: 30T, dtype: float64
This post is already several years old and uses the API that has long been deprecated. Modern Pandas already provides the resample method that is easier to use than pandas.TimeGrouper. Yet it allows only left and right labelled intervals but getting the intervals centered at the middle of the interval is not readily available.
Yet this is not hard to do.
First we fill in the data that we want to resample:
ts_g=[datetime.datetime.fromisoformat('2019-11-20') +
datetime.timedelta(minutes=10*x) for x in range(0,100)]
dg = {'ws': range(0,100), 'wdir': range(0,100)}
df_g = pd.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
df_g.head()
The output would be:
ws wdir
2019-11-20 00:00:00 0 0
2019-11-20 00:10:00 1 1
2019-11-20 00:20:00 2 2
2019-11-20 00:30:00 3 3
2019-11-20 00:40:00 4 4
Now we first resample to 30 minute intervals
grouped_g = df_g.resample('30min')
halfhourly_ws_g = grouped_g['ws'].mean()
halfhourly_ws_g.head()
The output would be:
2019-11-20 00:00:00 1
2019-11-20 00:30:00 4
2019-11-20 01:00:00 7
2019-11-20 01:30:00 10
2019-11-20 02:00:00 13
Freq: 30T, Name: ws, dtype: int64
Finally the trick to get the centered intervals:
hourly_ws_g = halfhourly_ws_g.add(halfhourly_ws_g.shift(1)).div(2)\
.loc[halfhourly_ws_g.index.minute % 60 == 30]
hourly_ws_g.head()
This would produce the expected output:
2019-11-20 00:30:00 2.5
2019-11-20 01:30:00 8.5
2019-11-20 02:30:00 14.5
2019-11-20 03:30:00 20.5
2019-11-20 04:30:00 26.5
Freq: 60T, Name: ws, dtype: float64
I have a DataFrame with measurements, containing the values of the measurement and the times.
time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in xrange(10)]
df_meas = pandas.DataFrame({'time': time, 'value': np.random.random(10)})
for example:
time value
0 2011-01-01 21:56:00 0.115025
1 2011-01-01 04:40:00 0.678882
2 2011-01-01 02:18:00 0.507168
3 2011-01-01 22:40:00 0.938408
4 2011-01-01 12:53:00 0.193573
5 2011-01-01 19:37:00 0.464744
6 2011-01-01 16:06:00 0.794495
7 2011-01-01 18:32:00 0.482684
8 2011-01-01 13:26:00 0.381747
9 2011-01-01 01:50:00 0.035798
the data-taking is organized in periods and I have another DataFrame for it:
start = pandas.date_range('1/1/2011', periods=5, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pandas.DataFrame({'start': start, 'stop': stop}, index=np.random.randint(0, 1000000, 5))
df_runs.index.name = 'run'
for example:
start stop
run
721158 2011-01-01 00:00:00 2011-01-01 00:50:00
340902 2011-01-01 01:00:00 2011-01-01 01:50:00
211578 2011-01-01 02:00:00 2011-01-01 02:50:00
120232 2011-01-01 03:00:00 2011-01-01 03:50:00
122199 2011-01-01 04:00:00 2011-01-01 04:50:00
Now I want to merge the two tables, obtaining:
time value run
0 2011-01-01 21:56:00 0.115025 NaN
1 2011-01-01 04:40:00 0.678882 122199
2 2011-01-01 02:18:00 0.507168 211578
3 2011-01-01 22:40:00 0.938408 NaN
...
time periods (runs) have a start and a stop and stop >= start. Different runs never overlap. (Even if in my example it is not true) you can assume that runs are ordered (by run) and if run1 < run2 then start1 < start2 (or you can simply sort the table by start). You can also assume that df_meas is sorted by time.
How to do that? Is there something build in? What is the most efficient way?
You can first reshape df_runs by stack - start and stop are in one column time. Then groupby by run, resample by minutes and ffill for filling NaN values. Last merge to df_meas:
Notice - this code works in last pandas version 0.18.1 see docs.
import pandas as pd
import numpy as np
import datetime as datetime
#for testing
np.random.seed(1)
time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in range(10)]
df_meas = pd.DataFrame({'time': time, 'value': np.random.random(10)})
start = pd.date_range('1/1/2011', periods=5, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pd.DataFrame({'start': start, 'stop': stop}, index=np.random.randint(0, 1000000, 5))
df_runs.index.name = 'run'
df = (df_runs.stack().reset_index(level=1, drop=True).reset_index(name='time'))
print (df)
run time
0 99335 2011-01-01 00:00:00
1 99335 2011-01-01 00:50:00
2 823615 2011-01-01 01:00:00
3 823615 2011-01-01 01:50:00
4 117565 2011-01-01 02:00:00
5 117565 2011-01-01 02:50:00
6 790038 2011-01-01 03:00:00
7 790038 2011-01-01 03:50:00
8 369977 2011-01-01 04:00:00
9 369977 2011-01-01 04:50:00
df1 = (df.set_index('time')
.groupby('run')
.resample('Min')
.ffill()
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
time run
0 2011-01-01 00:00:00 99335
1 2011-01-01 00:01:00 99335
2 2011-01-01 00:02:00 99335
3 2011-01-01 00:03:00 99335
4 2011-01-01 00:04:00 99335
5 2011-01-01 00:05:00 99335
6 2011-01-01 00:06:00 99335
7 2011-01-01 00:07:00 99335
8 2011-01-01 00:08:00 99335
9 2011-01-01 00:09:00 99335
...
...
print (pd.merge(df_meas, df1, on='time', how='left'))
time value run
0 2011-01-01 05:44:00 0.524548 NaN
1 2011-01-01 12:09:00 0.443453 NaN
2 2011-01-01 09:12:00 0.229577 NaN
3 2011-01-01 05:16:00 0.534414 NaN
4 2011-01-01 00:17:00 0.913962 99335.0
5 2011-01-01 01:13:00 0.457205 823615.0
6 2011-01-01 07:46:00 0.430699 NaN
7 2011-01-01 06:26:00 0.939128 NaN
8 2011-01-01 18:21:00 0.778389 NaN
9 2011-01-01 05:19:00 0.715971 NaN
Solution of IanS is very nice, and I try improve it with pd.lreshape:
df_runs['run1'] = -1
df_runs = df_runs.reset_index()
run_times = (pd.lreshape(df_runs, {'Run':['run', 'run1'],
'Time':['start', 'stop']})
.sort_values('Time')
.set_index('Time'))
print (run_times['Run'].asof(df_meas['time']))
time
2011-01-01 05:44:00 -1
2011-01-01 12:09:00 -1
2011-01-01 09:12:00 -1
2011-01-01 05:16:00 -1
2011-01-01 00:17:00 99335
2011-01-01 01:13:00 823615
2011-01-01 07:46:00 -1
2011-01-01 06:26:00 -1
2011-01-01 18:21:00 -1
2011-01-01 05:19:00 -1
Name: Run, dtype: int64
Edit: As suggested in a comment, there is no need to sort the times. Rather, use stack instead of unstack.
First step: transform the times dataframe
Since the start and end times are nicely ordered, I set them as index. I also add a column with the run id for starts, and NaN for stops. I do this in many lines (hopefully each one self-explanatory), but you could certainly condense the code:
run_times = df_runs.stack().to_frame(name='times')
run_times.reset_index(inplace=True)
run_times['actual_run'] = np.where(run_times['level_1'] == 'start', run_times['run'], np.nan)
run_times.drop(['level_1', 'run'], axis=1, inplace=True)
run_times.set_index('times', drop=True, inplace=True)
Result:
In[101] : run_times
Out[101]:
actual_run
times
2011-01-01 00:00:00 110343
2011-01-01 00:50:00 NaN
2011-01-01 01:00:00 839451
2011-01-01 01:50:00 NaN
2011-01-01 02:00:00 742879
2011-01-01 02:50:00 NaN
2011-01-01 03:00:00 275509
2011-01-01 03:50:00 NaN
2011-01-01 04:00:00 788777
2011-01-01 04:50:00 NaN
Second step: lookup the values
You can now look this up in the original dataframe with the asof method:
In[131] : run_times['actual_run'].fillna(-1).asof(df_meas['time'])
Out[131]:
2011-01-01 21:56:00 -1
2011-01-01 04:40:00 122199
2011-01-01 02:18:00 211578
2011-01-01 22:40:00 -1
2011-01-01 12:53:00 -1
Note that I had to use -1 instead of NaN because asof returns the last valid value.
Edited
If you want to benefit from tables being sorted, sometimes (or usually) it's better to leave it to pandas (or numpy). For example with merging two sorted arrays, there is not much you can do by hand, as this answer suggests. And pandas uses low level functions to do it automatically.
I measured time used by asof (as in A.asof(I)) and it looked as if it didn't benefit from I being sorted. But I don't see an easy way to beat it, if at all it's possible.
In my tests, asof was even faster than .loc when the index (A.index) already contained I. The only object I know of that could take advantage of the indices being sorted is pd.Index. And indeed, A.reindex(idx) for idx = pd.Index(I) was much faster (to use it, A.index has to be unique). Unfortunately, the time needed to build the right data frame or series outweighed the benefits.
The answer by #IanS and #jezrael are very fast. In fact, most of the time (almost 40%) in jezrael's second function is spent in lreshape. sort_values and asof take up to 15%.
Certainly, it's possible to optimize it further. The results are quite good, so I put it here.
I use the following setup to generate sorted data frames for testing:
def setup(intervals, periods):
time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in range(intervals)]
df_meas = pd.DataFrame({'time': time, 'value': np.random.random(intervals)})
df_meas = df_meas.sort_values(by='time')
df_meas.index = range(df_meas.shape[0])
start = pd.date_range('1/1/2011', periods=periods, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pd.DataFrame({'start': start, 'stop': stop}, index=np.unique(np.random.randint(0, 1000000, periods)))
df_runs.index.name = 'run'
return df_meas, df_runs
The function benefits from the use of asof and some tricks to cut down unnecessary formatting.
def run(df_meas, df_runs):
run_times = pd.Series(np.concatenate([df_runs.index, [-1] * df_runs.shape[0]]),
index=df_runs.values.flatten(order='F'))
run_times.sort_index(inplace=True)
return run_times.asof(df_meas['time'])
I tested it with intervals=100 and periods=20. The results measured with timeit:
# #jezrael's second function:
100 loops, best of 3: 3.43 ms per loop
# #IanS's function:
100 loops, best of 3: 3.92 ms per loop
# my function:
1000 loops, best of 3: 752 µs per loop
The merge() function could be used to merge 2 dataframes horizontally:
merge(x, y, by ="name") # merge df x and y using the "name" column
So you may have to rename the "start" col of the first dataframe in "time" and have a try...
Let's generate 10 rows of a time series with non-constant time step :
import pandas as pd
import numpy as np
x = pd.DataFrame(np.random.random(10),pd.date_range('1/1/2011', periods=5, freq='1min') \
.union(pd.date_range('1/2/2011', periods=5, freq='1min')))
Example of data:
2011-01-01 00:00:00 0.144852
2011-01-01 00:01:00 0.510248
2011-01-01 00:02:00 0.911903
2011-01-01 00:03:00 0.392504
2011-01-01 00:04:00 0.054307
2011-01-02 00:00:00 0.918862
2011-01-02 00:01:00 0.988054
2011-01-02 00:02:00 0.780668
2011-01-02 00:03:00 0.831947
2011-01-02 00:04:00 0.707357
Now let's define r as the so-called "returns" (difference between consecutive rows):
r = x[1:] - x[:-1].values
How to clean the data by removing the r[i] for which the time difference was not 1 minute? (here there is exactly one such row in r to clean)
IIUC I think you want the following:
In [26]:
x[(x.index.to_series().diff() == pd.Timedelta(1, 'm')) | (x.index.to_series().diff().isnull())]
Out[26]:
0
2011-01-01 00:00:00 0.367675
2011-01-01 00:01:00 0.128325
2011-01-01 00:02:00 0.772191
2011-01-01 00:03:00 0.638847
2011-01-01 00:04:00 0.476668
2011-01-02 00:01:00 0.992888
2011-01-02 00:02:00 0.944810
2011-01-02 00:03:00 0.171831
2011-01-02 00:04:00 0.316064
This converts the index to a series using to_seriesso we can call diff and we can then compare this with a timedelta of 1 minute, we also handle the first row case where diff will return NaT
I have a time series of events and I would like to count previous non-consecutive occurrences of each type of event in the time series. I want to do this with pandas. I could do it iterating through the items, but I wonder if there is a clever way of doing it w/o loops.
To make it clearer. Consider the following time series:
dates = pd.date_range('1/1/2011', periods=4, freq='H')
data = ['a', 'a', 'b', 'a']
df = pd.DataFrame(data,index=dates,columns=["event"])
event
2011-01-01 00:00:00 a
2011-01-01 01:00:00 a
2011-01-01 02:00:00 b
2011-01-01 03:00:00 a
I would like to add a new column that tells, for each element in the "event" column, how many non-consecutive times that element has previously appeared. That is, something like this:
event #prev-occurr
2011-01-01 00:00:00 a 0
2011-01-01 01:00:00 a 0
2011-01-01 02:00:00 b 0
2011-01-01 03:00:00 a 1
We don't really have good groupby support for contiguous groups yet, but we can use the shift-compare-cumsum pattern and then a dense rank to get what you need, IIUC:
>>> egroup = (df["event"] != df["event"].shift()).cumsum()
>>> df["prev_occur"] = egroup.groupby(df["event"]).rank(method="dense") - 1
>>> df
event prev_occur
2011-01-01 00:00:00 a 0
2011-01-01 01:00:00 a 0
2011-01-01 02:00:00 b 0
2011-01-01 03:00:00 a 1
2011-01-01 04:00:00 a 1
2011-01-01 05:00:00 b 1
2011-01-01 06:00:00 a 2
This works because we get a contiguous event group count:
>>> egroup
2011-01-01 00:00:00 1
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 3
2011-01-01 05:00:00 4
2011-01-01 06:00:00 5
Freq: H, Name: event, dtype: int64
and then we can group this by the event types, giving us the non-ranked version:
>>> for k,g in egroup.groupby(df["event"]):
... print(g)
...
2011-01-01 00:00:00 1
2011-01-01 01:00:00 1
2011-01-01 03:00:00 3
2011-01-01 04:00:00 3
2011-01-01 06:00:00 5
Name: event, dtype: int64
2011-01-01 02:00:00 2
2011-01-01 05:00:00 4
Name: event, dtype: int64
which we can finally do a dense rank on.