Question:
I have a timeseries dataset with irregular intervals, and I want to compute the averages per regular time interval.
What is the best way to do this in Python?
Example:
Below a simplified dataset as a Pandas series:
base = pd.to_datetime('2021-01-01 12:00')
mydict = {
base: 5,
base + timedelta(minutes=5): 10,
base + timedelta(minutes=7): 12,
base + timedelta(minutes=12): 6,
base + timedelta(minutes=25): 8
}
series = pd.Series(mydict)
Returns:
2021-01-01 12:00:00 5
2021-01-01 12:05:00 10
2021-01-01 12:07:00 12
2021-01-01 12:12:00 6
2021-01-01 12:25:00 8
My solution:
I want to resample this to a regular 15 minute interval and take the mean. I can do this by first resampling to a very small interval (seconds) and then resampling to 15 minutes:
series.resample('S').ffill().resample('15T').mean()
Returns:
2021-01-01 12:00:00 8.200000
2021-01-01 12:15:00 6.003328
It does not feel Pythonic to first resample to a small interval before sampling to the desired interval. And I expect it also get quite slow with large datasets that require high accuracy. Is there a better way to do this?
P.S. In case you are wondering: If you resample to 15 minutes right away you do not get the desired result:
series.resample('15T').mean()
Returns:
2021-01-01 12:00:00 8.25
2021-01-01 12:15:00 8.00
If the timestamps in your data represent breakpoints between intervals, then your data describes a step function. You can use a package called staircase which is built upon pandas and numpy for analysis with step functions.
Using the setup code you provided, create a staircase.Stairs object from series. These objects represent step functions are to staircase as Series are to pandas.
import staircase as sc
sf = sc.Stairs.from_values(initial_value=0, values=series)
There are lots of things you can do with Stairs objects, including plotting
sf.plot(style="hlines")
Next create your 15 minute bins, eg
bins = pd.date_range(base, periods=5, freq="15min")
bins looks like this
DatetimeIndex(['2021-01-01 12:00:00', '2021-01-01 12:15:00',
'2021-01-01 12:30:00', '2021-01-01 12:45:00',
'2021-01-01 13:00:00'],
dtype='datetime64[ns]', freq='15T')
Next we slice the stepfunction into pieces with the bins and take the mean. This is analogous to groupby-apply with dataframes in pandas.
means = sf.slice(bins).mean()
means is a pandas.Series indexed by the bins (a pandas.IntervalIndex) with the mean values
[2021-01-01 12:00:00, 2021-01-01 12:15:00) 8.200000
[2021-01-01 12:15:00, 2021-01-01 12:30:00) 6.666667
[2021-01-01 12:30:00, 2021-01-01 12:45:00) 8.000000
[2021-01-01 12:45:00, 2021-01-01 13:00:00) 8.000000
dtype: float64
If you just wanted to have the start points of the interval as the index then you can do this
means.index = means.index.left
Or similarly, use endpoints. If you're feeding this data into a ML algorithm then use endpoints to avoid data leakage.
summary
import staircase as sc
sf = sc.Stairs.from_values(initial_value=0, values=series)
bins = pd.date_range(base, periods=5, freq="15min")
means = sf.slice(bins).mean()
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
Related
I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)
I have a Pandas DataFrame where the index is datetimes for every 12 minutes in a day (120 rows total). I went ahead and resampled the data to every 30 minutes.
Time Rain_Rate
1 2014-04-02 00:00:00 0.50
2 2014-04-02 00:30:00 1.10
3 2014-04-02 01:00:00 0.48
4 2014-04-02 01:30:00 2.30
5 2014-04-02 02:00:00 4.10
6 2014-04-02 02:30:00 5.00
7 2014-04-02 03:00:00 3.20
I want to take 3 hour means centered on hours 00, 03, 06, 09, 12, 15 ,18, and 21. I want the mean to consist of 1.5 hours before 03:00:00 (so 01:30:00) and 1.5 hours after 03:00:00 (04:30:00). The 06:00:00 time would overlap with the 03:00:00 average (they would both use 04:30:00).
Is there a way to do this using pandas? I've tried a few things but they haven't worked.
Method 1
I'm going to suggest just change your resample from the get-go to get the chunks you want. Here's some fake data resembling yours, before resampling at all:
dr = pd.date_range('04-02-2014 00:00:00', '04-03-2014 00:00:00', freq='12T', closed='left')
data = np.random.rand(120)
df = pd.DataFrame(data, index=dr, columns=['Rain_Rate'])
df.index.name = 'Time'
#df.head()
Rain_Rate
Time
2014-04-02 00:00:00 0.616588
2014-04-02 00:12:00 0.201390
2014-04-02 00:24:00 0.802754
2014-04-02 00:36:00 0.712743
2014-04-02 00:48:00 0.711766
Averaging by 3 hour chunks initially will be the same as doing 30 minute chunks then doing 3 hour chunks. You just have to tweak a couple things to get the right bins you want. First you can add the bin you will start from (i.e. 10:30 pm on the previous day, even if there's no data there; the first bin is from 10:30pm - 1:30am), then resample starting from this point
before = df.index[0] - pd.Timedelta(minutes=90) #only if the first index is at midnight!!!
df.loc[before] = np.nan
df = df.sort_index()
output = df.resample('3H', base=22.5, loffset='90min').mean()
The base parameter here means start at the 22.5th hour (10:30), and loffset means push the bin names back by 90 minutes. You get the following output:
Rain_Rate
Time
2014-04-02 00:00:00 0.555515
2014-04-02 03:00:00 0.546571
2014-04-02 06:00:00 0.439953
2014-04-02 09:00:00 0.460898
2014-04-02 12:00:00 0.506690
2014-04-02 15:00:00 0.605775
2014-04-02 18:00:00 0.448838
2014-04-02 21:00:00 0.387380
2014-04-03 00:00:00 0.604204 #this is the bin at midnight on the following day
You could also start with the data binned at 30 minutes and use this method, and should get the same answer.*
Method 2
Another approach would be to find the locations of the indexes you want to create averages for, and then calculate the averages for entries in the 3 hours surrounding:
resampled = df.resample('30T',).mean() #like your data in the post
centers = [0,3,6,9,12,15,18,21]
mask = np.where(df.index.hour.isin(centers) & (df.index.minute==0), True, False)
df_centers = df.index[mask]
output = []
for center in df_centers:
cond1 = (df.index >= (center - pd.Timedelta(hours=1.5)))
cond2 = (df.index <= (center + pd.Timedelta(hours=1.5)))
output.append(df[cond1 & cond2].values.mean())
Output here is the same, but the answers are in a list (and the last point of "24 hours" is not included):
[0.5555146139562004,
0.5465709237162698,
0.43995277270996735,
0.46089800625663596,
0.5066902552121085,
0.6057747262752732,
0.44883794039466535,
0.3873795731806939]
*You mentioned you wanted some points on the edge of bins to be included in both bins. resample doesn't do this (and generally I don't think most people want to do so), but the second method I used is explicit about doing so (by using >= and <= in cond1 and cond2). However, these two methods achieve the same result here, presumably b/c of the use of resample at different stages causing data points to be included in different bins. It's hard for me to wrap my around that, but one could do a little manual binning to verify what is going on. The point is, I would recommend spot-checking the output of these methods (or any resample-based method) against your raw data to make sure things look correct. For these examples, I did so using Excel.
I have a dataset with measurements acquired almost every 2-hours over a week. I would like to calculate a mean of measurements taken at the same time on different days. For example, I want to calculate the mean of every measurement taken between 12:00 and 13:59.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
#generating test dataframe
date_today = datetime.now()
time_of_taken_measurment = pd.date_range(date_today, date_today +
timedelta(72), freq='2H20MIN')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100,
size=len(time_of_taken_measurment))
df = pd.DataFrame({'measurementTimestamp': time_of_taken_measurment, 'measurment': data})
df = df.set_index('measurementTimestamp')
#Calculating the mean for measurments taken in the same hour
hourly_average = df.groupby([df.index.hour]).mean()
hourly_average
The code above gives me this output:
0 47.967742
1 43.354839
2 46.935484
.....
22 42.833333
23 52.741935
I would like to have a result like this:
0 mean0
2 mean1
4 mean2
.....
20 mean10
22 mean11
I was trying to solve my problem using rolling_mean function, but I could not find a way to apply it to my static case.
Use the built-in floor functionality of datetimeIndex, which allows you to easily create 2 hour time bins.
df.groupby(df.index.floor('2H').time).mean()
Output:
measurment
00:00:00 51.516129
02:00:00 54.868852
04:00:00 52.935484
06:00:00 43.177419
08:00:00 43.903226
10:00:00 55.048387
12:00:00 50.639344
14:00:00 48.870968
16:00:00 43.967742
18:00:00 49.225806
20:00:00 43.774194
22:00:00 50.590164
Say I have a dataframe with several timestamps and values. I would like to measure Δ values / Δt every 2.5 seconds. Does Pandas provide any utilities for time differentiation?
time_stamp values
19492 2014-10-06 17:59:40.016000-04:00 1832128
167106 2014-10-06 17:59:41.771000-04:00 2671048
202511 2014-10-06 17:59:43.001000-04:00 2019434
161457 2014-10-06 17:59:44.792000-04:00 1294051
203944 2014-10-06 17:59:48.741000-04:00 867856
It most certainly does. First, you'll need to convert your indices into pandas date_rangeformat and then use the custom offset functions available to series/dataframes indexed with that class. Helpful documentation here. Read more here about offset aliases.
This code should resample your data to 2.5s intervals
#df is your dataframe
index = pd.date_range(df['time_stamp'])
values = pd.Series(df.values, index=index)
#Read above link about the different Offset Aliases, S=Seconds
resampled_values = values.resample('2.5S')
resampled_values.diff() #compute the difference between each point!
That should do it.
If you really want the time derivative, then you also need to divide by the time difference (delta time, dt) since last sample
An example:
dti = pd.DatetimeIndex([
'2018-01-01 00:00:00',
'2018-01-01 00:00:02',
'2018-01-01 00:00:03'])
X = pd.DataFrame({'data': [1,3,4]}, index=dti)
X.head()
data
2018-01-01 00:00:00 1
2018-01-01 00:00:02 3
2018-01-01 00:00:03 4
You can find the time delta by using the diff() on the DatetimeIndex. This gives you a series of type Time Deltas. You only need the values in seconds, though
dt = pd.Series(df.index).diff().dt.seconds.values
dXdt = df.diff().div(dt, axis=0, )
dXdt.head()
data
2018-01-01 00:00:00 NaN
2018-01-01 00:00:02 1.0
2018-01-01 00:00:03 1.0
As you can see, this approach takes into account that there are two seconds between the first two values, and only one between the two last values. :)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have a rather straightforward problem I'd like to solve with more efficiency than I'm currently getting.
I have a bunch of data coming in as a set of monitoring metrics. Input data is structured as an array of tuples. Each tuple is (timestamp, value). Timestamps are integer epoch seconds, and values are normal floating point numbers. Example:
inArr = [ (1388435242, 12.3), (1388435262, 11.1), (1388435281, 12.8), ... ]
The timestamps are not always the same number of seconds apart, but it's usually close. Sometimes we get duplicate numbers submitted, sometimes we miss datapoints, etc.
My current solution takes the timestamps and:
finds the num seconds between each successive pair of timestamps;
finds the median of these delays;
creates an array of the correct size;
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period);
averages values that happen to go into the same time bucket;
adds data to this array according to the correct (timestamp - starttime)/median element.
if there's no value for a time range, I obviously output a None value.
Output data has to be in the format:
outArr = [ (startTime, timeStep, numVals), [ val1, val2, val3, val4, ... ] ]
I suspect this is a solved problem with Python Pandas http://pandas.pydata.org/ (or Numpy / SciPy).
Yes, my solution works, but when I'm operating on 60K datapoints it can take a tenth of a second (or more) to run. This is troublesome when I'm trying to work on large numbers of sets of data.
So, I'm looking for a solution that might run faster than my pure-Python version. I guess I'm presuming (based on a couple of previous conversations with an Argonne National Labs guy) that SciPy and Numpy are (clearing-throat) "somewhat faster" at array operations. I've looked briefly (an hour or so) at the Pandas code but it looks cumbersome to do this set of operations. Am I incorrect?
-- Edit to show expected output --
The median time between datapoints is 20 seconds, half that is 10 seconds. To make sure we put the lines well between the timestamps, we make the start time 10 seconds before the first datapoint. If we just make the start time the first timestamp, it's a lot more likely that we'll get 2 timestamps in one interval.
So, 1388435242 - 10 = 1388435232. The timestep is the median, 20 seconds. The numvals here is 3.
outArr = [ (1388435232, 20, 3), [ 12.3, 11.1, 12.8 ] )
This is the format that Graphite expects when we're graphing the output; it's not my invention. It seems common, though, to have timeseries data be in this format - a starttime, interval, and then an array of values.
Here's a sketch
Create your input series
In [24]: x = zip(pd.date_range('20130101',periods=1000000,freq='s').asi8/1000000000,np.random.randn(1000000))
In [49]: x[0]
Out[49]: (1356998400, 1.2809949462375376)
Create the frame
In [25]: df = DataFrame(x,columns=['time','value'])
Make the dates a bit random (to simulate some data)
In [26]: df['time1'] = df['time'] + np.random.randint(0,10,size=1000000)
Convert the epoch seconds to datetime64[ns] dtype
In [29]: df['time2'] = pd.to_datetime(df['time1'],unit='s')
Difference the series (to create timedeltas)
In [32]: df['diff'] = df['time2'].diff()
Looks like this
In [50]: df
Out[50]:
time value time1 time2 diff
0 1356998400 -0.269644 1356998405 2013-01-01 00:00:05 NaT
1 1356998401 -0.924337 1356998401 2013-01-01 00:00:01 -00:00:04
2 1356998402 0.952466 1356998410 2013-01-01 00:00:10 00:00:09
3 1356998403 0.604783 1356998411 2013-01-01 00:00:11 00:00:01
4 1356998404 0.140927 1356998407 2013-01-01 00:00:07 -00:00:04
5 1356998405 -0.083861 1356998414 2013-01-01 00:00:14 00:00:07
6 1356998406 1.287110 1356998412 2013-01-01 00:00:12 -00:00:02
7 1356998407 0.539957 1356998414 2013-01-01 00:00:14 00:00:02
8 1356998408 0.337780 1356998412 2013-01-01 00:00:12 -00:00:02
9 1356998409 -0.368456 1356998410 2013-01-01 00:00:10 -00:00:02
10 1356998410 -0.355176 1356998414 2013-01-01 00:00:14 00:00:04
11 1356998411 -2.912447 1356998417 2013-01-01 00:00:17 00:00:03
12 1356998412 -0.003209 1356998418 2013-01-01 00:00:18 00:00:01
13 1356998413 0.122424 1356998414 2013-01-01 00:00:14 -00:00:04
14 1356998414 0.121545 1356998421 2013-01-01 00:00:21 00:00:07
15 1356998415 -0.838947 1356998417 2013-01-01 00:00:17 -00:00:04
16 1356998416 0.329681 1356998419 2013-01-01 00:00:19 00:00:02
17 1356998417 -1.071963 1356998418 2013-01-01 00:00:18 -00:00:01
18 1356998418 1.090762 1356998424 2013-01-01 00:00:24 00:00:06
19 1356998419 1.740093 1356998428 2013-01-01 00:00:28 00:00:04
20 1356998420 1.480837 1356998428 2013-01-01 00:00:28 00:00:00
21 1356998421 0.118806 1356998427 2013-01-01 00:00:27 -00:00:01
22 1356998422 -0.935749 1356998427 2013-01-01 00:00:27 00:00:00
Calc median
In [34]: df['diff'].median()
Out[34]:
0 00:00:01
dtype: timedelta64[ns]
Calc mean
In [35]: df['diff'].mean()
Out[35]:
0 00:00:00.999996
dtype: timedelta64[ns]
Should get you started
You can pass your inArr to a pandas Dataframe:
df = pd.DataFrame(inArr, columns=['time', 'value'])
num seconds between each successive pair of timestamps: df['time'].diff()
median delay: df['time'].diff().median()
creates an array of the correct size (I think that's taken care of?)
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period); I don't know what you mean here
averages values that happen to go into the same time bucket
For several of these problems it may make since to convert your seconds to datetime and set it as the index:
In [39]: df['time'] = pd.to_datetime(df['time'], unit='s')
In [41]: df = df.set_index('time')
In [42]: df
Out[42]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
Then to handle multiple values in the same time, use groupby.
In [49]: df.groupby(level='time').mean()
Out[49]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
It's the same since there aren't any dupes.
Not sure what you mean about the last two.
And your desired output seems to contradict what you wanted earlier. You values with the same timestamp should be averaged, and now you want them all? Maybe clear that up a bit.