Time differentiation in Pandas - python

Say I have a dataframe with several timestamps and values. I would like to measure Δ values / Δt every 2.5 seconds. Does Pandas provide any utilities for time differentiation?
time_stamp values
19492 2014-10-06 17:59:40.016000-04:00 1832128
167106 2014-10-06 17:59:41.771000-04:00 2671048
202511 2014-10-06 17:59:43.001000-04:00 2019434
161457 2014-10-06 17:59:44.792000-04:00 1294051
203944 2014-10-06 17:59:48.741000-04:00 867856

It most certainly does. First, you'll need to convert your indices into pandas date_rangeformat and then use the custom offset functions available to series/dataframes indexed with that class. Helpful documentation here. Read more here about offset aliases.
This code should resample your data to 2.5s intervals
#df is your dataframe
index = pd.date_range(df['time_stamp'])
values = pd.Series(df.values, index=index)
#Read above link about the different Offset Aliases, S=Seconds
resampled_values = values.resample('2.5S')
resampled_values.diff() #compute the difference between each point!
That should do it.

If you really want the time derivative, then you also need to divide by the time difference (delta time, dt) since last sample
An example:
dti = pd.DatetimeIndex([
'2018-01-01 00:00:00',
'2018-01-01 00:00:02',
'2018-01-01 00:00:03'])
X = pd.DataFrame({'data': [1,3,4]}, index=dti)
X.head()
data
2018-01-01 00:00:00 1
2018-01-01 00:00:02 3
2018-01-01 00:00:03 4
You can find the time delta by using the diff() on the DatetimeIndex. This gives you a series of type Time Deltas. You only need the values in seconds, though
dt = pd.Series(df.index).diff().dt.seconds.values
dXdt = df.diff().div(dt, axis=0, )
dXdt.head()
data
2018-01-01 00:00:00 NaN
2018-01-01 00:00:02 1.0
2018-01-01 00:00:03 1.0
As you can see, this approach takes into account that there are two seconds between the first two values, and only one between the two last values. :)

Related

Best way to compute regular period averages based on irregular timeseries data

Question:
I have a timeseries dataset with irregular intervals, and I want to compute the averages per regular time interval.
What is the best way to do this in Python?
Example:
Below a simplified dataset as a Pandas series:
base = pd.to_datetime('2021-01-01 12:00')
mydict = {
base: 5,
base + timedelta(minutes=5): 10,
base + timedelta(minutes=7): 12,
base + timedelta(minutes=12): 6,
base + timedelta(minutes=25): 8
}
series = pd.Series(mydict)
Returns:
2021-01-01 12:00:00 5
2021-01-01 12:05:00 10
2021-01-01 12:07:00 12
2021-01-01 12:12:00 6
2021-01-01 12:25:00 8
My solution:
I want to resample this to a regular 15 minute interval and take the mean. I can do this by first resampling to a very small interval (seconds) and then resampling to 15 minutes:
series.resample('S').ffill().resample('15T').mean()
Returns:
2021-01-01 12:00:00 8.200000
2021-01-01 12:15:00 6.003328
It does not feel Pythonic to first resample to a small interval before sampling to the desired interval. And I expect it also get quite slow with large datasets that require high accuracy. Is there a better way to do this?
P.S. In case you are wondering: If you resample to 15 minutes right away you do not get the desired result:
series.resample('15T').mean()
Returns:
2021-01-01 12:00:00 8.25
2021-01-01 12:15:00 8.00
If the timestamps in your data represent breakpoints between intervals, then your data describes a step function. You can use a package called staircase which is built upon pandas and numpy for analysis with step functions.
Using the setup code you provided, create a staircase.Stairs object from series. These objects represent step functions are to staircase as Series are to pandas.
import staircase as sc
sf = sc.Stairs.from_values(initial_value=0, values=series)
There are lots of things you can do with Stairs objects, including plotting
sf.plot(style="hlines")
Next create your 15 minute bins, eg
bins = pd.date_range(base, periods=5, freq="15min")
bins looks like this
DatetimeIndex(['2021-01-01 12:00:00', '2021-01-01 12:15:00',
'2021-01-01 12:30:00', '2021-01-01 12:45:00',
'2021-01-01 13:00:00'],
dtype='datetime64[ns]', freq='15T')
Next we slice the stepfunction into pieces with the bins and take the mean. This is analogous to groupby-apply with dataframes in pandas.
means = sf.slice(bins).mean()
means is a pandas.Series indexed by the bins (a pandas.IntervalIndex) with the mean values
[2021-01-01 12:00:00, 2021-01-01 12:15:00) 8.200000
[2021-01-01 12:15:00, 2021-01-01 12:30:00) 6.666667
[2021-01-01 12:30:00, 2021-01-01 12:45:00) 8.000000
[2021-01-01 12:45:00, 2021-01-01 13:00:00) 8.000000
dtype: float64
If you just wanted to have the start points of the interval as the index then you can do this
means.index = means.index.left
Or similarly, use endpoints. If you're feeding this data into a ML algorithm then use endpoints to avoid data leakage.
summary
import staircase as sc
sf = sc.Stairs.from_values(initial_value=0, values=series)
bins = pd.date_range(base, periods=5, freq="15min")
means = sf.slice(bins).mean()
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Python Datetime resample results suddenly in NaN Values

I have tried to resample my values to hour. However, since I have changed the format of the date in csv file because of automatic swapping of months and days with low numbers (2003-04-01 is suddenly 2003-01-04). Now the date format is fine (when showing the csv file in Python) but while using resample, the values appear in NaN values.
df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
`hour_avg = df_2.resample('H').mean()`
Sample of my data:
Raw data with time as index
Afterwards: even when time is datetime it shows 99% of the data as NaN values (one value per day is shown)
Data with NaN values after resample per hours
When I used resample for day values, all values are back. So it seems there is a problem with the Time.
When I use the format at the beginning, the error "The format doesn't fit" comes up.
I tried a different way before (not sure what was different) but resample worked per hour.
What do I need to change to be able to use resample for hour again?
Can you share a sample of your data? Assuming that your data consists of a DateTime feature (i.e. yyyy-mm-dd hh-mm-ss) and some other features that you are trying to resample by hour, NaN values can occur due to two reasons: incorrect formatting by Pandas or missing hour values in data.
(1) It is possible that pandas is not reading your dates correctly. Once you read the file, make sure the date column is in the right format (i.e. yyyy-mm-dd).
df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')
(2) If there are any gaps in your data, NaN values will pop up. For instance, assume the data is of this form:
2000-01-01 00:00:00 1
2000-01-01 00:01:00 1
2000-01-01 00:03:00 1
2000-01-01 00:04:00 1
2000-01-01 00:06:00 1
If you try hour_avg = df_2.resample('H').mean(), your output will look like:
2000-01-01 00:00:00 1
2000-01-01 00:01:00 1
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 1
2000-01-01 00:04:00 1
2000-01-01 00:05:00 NaN
2000-01-01 00:06:00 1
I suspect the problem is the latter. If it is the latter, you can simply remove the NaN values using df_2.dropna(). Otherwise, if you do need the hourly bins regardless of missing data, you can avoid the NaN values by padding the missing values first and then attempting to get the mean:
hour_pad = df_2.resample('H').pad()
hour_avg = hour_pad.resample('H').mean()

How to calculate a mean of measurements taken at the same time (n-hours window) on different days in pandas dataframe?

I have a dataset with measurements acquired almost every 2-hours over a week. I would like to calculate a mean of measurements taken at the same time on different days. For example, I want to calculate the mean of every measurement taken between 12:00 and 13:59.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
#generating test dataframe
date_today = datetime.now()
time_of_taken_measurment = pd.date_range(date_today, date_today +
timedelta(72), freq='2H20MIN')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100,
size=len(time_of_taken_measurment))
df = pd.DataFrame({'measurementTimestamp': time_of_taken_measurment, 'measurment': data})
df = df.set_index('measurementTimestamp')
#Calculating the mean for measurments taken in the same hour
hourly_average = df.groupby([df.index.hour]).mean()
hourly_average
The code above gives me this output:
0 47.967742
1 43.354839
2 46.935484
.....
22 42.833333
23 52.741935
I would like to have a result like this:
0 mean0
2 mean1
4 mean2
.....
20 mean10
22 mean11
I was trying to solve my problem using rolling_mean function, but I could not find a way to apply it to my static case.
Use the built-in floor functionality of datetimeIndex, which allows you to easily create 2 hour time bins.
df.groupby(df.index.floor('2H').time).mean()
Output:
measurment
00:00:00 51.516129
02:00:00 54.868852
04:00:00 52.935484
06:00:00 43.177419
08:00:00 43.903226
10:00:00 55.048387
12:00:00 50.639344
14:00:00 48.870968
16:00:00 43.967742
18:00:00 49.225806
20:00:00 43.774194
22:00:00 50.590164

shifting timezone for reshaped pandas dataframe

I am using Pandas dataframes with DatetimeIndex to manipulate timeseries data. The data is stored at UTC time and I usually keep it that way (with naive DatetimeIndex), and only use timezones for output. I like it that way because nothing in the world confuses me more than trying to manipuluate timezones.
e.g.
In: ts = pd.date_range('2017-01-01 00:00','2017-12-31 23:30',freq='30Min')
data = np.random.rand(17520,1)
df= pd.DataFrame(data,index=ts,columns = ['data'])
df.head()
Out[15]:
data
2017-01-01 00:00:00 0.697478
2017-01-01 00:30:00 0.506914
2017-01-01 01:00:00 0.792484
2017-01-01 01:30:00 0.043271
2017-01-01 02:00:00 0.558461
I want to plot a chart of data versus time for each day of the year so I reshape the dataframe to have time along the index and dates for columns
df.index = [df.index.time,df.index.date]
df_new = df['data'].unstack()
In: df_new.head()
Out :
2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 \
00:00:00 0.697478 0.143626 0.189567 0.061872 0.748223
00:30:00 0.506914 0.470634 0.430101 0.551144 0.081071
01:00:00 0.792484 0.045259 0.748604 0.305681 0.333207
01:30:00 0.043271 0.276888 0.034643 0.413243 0.921668
02:00:00 0.558461 0.723032 0.293308 0.597601 0.120549
If I'm not worried about timezones i can plot like this:
fig, ax = plt.subplots()
ax.plot(df_new.index,df_new)
but I want to plot the data in the local timezone (tz = pytz.timezone('Australia/Sydney') making allowance for daylight savings time, but the times and dates are no longer Timestamp objects so I can't use Pandas timezone handling. Or can I?
Assuming I can't, I'm trying to do the shift manually, (given DST starts 1/10 at 2am and finishes 1/4 at 2am), so I've got this far:
df_new[[c for c in df_new.columns if c >= dt.datetime(2017,4,1) and c <dt.datetime(2017,10,1)]].shift_by(+10)
df_new[[c for c in df_new.columns if c < dt.datetime(2017,4,1) or c >= dt.datetime(2017,10,1)]].shift_by(+11)
but am not sure how to write the function shift_by.
(This doesn't handle midnight to 2am on teh changeover days correctly, which is not ideal, but I could live with)
Use dt.tz_localize + dt.tz_convert to convert the dataframe dates to a particular timezone.
df.index = df.index.tz_localize('UTC').tz_convert('Australia/Sydney')
df.index = [df.index.time, df.index.date]
Be a little careful when creating the MuliIndex - as you observed, it creates two rows of duplicate timestamps, so if that's the case, get rid of it with duplicated:
df = df[~df.index.duplicated()]
df = df['data'].unstack()
You can also create subplots with df.plot:
df.plot(subplots=True)
plt.show()

How to Convert (timestamp, value) array to timeseries [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have a rather straightforward problem I'd like to solve with more efficiency than I'm currently getting.
I have a bunch of data coming in as a set of monitoring metrics. Input data is structured as an array of tuples. Each tuple is (timestamp, value). Timestamps are integer epoch seconds, and values are normal floating point numbers. Example:
inArr = [ (1388435242, 12.3), (1388435262, 11.1), (1388435281, 12.8), ... ]
The timestamps are not always the same number of seconds apart, but it's usually close. Sometimes we get duplicate numbers submitted, sometimes we miss datapoints, etc.
My current solution takes the timestamps and:
finds the num seconds between each successive pair of timestamps;
finds the median of these delays;
creates an array of the correct size;
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period);
averages values that happen to go into the same time bucket;
adds data to this array according to the correct (timestamp - starttime)/median element.
if there's no value for a time range, I obviously output a None value.
Output data has to be in the format:
outArr = [ (startTime, timeStep, numVals), [ val1, val2, val3, val4, ... ] ]
I suspect this is a solved problem with Python Pandas http://pandas.pydata.org/ (or Numpy / SciPy).
Yes, my solution works, but when I'm operating on 60K datapoints it can take a tenth of a second (or more) to run. This is troublesome when I'm trying to work on large numbers of sets of data.
So, I'm looking for a solution that might run faster than my pure-Python version. I guess I'm presuming (based on a couple of previous conversations with an Argonne National Labs guy) that SciPy and Numpy are (clearing-throat) "somewhat faster" at array operations. I've looked briefly (an hour or so) at the Pandas code but it looks cumbersome to do this set of operations. Am I incorrect?
-- Edit to show expected output --
The median time between datapoints is 20 seconds, half that is 10 seconds. To make sure we put the lines well between the timestamps, we make the start time 10 seconds before the first datapoint. If we just make the start time the first timestamp, it's a lot more likely that we'll get 2 timestamps in one interval.
So, 1388435242 - 10 = 1388435232. The timestep is the median, 20 seconds. The numvals here is 3.
outArr = [ (1388435232, 20, 3), [ 12.3, 11.1, 12.8 ] )
This is the format that Graphite expects when we're graphing the output; it's not my invention. It seems common, though, to have timeseries data be in this format - a starttime, interval, and then an array of values.
Here's a sketch
Create your input series
In [24]: x = zip(pd.date_range('20130101',periods=1000000,freq='s').asi8/1000000000,np.random.randn(1000000))
In [49]: x[0]
Out[49]: (1356998400, 1.2809949462375376)
Create the frame
In [25]: df = DataFrame(x,columns=['time','value'])
Make the dates a bit random (to simulate some data)
In [26]: df['time1'] = df['time'] + np.random.randint(0,10,size=1000000)
Convert the epoch seconds to datetime64[ns] dtype
In [29]: df['time2'] = pd.to_datetime(df['time1'],unit='s')
Difference the series (to create timedeltas)
In [32]: df['diff'] = df['time2'].diff()
Looks like this
In [50]: df
Out[50]:
time value time1 time2 diff
0 1356998400 -0.269644 1356998405 2013-01-01 00:00:05 NaT
1 1356998401 -0.924337 1356998401 2013-01-01 00:00:01 -00:00:04
2 1356998402 0.952466 1356998410 2013-01-01 00:00:10 00:00:09
3 1356998403 0.604783 1356998411 2013-01-01 00:00:11 00:00:01
4 1356998404 0.140927 1356998407 2013-01-01 00:00:07 -00:00:04
5 1356998405 -0.083861 1356998414 2013-01-01 00:00:14 00:00:07
6 1356998406 1.287110 1356998412 2013-01-01 00:00:12 -00:00:02
7 1356998407 0.539957 1356998414 2013-01-01 00:00:14 00:00:02
8 1356998408 0.337780 1356998412 2013-01-01 00:00:12 -00:00:02
9 1356998409 -0.368456 1356998410 2013-01-01 00:00:10 -00:00:02
10 1356998410 -0.355176 1356998414 2013-01-01 00:00:14 00:00:04
11 1356998411 -2.912447 1356998417 2013-01-01 00:00:17 00:00:03
12 1356998412 -0.003209 1356998418 2013-01-01 00:00:18 00:00:01
13 1356998413 0.122424 1356998414 2013-01-01 00:00:14 -00:00:04
14 1356998414 0.121545 1356998421 2013-01-01 00:00:21 00:00:07
15 1356998415 -0.838947 1356998417 2013-01-01 00:00:17 -00:00:04
16 1356998416 0.329681 1356998419 2013-01-01 00:00:19 00:00:02
17 1356998417 -1.071963 1356998418 2013-01-01 00:00:18 -00:00:01
18 1356998418 1.090762 1356998424 2013-01-01 00:00:24 00:00:06
19 1356998419 1.740093 1356998428 2013-01-01 00:00:28 00:00:04
20 1356998420 1.480837 1356998428 2013-01-01 00:00:28 00:00:00
21 1356998421 0.118806 1356998427 2013-01-01 00:00:27 -00:00:01
22 1356998422 -0.935749 1356998427 2013-01-01 00:00:27 00:00:00
Calc median
In [34]: df['diff'].median()
Out[34]:
0 00:00:01
dtype: timedelta64[ns]
Calc mean
In [35]: df['diff'].mean()
Out[35]:
0 00:00:00.999996
dtype: timedelta64[ns]
Should get you started
You can pass your inArr to a pandas Dataframe:
df = pd.DataFrame(inArr, columns=['time', 'value'])
num seconds between each successive pair of timestamps: df['time'].diff()
median delay: df['time'].diff().median()
creates an array of the correct size (I think that's taken care of?)
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period); I don't know what you mean here
averages values that happen to go into the same time bucket
For several of these problems it may make since to convert your seconds to datetime and set it as the index:
In [39]: df['time'] = pd.to_datetime(df['time'], unit='s')
In [41]: df = df.set_index('time')
In [42]: df
Out[42]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
Then to handle multiple values in the same time, use groupby.
In [49]: df.groupby(level='time').mean()
Out[49]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
It's the same since there aren't any dupes.
Not sure what you mean about the last two.
And your desired output seems to contradict what you wanted earlier. You values with the same timestamp should be averaged, and now you want them all? Maybe clear that up a bit.

Categories

Resources