Custom resample function: only sample similar values hourly - Irregular time series - python

I am quite new to the game and can't seem to find an answer to my problem online.
I have an somewhat irregular time series in Python (mostly I use Pandas to work with it), which has a datetime index (roughly every 15 minutes) and multiple columns with values. I know that those values are approximatly changing every hour, but they actually don't quite match up with the index I have. It looks something like this:
Values
2019-08-27 02:15:00 91.45
2019-08-27 02:30:00 91.44
2019-08-27 02:45:00 91.44
2019-08-27 03:00:00 91.43
2019-08-27 03:15:00 91.43
2019-08-27 03:30:00 91.43
2019-08-27 03:45:00 91.42
This is just an example, but one can see that the values change at random times (:15, :45, :00) and even tho they should change every hour sometimes there are only two 15 min intervalls with values, so I can't just say: take a group of 4 values and resample them to one hour.
So my idea was to use the if and else function to create something like this:
if a value is the same as the next one: resample those to an hour
else: add one hour to the resampled index.
How could I accomplish that in Python and does my idea even make sense??
Thanks in advance for any kind of help!

You can use pandas.resample.
Ex:
import pandas as pd
index = pd.date_range('2019-08-27 02:15:00', periods=30, freq='15min')
series = pd.Series(range(30), index=index)
series.resample('15min').mean()
2019-08-27 02:00:00 1.0
2019-08-27 03:00:00 4.5
2019-08-27 04:00:00 8.5
2019-08-27 05:00:00 12.5
2019-08-27 06:00:00 16.5
2019-08-27 07:00:00 20.5
2019-08-27 08:00:00 24.5
2019-08-27 09:00:00 28.0
Freq: H, dtype: float64

Pandas is not Python.
When you use plain Python, you have a simple and nice procedural language and you iterate over values in containers. When you use Pandas, you should try hard to avoid any explicit Python loop at Python level. The rationale is that Pandas (and numpy for the underlying containers) uses C optimized code. So you have a large gain when using pandas and numpy tools (it is called vectorization).
Here what you want already exists in Pandas and is called resample.
In you example, and provided the index is a true DatetimeIndex (*), you just do:
df2 = df.resample('1H').mean()
It gives:
Values
2019-08-27 02:00:00 91.443333
2019-08-27 03:00:00 91.427500
(*) If not, convert it first with: df.index = pd.to_datetime(df.index)
From your edit, I think that you want to get one value from each period. A possible way would be to take the most frequent one in the interval H-15T H+30T.
You could then use:
pd.DataFrame(df.resample('60T', base=45, loffset=pd.Timedelta(minutes=15)).agg(
lambda x: x['Values'].value_counts().index[0]).rename('Values'))
This one give:
Values
2019-08-27 02:00:00 91.45
2019-08-27 03:00:00 91.43
2019-08-27 04:00:00 91.42

Related

selecting rows in dataframe using datetime.datetime

Python is new for me.
I want to select a range of rows by using the datetime which is also the index.
I am not sure if having the datetime as the index is a problem or not.
my dataframe looks like this:
gradient
date
2022-04-15 10:00:00 0.013714
2022-04-15 10:20:00 0.140792
2022-04-15 10:40:00 0.148240
2022-04-15 11:00:00 0.016510
2022-04-15 11:20:00 0.018219
...
2022-05-02 15:40:00 0.191208
2022-05-02 16:00:00 0.016198
2022-05-02 16:20:00 0.043312
2022-05-02 16:40:00 0.500573
2022-05-02 17:00:00 0.955833
And I have made variables which contain the start and end date of the rows I want to select. This looks like this:
A_start_646 = datetime.datetime(2022,4,27, 11,0,0)
S_start_646 = datetime.datetime(2022,4,28, 3,0,0)
D_start_646 = datetime.datetime(2022,5,2, 15,25,0)
D_end_646 = datetime.datetime(2022,5, 2, 15,50,0)
So I would like to make a new dataframe. I saw some examples on the internet, but they use another way of expressing the date.
Does somewan know a solution?
I feel kind of stupid and smart at the same time now. This because I have already answered my own question, my apologies
So this is the answer:
new_df = data_646_mean[A_start_646 : S_start_646]

Pandas: how to convert an incomplete date-only index to hourly index

I have an ideal hourly time series that looks like this:
from pandas import Series, date_range, Timestamp
from numpy import random
index = date_range(
"2020-01-26 14:00:00",
"2020-11-28 02:00:00",
freq="H",
tz="Europe/Madrid",
)
ideal = Series(index=index, data=random.rand(len(index)))
ideal
2020-01-26 14:00:00+01:00 0.186026
2020-01-26 15:00:00+01:00 0.142096
2020-01-26 16:00:00+01:00 0.432625
2020-01-26 17:00:00+01:00 0.373805
2020-01-26 18:00:00+01:00 0.377718
...
2020-11-27 22:00:00+01:00 0.961327
2020-11-27 23:00:00+01:00 0.440274
2020-11-28 00:00:00+01:00 0.996126
2020-11-28 01:00:00+01:00 0.607873
2020-11-28 02:00:00+01:00 0.122993
Freq: H, Length: 7357, dtype: float64
The actual, non-ideal time series is far from perfect:
It is not complete (i.e.: some hourly values are missing)
Only the date is stored, not the hour
Something like this:
actual = ideal.drop([
Timestamp("2020-01-28 01:00:00", tz="Europe/Madrid"),
Timestamp("2020-08-02 15:00:00", tz="Europe/Madrid"),
Timestamp("2020-08-02 16:00:00", tz="Europe/Madrid"),
])
actual.index = actual.index.date
actual
2020-01-26 0.186026
2020-01-26 0.142096
2020-01-26 0.432625
2020-01-26 0.373805
2020-01-26 0.377718
...
2020-11-27 0.961327
2020-11-27 0.440274
2020-11-28 0.996126
2020-11-28 0.607873
2020-11-28 0.122993
Length: 7355, dtype: float64
Now I would like to convert the actual time series into something as close as possible to the ideal. That means:
The resulting series has a full hourly index (i.e.: no missing values)
NaNs are allowed if there is no way to fill an hour (i.e.: it is missing in the actual time series)
Misalignment within a day with respect to the ideal time series is expected only in those days with missing data
Is there an efficient way to do this? I would like to avoid having to iterate since I am guessing that would be very inefficient.
With efficient I am looking for a fast (CPU wall time) implementation that relies only on Python/Pandas/NumPy (no Cython or Numba allowed).
You can use groupby().cumcount() to represent the hour within a day then reindex:
s = pd.to_datetime(actual.index).tz_localize("Europe/Madrid").to_series()
actual.index = s + s.groupby(level=0).cumcount() * pd.Timedelta('1H')
new_idx = pd.date_range(actual.index.min(),actual.index.max(), freq='H')
actual = actual.reindex(new_idx)

how to get consistent behavior when aggregating multiple dtypes in pandas?

I am using pandas 0.20.2.
I am getting inconsistent results when aggregating a mixed dtype dataframe.
Here are some example data:
import pandas as pd
import numpy as np
df=pd.DataFrame(data=pd.date_range('20100201', periods=10,
freq='5h3min'),columns=['Start'])
df.loc[:,'End']=df.loc[:,'Start']+pd.Timedelta(4,'h')
df.loc[:,'Value']=42.0
df.loc[:,'Dur']=df.loc[:,'End']-df.loc[:,'Start']
I want to apply some functions to both Dur (float) and Value (np.timedelta64).
In particular, combining np.nansum and np.nanmax I get the following:
**df.resample('1D',on='Start')['Dur','Value'].agg([np.nansum,np.nanmedian])**
Out[16]:
Value
nansum nanmedian
Start
2010-02-01 210.0 42.0
2010-02-02 210.0 42.0
The columns 'Dur' is silently ignored and dropped, whereas if apply only
np.nansum I obtain the expected result including both columns
f.resample('1D',on='Start')['Dur','Value'].agg([np.nansum])
Out[17]:
Dur Value
nansum nansum
Start
2010-02-01 20:00:00 210.0
2010-02-02 20:00:00 210.0
How to get the same when applying nanmedian ? Or how to get all the expected columns in the multi-level dataframe returned at * ?
User Yakym Pirozhenko is correct, the error is due to the application of np.isnan on a timestamp column inside the function np.nanmedian
To avoid this, you can define your own nanmedian that will apply np.median on non-null timestamps:
def mynanmedian(x):
return np.median(x[pd.notnull(x)])
df.resample('1D',on='Start')['Dur','Value'].agg([np.nansum,mynanmedian])
# out:
Dur Value
nansum mynanmedian nansum mynanmedian
Start
2010-02-01 20:00:00 04:00:00 210.0 42.0
2010-02-02 20:00:00 04:00:00 210.0 42.0
np.nanmedian calls np.isnan which is not defined on datetime objects (instead one should use np.isnat). So pandas defaults to ignoring the column since the function cannot be called.
If you want an explicit error you could use
df.groupby(...).agg({c: [np.nansum, np.nanmedian] for c in cols})

STL decomposition Python - graph is plotted, values are N/A

I have a time series with 1hr time interval, which I'm trying to decompose - with seasonality of a week.
Time Total_request
2018-04-09 22:00:00 1019656
2018-04-09 23:00:00 961867
2018-04-10 00:00:00 881291
2018-04-10 01:00:00 892974
import pandas as pd
import statsmodels as sm
d.reset_index(inplace=True)
d['env_time'] = pd.to_datetime(d['env_time'])
d = d.set_index('env_time')
s=sm.tsa.seasonal_decompose(d.total_request, freq = 24*7)
This gives me a resulting graphs of Seasonal, Trend, Residue - https://imgur.com/a/CjhWphO
But on trying to extract the residual values using s.resid I get this -
env_time
2018-04-09 20:00:00 NaN
2018-04-09 21:00:00 NaN
2018-04-09 22:00:00 NaN
I get values when I modify it to a lower frequency. What's strange is why I can't derive the values, when it's being plotted. I have found similar questions being asked, none of the answers were relevant to this case.

How to sum field across two DataFrames when the indexes don't line up?

I am brand new to complex data analysis in general, and to pandas in particular. I have a feeling that pandas should be able to handle this task easily, but my newbieness prevents me from seeing the path to a solution. I want to sum one column across two files at a given time each day, 3pm in this case. If a file doesn't have a record at 3pm that day, I want to use the previous record.
Let me give a concrete example. I have data in two CSV files. Here are a couple small examples:
datetime value
2013-02-28 09:30:00 0.565019720442
2013-03-01 09:30:00 0.549536266504
2013-03-04 09:30:00 0.5023031467
2013-03-05 09:30:00 0.698370467751
2013-03-06 09:30:00 0.75834927162
2013-03-07 09:30:00 0.783620442226
2013-03-11 09:30:00 0.777265379462
2013-03-12 09:30:00 0.785787872851
2013-03-13 09:30:00 0.784873183044
2013-03-14 10:15:00 0.802959366653
2013-03-15 10:15:00 0.802959366653
2013-03-18 10:15:00 0.805413095911
2013-03-19 09:30:00 0.80816233134
2013-03-20 10:15:00 0.878912249996
2013-03-21 10:15:00 0.986393922571
and the other:
datetime value
2013-02-28 05:00:00 0.0373634672097
2013-03-01 05:00:00 -0.24700085273
2013-03-04 05:00:00 -0.452964976056
2013-03-05 05:00:00 -0.2479288295
2013-03-06 05:00:00 -0.0326855588777
2013-03-07 05:00:00 0.0780461766619
2013-03-08 05:00:00 0.306247682656
2013-03-11 06:00:00 0.0194146154407
2013-03-12 05:30:00 0.0103653153719
2013-03-13 05:30:00 0.0350377752558
2013-03-14 05:30:00 0.0110884755383
2013-03-15 05:30:00 -0.173216846788
2013-03-19 05:30:00 -0.211785013352
2013-03-20 05:30:00 -0.891054563968
2013-03-21 05:30:00 -1.27207563599
2013-03-22 05:30:00 -1.28648629004
2013-03-25 05:30:00 -1.5459897419
Note that a) neither file actually has a 3pm record, and b) the two files don't always have records for any given day. (2013-03-08 is missing from the first file, while 2013-03-18 is missing from the second, and the first file ends before the second.) As output, I envision a dataframe like this (perhaps just the date without the time):
datetime value
2013-Feb-28 15:00:00 0.6023831876517
2013-Mar-01 15:00:00 0.302535413774
2013-Mar-04 15:00:00 0.049338170644
2013-Mar-05 15:00:00 0.450441638251
2013-Mar-06 15:00:00 0.7256637127423
2013-Mar-07 15:00:00 0.8616666188879
2013-Mar-08 15:00:00 0.306247682656
2013-Mar-11 15:00:00 0.7966799949027
2013-Mar-12 15:00:00 0.7961531882229
2013-Mar-13 15:00:00 0.8199109582998
2013-Mar-14 15:00:00 0.8140478421913
2013-Mar-15 15:00:00 0.629742519865
2013-Mar-18 15:00:00 0.805413095911
2013-Mar-19 15:00:00 0.596377317988
2013-Mar-20 15:00:00 -0.012142313972
2013-Mar-21 15:00:00 -0.285681713419
2013-Mar-22 15:00:00 -1.28648629004
2013-Mar-25 15:00:00 -1.5459897419
I have a feeling this is perhaps a three-liner in pandas, but it's not at all clear to me how to do this. Further complicating my thinking about this problem, more complex CSV files might have multiple records for a single day (same date, different times). It seems that I need to somehow either generate a new pair of input dataframes with times at 15:00 and then sum across their values columns keying on just the date, or during the sum operation select the record with the greatest time on any given day with the time <= 15:00:00. Given that datetime.time objects can't be compared for magnitude, I suspect I might have to group rows together having the same date, then within each group, select only the row nearest to (but not greater than) 3pm. Kind of at that point my brain explodes.
I got nowhere looking at the documentation, as I don't really understand all the database-like operations pandas supports. Pointers to relevant documentation (especially tutorials) would be much appreciated.
First combine your DataFrames:
df3 = df1.append(df2)
so that everything is in one table, next use the groupby to sum across timestamps:
df4 = df3.groupby('datetime').aggregate(sum)
now d4 has a value column that is the sum of all matching datetime columns.
Assuming you have the timestamps as datetime objects, you can do whatever filtering you like at any stage:
filtered = df[df['datetime'] < datetime.datetime(year, month, day, hour, minute, second)]
I'm not sure exactly what you are trying to do, you may need to parse your timestamp columns before filtering.

Categories

Resources