I have an ideal hourly time series that looks like this:
from pandas import Series, date_range, Timestamp
from numpy import random
index = date_range(
"2020-01-26 14:00:00",
"2020-11-28 02:00:00",
freq="H",
tz="Europe/Madrid",
)
ideal = Series(index=index, data=random.rand(len(index)))
ideal
2020-01-26 14:00:00+01:00 0.186026
2020-01-26 15:00:00+01:00 0.142096
2020-01-26 16:00:00+01:00 0.432625
2020-01-26 17:00:00+01:00 0.373805
2020-01-26 18:00:00+01:00 0.377718
...
2020-11-27 22:00:00+01:00 0.961327
2020-11-27 23:00:00+01:00 0.440274
2020-11-28 00:00:00+01:00 0.996126
2020-11-28 01:00:00+01:00 0.607873
2020-11-28 02:00:00+01:00 0.122993
Freq: H, Length: 7357, dtype: float64
The actual, non-ideal time series is far from perfect:
It is not complete (i.e.: some hourly values are missing)
Only the date is stored, not the hour
Something like this:
actual = ideal.drop([
Timestamp("2020-01-28 01:00:00", tz="Europe/Madrid"),
Timestamp("2020-08-02 15:00:00", tz="Europe/Madrid"),
Timestamp("2020-08-02 16:00:00", tz="Europe/Madrid"),
])
actual.index = actual.index.date
actual
2020-01-26 0.186026
2020-01-26 0.142096
2020-01-26 0.432625
2020-01-26 0.373805
2020-01-26 0.377718
...
2020-11-27 0.961327
2020-11-27 0.440274
2020-11-28 0.996126
2020-11-28 0.607873
2020-11-28 0.122993
Length: 7355, dtype: float64
Now I would like to convert the actual time series into something as close as possible to the ideal. That means:
The resulting series has a full hourly index (i.e.: no missing values)
NaNs are allowed if there is no way to fill an hour (i.e.: it is missing in the actual time series)
Misalignment within a day with respect to the ideal time series is expected only in those days with missing data
Is there an efficient way to do this? I would like to avoid having to iterate since I am guessing that would be very inefficient.
With efficient I am looking for a fast (CPU wall time) implementation that relies only on Python/Pandas/NumPy (no Cython or Numba allowed).
You can use groupby().cumcount() to represent the hour within a day then reindex:
s = pd.to_datetime(actual.index).tz_localize("Europe/Madrid").to_series()
actual.index = s + s.groupby(level=0).cumcount() * pd.Timedelta('1H')
new_idx = pd.date_range(actual.index.min(),actual.index.max(), freq='H')
actual = actual.reindex(new_idx)
Related
I'd like to get a time series with a fixed set of dates in the index. I thought that resample with freq and epoch='origin' will do the trick. It seems that I'm using this method in a wrong way. Here's an example that shows that epoch='origin' does not seem to work.
import pandas as pd
dates = pd.date_range('2022-01-01', '2022-02-01', freq="1D")
freq = '2W-MON'
vals = range(len(dates))
print(
pd.Series(vals,index = dates)
.resample(freq,
origin="epoch",
convention='end')
.sum()
.to_markdown()
)
0
2022-01-03 00:00:00
3
2022-01-17 00:00:00
133
2022-01-31 00:00:00
329
2022-02-14 00:00:00
31
If I change the first date in the series to anything after the "2022-01-03", I get a different result.
dates = pd.date_range('2022-01-04', '2022-02-01', freq="1D")
freq = '2W-MON'
vals = range(len(dates))
print(
pd.Series(vals,index = dates)
.resample(freq,
origin="epoch",
convention='end')
.sum()
.to_markdown()
)
0
2022-01-10 00:00:00
21
2022-01-24 00:00:00
189
2022-02-07 00:00:00
196
I'd expect that if the freq='2W-MON' and epoch='origin', both the examples will end up with the same dates (so, both should have either 2022-01-10 or 2022-01-03).
Is there an elegant way of forcing pandas to actually use epoch="origin"?
I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)
I have a large time-series dataframe at the second interval, and I need to preform some analysis on every grouping of 0-59 seconds.
This is then used to be a feature in a different time-series dataframe by using the floor of the sub-section
I feel like I'm missing something basic, but unsure of the wording to get it right.
ex:
timestamp,close
2021-06-01 00:00:00,37282.0
2021-06-01 00:00:01,37282.0
2021-06-01 00:00:02,37285.0
2021-06-01 00:00:03,37283.0
2021-06-01 00:00:04,37281.0
2021-06-01 00:00:05,37278.0
2021-06-01 00:00:06,37275.0
2021-06-01 00:00:07,37263.0
2021-06-01 00:00:08,37264.0
2021-06-01 00:00:09,37259.0
...
2021-06-01 00:00:59,37260.0
2021-06-02 00:01:00,37261.0 --> new analysis starts here
2021-06-02 00:01:01,37262.0
# and repeat
My current implementation works, but I have a feeling is a really bad way of doing it.
df['last_update_hxro'] = df.apply(lambda x: 1 if x.timestamp.second == 59 else 0, axis=1)
df['hxro_close'] = df[df['last_update_hxro']==1].close
df['next_hxro_close'] = df['hxro_close'].shift(-60)
df['hxro_result'] = df[df['last_update_hxro']==1].apply(lambda x: 1 if x.next_hxro_close > x.hxro_close else 0, axis=1)
df['trade_number'] = df.last_update_hxro.cumsum() - df.last_update_hxro
unique_trades = df.trade_number.unique()
for x in unique_trades:
temp_df = btc_df[btc_df['trade_number']==x]
new_df = generate_sub_min_features(temp_df)
feature_df = feature_df.append(new_df)
def generate_sub_min_features(full_df):
# do stuff here and return a series of length 1, with the minute floor of the subsection as the key
First set timestamp as index:
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%Y-%m-%d %H:%M:S")
df.set_index('timestamp')
Then group by hour and save each dataframe in a list:
groups = [g for n, g in df.groupby(pd.Grouper(freq='H'))]
IMO, any pandas solution which includes a loop could be done better.
I feel I may be a little lost on your ask without a basic example, but it sounds like you're looking for a simple resample or groupby?
Example
Here's a sample using a df loaded with the distinct seconds from 1/1/21 - 1/2/21, paired with random integers from 0 - 10. We'll take the average of every minute.
import pandas as pd
import numpy as np
df_time = pd.DataFrame(
{'val': np.random.randint(0,10, size=(60 * 60 * 24) + 1)},
index=pd.date_range('1/1/2021', '1/2/2021', freq='S')
)
df_mean = df_time.resample('T').apply(lambda x: x.mean())
print(df_mean)
Returns...
val
2021-01-01 00:00:00 4.566667
2021-01-01 00:01:00 5.000000
2021-01-01 00:02:00 4.316667
2021-01-01 00:03:00 4.800000
2021-01-01 00:04:00 4.533333
... ...
2021-01-01 23:56:00 4.916667
2021-01-01 23:57:00 4.450000
2021-01-01 23:58:00 4.883333
2021-01-01 23:59:00 4.316667
2021-01-02 00:00:00 2.000000
Notes
Note the use of T here to define the "minute" portion of the datetime flags in the index. Read more about Offset Aliases. Also the use of the resample() method since our timeseries also acts as the index. groupby() would have also been valid here with a slightly different approach in case our datetime information was not our index.
Customization
In application, one would replace the contents of lambda() with whatever function you'd like to apply to each distinct group of line items which share a truncated datetime-minute.
I am quite new to the game and can't seem to find an answer to my problem online.
I have an somewhat irregular time series in Python (mostly I use Pandas to work with it), which has a datetime index (roughly every 15 minutes) and multiple columns with values. I know that those values are approximatly changing every hour, but they actually don't quite match up with the index I have. It looks something like this:
Values
2019-08-27 02:15:00 91.45
2019-08-27 02:30:00 91.44
2019-08-27 02:45:00 91.44
2019-08-27 03:00:00 91.43
2019-08-27 03:15:00 91.43
2019-08-27 03:30:00 91.43
2019-08-27 03:45:00 91.42
This is just an example, but one can see that the values change at random times (:15, :45, :00) and even tho they should change every hour sometimes there are only two 15 min intervalls with values, so I can't just say: take a group of 4 values and resample them to one hour.
So my idea was to use the if and else function to create something like this:
if a value is the same as the next one: resample those to an hour
else: add one hour to the resampled index.
How could I accomplish that in Python and does my idea even make sense??
Thanks in advance for any kind of help!
You can use pandas.resample.
Ex:
import pandas as pd
index = pd.date_range('2019-08-27 02:15:00', periods=30, freq='15min')
series = pd.Series(range(30), index=index)
series.resample('15min').mean()
2019-08-27 02:00:00 1.0
2019-08-27 03:00:00 4.5
2019-08-27 04:00:00 8.5
2019-08-27 05:00:00 12.5
2019-08-27 06:00:00 16.5
2019-08-27 07:00:00 20.5
2019-08-27 08:00:00 24.5
2019-08-27 09:00:00 28.0
Freq: H, dtype: float64
Pandas is not Python.
When you use plain Python, you have a simple and nice procedural language and you iterate over values in containers. When you use Pandas, you should try hard to avoid any explicit Python loop at Python level. The rationale is that Pandas (and numpy for the underlying containers) uses C optimized code. So you have a large gain when using pandas and numpy tools (it is called vectorization).
Here what you want already exists in Pandas and is called resample.
In you example, and provided the index is a true DatetimeIndex (*), you just do:
df2 = df.resample('1H').mean()
It gives:
Values
2019-08-27 02:00:00 91.443333
2019-08-27 03:00:00 91.427500
(*) If not, convert it first with: df.index = pd.to_datetime(df.index)
From your edit, I think that you want to get one value from each period. A possible way would be to take the most frequent one in the interval H-15T H+30T.
You could then use:
pd.DataFrame(df.resample('60T', base=45, loffset=pd.Timedelta(minutes=15)).agg(
lambda x: x['Values'].value_counts().index[0]).rename('Values'))
This one give:
Values
2019-08-27 02:00:00 91.45
2019-08-27 03:00:00 91.43
2019-08-27 04:00:00 91.42
I have some time series data (pandas.DataFrame) and I resample it in '600S' bars:
import numpy as np
data.resample('600S', level='time').aggregate({'abc':np.sum})
I get something like this:
abc
time
09:30:01.446000 19836
09:40:01.446000 8577
09:50:01.446000 29746
10:00:01.446000 29340
10:10:01.446000 5197
...
How can I force the time bars to start at 09:30:00.000000 instead of at the time of the 1st row in the data? I.e. output should be something like this:
abc
time
09:30:00.000000 *****
09:40:00.000000 ****
09:50:00.000000 *****
10:00:00.000000 *****
10:10:00.000000 ****
...
Thank you for your help!
You can add Series.dt.floor to your code:
df.time = df.time.dt.floor('10 min')
time abc
0 2018-12-05 09:30:00 19836
1 2018-12-05 09:40:00 8577
2 2018-12-05 09:50:00 29746
3 2018-12-05 10:00:00 29340
4 2018-12-05 10:10:00 5197
.resample is a bit of a wildcard. It behaves rather differently with datetime64[ns] and timedelta64[ns] so I personally find it more reliable to work with groupby, when just doing things like .sum or .first.
Sample Data
import pandas as pd
import numpy as np
n = 1000
np.random.seed(123)
df = pd.DataFrame({'time': pd.date_range('2018-01-01 01:13:43', '2018-01-01 23:59:59', periods=n),
'abc': np.random.randint(1,1000,n)})
When the dtype is datetime64[ns] it will resample to "round" bins:
df.dtypes
#time datetime64[ns]
#abc int32
#dtype: object
df.set_index('time').resample('600S').sum()
abc
time
2018-01-01 01:10:00 2572
2018-01-01 01:20:00 2257
2018-01-01 01:30:00 2470
2018-01-01 01:40:00 3131
2018-01-01 01:50:00 3402
With timedelta64[ns] it instead begins the bins based on your first observation:
df['time'] = pd.to_timedelta(df.time.dt.time.astype('str'))
df.dtypes
#time timedelta64[ns]
#abc int32
#dtype: object
df.set_index('time').resample('600S').sum()
abc
time
01:13:43 3432
01:23:43 2447
01:33:43 2588
01:43:43 3202
01:53:43 2547
So in the case of a timedelta64[ns] column, I'd advise you to go with groupby creating bins out of .dt.floor to create your 10 minute bins that go from [XX:00:00 - XX:10:00]
df.groupby(df.time.dt.floor('600S')).sum()
# abc
#time
#01:10:00 2572
#01:20:00 2257
#01:30:00 2470
#01:40:00 3131
#01:50:00 3402
This is the same result we got in the first case with the datetime64[ns] dtype, which binned to the "round" bins.
If your use case is robust to it and you want to extend the time before the actual starting time, a solution is to add an empty row at the starting time you want.
E.g. truncating the first time (df.loc[0] if the index is sorted, else df.index.min()) to its hour (.floor("h")) :
df.loc[df.index.min().floor("h")] = None
df.sort_index(inplace=True) # cleaner, but not even needed
Then resample() will use this time as the starting point (9:00 in OP’s case).
This can also be applied to extend the time range after the actual end of the dataset.