How can I loop through pandas's loc[] function such that given a long Series, I can break it into multiple little ones. Something I imagine would be like
for i in range(1,10):
df.loc['2002-i-01:'2002-(i+1)-01']
Where i represents the number of months.
Consider the dataframe df
df = pd.DataFrame(dict(A=range(100)), pd.date_range('2010-03-31', periods=100))
Observe that you are asking to slice from the beginning of one month to the beginning of the next. Typical python slicing does not include the end point (though loc does). I'll assume you meant to exclude it as that makes this answer convenient.
Use resample with a frequency 'M'
df.resample('M').sum()
A
2010-03-31 0
2010-04-30 465
2010-05-31 1426
2010-06-30 2295
2010-07-31 764
You can iterate through each month
for m, grp in df.groupby(pd.TimeGrouper('M')):
# do stuff
print(m)
2010-03-31 00:00:00
2010-04-30 00:00:00
2010-05-31 00:00:00
2010-06-30 00:00:00
2010-07-31 00:00:00
Related
I have an ideal hourly time series that looks like this:
from pandas import Series, date_range, Timestamp
from numpy import random
index = date_range(
"2020-01-26 14:00:00",
"2020-11-28 02:00:00",
freq="H",
tz="Europe/Madrid",
)
ideal = Series(index=index, data=random.rand(len(index)))
ideal
2020-01-26 14:00:00+01:00 0.186026
2020-01-26 15:00:00+01:00 0.142096
2020-01-26 16:00:00+01:00 0.432625
2020-01-26 17:00:00+01:00 0.373805
2020-01-26 18:00:00+01:00 0.377718
...
2020-11-27 22:00:00+01:00 0.961327
2020-11-27 23:00:00+01:00 0.440274
2020-11-28 00:00:00+01:00 0.996126
2020-11-28 01:00:00+01:00 0.607873
2020-11-28 02:00:00+01:00 0.122993
Freq: H, Length: 7357, dtype: float64
The actual, non-ideal time series is far from perfect:
It is not complete (i.e.: some hourly values are missing)
Only the date is stored, not the hour
Something like this:
actual = ideal.drop([
Timestamp("2020-01-28 01:00:00", tz="Europe/Madrid"),
Timestamp("2020-08-02 15:00:00", tz="Europe/Madrid"),
Timestamp("2020-08-02 16:00:00", tz="Europe/Madrid"),
])
actual.index = actual.index.date
actual
2020-01-26 0.186026
2020-01-26 0.142096
2020-01-26 0.432625
2020-01-26 0.373805
2020-01-26 0.377718
...
2020-11-27 0.961327
2020-11-27 0.440274
2020-11-28 0.996126
2020-11-28 0.607873
2020-11-28 0.122993
Length: 7355, dtype: float64
Now I would like to convert the actual time series into something as close as possible to the ideal. That means:
The resulting series has a full hourly index (i.e.: no missing values)
NaNs are allowed if there is no way to fill an hour (i.e.: it is missing in the actual time series)
Misalignment within a day with respect to the ideal time series is expected only in those days with missing data
Is there an efficient way to do this? I would like to avoid having to iterate since I am guessing that would be very inefficient.
With efficient I am looking for a fast (CPU wall time) implementation that relies only on Python/Pandas/NumPy (no Cython or Numba allowed).
You can use groupby().cumcount() to represent the hour within a day then reindex:
s = pd.to_datetime(actual.index).tz_localize("Europe/Madrid").to_series()
actual.index = s + s.groupby(level=0).cumcount() * pd.Timedelta('1H')
new_idx = pd.date_range(actual.index.min(),actual.index.max(), freq='H')
actual = actual.reindex(new_idx)
Problem
I need to reduce the length of a DataFrame to some externally defined integer (could be two rows, 10,000 rows, etc., but will always be a reduction in overall length), but I also want to keep the resulting DataFrame representative of the original. The original DataFrame (we'll call it df) has a datetime column (utc_time) and a data value column (data_value). The datetimes are always sequential, non-repeating, though not evenly spaced (i.e., data might be "missing"). For the DataFrame in this example, the timestamps are at ten minute intervals (when data is present).
Attempts
To accomplish this, my mind immediately went to resampling with the following logic: find the difference in seconds between the first and last timestamps, divide that by the desired final length, and that's the resampling factor. I set this up here:
# Define the desired final length.
final_length = 2
# Define the first timestamp.
first_timestamp = df['utc_time'].min().timestamp()
# Define the last timestamp.
last_timestamp = df['utc_time'].max().timestamp()
# Define the difference in seconds between the first and last timestamps.
delta_t = last_timestamp - first_timestamp
# Define the resampling factor.
resampling_factor = np.ceil(delta_t / final_length)
# Set the index from the `utc_time` column so that we can resample nicely.
df.set_index('utc_time', drop=True, inplace=True)
# Do the resampling.
resamp = df.resample(f'{resampling_factor}S')
To look at resamp, I simply looped and printed:
for i in resamp:
print(i)
This yielded (with some cleanup on my part) the following:
utc_time data_value
2016-09-28 21:10:00 140.0
2016-09-28 21:20:00 250.0
2016-09-28 21:30:00 250.0
2016-09-28 21:40:00 240.0
2016-09-28 21:50:00 240.0
... ...
2018-08-06 13:00:00 240.0
2018-08-06 13:10:00 240.0
2018-08-06 13:20:00 240.0
2018-08-06 13:30:00 240.0
2018-08-06 13:40:00 230.0
[69889 rows x 1 columns])
utc_time data_value
2018-08-06 13:50:00 230.0
2018-08-06 14:00:00 230.0
2018-08-06 14:10:00 230.0
2018-08-06 14:20:00 230.0
2018-08-06 14:30:00 230.0
... ...
2020-06-14 02:50:00 280.0
2020-06-14 03:00:00 280.0
2020-06-14 03:10:00 280.0
2020-06-14 03:20:00 280.0
2020-06-14 03:30:00 280.0
[97571 rows x 1 columns])
utc_time data_value
2020-06-14 03:40:00 280.0
2020-06-14 03:50:00 280.0
2020-06-14 04:00:00 280.0
2020-06-14 04:10:00 280.0
2020-06-14 04:20:00 280.0
... ...
2020-06-15 00:10:00 280.0
2020-06-15 00:20:00 270.0
2020-06-15 00:30:00 270.0
2020-06-15 00:40:00 270.0
2020-06-15 00:50:00 280.0
[128 rows x 1 columns])
As one can see, this produced three bins rather than the two I expected.
I could do something different, like changing the way I choose the resampling factor (e.g., finding the average time between timestamps, and multiplying that by (length of DataFrame / final_length) should yield a more conservative resampling factor), but that would, to my mind, be a mask to the underlying issue. Mainly, I'd love to understand why what's happening is happening. Which leads to...
Question
Does anyone know why this is happening, and what steps I might take to ensure we get the desired number of bins? I wonder if it's an offsetting issue - that is, although we see the first timestamp in the first bin as the first timestamp from the DataFrame, perhaps pandas is actually starting the bin before then?
For anyone who'd like to play along at home, the test DataFrame can be found here as a .csv. To get it in as a DataFrame:
df = pd.read_csv('test.csv', parse_dates=[0])
Summary
Problem 1 & fix: The way you form the bins will make one extra bin since the bins created with df.resample() will be closed only on one end (left or right). Fix this with one of options listed in "1.".
Problem 2 & fix: The first bin left edge is at the start of that day ('2016-09-28 00:00:00') (See "2."). You can fix it by using kind='period' as argument to resample(). (See "3.")
1. Having a glance at the input data (& what kind of bins we need)
The input data is from 2016-09-28 21:10:00 to 2020-06-15 00:50:00, and using the resampling_factor you have, we get:
In [63]: df.index.min()
Out[63]: Timestamp('2016-09-28 21:10:00')
In [64]: df.index.min() + pd.Timedelta(f'{resampling_factor}S')
Out[64]: Timestamp('2018-08-07 11:00:00')
In [65]: _ + pd.Timedelta(f'{resampling_factor}S')
Out[65]: Timestamp('2020-06-15 00:50:00')
To partition data into two pieces with these timestamps, we would need bins to be
['2016-09-28 21:10:00', '2018-08-07 11:00:00')
['2018-08-07 11:00:00', '2020-06-15 00:50:00']
(The [ means closed end and ( means open end)
Here is one problem: You can not form bins that are closed from both ends. You will have to decide if you want to close the bins from left or right (argument closed='left'|'right',). With closed='left' you would have
['2016-09-28 21:10:00', '2018-08-07 11:00:00')
['2018-08-07 11:00:00', '2020-06-15 00:50:00')
['2020-06-15 00:50:00', '2022-04-23 14:40:00') (only one entry here)
Possible fixes:
Adjust your last timestamp by adding some time into it:
last_timestamp = (df['utc_time'].max() +
pd.Timedelta('10 minutes')).timestamp()
Make the resampling_factor a bit larger than you first calculated.
Just use the first two dataframes from the df.resample and disregard the third which has only one or few entries
Choose which makes most sense in your application.
2. Looking at what we have now
From the df.resample docs, we know that the labels returned are the left bin edges
If we look the data, we see the what kind of labels there are now.
In [67]: resamp = df.resample(f'{resampling_factor}S')
In [68]: itr = iter(resamp)
In [69]: next(itr)
Out[69]:
(Timestamp('2016-09-28 00:00:00', freq='58542600S'),
data_value
utc_time
2016-09-28 21:10:00 140.0
... ...
2018-08-06 13:40:00 230.0
[69889 rows x 1 columns])
In [70]: next(itr)
Out[70]:
(Timestamp('2018-08-06 13:50:00', freq='58542600S'),
data_value
utc_time
2018-08-06 13:50:00 230.0
... ...
2020-06-14 03:30:00 280.0
[97571 rows x 1 columns])
In [71]: next(itr)
Out[71]:
(Timestamp('2020-06-14 03:40:00', freq='58542600S'),
data_value
utc_time
2020-06-14 03:40:00 280.0
... ...
2020-06-15 00:50:00 280.0
[128 rows x 1 columns])
The bins are therefore
['2016-09-28 00:00:00', '2018-08-06 13:50:00')
['2018-08-06 13:50:00', '2020-06-14 03:40:00')
['2020-06-14 03:40:00', '2022-04-22 17:30:00')
(Endpoint calculated by adding resampling_factor to the beginning of the bin.)
We see that the first bin does not start from the df['utc_time'].min (2016-09-28 21:10:00), but it starts from the beginning of that day (as you guessed)
Since the first bin starts before intended, we have data outside two bins, in a third bin.
3. Fixing the starting bin left edge
The kind argument can be either 'timestamp' or 'period'. If you change it into 'period', you will have following bins (with closed='left'):
['2016-09-28 21:10:00', '2018-08-07 11:00:00') <-- fixed
['2018-08-07 11:00:00', '2020-06-15 00:50:00')
['2020-06-15 00:50:00', '2022-04-23 14:40:00') (Remove with options given in "1.")
My problem might sound trivial but I haven't found any solution for it:
I want the resampled data to remain in the same date range as the original data when I resample a DataFrame with a DatetimeIndex e.g. into three-monthly values.
Minimal example:
import numpy as np
import pandas as pd
# data from 2014 to 2016
dim = 8760 * 3 + 24
idx = pd.date_range('1/1/2014 00:00:00', freq='h', periods=dim)
df = pd.DataFrame(np.random.randn(dim, 2), index=idx)
# resample two three months
df = df.resample('3M').sum()
print(df)
yielding
0 1
2014-01-31 24.546928 -16.082389
2014-04-30 -52.966507 -40.255773
2014-07-31 -32.580114 47.096810
2014-10-31 -9.501333 12.872683
2015-01-31 -106.504047 45.082733
2015-04-30 -34.230358 70.508420
2015-07-31 -35.916497 104.930101
2015-10-31 -16.780425 17.411410
2016-01-31 68.512994 -43.772082
2016-04-30 -0.349917 27.794895
2016-07-31 -30.408862 -18.182486
2016-10-31 -97.355730 -105.961101
2017-01-31 -7.221361 40.037358
Why does the resampling exceed the date range e.g. create an entry for 2017-01-31 and how can I prevent this and instead remain within the original range e.g. between 2014-01-01 and 2016-12-31? And shouldn't this be the expected standard behaviour going from January-March, April-June, ... October-December?
Thanks in advance!
There are 36 months in your DataFrame.
When you resample every 3 months, the first row will contain everything up to the end of your first month, the second row will contain everything between your second month and 3 months after that, and so on. Your last row will contain everything from 2016-10-31 until 3 months after that, which is 2017-01-31.
If you want, you could change it to
df.resample('3M', closed='left', label='left').sum()
, giving you
2013-10-31 3.705955 25.394287
2014-01-31 38.778872 -12.655323
2014-04-30 10.382832 -64.649173
2014-07-31 66.939190 31.966008
2014-10-31 -39.453572 27.431183
2015-01-31 66.436348 29.585436
2015-04-30 78.731608 -25.150526
2015-07-31 14.493226 -5.842421
2015-10-31 -2.394419 58.017105
2016-01-31 -36.295499 -14.542251
2016-04-30 69.794101 62.572736
2016-07-31 76.600558 -17.706111
2016-10-31 -68.842328 -32.723581
, but then the first row would be 'outside your range'.
If you resample every 3 months, then either your first row is going to be outside your range, or your last one is.
EDIT
If you want the bins to be 'first three months', 'next three months', and so on, you could write
df.resample('3MS').sum()
, as this will take the beginning of each month rather than its end (see https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)
Is it somehow possible to use resample on irregularly spaced data? (I know that the documentation says it's for "resampling of regular time-series data", but I wanted to try if it works on irregular data, too. Maybe it doesn't, or maybe I am doing something wrong.)
In my real data, I have generally 2 samples per hour, the time difference between them ranging usually from 20 to 40 minutes. So I was hoping to resample them to a regular hourly series.
To test if I am using it right, I used some random list of dates that I already had, so it may not be a best example but at least a solution that works for it will be very robust. here it is:
fraction number time
0 0.729797 0 2014-10-23 15:44:00
1 0.141084 1 2014-10-30 19:10:00
2 0.226900 2 2014-11-05 21:30:00
3 0.960937 3 2014-11-07 05:50:00
4 0.452835 4 2014-11-12 12:20:00
5 0.578495 5 2014-11-13 13:57:00
6 0.352142 6 2014-11-15 05:00:00
7 0.104814 7 2014-11-18 07:50:00
8 0.345633 8 2014-11-19 13:37:00
9 0.498004 9 2014-11-19 22:47:00
10 0.131665 10 2014-11-24 15:28:00
11 0.654018 11 2014-11-26 10:00:00
12 0.886092 12 2014-12-04 06:37:00
13 0.839767 13 2014-12-09 00:50:00
14 0.257997 14 2014-12-09 02:00:00
15 0.526350 15 2014-12-09 02:33:00
Now I want to resample these for example monthly:
df_new = df.set_index(pd.DatetimeIndex(df['time']))
df_new['fraction'] = df.fraction.resample('M',how='mean')
df_new['number'] = df.number.resample('M',how='mean')
But I get TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' - unless I did something wrong with assigning the datetime index, it must be due to the irregularity?
So my questions are:
Am I using it correctly?
If 1==True, is there no straightforward way to resample the data?
(I only see a solution in first reindexing the data to get finer intervals, interpolate the values in between and then reindexing it to hourly interval. If it is so, then a question regarding the correct implementation of reindex will follow shortly.)
You don't need to explicitly use DatetimeIndex, just set 'time' as the index and pandas will take care of the rest, so long as your 'time' column has been converted to datetime using pd.to_datetime or some other method. Additionally, you don't need to resample each column individually if you're using the same method; just do it on the entire DataFrame.
# Convert to datetime, if necessary.
df['time'] = pd.to_datetime(df['time'])
# Set the index and resample (using month start freq for compact output).
df = df.set_index('time')
df = df.resample('MS').mean()
The resulting output:
fraction number
time
2014-10-01 0.435441 0.5
2014-11-01 0.430544 6.5
2014-12-01 0.627552 13.5
Let's say I have financial data in a pandas.Series, called fin_series.
Here's a peek at fin_series.
In [565]: fin_series
Out[565]:
Date
2008-05-16 1000.000000
2008-05-19 1001.651747
2008-05-20 1004.137434
...
2014-12-22 1158.085200
2014-12-23 1150.139126
2014-12-24 1148.934665
Name: Close, Length: 1665
I'm interested in looking at the quarterly endpoints of the data. However, not all financial trading days fall exactly on the 'end of the quarter.'
For example:
In [566]: fin_series.asfreq('q')
Out[566]:
2008-06-30 976.169624
2008-09-30 819.518923
2008-12-31 760.429261
...
2009-06-30 795.768956
2009-09-30 870.467121
2009-12-31 886.329978
...
2011-09-30 963.304679
2011-12-31 NaN
2012-03-31 NaN
....
2012-09-30 NaN
2012-12-31 1095.757137
2013-03-31 NaN
2013-06-30 NaN
...
2014-03-31 1138.548881
2014-06-30 1168.248194
2014-09-30 1147.000073
Freq: Q-DEC, Name: Close, dtype: float64
Here's a little function that accomplishes what I'd like, along with the desired end result.
def bmg_qt_asfreq(series):
ind = series[1:].index.quarter != series[:-1].index.quarter
ind = numpy.append(ind, True)
return tmp[ind]
which gives me:
In [15]: bmg_asfreq(tmp)
Out[15]:
Date
2008-06-30 976.169425
2008-09-30 819.517607
2008-12-31 760.428770
...
2011-09-30 963.252831
2011-12-30 999.742132
2012-03-30 1049.848583
...
2012-09-28 1086.689824
2012-12-31 1093.943357
2013-03-28 1117.111859
Name: Close, dtype: float64
Note that I'm preserving the dates of the "closest prior price," instead of simply using pandas.asfreq(freq = 'q', method = 'ffill'), as the preservations of dates that exist within the original Series.Index is crucial.
This seems like a silly problem that many people have had and must be addressed by all of the pandas time manipulation functionality, but I can't figure out how to do it with resample or asfreq.
Anyone who could show me the builtin pandas functionality to accomplish this would be greatly appreciated.
Regards,
Assuming the input is a dataframe Series , first do
import pandas as pd
fin_series.resample("q",pd.Series.last_valid_index)
to get a series with the last non-NA index for each quarter. Then
fin_series.resample("q","last")
for the last non-NA value. You can then join these together. As you suggested in your comment:
fin_series[fin_series.resample("q",pd.Series.last_valid_index)]
df.asfreq('d').interpolate().asfreq('q')