Unexpected number of bins in Pandas DataFrame resample - python

Problem
I need to reduce the length of a DataFrame to some externally defined integer (could be two rows, 10,000 rows, etc., but will always be a reduction in overall length), but I also want to keep the resulting DataFrame representative of the original. The original DataFrame (we'll call it df) has a datetime column (utc_time) and a data value column (data_value). The datetimes are always sequential, non-repeating, though not evenly spaced (i.e., data might be "missing"). For the DataFrame in this example, the timestamps are at ten minute intervals (when data is present).
Attempts
To accomplish this, my mind immediately went to resampling with the following logic: find the difference in seconds between the first and last timestamps, divide that by the desired final length, and that's the resampling factor. I set this up here:
# Define the desired final length.
final_length = 2
# Define the first timestamp.
first_timestamp = df['utc_time'].min().timestamp()
# Define the last timestamp.
last_timestamp = df['utc_time'].max().timestamp()
# Define the difference in seconds between the first and last timestamps.
delta_t = last_timestamp - first_timestamp
# Define the resampling factor.
resampling_factor = np.ceil(delta_t / final_length)
# Set the index from the `utc_time` column so that we can resample nicely.
df.set_index('utc_time', drop=True, inplace=True)
# Do the resampling.
resamp = df.resample(f'{resampling_factor}S')
To look at resamp, I simply looped and printed:
for i in resamp:
print(i)
This yielded (with some cleanup on my part) the following:
utc_time data_value
2016-09-28 21:10:00 140.0
2016-09-28 21:20:00 250.0
2016-09-28 21:30:00 250.0
2016-09-28 21:40:00 240.0
2016-09-28 21:50:00 240.0
... ...
2018-08-06 13:00:00 240.0
2018-08-06 13:10:00 240.0
2018-08-06 13:20:00 240.0
2018-08-06 13:30:00 240.0
2018-08-06 13:40:00 230.0
[69889 rows x 1 columns])
utc_time data_value
2018-08-06 13:50:00 230.0
2018-08-06 14:00:00 230.0
2018-08-06 14:10:00 230.0
2018-08-06 14:20:00 230.0
2018-08-06 14:30:00 230.0
... ...
2020-06-14 02:50:00 280.0
2020-06-14 03:00:00 280.0
2020-06-14 03:10:00 280.0
2020-06-14 03:20:00 280.0
2020-06-14 03:30:00 280.0
[97571 rows x 1 columns])
utc_time data_value
2020-06-14 03:40:00 280.0
2020-06-14 03:50:00 280.0
2020-06-14 04:00:00 280.0
2020-06-14 04:10:00 280.0
2020-06-14 04:20:00 280.0
... ...
2020-06-15 00:10:00 280.0
2020-06-15 00:20:00 270.0
2020-06-15 00:30:00 270.0
2020-06-15 00:40:00 270.0
2020-06-15 00:50:00 280.0
[128 rows x 1 columns])
As one can see, this produced three bins rather than the two I expected.
I could do something different, like changing the way I choose the resampling factor (e.g., finding the average time between timestamps, and multiplying that by (length of DataFrame / final_length) should yield a more conservative resampling factor), but that would, to my mind, be a mask to the underlying issue. Mainly, I'd love to understand why what's happening is happening. Which leads to...
Question
Does anyone know why this is happening, and what steps I might take to ensure we get the desired number of bins? I wonder if it's an offsetting issue - that is, although we see the first timestamp in the first bin as the first timestamp from the DataFrame, perhaps pandas is actually starting the bin before then?
For anyone who'd like to play along at home, the test DataFrame can be found here as a .csv. To get it in as a DataFrame:
df = pd.read_csv('test.csv', parse_dates=[0])

Summary
Problem 1 & fix: The way you form the bins will make one extra bin since the bins created with df.resample() will be closed only on one end (left or right). Fix this with one of options listed in "1.".
Problem 2 & fix: The first bin left edge is at the start of that day ('2016-09-28 00:00:00') (See "2."). You can fix it by using kind='period' as argument to resample(). (See "3.")
1. Having a glance at the input data (& what kind of bins we need)
The input data is from 2016-09-28 21:10:00 to 2020-06-15 00:50:00, and using the resampling_factor you have, we get:
In [63]: df.index.min()
Out[63]: Timestamp('2016-09-28 21:10:00')
In [64]: df.index.min() + pd.Timedelta(f'{resampling_factor}S')
Out[64]: Timestamp('2018-08-07 11:00:00')
In [65]: _ + pd.Timedelta(f'{resampling_factor}S')
Out[65]: Timestamp('2020-06-15 00:50:00')
To partition data into two pieces with these timestamps, we would need bins to be
['2016-09-28 21:10:00', '2018-08-07 11:00:00')
['2018-08-07 11:00:00', '2020-06-15 00:50:00']
(The [ means closed end and ( means open end)
Here is one problem: You can not form bins that are closed from both ends. You will have to decide if you want to close the bins from left or right (argument closed='left'|'right',). With closed='left' you would have
['2016-09-28 21:10:00', '2018-08-07 11:00:00')
['2018-08-07 11:00:00', '2020-06-15 00:50:00')
['2020-06-15 00:50:00', '2022-04-23 14:40:00') (only one entry here)
Possible fixes:
Adjust your last timestamp by adding some time into it:
last_timestamp = (df['utc_time'].max() +
pd.Timedelta('10 minutes')).timestamp()
Make the resampling_factor a bit larger than you first calculated.
Just use the first two dataframes from the df.resample and disregard the third which has only one or few entries
Choose which makes most sense in your application.
2. Looking at what we have now
From the df.resample docs, we know that the labels returned are the left bin edges
If we look the data, we see the what kind of labels there are now.
In [67]: resamp = df.resample(f'{resampling_factor}S')
In [68]: itr = iter(resamp)
In [69]: next(itr)
Out[69]:
(Timestamp('2016-09-28 00:00:00', freq='58542600S'),
data_value
utc_time
2016-09-28 21:10:00 140.0
... ...
2018-08-06 13:40:00 230.0
[69889 rows x 1 columns])
In [70]: next(itr)
Out[70]:
(Timestamp('2018-08-06 13:50:00', freq='58542600S'),
data_value
utc_time
2018-08-06 13:50:00 230.0
... ...
2020-06-14 03:30:00 280.0
[97571 rows x 1 columns])
In [71]: next(itr)
Out[71]:
(Timestamp('2020-06-14 03:40:00', freq='58542600S'),
data_value
utc_time
2020-06-14 03:40:00 280.0
... ...
2020-06-15 00:50:00 280.0
[128 rows x 1 columns])
The bins are therefore
['2016-09-28 00:00:00', '2018-08-06 13:50:00')
['2018-08-06 13:50:00', '2020-06-14 03:40:00')
['2020-06-14 03:40:00', '2022-04-22 17:30:00')
(Endpoint calculated by adding resampling_factor to the beginning of the bin.)
We see that the first bin does not start from the df['utc_time'].min (2016-09-28 21:10:00), but it starts from the beginning of that day (as you guessed)
Since the first bin starts before intended, we have data outside two bins, in a third bin.
3. Fixing the starting bin left edge
The kind argument can be either 'timestamp' or 'period'. If you change it into 'period', you will have following bins (with closed='left'):
['2016-09-28 21:10:00', '2018-08-07 11:00:00') <-- fixed
['2018-08-07 11:00:00', '2020-06-15 00:50:00')
['2020-06-15 00:50:00', '2022-04-23 14:40:00') (Remove with options given in "1.")

Related

How do I take the mean on either side of a value in a pandas DataFrame?

I have a Pandas DataFrame where the index is datetimes for every 12 minutes in a day (120 rows total). I went ahead and resampled the data to every 30 minutes.
Time Rain_Rate
1 2014-04-02 00:00:00 0.50
2 2014-04-02 00:30:00 1.10
3 2014-04-02 01:00:00 0.48
4 2014-04-02 01:30:00 2.30
5 2014-04-02 02:00:00 4.10
6 2014-04-02 02:30:00 5.00
7 2014-04-02 03:00:00 3.20
I want to take 3 hour means centered on hours 00, 03, 06, 09, 12, 15 ,18, and 21. I want the mean to consist of 1.5 hours before 03:00:00 (so 01:30:00) and 1.5 hours after 03:00:00 (04:30:00). The 06:00:00 time would overlap with the 03:00:00 average (they would both use 04:30:00).
Is there a way to do this using pandas? I've tried a few things but they haven't worked.
Method 1
I'm going to suggest just change your resample from the get-go to get the chunks you want. Here's some fake data resembling yours, before resampling at all:
dr = pd.date_range('04-02-2014 00:00:00', '04-03-2014 00:00:00', freq='12T', closed='left')
data = np.random.rand(120)
df = pd.DataFrame(data, index=dr, columns=['Rain_Rate'])
df.index.name = 'Time'
#df.head()
Rain_Rate
Time
2014-04-02 00:00:00 0.616588
2014-04-02 00:12:00 0.201390
2014-04-02 00:24:00 0.802754
2014-04-02 00:36:00 0.712743
2014-04-02 00:48:00 0.711766
Averaging by 3 hour chunks initially will be the same as doing 30 minute chunks then doing 3 hour chunks. You just have to tweak a couple things to get the right bins you want. First you can add the bin you will start from (i.e. 10:30 pm on the previous day, even if there's no data there; the first bin is from 10:30pm - 1:30am), then resample starting from this point
before = df.index[0] - pd.Timedelta(minutes=90) #only if the first index is at midnight!!!
df.loc[before] = np.nan
df = df.sort_index()
output = df.resample('3H', base=22.5, loffset='90min').mean()
The base parameter here means start at the 22.5th hour (10:30), and loffset means push the bin names back by 90 minutes. You get the following output:
Rain_Rate
Time
2014-04-02 00:00:00 0.555515
2014-04-02 03:00:00 0.546571
2014-04-02 06:00:00 0.439953
2014-04-02 09:00:00 0.460898
2014-04-02 12:00:00 0.506690
2014-04-02 15:00:00 0.605775
2014-04-02 18:00:00 0.448838
2014-04-02 21:00:00 0.387380
2014-04-03 00:00:00 0.604204 #this is the bin at midnight on the following day
You could also start with the data binned at 30 minutes and use this method, and should get the same answer.*
Method 2
Another approach would be to find the locations of the indexes you want to create averages for, and then calculate the averages for entries in the 3 hours surrounding:
resampled = df.resample('30T',).mean() #like your data in the post
centers = [0,3,6,9,12,15,18,21]
mask = np.where(df.index.hour.isin(centers) & (df.index.minute==0), True, False)
df_centers = df.index[mask]
output = []
for center in df_centers:
cond1 = (df.index >= (center - pd.Timedelta(hours=1.5)))
cond2 = (df.index <= (center + pd.Timedelta(hours=1.5)))
output.append(df[cond1 & cond2].values.mean())
Output here is the same, but the answers are in a list (and the last point of "24 hours" is not included):
[0.5555146139562004,
0.5465709237162698,
0.43995277270996735,
0.46089800625663596,
0.5066902552121085,
0.6057747262752732,
0.44883794039466535,
0.3873795731806939]
*You mentioned you wanted some points on the edge of bins to be included in both bins. resample doesn't do this (and generally I don't think most people want to do so), but the second method I used is explicit about doing so (by using >= and <= in cond1 and cond2). However, these two methods achieve the same result here, presumably b/c of the use of resample at different stages causing data points to be included in different bins. It's hard for me to wrap my around that, but one could do a little manual binning to verify what is going on. The point is, I would recommend spot-checking the output of these methods (or any resample-based method) against your raw data to make sure things look correct. For these examples, I did so using Excel.

Resample DataFrame with DatetimeIndex and keep date range

My problem might sound trivial but I haven't found any solution for it:
I want the resampled data to remain in the same date range as the original data when I resample a DataFrame with a DatetimeIndex e.g. into three-monthly values.
Minimal example:
import numpy as np
import pandas as pd
# data from 2014 to 2016
dim = 8760 * 3 + 24
idx = pd.date_range('1/1/2014 00:00:00', freq='h', periods=dim)
df = pd.DataFrame(np.random.randn(dim, 2), index=idx)
# resample two three months
df = df.resample('3M').sum()
print(df)
yielding
0 1
2014-01-31 24.546928 -16.082389
2014-04-30 -52.966507 -40.255773
2014-07-31 -32.580114 47.096810
2014-10-31 -9.501333 12.872683
2015-01-31 -106.504047 45.082733
2015-04-30 -34.230358 70.508420
2015-07-31 -35.916497 104.930101
2015-10-31 -16.780425 17.411410
2016-01-31 68.512994 -43.772082
2016-04-30 -0.349917 27.794895
2016-07-31 -30.408862 -18.182486
2016-10-31 -97.355730 -105.961101
2017-01-31 -7.221361 40.037358
Why does the resampling exceed the date range e.g. create an entry for 2017-01-31 and how can I prevent this and instead remain within the original range e.g. between 2014-01-01 and 2016-12-31? And shouldn't this be the expected standard behaviour going from January-March, April-June, ... October-December?
Thanks in advance!
There are 36 months in your DataFrame.
When you resample every 3 months, the first row will contain everything up to the end of your first month, the second row will contain everything between your second month and 3 months after that, and so on. Your last row will contain everything from 2016-10-31 until 3 months after that, which is 2017-01-31.
If you want, you could change it to
df.resample('3M', closed='left', label='left').sum()
, giving you
2013-10-31 3.705955 25.394287
2014-01-31 38.778872 -12.655323
2014-04-30 10.382832 -64.649173
2014-07-31 66.939190 31.966008
2014-10-31 -39.453572 27.431183
2015-01-31 66.436348 29.585436
2015-04-30 78.731608 -25.150526
2015-07-31 14.493226 -5.842421
2015-10-31 -2.394419 58.017105
2016-01-31 -36.295499 -14.542251
2016-04-30 69.794101 62.572736
2016-07-31 76.600558 -17.706111
2016-10-31 -68.842328 -32.723581
, but then the first row would be 'outside your range'.
If you resample every 3 months, then either your first row is going to be outside your range, or your last one is.
EDIT
If you want the bins to be 'first three months', 'next three months', and so on, you could write
df.resample('3MS').sum()
, as this will take the beginning of each month rather than its end (see https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)

How to properly set start/end params of statsmodels.tsa.ar_model.AR.predict function

I have a dataframe of project costs from an irregularly spaced time series that I would like to try to apply the statsmodel AR model against.
This is a sample of the data in it's dataframe:
cost
date
2015-07-16 35.98
2015-08-11 25.00
2015-08-11 43.94
2015-08-13 26.25
2015-08-18 15.38
2015-08-24 77.72
2015-09-09 40.00
2015-09-09 20.00
2015-09-09 65.00
2015-09-23 70.50
2015-09-29 59.00
2015-11-03 19.25
2015-11-04 19.97
2015-11-10 26.25
2015-11-12 19.97
2015-11-12 23.97
2015-11-12 21.88
2015-11-23 23.50
2015-11-23 33.75
2015-11-23 22.70
2015-11-23 33.75
2015-11-24 27.95
2015-11-24 27.95
2015-11-24 27.95
...
2017-03-31 21.93
2017-04-06 22.45
2017-04-06 26.85
2017-04-12 60.40
2017-04-12 37.00
2017-04-12 20.00
2017-04-12 66.00
2017-04-12 60.00
2017-04-13 41.95
2017-04-13 25.97
2017-04-13 29.48
2017-04-19 41.00
2017-04-19 58.00
2017-04-19 78.00
2017-04-19 12.00
2017-04-24 51.05
2017-04-26 21.88
2017-04-26 50.05
2017-04-28 21.00
2017-04-28 30.00
I am having a hard time understanding how to use start and end in the predict function.
According to the docs:
start : int, str, or datetime
Zero-indexed observation number at which to start forecasting, ie., the first > forecast is start. Can also be a date string to parse or a datetime type.
end : int, str, or datetime Zero-indexed observation number at which
to end forecasting, ie., the first forecast is start. Can also be a
date string to parse or a datetime type.
I create a dataframe that has an empty daily time series, add my irregularly spaced time series data to it, and then try to apply the model.
data = pd.read_csv('data.csv', index_col=1, parse_dates=True)
df = pd.DataFrame(index=pd.date_range(start=datetime(2015, 1, 1), end=datetime(2017, 12, 31), freq='d'))
df = df.join(data)
df.cost.interpolate(inplace=True)
ar_model = sm.tsa.AR(df, missing='drop', freq='D')
ar_res = ar_model.fit(maxlag=9, method='mle', disp=-1)
pred = ar_res.predict(start='2016', end='2016')
The predict function results in an error of pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 605-12-31 00:00:00
If I try to use a more specific date, I get the same type of error:
pred = ar_res.predict(start='2016-01-01', end='2016-06-01')
If I try to use integers, I get a different error:
pred = ar_res.predict(start=0, end=len(data))
Wrong number of items passed 202, placement implies 197
If I actually use a datetime, I get an error that reads no rule for interpreting end.
I am hitting a wall so hard here I am thinking there must be something I am missing.
Ultimately, I would like to use the model to get out-of-sample predictions (such as a prediction for next quarter).
This works if you pass a datetime (rather than a date):
from datetime import datetime
...
pred = ar_res.predict(start=datetime(2015, 1, 1), end=datetime(2017,12,31))
In [21]: pred.head(2) # my dummy numbers from data
Out[21]:
2015-01-01 35
2015-01-02 23
Freq: D, dtype: float64
In [22]: pred.tail(2)
Out[22]:
2017-12-30 44
2017-12-31 44
Freq: D, dtype: float64
So I was creating a daily index to account for the equally spaced time series requirement, but it still remained non-unique (comment by #user333700).
I added a groupby function to sum duplicate dates together, and could then run the predict function using datetime objects (answer by #andy-hayden).
df = df.groupby(pd.TimeGrouper(freq='D')).sum()
...
ar_res.predict(start=min(df.index), end=datetime(2018,12,31))
With the predict function providing a result, I am now able to analyze the results and tweak the params to get something useful.

Using loc[] for i in range loops with pandas dataframes.

How can I loop through pandas's loc[] function such that given a long Series, I can break it into multiple little ones. Something I imagine would be like
for i in range(1,10):
df.loc['2002-i-01:'2002-(i+1)-01']
Where i represents the number of months.
Consider the dataframe df
df = pd.DataFrame(dict(A=range(100)), pd.date_range('2010-03-31', periods=100))
Observe that you are asking to slice from the beginning of one month to the beginning of the next. Typical python slicing does not include the end point (though loc does). I'll assume you meant to exclude it as that makes this answer convenient.
Use resample with a frequency 'M'
df.resample('M').sum()
A
2010-03-31 0
2010-04-30 465
2010-05-31 1426
2010-06-30 2295
2010-07-31 764
You can iterate through each month
for m, grp in df.groupby(pd.TimeGrouper('M')):
# do stuff
print(m)
2010-03-31 00:00:00
2010-04-30 00:00:00
2010-05-31 00:00:00
2010-06-30 00:00:00
2010-07-31 00:00:00

Python -Pandas Downsampling with first returns NaN

I am trying use pandas to resample vessel tracking data from seconds to minutes using how='first'. The dataframe is called hg1s. The unique ID is called MMSI. The datetime index is TX_DTTM. Here is a data sample:
TX_DTTM MMSI LAT LON NS
2013-10-01 00:00:02 367542760 29.660550 -94.974195 15
2013-10-01 00:00:04 367542760 29.660550 -94.974195 15
2013-10-01 00:00:07 367451120 29.614161 -94.954459 0
2013-10-01 00:00:15 367542760 29.660210 -94.974069 15
2013-10-01 00:00:13 367542760 29.660210 -94.974069 15
The code to resample:
hg1s1min = hg1s.groupby('MMSI').resample('1Min', how='first')
And a data sample of the output:
hg1s1min[20000:20004]
MMSI TX_DTTM NS LAT LON
367448060 2013-10-21 00:42:00 NaN NaN NaN
2013-10-21 00:43:00 NaN NaN NaN
2013-10-21 00:44:00 NaN NaN NaN
2013-10-21 00:45:00 NaN NaN NaN
It's safe to assume that there are several data points within each minute, so I don't understand why this isn't picking up the first record for that method. I looked at this link: Pandas Downsampling Issue because it seemed similar to my problem. I tried passing label='left' and label='right', neither worked.
How do I return the first record in every minute for each MMSI?
As it turns out, the problem isn't with the method, but with my assumption about the data. The large data set is a month, or 44640 minutes. While every record in my dataset has the relevant values, there isn't 100% overlap in time. In this case MMSI = 367448060 is present at 2013-10-17 23:24:31 and again at 2013-10-29 20:57:32. between those two data points, there isn't data to sample, resulting in a NaN, which is correct.

Categories

Resources