Resample DataFrame with DatetimeIndex and keep date range - python

My problem might sound trivial but I haven't found any solution for it:
I want the resampled data to remain in the same date range as the original data when I resample a DataFrame with a DatetimeIndex e.g. into three-monthly values.
Minimal example:
import numpy as np
import pandas as pd
# data from 2014 to 2016
dim = 8760 * 3 + 24
idx = pd.date_range('1/1/2014 00:00:00', freq='h', periods=dim)
df = pd.DataFrame(np.random.randn(dim, 2), index=idx)
# resample two three months
df = df.resample('3M').sum()
print(df)
yielding
0 1
2014-01-31 24.546928 -16.082389
2014-04-30 -52.966507 -40.255773
2014-07-31 -32.580114 47.096810
2014-10-31 -9.501333 12.872683
2015-01-31 -106.504047 45.082733
2015-04-30 -34.230358 70.508420
2015-07-31 -35.916497 104.930101
2015-10-31 -16.780425 17.411410
2016-01-31 68.512994 -43.772082
2016-04-30 -0.349917 27.794895
2016-07-31 -30.408862 -18.182486
2016-10-31 -97.355730 -105.961101
2017-01-31 -7.221361 40.037358
Why does the resampling exceed the date range e.g. create an entry for 2017-01-31 and how can I prevent this and instead remain within the original range e.g. between 2014-01-01 and 2016-12-31? And shouldn't this be the expected standard behaviour going from January-March, April-June, ... October-December?
Thanks in advance!

There are 36 months in your DataFrame.
When you resample every 3 months, the first row will contain everything up to the end of your first month, the second row will contain everything between your second month and 3 months after that, and so on. Your last row will contain everything from 2016-10-31 until 3 months after that, which is 2017-01-31.
If you want, you could change it to
df.resample('3M', closed='left', label='left').sum()
, giving you
2013-10-31 3.705955 25.394287
2014-01-31 38.778872 -12.655323
2014-04-30 10.382832 -64.649173
2014-07-31 66.939190 31.966008
2014-10-31 -39.453572 27.431183
2015-01-31 66.436348 29.585436
2015-04-30 78.731608 -25.150526
2015-07-31 14.493226 -5.842421
2015-10-31 -2.394419 58.017105
2016-01-31 -36.295499 -14.542251
2016-04-30 69.794101 62.572736
2016-07-31 76.600558 -17.706111
2016-10-31 -68.842328 -32.723581
, but then the first row would be 'outside your range'.
If you resample every 3 months, then either your first row is going to be outside your range, or your last one is.
EDIT
If you want the bins to be 'first three months', 'next three months', and so on, you could write
df.resample('3MS').sum()
, as this will take the beginning of each month rather than its end (see https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)

Related

Python Datetime resample results suddenly in NaN Values

I have tried to resample my values to hour. However, since I have changed the format of the date in csv file because of automatic swapping of months and days with low numbers (2003-04-01 is suddenly 2003-01-04). Now the date format is fine (when showing the csv file in Python) but while using resample, the values appear in NaN values.
df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
`hour_avg = df_2.resample('H').mean()`
Sample of my data:
Raw data with time as index
Afterwards: even when time is datetime it shows 99% of the data as NaN values (one value per day is shown)
Data with NaN values after resample per hours
When I used resample for day values, all values are back. So it seems there is a problem with the Time.
When I use the format at the beginning, the error "The format doesn't fit" comes up.
I tried a different way before (not sure what was different) but resample worked per hour.
What do I need to change to be able to use resample for hour again?
Can you share a sample of your data? Assuming that your data consists of a DateTime feature (i.e. yyyy-mm-dd hh-mm-ss) and some other features that you are trying to resample by hour, NaN values can occur due to two reasons: incorrect formatting by Pandas or missing hour values in data.
(1) It is possible that pandas is not reading your dates correctly. Once you read the file, make sure the date column is in the right format (i.e. yyyy-mm-dd).
df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')
(2) If there are any gaps in your data, NaN values will pop up. For instance, assume the data is of this form:
2000-01-01 00:00:00 1
2000-01-01 00:01:00 1
2000-01-01 00:03:00 1
2000-01-01 00:04:00 1
2000-01-01 00:06:00 1
If you try hour_avg = df_2.resample('H').mean(), your output will look like:
2000-01-01 00:00:00 1
2000-01-01 00:01:00 1
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 1
2000-01-01 00:04:00 1
2000-01-01 00:05:00 NaN
2000-01-01 00:06:00 1
I suspect the problem is the latter. If it is the latter, you can simply remove the NaN values using df_2.dropna(). Otherwise, if you do need the hourly bins regardless of missing data, you can avoid the NaN values by padding the missing values first and then attempting to get the mean:
hour_pad = df_2.resample('H').pad()
hour_avg = hour_pad.resample('H').mean()

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

Pandas time series resample, binning seems off

I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.
Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.
index = pd.date_range(start="1/1/2018", end="31/12/2018")
df = pd.DataFrame(np.random.randint(100, size=len(index)),
columns=["sales"], index=index)
>>> df.head()
sales
2018-01-01 66
2018-01-02 18
2018-01-03 45
2018-01-04 92
2018-01-05 76
Now I resample by one month, everything looks fine:
>>>df.resample("1M").sum()
sales
2018-01-31 1507
2018-02-28 1186
2018-03-31 1382
[...]
2018-11-30 1342
2018-12-31 1337
If I try to resample by more months though binning starts to look off. This is particularly evident with 6M
df.resample("6M").sum()
sales
2018-01-31 1507
2018-07-31 8393
2019-01-31 7283
First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9054
2019-06-30 39
Now I have an extra bin in 2019 with data from 2018-12-31...
Is this working properly? am I missing any option I should set?
EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9093 # 9054 + 39
Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.
The M time offset alias implies month end frequency.
What you need is 6MS which is an alias for month start frequency:
df.resample('6MS').sum()
resulting in
sales
2018-01-01 8130
2018-07-01 9563
2019-01-01 0
Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.
For extra clarity you can compare ranges directly:
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
Adding np.random.seed(365) to check both our outputs.
print(df.resample("6M", kind='period').sum())
sales
2018-01 8794
2018-07 9033
would this work for you?

pandas get data for the end day of month?

The data is given as following:
return
2010-01-04 0.016676
2010-01-05 0.003839
...
2010-01-05 0.003839
2010-01-29 0.001248
2010-02-01 0.000134
...
What I want get is to extract all value that is the last day of month appeared in the data .
2010-01-29 0.00134
2010-02-28 ......
If I directly use pandas.resample, i.e., df.resample('M).last(). I would select the correct rows with the wrong index. (it automatically use the last day of the month as the index)
2010-01-31 0.00134
2010-02-28 ......
How can I get the correct answer in a Pythonic way?
An assumption made here is that your date data is part of the index. If not, I recommend setting it first.
Single Year
I don't think the resampling or grouper functions would do. Let's group on the month number instead and call DataFrameGroupBy.tail.
df.groupby(df.index.month).tail(1)
Multiple Years
If your data spans multiple years, you'll need to group on the year and month. Using a single grouper created from dt.strftime—
df.groupby(df.index.strftime('%Y-%m')).tail(1)
Or, using multiple groupers—
df.groupby([df.index.year, df.index.month]).tail(1)
Note—if your index is not a DatetimeIndex as assumed here, you'll need to replace df.index with pd.to_datetime(df.index, errors='coerce') above.
Although this doesn't answer the question properly I'll leave it if someone is interested.
An approach which would only work if you are certain you have all days (!IMPORTANT) is to add 1 day too with pd.Timedelta and check if day == 1. I did a small running time test and it is 6x faster than the groupby solution.
df[(df['dates'] + pd.Timedelta(days=1)).dt.day == 1]
Or if index:
df[(df.index + pd.Timedelta(days=1)).day == 1]
Full example:
import pandas as pd
df = pd.DataFrame({
'dates': pd.date_range(start='2016-01-01', end='2017-12-31'),
'i': 1
}).set_index('dates')
dfout = df[(df.index + pd.Timedelta(days=1)).day == 1]
print(dfout)
Returns:
i
dates
2016-01-31 1
2016-02-29 1
2016-03-31 1
2016-04-30 1
2016-05-31 1
2016-06-30 1
2016-07-31 1
2016-08-31 1
2016-09-30 1
2016-10-31 1
2016-11-30 1
2016-12-31 1
2017-01-31 1
2017-02-28 1
2017-03-31 1
2017-04-30 1
2017-05-31 1
2017-06-30 1
2017-07-31 1
2017-08-31 1
2017-09-30 1
2017-10-31 1
2017-11-30 1
2017-12-31 1

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

Categories

Resources