How to handle end of time series in pandas resample when upsampling? - python

I want to resample from hours to half-hours. I use .ffill() in the example, but I've tested .asfreq() as an intermediate step too.
The goal is to get intervals of half hours where the hourly values are spread among the upsampled intervals, and I'm trying to find a general solution for any ranges with the same problem.
import pandas as pd
index = pd.date_range('2018-10-10 00:00', '2018-10-10 02:00', freq='H')
hourly = pd.Series(range(10, len(index)+10), index=index)
half_hourly = hourly.resample('30min').ffill() / 2
The hourly series looks like:
2018-10-10 00:00:00 10
2018-10-10 01:00:00 11
2018-10-10 02:00:00 12
Freq: H, dtype: int64
And the half_hourly:
2018-10-10 00:00:00 5.0
2018-10-10 00:30:00 5.0
2018-10-10 01:00:00 5.5
2018-10-10 01:30:00 5.5
2018-10-10 02:00:00 6.0
Freq: 30T, dtype: float64
The problem with the last one is that there is no row for representing 02:30:00
I want to achieve something that is:
2018-10-10 00:00:00 5.0
2018-10-10 00:30:00 5.0
2018-10-10 01:00:00 5.5
2018-10-10 01:30:00 5.5
2018-10-10 02:00:00 6.0
2018-10-10 02:30:00 6.0
Freq: 30T, dtype: float64
I understand that the hourly series ends at 02:00, so there is no reason to expect pandas to insert the last half hour by default. However, after reading a lot of deprecated/old posts, some newer ones, the documentation, and cookbook, I still weren't able to find a straight-forward solution.
Lastly, I've also tested the use of .mean(), but that didn't fill the NaNs. And interpolate() didn't average by hour as I wanted it to.
My .ffill() / 2 almost works as a way to spread hour to half hours in this case, but it seems like a hack to a problem that I expect pandas already provides a better solution to.
Thanks in advance.

Your precise issue can be solved like this
>>> import pandas as pd
>>> index = pd.date_range('2018-10-10 00:00', '2018-10-10 02:00', freq='H')
>>> hourly = pd.Series(range(10, len(index)+10), index=index)
>>> hourly.reindex(index.union(index.shift(freq='30min'))).ffill() / 2
2018-10-10 00:00:00 5.0
2018-10-10 00:30:00 5.0
2018-10-10 01:00:00 5.5
2018-10-10 01:30:00 5.5
2018-10-10 02:00:00 6.0
2018-10-10 02:30:00 6.0
Freq: 30T, dtype: float64
>>> import pandas as pd
>>> index = pd.date_range('2018-10-10 00:00', '2018-10-10 02:00', freq='H')
>>> hourly = pd.Series(range(10, len(index)+10), index=index)
>>> hourly.reindex(index.union(index.shift(freq='30min'))).ffill() / 2
I suspect that this is a minimal example so I will try to generically solve as well. Lets say you have multiple points to fill in each day
>>> import pandas as pd
>>> x = pd.Series([1.5, 2.5], pd.DatetimeIndex(['2018-09-21', '2018-09-22']))
>>> x.resample('6h').ffill()
2018-09-21 00:00:00 1.5
2018-09-21 06:00:00 1.5
2018-09-21 12:00:00 1.5
2018-09-21 18:00:00 1.5
2018-09-22 00:00:00 2.5
Freq: 6H, dtype: float64
Employ a similar trick to include 6am, 12pm, 6pm on 2018-09-22 as well.
Re-index with a shift equal to that you want to have as an inclusive endpoint. In this case our shift is an extra day
>>> import pandas as pd
>>> x = pd.Series([1.5, 2.5], pd.DatetimeIndex(['2018-09-21', '2018-09-22']))
>>> res = x.reindex(x.index.union(x.index.shift(freq='1D'))).resample('6h').ffill()
>>> res[:res.last_valid_index()] # drop the start of next day
2018-09-21 00:00:00 1.5
2018-09-21 06:00:00 1.5
2018-09-21 12:00:00 1.5
2018-09-21 18:00:00 1.5
2018-09-22 00:00:00 2.5
2018-09-22 06:00:00 2.5
2018-09-22 12:00:00 2.5
2018-09-22 18:00:00 2.5
Freq: 6H, dtype: float64

Related

Remove data timestamp and get data only every hours python

I have a bunch of timestamp data in a csv file like this:
2012-01-01 00:00:00, data
2012-01-01 00:01:00, data
2012-01-01 00:02:00, data
...
2012-01-01 00:59:00, data
2012-01-01 01:00:00, data
2012-01-01 01:01:00, data
I want to delete data every minute and only display every hour in python like the following:
2012-01-01 00:00:00, data
2012-01-01 01:00:00, data
2012-01-01 02:00:00, data
Could any one help me? Thank you.
I believe you need to use pandas resample, here's is an example of how it is used to achieve the output you desire. However, keep in mind that since this is a resampling operation during frequency conversion, you must pass a function on how the other columns will beahve (summing all values corresponding to the new timeframe, calculating an average, calculating the difference, etc...) otherwise you will get returned a DatetimeIndexResample. Here is an example:
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='40T')
series = pd.Series(range(9),index=index)
print(series)
Output:
2000-01-01 00:00:00 0
2000-01-01 00:40:00 1
2000-01-01 01:20:00 2
2000-01-01 02:00:00 3
2000-01-01 02:40:00 4
2000-01-01 03:20:00 5
2000-01-01 04:00:00 6
2000-01-01 04:40:00 7
2000-01-01 05:20:00 8
Applying resample hourly without passing the aggregation function:
print(series.resample('H'))
Output:
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
After passing .sum():
print(series.resample('H').sum())
Output:
2000-01-01 00:00:00 1
2000-01-01 01:00:00 2
2000-01-01 02:00:00 7
2000-01-01 03:00:00 5
2000-01-01 04:00:00 13
2000-01-01 05:00:00 8
Freq: H, dtype: int64

Python-pandas - Datetimeindex: What is the mosty pythonic strategy to analyse rolling with steps? (e.g. certain hours for each day)

I am working on a data frame with DateTimeIndex of hourly temperature data spanning a couple of years. I want to add a column with the minimum temperature between 20:00 of a day and 8:00 of the following day. Daytime temperatures - from 8:00 to 20:00 - are not of interest. The result can either be at the same hourly resolution of the original data or be resampled to days.
I have researched a number of strategies to solve this, but am unsure about the most efficienct (in terms of primarily coding efficiency and secondary computing efficiency) respectively pythonic way to do this. Some of the possibilities I have come up with:
Attach a column with labels 'day', 'night' depending on df.index.hour and use group_by or df.loc to find the minimum
Resample to 12h and drop every second value. Not sure how I can make the resampling period start at 20:00.
Add a multi-index - I guess this is similar to approach 1, but feels a bit over the top for what I'm trying to achieve.
Use df.between_time (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html#pandas.DataFrame.between_time) though I'm not sure if the date change over midnight will make this a bit messy.
Lastly there is some discussion about combining rolling with a stepping parameter as new pandas feature: https://github.com/pandas-dev/pandas/issues/15354
Original df looks like this:
datetime temp
2009-07-01 01:00:00 17.16
2009-07-01 02:00:00 16.64
2009-07-01 03:00:00 16.21 #<-- minimum for the night 2009-06-30 (previous date since periods starts 2009-06-30 20:00)
... ...
2019-06-24 22:00:00 14.03 #<-- minimum for the night 2019-06-24
2019-06-24 23:00:00 18.87
2019-06-25 00:00:00 17.85
2019-06-25 01:00:00 17.25
I want to get something like this (min temp from day 20:00 to day+1 8:00):
datetime temp
2009-06-30 23:00:00 16.21
2009-07-01 00:00:00 16.21
2009-07-01 01:00:00 16.21
2009-07-01 02:00:00 16.21
2009-07-01 03:00:00 16.21
... ...
2019-06-24 22:00:00 14.03
2019-06-24 23:00:00 14.03
2019-06-25 00:00:00 14.03
2019-06-25 01:00:00 14.03
or a bit more succinct:
datetime temp
2009-06-30 16.21
... ...
2019-06-24 14.03
Use the base option to resample:
rs = df.resample('12h', base=8).min()
Then keep only the rows for 20:00:
rs[rs.index.hour == 20]
you can use TimeGrouper with freq=12h and base=8 to chunk the dataframe every 12h from 20:00 - (+day)08:00,
then you can just use .min()
try this:
import pandas as pd
from io import StringIO
s = """
datetime temp
2009-07-01 01:00:00 17.16
2009-07-01 02:00:00 16.64
2009-07-01 03:00:00 16.21
2019-06-24 22:00:00 14.03
2019-06-24 23:00:00 18.87
2019-06-25 00:00:00 17.85
2019-06-25 01:00:00 17.25"""
df = pd.read_csv(StringIO(s), sep="\s\s+")
df['datetime'] = pd.to_datetime(df['datetime'])
result = df.sort_values('datetime').groupby(pd.Grouper(freq='12h', base=8, key='datetime')).min()['temp'].dropna()
print(result)
Output:
datetime
2009-06-30 20:00:00 16.21
2019-06-24 20:00:00 14.03
Name: temp, dtype: float64

Change the frequency of a Pandas datetimeindex from daily to hourly, to select hourly data based on a condition on daily resampled data

I am working on hourly and sub-hourly time series. However, one of the conditions I need to test is on daily averages. I need to find the days that meet the condition, and then select all hours (or other time steps) from those days to change their values. But right now, the only value that is actually changed is the first hour on the selected day. How can I select and modify every hour?
This is an example of my dataset:
In[]: print(hourly_dataset.head())
Out[]:
GHI DNI DHI
2016-01-01 00:00:00 0.0 0.0 0.0
2016-01-01 01:00:00 0.0 0.0 0.0
2016-01-01 02:00:00 0.0 0.0 0.0
2016-01-01 03:00:00 0.0 0.0 0.0
2016-01-01 04:00:00 0.0 0.0 0.0
And this is the condition I need to check. I saved the indexes that satisfy the condition on the daily standard deviation as ix.
ix = hourly_dataset['GHI'].resample('D').std()[hourly_dataset['GHI'].resample('D').std() > 300].index
In[]: print(ix)
Out[]: DatetimeIndex(['2016-05-31', '2016-07-17', '2016-07-18'], dtype='datetime64[ns]', freq=None)
But then I assign a nan value to those days and only the first hour is actually modified to nan.
hourly_dataset.loc[ix,'GHI'] = np.nan
In[]: print(hourly_dataset.loc['2016-05-31','GHI'].head())
Out[]:
2016-05-31 00:00:00 NaN
2016-05-31 01:00:00 0.0
2016-05-31 02:00:00 0.0
2016-05-31 03:00:00 0.0
2016-05-31 04:00:00 7.4
Freq: H, Name: GHI, dtype: float64
I would like all values in that day to be assigned nan.
Thanks for the help!
Possible workaround:
for i in ix:
hourly_dataset.loc[i.strftime('%Y-%m-%d'),'GHI'] = np.nan
Explanation
I had a small look and the issue is when we try to select the index by a Timestamp. I was able to reproduce your error.
Consider this example:
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range(start='2018-01-01', freq='2H', periods=24),
'GHI': 0
}).set_index('date')
ix = pd.date_range(start='2018-01-01', end='2018-01-02')
df.loc[ix, 'GHI'] = np.nan
print(df.head())
Returns:
GHI
date
2018-01-01 00:00:00 NaN
2018-01-01 02:00:00 0.0
2018-01-01 04:00:00 0.0
2018-01-01 06:00:00 0.0
2018-01-01 08:00:00 0.0
Maybe not the best, but one work-around would be to loop through the ix and use loc on ix as a datetime string with format YYYY-mm-dd.
# df.loc[ix.strftime('%Y-%m-%d'), 'GHI'] = np.nan --> does not work
for i in ix:
df.loc[i.strftime('%Y-%m-%d'), 'GHI'] = np.nan
print(df.head())
date
2018-01-01 00:00:00 NaN
2018-01-01 02:00:00 NaN
2018-01-01 04:00:00 NaN
2018-01-01 06:00:00 NaN
2018-01-01 08:00:00 NaN

Pandas computer hourly average and set at middle of interval

I want to compute the hourly mean for a time series of wind speed and direction, but I want to set the time at the half hour. So, the average for values from 14:00 to 15:00 will be at 14:30. Right now, I can only seem to get it on left or right of the interval. Here is what I currently have:
ts_g=[item.replace(second=0, microsecond=0) for item in dates_g]
dg = {'ws': data_g.ws, 'wdir': data_g.wdir}
df_g = pandas.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
grouped_g = df_g.groupby(pandas.TimeGrouper('H'))
hourly_ws_g = grouped_g['ws'].mean()
hourly_wdir_g = grouped_g['wdir'].mean()
the output for this looks like:
2016-04-08 06:00:00+00:00 46.980000
2016-04-08 07:00:00+00:00 64.313333
2016-04-08 08:00:00+00:00 75.678333
2016-04-08 09:00:00+00:00 127.383333
2016-04-08 10:00:00+00:00 145.950000
2016-04-08 11:00:00+00:00 184.166667
....
but I would like it to be like:
2016-04-08 06:30:00+00:00 54.556
2016-04-08 07:30:00+00:00 78.001
....
Thanks for your help!
So the easiest way is to resample and then use linear interpolation:
In [21]: rng = pd.date_range('1/1/2011', periods=72, freq='H')
In [22]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
...:
In [23]: ts.head()
Out[23]:
2011-01-01 00:00:00 0.796704
2011-01-01 01:00:00 -1.153179
2011-01-01 02:00:00 -1.919475
2011-01-01 03:00:00 0.082413
2011-01-01 04:00:00 -0.397434
Freq: H, dtype: float64
In [24]: ts2 = ts.resample('30T').interpolate()
In [25]: ts2.head()
Out[25]:
2011-01-01 00:00:00 0.796704
2011-01-01 00:30:00 -0.178237
2011-01-01 01:00:00 -1.153179
2011-01-01 01:30:00 -1.536327
2011-01-01 02:00:00 -1.919475
Freq: 30T, dtype: float64
In [26]:
I believe this is what you need.
Edit to add clarifying example
Perhaps it's easier to see what's going on without random Data:
In [29]: ts.head()
Out[29]:
2011-01-01 00:00:00 0
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 4
Freq: H, dtype: int64
In [30]: ts2 = ts.resample('30T').interpolate()
In [31]: ts2.head()
Out[31]:
2011-01-01 00:00:00 0.0
2011-01-01 00:30:00 0.5
2011-01-01 01:00:00 1.0
2011-01-01 01:30:00 1.5
2011-01-01 02:00:00 2.0
Freq: 30T, dtype: float64
This post is already several years old and uses the API that has long been deprecated. Modern Pandas already provides the resample method that is easier to use than pandas.TimeGrouper. Yet it allows only left and right labelled intervals but getting the intervals centered at the middle of the interval is not readily available.
Yet this is not hard to do.
First we fill in the data that we want to resample:
ts_g=[datetime.datetime.fromisoformat('2019-11-20') +
datetime.timedelta(minutes=10*x) for x in range(0,100)]
dg = {'ws': range(0,100), 'wdir': range(0,100)}
df_g = pd.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
df_g.head()
The output would be:
ws wdir
2019-11-20 00:00:00 0 0
2019-11-20 00:10:00 1 1
2019-11-20 00:20:00 2 2
2019-11-20 00:30:00 3 3
2019-11-20 00:40:00 4 4
Now we first resample to 30 minute intervals
grouped_g = df_g.resample('30min')
halfhourly_ws_g = grouped_g['ws'].mean()
halfhourly_ws_g.head()
The output would be:
2019-11-20 00:00:00 1
2019-11-20 00:30:00 4
2019-11-20 01:00:00 7
2019-11-20 01:30:00 10
2019-11-20 02:00:00 13
Freq: 30T, Name: ws, dtype: int64
Finally the trick to get the centered intervals:
hourly_ws_g = halfhourly_ws_g.add(halfhourly_ws_g.shift(1)).div(2)\
.loc[halfhourly_ws_g.index.minute % 60 == 30]
hourly_ws_g.head()
This would produce the expected output:
2019-11-20 00:30:00 2.5
2019-11-20 01:30:00 8.5
2019-11-20 02:30:00 14.5
2019-11-20 03:30:00 20.5
2019-11-20 04:30:00 26.5
Freq: 60T, Name: ws, dtype: float64

Python pandas resampling instantaneous hourly data to daily timestep including 00:00 of next day

Let's say I have a series of instantaneous temperature measurements (i.e. they capture the temperature at an exact moment in time).
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
Out[130]:
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
I want to get a average of daily temperature. The problem is that I want to include 00:00:00 from the current day and the next day in the average for the current day. For example I want to average 2000-01-01 00:00:00 to 2000-01-02 00:00:00 inclusive. The pandas resample function will not include 2000-01-02 in the bin because it's a different day.
I would imagine this situation comes up often when dealing with instantaneous measurements that need to be resampled. What's the solution?
setup
index = pd.date_range('1/1/2000', periods=9, freq='6H')
series = pd.Series(range(9), index=index)
series
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
solution
series.rolling(5).mean().resample('D').first()
2000-01-01 NaN
2000-01-02 2.0
2000-01-03 6.0
Freq: D, dtype: float64

Categories

Resources