Truncating hourly Pandas Series to be full days only - python

I'm using Pandas 0.17.1 and I oftentimes encounter hourly Series data that contains partial days. It does not seem that there is any functionality built into pandas that permits you to discard values that correspond to incomplete segments of a coarser date offset on the boundaries of the Series data (I would only like to discard partial data that exist at the beginning and/or the end of the Series).
My intuition, given the above, is that I would have to code something up to abstract the criterion (e.g. groupby with a count aggregation, discard hours in days with < 24 hours):
>> hist_data.groupby(lambda x: x.date()).agg('count')
2007-01-01 23
2007-01-02 24
...
An example of desired behavior:
>> hourly_data
2016-01-01 04:00:00 0.603820
2016-01-01 05:00:00 0.806696
2016-01-01 06:00:00 0.938521
2016-01-01 07:00:00 0.781834
2016-01-01 08:00:00 0.154952
...
2016-01-03 22:00:00 0.082177
2016-01-03 23:00:00 0.753210
2016-01-04 00:00:00 0.458402
2016-01-04 01:00:00 0.649496
2016-01-04 02:00:00 0.525321
2016-01-04 03:00:00 0.242605
Freq: H, dtype: float64
>> remove_partial_boundary_data(hourly_data)
2016-01-02 00:00:00 0.833063
2016-01-02 01:00:00 0.131586
2016-01-02 02:00:00 0.876609
2016-01-02 03:00:00 0.319436
2016-01-02 04:00:00 0.056246
...
2016-01-03 20:00:00 0.405725
2016-01-03 21:00:00 0.541096
2016-01-03 22:00:00 0.082177
2016-01-03 23:00:00 0.753210
Freq: H, dtype: float64
However, if my timezone is anything other than UTC (timezone-aware), the approach suggested above seems fraught with peril because counts of hours on DST transition days would be either 23 or 25.
Does anyone know of a clever or built-in way to handle this?

You can do this with a groupby and filter on groups that are not complete. To check for completeness, I first reindexed the data and then checked if there are NaN values:
In [10]: hourly_data = pd.Series(np.random.randn(72), index=pd.date_range('2016-01-01 04:00', periods=72, freq='H'))
In [11]: new_idx = pd.date_range(hourly_data.index[0].date(), hourly_data.index[-1].date() + pd.Timedelta('1 day'), freq='H')
In [12]: hourly_data.reindex(new_idx)
Out[12]:
2016-01-01 00:00:00 NaN
2016-01-01 01:00:00 NaN
2016-01-01 02:00:00 NaN
2016-01-01 03:00:00 NaN
2016-01-01 04:00:00 -0.941332
2016-01-01 05:00:00 1.802739
2016-01-01 06:00:00 0.798968
2016-01-01 07:00:00 -0.444979
...
2016-01-04 17:00:00 NaN
2016-01-04 18:00:00 NaN
2016-01-04 19:00:00 NaN
2016-01-04 20:00:00 NaN
2016-01-04 21:00:00 NaN
2016-01-04 22:00:00 NaN
2016-01-04 23:00:00 NaN
2016-01-05 00:00:00 NaN
Freq: H, dtype: float64
This resulted in a timeseries that includes all hours of the dates that are present in the timeseries. This way, we can check if a date was complete by checking if there are NaN values for that date (this method should work for DST transitions), and filter with this criterion:
In [13]: hourly_data.reindex(new_idx).groupby(lambda x: x.date()).filter(lambda x: x.isnull().sum() == 0)
Out[13]:
2016-01-02 00:00:00 -1.231445
2016-01-02 01:00:00 2.371690
2016-01-02 02:00:00 -0.695448
2016-01-02 03:00:00 0.745308
2016-01-02 04:00:00 0.814579
2016-01-02 05:00:00 1.345674
2016-01-02 06:00:00 -1.491470
2016-01-02 07:00:00 0.407182
...
2016-01-03 16:00:00 -0.742151
2016-01-03 17:00:00 0.677229
2016-01-03 18:00:00 0.832271
2016-01-03 19:00:00 -0.183729
2016-01-03 20:00:00 1.938594
2016-01-03 21:00:00 -0.816370
2016-01-03 22:00:00 1.745757
2016-01-03 23:00:00 0.223487
Freq: H, dtype: float64
ORIGINAL ANSWER
You can do this with resample by providing a custom function, and in that function you can then specify that NaN values should not be skipped.
Short answer:
hist_data.resample('D', how=lambda x: x.mean(skipna=False))
if the missing hours are already present as NaNs. Otherwise, you can first resample it to a regular hourly series:
hist_data.resample('H').resample('D', how=lambda x: x.mean(skipna=False))
Long answer with an example. With some dummy data (and I insert a NaN in one of the days):
In [77]: hist_data = pd.Series(np.random.randn(72), index=pd.date_range('2016-01-01', periods=72, freq='H'))
In [78]: hist_data
Out[78]:
2016-01-01 00:00:00 -0.717624
2016-01-01 01:00:00 0.029151
2016-01-01 02:00:00 0.535843
...
2016-01-03 21:00:00 0.659923
2016-01-03 22:00:00 -1.085640
2016-01-03 23:00:00 0.571347
Freq: H, dtype: float64
In [80]: hist_data.iloc[30] = np.nan
With count you can see that there is indeed one missing value for the second day:
In [81]: hist_data.resample('D', how='count')
Out[81]:
2016-01-01 24
2016-01-02 23
2016-01-03 24
Freq: D, dtype: int64
By default, 'mean' will ignore this NaN value:
In [83]: hist_data.resample('D', how='mean')
Out[83]:
2016-01-01 0.106537
2016-01-02 -0.112774
2016-01-03 -0.292248
Freq: D, dtype: float64
But you can change this behaviour with the skipna keyword argument:
In [82]: hist_data.resample('D', how=lambda x: x.mean(skipna=False))
Out[82]:
2016-01-01 0.106537
2016-01-02 NaN
2016-01-03 -0.292248
Freq: D, dtype: float64

Related

how to calculate percent change rate on a multi-date dataframe elegantly?

I have a dataframe, which index is datetime. it contains a columns - price
In [9]: df = pd.DataFrame({'price':[3,5,6,10,11]}, index=pd.to_datetime(['2016-01-01 14:58:00',
'2016-01-01 14:58:00', '2016-01-01 14:58:00', '2016-01-02 09:30:00', '2016-01-02 09:31:00']))
...:
In [10]: df
Out[10]:
price
2016-01-01 14:58:00 3
2016-01-01 14:58:00 5
2016-01-01 14:58:00 6
2016-01-02 09:30:00 10
2016-01-02 09:31:00 11
I want to calculate the next return(price percent change rate for some time intervals).
dataframe have a pct_change() function can calculate the change rate.
In [12]: df['price'].pct_change().shift(-1)
Out[12]:
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 0.666667
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64
but, i want the cross date element to be nan
which means, i want df['pct_change'].loc['2016-01-01 14:58:00'] to be nan, because it calculate the pct_change using tomw's data(2016-01-02 09:30:00)
the expected output:
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 NaN
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64
i can make a mask to filter those out. but i think this solution is not elegant enough, is there any suggestions?
You can use GroupBy.apply by DatetimeIndex.date:
s1 = df.groupby(df.index.date)['price'].apply(lambda x: x.pct_change().shift(-1))
print (s1)
2016-01-01 14:58:00 0.666667
2016-01-01 14:58:00 0.200000
2016-01-01 14:58:00 NaN
2016-01-02 09:30:00 0.100000
2016-01-02 09:31:00 NaN
Name: price, dtype: float64

Replace "flatline" repeated data in Pandas series with nan

I have hourly time series data stored in a pandas series. Similar to this example:
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2019', end='1/2/2019', freq='H')
data = np.random.uniform(180,182,size=(len(date_rng)))
timeseries = pd.Series(data, index=date_rng)
timeseries.iloc[4:12] = 181.911
At three decimal places, it is highly unlikely the data will be exactly the same for more than, say, 3 hours in a row. When this flatlining occurs, it indicates an issue with the sensor. So I want to detect repeated data and replace it with nan values (i.e., detect the repeated values 181.911 in the above and replace with nan)
I assume I can iterate over the time series and detect/replace that way, but is there a more efficient way to do this?
You can do it with diff, but the first occurrence retain in the series.
timeseries.where(timeseries.diff(1)!=0.0,np.nan)
2019-01-01 00:00:00 180.539278
2019-01-01 01:00:00 181.509729
2019-01-01 02:00:00 180.740326
2019-01-01 03:00:00 181.736425
2019-01-01 04:00:00 181.911000
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.093216
2019-01-01 13:00:00 180.623440
First occurrence also can be removed using diff(-1) and diff(1):
np.c_[timeseries.where(timeseries.diff(-1)!=0.0,np.nan), timeseries.where(timeseries.diff(1)!=0.0,np.nan)].mean(axis=1)
It works when repetitions are sequential in series.
With following reasonably efficient function one can choose the minimum number of repeated values to consider as flatline:
import numpy as np
def remove_flatlines(ts, threshold):
# get start and end indices of each flatline as an n x 2 array
isflat = np.concatenate(([False], np.isclose(ts.diff(), 0), [False]))
isedge = isflat[1:] != isflat[:-1]
flatrange = np.where(isedge)[0].reshape(-1, 2)
# include also first value of each flatline
flatrange[:, 0] -= 1
# remove flatlines with at least threshold number of equal values
ts = ts.copy()
for j in range(len(flatrange)):
if flatrange[j][1] - flatrange[j][0] >= threshold:
ts.iloc[flatrange[j][0]:flatrange[j][1]] = np.nan
return ts
Applied to example:
remove_flatlines(timeseries, threshold=3)
2019-01-01 00:00:00 181.447940
2019-01-01 01:00:00 180.142692
2019-01-01 02:00:00 180.994674
2019-01-01 03:00:00 180.116489
2019-01-01 04:00:00 NaN
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.972644
2019-01-01 13:00:00 181.969759
2019-01-01 14:00:00 181.008693
2019-01-01 15:00:00 180.769328
2019-01-01 16:00:00 180.576061
2019-01-01 17:00:00 181.562315
2019-01-01 18:00:00 181.978567
2019-01-01 19:00:00 181.928330
2019-01-01 20:00:00 180.773995
2019-01-01 21:00:00 180.475290
2019-01-01 22:00:00 181.460028
2019-01-01 23:00:00 180.220693
2019-01-02 00:00:00 181.630176
Freq: H, dtype: float64

Pandas computer hourly average and set at middle of interval

I want to compute the hourly mean for a time series of wind speed and direction, but I want to set the time at the half hour. So, the average for values from 14:00 to 15:00 will be at 14:30. Right now, I can only seem to get it on left or right of the interval. Here is what I currently have:
ts_g=[item.replace(second=0, microsecond=0) for item in dates_g]
dg = {'ws': data_g.ws, 'wdir': data_g.wdir}
df_g = pandas.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
grouped_g = df_g.groupby(pandas.TimeGrouper('H'))
hourly_ws_g = grouped_g['ws'].mean()
hourly_wdir_g = grouped_g['wdir'].mean()
the output for this looks like:
2016-04-08 06:00:00+00:00 46.980000
2016-04-08 07:00:00+00:00 64.313333
2016-04-08 08:00:00+00:00 75.678333
2016-04-08 09:00:00+00:00 127.383333
2016-04-08 10:00:00+00:00 145.950000
2016-04-08 11:00:00+00:00 184.166667
....
but I would like it to be like:
2016-04-08 06:30:00+00:00 54.556
2016-04-08 07:30:00+00:00 78.001
....
Thanks for your help!
So the easiest way is to resample and then use linear interpolation:
In [21]: rng = pd.date_range('1/1/2011', periods=72, freq='H')
In [22]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
...:
In [23]: ts.head()
Out[23]:
2011-01-01 00:00:00 0.796704
2011-01-01 01:00:00 -1.153179
2011-01-01 02:00:00 -1.919475
2011-01-01 03:00:00 0.082413
2011-01-01 04:00:00 -0.397434
Freq: H, dtype: float64
In [24]: ts2 = ts.resample('30T').interpolate()
In [25]: ts2.head()
Out[25]:
2011-01-01 00:00:00 0.796704
2011-01-01 00:30:00 -0.178237
2011-01-01 01:00:00 -1.153179
2011-01-01 01:30:00 -1.536327
2011-01-01 02:00:00 -1.919475
Freq: 30T, dtype: float64
In [26]:
I believe this is what you need.
Edit to add clarifying example
Perhaps it's easier to see what's going on without random Data:
In [29]: ts.head()
Out[29]:
2011-01-01 00:00:00 0
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 4
Freq: H, dtype: int64
In [30]: ts2 = ts.resample('30T').interpolate()
In [31]: ts2.head()
Out[31]:
2011-01-01 00:00:00 0.0
2011-01-01 00:30:00 0.5
2011-01-01 01:00:00 1.0
2011-01-01 01:30:00 1.5
2011-01-01 02:00:00 2.0
Freq: 30T, dtype: float64
This post is already several years old and uses the API that has long been deprecated. Modern Pandas already provides the resample method that is easier to use than pandas.TimeGrouper. Yet it allows only left and right labelled intervals but getting the intervals centered at the middle of the interval is not readily available.
Yet this is not hard to do.
First we fill in the data that we want to resample:
ts_g=[datetime.datetime.fromisoformat('2019-11-20') +
datetime.timedelta(minutes=10*x) for x in range(0,100)]
dg = {'ws': range(0,100), 'wdir': range(0,100)}
df_g = pd.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
df_g.head()
The output would be:
ws wdir
2019-11-20 00:00:00 0 0
2019-11-20 00:10:00 1 1
2019-11-20 00:20:00 2 2
2019-11-20 00:30:00 3 3
2019-11-20 00:40:00 4 4
Now we first resample to 30 minute intervals
grouped_g = df_g.resample('30min')
halfhourly_ws_g = grouped_g['ws'].mean()
halfhourly_ws_g.head()
The output would be:
2019-11-20 00:00:00 1
2019-11-20 00:30:00 4
2019-11-20 01:00:00 7
2019-11-20 01:30:00 10
2019-11-20 02:00:00 13
Freq: 30T, Name: ws, dtype: int64
Finally the trick to get the centered intervals:
hourly_ws_g = halfhourly_ws_g.add(halfhourly_ws_g.shift(1)).div(2)\
.loc[halfhourly_ws_g.index.minute % 60 == 30]
hourly_ws_g.head()
This would produce the expected output:
2019-11-20 00:30:00 2.5
2019-11-20 01:30:00 8.5
2019-11-20 02:30:00 14.5
2019-11-20 03:30:00 20.5
2019-11-20 04:30:00 26.5
Freq: 60T, Name: ws, dtype: float64

rolling_sum on business day and return new dataframe with date as index

I have such a DataFrame:
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-02 00:00:00 2
2016-01-02 12:00:00 3
2016-01-03 00:00:00 4
2016-01-03 12:00:00 5
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
The reason I separate 2016-01-02 00:00:00 to 2016-01-03 12:00:00 is that, those two days are weekends.
So here is what I wish to do:
I wish to rolling_sum with window = 2 business days.
For example, I wish to sum
A
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
and then sum (we skip any non-business days)
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
And the result is
A
2016-01-01 Nan
2016-01-04 14
2016-01-05 30
How can I achieve that?
I tried rolling_sum(df, window=2, freq=BDay(1)), it seems it just pick one row of the same day, but not sum the two rows (00:00 and 12:00) within the same day.
You could first select only business days, resample to (business) daily frequency for the remaining data points and sum, and then apply rolling_sum:
Starting with some sample data:
df = pd.DataFrame(data={'A': np.random.randint(0, 10, 500)}, index=pd.date_range(datetime(2016,1,1), freq='6H', periods=500))
A
2016-01-01 00:00:00 6
2016-01-01 06:00:00 9
2016-01-01 12:00:00 3
2016-01-01 18:00:00 9
2016-01-02 00:00:00 7
2016-01-02 06:00:00 5
2016-01-02 12:00:00 8
2016-01-02 18:00:00 6
2016-01-03 00:00:00 2
2016-01-03 06:00:00 0
2016-01-03 12:00:00 0
2016-01-03 18:00:00 0
2016-01-04 00:00:00 5
2016-01-04 06:00:00 4
2016-01-04 12:00:00 1
2016-01-04 18:00:00 4
2016-01-05 00:00:00 6
2016-01-05 06:00:00 9
2016-01-05 12:00:00 7
2016-01-05 18:00:00 2
....
First select the values on business days:
tsdays = df.index.values.astype('<M8[D]')
bdays = pd.bdate_range(tsdays[0], tsdays[-1]).values.astype('<M8[D]')
df = df[np.in1d(tsdays, bdays)]
Then apply rolling_sum() to the resampled data, where each value represents the sum for an individual business day:
pd.rolling_sum(df.resample('B', how='sum'), window=2)
to get:
A
2016-01-01 NaN
2016-01-04 41
2016-01-05 38
2016-01-06 56
2016-01-07 52
2016-01-08 37
See also [here] for the type conversion and 1[this question]2 for the business day extraction.

python pandas date_range when importing with read_csv

I have a .csv file with some data in the following format:
1.69511909, 0.57561167, 0.31437427, 0.35458831, 0.15841189, 0.28239582, -0.18180907, 1.34761404, -1.5059083, 1.29246638
-1.66764664, 0.1488095, 1.03832221, -0.35229205, 1.35705861, -1.56747104, -0.36783851, -0.57636948, 0.9854391, 1.63031066
0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838, -0.08557089, 1.71855596, -0.61235066, -0.32520105, 1.54162629
Every line corresponds to a specific day, and every record in a line corresponds to a specific hour in that day.
Is there is a convenient way of importing the data with read_csv such that everything would be correctly indexed, i.e. the importing function would discriminate different days (lines), and hours within days (separate records in lines).
Something like this. Note that I couldn't copy your string for some reason, so my dataset is cutoff....
Read in the string (it reads as a dataframe because mine had newlines in it)....but need to coerce to a Series.
In [23]: s = pd.read_csv(StringIO(data)).values
In [24]: s
Out[24]:
array([[-1.66764664, 0.1488095 , 1.03832221, -0.35229205, 1.35705861,
-1.56747104, -0.36783851, -0.57636948, 0.9854391 , 1.63031066],
[ 0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838,
-0.08557089, 1.71855596, nan, nan, nan]])
In [25]: s = Series(pd.read_csv(StringIO(data)).values.ravel())
In [26]: s
Out[26]:
0 -1.667647
1 0.148810
2 1.038322
3 -0.352292
4 1.357059
5 -1.567471
6 -0.367839
7 -0.576369
8 0.985439
9 1.630311
10 0.877638
11 0.607572
12 0.649083
13 -0.683577
14 0.334998
15 -0.085571
16 1.718556
17 NaN
18 NaN
19 NaN
dtype: float64
Just set the index directly....Note that you are solely responsible for alignment, this is VERY
easy to be say off by 1
In [27]: s.index = pd.date_range('20130101',freq='H',periods=len(s))
In [28]: s
Out[28]:
2013-01-01 00:00:00 -1.667647
2013-01-01 01:00:00 0.148810
2013-01-01 02:00:00 1.038322
2013-01-01 03:00:00 -0.352292
2013-01-01 04:00:00 1.357059
2013-01-01 05:00:00 -1.567471
2013-01-01 06:00:00 -0.367839
2013-01-01 07:00:00 -0.576369
2013-01-01 08:00:00 0.985439
2013-01-01 09:00:00 1.630311
2013-01-01 10:00:00 0.877638
2013-01-01 11:00:00 0.607572
2013-01-01 12:00:00 0.649083
2013-01-01 13:00:00 -0.683577
2013-01-01 14:00:00 0.334998
2013-01-01 15:00:00 -0.085571
2013-01-01 16:00:00 1.718556
2013-01-01 17:00:00 NaN
2013-01-01 18:00:00 NaN
2013-01-01 19:00:00 NaN
Freq: H, dtype: float64
First just read in the DataFrame:
df = pd.read_csv(file_name, sep=',\s+', header=None)
Then set the index to be the dates and the columns to be the hours*
df.index = pd.date_range('2012-01-01', freq='D', periods=len(df))
from pandas.tseries.offsets import Hour
df.columns = [Hour(7+t) for t in df.columns]
In [5]: df
Out[5]:
<7 Hours> <8 Hours> <9 Hours> <10 Hours> <11 Hours> <12 Hours> <13 Hours> <14 Hours> <15 Hours> <16 Hours>
2012-01-01 1.695119 0.575612 0.314374 0.354588 0.158412 0.282396 -0.181809 1.347614 -1.505908 1.292466
2012-01-02 -1.667647 0.148810 1.038322 -0.352292 1.357059 -1.567471 -0.367839 -0.576369 0.985439 1.630311
2012-01-03 0.877638 0.607572 0.649083 -0.683577 0.334998 -0.085571 1.718556 -0.612351 -0.325201 1.541626
Then stack it and add the Date and the Hour levels of the MultiIndex:
s = df.stack()
s.index = [x[0]+x[1] for x in s.index]
In [8]: s
Out[8]:
2012-01-01 07:00:00 1.695119
2012-01-01 08:00:00 0.575612
2012-01-01 09:00:00 0.314374
2012-01-01 10:00:00 0.354588
2012-01-01 11:00:00 0.158412
2012-01-01 12:00:00 0.282396
2012-01-01 13:00:00 -0.181809
2012-01-01 14:00:00 1.347614
2012-01-01 15:00:00 -1.505908
2012-01-01 16:00:00 1.292466
2012-01-02 07:00:00 -1.667647
2012-01-02 08:00:00 0.148810
...
* You can use different offsets, see here, e.g. Minute, Second.

Categories

Resources