Resampling with pandas.MultiIndex: Resampler.aggregate() & Resampler[column] - python

I am trying to resample a data frame. First, I want to keep several aggregations in the result. Second, there is an additional aggregation in interest for a specific column. Since this aggregation is only relevant to a single column, the resampler can be limited to this column in order not to unnecessarily apply the aggregation to the other columns.
This scenario is working for a simple one-dimensional column index:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.rand(50,4), index=pd.to_datetime(np.arange(0, 50), unit="s"), columns=["a", "b", "c", "d"])
r = df.resample("10s")
result = r.aggregate(["mean", "std"])
result[("d", "ffill")] = r["d"].ffill()
print(result)
However, as soon as I start to use multi-indexed columns problems appear. First, I can not keep several aggregations at once:
df.columns = pd.MultiIndex.from_product([("a", "b"), ("alpha", "beta")])
r = df.resample("10s") # can be omitted
result = r.aggregate(["mean", "std"])
---> AttributeError: 'Series' object has no attribute 'columns'
Second, the resampler can not be limited to the relevant column anymore:
r[("b", "beta")].ffill()
--> KeyError: "Columns not found: 'b', 'beta'"
How can I transform my concern from simple indices to multi-indices?

you can use pd.Grouper in a groupby instead of resample, such as:
result = df.groupby(pd.Grouper(freq='10s',level=0)).aggregate(["mean", "std"])
print (result)
a b \
alpha beta alpha
mean std mean std mean
1970-01-01 00:00:00 0.460569 0.312508 0.476511 0.260534 0.479577
1970-01-01 00:00:10 0.441498 0.315277 0.487855 0.306068 0.535842
1970-01-01 00:00:20 0.569884 0.248503 0.320552 0.288479 0.507755
1970-01-01 00:00:30 0.478037 0.262654 0.552214 0.251581 0.505132
1970-01-01 00:00:40 0.611227 0.328916 0.473773 0.241604 0.358298
beta
std mean std
1970-01-01 00:00:00 0.357493 0.448487 0.294432
1970-01-01 00:00:10 0.259145 0.472250 0.320954
1970-01-01 00:00:20 0.369490 0.432944 0.150473
1970-01-01 00:00:30 0.298759 0.381614 0.248785
1970-01-01 00:00:40 0.203831 0.381412 0.374965
and for the second part, I'm not sure what you mean, but according to the result given in the case of the single column level, try this it gives a result
result[("b", "beta",'ffill')] = df.groupby(pd.Grouper(freq='10s',level=0))[[("b", "beta")]].first()

It must be a bug in aggregate. A go-around would be stack:
(df.stack().groupby(level=-1)
.apply(lambda x:x.resample('10s', level=0).aggregate(["mean", "std"]))
.unstack(level=0)
)

Related

Groupby number of hours in a month in pandas

Could someone please guide how to groupby no. of hours from hourly based index to find how many hours of null values are there in a specific month? Therefore, I am thinking of having a dataframe with monthly based index.
Below given is the dataframe which has timestamp as index and another column with has occassionally null values.
timestamp
rel_humidity
1999-09-27 05:00:00
82.875
1999-09-27 06:00:00
83.5
1999-09-27 07:00:00
83.0
1999-09-27 08:00:00
80.6
1999-09-27 09:00:00
nan
1999-09-27 10:00:00
nan
1999-09-27 11:00:00
nan
1999-09-27 12:00:00
nan
I tried this but the resulting dataframe is not what I expected.
gap_in_month = OG_1998_2022_gaps.groupby(OG_1998_2022_gaps.index.month, OG_1998_2022_gaps.index.year).count()
I always struggle with groupby in function. Therefore, highly appreciate any help. Thanks in advance!
If need 0 if no missing value per month create mask by Series.isna, convert DatetimeIndex to month periods by DatetimeIndex.to_period and aggregate sum - Trues of mask are processing like 1 or alternative with Grouper:
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(OG_1998_2022_gaps.index.to_period('m')).sum())
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(pd.Grouper(freq='m')).sum())
If need only matched rows solution is similar, but first filter by boolean indexing and then aggregate counts by GroupBy.size:
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(OG_1998_2022_gaps.index.to_period('m')).size())
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(pd.Grouper(freq='m')).size())
Alternative to groupby, but (in my opinion) much nicer, is to use pd.Series.resample:
import pandas as pd
# Some sample data with a DatetimeIndex:
series = pd.Series(
np.random.choice([1.0, 2.0, 3.0, np.nan], size=2185),
index=pd.date_range(start="1999-09-26", end="1999-12-26", freq="H")
)
# Solution:
series.isna().resample("M").sum()
# Note that GroupBy.count and Resampler.count count the number of non-null values,
# whereas you seem to be looking for the opposite :)
In your case:
OG_1998_2022_gaps['rel_humidity'].isna().resample("M").sum()

How can I set the frequency of a pandas index?

So this is my code basically:
df = pd.read_csv('XBT_60.csv', index_col = 'date', parse_dates = True)
df.index.freq = 'H'
I load a csv, set the index to the date column and want to set the frequency to 'H'. But this raises this error:
ValueError: Inferred frequency None from passed values does not conform to passed frequency H
The format of the dates column is: 2017-01-01 00:00:00
I already tried loading the csv without setting the index column and used pd.to_datetime on the dates column before I set it as index, but still i am unable to set the frequency. How can I solve this?
BTW: my aim is to use the seasonal_decompose() method from statsmodels, so I need the frequency there.
You can't set frequency if you have missing index values:
>>> df
val
2019-09-15 0
2019-09-16 1
2019-09-18 3
>>> df.index.freq = 'D'
...
ValueError: Inferred frequency None from passed values does not conform to passed frequency D
To find missing index, use:
>>> df = df.resample('D').first()
val
2019-09-15 0.0
2019-09-16 1.0
2019-09-17 NaN
2019-09-18 3.0
>>> df.index.freq
<Day>
To debug, find missing indexes:
>>> pd.date_range(df.index.min(), df.index.max(), freq='D').difference(df.index)
DatetimeIndex(['2019-09-17'], dtype='datetime64[ns]', freq=None)

Split hourly time-series in pandas DataFrame into specific dates and all other dates

I have a time-series in a pandas DataFrame at hourly frequency:
import pandas as pd
import numpy as np
idx = pd.date_range(freq="h", start="2018-01-01", periods=365*24)
df = pd.DataFrame({'value': np.random.rand(365*24)}, index=idx)
I have a list of dates:
dates = ['2018-03-20', '2018-04-08', '2018-07-14']
I want to end up with two DataFrames: one containing just the data for these dates, and one containing all of the data from the original DataFrame excluding all the data for these dates. In this case, I would have a DataFrame containing three days worth of data (for the days listed in dates), and a DataFrame containing 362 days data (all the data excluding those three days).
What is the best way to do this in pandas?
I can take advantage of nice string-based datetime indexing in pandas to extract each date separately, for example:
df[dates[0]]
and I can use this to put together a DataFrame containing just the specified dates:
to_concat = [df[date] for date in dates]
just_dates = pd.concat(to_concat)
This isn't as 'nice' as it could be, but does the job.
However, I can't work out how to remove those dates from the DataFrame to get the other output that I want. Doing:
df[~dates[0]]
gives a TypeError: bad operand type for unary ~: 'str', and I can't seem to get df.drop to work in this context.
What do you suggest as a nice, Pythonic and 'pandas-like' way to go about this?
Create boolean mask by numpy.in1d with converted dates to strings or Index.isin for test membership:
m = np.in1d(df.index.date.astype(str), dates)
m = df.index.to_series().dt.date.astype(str).isin(dates)
Or DatetimeIndex.strftime for strings:
m = df.index.strftime('%Y-%m-%d').isin(dates)
Another idea is remove times by DatetimeIndex.normalize - get DatetimeIndex in output:
m = df.index.normalize().isin(dates)
#alternative
#m = df.index.floor('d').isin(dates)
Last filter by boolean indexing:
df1 = df[m]
And for second DataFrame invert mask by ~:
df2 = df[~m]
print (df1)
value
2018-03-20 00:00:00 0.348010
2018-03-20 01:00:00 0.406394
2018-03-20 02:00:00 0.944569
2018-03-20 03:00:00 0.425583
2018-03-20 04:00:00 0.586190
...
2018-07-14 19:00:00 0.710710
2018-07-14 20:00:00 0.403660
2018-07-14 21:00:00 0.949572
2018-07-14 22:00:00 0.629871
2018-07-14 23:00:00 0.363081
[72 rows x 1 columns]
one way to solve this
df = df.reset_index()
with_date = df[df['index'].dt.date.astype(str).isin(dates)].set_index('index')
##use del with_date.index.name to remove the index name, if required
without_date = df[~df['index'].dt.date.astype(str).isin(dates)].set_index('index')
##with_date
value
index
2018-03-20 00:00:00 0.059623
2018-03-20 01:00:00 0.343513
...
##without_date
value
index
2018-01-01 00:00:00 0.087846
2018-01-01 01:00:00 0.481971
...
Another way to solve this:
Keep your dates in datetime format, for example through a pd.Timestamp:
dates_in_dt_format = [pd.Timestamp(date).date() for date in dates]
Then, keep only the rows where the index's date is not in that group, for example with:
df_without_dates = df.loc[[idx for idx in df.index if idx.date() not in dates_in_dt_format]]
df_with_dates = df.loc[[idx for idx in df.index if idx.date() in dates_in_dt_format]]
or using pandas apply instead of list comprehension:
df_with_dates = df[df.index.to_series().apply(lambda x: pd.Timestamp(x).date()).isin(dates_in_dt_format)]
df_without_dates = df[~df.index.to_series().apply(lambda x: pd.Timestamp(x).date()).isin(dates_in_dt_format)]

Down sampling problem: difference between resample and asfreq

I am playing with a time series dataframe defined as df using pandas.
I've already changed the row index as datetime index using set_index.
I want to downsample a dataframe at 5 second interval using resample or asfreq.
Let say downsample to 1 hour.
df_inst = df.asfreq('1H')
df_inst2 = df.resample('1H')
When I execute above written code, asfreq gave me the right data frame downsampled to 1 h interval, which is exactly I expected to see.
However, resample didn't generate any dataframe variable, moreover, there is no error message.
When inspect it using print, I have the following message.
print(df_inst2)
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
What am I missing?
More specifically, how can I get the results using resample as I used asfreq
Thank you in advance.
DataFrame.resample returns a Resampler object while DataFrame.asfreq returns the data converted.
If you want to use resample correct, use it with a specific method, for instance: df.resample('1H').asfreq().
Example from the docs:
>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>> series = pd.Series(range(9), index=index)
>> series.resample('30S').asfreq().head(5)
2000-01-01 00:00:00 0.0
2000-01-01 00:00:30 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:30 NaN
2000-01-01 00:02:00 2.0
Freq: 30S, dtype: float64

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df

Categories

Resources