Down sampling problem: difference between resample and asfreq - python

I am playing with a time series dataframe defined as df using pandas.
I've already changed the row index as datetime index using set_index.
I want to downsample a dataframe at 5 second interval using resample or asfreq.
Let say downsample to 1 hour.
df_inst = df.asfreq('1H')
df_inst2 = df.resample('1H')
When I execute above written code, asfreq gave me the right data frame downsampled to 1 h interval, which is exactly I expected to see.
However, resample didn't generate any dataframe variable, moreover, there is no error message.
When inspect it using print, I have the following message.
print(df_inst2)
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
What am I missing?
More specifically, how can I get the results using resample as I used asfreq
Thank you in advance.

DataFrame.resample returns a Resampler object while DataFrame.asfreq returns the data converted.
If you want to use resample correct, use it with a specific method, for instance: df.resample('1H').asfreq().
Example from the docs:
>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>> series = pd.Series(range(9), index=index)
>> series.resample('30S').asfreq().head(5)
2000-01-01 00:00:00 0.0
2000-01-01 00:00:30 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:30 NaN
2000-01-01 00:02:00 2.0
Freq: 30S, dtype: float64

Related

Groupby number of hours in a month in pandas

Could someone please guide how to groupby no. of hours from hourly based index to find how many hours of null values are there in a specific month? Therefore, I am thinking of having a dataframe with monthly based index.
Below given is the dataframe which has timestamp as index and another column with has occassionally null values.
timestamp
rel_humidity
1999-09-27 05:00:00
82.875
1999-09-27 06:00:00
83.5
1999-09-27 07:00:00
83.0
1999-09-27 08:00:00
80.6
1999-09-27 09:00:00
nan
1999-09-27 10:00:00
nan
1999-09-27 11:00:00
nan
1999-09-27 12:00:00
nan
I tried this but the resulting dataframe is not what I expected.
gap_in_month = OG_1998_2022_gaps.groupby(OG_1998_2022_gaps.index.month, OG_1998_2022_gaps.index.year).count()
I always struggle with groupby in function. Therefore, highly appreciate any help. Thanks in advance!
If need 0 if no missing value per month create mask by Series.isna, convert DatetimeIndex to month periods by DatetimeIndex.to_period and aggregate sum - Trues of mask are processing like 1 or alternative with Grouper:
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(OG_1998_2022_gaps.index.to_period('m')).sum())
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(pd.Grouper(freq='m')).sum())
If need only matched rows solution is similar, but first filter by boolean indexing and then aggregate counts by GroupBy.size:
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(OG_1998_2022_gaps.index.to_period('m')).size())
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(pd.Grouper(freq='m')).size())
Alternative to groupby, but (in my opinion) much nicer, is to use pd.Series.resample:
import pandas as pd
# Some sample data with a DatetimeIndex:
series = pd.Series(
np.random.choice([1.0, 2.0, 3.0, np.nan], size=2185),
index=pd.date_range(start="1999-09-26", end="1999-12-26", freq="H")
)
# Solution:
series.isna().resample("M").sum()
# Note that GroupBy.count and Resampler.count count the number of non-null values,
# whereas you seem to be looking for the opposite :)
In your case:
OG_1998_2022_gaps['rel_humidity'].isna().resample("M").sum()

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Resampling with pandas.MultiIndex: Resampler.aggregate() & Resampler[column]

I am trying to resample a data frame. First, I want to keep several aggregations in the result. Second, there is an additional aggregation in interest for a specific column. Since this aggregation is only relevant to a single column, the resampler can be limited to this column in order not to unnecessarily apply the aggregation to the other columns.
This scenario is working for a simple one-dimensional column index:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.rand(50,4), index=pd.to_datetime(np.arange(0, 50), unit="s"), columns=["a", "b", "c", "d"])
r = df.resample("10s")
result = r.aggregate(["mean", "std"])
result[("d", "ffill")] = r["d"].ffill()
print(result)
However, as soon as I start to use multi-indexed columns problems appear. First, I can not keep several aggregations at once:
df.columns = pd.MultiIndex.from_product([("a", "b"), ("alpha", "beta")])
r = df.resample("10s") # can be omitted
result = r.aggregate(["mean", "std"])
---> AttributeError: 'Series' object has no attribute 'columns'
Second, the resampler can not be limited to the relevant column anymore:
r[("b", "beta")].ffill()
--> KeyError: "Columns not found: 'b', 'beta'"
How can I transform my concern from simple indices to multi-indices?
you can use pd.Grouper in a groupby instead of resample, such as:
result = df.groupby(pd.Grouper(freq='10s',level=0)).aggregate(["mean", "std"])
print (result)
a b \
alpha beta alpha
mean std mean std mean
1970-01-01 00:00:00 0.460569 0.312508 0.476511 0.260534 0.479577
1970-01-01 00:00:10 0.441498 0.315277 0.487855 0.306068 0.535842
1970-01-01 00:00:20 0.569884 0.248503 0.320552 0.288479 0.507755
1970-01-01 00:00:30 0.478037 0.262654 0.552214 0.251581 0.505132
1970-01-01 00:00:40 0.611227 0.328916 0.473773 0.241604 0.358298
beta
std mean std
1970-01-01 00:00:00 0.357493 0.448487 0.294432
1970-01-01 00:00:10 0.259145 0.472250 0.320954
1970-01-01 00:00:20 0.369490 0.432944 0.150473
1970-01-01 00:00:30 0.298759 0.381614 0.248785
1970-01-01 00:00:40 0.203831 0.381412 0.374965
and for the second part, I'm not sure what you mean, but according to the result given in the case of the single column level, try this it gives a result
result[("b", "beta",'ffill')] = df.groupby(pd.Grouper(freq='10s',level=0))[[("b", "beta")]].first()
It must be a bug in aggregate. A go-around would be stack:
(df.stack().groupby(level=-1)
.apply(lambda x:x.resample('10s', level=0).aggregate(["mean", "std"]))
.unstack(level=0)
)

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df

Pandas caculating rolling functions efficiently

I need to calculate moving average using pandas.
ser = pd.Series(np.random.randn(100),
index=pd.date_range('1/1/2000', periods=100, freq='1min'))
ser.rolling(window=20).mean().tail(5)
[Out]
2000-01-01 01:35:00 0.390383
2000-01-01 01:36:00 0.279308
2000-01-01 01:37:00 0.173532
2000-01-01 01:38:00 0.194097
2000-01-01 01:39:00 0.194743
Freq: T, dtype: float64
But after appending a new row like this,
new_row = pd.Series([1.0], index=[pd.to_datetime("2000-01-01 01:40:00")])
ser = ser.append(new_row)
I have to recalcuate all moving data, like this,
ser.rolling(window=20).mean().tail(5)
[Out]
2000-01-01 01:36:00 0.279308
2000-01-01 01:37:00 0.173532
2000-01-01 01:38:00 0.194097
2000-01-01 01:39:00 0.194743
2000-01-01 01:40:00 0.201918
dtype: float64
I think I just need calculate last 2000-01-01 01:40:00 0.201918 data, but I can't find pandas api that calculate only last appended row value. Pandas rolling().mean() always calculate all series data
This is simple example but in my real project, range is more than 1,000,000 series, and each rolling calculation consumes much time
Is there way to solve this problem in pandas?
As Anton vBR wrote in his comment, after you append the row, you can calculate the last value with
ser.tail(20).mean
which takes time independent of the series length (1000000 in your example).
If you do this operation often, you can calculate it a bit more efficiently. The mean after appending the row is:
20 times the mean of the penultimate row
plus the latest appended value
minus the value at the 21 last index
divided by 20
This is more complicated to implement, though.

Categories

Resources