I need to calculate moving average using pandas.
ser = pd.Series(np.random.randn(100),
index=pd.date_range('1/1/2000', periods=100, freq='1min'))
ser.rolling(window=20).mean().tail(5)
[Out]
2000-01-01 01:35:00 0.390383
2000-01-01 01:36:00 0.279308
2000-01-01 01:37:00 0.173532
2000-01-01 01:38:00 0.194097
2000-01-01 01:39:00 0.194743
Freq: T, dtype: float64
But after appending a new row like this,
new_row = pd.Series([1.0], index=[pd.to_datetime("2000-01-01 01:40:00")])
ser = ser.append(new_row)
I have to recalcuate all moving data, like this,
ser.rolling(window=20).mean().tail(5)
[Out]
2000-01-01 01:36:00 0.279308
2000-01-01 01:37:00 0.173532
2000-01-01 01:38:00 0.194097
2000-01-01 01:39:00 0.194743
2000-01-01 01:40:00 0.201918
dtype: float64
I think I just need calculate last 2000-01-01 01:40:00 0.201918 data, but I can't find pandas api that calculate only last appended row value. Pandas rolling().mean() always calculate all series data
This is simple example but in my real project, range is more than 1,000,000 series, and each rolling calculation consumes much time
Is there way to solve this problem in pandas?
As Anton vBR wrote in his comment, after you append the row, you can calculate the last value with
ser.tail(20).mean
which takes time independent of the series length (1000000 in your example).
If you do this operation often, you can calculate it a bit more efficiently. The mean after appending the row is:
20 times the mean of the penultimate row
plus the latest appended value
minus the value at the 21 last index
divided by 20
This is more complicated to implement, though.
Related
Could someone please guide how to groupby no. of hours from hourly based index to find how many hours of null values are there in a specific month? Therefore, I am thinking of having a dataframe with monthly based index.
Below given is the dataframe which has timestamp as index and another column with has occassionally null values.
timestamp
rel_humidity
1999-09-27 05:00:00
82.875
1999-09-27 06:00:00
83.5
1999-09-27 07:00:00
83.0
1999-09-27 08:00:00
80.6
1999-09-27 09:00:00
nan
1999-09-27 10:00:00
nan
1999-09-27 11:00:00
nan
1999-09-27 12:00:00
nan
I tried this but the resulting dataframe is not what I expected.
gap_in_month = OG_1998_2022_gaps.groupby(OG_1998_2022_gaps.index.month, OG_1998_2022_gaps.index.year).count()
I always struggle with groupby in function. Therefore, highly appreciate any help. Thanks in advance!
If need 0 if no missing value per month create mask by Series.isna, convert DatetimeIndex to month periods by DatetimeIndex.to_period and aggregate sum - Trues of mask are processing like 1 or alternative with Grouper:
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(OG_1998_2022_gaps.index.to_period('m')).sum())
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(pd.Grouper(freq='m')).sum())
If need only matched rows solution is similar, but first filter by boolean indexing and then aggregate counts by GroupBy.size:
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(OG_1998_2022_gaps.index.to_period('m')).size())
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(pd.Grouper(freq='m')).size())
Alternative to groupby, but (in my opinion) much nicer, is to use pd.Series.resample:
import pandas as pd
# Some sample data with a DatetimeIndex:
series = pd.Series(
np.random.choice([1.0, 2.0, 3.0, np.nan], size=2185),
index=pd.date_range(start="1999-09-26", end="1999-12-26", freq="H")
)
# Solution:
series.isna().resample("M").sum()
# Note that GroupBy.count and Resampler.count count the number of non-null values,
# whereas you seem to be looking for the opposite :)
In your case:
OG_1998_2022_gaps['rel_humidity'].isna().resample("M").sum()
I have a DataFrame that looks like such:
closingDate Time Last
0 1997-09-09 2018-12-13 00:00:00 1000
1 1997-09-09 2018-12-13 00:01:00 1002
2 1997-09-09 2018-12-13 00:02:00 1001
3 1997-09-09 2018-12-13 00:03:00 1005
I want to create a DataFrame with roughly 1440 columns labled as timestamps, where the respective daily value is the return over the prior minute:
closingDate 00:00:00 00:01:00 00:02:00
0 1997-09-09 2018-12-13 -0.08 0.02 -0.001 ...
1 1997-09-10 2018-12-13 ...
My issue is that this is a very large DataFrame (several GB), and I need to do this operation multiple times. Time and memory efficiency is key, but time being more important. Is there some vectorized, built in method to do this in pandas?
You can do this with some aggregation and shifting your time series that should result in more efficient calculations.
First aggregate your data by closingDate.
g = df.groupby("closingDate")
Next you can shift your data to offset by a day.
shifted = g.shift(periods=1)
This will create a new dataframe where the Last value will be from the previous minute. Now you can join to your original dataframe based on the index.
df = df.merge(shifted, left_index=True, right_index=True)
This adds the shifted columns to the new dataframe that you can use to do your difference calculation.
df["Diff"] = (df["Last_x"] - df["Last_y"]) / df["Last_y"]
You now have all the data you're looking for. If you need each minute to be its own column you can pivot the results. By grouping the closingDate and then applying the shift you avoid shifting dates across days. If you look at the first observation of each day you'll get a NaN since the values won't be shifted across separate days.
I am playing with a time series dataframe defined as df using pandas.
I've already changed the row index as datetime index using set_index.
I want to downsample a dataframe at 5 second interval using resample or asfreq.
Let say downsample to 1 hour.
df_inst = df.asfreq('1H')
df_inst2 = df.resample('1H')
When I execute above written code, asfreq gave me the right data frame downsampled to 1 h interval, which is exactly I expected to see.
However, resample didn't generate any dataframe variable, moreover, there is no error message.
When inspect it using print, I have the following message.
print(df_inst2)
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
What am I missing?
More specifically, how can I get the results using resample as I used asfreq
Thank you in advance.
DataFrame.resample returns a Resampler object while DataFrame.asfreq returns the data converted.
If you want to use resample correct, use it with a specific method, for instance: df.resample('1H').asfreq().
Example from the docs:
>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>> series = pd.Series(range(9), index=index)
>> series.resample('30S').asfreq().head(5)
2000-01-01 00:00:00 0.0
2000-01-01 00:00:30 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:30 NaN
2000-01-01 00:02:00 2.0
Freq: 30S, dtype: float64
I have a large dateset that includes categorical data which are my labels ( non-uniform timestamp). I have another dataset which is aggregate of the measurement.
When I want to assemble these two dataset, they have two different timestamp ( aggregated vs non-aggregated).
Categorical dataframe (df_Label)
count 1185
unique 10
top ABCD
freq 1165
Aggregated Dataset (MeasureAgg),
In order to assemble the label dataframe with measurement dataframe.
I use df_Label=df_Label.reindex(MeasureAgg.index, method='nearest')
The issue is that the result of this reindexing will eliminate many of my labels, so the df.describe() will be:
count 4
unique 2
top ABCD
freq 3
I looked in two several lines of where the labels get replaced by nan but couldn't find any indication of where this come from.
I was suspicious that this issue might be due clustering of the labels in between two timestamp which eliminate many of them but this is not the case.
I tried this for fabricated dataset and it work as expected but not sure why is not working in my case df_Label=df_Label.reindex(MeasureAgg.index, method='nearest')
my apology on vague nature of my question, I couldn't replicate the issue with fabricated dataset( for fabricated dataset it worked fine). I would greatly appreciate if any one can guide me with alternative way/modified way that I can assemble these two dataframes.
Thanks in advance
Update:
There is only timestamp since it is mostly missing data
df_Label.head(5)
Time
2000-01-01 00:00:10.870 NaN
2000-01-01 00:00:10.940 NaN
2000-01-01 00:00:11.160 NaN
2000-01-01 00:00:11.640 NaN
2000-01-01 00:00:12.460 NaN
Name: SUM, dtype: object
df_Label.describe()
count 1185
unique 10
top 9_33_2_0_0_0
freq 1165
Name: SUM, dtype: object
MeasureAgg.head(5)
Time mean std skew kurt
2000-01-01 00:00:00 0.0 0.0
2010-01-01 00:00:00 0.0
2015-01-01 00:00:00
2015-12-01 00:00:00
2015-12-01 12:40:00 0.0
MeasureAgg.describe()
mean std skew kurt
count 407.0 383.0 382.0 382.0
mean 487.3552791234544 35.67631749396375 -0.7545081710390299 2.52171909979003
std 158.53524231679074 43.66050329988979 1.3831195437535115 6.72280956322486
min 0.0 0.0 -7.526780108501018 -1.3377292623812096
25% 474.33696969696973 11.5126181533734 -1.1790982769904146 -0.4005545816076801
50% 489.03428571428566 13.49696931937243 -0.2372819584684056 -0.017202890096714274
75% 532.3371929824561 51.40084557371704 0.12755009341999793 1.421205718986767
max 699.295652173913 307.8822231525122 1.2280152015331378 66.9243304128838
I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.