related to: daily data, resample every 3 days, calculate over trailing 5 days efficiently but the summing is over strided non-consecutive data.
I have an hourly time series. For every hour I would like to have the average of the same hour of the day, in the last preceding 10 days window. E.g. at 2019-08-14 23:00, I would like to have an average of all 23:00 data from 2019-08-04 till 2019-08-13.
Is there an efficient way to do so in pandas/numpy? Or should I roll my sleeves and write my own loops and data structures?
Extra points: if today is a workday (Mon-Fri), the average should be for the previous 10 workdays. If it's a weekend (Sat-Sun), for the previous 10 weekend-days (span about 2.5 months)
Related
I am looking for a way to check the frequency of dates in a column. I have a date with a frequency of every week, but sometimes there is a hurdle of 2 or 3 weeks, and the pd.infer_freq method returns NaN.
My data:
2022-01-01
2022-01-08
2022-01-23
2022-01-30
Your sample data is too small for pd.infer_freq to be able to infer the frequencies. You could find the most common time difference between consecutive days and use that to infer the frequency -
s = pd.Series(dates)
print((s - s.shift(1)).mode())
Output
0 7 days
dtype: timedelta64[ns]
I have to calculate number of days when temperature was more than 32 degree C, in last 30 days.
I am try use rolling average. The issue is that number of days in a month varies.
weather_2['highTemp_days'] = weather_2.groupby(['date','station'])['over32'].apply(lambda x: x.rolling(len('month')).sum())
weather_2 has 66 stations
date varies from 1950 to 2020
over32 is boolean data. If temp on that date is > 32 then 1 otherwise zero.
month is taken from the date data which is weather_2['month'] = weather_2['date'].dt.month
I used this
weather_2['highTemp_days'] = weather_2.groupby(['year','station'])['over32'].apply(lambda x: x.rolling(30).sum())
This issue was I was grouping by date. That is why the answer was wrong.
From dataframe below, I want to count the number of hours in each single day. Each record weights 5 minutes. DateTime is TZ-aware.
Year data
Timestamp
2008-11-13 16:50:00+09:30 177.83
2008-11-13 16:55:00+09:30 165.73
2008-11-15 17:00:00+09:30 160.34
2008-11-15 17:15:00+09:30 148.90
2008-11-15 17:40:00+09:30 113.66
2008-11-20 17:15:00+09:30 121.12
2008-11-20 17:20:00+09:30 109.55
2008-11-20 17:35:00+09:30 100.86
2008-11-20 17:50:00+09:30 90.72
2008-11-20 07:55:00+09:30 86.27
The expected result is
Year hrs/day
Timestamp
2008-11-13 00:00:00+09:30 0.16666666666666666 # <-- 10 min / 60
2008-11-15 00:00:00+09:30 0.25 # <-- 15 min / 60
2008-11-20 00:00:00+09:30 0.4166666666666667 # <-- 15 min / 60
This is what I did.
df['Hour'] = df.index.hour.astype(int)
days = df.resample('D').apply({'Hour':'count'})
which gives me a column 'Hour' with values are the number of record per day.
Next...
days['Hr/dy'] = (days['Hour'] * 5.0)/60.0
where '5.0' is the timestamp interval. With this way, I can get the expected result.
But, I must switch between many data frames with different timestamp intervals. Providing the interval one-by-one every time I switch to a new data is not convenient. I need to get the timestamp interval automatically from the timestamp index.
freqdays = pd.infer_freq(df.index[0:10])
gives a non integer timestamp frequency ('5T') whis is not usable for mathematical operations to further get the hours.
What I need is either:
- a method to get the frequency (interval) from the timestamp index in integer or float, or
- to calculate the length of hours per day directly from the timestamp index.
Edit:
The original data has 5 minute interval with many missing records. The start and end hour is different from day to day.
you can try to get the minimum difference in seconds in your index with:
print (df.index.to_series(keep_tz=True).sort_values().diff().min().total_seconds())
300.0
so to get your result, do the groupby per day, multiply by the min difference of index, and divide by 3600 to get in hours:
df_agg = df.groupby(df.index.date).count()\
*df.index.to_series(keep_tz=True).sort_values().diff().min().total_seconds()/3600
print (df_agg)
date
2008-11-13 0.166667
2008-11-15 0.250000
2008-11-20 0.416667
I have a dataset that looks as follows:
The dataset
I have a big list of orders with there traveled distance for the entire year of 2018. To predict the orders for the future I want to calculate the total orders per Hour for all the Mondays of the year. So, the average number of orders between 00:00:00 -:01:00:00 the average orders between 01:00:00 - 02:00:00 until 23:00:00 - 24:00:00 only on Mondays. They should not include the orders on other weekdays.
What I have so far is:
df_data = pd.read_csv('Finalorders.csv', parse_dates=['datetime'])
week_dfsum = df_data.groupby(df_data['datetime'].dt.weekday_name).sum()
week_dfsum = df_data.groupby(df_data['datetime'].dt.weekday_name).sum()
pprint(week_dfsum)
pprint(week_dfmean)
But I don't know how to only include the orders on Monday.
You're close. After you produce a column called "Day of the week", filter it by Monday's using:
df[df['Day_of_Week'] == 1]
This will return only values from Mondays.
I am converting low-frequency data to a higher frequency with pandas (for instance monthly to daily). When making this conversion, I would like the resulting higher-frequency index to span the entire low-frequency window. For example, suppose I have a monthly series, like so:
import numpy as np
from pandas import *
data = np.random.randn(2)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s
2012-01-31 0
2012-02-29 1
Now, I convert it to daily frequency:
s.resample('D')
2012-01-31 0
2012-02-01 NaN
2012-02-02 NaN
2012-02-03 NaN
...
2012-02-27 NaN
2012-02-28 NaN
2012-02-29 1
Notice how the resulting output goes from 2012-01-31 to 2012-02-29. But what I really want is days from 2011-01-01 to 2012-02-29, so that the daily index "fills" the entire January month, even if 2012-01-31 is still the only non-NaN observation in that month.
I'm also curious if there are built-in methods that give more control over how the higher-frequency period is filled with the lower frequency values. In the monthly to daily example, the default is to fill in just the last day of each month; if I use a PeriodIndex to index my series I can also s.resample('D', convention='start') to have only the first observation filled in. However, I also would like options to fill every day in the month with the monthly value, and to fill every day with the daily average (the monthly value divided by the number of days in the month).
Note that basic backfill and forward fill would not be sufficient to fill every daily observation in the month with the monthly value. For example, if the monthly series runs from January to March but the February value is NaN, then a forward fill would carry the January values into February, which is not desired.
How about this?
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='D'))