Hourly average for each week/month in dataframe (moving average) - python

I have a dataframe with full year data of values on each second:
YYYY-MO-DD HH-MI-SS_SSS TEMPERATURE (C)
2016-09-30 23:59:55.923 28.63
2016-09-30 23:59:56.924 28.61
2016-09-30 23:59:57.923 28.63
... ...
2017-05-30 23:59:57.923 30.02
I want to create a new dataframe which takes each week or month of values and average them over the same hour of each day (kind of moving average but for each hour).
So the result for the month case will be like this:
Date TEMPERATURE (C)
2016-09 00:00:00 28.63
2016-09 01:00:00 27.53
2016-09 02:00:00 27.44
...
2016-10 00:00:00 28.61
... ...
I'm aware of the fact that I can split the df into 12 df's for each month and use:
hour = pd.to_timedelta(df['YYYY-MO-DD HH-MI-SS_SSS'].dt.hour, unit='H')
df2 = df.groupby(hour).mean()
But I'm searching for a better and faster way.
Thanks !!

Here's an alternate method of converting your date and time columns:
df['datetime'] = pd.to_datetime(df['YYYY-MO-DD'] + ' ' + df['HH-MI-SS_SSS'])
Additionally you could groupby both week and hour to form a MultiIndex dataframe (instead of creating and managing 12 dfs):
df.groupby([df.datetime.dt.weekofyear, df.datetime.dt.hour]).mean()

Related

How to subtract dataframe with month index from dataframe with datetime index

I have two dataframes, one is called Clim and one is called O3_mda8_3135. Clim is a dataframe including monthly average meteorological parameters for one year of data; here is a sample of the dataframe:
Clim.head(12)
Out[7]:
avgT_2551 avgT_5330 ... avgNOx_3135(ppb) avgCO_3135(ppm)
Month ...
1 14.924181 13.545691 ... 48.216128 0.778939
2 16.352172 15.415385 ... 36.110385 0.605629
3 20.530879 19.684720 ... 20.974544 0.460571
4 23.738576 22.919158 ... 14.270995 0.432855
5 26.961927 25.779007 ... 11.087005 0.334505
6 32.208322 31.225072 ... 12.801409 0.384325
7 35.280124 34.265880 ... 10.732970 0.321284
8 35.428857 34.433351 ... 11.916420 0.326389
9 32.008317 30.856782 ... 15.236616 0.343405
10 25.691444 24.139874 ... 24.829518 0.467317
11 19.310550 17.827946 ... 36.339847 0.621938
12 14.186050 12.860077 ... 49.173287 0.720708
[12 rows x 20 columns]
I also have the dataframe O3_mda8_3135, which was created by first calculating the rolling 8 hour average of each component, then finding the maximum daily value of ozone, which is why all of the timestamps and indices are different. There is one value for each meteorological parameter every day of the year. Here's a sample of this dataframe:
O3_mda8_3135
Out[9]:
date Temp_C_2551 ... CO_3135(ppm) O3_mda8_3135
12 2018-01-01 12:00:00 24.1 ... 0.294 10.4000
36 2018-01-02 12:00:00 26.3 ... 0.202 9.4375
60 2018-01-03 12:00:00 22.8 ... 0.184 7.1625
84 2018-01-04 12:00:00 25.6 ... 0.078 8.2500
109 2018-01-05 13:00:00 27.3 ... NaN 9.4500
... ... ... ... ...
8653 2018-12-27 13:00:00 19.6 ... 0.115 35.1125
8676 2018-12-28 12:00:00 14.9 ... 0.097 39.4500
8700 2018-12-29 12:00:00 13.9 ... 0.092 38.1250
8724 2018-12-30 12:00:00 17.4 ... 0.186 35.1375
8753 2018-12-31 17:00:00 8.3 ... 0.110 30.8875
[365 rows x 24 columns]
I am wondering how to subtract the average values in Clim from the corresponding columns and rows in O3_mda8_3135. For example, I would like to subtract the average value for temperature at site 2551 in January (avgT_2551 Month 1 in the Clim dataframe) from every day in January in the other dataframe O3_mda8_3135, column name Temp_C_2551.
avgT_2551 corresponds to Temp_C_2551 in the other dataframe
Is there a simple way to do this? Should I extract the month from the datetime and put it into another column for the O3_mda8_3135 dataframe? I am still a beginner and would appreciate any advice or tips.
I saw this post How to subtract the mean of a month from each day in that month? but there was not enough information given for me to understand what actions were being performed.
I figured it out on my own, thanks to Stack Overflow posts :)
I created new columns in both dataframes corresponding to the month. I had originally set the index in Clim to the Month using Clim = Clim.set_index('Month') so I removed that line. Then, I created a column for Month in the O3_mda8_3135 dataframe. After that, I merged the two dataframes based on the 'Month' column, then used the pd.sub function to subtract the columns I desired.
Here's some example code, sorry the variables are so long but this dataframe is huge.
O3_mda8_3135['Month'] = O3_mda8_3135['date'].dt.month
O3_mda8_3135_anom = pd.merge(O3_mda8_3135, Clim, how='left', on=('Month'))
O3_mda8_3135_anom['O3_mda8_3135_anom'] = O3_mda8_3135_anom['O3_mda8_3135'].sub(O3_mda8_3135_anom['MDA8_3135'])
These posts helped me answer my question:
python pandas extract year from datetime: df['year'] = df['date'].year is not working
How to calculate monthly mean of a time seies data and substract the monthly mean with the values of that month of each year?
Find difference between 2 columns with Nulls using pandas

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

Pandas time series resample, binning seems off

I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.
Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.
index = pd.date_range(start="1/1/2018", end="31/12/2018")
df = pd.DataFrame(np.random.randint(100, size=len(index)),
columns=["sales"], index=index)
>>> df.head()
sales
2018-01-01 66
2018-01-02 18
2018-01-03 45
2018-01-04 92
2018-01-05 76
Now I resample by one month, everything looks fine:
>>>df.resample("1M").sum()
sales
2018-01-31 1507
2018-02-28 1186
2018-03-31 1382
[...]
2018-11-30 1342
2018-12-31 1337
If I try to resample by more months though binning starts to look off. This is particularly evident with 6M
df.resample("6M").sum()
sales
2018-01-31 1507
2018-07-31 8393
2019-01-31 7283
First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9054
2019-06-30 39
Now I have an extra bin in 2019 with data from 2018-12-31...
Is this working properly? am I missing any option I should set?
EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9093 # 9054 + 39
Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.
The M time offset alias implies month end frequency.
What you need is 6MS which is an alias for month start frequency:
df.resample('6MS').sum()
resulting in
sales
2018-01-01 8130
2018-07-01 9563
2019-01-01 0
Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.
For extra clarity you can compare ranges directly:
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
Adding np.random.seed(365) to check both our outputs.
print(df.resample("6M", kind='period').sum())
sales
2018-01 8794
2018-07 9033
would this work for you?

Pandas DataReader: normalizing dates

I use pandas data reader package to pull economic time series from website like fred, yahoo finance. I have pulled us recession (USREC) series from the 'fred' website and historical sp500 (^GSPC) from yahoo finance.
Historical US recession:
web.DataReader("USREC", "fred", start, end)
Output:
2017-08-01 0
2017-09-01 0
2017-10-01 0
2017-11-01 0
S&P500 returns
web.DataReader("^GSPC",'yahoo',start,end)['Close'].to_frame().resample('M').mean().round()
Output:
2017-08-31 2456.0
2017-09-30 2493.0
2017-10-31 2557.0
2017-11-30 2594.0
I want to merge the two data frames, but one has beginning date of the month and other has ending date of the month. How do I make a) the date column yyyy-mm b) either make the date column of both frames month beginning or month end?
Thanks for the help!
You can use MS for resample by start of months:
web.DataReader("^GSPC",'yahoo',start,end)['Close'].to_frame().resample('MS').mean().round()
Or is possible use to_period for month PeriodIndex:
df1 = df1.to_period('M')
df2 = df2.to_period('M')
print (df1)
Close
2017-08 0
2017-09 0
2017-10 0
2017-11 0
print (df2)
Close
2017-08 2456.0
2017-09 2493.0
2017-10 2557.0
2017-11 2594.0
print (df1.index)
PeriodIndex(['2017-08', '2017-09', '2017-10', '2017-11'], dtype='period[M]', freq='M')
print (df2.index)
PeriodIndex(['2017-08', '2017-09', '2017-10', '2017-11'], dtype='period[M]', freq='M')

Count unique dates in pandas dataframe

I have a dataframe of surface weather observations (fzraHrObs) organized by a station identifier code and date. fzraHrObs has several columns of weather data. The station code and date (datetime objects) look like:
usaf dat
716270 2014-11-23 12:00:00
2015-12-20 08:00:00
2015-12-20 09:00:00
2015-12-21 04:00:00
2015-12-28 03:00:00
716280 2015-12-19 08:00:00
2015-12-19 08:00:00
I would like to get a count of the number of unique dates (days) per year for each station - i.e. the number of days of obs per year at each station. In my example above this would give me:
usaf Year Count
716270 2014 1
2015 3
716280 2014 0
2015 1
I've tried using groupby and grouping by station, year, and date:
grouped = fzraHrObs['dat'].groupby(fzraHrObs['usaf'], fzraHrObs.dat.dt.year, fzraHrObs.dat.dt.date])
Count, size, nunique, etc. on this just gives me the number of obs on each date, not the number of dates themselves per year. Any suggestions on getting what I want here?
Could be something like this, group the date by usaf and year and then count the number of unique values:
import pandas as pd
df.dat.apply(lambda dt: dt.date()).groupby([df.usaf, df.dat.apply(lambda dt: dt.year)]).nunique()
# usaf dat
# 716270 2014 1
# 2015 3
# 716280 2015 1
# Name: dat, dtype: int64
The following should work:
df.groupby(['usaf', df.dat.dt.year])['dat'].apply(lambda s: s.dt.date.nunique())
What I did differently is group by two levels only, then use the nunique method of pandas series to count the number of unique dates in each group.

Categories

Resources