I have a sequence of datetime objects and a series of data which spans through several years. A can create a Series object and resample it to group it by months:
df=pd.Series(varv,index=dates)
multiMmean=df.resample("M", how='mean')
print multiMmean
This, however, outputs
2005-10-31 172.4
2005-11-30 69.3
2005-12-31 187.6
2006-01-31 126.4
2006-02-28 187.0
2006-03-31 108.3
...
2014-01-31 94.6
2014-02-28 82.3
2014-03-31 130.1
2014-04-30 59.2
2014-05-31 55.6
2014-06-30 1.2
which is a list of the mean value for each month of the series. This is not what I want. I want 12 values, one for every month of the year with a mean for each month through the years. How do I get that for multiMmean?
I have tried using resample("M",how='mean') on multiMmean and list comprehensions but I cannot get it to work. What am I missing?
Thank you.
the following worked for me:
# create some random data with datetime index spanning 17 months
s = pd.Series(index=pd.date_range(start=dt.datetime(2014,1,1), end = dt.datetime(2015,6,1)), data = np.random.randn(517))
In [25]:
# now calc the mean for each month
s.groupby(s.index.month).mean()
Out[25]:
1 0.021974
2 -0.192685
3 0.095229
4 -0.353050
5 0.239336
6 -0.079959
7 0.022612
8 -0.254383
9 0.212334
10 0.063525
11 -0.043072
12 -0.172243
dtype: float64
So we can groupby the month attribute of the datetimeindex and call mean this will calculate the mean for all months
Related
as the title implies:
IN[1] :
dates = pd.date_range('10/10/2018', periods=11, freq='D')
close_prices = np.arange(len(dates))
close = pd.Series(close_prices, dates)
close
OUT[1]:
2018-10-10 0
2018-10-11 1
2018-10-12 2
2018-10-13 3
2018-10-14 4
2018-10-15 5
2018-10-16 6
2018-10-17 7
2018-10-18 8
2018-10-19 9
2018-10-20 10
IN[2] : close.resample('W').first()
OUT[2] :
2018-10-14 0
2018-10-21 5
Freq: W-SUN, dtype: int64
first what does resample & first do?
and why do we have this date 2018-10-21 as it was not existing in the series and based on what we have the 0 and 5?
Thanks
You have resampled your data by week. '2018-10-14' and '2018-10-21' are the last dates of each resampled week (each a Sunday). So by resampling, you have aggregated your data into weekly samples displayed on the Sundays on 10-14 and 10-21. 0 and 5 each refer to the count at the beginning of each respective week (in other words, the counts on 10-10 and 10-15, which would be the beginning Mondays of the resampled weeks ending on Sundays.
resample('W') reorders and groups the dates so that they're each a full week.
first() selects each week.
I have two dataframes, one is called Clim and one is called O3_mda8_3135. Clim is a dataframe including monthly average meteorological parameters for one year of data; here is a sample of the dataframe:
Clim.head(12)
Out[7]:
avgT_2551 avgT_5330 ... avgNOx_3135(ppb) avgCO_3135(ppm)
Month ...
1 14.924181 13.545691 ... 48.216128 0.778939
2 16.352172 15.415385 ... 36.110385 0.605629
3 20.530879 19.684720 ... 20.974544 0.460571
4 23.738576 22.919158 ... 14.270995 0.432855
5 26.961927 25.779007 ... 11.087005 0.334505
6 32.208322 31.225072 ... 12.801409 0.384325
7 35.280124 34.265880 ... 10.732970 0.321284
8 35.428857 34.433351 ... 11.916420 0.326389
9 32.008317 30.856782 ... 15.236616 0.343405
10 25.691444 24.139874 ... 24.829518 0.467317
11 19.310550 17.827946 ... 36.339847 0.621938
12 14.186050 12.860077 ... 49.173287 0.720708
[12 rows x 20 columns]
I also have the dataframe O3_mda8_3135, which was created by first calculating the rolling 8 hour average of each component, then finding the maximum daily value of ozone, which is why all of the timestamps and indices are different. There is one value for each meteorological parameter every day of the year. Here's a sample of this dataframe:
O3_mda8_3135
Out[9]:
date Temp_C_2551 ... CO_3135(ppm) O3_mda8_3135
12 2018-01-01 12:00:00 24.1 ... 0.294 10.4000
36 2018-01-02 12:00:00 26.3 ... 0.202 9.4375
60 2018-01-03 12:00:00 22.8 ... 0.184 7.1625
84 2018-01-04 12:00:00 25.6 ... 0.078 8.2500
109 2018-01-05 13:00:00 27.3 ... NaN 9.4500
... ... ... ... ...
8653 2018-12-27 13:00:00 19.6 ... 0.115 35.1125
8676 2018-12-28 12:00:00 14.9 ... 0.097 39.4500
8700 2018-12-29 12:00:00 13.9 ... 0.092 38.1250
8724 2018-12-30 12:00:00 17.4 ... 0.186 35.1375
8753 2018-12-31 17:00:00 8.3 ... 0.110 30.8875
[365 rows x 24 columns]
I am wondering how to subtract the average values in Clim from the corresponding columns and rows in O3_mda8_3135. For example, I would like to subtract the average value for temperature at site 2551 in January (avgT_2551 Month 1 in the Clim dataframe) from every day in January in the other dataframe O3_mda8_3135, column name Temp_C_2551.
avgT_2551 corresponds to Temp_C_2551 in the other dataframe
Is there a simple way to do this? Should I extract the month from the datetime and put it into another column for the O3_mda8_3135 dataframe? I am still a beginner and would appreciate any advice or tips.
I saw this post How to subtract the mean of a month from each day in that month? but there was not enough information given for me to understand what actions were being performed.
I figured it out on my own, thanks to Stack Overflow posts :)
I created new columns in both dataframes corresponding to the month. I had originally set the index in Clim to the Month using Clim = Clim.set_index('Month') so I removed that line. Then, I created a column for Month in the O3_mda8_3135 dataframe. After that, I merged the two dataframes based on the 'Month' column, then used the pd.sub function to subtract the columns I desired.
Here's some example code, sorry the variables are so long but this dataframe is huge.
O3_mda8_3135['Month'] = O3_mda8_3135['date'].dt.month
O3_mda8_3135_anom = pd.merge(O3_mda8_3135, Clim, how='left', on=('Month'))
O3_mda8_3135_anom['O3_mda8_3135_anom'] = O3_mda8_3135_anom['O3_mda8_3135'].sub(O3_mda8_3135_anom['MDA8_3135'])
These posts helped me answer my question:
python pandas extract year from datetime: df['year'] = df['date'].year is not working
How to calculate monthly mean of a time seies data and substract the monthly mean with the values of that month of each year?
Find difference between 2 columns with Nulls using pandas
Is there a (more) convenient/efficient method to calculate the number of business days between to dates using pandas?
I could do
len(pd.bdate_range(start='2018-12-03',end='2018-12-14'))-1 # minus one only if end date is a business day
but for longer distances between the start and end day this seems rather inefficient.
There are a couple of suggestion how to use the BDay offset object, but they all seem to refer to the creation of dateranges or something similar.
I am thinking more in terms of a Timedelta object that is represented in business-days.
Say I have two series,s1 and s2, containing datetimes. If pandas had something along the lines of
s1.dt.subtract(s2,freq='B')
# giving a new series containing timedeltas where the number of days calculated
# use business days only
would be nice.
(numpy has a busday_count() method. But I would not want to convert my pandas Timestamps to numpy, as this can get messy.)
I think np.busday_count here is good idea, also convert to numpy arrays is not necessary:
s1 = pd.Series(pd.date_range(start='05/01/2019',end='05/10/2019'))
s2 = pd.Series(pd.date_range(start='05/04/2019',periods=10, freq='5d'))
s = pd.Series([np.busday_count(a, b) for a, b in zip(s1, s2)])
print (s)
0 3
1 5
2 7
3 10
4 14
5 17
6 19
7 23
8 25
9 27
dtype: int64
from xone import calendar
def business_dates(start, end):
us_cal = calendar.USTradingCalendar()
kw = dict(start=start, end=end)
return pd.bdate_range(**kw).drop(us_cal.holidays(**kw))
In [1]: business_dates(start='2018-12-20', end='2018-12-31')
Out[1]: DatetimeIndex(['2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26',
'2018-12-27', '2018-12-28', '2018-12-31'],
dtype='datetime64[ns]', freq=None)
source Get business days between start and end date using pandas
#create dataframes with the dates
df=pd.DataFrame({'dates':pd.date_range(start='05/01/2019',end='05/31/2019')})
#check if the dates are in business days
df[df['dates'].isin(pd.bdate_range(df['dates'].get(0), df['dates'].get(len(df)-1)))]
out[]:
0 2019-05-01
1 2019-05-02
2 2019-05-03
5 2019-05-06
6 2019-05-07
7 2019-05-08
8 2019-05-09
9 2019-05-10
12 2019-05-13
13 2019-05-14
14 2019-05-15
15 2019-05-16
16 2019-05-17
19 2019-05-20
20 2019-05-21
21 2019-05-22
22 2019-05-23
23 2019-05-24
26 2019-05-27
27 2019-05-28
28 2019-05-29
29 2019-05-30
30 2019-05-31
I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....
Is it somehow possible to use resample on irregularly spaced data? (I know that the documentation says it's for "resampling of regular time-series data", but I wanted to try if it works on irregular data, too. Maybe it doesn't, or maybe I am doing something wrong.)
In my real data, I have generally 2 samples per hour, the time difference between them ranging usually from 20 to 40 minutes. So I was hoping to resample them to a regular hourly series.
To test if I am using it right, I used some random list of dates that I already had, so it may not be a best example but at least a solution that works for it will be very robust. here it is:
fraction number time
0 0.729797 0 2014-10-23 15:44:00
1 0.141084 1 2014-10-30 19:10:00
2 0.226900 2 2014-11-05 21:30:00
3 0.960937 3 2014-11-07 05:50:00
4 0.452835 4 2014-11-12 12:20:00
5 0.578495 5 2014-11-13 13:57:00
6 0.352142 6 2014-11-15 05:00:00
7 0.104814 7 2014-11-18 07:50:00
8 0.345633 8 2014-11-19 13:37:00
9 0.498004 9 2014-11-19 22:47:00
10 0.131665 10 2014-11-24 15:28:00
11 0.654018 11 2014-11-26 10:00:00
12 0.886092 12 2014-12-04 06:37:00
13 0.839767 13 2014-12-09 00:50:00
14 0.257997 14 2014-12-09 02:00:00
15 0.526350 15 2014-12-09 02:33:00
Now I want to resample these for example monthly:
df_new = df.set_index(pd.DatetimeIndex(df['time']))
df_new['fraction'] = df.fraction.resample('M',how='mean')
df_new['number'] = df.number.resample('M',how='mean')
But I get TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' - unless I did something wrong with assigning the datetime index, it must be due to the irregularity?
So my questions are:
Am I using it correctly?
If 1==True, is there no straightforward way to resample the data?
(I only see a solution in first reindexing the data to get finer intervals, interpolate the values in between and then reindexing it to hourly interval. If it is so, then a question regarding the correct implementation of reindex will follow shortly.)
You don't need to explicitly use DatetimeIndex, just set 'time' as the index and pandas will take care of the rest, so long as your 'time' column has been converted to datetime using pd.to_datetime or some other method. Additionally, you don't need to resample each column individually if you're using the same method; just do it on the entire DataFrame.
# Convert to datetime, if necessary.
df['time'] = pd.to_datetime(df['time'])
# Set the index and resample (using month start freq for compact output).
df = df.set_index('time')
df = df.resample('MS').mean()
The resulting output:
fraction number
time
2014-10-01 0.435441 0.5
2014-11-01 0.430544 6.5
2014-12-01 0.627552 13.5