I have two dataframes, one is called Clim and one is called O3_mda8_3135. Clim is a dataframe including monthly average meteorological parameters for one year of data; here is a sample of the dataframe:
Clim.head(12)
Out[7]:
avgT_2551 avgT_5330 ... avgNOx_3135(ppb) avgCO_3135(ppm)
Month ...
1 14.924181 13.545691 ... 48.216128 0.778939
2 16.352172 15.415385 ... 36.110385 0.605629
3 20.530879 19.684720 ... 20.974544 0.460571
4 23.738576 22.919158 ... 14.270995 0.432855
5 26.961927 25.779007 ... 11.087005 0.334505
6 32.208322 31.225072 ... 12.801409 0.384325
7 35.280124 34.265880 ... 10.732970 0.321284
8 35.428857 34.433351 ... 11.916420 0.326389
9 32.008317 30.856782 ... 15.236616 0.343405
10 25.691444 24.139874 ... 24.829518 0.467317
11 19.310550 17.827946 ... 36.339847 0.621938
12 14.186050 12.860077 ... 49.173287 0.720708
[12 rows x 20 columns]
I also have the dataframe O3_mda8_3135, which was created by first calculating the rolling 8 hour average of each component, then finding the maximum daily value of ozone, which is why all of the timestamps and indices are different. There is one value for each meteorological parameter every day of the year. Here's a sample of this dataframe:
O3_mda8_3135
Out[9]:
date Temp_C_2551 ... CO_3135(ppm) O3_mda8_3135
12 2018-01-01 12:00:00 24.1 ... 0.294 10.4000
36 2018-01-02 12:00:00 26.3 ... 0.202 9.4375
60 2018-01-03 12:00:00 22.8 ... 0.184 7.1625
84 2018-01-04 12:00:00 25.6 ... 0.078 8.2500
109 2018-01-05 13:00:00 27.3 ... NaN 9.4500
... ... ... ... ...
8653 2018-12-27 13:00:00 19.6 ... 0.115 35.1125
8676 2018-12-28 12:00:00 14.9 ... 0.097 39.4500
8700 2018-12-29 12:00:00 13.9 ... 0.092 38.1250
8724 2018-12-30 12:00:00 17.4 ... 0.186 35.1375
8753 2018-12-31 17:00:00 8.3 ... 0.110 30.8875
[365 rows x 24 columns]
I am wondering how to subtract the average values in Clim from the corresponding columns and rows in O3_mda8_3135. For example, I would like to subtract the average value for temperature at site 2551 in January (avgT_2551 Month 1 in the Clim dataframe) from every day in January in the other dataframe O3_mda8_3135, column name Temp_C_2551.
avgT_2551 corresponds to Temp_C_2551 in the other dataframe
Is there a simple way to do this? Should I extract the month from the datetime and put it into another column for the O3_mda8_3135 dataframe? I am still a beginner and would appreciate any advice or tips.
I saw this post How to subtract the mean of a month from each day in that month? but there was not enough information given for me to understand what actions were being performed.
I figured it out on my own, thanks to Stack Overflow posts :)
I created new columns in both dataframes corresponding to the month. I had originally set the index in Clim to the Month using Clim = Clim.set_index('Month') so I removed that line. Then, I created a column for Month in the O3_mda8_3135 dataframe. After that, I merged the two dataframes based on the 'Month' column, then used the pd.sub function to subtract the columns I desired.
Here's some example code, sorry the variables are so long but this dataframe is huge.
O3_mda8_3135['Month'] = O3_mda8_3135['date'].dt.month
O3_mda8_3135_anom = pd.merge(O3_mda8_3135, Clim, how='left', on=('Month'))
O3_mda8_3135_anom['O3_mda8_3135_anom'] = O3_mda8_3135_anom['O3_mda8_3135'].sub(O3_mda8_3135_anom['MDA8_3135'])
These posts helped me answer my question:
python pandas extract year from datetime: df['year'] = df['date'].year is not working
How to calculate monthly mean of a time seies data and substract the monthly mean with the values of that month of each year?
Find difference between 2 columns with Nulls using pandas
Related
I have the pandas data frame as below with a datetime index. The dataframe shows the data for the month of April and May. (The original dataframe has many more columns).
I want to remove all the rows for the month of May i.e. starting from index 2022-05-01 00:00:00 and ending at 2022-05-31 23:45:00. Currently, I am doing it by explicitly mentioning the index labels but I am sure that should be a more sophisticated way to do it without having to mention the index labels so that if the data changes and I want to remove the next month, I don't have to hard code it. I would appreciate help with this.
Current Code:
start_remove = pd.to_datetime('2022-05-01 00:00:00')
end_remove = pd.to_datetime('2022-05-01 23:45:00')
df = df.loc[(df.index < start_remove) | (df.index > end_remove)]
Sample Dataset:
date Open Close High Low
...
2022-04-30 23:30:00 10 11.4 10.2 10.7
2022-04-30 23:45:00 18 17.2 17.2 15.8
2022-05-01 00:00:00 24 24 24.8 24.8
2022-05-01 00:15:00 59 58 60 60.3
2022-05-01 00:30:00 43.7 43.9 48 48
...
...
2022-05-31 23:45:00 41.7 53.9 51 50
you may want to include the year when choosing month, to avoid deleting same month from other year
# assumption: date field is an index
# and is already converted to datetime using pd.to_datetime
df.drop(df.loc[df.index.strftime('%Y%m') == '202205'].index)
converting index to datetime
df.index=pd.to_datetime(df.index)
df
Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))
I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....
I have a dataframe with full year data of values on each second:
YYYY-MO-DD HH-MI-SS_SSS TEMPERATURE (C)
2016-09-30 23:59:55.923 28.63
2016-09-30 23:59:56.924 28.61
2016-09-30 23:59:57.923 28.63
... ...
2017-05-30 23:59:57.923 30.02
I want to create a new dataframe which takes each week or month of values and average them over the same hour of each day (kind of moving average but for each hour).
So the result for the month case will be like this:
Date TEMPERATURE (C)
2016-09 00:00:00 28.63
2016-09 01:00:00 27.53
2016-09 02:00:00 27.44
...
2016-10 00:00:00 28.61
... ...
I'm aware of the fact that I can split the df into 12 df's for each month and use:
hour = pd.to_timedelta(df['YYYY-MO-DD HH-MI-SS_SSS'].dt.hour, unit='H')
df2 = df.groupby(hour).mean()
But I'm searching for a better and faster way.
Thanks !!
Here's an alternate method of converting your date and time columns:
df['datetime'] = pd.to_datetime(df['YYYY-MO-DD'] + ' ' + df['HH-MI-SS_SSS'])
Additionally you could groupby both week and hour to form a MultiIndex dataframe (instead of creating and managing 12 dfs):
df.groupby([df.datetime.dt.weekofyear, df.datetime.dt.hour]).mean()
I have a sequence of datetime objects and a series of data which spans through several years. A can create a Series object and resample it to group it by months:
df=pd.Series(varv,index=dates)
multiMmean=df.resample("M", how='mean')
print multiMmean
This, however, outputs
2005-10-31 172.4
2005-11-30 69.3
2005-12-31 187.6
2006-01-31 126.4
2006-02-28 187.0
2006-03-31 108.3
...
2014-01-31 94.6
2014-02-28 82.3
2014-03-31 130.1
2014-04-30 59.2
2014-05-31 55.6
2014-06-30 1.2
which is a list of the mean value for each month of the series. This is not what I want. I want 12 values, one for every month of the year with a mean for each month through the years. How do I get that for multiMmean?
I have tried using resample("M",how='mean') on multiMmean and list comprehensions but I cannot get it to work. What am I missing?
Thank you.
the following worked for me:
# create some random data with datetime index spanning 17 months
s = pd.Series(index=pd.date_range(start=dt.datetime(2014,1,1), end = dt.datetime(2015,6,1)), data = np.random.randn(517))
In [25]:
# now calc the mean for each month
s.groupby(s.index.month).mean()
Out[25]:
1 0.021974
2 -0.192685
3 0.095229
4 -0.353050
5 0.239336
6 -0.079959
7 0.022612
8 -0.254383
9 0.212334
10 0.063525
11 -0.043072
12 -0.172243
dtype: float64
So we can groupby the month attribute of the datetimeindex and call mean this will calculate the mean for all months