Get mean of last N weekdays for pandas dataframe - python

Assume my data is daily counts and has as its index a DateTimeIndex column. Is there a way to get the average of the past n weekdays? For instance, if the date is Sunday August 15th, I'd like to get mean of counts on (sunday august 8th, sunday august 1st, ...).
I started using pandas yesterday, so here's what I've brute forced.
# df is a dataframe with an DateTimeIndex
# brute force for count last n weekdays, wherelnwd = last n weekdays
def lnwd(n=1):
lnwd, tmp = df.shift(7), df.shift(7) # count last weekday
for i in xrange(n-1):
tmp = tmp.shift(7)
lnwd += tmp
lnwd = lnwd/n # average
return lnwd
There has to be a one liner? Is there a way to use apply() (without passing a function that has a for loop? since n is variable) or some form of groupby? For instance, the way to find the mean of all data on each weekday is:
df.groupby(lambda x: x.dayofweek).mean() # mean of each MTWHFSS

I think you are looking for a rolling apply (rolling mean in this case)? See the docs: http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments. But then applied for each weekday seperately, this can be achieved by combining rolling_mean with grouping on the weekday with groupby.
This should give somethin like (with a series s):
s.groupby(s.index.weekday).transform(lambda x: pd.rolling_mean(x, window=n))

Using Pandas Version 1.4.1 the solution provided by joris seems outdated ("module 'pandas' has no attribute 'rolling_mean'"). The same could be achieved using
s.groupby(s.index.weekday).transform(lambda x: pd.Series.rolling(x, window=n).mean())

Related

Trouble resampling pandas timeseries from 1min to 5min data

I have a 1 minute interval intraday stock data which looks like this:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
I am trying to resample it to '5m' data like this:
n = n.resample('5T').agg(dict(zip(n.columns, ['first', 'max', 'min', 'last', 'last', 'sum'])))
But it tries to resample the datetime information which is not in my data. The market data is only available till 03:30 PM, but when I look at the resampled dataframe I find its tried to resample for entire 24 hrs.
How do I stop the resampling till 03:30PM and move on to the succeeding date?
Right now the dataframe has mostly NaN values due to this. Any suggestions will be welcome.
I am not sure what you are trying to achieve with that agg() function. Assuming 'first' refers to the first quantile and 'last' to the last quantile and you want to calculate some statistics per column, I suggest you do the following:
Get your data:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
Resample your data:
Note: your result is the same as when you resample with n.resample('5T').first() but this means every value in the dataframe
equals the first value from the 5 minute interval consisting of 5
values. A more logical resampling method is to use the mean() or
sum() function as shown below.
If this is data on stock prices it makes more sense to use mean():
resampled_df = n.resample('5T').mean()
To remove resampled hours that are outside of the working stock hours you have 2 options.
Option 1: drop na values:
filtered_df = resampled_df.dropna()
Note: this will not work if you use sum() since the result won't contain missing values but zeros.
Option 2 filter based on start and end hour
Get minimum and maximum time of day where data is available as datetime.time object:
start = n.index.min().time() # 09:15 as datetime.time object
end = n.index.max().time() # 15:29 as datetime.time object
Filter dataframe based on start and end times:
filtered_df = resampled_df.between_time(start, end)
Get the statistics:
statistics = filtered_df.describe()
statistics
Note that describe() will not contain the sum, so in order to add it you could do:
statistics = pd.concat([statistics, filtered_df.agg(['sum'])])
statistics
Output:
The agg() is to apply individual method of operation for each column, I used this so that I can get to see the 'candlestick' formation as it is called in stock technical analysis.
I was able to fix the issue, by dropping the NaN values.

Remove last n days from dataframe

I have a pandas dataframe with datetime index (30 min frequency). And I want do remove "n" last days from it. My dataframe do not include weekends, so if the last day of it is Monday, I want to remove Monday, Friday and Thursday (from the end). So, I mean observed days, not calendar. What is the most pythonic way to do it?
Thanks.
Pandas knows about Monday to Friday as business days.
So if you want to remove the last n business days from your dataframe, you can just do:
df.drop(df[df.index >= df.index.max().date()-pd.offsets.BDay(n-1)].index, inplace=True)
If you really need to remove observed days in the dataframe, if will be slightly more complex because you will have to count the days. Code could be (using a companion dataframe called df_days):
# create a dataframe with same index and only one row per day:
df_days = pd.DataFrame(index=df.index).assign(day=df.index.date).drop_duplicates('day')
# now count the observed day in the companion dataframe
df_days['new_day'] = 1
df_days['days'] = df_days['new_day'].cumsum()
# compute first index to remove to remove last observed n days
ix = df_days.loc[df_days['days'] == df_days['days'].max() + 1 - n].index[0]
# ok drop the last observed n days from the initial dataframe and delete the companion one
df.drop(df.loc[df.index > ix].index)
del df_days

Pandas - Cumsum over time periods + groupby time

I currently have a dataframe that looks like:
I am trying to figure out how to do the following, and just don't know how to start....
For each DAY, cumsum the volume..
After this, group the data by time
of day (ie, 10min intervals). If a day doesn't have that interval
(sometimes gaps), then it should just be treated as 0.
Any help would really be appreciated!
For number 1:
Let's use resample with D:
df.resample('D')['volume'].cumsum()
For Number 2:
Let's use resample('10T') with asfreq and replace:
df.resample('10T').asfreq().replace(np.nan,0)

Use dictionary on Pandas column?

I want to do something like the following:
df['Day'] = df['Day'].apply(lambda x: x + myDict[df['Month']]),
where
myDict={2:3,4:1,6:1,9:1,11:1,1:0,3:0,5:0,7:0,8:0,10:0,12:0}.
What I'm doing is adding a number of days onto the day of the month if it's a certain month. Ex: If it's February and the day of the month is 28, I add 3 to get 31.
But this does not work because I really want to apply myDict to the indexes of df['Month'], not the Month column directly.
Can I do iterrows inline for my command? I think this would perform faster through pandas than a big for loop iterating through the whole dataframe.
Try:
df.Day += df.Month.map(myDict)
Or:
because I don't really get what you are doing
df.Day += df.index.to_series().map(myDict)

Get last date in each month of a time series pandas

Currently I'm generating a DateTimeIndex using a certain function, zipline.utils.tradingcalendar.get_trading_days. The time series is roughly daily but with some gaps.
My goal is to get the last date in the DateTimeIndex for each month.
.to_period('M') & .to_timestamp('M') don't work since they give the last day of the month rather than the last value of the variable in each month.
As an example, if this is my time series I would want to select '2015-05-29' while the last day of the month is '2015-05-31'.
['2015-05-18', '2015-05-19', '2015-05-20', '2015-05-21',
'2015-05-22', '2015-05-26', '2015-05-27', '2015-05-28',
'2015-05-29', '2015-06-01']
Condla's answer came closest to what I needed except that since my time index stretched for more than a year I needed to groupby by both month and year and then select the maximum date. Below is the code I ended up with.
# tempTradeDays is the initial DatetimeIndex
dateRange = []
tempYear = None
dictYears = tempTradeDays.groupby(tempTradeDays.year)
for yr in dictYears.keys():
tempYear = pd.DatetimeIndex(dictYears[yr]).groupby(pd.DatetimeIndex(dictYears[yr]).month)
for m in tempYear.keys():
dateRange.append(max(tempYear[m]))
dateRange = pd.DatetimeIndex(dateRange).order()
Suppose your data frame looks like this
original dataframe
Then the following Code will give you the last day of each month.
df_monthly = df.reset_index().groupby([df.index.year,df.index.month],as_index=False).last().set_index('index')
transformed_dataframe
This one line code does its job :)
My strategy would be to group by month and then select the "maximum" of each group:
If "dt" is your DatetimeIndex object:
last_dates_of_the_month = []
dt_month_group_dict = dt.groupby(dt.month)
for month in dt_month_group_dict:
last_date = max(dt_month_group_dict[month])
last_dates_of_the_month.append(last_date)
The list "last_date_of_the_month" contains all occuring last dates of each month in your dataset. You can use this list to create a DatetimeIndex in pandas again (or whatever you want to do with it).
This is an old question, but all existing answers here aren't perfect. This is the solution I came up with (assuming that date is a sorted index), which can be even written in one line, but I split it for readability:
month1 = pd.Series(apple.index.month)
month2 = pd.Series(apple.index.month).shift(-1)
mask = (month1 != month2)
apple[mask.values].head(10)
Few notes here:
Shifting a datetime series requires another pd.Series instance (see here)
Boolean mask indexing requires .values (see here)
By the way, when the dates are the business days, it'd be easier to use resampling: apple.resample('BM')
Maybe the answer is not needed anymore, but while searching for an answer to the same question I found maybe a simpler solution:
import pandas as pd
sample_dates = pd.date_range(start='2010-01-01', periods=100, freq='B')
month_end_dates = sample_dates[sample_dates.is_month_end]
Try this, to create a new diff column where the value 1 points to the change from one month to the next.
df['diff'] = np.where(df['Date'].dt.month.diff() != 0,1,0)

Categories

Resources