Plot certain days only pandas dataframe - python

trying to get better at using Python and pandas...
I have some stock market data, I have added a day_of_week column(Monday, Tuesday, Wednesday, Thursday, Friday) based on the "Date" columnn, obviously theres Open, High, Low and Close columns as well for each day, and I've also added and pct-chance column, now what I want to do is print out, then plot the pct-change for all the Monday's, or whichever day I like, then I want to plot the average pct-change in a histogram style for each week day in my data, so it'll look like a histogram chart with Monday,Tuesday etc. along the X axis on the bottom, with their average pct-change data on the Y.
So somehow I need to figure out if day_of_week column == "Monday" return pct-change? But I'm assuming there's a better way with pandas somehow, with .loc or something but I can't figure it out :(
The data is indexed by just an integer, then a Date, O,H,L,C, Volume, Range, Pct_change, day_of_week columns etc, below is my output from data.head()
Date Open High Low Settle Volume \
0 2017-07-20 12493.0 12567.0 12381.0 12422.0 94966.0
1 2017-07-19 12446.5 12481.5 12408.0 12432.5 68435.0
2 2017-07-18 12554.0 12569.0 12373.5 12425.5 96933.0
3 2017-07-17 12646.5 12668.0 12531.0 12587.0 65648.0
4 2017-07-14 12642.5 12658.5 12567.0 12611.0 59074.0
Prev. Day Open Interest day_of_week Range < MR > MR Pct_change
0 151217.0 Thursday 186.0 NaN 186.0 -0.568318
1 148249.0 Wednesday 73.5 73.5 NaN -0.112481
2 154485.0 Tuesday 195.5 NaN 195.5 -1.023578
3 145445.0 Monday 137.0 137.0 NaN -0.470486
4 144704.0 Friday 91.5 91.5 NaN -0.249160
Hope someone can help/point me in the right direction, thanks!!

I think you need groupby with aggregate mean and then DataFrame.plot.bar:
df.groupby('day_of_week')['Pct_change'].mean().plot.bar()
For ordering is possible use ordered categorical:
cats = ['Monday','Tuesday','Wednesday','Thursday','Friday']
df['day_of_week'] = df['day_of_week'].astype('category', categories=cats, ordered=True)
df.groupby('day_of_week')['Pct_change'].mean().plot.bar()

Related

Pandas weekly resampling

I have a dataframe with daily market data (OHLCV) and am resampling it to weekly.
My specific requirement is that the weekly dataframe's index labels must be the index labels of the first day of that week, whose data is present in the daily dataframe.
For example, in July 2022, the trading week beginning 4th July (for US stocks) should be labelled 5th July, since 4th July was a holiday and not found in the daily dataframe, and the first date in that week found in the daily dataframe is 5th July.
The usual weekly resampling offset aliases and anchored offsets do not seem to have such an option.
I can achieve my requirement specifically for US stocks by importing USFederalHolidayCalendar from pandas.tseries.holiday and then using
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
dfw.index = dfw.index.map(lambda idx: bday_us.rollforward(idx))
where dfw is the already resampled weekly dataframe with W-MON as option.
However, this would mean that I'd have to use different trading calendars for each different exchange/market, which I'd very much like to avoid.
Any pointers on how to do this simply so that the index label in the weekly dataframe is the index label of the first day of that week available in the daily dataframe would be much appreciated.
You want to group all days by calendar week (Mon-Sun), then aggregate the data, and use the first observed date as the index, correct?
If so, W-MON is not applicable because you will group dates from Tuesday through Monday. Using W-SUN instead, you group by the calendar week where the index is the Sunday. However, you can use method first on the date column to obtain the first observed date in this week and replace the index with this result.
This is possible with either groupby or resample:
import numpy as np
import pandas as pd
# simulate daily data, drop a monday
date_range = pd.bdate_range(start='2022-06-06',end='2022-07-31')
date_range = date_range[~(date_range=='2022-07-04')]
# simulate data
df = pd.DataFrame(data = {
'date': date_range,
'return': np.random.random(size=len(date_range))
})
# resample with groupby
g = df.groupby([pd.Grouper(key='date', freq='W-SUN')])
result_groupby = g[['return']].mean() # example aggregation method
result_groupby['date_first_observed'] = g['date'].first()
result_groupby['date_last_observed'] = g['date'].last()
result_groupby.set_index('date_first_observed', inplace=True)
# resample with resample
df.index = df['date']
g = df.resample('W-SUN')
result_resample = g[['return']].mean() # example aggregation method
result_resample['date_first_observed'] = g['date'].first()
result_resample['date_last_observed'] = g['date'].last()
result_resample.set_index('date_first_observed', inplace=True)
This gives
>>> result_groupby
return date_last_observed
date_first_observed
2022-06-06 0.704949 2022-06-10
2022-06-13 0.460946 2022-06-17
2022-06-20 0.578682 2022-06-24
2022-06-27 0.361004 2022-07-01
2022-07-05 0.692309 2022-07-08
2022-07-11 0.569810 2022-07-15
2022-07-18 0.435222 2022-07-22
2022-07-25 0.454765 2022-07-29
>>> result_resample
return date_last_observed
date_first_observed
2022-06-06 0.704949 2022-06-10
2022-06-13 0.460946 2022-06-17
2022-06-20 0.578682 2022-06-24
2022-06-27 0.361004 2022-07-01
2022-07-05 0.692309 2022-07-08
2022-07-11 0.569810 2022-07-15
2022-07-18 0.435222 2022-07-22
2022-07-25 0.454765 2022-07-29
One row shows 2022-07-05 (Tuesday) instead of 2022-07-04 (Monday).

Adjusting Monthly Time Series Data in Pandas

I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.

Filling date gaps in pandas dataframe

I have Pandas DataFrame (loaded from .csv) with Date-time as index.. where there is/have-to-be one entry per day.
The problem is that I have gaps i.e. there is days for which I have no data at all.
What is the easiest way to insert rows (days) in the gaps ? Also is there a way to control what is inserted in the columns as data ! Say 0 OR copy the prev day info OR to fill sliding increasing/decreasing values in the range from prev-date toward next-date data-values.
thanks
Here is example 01-03 and 01-04 are missing :
In [60]: df['2015-01-06':'2015-01-01']
Out[60]:
Rate High (est) Low (est)
Date
2015-01-06 1.19643 0.0000 0.0000
2015-01-05 1.20368 1.2186 1.1889
2015-01-02 1.21163 1.2254 1.1980
2015-01-01 1.21469 1.2282 1.2014
Still experimenting but this seems to solve the problem :
df.set_index(pd.DatetimeIndex(df.Date),inplace=True)
and then resample... the reason being that importing the .csv with header-col-name Date, is not actually creating date-time-index, but Frozen-list whatever that means.
resample() is expecting : if isinstance(ax, DatetimeIndex): .....
Here is my final solution :
#make dates the index
self.df.set_index(pd.DatetimeIndex(self.df.Date), inplace=True)
#fill the gaps
self.df = self.df.resample('D',fill_method='pad')
#fix the Date column
self.df.Date = self.df.index.values
I had to fix the Date column, because resample() just allow you to pad-it.
It fixes the index correctly though, so I could use it to fix the Date column.
Here is snipped of the data after correction :
2015-01-29 2015-01-29 1.13262 0.0000 0.0000
2015-01-30 2015-01-30 1.13161 1.1450 1.1184
2015-01-31 2015-01-31 1.13161 1.1450 1.1184
2015-02-01 2015-02-01 1.13161 1.1450 1.1184
01-30, 01-31 are the new generated data.
You'll could resample by day e.g. using mean if there are multiple entries per day:
df.resample('D', how='mean')
You can then ffill to replace NaNs with the previous days result.
See up and down sampling in the docs.

Filling higher-frequency windows when upsampling with pandas

I am converting low-frequency data to a higher frequency with pandas (for instance monthly to daily). When making this conversion, I would like the resulting higher-frequency index to span the entire low-frequency window. For example, suppose I have a monthly series, like so:
import numpy as np
from pandas import *
data = np.random.randn(2)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s
2012-01-31 0
2012-02-29 1
Now, I convert it to daily frequency:
s.resample('D')
2012-01-31 0
2012-02-01 NaN
2012-02-02 NaN
2012-02-03 NaN
...
2012-02-27 NaN
2012-02-28 NaN
2012-02-29 1
Notice how the resulting output goes from 2012-01-31 to 2012-02-29. But what I really want is days from 2011-01-01 to 2012-02-29, so that the daily index "fills" the entire January month, even if 2012-01-31 is still the only non-NaN observation in that month.
I'm also curious if there are built-in methods that give more control over how the higher-frequency period is filled with the lower frequency values. In the monthly to daily example, the default is to fill in just the last day of each month; if I use a PeriodIndex to index my series I can also s.resample('D', convention='start') to have only the first observation filled in. However, I also would like options to fill every day in the month with the monthly value, and to fill every day with the daily average (the monthly value divided by the number of days in the month).
Note that basic backfill and forward fill would not be sufficient to fill every daily observation in the month with the monthly value. For example, if the monthly series runs from January to March but the February value is NaN, then a forward fill would carry the January values into February, which is not desired.
How about this?
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='D'))

Upsampling to weekly data in pandas

I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.

Categories

Resources