I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.
Related
(using Python 3.5.6)
Hi,
am trying to prepare a dataframe for being used with zipline with a customized trading_calendar. Let's say that I imported relevant timeseries data from 2:15am till 21:58pm only business days, that info will be treated by zipline later with frequency="minute"... so data needs to be accurate.
Something like this:
de = OrderedDict()
de['DE30'] = pd.read_csv("DE.30.1m.csv",header=None)
de['DE30']['date'] = pd.to_datetime(de['DE30']['time'],utc=True)
de['DE30'].set_index('date', inplace=True)
de['DE30'] = de['DE30'].resample(rule="min").mean()
de['DE30'].rename(columns={2:'open',3:'high',4:'low',5:'close',6:'volume'},inplace=True)
de['DE30'].loc[datetime(2020,3,30,21,57):].head()
date open high low close volume
2020-03-30 21:57:00+00:00 9846.4 9848.9 9839.4 9843.4 62.0
2020-03-30 21:58:00+00:00 9842.9 9842.9 9840.9 9840.9 2.0
2020-03-30 21:59:00+00:00 NaN NaN NaN NaN NaN
2020-03-30 22:00:00+00:00 NaN NaN NaN NaN NaN
2020-03-30 22:01:00+00:00 NaN NaN NaN NaN NaN
The tricky thing here is that: if I resample the full timeseries -like before- then I won't miss any "minute" in my original information (i.e. in case my .CSV does not have some important rows by mistake), however I will have a tedious task to remove all rows between 21:59pm and 2:15am (no trading hours) as well as weekends and holidays info. How to do this in a 'simple way'..?
If I do not resample it, I will pass to zipline a panel that contains only the correct trading hours and business days (zipline will work with a pre-defined customized trading_calendar that I registered for that) however I can obtain errors like this:
KeyError: 'the label [2020-03-31 05:37:00+00:00] is not in the [index]'
because the row with the minute 05:37 for that day is not in the original .csv
Option 1) Is it possible to check only if there are 'missing' minutes in an specific timeframe without resampling? or..
Option 2) in case I decide to resample, is there an easy way to clean rows with undesired times, weekends, etc?
I have the following dataframe after I appended the data from different sources of files:
Owed Due Date
Input NaN 51.83 08012019
Net NaN 35.91 08012019
Output NaN -49.02 08012019
Total -1.26 38.72 08012019
Input NaN 58.43 09012019
Net NaN 9.15 09012019
Output NaN -57.08 09012019
Total -3.48 10.50 09012019
Input NaN 66.50 10012019
Net NaN 9.64 10012019
Output NaN -64.70 10012019
Total -5.16 11.44 10012019
I have been trying to figure out how to reorganize this dataframe to become multi index like this:
I have tried to use melt and pivot but with limited success to even reshape anything. Will appreciate for some guidance!
P.S: The date when using print(df) shows DD for date (e.g 08). However if I change this to a csv file, it becomes 8 instead of 08 for single digit day. Hope someone can guide me on this too, thanks.
Here you go:
df.set_index('Date', append=True).unstack(0).dropna(axis=1)
set_index() moves Date to become an additional index column. Then unstack(0) moves the original index to become column names. Finally, drop the NAN columns and you have your desired result.
trying to get better at using Python and pandas...
I have some stock market data, I have added a day_of_week column(Monday, Tuesday, Wednesday, Thursday, Friday) based on the "Date" columnn, obviously theres Open, High, Low and Close columns as well for each day, and I've also added and pct-chance column, now what I want to do is print out, then plot the pct-change for all the Monday's, or whichever day I like, then I want to plot the average pct-change in a histogram style for each week day in my data, so it'll look like a histogram chart with Monday,Tuesday etc. along the X axis on the bottom, with their average pct-change data on the Y.
So somehow I need to figure out if day_of_week column == "Monday" return pct-change? But I'm assuming there's a better way with pandas somehow, with .loc or something but I can't figure it out :(
The data is indexed by just an integer, then a Date, O,H,L,C, Volume, Range, Pct_change, day_of_week columns etc, below is my output from data.head()
Date Open High Low Settle Volume \
0 2017-07-20 12493.0 12567.0 12381.0 12422.0 94966.0
1 2017-07-19 12446.5 12481.5 12408.0 12432.5 68435.0
2 2017-07-18 12554.0 12569.0 12373.5 12425.5 96933.0
3 2017-07-17 12646.5 12668.0 12531.0 12587.0 65648.0
4 2017-07-14 12642.5 12658.5 12567.0 12611.0 59074.0
Prev. Day Open Interest day_of_week Range < MR > MR Pct_change
0 151217.0 Thursday 186.0 NaN 186.0 -0.568318
1 148249.0 Wednesday 73.5 73.5 NaN -0.112481
2 154485.0 Tuesday 195.5 NaN 195.5 -1.023578
3 145445.0 Monday 137.0 137.0 NaN -0.470486
4 144704.0 Friday 91.5 91.5 NaN -0.249160
Hope someone can help/point me in the right direction, thanks!!
I think you need groupby with aggregate mean and then DataFrame.plot.bar:
df.groupby('day_of_week')['Pct_change'].mean().plot.bar()
For ordering is possible use ordered categorical:
cats = ['Monday','Tuesday','Wednesday','Thursday','Friday']
df['day_of_week'] = df['day_of_week'].astype('category', categories=cats, ordered=True)
df.groupby('day_of_week')['Pct_change'].mean().plot.bar()
I have Pandas DataFrame (loaded from .csv) with Date-time as index.. where there is/have-to-be one entry per day.
The problem is that I have gaps i.e. there is days for which I have no data at all.
What is the easiest way to insert rows (days) in the gaps ? Also is there a way to control what is inserted in the columns as data ! Say 0 OR copy the prev day info OR to fill sliding increasing/decreasing values in the range from prev-date toward next-date data-values.
thanks
Here is example 01-03 and 01-04 are missing :
In [60]: df['2015-01-06':'2015-01-01']
Out[60]:
Rate High (est) Low (est)
Date
2015-01-06 1.19643 0.0000 0.0000
2015-01-05 1.20368 1.2186 1.1889
2015-01-02 1.21163 1.2254 1.1980
2015-01-01 1.21469 1.2282 1.2014
Still experimenting but this seems to solve the problem :
df.set_index(pd.DatetimeIndex(df.Date),inplace=True)
and then resample... the reason being that importing the .csv with header-col-name Date, is not actually creating date-time-index, but Frozen-list whatever that means.
resample() is expecting : if isinstance(ax, DatetimeIndex): .....
Here is my final solution :
#make dates the index
self.df.set_index(pd.DatetimeIndex(self.df.Date), inplace=True)
#fill the gaps
self.df = self.df.resample('D',fill_method='pad')
#fix the Date column
self.df.Date = self.df.index.values
I had to fix the Date column, because resample() just allow you to pad-it.
It fixes the index correctly though, so I could use it to fix the Date column.
Here is snipped of the data after correction :
2015-01-29 2015-01-29 1.13262 0.0000 0.0000
2015-01-30 2015-01-30 1.13161 1.1450 1.1184
2015-01-31 2015-01-31 1.13161 1.1450 1.1184
2015-02-01 2015-02-01 1.13161 1.1450 1.1184
01-30, 01-31 are the new generated data.
You'll could resample by day e.g. using mean if there are multiple entries per day:
df.resample('D', how='mean')
You can then ffill to replace NaNs with the previous days result.
See up and down sampling in the docs.
I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.