I'm looking for an equivalent specification to W-MON (weekly, ending Monday) for monthly data.
Specifically, I have a pandas data frame of daily data, and I want to only take monthly observations, starting with the most recent date and going back monthly.
So if today is 17/06/2016, my date index would be 17/06/2016, 17/05/2016, 17/04/2016... etc.
Right now I can only find month-start and month-end as specifications for df.asfreq().
Thanks.
You can create the relevant dates using relativedelta and select using .loc[]:
from datetime import datetime
from dateutil.relativedelta import relativedelta
from pandas_datareader.data import DataReader
Using daily sample data:
stock_data = DataReader('FB', 'yahoo', datetime(2013, 1, 1), datetime.today()).resample('D').fillna(method='ffill')['Open']
and a month end date to show how relativedelta treats this case:
today = date(2016, 1, 31)
Create the sequence of dates:
n_months = 30
dates = [today - relativedelta(years=m // 12, months=m % 12) for m in range(n_months)]
to get:
stock_data.loc[dates]
Date
2016-01-31 108.989998
2015-12-31 106.000000
2015-11-30 105.839996
2015-10-31 104.510002
2015-09-30 88.440002
2015-08-31 90.599998
2015-07-31 94.949997
2015-06-30 86.599998
2015-05-31 79.949997
2015-04-30 80.010002
2015-03-31 82.900002
2015-02-28 80.680000
2015-01-31 78.000000
2014-12-31 79.540001
2014-11-30 77.669998
2014-10-31 74.930000
2014-09-30 79.349998
2014-08-31 74.300003
2014-07-31 74.000000
2014-06-30 67.459999
2014-05-31 63.950001
2014-04-30 57.580002
2014-03-31 60.779999
2014-02-28 69.470001
2014-01-31 60.470001
2013-12-31 54.119999
2013-11-30 46.750000
2013-10-31 47.160000
2013-09-30 50.139999
2013-08-31 42.020000
Name: Open, dtype: float64
Related
I'm trying to convert daily prices into weekly, monthly, quarterly, semesterly, yearly, but the code only works when I run it for one stock. When I add another stock to the list the code crashes and gives two errors. 'ValueError: Length of names must match number of levels in MultiIndex.' and 'TypeError: other must be a MultiIndex or a list of tuples.' I'm not experienced with MultiIndexing and have searched everywhere with no success.
This is the code:
import yfinance as yf
from pandas_datareader import data as pdr
symbols = ['AMZN', 'AAPL']
yf.pdr_override()
df = pdr.get_data_yahoo(symbols, start = '2014-12-01', end = '2021-01-01')
df = df.reset_index()
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace = True)
res = {'Open': 'first', 'Adj Close': 'last'}
dfw = df.resample('W').agg(res)
dfw_ret = (dfw['Adj Close'] / dfw['Open'] - 1)
dfm = df.resample('BM').agg(res)
dfm_ret = (dfm['Adj Close'] / dfm['Open'] - 1)
dfq = df.resample('Q').agg(res)
dfq_ret = (dfq['Adj Close'] / dfq['Open'] - 1)
dfs = df.resample('6M').agg(res)
dfs_ret = (dfs['Adj Close'] / dfs['Open'] - 1)
dfy = df.resample('Y').agg(res)
dfy_ret = (dfy['Adj Close'] / dfy['Open'] - 1)
print(dfw_ret)
print(dfm_ret)
print(dfq_ret)
print(dfs_ret)
print(dfy_ret)```
This is what the original df prints:
```Adj Close Open
AAPL AMZN AAPL AMZN
Date
2014-12-01 26.122288 326.000000 29.702499 338.119995
2014-12-02 26.022408 326.309998 28.375000 327.500000
2014-12-03 26.317518 316.500000 28.937500 325.730011
2014-12-04 26.217640 316.929993 28.942499 315.529999
2014-12-05 26.106400 312.630005 28.997499 316.799988
... ... ... ... ...
2020-12-24 131.549637 3172.689941 131.320007 3193.899902
2020-12-28 136.254608 3283.959961 133.990005 3194.000000
2020-12-29 134.440399 3322.000000 138.050003 3309.939941
2020-12-30 133.294067 3285.850098 135.580002 3341.000000
2020-12-31 132.267349 3256.929932 134.080002 3275.000000
And this is what the different df_ret print when I go from daily
to weekly/monthly/etc but it can only do it for one stock and
the idea is to be able to do it for multiple stocks:
Date
2014-12-07 -0.075387
2014-12-14 -0.013641
2014-12-21 -0.029041
2014-12-28 0.023680
2015-01-04 0.002176
...
2020-12-06 -0.014306
2020-12-13 -0.012691
2020-12-20 0.018660
2020-12-27 -0.008537
2021-01-03 0.019703
Freq: W-SUN, Length: 318, dtype: float64
Date
2014-12-31 -0.082131
2015-01-30 0.134206
2015-02-27 0.086016
2015-03-31 -0.022975
2015-04-30 0.133512
...
2020-08-31 0.085034
2020-09-30 -0.097677
2020-10-30 -0.053569
2020-11-30 0.034719
2020-12-31 0.021461
Freq: BM, Length: 73, dtype: float64
Date
2014-12-31 -0.082131
2015-03-31 0.190415
2015-06-30 0.166595
2015-09-30 0.165108
2015-12-31 0.322681
2016-03-31 -0.095461
2016-06-30 0.211909
2016-09-30 0.167275
2016-12-31 -0.103026
2017-03-31 0.169701
2017-06-30 0.090090
2017-09-30 -0.011760
2017-12-31 0.213143
2018-03-31 0.234932
2018-06-30 0.199052
2018-09-30 0.190349
2018-12-31 -0.257182
2019-03-31 0.215363
2019-06-30 0.051952
2019-09-30 -0.097281
2019-12-31 0.058328
2020-03-31 0.039851
2020-06-30 0.427244
2020-09-30 0.141676
2020-12-31 0.015252
Freq: Q-DEC, dtype: float64
Date
2014-12-31 -0.082131
2015-06-30 0.388733
2015-12-31 0.538386
2016-06-30 0.090402
2016-12-31 0.045377
2017-06-30 0.277180
2017-12-31 0.202181
2018-06-30 0.450341
2018-12-31 -0.107405
2019-06-30 0.292404
2019-12-31 -0.039075
2020-06-30 0.471371
2020-12-31 0.180907
Freq: 6M, dtype: float64
Date
2014-12-31 -0.082131
2015-12-31 1.162295
2016-12-31 0.142589
2017-12-31 0.542999
2018-12-31 0.281544
2019-12-31 0.261152
2020-12-31 0.737029
Freq: A-DEC, dtype: float64```
Without knowing what your df DataFrame looks like I am assuming it is an issue with correctly handling the resampling on a MultiIndex similar to the one talked about in this question.
The solution listed there is to use pd.Grouper with the freq and level parameters filled out correctly.
# This is just from the listed solution so I am not sure if these is the correct level to choose
df.groupby(pd.Grouper(freq='W', level=-1))
If this doesn't work, I think you would need to provide some more detail or a dummy data set to reproduce the issue.
I have a csv file like this and this is the code I wrote to filter the date
example['date_1'] = pd.to_datetime(example['date_1'])
example['date_2'] = pd.to_datetime(example['date_2'])
example
date_1 ID date_2
2015-01-12 111 2016-01-20 08:34:00
2016-01-11 222 2016-12-15 08:34:00
2016-01-11 7770 2016-12-15 08:34:00
2016-01-10 7881 2016-11-17 08:32:00
2016-01-03 90243 2016-04-14 08:35:00
2016-01-03 90354 2016-04-14 08:35:00
2015-01-11 1140303 2015-12-15 08:43:00
2015-01-11 1140414 2015-12-15 08:43:00
example[(example['date_1'] <= '2016-11-01')
& (example['date_1'] >= '2015-11-01')
& (example['date_2'] <= '2016-12-16')
& (example['date_2'] >= '2015-12-15')]
Output:
2016-01-11 222 2016-12-15 08:34:00
2016-01-11 7770 2016-12-15 08:34:00
2016-01-10 7881 2016-11-17 08:32:00
2016-01-03 90243 2016-04-14 08:35:00
2016-01-03 90354 2016-04-14 08:35:00
I don't understand why it changes the format of the date, and it seems like it mix up the month&day in the date, with the conditional filter, the expected result should be the same with the original dataset, but it erased several lines? Can someone help me with it, many thanks.
Some locales format the date as dd/mm/YYYY, while others use mm/dd/YYYY. By default pandas uses the american format of mm/dd/YYYY unless it can infer the alternate format from the values (when a day number is greater than 12...).
So if you know that you input date format is dd/mm/YYYY, you must say it to pandas:
example['date_1'] = pd.to_datetime(example['date_1'], dayfirst=True)
example['date_2'] = pd.to_datetime(example['date_2'], dayfirst=True)
Once pandas has a Timestamp column, it internally stores a number of nano seconds from 1970-01-01 00:00, and by default displays it according to ISO-8601, striping parts that are 0 for the columns. Parts being the full time, fractions of seconds or nanoseconds.
You should not care if you want to process the Timestamps. If at the end you want to force a format, explicitely change the column to its string representation:
df['date_1'] = df['date_1'].df.strftime('%d/%m/%Y %H:%M')
I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!
We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps
How about transpose it:
df_seasons.T.plot()
Output:
I have a simple timeseries of daily observations over 2 years. I basically want to plot the daily date for each month of the series (looking daily seasonality that occurs each month). For example:
I'd expect a series on the chart for each month. Is there a way to split the dataframe easily to do this?
I'm trying to avoid doing this for each month/year...
df['JUN-2016'] = data[df['date'].month==12 & df['date'].year==2016]
A sample of the dataframe:
DATE
2015-01-05 2.7483
2015-01-06 2.7400
2015-01-07 2.7250
2015-01-08 2.7350
2015-01-09 2.7350
2015-01-12 2.7350
2015-01-13 2.7450
2015-01-14 2.7450
2015-01-15 2.7350
2015-01-16 2.7183
2015-01-19 2.7300
2015-01-20 2.7150
2015-01-21 2.7150
2015-01-22 2.6550
2015-01-23 2.6500
2015-01-27 2.6450
2015-01-28 2.6350
2015-01-29 2.6100
2015-01-30 2.5600
2015-02-02 2.4783
2015-02-03 2.4700
First you need to convert the column with all dates in your dataframe (let's say it is called df["dates"]) into datetimeformat:
df["date"]=pd.to_datetime(df["date"])
also you need to import datetime library:
from datetime import datetime
Then you can just do:
startDateOfInterval = "2016-05-31"
endDateOfInterval = "2016-07-01"
dfOfDesiredMonth = df[df["date"].apply(lambda x: x > datetime.strptime(startDateOfInterval, "%Y-%m-%d") and x < datetime.strptime(endDateOfInterval, "%Y-%m-%d"))]
The df you will get will then only contain the rows with date within this interval.
I have a data-frame like this.
value estimated \
dttm_timezone
2011-12-31 20:10:00 10.7891 0
2011-12-31 20:15:00 11.2060 0
2011-12-31 20:20:00 19.9975 0
2011-12-31 20:25:00 15.9975 0
2011-12-31 20:30:00 10.9975 0
2011-12-31 20:35:00 13.9975 0
2011-12-31 20:40:00 15.9975 0
2011-12-31 20:45:00 11.7891 0
2011-12-31 20:50:00 10.9975 0
2011-12-31 20:55:00 10.3933 0
By using the dttm_timezone column information, I would like to extract all the rows which are just within a day or a week or a month.
I have data of 1 year, so if I select day as the duration I should extract 365 days data separately, if I select month then I should extract a 12 months data separately.
How can I achieve this?
Let's use
import pandas as pd
import numpy as np
tidx = pd.date_range('2010-01-01', '2014-12-31', freq='H', name='dtime')
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(len(tidx)), tidx, ['value'])
You can limit to '2010' like this:
df['2010']
Or
df[df.index.year == 2010]
You can limit to a specific month by:
df['2010-04']
or all Aprils:
df[df.index.month == 4]
You can limit to a specific day:
df['2010-04-28']
all 1:00 pm's:
df[df.index.hour == 13]
range of dates:
df['2011':'2013']
or
df['2011-01-01':'2013-06-30']
There is ton of ways to do this:
df.loc[(df.index.month == 11) & (df.index.hour == 22)]
link ---> The list can go on and on. Please read the docs <--- link