Turning daily stock prices into weekly/monthly/quarterly/semester/yearly? - python

I'm trying to convert daily prices into weekly, monthly, quarterly, semesterly, yearly, but the code only works when I run it for one stock. When I add another stock to the list the code crashes and gives two errors. 'ValueError: Length of names must match number of levels in MultiIndex.' and 'TypeError: other must be a MultiIndex or a list of tuples.' I'm not experienced with MultiIndexing and have searched everywhere with no success.
This is the code:
import yfinance as yf
from pandas_datareader import data as pdr
symbols = ['AMZN', 'AAPL']
yf.pdr_override()
df = pdr.get_data_yahoo(symbols, start = '2014-12-01', end = '2021-01-01')
df = df.reset_index()
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace = True)
res = {'Open': 'first', 'Adj Close': 'last'}
dfw = df.resample('W').agg(res)
dfw_ret = (dfw['Adj Close'] / dfw['Open'] - 1)
dfm = df.resample('BM').agg(res)
dfm_ret = (dfm['Adj Close'] / dfm['Open'] - 1)
dfq = df.resample('Q').agg(res)
dfq_ret = (dfq['Adj Close'] / dfq['Open'] - 1)
dfs = df.resample('6M').agg(res)
dfs_ret = (dfs['Adj Close'] / dfs['Open'] - 1)
dfy = df.resample('Y').agg(res)
dfy_ret = (dfy['Adj Close'] / dfy['Open'] - 1)
print(dfw_ret)
print(dfm_ret)
print(dfq_ret)
print(dfs_ret)
print(dfy_ret)```
This is what the original df prints:
```Adj Close Open
AAPL AMZN AAPL AMZN
Date
2014-12-01 26.122288 326.000000 29.702499 338.119995
2014-12-02 26.022408 326.309998 28.375000 327.500000
2014-12-03 26.317518 316.500000 28.937500 325.730011
2014-12-04 26.217640 316.929993 28.942499 315.529999
2014-12-05 26.106400 312.630005 28.997499 316.799988
... ... ... ... ...
2020-12-24 131.549637 3172.689941 131.320007 3193.899902
2020-12-28 136.254608 3283.959961 133.990005 3194.000000
2020-12-29 134.440399 3322.000000 138.050003 3309.939941
2020-12-30 133.294067 3285.850098 135.580002 3341.000000
2020-12-31 132.267349 3256.929932 134.080002 3275.000000
And this is what the different df_ret print when I go from daily
to weekly/monthly/etc but it can only do it for one stock and
the idea is to be able to do it for multiple stocks:
Date
2014-12-07 -0.075387
2014-12-14 -0.013641
2014-12-21 -0.029041
2014-12-28 0.023680
2015-01-04 0.002176
...
2020-12-06 -0.014306
2020-12-13 -0.012691
2020-12-20 0.018660
2020-12-27 -0.008537
2021-01-03 0.019703
Freq: W-SUN, Length: 318, dtype: float64
Date
2014-12-31 -0.082131
2015-01-30 0.134206
2015-02-27 0.086016
2015-03-31 -0.022975
2015-04-30 0.133512
...
2020-08-31 0.085034
2020-09-30 -0.097677
2020-10-30 -0.053569
2020-11-30 0.034719
2020-12-31 0.021461
Freq: BM, Length: 73, dtype: float64
Date
2014-12-31 -0.082131
2015-03-31 0.190415
2015-06-30 0.166595
2015-09-30 0.165108
2015-12-31 0.322681
2016-03-31 -0.095461
2016-06-30 0.211909
2016-09-30 0.167275
2016-12-31 -0.103026
2017-03-31 0.169701
2017-06-30 0.090090
2017-09-30 -0.011760
2017-12-31 0.213143
2018-03-31 0.234932
2018-06-30 0.199052
2018-09-30 0.190349
2018-12-31 -0.257182
2019-03-31 0.215363
2019-06-30 0.051952
2019-09-30 -0.097281
2019-12-31 0.058328
2020-03-31 0.039851
2020-06-30 0.427244
2020-09-30 0.141676
2020-12-31 0.015252
Freq: Q-DEC, dtype: float64
Date
2014-12-31 -0.082131
2015-06-30 0.388733
2015-12-31 0.538386
2016-06-30 0.090402
2016-12-31 0.045377
2017-06-30 0.277180
2017-12-31 0.202181
2018-06-30 0.450341
2018-12-31 -0.107405
2019-06-30 0.292404
2019-12-31 -0.039075
2020-06-30 0.471371
2020-12-31 0.180907
Freq: 6M, dtype: float64
Date
2014-12-31 -0.082131
2015-12-31 1.162295
2016-12-31 0.142589
2017-12-31 0.542999
2018-12-31 0.281544
2019-12-31 0.261152
2020-12-31 0.737029
Freq: A-DEC, dtype: float64```

Without knowing what your df DataFrame looks like I am assuming it is an issue with correctly handling the resampling on a MultiIndex similar to the one talked about in this question.
The solution listed there is to use pd.Grouper with the freq and level parameters filled out correctly.
# This is just from the listed solution so I am not sure if these is the correct level to choose
df.groupby(pd.Grouper(freq='W', level=-1))
If this doesn't work, I think you would need to provide some more detail or a dummy data set to reproduce the issue.

Related

How to change Dataframe columns names

Let's say I have an array of values corresponding to stock symbols: all = {AAPL, TSLA, MSFT}.
I created a function to scrap their historical prices on YFinance.
def Scrapping(symbol):
aapl = yf.Ticker(symbol)
ainfo = aapl.history(period='1y')
global castex_date
castex_date = ainfo.index
return ainfo.Close
all_Assets = list(map(Scrapping, all))
print(all_Assets)
This is an output sample:
[Date
2018-12-12 00:00:00-05:00 53.183998
2018-12-13 00:00:00-05:00 53.095001
Name: Close, Length: 1007, dtype: float64, Date
2018-12-12 00:00:00-05:00 24.440001
2018-12-13 00:00:00-05:00 25.119333
2022-12-08 00:00:00-05:00 247.399994
2022-12-09 00:00:00-05:00 245.419998
Name: Close, Length: 1007, dtype: float64]
The issue is that all these symbols' historical data have the same name 'Output'. When putting all of these in a dataframe, we get:
df = pd.DataFrame(all_Assets)
df2 = df.transpose()
print(df2)
Close Close Close
Date
2018-12-12 00:00:00-05:00 53.183998 24.440001 104.500511
2018-12-13 00:00:00-05:00 53.095001 25.119333 104.854965
2018-12-14 00:00:00-05:00 52.105000 24.380667 101.578560
2018-12-17 00:00:00-05:00 50.826500 23.228001 98.570389
2018-12-18 00:00:00-05:00 51.435501 22.468666 99.605042
This creates an issue when plotting the DF.
I need these columns names to be equal to the 'symbol' parameter of the function. SO, automatically, the column names would be AAPL TSLA MSFT
You can use:
import pandas as pd
import yfinance as yf
all_symbols = ['AAPL', 'TSLA', 'MSFT']
def scrapping(symbol):
ticker = yf.Ticker(symbol)
data = ticker.history(period='1y')
return data['Close'].rename(symbol)
all_assets = map(scrapping, all_symbols)
df = pd.concat(all_assets, axis=1)
One-liner version:
df = pd.concat({symbol: yf.Ticker(symbol).history(period='1y')['Close']
for symbol in all_symbols}, axis=1)
Output:
>>> df
AAPL TSLA MSFT
Date
2022-02-14 167.863144 291.920013 292.261475
2022-02-15 171.749573 307.476654 297.680695
2022-02-16 171.511032 307.796661 297.333221
2022-02-17 167.863144 292.116669 288.626678
2022-02-18 166.292648 285.660004 285.846893
... ... ... ...
2023-02-08 151.688400 201.289993 266.730011
2023-02-09 150.639999 207.320007 263.619995
2023-02-10 151.009995 196.889999 263.100006
2023-02-13 153.850006 194.639999 271.320007
2023-02-14 152.949997 203.729996 272.790009
[252 rows x 3 columns]

Plot each column mean grouped by specific date range

I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!
We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps
How about transpose it:
df_seasons.T.plot()
Output:

Pandas Merging two dataframes with joining on date between dates

Have quite interesting case.
There is df_1 with time column based on low-granularity data (2s) like this:
2018-08-31 22:59:47.980000+00:00 41.77
2018-08-31 22:59:49.979000+00:00 42.76
2018-08-31 22:59:51.979000+00:00 40.86
2018-08-31 22:59:53.979000+00:00 41.83
2018-08-31 22:59:55.979000+00:00 41.73
2018-08-31 22:59:57.979000+00:00 42.71
Also there is df_2 with labels for this data and time column on hour basis:
2018-08-31 22:00:00 0.0
2018-08-31 23:00:00 1.0
2018-09-01 00:00:00 0.0
2018-09-01 01:00:00 1.0
2018-09-01 02:00:00 0.0
I would like to merge df_1 with df_2 that time from df_1 would be between each two consecutive time rows in df_2 (between one hour for giving the label). If I would have two time columns in df_2 (like startTime and endTime) I would use pandasql and its opportunities:
import pandasql
sqlcode = '''
select *
from df_1
inner join df_2 on df_1.time >= df_2.startTime and df_1.time <= df_2.endTime
'''
newdf = ps.sqldf(sqlcode,locals())
But in this case I only have one column. Is there any way to solve this problem in Pandas?
This is pd.merge_asofproblem, I create a keydat dual of dates in df2,in order to show which date we merge from df2
#df1.Date=pd.to_datetime(df1.Date)
#df2.Date=pd.to_datetime(df2.Date)
yourdf=pd.merge_asof(df1,df2.assign(keydate=df2.Date),on='Date',direction='forward')
yourdf
Date ... keydate
0 2018-08-31 22:59:47.980 ... 2018-08-31 23:00:00
1 2018-08-31 22:59:49.979 ... 2018-08-31 23:00:00
2 2018-08-31 22:59:51.979 ... 2018-08-31 23:00:00
3 2018-08-31 22:59:53.979 ... 2018-08-31 23:00:00
4 2018-08-31 22:59:55.979 ... 2018-08-31 23:00:00
5 2018-08-31 22:59:57.979 ... 2018-08-31 23:00:00
[6 rows x 4 columns]
I solved the problem using workaround with splitting time into date and hour columns. Maybe not too fancy but it solves the deal and pretty straight-forward:
import pandasql as ps
df_1['date'] = [d.date() for d in df_1['time']]
df_1['time'] = df_1['time'].dt.round('H').dt.hour
df_2['date'] = [d.date() for d in df_2['time']]
df_2['time'] = df_2['time'].dt.round('H').dt.hour
sqlcode = '''
select *
from df_1
inner join df_2 on df_1.time=df_2.time and df_1.date=df_2.date
'''
newdf = ps.sqldf(sqlcode,locals())

How to get the mean and sum of the random numbers for a date time series?

I created a series for all the business days for the year 2016 and then assigned random numbers for each date:
Created a date time index for the year 2016:
df= pd.bdate_range('2016-01-01', '2016-12-31')
output
DatetimeIndex(['2016-01-01', '2016-01-04', '2016-01-05', '2016-01-06',
'2016-01-07', '2016-01-08', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14',
...
'2016-12-19', '2016-12-20', '2016-12-21', '2016-12-22',
'2016-12-23', '2016-12-26', '2016-12-27', '2016-12-28',
'2016-12-29', '2016-12-30'],
dtype='datetime64[ns]', length=261, freq='B')
Created index for each columns:
s = pd.Series(np.random.randn(len(df)), index=df)
output
2016-01-01 0.430445
2016-01-04 -0.378483
2016-01-05 0.410059
2016-01-06 2.276409
2016-01-07 1.102603
2016-01-08 -0.339722
2016-01-11 0.542110
2016-01-12 -0.898154
......
2016-12-28 -0.952172
2016-12-29 -1.522073
2016-12-30 -1.065957
I would like to get the sum of index created for each values where I have Tuesday and also I would like to get the mean values of index for each month.
Problem 1: Sum of tuesday values
use dayofweek, and index where dayofweek == 1 (which represents tuesdays)
s[s.index.dayofweek == 1].sum()
# Output:
2.1416224135016124
Problem 2: Mean by month
Use groupby with pd.Grouper(freq='m'):
s.groupby(pd.Grouper(freq='m')).mean()
# Output:
2016-01-31 0.072559
2016-02-29 0.009706
2016-03-31 0.118553
2016-04-30 -0.228017
2016-05-31 0.132211
2016-06-30 -0.188015
2016-07-31 0.008239
2016-08-31 -0.181972
2016-09-30 0.554330
2016-10-31 -0.293271
2016-11-30 -0.092587
2016-12-31 -0.268706
Freq: M, dtype: float64

Resample with custom month-end frequency

I'm looking for an equivalent specification to W-MON (weekly, ending Monday) for monthly data.
Specifically, I have a pandas data frame of daily data, and I want to only take monthly observations, starting with the most recent date and going back monthly.
So if today is 17/06/2016, my date index would be 17/06/2016, 17/05/2016, 17/04/2016... etc.
Right now I can only find month-start and month-end as specifications for df.asfreq().
Thanks.
You can create the relevant dates using relativedelta and select using .loc[]:
from datetime import datetime
from dateutil.relativedelta import relativedelta
from pandas_datareader.data import DataReader
Using daily sample data:
stock_data = DataReader('FB', 'yahoo', datetime(2013, 1, 1), datetime.today()).resample('D').fillna(method='ffill')['Open']
and a month end date to show how relativedelta treats this case:
today = date(2016, 1, 31)
Create the sequence of dates:
n_months = 30
dates = [today - relativedelta(years=m // 12, months=m % 12) for m in range(n_months)]
to get:
stock_data.loc[dates]
Date
2016-01-31 108.989998
2015-12-31 106.000000
2015-11-30 105.839996
2015-10-31 104.510002
2015-09-30 88.440002
2015-08-31 90.599998
2015-07-31 94.949997
2015-06-30 86.599998
2015-05-31 79.949997
2015-04-30 80.010002
2015-03-31 82.900002
2015-02-28 80.680000
2015-01-31 78.000000
2014-12-31 79.540001
2014-11-30 77.669998
2014-10-31 74.930000
2014-09-30 79.349998
2014-08-31 74.300003
2014-07-31 74.000000
2014-06-30 67.459999
2014-05-31 63.950001
2014-04-30 57.580002
2014-03-31 60.779999
2014-02-28 69.470001
2014-01-31 60.470001
2013-12-31 54.119999
2013-11-30 46.750000
2013-10-31 47.160000
2013-09-30 50.139999
2013-08-31 42.020000
Name: Open, dtype: float64

Categories

Resources