I am using the yfinance library to import data for a given stock. See code below:
import yfinance as yf
from datetime import datetime as dt
import pandas as pd
# Naming Constants
stock = "AAPL"
start_date = "2014-01-01"
end_date = "2018-01-01"
# Importing all the data into a dataFrame
stock_data = yf.download(stock, start=start_date, end=end_date)
When I call print(stock_data.index) I have the following:
DatetimeIndex(['2014-01-02', '2014-01-03', '2014-01-06', '2014-01-07', '2014-01-08', '2014-01-09', '2014-01-10', '2014-01-13', '2014-01-14', '2014-01-15',
...
'2017-12-15', '2017-12-18', '2017-12-19', '2017-12-20', '2017-12-21', '2017-12-22', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29'],
dtype='datetime64[ns]', name='Date', length=1007, freq=None)
I wish to switch the frequency argument from None to daily since every Date refers to a trading day.
When I say stock_data.index.freq = 'B' I get the following error:
ValueError: Inferred frequency None from passed values does not conform to passed frequency B
And if I put stock_data = stock_data.asfreq('B'), it will change the frequency but it will add certain lines that were not there originally and fills them with NA values.
In other words, what is the offset ALIAS used for trading days?
You can find the list of alias from the Pandas documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
The error with stock_data.index.freq = 'B' indicates that your timeseries frequency is not 'business-day', but undefined or 'None'.
With
stock_data = stock_data.asfreq('B')
your are re-indexing your timeseries to business-daily frequency: The missing timestamps will be added, and the missing stock data values are set to NaN. Now you need to decide how replace them, so have a look here: pandas.DataFrame.asfreq. So you could replace all NaN's with a fixed value like -999, but in general what you want to do with stock data is take the last valid value at a given point in time, which is forward filling the gaps:
stock_data = stock_data.asfreq('B', method='ffill')
It's always worth reading the docs.
Related
I want to create a date vector with a given fixed spacing depending on the frequency I choose. So far, this is what I got:
import pandas as pd
import datetime as dt
from datetime import date, timedelta
def getDates(sD, eD, f):
# creating the datetime object to later on make the date vector
sD = dt.datetime.strptime(sD, '%m/%d/%Y')
eD = dt.datetime.strptime(eD, '%m/%d/%Y')
sd_t = date(sD.year,sD.month,sD.day) # start date
ed_t = date(eD.year,eD.month,eD.day) # end date
# we hardcode a frequency dictionary for the frequencies and spacing that
# date vectors are going to have.
freqDict = {'1h':'40D', '4h':'162D', '1d':'1000D'}
dateVector = pd.date_range(sd_t, ed_t, freq = freqDict[f])
return dateVector
As you can see, I have only 3 frequencies I'm interested in. And the spacing between them works well, I have to play with API limitations and the limit I set up for requests is 1000. This is why I chose these custom spacings between dates, in order to allow a good amount of data points as possible according to the frequency and the API request limitations for which these dates are meant.
Unfortunately, I can't get the final date on the dateVector for some cases. If run this function with these inputs:
getDates('01/01/2020', '01/01/2021', '4h')
I get this outcome, which is missing the final date on the array ('01/01/2021'):
0 2020-01-01
1 2020-06-11
2 2020-11-20
I thought of using the closed = parameter, but it didn't get me where I wanted.
A workaround I thought of consists of using periods instead of freq, and dynamically computing the periods according to the distance (in terms of days) between the start date and the end date. But I would like to know if I can make date_range work in my favor without having to write such a routine.
I want to calculate the number of business days between two dates and create a new pandas dataframe column with those days. I also have a holiday calendar and I want to exclude dates in the holiday calendar while making my calculation.
I looked around and I saw the numpy busday_count function as a useful tool for it. The function counts the number of business days between two dates and also allows you to include a holiday calendar.
I also looked around and I saw the holidays package which gives me the holiday dates for different countries. I thought it will be great to add this holiday calendar into the numpy function.
Then I proceeded as follows;
import pandas as pd
import numpy as np
import holidays
from datetime import datetime, timedelta, date
df = {'start' : ['2019-01-02', '2019-02-01'],
'end' : ['2020-01-04', '2020-03-05']
}
df = pd.DataFrame(df)
holidays_country = holidays.CountryHoliday('UnitedKingdom')
start_date = [d.date for d in df['start']]
end_date = [d.date for d in df['end']]
holidays_numpy = holidays_country[start_date:end_date]
df['business_days'] = np.busday_count(begindates = start_date,
enddates = end_date,
holidays=holidays_numpy)
When I run this code, it throws this error TypeError: Cannot convert type '<class 'list'>' to date
When I looked further, I noticed that the start_date and end_date are lists and that might be whey the error was occuring.
I then changed the holidays_numpy variable to holidays_numpy = holidays_country['2019-01-01':'2019-12-31'] and it worked.
However, since my dates are different for each row in my dataframe, is there a way to set the two arguments in my holiday_numpy variable to select corresponding values (just like the zip function) each from start_date and end_date?
I'm also open to alternative ways of solving this problem.
This should work:
import pandas as pd
import numpy as np
import holidays
df = {'start' : ['2019-01-02', '2019-02-01'],
'end' : ['2020-01-04', '2020-03-05']}
df = pd.DataFrame(df)
holidays_country = holidays.CountryHoliday('UK')
def f(x):
return np.busday_count(x[0],x[1],holidays=holidays_country[x[0]:x[1]])
df['business_days'] = df[['start','end']].apply(f,axis=1)
df.head()
date['Maturity_date'] = data.apply(lambda data: relativedelta(months=int(data['TRM_LNTH_MO'])) + data['POL_EFF_DT'], axis=1)
Tried this also:
date['Maturity_date'] = date['POL_EFF_DT'] + date['TRM_LNTH_MO'].values.astype("timedelta64[M]")
TypeError: 'type' object does not support item assignment
import pandas as pd
import datetime
#Convert the date column to date format
date['date_format'] = pd.to_datetime(date['Maturity_date'])
#Add a month column
date['Month'] = date['date_format'].apply(lambda x: x.strftime('%b'))
If you are using Pandas, you may use a resource called: "Frequency Aliases". Something very out of the box:
# For "periods": 1 (is the current date you have) and 2 the result, plus 1, by the frequency of 'M' (month).
import pandas as pd
_new_period = pd.date_range(_existing_date, periods=2, freq='M')
Now you can get exactly the period you want as the second element returned:
# The index for your information is 1. Index 0 is the existing date.
_new_period.strftime('%Y-%m-%d')[1]
# You can format in different ways. Only Year, Month or Day. Whatever.
Consult this link for further information
The following short script uses findatapy to collect data from Dukascopy website. Note that this package uses Pandas and it doesn't require to import it separately.
from findatapy.market import Market, MarketDataRequest, MarketDataGenerator
market = Market(market_data_generator=MarketDataGenerator())
md_request = MarketDataRequest(start_date='08 Feb 2017', finish_date='09 Feb 2017', category='fx', fields=['bid', 'ask'], freq='tick', data_source='dukascopy', tickers=['EURUSD'])
df = market.fetch_market(md_request)
#Group everything by an hourly frequency.
df=df.groupby(pd.TimeGrouper('1H')).head(1)
#Deleting the milliseconds from the Dateframe
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M:%S'))
#Computing Average between columns 1 and 2, and storing it in a new one.
df['Avg'] = (df['EURUSD.bid'] + df['EURUSD.ask'])/2
The outcome looks like this:
Until this point, everything runs properly, but I need to extract an specific hour from this dataframe. I'd like to pick, let's say, all the values (bid, ask, avg... or just one of them) at a certain hour, 10:00:00AM.
By seeing other posts, I thought I could do something like this:
match_timestamp = "10:00:00"
df.loc[(df.index.strftime("%H:%M:%S") == match_timestamp)]
But the outcome is an error message saying:
AttributeError: 'Index' object has no attribute 'strftime'
I can't even perform df.index.hour, it used to work before the line where I remove the milliseconds (the dtype is datetime64[ns] until that point), after that the dtype is an 'Object'. Looks like I need to reverse this format in order to use strftime.
Can you help me out?
you should take a look at resample :
df = df.resample('H').first() # resample for each hour and use first value of hour
then:
df.loc[df.index.hour == 10] # index is still a date object, play with it
if you dont like that, you can just set your index to a datetime object like so:
df.index = pd.to_datetime(df.index)
then your code should work as is
try to reset the index
match_timestamp = "10:00:00"
df = df.reset_index()
df = df.assign(Date=pd.to_datetime(df.Date))
df.loc[(df.Date.strftime("%H:%M:%S") == match_timestamp)]
I want to download adjusted close prices and their corresponding dates from yahoo, but I can't seem to figure out how to get dates from pandas DataFrame.
I was reading an answer to this question
from pandas.io.data import DataReader
from datetime import datetime
goog = DataReader("GOOG", "yahoo", datetime(2000,1,1), datetime(2012,1,1))
print goog["Adj Close"]
and this part works fine; however, I need to extract the dates that correspond to the prices.
For example:
adj_close = np.array(goog["Adj Close"])
Gives me a 1-D array of adjusted closing prices, I am looking for 1-D array of dates, such that:
date = # what do I do?
adj_close[0] corresponds to date[0]
When I do:
>>> goog.keys()
Index([Open, High, Low, Close, Volume, Adj Close], dtype=object)
I see that none of the keys will give me anything similar to the date, but I think there has to be a way to create an array of dates. What am I missing?
You can get it by goog.index which is stored as a DateTimeIndex.
To get a series of date, you can do
goog.reset_index()['Date']
import numpy as np
import pandas as pd
from pandas.io.data import DataReader
symbols_list = ['GOOG','IBM']
d = {}
for ticker in symbols_list:
d[ticker] = DataReader(ticker, "yahoo", '2014-01-01')
pan = pd.Panel(d)
df_adj_close = pan.minor_xs('Adj Close') #also use 'Open','High','Low','Adj Close' and 'Volume'
#the dates of the adjusted closes from the dataframe containing adjusted closes on multiple stocks
df_adj_close.index
# create a dataframe that has data on only one stock symbol
df_individual = pan.get('GOOG')
# the dates from the dataframe of just 'GOOG' data
df_individual.index