I'm trying to create in Python what a macro does in SAS. I have a list of over 1K tickers that I'm trying to download information for but doing all of them in one step made python crash so I split up the data into 11 portions. Below is the code we're working with:
t0=t.time()
printcounter=0
for ticker in tickers1:
printcounter+=1
print(printcounter)
try:
selected = yf.Ticker(ticker)
shares = selected.get_shares()
shares_wide = shares.transpose()
info=selected.info
market_cap=info['marketCap']
sector=info['sector']
name=info['shortName']
comb = shares_wide.assign(market_cap_oct22=market_cap,sector=sector,symbol=ticker,name=name)
company_info_1 = company_info_1.append(comb)
except:
comb = pd.DataFrame()
comb = comb.append({'symbol':ticker,'ERRORFLAG':'ERROR'},ignore_index=True)
company_info_1 = company_info_1.append(comb)
print("total run time:", round(t.time()-t0,3),"s")
What I'd like to do is instead of re-writing and running this code for all 11 portions of data and manually changing "tickers1" and "company_info_1" to "tickers2" "company_info_2" "tickers3" "company_info_3" (and so on)... I'd like to see if there is a way to make a python version of a SAS macro/call so that I can get this data more dynamically. Is there a way to do this in python?
You need to generalize your existing code and wrap it in a function.
def comany_info(tickers):
for ticker in tickers:
try:
selected = yf.Ticker(ticker) # you may also have to pass the yf object
shares = selected.get_shares()
shares_wide = shares.transpose()
info=selected.info
market_cap=info['marketCap']
sector=info['sector']
name=info['shortName']
comb = shares_wide.assign(market_cap_oct22=market_cap,sector=sector,symbol=ticker,name=name)
company_info = company_info.append(comb)
except:
comb = pd.DataFrame()
comb = comb.append({'symbol':ticker,'ERRORFLAG':'ERROR'},ignore_index=True)
company_info = company_info.append(comb)
return company_info # return the dataframe
Create a master dataframe to collect your results from the function call. Loop over the 11 groups of tickers passing each group into your function. Append the results to your master.
# master df to collect results
master = pd.DataFrame()
# assuming you have your tickers in a list of lists
# loop over each of the 11 groups of tickers
for tickers in groups_of_tickers:
df = company_info(tickers) # fetch data from Yahoo Finance
master = master.append(df))
Please note I typed this on the fly. I have no way of testing this. I'm quite sure there are syntactical issues to work through. Hopefully it provides a framework for how to think about the solution.
Related
I am trying to pull out multiple ticker data from the yfinance API and save it to a csv file (in total I have 1000 tickers I need to get the data for, that data being the entire table of date, open, high, low, close, volume, etc etc), so far I am able to successfully get data for 1 ticker by using the following Python code:
import yfinance as yf
def yfinance(ticker_symbol):
ticker_data = yf.Ticker(ticker_symbol)
tickerDF = ticker_data.history(period='1d', start='2020-09-30', end='2020-10-31')
print(tickerDF)
yfinance('000001.SS')
However if I try on multiple tickers this doesn't work. Following the yfinance docs which say for multiple tickers use:
tickers = yf.Tickers('msft aapl goog')
# ^ returns a named tuple of Ticker objects
# access each ticker using (example)
tickers.tickers.MSFT.info
tickers.tickers.AAPL.history(period="1mo")
tickers.tickers.GOOG.actions
I have a couple of issue here, the docs use a string such as 'aapl' my tickers are all of digit format like '000001.SS', the ".SS" part is proving to be an issue when passing it into the code:
tickers.tickers.000001.SS.history(period="1mo")
# Clearly this wont for for a start
The next issue I am having is, even if I pass in for example 3 tickers to my function like so:
yfinance('000001.SS 000050.KS 00006.KS')
# similar to yfinance docs of tickers = yf.Tickers('msft aapl goog')
I get errors like:
AttributeError: 'Tickers' object has no attribute '000001.SS'
(I have also tried to run these into a for loop and pass each on to the Tickers object but get the same error.)
Im stuck now, I dont know how to pass in multiple tickers to yfinance and get back data that I want and the docs aren't very helpful.
Is anyone able to help me with this?
Could you not just store them in an array specifying the type as dtype object then use that pull the data from.
import yfinance as yf
import numpy as np
tickers = ['msft', 'aapl', 'goog']
totalPortfolio = np.empty([len(tickers)], dtype=object)
num = 0
for ticker in tickers:
totalPortfolio[num] = yf.download(ticker, start='2020-09-30', end='2020-10-31', interval="1d")
num = num + 1
Take a look at the code below:
test = yf.Tickers("A B C")
# creates test as a yf.tickers object
test_dict = test.tickers
# creates a dict object containing the individual tickers. Can be checked with type()
You are trying to use "tickers.tickers.MSFT.info" to retrieve the ticker data from your dictionary "tickers.tickers" but like your error message says, a dict object has no attributes named after your specific ticker names. This is in general not how you access elements in a dictionary.
Instead you should use the code as below (like with all dict objects):
#old code from above
test = yf.Tickers("A B C")
test_dict = test.tickers
#new code accessing the dict correctly
a_data = test_dict["A"]
a_data = test.tickers["A"] #does the same as the line above
b_data = test.tickers["B"] #and so on for the other tickers
In a loop this could look something like this:
ticker_list = ["A", "B", "C"] #add tickers as needed
tickers_data = {}
tickers_history = {}
for ticker in ticker_list:
tickers_data[ticker] = yf.Ticker(ticker)
tickers_history = tickers_data[ticker].history(period='1d', start='2020-09-30', end='2020-10-31')
#access the dicts as needed using tickers_data[" your ticker name "]
alternatively you can also use the "yf.Tickers" function to retrieve multiple tickers at once, but because you save the history seperately I don't think this will necessarily improve your code much.
You should pay attention however, that "yf.Ticker()" and "yf.Tickers()" are different functions from each other with differing syntax and are not interchangeable.
You did mix that up when you tried accessing multiple tickers with your custom "yfinance()" function, that has been previously defined with the "yf.Ticker()" function and thus only accepts one symbol at a time.
I am pulling data from an API that allows batch requests, and then storing the data to a Dataframe. When there is an exception with one of the items being looked up via the API, I want to either skip that item entirely, (or write zeroes to the Dataframe) and then go on to the next item.
But my issue is that because the API data is being accessed in bulk (i.e., not looping through each item in the list), an exception for any item in the list breaks the program. So how can I elegantly handle exceptions without looping through each individual item in the tickers list?
Note that removing ERROR from the tickers list will enable the program to run successfully:
import os
from iexfinance.stocks import Stock
import iexfinance
# Set IEX Finance API Token (Sandbox)
os.environ['IEX_API_VERSION'] = 'iexcloud-sandbox'
os.environ['IEX_TOKEN'] = 'Tpk_a4bc3e95d4c94810a3b2d4138dc81c5d'
# List of companies to get data for
tickers = ['MSFT', 'ERROR', 'AMZN']
batch = Stock(tickers, output_format='pandas')
income_ttm = 0
try:
# Get income from last 4 quarters, sum it, and store to temp Dataframe
df_income = batch.get_income_statement(period="year")
print(df_income)
except (iexfinance.utils.exceptions.IEXQueryError, iexfinance.utils.exceptions.IEXSymbolError) as e:
pass
This should do the work
import os
from copy import deepcopy
from iexfinance.stocks import Stock
import iexfinance
def find_wrong_symbol(tickers, err):
wrong_ticker = []
for one_ticker in tickers:
if one_ticker.upper() in err:
wrong_ticker.append(one_ticker)
return wrong_ticker
# Set IEX Finance API Token (Sandbox)
os.environ['IEX_API_VERSION'] = 'iexcloud-sandbox'
os.environ['IEX_TOKEN'] = 'Tpk_a4bc3e95d4c94810a3b2d4138dc81c5d'
# List of companies to get data for
tickers = ['MSFT', 'AMZN', 'failing']
batch = Stock(tickers, output_format='pandas')
income_ttm = 0
try:
# Get income from last 4 quarters, sum it, and store to temp Dataframe
df_income = batch.get_income_statement(period="year")
print(df_income)
except (iexfinance.utils.exceptions.IEXQueryError, iexfinance.utils.exceptions.IEXSymbolError) as e:
wrong_tickers = find_wrong_symbol(tickers, str(e))
tickers_to_get = deepcopy(tickers)
assigning_dict = {}
for wrong_ticker in wrong_tickers:
tickers_to_get.pop(tickers_to_get.index(wrong_ticker))
assigning_dict.update({wrong_ticker: lambda x: 0})
new_batch = Stock(tickers_to_get, output_format='pandas')
df_income = new_batch.get_income_statement(period="year").assign(**assigning_dict)
I create a small function in order to find tickers that are not handle by the API. After deleting the wrong tickers, I recall the API without it and with the assign function add the missing columns with the 0 values (it could be anything, a NaN or another default value).
No matter what I do I don't seem to be able to add all the base volumes and quote volumes together easily! I want to end up with a total base volume and a total quote volume of all the data in the data frame. Can someone help me on how you can do this easily?
I have tried summing and saving the data in a dictionary first and then adding it but I just don't seem to be able to make this work!
import urllib
import pandas as pd
import json
def call_data(): # Call data from Poloniex
global df
datalink = 'https://poloniex.com/public?command=returnTicker'
df = urllib.request.urlopen(datalink)
df = df.read().decode('utf-8')
df = json.loads(df)
global current_eth_price
for k, v in df.items():
if 'ETH' in k:
if 'USDT_ETH' in k:
current_eth_price = round(float(v['last']),2)
print("Current ETH Price $:",current_eth_price)
def calc_volumes(): # Calculate the base & quote volumes
global volume_totals
for k, v in df.items():
if 'ETH' in k:
basevolume = float(v['baseVolume'])*current_eth_price
quotevolume = float(v['quoteVolume'])*float(v['last'])*current_eth_price
if quotevolume > 0:
percentages = (quotevolume - basevolume) / basevolume * 100
volume_totals = {'key':[k],
'basevolume':[basevolume],
'quotevolume':[quotevolume],
'percentages':[percentages]}
print("volume totals:",volume_totals)
print("#"*8)
call_data()
calc_volumes()
A few notes:
For the next 2 years don't use the keyword globals for anything.
put function documentation under the function in quotes
using the requests library will be much easier than urllib. However ...
pandas can fetch the JSON and parse it all in one step
ok it doesn't have to be as split up as this, I'm just showing you how to properly pass variables around instead of globals.
I could not find "ETH" by itself. In the data they sent they have these 3 ['BTC_ETH', 'USDT_ETH', 'USDC_ETH']. So I used "USDT_ETH" I hope the substitution is ok.
calc_volumes is seeming to do the calculation and being some sort of filter (it's picky as to what it prints). This function needs to be broken up in to it's two separate jobs. printing and calculating. (maybe there was a filter step but I leave that for homework)
.
import pandas as pd
eth_price_url = 'https://poloniex.com/public?command=returnTicker'
def get_data(url=''):
""" Call data from Poloniex and put it in a dataframe"""
data = pd.read_json(url)
return data
def get_current_eth_price(data = None):
""" grab the price out of the dataframe """
current_eth_price = data['USDT_ETH']['last'].round(2)
return current_eth_price
def calc_volumes(data=None, current_eth_price=None):
""" Calculate the base & quote volumes """
data = df[df.columns[df.columns.str.contains('ETH')]].loc[['baseVolume', 'quoteVolume', 'last']]
data = data.transpose()
data[['baseVolume','quoteVolume']]*= current_eth_price
data['quoteVolume']*=data['last']
data['percentages']=(data['quoteVolume'] - data['baseVolume']) / data['quoteVolume'] * 100
return data
df = get_data(url = eth_price_url)
the_price = get_current_eth_price(data = df)
print(f'the current eth price is: {the_price}')
volumes = calc_volumes(data=df, current_eth_price=the_price)
print(volumes)
This code seems kind of odd and inconsistent... for example, you're importing pandas and calling your variable df but you're not actually using dataframes. If you used df = pd.read_json('https://poloniex.com/public?command=returnTicker', 'index')* to get a dataframe, most of your data manipulation here would become much easier, and wouldn't require any loops either.
For example, the first function's code would become as simple as current_eth_price = df.loc['USDT_ETH','last'].
The second function's code would basically be
eth_rows = df[df.index.str.contains('ETH')]
total_base_volume = (eth_rows.baseVolume * current_eth_price).sum()
total_quote_volume = (eth_rows.quoteVolume * eth_rows['last'] * current_eth_price).sum()
(*The 'index' argument tells pandas to read the JSON dictionary indexed by rows, then columns, rather than columns, then rows.)
I have a Python script that pulls in all fields from JIRA. I have multiple projects within JIRA and the set of columns within each project is different. Hence I have the below script that works just fine.
## Login to Jira
jira = JIRA(basic_auth=('login#login.com', 'password'), options={'server': 'https://company.atlassian.net'})
# Pulling Proj_A tickets
issues = jira.search_issues('project= Proj_A',maxResults=False) ## Get Proj_A tickets
## Create the full df by normalizing the output
issue_list = []
for i in range(len(issues)):
result = json_normalize(issues[i].raw['fields'])
result['issue_id'] = issues[i]
result['issue_link'] = 'https://company.atlassian.net/browse/' + str(issues[i])
issue_list.append(result)
final_issue_df_a = pd.concat(issue_list, axis=0, sort=True).reset_index()
# Pulling Proj_B tickets
issues = jira.search_issues('project= Proj_B',maxResults=False) ## Get Proj_B tickets
## Create the full df by normalizing the output
issue_list = []
for i in range(len(issues)):
result = json_normalize(issues[i].raw['fields'])
result['issue_id'] = issues[i]
result['issue_link'] = 'https://company.atlassian.net/browse/' + str(issues[i])
issue_list.append(result)
final_issue_df_b = pd.concat(issue_list, axis=0, sort=True).reset_index()
# Concatenating fields from the 2 projects into one single dataframe
Final_DF = pd.concat([final_issue_df_a,final_issue_df_b], axis=0, ignore_index=True)
This above code works just fine. Now I am trying to optimize the script wherein I am trying to pass a loop that has list of all projects as below:
project = [Proj_A,Proj_B,Proj_C..]
This gets passed to the first line
(issues = jira.search_issues('project=',maxResults=False))
and iterates each project and pulls in relevant fields that gets stored in the final Dataframe.
Could anyone assist. Thanks
Use the project api to get a list of projects in JIRA. Then extract Project Name from each Project returned.
r = requests.get('https://<jira>/rest/api/2/project', auth=('username', 'password'), verify=False)
projects = r.json()
#Get name of each project
for i in projects:
print i['name']
You could do something like this:
projects = ['Proj_A', 'Proj_B', 'Proj_C']
final_df_list = []
for project in projects:
issues = jira.search_issues('project= '+project, maxResults=False)
# Rest of the code processing the issues obtained above
final_issue_df_x = pd.concat(issue_list, axis=0, sort=True).reset_index()
final_df_list.append(final_issue_df_x)
Final_DF = pd.concat(final_df_list, axis=0, ignore_index=True)
Is there a way to check the HTTP Status Code in the code below, as I have not used the request or urllib libraries which would allow for this.
from pandas.io.excel import read_excel
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
#check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8) #Creates the dataframes
short_end_spot_curve = read_excel(url, sheetname=6)
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
#Providing correct names
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
In pandas:
read_excel try to use urllib2.urlopen(urllib.request.urlopen instead in py3x) to open the url and get .read() of response immediately without store the http request like:
data = urlopen(url).read()
Though you need only part of the excel, pandas will download the whole excel each time. So, I voted #jonnybazookatone.
It's better to store the excel to your local, then you can check the status code and md5 of file first to verify data integrity or others.