Difficulty extracting stock tickers - pd.read_html not preserving whitespace

Difficulty extracting stock tickers - pd.read_html not preserving whitespace - python

I'm trying to get stock tickers from a few hundred ETF's from etfdailynews.com. I started by getting a list of category names from https://etfdailynews.com/etfs/, then concatenating the category to that url to open a page with the ETF names and symbols. For example, https://etfdailynews.com/etfs/technology-equities-etfs/
On the page, the title "Fund Symbol/Name" has the symbol, then name beneath. The plan was to read the table, then assuming there was some \n between the symbol and name, split to get just the symbol. For example, getting the first 10:
sector_table = pd.read_html("https://etfdailynews.com/etfs/Large-Cap-Blend-ETFs")
etf_list = list(sector[0]["Fund Symbol/Name"].iloc[0:10])
The problem is that it is returning the names and the symbols without any whitespace between. Since some symbols are sometimes 3 and other times 4 characters long I can't perform a simple splice. An example of the list returned above:
['SPYSPDR S&P 500', 'IVViShares Core S&P 500 ETF', 'VTIVanguard
Total Stock Market ETF', 'VOOVanguard S&P 500 ETF', 'VIGVanguard Div
Appreciation ETF - DNQ', 'IWBiShares Russell 1000 ETF',
'RSPGuggenheim S&P 500 Equal Weight ETF', 'USMViShares Edge MSCI Min
Vol USA ETF', 'ITOTiShares Core S&P Total U.S. Stock Market ETF',
'SCHXSchwab U.S. Large-Cap ETF']
Perhaps there is a way to do as I want with beautifulsoup, but I am not proficient with that module, and from what I know pd.read_html is better at working with tables but I could be entirely mistaken.
EDIT: I should clarify that I plan to open the URL of the ETF's to extract the tickers. I had planned to concatenate the ETF symbols to the URL. An alternative that allows me to simply extract the URL's of the ETF's works perfectly as well.

The function parses the cell below on the line break by adding a semi-colon to the tag and splitting the text.
(HTML as of 3/18/18 https://etfdailynews.com/etfs/Large-Cap-Blend-ETFs/)
html
<td class="bold"><a class="show" href="/etf/SPY/">SPY<br/>
<span class="thirteen unbold">SPDR S&P 500</span></a></td>
After you have opened the url with urllib or requests, pass the html table to the function below and it will return a DataFrame.
def parse_etf_html(data_table, debug=False):
header = [th.text for th in data_table.findAll('th')]
# header modifications
compound_field = header.pop(0)
header.insert(0, compound_field.split('/')[0])
header.insert(1, 'Fund ' + compound_field.split('/')[1])
compound_field = header.pop(8)
header.insert(8, compound_field + '_cur')
header.insert(9, compound_field + '_per')
# Row 0 is the table header
extracted_data = list()
# Starting at row 1, loop each table row
for tr in data_table.findAll('tr')[1:]:
extracted_row = list()
if debug:
# simple test to verify if number of items matches expectations.
row_parsing_log = dict()
for td in tr.findAll('td'):
#<td class="bold"><a class="show" href="/etf/SPY/">SPY<br/><span class="thirteen unbold">SPDR S&P 500</span></a></td>,
if td.find('a') and td.find('br') and td.find('span'):
td.br.string=";"
extracted_row.extend(td.text.split(";"))
if debug:
row_parsing_log['symbol_fund_as_expected'] = len(td.text.split(";")) == 2
# <td class="grade-4">+0.25<br/>(0.09%)</td>
elif td.find('br') and [td.find('strong'),td.find('small'), td.find('a')] == [None, None, None]:
td.br.string=";"
# percent change is enclosed with (). remove to avoid confusion
extracted_row.append(td.text.split(";")[0])
extracted_row.append(td.text.split(";")[1].replace("(", "").replace(")", ""))
if debug:
row_parsing_log['day_chg_as_expected'] = len(td.text.split(";")) == 2
#<td class="text-center grade-1"> <strong>A</strong><br/> <small>Strong Buy</small> </td>
elif td.find('br') and td.find('strong') and td.find('small'):
# Appears to be parsed correctly by pandas read html
extracted_row.append(td.text.replace('\n', ' ').strip())
else:
extracted_row.append(td.text)
record = dict(zip(header, extracted_row))
if debug:
record.update(row_parsing_log)
# append each row
extracted_data.append(record)
if debug:
header.extend(['symbol_fund_as_expected', 'day_chg_as_expected'])
outputDF = pd.DataFrame(extracted_data)[header]
# data types
return outputDF
Link to static notebook:
https://github.com/emican86/49350586/blob/master/read_etf_html_tables.ipynb
Link to Azure notebook ( you can clone and use as a live demo):
https://notebooks.azure.com/emican86/libraries/read-etf-html-tables

Related

use ek.get_data in python, there is only one line corp.ric in csv shows the results, other lines are disappear

I want retrieve the financial data from Refinitiv, I use the code below, the 122.csv have many ric of companies, but when I run the code, the results only shows the first company's information, and the ric of the first company is incomplete,
with open (r'C:\Users\zhang\Desktop\122.csv',encoding='utf-8') as date:
for i, line in enumerate(read_csv(date)):
RIC = line[0]
SDate = '2013-06-10'
EDate = '2015-06-10'
df1,das= ek.get_data(RIC,fields =[
'TR.F.TotAssets(Scale=6)',
'TR.F.DebtTot(Scale=6)',]
parameters = {'SDate':'{}'.format(SDate), 'EDate': '{}'.format(EDate)},
)
df1
the result:
Instrume ntTotal Assets Debt - Total
0 A 10686 2699
1 A 10815 1663
How can I generate the whole list of ric of companies in my csv file?

Need a 'for loop' to get dividend data for a stock portfolio, from their respective api urls

I am trying to automate parsing of dividend data for a stock portfolio, and getting the stock wise dividend values into a single dataframe table.
The data for each stock in a portfolio is stored in a separate api url
The portfolio ids (for stocks - ITC, Britannia, Sanofi) are [500875, 500825, 500674].
I would first like to run a 'for loop' to generate/concatenate each specific url (which goes like this - https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode=500674), the last 6 digit numbers of urls being their respective company ids
Then I would like to use that url to get each of the respective dividend table's first line into a single dataframe. The code I used to get the individual dividend data, and the final dataframe that I need is represented in image attached
Basically I would like to run a 'for loop' to get the first line of 'Table2' for each stock id and store it in a single data frame as a final result.
PS - The code which I used to get individual dividend data is highlighted below:
url = 'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode=500674'
jsondata = requests.get(url, headers= {'User-Agent': 'Mozilla/5.0'}).json()
df = pd.DataFrame(jsondata['Table2'])

If you need for-loop then you should use it and show code with for-loop and problem which it gives you.
You could use single for-loop for all works.
You can use string formatting to create url with code and read data from server. Next you can get first row (even without creating DataFrame) and append to list with all rows. And after loop you can convert this list to DataFrame
import requests
import pandas as pd
# --- before loop ---
headers = {'User-Agent': 'Mozilla/5.0'}
all_rows = []
# --- loop ---
for code in [500875, 500825, 500674]:
# use `f-string` of string `.format()` to create url
#url = f'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode={code}'
url = 'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode={}'.format(code)
r = requests.get(url, headers=headers)
#print(r.text) # to check error message
#print(r.status_code)
data = r.json()
first_row = data['Table2'][0] # no need to use DataFrame
#df = pd.DataFrame(data['Table2'])
#first_row = df.iloc[0]
#print(first_row)
all_rows.append(first_row)
# --- after loop ---
df_result = pd.DataFrame(all_rows)
print(df_result)
Result:
scrip_code sLongName ... Details PAYMENT_DATE
0 500875 ITC LTD. ... 10.1500 2020-09-08T00:00:00
1 500825 BRITANNIA INDUSTRIES LTD. ... 83.0000 2020-09-16T00:00:00
2 500674 Sanofi India Ltd ... 106.0000 2020-08-06T00:00:00
[3 rows x 9 columns]

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Apologies in advance for long question- I am new to Python and I'm trying to be as explicit as I can with a fairly specific situation.
I am trying to identify specific data points from SEC Filings on a routine basis however I want to automate this instead of having to manually go search a companies CIK ID and Form filing. So far, I have been able to get to a point where I am downloading metadata about all filings received by the SEC in a given time period. It looks like this:
index cik conm type date path
0 0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
Despite having all this information, as well as being able to download these text files and see the underlying data, I am unable to parse this data as it is in xbrl format and is a bit out of my wheelhouse. Instead I came across this script (kindly provided from this site https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python):
from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
# Exit if document link couldn't be found
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)
Just running this script works exactly how I'd like it to. It returns the stockholders equity for a given company (IBM in this case) and I can then take that value and write it to an excel file.
My two-part question is this:
I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [('1009759', 'D', '20190215'),('1009891', 'D', '20190206'),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I'm interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.
Thank you for the help!

You need to define a function which can be essentially most of the code you have posted and that function should take 3 keyword arguments (your 3 values). Then rather than define the three in your code, you just pass in those values and return a result.
Then you take your list which you created and make a simple for loop around it to cal the function you defined with those three values and then do something with the result.
def get_data(value1, value2, value3):
# your main code here but replace with your arguments above.
return content
for company in companies:
content = get_data(value1, value2, value3)
# do something with content

Assuming you have a dataframe sec with correctly named columns for your list of filings, above, you first need to extract from the dataframe the relevant information into three lists:
cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)
Then you create your base_url, with the items inserted and get your data:
for c, t, d in zip(cik, typ, dat):
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
edgar_resp = requests.get(base_url)
And go from there.

Cannot scrape dataid from Morningstar - How can I access the Network inspection tool from Python?

I'm trying to scrape Morningstar.com to get financial data and prices of each fund available on the website. Fortunately I have no problem at scraping financial data (holdings, asset allocation, portfolio, risk, etc.), but when it comes to find the URL that hosts the daily prices in JSON format for each fund, there is a "dataid" value that is not available in the HTML code and without it there is no way to know the exact URL that hosts all the prices.
I have tried to print the whole page as text for many funds, and none of them show in the HTML code the "dataid" value that I need in order to get the prices. The URL that hosts the prices also includes the "secid", which is scrapeable very easily but has no relationship at all with the "dataid" that I need to scrape.
import requests
from lxml import html
import re
import json
quote_page = "https://www.morningstar.com/etfs/arcx/aadr/quote.html"
prices1 = "https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids="
prices2 = "&dataid="
prices3 = "&startdate="
prices4 = "&enddate="
starting_date = "2018-01-01"
ending_date = "2018-12-28"
quote_html = requests.get(quote_page, timeout=10)
quote_tree = html.fromstring(quote_html.text)
security_id = re.findall('''meta name=['"]secId['"]\s*content=['"](.*?)['"]''', quote_html.text)[0]
security_type = re.findall('''meta name=['"]securityType['"]\s*content=['"](.*?)['"]''', quote_html.text)[0]
data_id = "8225"
daily_prices_url = prices1 + security_id + ";" + security_type + prices2 + data_id + prices3 + starting_date + prices4 + ending_date
daily_prices_html = requests.get(daily_prices_url, timeout=10)
json_prices = daily_prices_html.json()
for json_price in json_prices["data"]["r"]:
j_prices = json_price["t"]
for j_price in j_prices:
daily_prices = j_price["d"]
for daily_price in daily_prices:
print(daily_price["i"] + " || " + daily_price["v"])
The code above works for the "AADR" ETF only because I copied and pasted the "dataid" value manually in the "data_id" variable, and without this piece of information there is no way to access the daily prices. I would not like to use Selenium as alternative to find the "dataid" because it is a very slow tool and my intention is to scrape data for more than 28k funds, so I have tried only robot web-scraping methods.
Do you have any suggestion on how to access the Network inspection tool, which is the only source I have found so far that shows the "dataid"?
Thanks in advance

The data id may not be that important. I varied the code F00000412E that is associated with AADR whilst keeping the data id constant.
I got a list of all those codes from here:
https://www.firstrade.com/scripts/free_etfs/io.php
Then add the code of choice into your url e.g.
[
"AIA",
"iShares Asia 50 ETF",
"FOUSA06MPQ"
]
Use FOUSA06MPQ
https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=FOUSA06MPQ;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30
You can verify the values by adding the other fund as a benchmark to your chart e.g. XNAS:AIA
28th december has value of 55.32. Compare this with JSON retrieved:
I repeated this with
[
"ALD",
"WisdomTree Asia Local Debt ETF",
"F00000M8TW"
]
https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=F00000M8TW;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30

dataId 8217 works well for me, irrespective of the security.

Extract using Beautiful Soup

I want to fetch the stock price from web site: http://www.bseindia.com/
For example stock price appears as "S&P BSE :25,489.57".I want to fetch the numeric part of it as "25489.57"
This is the code i have written as of now.It is fetching the entire div in which this amount appears but not the amount.
Below is the code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = "http://www.bseindia.com"
html_page = urlopen(page)
html_text = html_page.read()
soup = BeautifulSoup(html_text,"html.parser")
divtag = soup.find_all("div",{"class":"sensexquotearea"})
for oye in divtag:
tdidTags = oye.find_all("div", {"class": "sensexvalue2"})
for tag in tdidTags:
tdTags = tag.find_all("div",{"class":"newsensexvaluearea"})
for newtag in tdTags:
tdnewtags = newtag.find_all("div",{"class":"sensextext"})
for rakesh in tdnewtags:
tdtdsp1 = rakesh.find_all("div",{"id":"tdsp"})
for texts in tdtdsp1:
print(texts)

I had a look around in what is going on when that page loads the information and I was able to simulate what the javascript is doing in python.
I found out it is referencing a page called IndexMovers.aspx?ln=en check it out here
It looks like this page is a comma separated list of things. First comes the name, next comes the price, and then a couple other things you don't care about.
To simulate this in python, we request the page, split it by the commas, then read through every 6th value in the list, adding that value and the value one after that to a new list called stockInformation.
Now we can just loop through stock information and get the name using item[0] and price with item[1]
import requests
newUrl = "http://www.bseindia.com/Msource/IndexMovers.aspx?ln=en"
response = requests.get(newUrl).text
commaItems = response.split(",")
#create list of stocks, each one containing information
#index 0 is the name, index 1 is the price
#the last item is not included because for some reason it has no price info on indexMovers page
stockInformation = []
for i, item in enumerate(commaItems[:-1]):
if i % 6 == 0:
newList = [item, commaItems[i+1]]
stockInformation.append(newList)
#print each item and its price from your list
for item in stockInformation:
print(item[0], "has a price of", item[1])
This prints out:
S&P BSE SENSEX has a price of 25489.57
SENSEX#S&P BSE 100 has a price of 7944.50
BSE-100#S&P BSE 200 has a price of 3315.87
BSE-200#S&P BSE MidCap has a price of 11156.07
MIDCAP#S&P BSE SmallCap has a price of 11113.30
SMLCAP#S&P BSE 500 has a price of 10399.54
BSE-500#S&P BSE GREENEX has a price of 2234.30
GREENX#S&P BSE CARBONEX has a price of 1283.85
CARBON#S&P BSE India Infrastructure Index has a price of 152.35
INFRA#S&P BSE CPSE has a price of 1190.25
CPSE#S&P BSE IPO has a price of 3038.32
#and many more... (total of 40 items)
Which clearly is equivlent to the values shown on the page
So there you have it, you can simulate exactly what the javascript on that page is doing to load the information. Infact you now have even more information than was just shown to you on the page and the request is going to be faster because we are downloading just data, not all that extraneous html.

If you look into the source code of your page (e.g. by storing it into a file and opening it with an editor), you will see that the actual stock price 25,489.57 does not show up directly. The price is not in the stored html code but loaded in a different way.
You could use the linked page where the numbers show up:
http://www.bseindia.com/sensexview/indexview_new.aspx?index_Code=16&iname=BSE30

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.