Extract using Beautiful Soup - python

I want to fetch the stock price from web site: http://www.bseindia.com/
For example stock price appears as "S&P BSE :25,489.57".I want to fetch the numeric part of it as "25489.57"
This is the code i have written as of now.It is fetching the entire div in which this amount appears but not the amount.
Below is the code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = "http://www.bseindia.com"
html_page = urlopen(page)
html_text = html_page.read()
soup = BeautifulSoup(html_text,"html.parser")
divtag = soup.find_all("div",{"class":"sensexquotearea"})
for oye in divtag:
tdidTags = oye.find_all("div", {"class": "sensexvalue2"})
for tag in tdidTags:
tdTags = tag.find_all("div",{"class":"newsensexvaluearea"})
for newtag in tdTags:
tdnewtags = newtag.find_all("div",{"class":"sensextext"})
for rakesh in tdnewtags:
tdtdsp1 = rakesh.find_all("div",{"id":"tdsp"})
for texts in tdtdsp1:
print(texts)

I had a look around in what is going on when that page loads the information and I was able to simulate what the javascript is doing in python.
I found out it is referencing a page called IndexMovers.aspx?ln=en check it out here
It looks like this page is a comma separated list of things. First comes the name, next comes the price, and then a couple other things you don't care about.
To simulate this in python, we request the page, split it by the commas, then read through every 6th value in the list, adding that value and the value one after that to a new list called stockInformation.
Now we can just loop through stock information and get the name using item[0] and price with item[1]
import requests
newUrl = "http://www.bseindia.com/Msource/IndexMovers.aspx?ln=en"
response = requests.get(newUrl).text
commaItems = response.split(",")
#create list of stocks, each one containing information
#index 0 is the name, index 1 is the price
#the last item is not included because for some reason it has no price info on indexMovers page
stockInformation = []
for i, item in enumerate(commaItems[:-1]):
if i % 6 == 0:
newList = [item, commaItems[i+1]]
stockInformation.append(newList)
#print each item and its price from your list
for item in stockInformation:
print(item[0], "has a price of", item[1])
This prints out:
S&P BSE SENSEX has a price of 25489.57
SENSEX#S&P BSE 100 has a price of 7944.50
BSE-100#S&P BSE 200 has a price of 3315.87
BSE-200#S&P BSE MidCap has a price of 11156.07
MIDCAP#S&P BSE SmallCap has a price of 11113.30
SMLCAP#S&P BSE 500 has a price of 10399.54
BSE-500#S&P BSE GREENEX has a price of 2234.30
GREENX#S&P BSE CARBONEX has a price of 1283.85
CARBON#S&P BSE India Infrastructure Index has a price of 152.35
INFRA#S&P BSE CPSE has a price of 1190.25
CPSE#S&P BSE IPO has a price of 3038.32
#and many more... (total of 40 items)
Which clearly is equivlent to the values shown on the page
So there you have it, you can simulate exactly what the javascript on that page is doing to load the information. Infact you now have even more information than was just shown to you on the page and the request is going to be faster because we are downloading just data, not all that extraneous html.

If you look into the source code of your page (e.g. by storing it into a file and opening it with an editor), you will see that the actual stock price 25,489.57 does not show up directly. The price is not in the stored html code but loaded in a different way.
You could use the linked page where the numbers show up:
http://www.bseindia.com/sensexview/indexview_new.aspx?index_Code=16&iname=BSE30

Related

Want to find discount price and original price while web scraping puma website?

enter image description here
I was web scraping through the puma website I wanted to find the original price and discount price separately, So whenever there is no discount price I just want to discount the price as 0 instead of not add any value is there any way I could do that?
link to the website: - https://in.puma.com/in/en/mens/mens-shoes
Whenever there is discount price + original price the div tag is
'product-tile-price-new product-tile__price--new' for discount price
'product-tile-price-old product-tile__price--old' for original price
When there is no discount the div tag is
'product-tile-price-standard product-tile__price--standard'
I could get both the discount price and original price by accessing there parent tag
a_price=soup.findAll('div',class_='product-tile-info-price product-tile__price')
a_price_list = []
for head in a_price:
a_price_list.append((head.text).strip())
a_price_list
I will get
['₹2,149\n₹4,299',
'₹2,149\n₹4,299',
'₹3,369\n₹7,499',
'₹2,449\n₹3,499',
'₹9,999',
'₹6,999',
'₹2,449\n₹3,499',
'₹3,499\n₹6,999',
'₹8,999',
'₹8,249\n₹10,999',
'₹3,999\n₹7,999',
'₹5,999\n₹7,999',
'₹5,399\n₹8,999',
'₹3,249\n₹6,499',
'₹5,949\n₹6,999',
'₹4,249\n₹4,999',
'₹4,199\n₹6,999',
'₹2,399\n₹3,999',
'₹5,999',
'₹5,999',
'₹9,999',
'₹3,999\n₹7,999',
'₹3,499\n₹6,999',
'₹5,999\n₹11,999',
'₹5,499',
'₹2,469\n₹3,799',
'₹7,999',
'₹9,999',
'₹3,999',
'₹4,249\n₹4,999',
'₹3,249\n₹6,499',
'₹10,999',
'₹9,999',
'₹2,579\n₹4,299',
'₹2,999',
'₹3,499\n₹6,999']
So In the first index, there is a discount price and in the 2nd index, there is the original price, check out the 5th and 6th row , there is no discount price there so it's displaying the original price in 1st index. Instead, I want to print 0 as a discount price.
You can use try and except block for finding tags for discount price is not there so it will go to except block and append 0 to list and you can find old price by tag
from bs4 import BeautifulSoup
import requests
res=requests.get("https://in.puma.com/in/en/mens/mens-shoes")
soup=BeautifulSoup(res.text,"html.parser")
prices=soup.find_all("div",attrs={"class":"product-tile-info-price"})
main_list=[]
for price in prices:
try:
discount_price=price.find("div",class_='product-tile-price-new').get_text(strip=True)
main_list.append(discount_price)
original_price=price.find("div",class_='product-tile-price-old').get_text(strip=True)
main_list.append(original_price)
except AttributeError:
main_list.append(0)
original_price=price.find("div",class_='product-tile-price-standard').get_text(strip=True)
main_list.append(original_price)
Output:
['₹2,149',
'₹4,299',
'₹2,149',
'₹4,299',
'₹3,369',
'₹7,499',
'₹2,449',
'₹3,499',
0,
'₹9,999',
....

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Apologies in advance for long question- I am new to Python and I'm trying to be as explicit as I can with a fairly specific situation.
I am trying to identify specific data points from SEC Filings on a routine basis however I want to automate this instead of having to manually go search a companies CIK ID and Form filing. So far, I have been able to get to a point where I am downloading metadata about all filings received by the SEC in a given time period. It looks like this:
index cik conm type date path
0 0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
Despite having all this information, as well as being able to download these text files and see the underlying data, I am unable to parse this data as it is in xbrl format and is a bit out of my wheelhouse. Instead I came across this script (kindly provided from this site https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python):
from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
# Exit if document link couldn't be found
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)
Just running this script works exactly how I'd like it to. It returns the stockholders equity for a given company (IBM in this case) and I can then take that value and write it to an excel file.
My two-part question is this:
I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [('1009759', 'D', '20190215'),('1009891', 'D', '20190206'),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I'm interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.
Thank you for the help!
You need to define a function which can be essentially most of the code you have posted and that function should take 3 keyword arguments (your 3 values). Then rather than define the three in your code, you just pass in those values and return a result.
Then you take your list which you created and make a simple for loop around it to cal the function you defined with those three values and then do something with the result.
def get_data(value1, value2, value3):
# your main code here but replace with your arguments above.
return content
for company in companies:
content = get_data(value1, value2, value3)
# do something with content
Assuming you have a dataframe sec with correctly named columns for your list of filings, above, you first need to extract from the dataframe the relevant information into three lists:
cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)
Then you create your base_url, with the items inserted and get your data:
for c, t, d in zip(cik, typ, dat):
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
edgar_resp = requests.get(base_url)
And go from there.

How can I find next sibling of 'a' tag which is inside into a table th tag?

I am scraping companies data from Wikipedia infobox table where I need to scrape some values that are inside td - like Type, Traded as, Services etc.
my code is
response = requests.get(url,headers=headers)
html_soup = BeautifulSoup(response.text, 'lxml')
table_container = html_soup.find('table', class_='infobox')
hq_name=table_container.find("th", text=['Headquarters']).find_next_sibling("td")
It gives the headquarter value and works perfectly
But when I am going to fetch 'Trade as' or any hyperlink th element the above code is not working, it returns none.
So how to get the next sibling of Trade as or Type?
From your comment:
https://en.wikipedia.org/wiki/IBM This is the URL, and the expected
output will be Trade as- NYSE: IBM DJIA Component S&P 100 Component
S&P 500 Component
Use the a tags to separate and select the required row from the table by nth-of-type. You can then join the first two items in the output list if required
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/IBM')
soup = bs(r.content, 'lxml')
items = [item.text.replace('\xa0',' ') for item in soup.select('.vcard tr:nth-of-type(4) a')]
print(items)
To have as shown (if indeed first and second are joined?):
final = items[2:]
final.insert(0, '-'.join([items[0] , items[1]]))
final

Cannot scrape dataid from Morningstar - How can I access the Network inspection tool from Python?

I'm trying to scrape Morningstar.com to get financial data and prices of each fund available on the website. Fortunately I have no problem at scraping financial data (holdings, asset allocation, portfolio, risk, etc.), but when it comes to find the URL that hosts the daily prices in JSON format for each fund, there is a "dataid" value that is not available in the HTML code and without it there is no way to know the exact URL that hosts all the prices.
I have tried to print the whole page as text for many funds, and none of them show in the HTML code the "dataid" value that I need in order to get the prices. The URL that hosts the prices also includes the "secid", which is scrapeable very easily but has no relationship at all with the "dataid" that I need to scrape.
import requests
from lxml import html
import re
import json
quote_page = "https://www.morningstar.com/etfs/arcx/aadr/quote.html"
prices1 = "https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids="
prices2 = "&dataid="
prices3 = "&startdate="
prices4 = "&enddate="
starting_date = "2018-01-01"
ending_date = "2018-12-28"
quote_html = requests.get(quote_page, timeout=10)
quote_tree = html.fromstring(quote_html.text)
security_id = re.findall('''meta name=['"]secId['"]\s*content=['"](.*?)['"]''', quote_html.text)[0]
security_type = re.findall('''meta name=['"]securityType['"]\s*content=['"](.*?)['"]''', quote_html.text)[0]
data_id = "8225"
daily_prices_url = prices1 + security_id + ";" + security_type + prices2 + data_id + prices3 + starting_date + prices4 + ending_date
daily_prices_html = requests.get(daily_prices_url, timeout=10)
json_prices = daily_prices_html.json()
for json_price in json_prices["data"]["r"]:
j_prices = json_price["t"]
for j_price in j_prices:
daily_prices = j_price["d"]
for daily_price in daily_prices:
print(daily_price["i"] + " || " + daily_price["v"])
The code above works for the "AADR" ETF only because I copied and pasted the "dataid" value manually in the "data_id" variable, and without this piece of information there is no way to access the daily prices. I would not like to use Selenium as alternative to find the "dataid" because it is a very slow tool and my intention is to scrape data for more than 28k funds, so I have tried only robot web-scraping methods.
Do you have any suggestion on how to access the Network inspection tool, which is the only source I have found so far that shows the "dataid"?
Thanks in advance
The data id may not be that important. I varied the code F00000412E that is associated with AADR whilst keeping the data id constant.
I got a list of all those codes from here:
https://www.firstrade.com/scripts/free_etfs/io.php
Then add the code of choice into your url e.g.
[
"AIA",
"iShares Asia 50 ETF",
"FOUSA06MPQ"
]
Use FOUSA06MPQ
https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=FOUSA06MPQ;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30
You can verify the values by adding the other fund as a benchmark to your chart e.g. XNAS:AIA
28th december has value of 55.32. Compare this with JSON retrieved:
I repeated this with
[
"ALD",
"WisdomTree Asia Local Debt ETF",
"F00000M8TW"
]
https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=F00000M8TW;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30
dataId 8217 works well for me, irrespective of the security.

Difficulty extracting stock tickers - pd.read_html not preserving whitespace

I'm trying to get stock tickers from a few hundred ETF's from etfdailynews.com. I started by getting a list of category names from https://etfdailynews.com/etfs/, then concatenating the category to that url to open a page with the ETF names and symbols. For example, https://etfdailynews.com/etfs/technology-equities-etfs/
On the page, the title "Fund Symbol/Name" has the symbol, then name beneath. The plan was to read the table, then assuming there was some \n between the symbol and name, split to get just the symbol. For example, getting the first 10:
sector_table = pd.read_html("https://etfdailynews.com/etfs/Large-Cap-Blend-ETFs")
etf_list = list(sector[0]["Fund Symbol/Name"].iloc[0:10])
The problem is that it is returning the names and the symbols without any whitespace between. Since some symbols are sometimes 3 and other times 4 characters long I can't perform a simple splice. An example of the list returned above:
['SPYSPDR S&P 500', 'IVViShares Core S&P 500 ETF', 'VTIVanguard
Total Stock Market ETF', 'VOOVanguard S&P 500 ETF', 'VIGVanguard Div
Appreciation ETF - DNQ', 'IWBiShares Russell 1000 ETF',
'RSPGuggenheim S&P 500 Equal Weight ETF', 'USMViShares Edge MSCI Min
Vol USA ETF', 'ITOTiShares Core S&P Total U.S. Stock Market ETF',
'SCHXSchwab U.S. Large-Cap ETF']
Perhaps there is a way to do as I want with beautifulsoup, but I am not proficient with that module, and from what I know pd.read_html is better at working with tables but I could be entirely mistaken.
EDIT: I should clarify that I plan to open the URL of the ETF's to extract the tickers. I had planned to concatenate the ETF symbols to the URL. An alternative that allows me to simply extract the URL's of the ETF's works perfectly as well.
The function parses the cell below on the line break by adding a semi-colon to the tag and splitting the text.
(HTML as of 3/18/18 https://etfdailynews.com/etfs/Large-Cap-Blend-ETFs/)
html
<td class="bold"><a class="show" href="/etf/SPY/">SPY<br/>
<span class="thirteen unbold">SPDR S&P 500</span></a></td>
After you have opened the url with urllib or requests, pass the html table to the function below and it will return a DataFrame.
def parse_etf_html(data_table, debug=False):
header = [th.text for th in data_table.findAll('th')]
# header modifications
compound_field = header.pop(0)
header.insert(0, compound_field.split('/')[0])
header.insert(1, 'Fund ' + compound_field.split('/')[1])
compound_field = header.pop(8)
header.insert(8, compound_field + '_cur')
header.insert(9, compound_field + '_per')
# Row 0 is the table header
extracted_data = list()
# Starting at row 1, loop each table row
for tr in data_table.findAll('tr')[1:]:
extracted_row = list()
if debug:
# simple test to verify if number of items matches expectations.
row_parsing_log = dict()
for td in tr.findAll('td'):
#<td class="bold"><a class="show" href="/etf/SPY/">SPY<br/><span class="thirteen unbold">SPDR S&P 500</span></a></td>,
if td.find('a') and td.find('br') and td.find('span'):
td.br.string=";"
extracted_row.extend(td.text.split(";"))
if debug:
row_parsing_log['symbol_fund_as_expected'] = len(td.text.split(";")) == 2
# <td class="grade-4">+0.25<br/>(0.09%)</td>
elif td.find('br') and [td.find('strong'),td.find('small'), td.find('a')] == [None, None, None]:
td.br.string=";"
# percent change is enclosed with (). remove to avoid confusion
extracted_row.append(td.text.split(";")[0])
extracted_row.append(td.text.split(";")[1].replace("(", "").replace(")", ""))
if debug:
row_parsing_log['day_chg_as_expected'] = len(td.text.split(";")) == 2
#<td class="text-center grade-1"> <strong>A</strong><br/> <small>Strong Buy</small> </td>
elif td.find('br') and td.find('strong') and td.find('small'):
# Appears to be parsed correctly by pandas read html
extracted_row.append(td.text.replace('\n', ' ').strip())
else:
extracted_row.append(td.text)
record = dict(zip(header, extracted_row))
if debug:
record.update(row_parsing_log)
# append each row
extracted_data.append(record)
if debug:
header.extend(['symbol_fund_as_expected', 'day_chg_as_expected'])
outputDF = pd.DataFrame(extracted_data)[header]
# data types
return outputDF
Link to static notebook:
https://github.com/emican86/49350586/blob/master/read_etf_html_tables.ipynb
Link to Azure notebook ( you can clone and use as a live demo):
https://notebooks.azure.com/emican86/libraries/read-etf-html-tables

Categories

Resources