Beautifulsoup unable to get data from mds-data-table from morningstar - python

I'm trying to get the dividend information from morningstar.
The following code works for scraping info from finviz but the dividend information is not the same as my broker platform.
symbol = 'bxs'
morningstar_url = 'https://www.morningstar.com/stocks/xnys/' + symbol + '/dividends'
http = urllib3.PoolManager()
response = http.request('GET', morningstar_url)
soup = BeautifulSoup(response.data, 'lxml')
html = list(soup.children)[1]
[type(item) for item in list(soup.children)]
def display_elements(L, show = 0):
test = list(L.children)
if(show):
for i in range(len(test)):
print(i)
print(test[i])
print()
return(test)
test = display_elements(html,1)
I have no issue printing out the elements but cannot find the element that houses the information such as "Total Yield %" of 2.8%. How do I get inside the mds-data-table to extract the information?

Great question! I've actually worked on this specifically, but years ago. Morningstar will only load the tables after running a script to prevent this exact type of scraping behavior. If you view source generally, immediately on load, you won't be able to see any HTML.
What your going to want to do is find the JavaScript code that is loading the elements, and hook up bs4 to use that. You'll have to poke around the files, but somewhere deep in those js files, you'll find a dynamic URL. It'll be hidden, but it'll be in there somewhere. I'll go look at some of my old code and see if i can find something that helps.
So here's an edited sample of what used to work for me:
from urllib.request import urlopen
exchange = 'NYSE'
ticker = 'V'
if exchange == 'NYSE':
exchange_code = "XNYS"
elif exchange in ["NasdaqNM", "NASDAQ"]:
exchange_code = "XNAS"
else:
logging.info("Unknown Exchange Code for {}".format(stock.symbol))
return
time_now = int(time.time())
time_delay = int(time.time()+150)
morningstar_raw = urlopen(f'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t={exchange_code}:{ticker}&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=asc&columnYear=5&rounding=3&view=raw&r=354589&callback=jsonp{time_now}&_={time_delay}')
print(morningstar_raw)
Granted this solution is from a file lasted edited sometime in 2018, and they may have changed up their scripting, but you can find this and much more on my github project wxStocks

Related

Sorting audiobooks on Audible.com by release date when using Python requests library

I am trying to reproduce the result of "Scraping and Exploring the Entire English Audible Catalog" by Toby Manders to add results for the books released after this article was published. The idea is to take Manders' dataset and add equivalent fields for all the new audiobooks in the past year or so, and to do that with as few http requests to Audible as possible. I'm using a different Python library than Manders, and Audible has also changed a bit since that piece was published.
The approach used by Manders of getting paged results of each category views is working so far, but my http request is not sorting the result by release date. Here is my code:
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.audible.com/search?pf_rd_p=7fe4387b-4762-42a8-8d9a-a63254c74bb2&pf_rd_r=C7ENYKDADHMCH4KY12D4&ref=a_search_l1_feature_five_browse-bin_6&feature_six_browse-bin=9178177011&pageSize=50'
r = requests.get(base_url)
html = BeautifulSoup(r.text)
# get category list, and links
cat_tuples = []
for cat in html.find('div', {'class':'categories'}).find_all('li', {'class':'bc-list-item'}):
a = cat.find('a')
mytuple = (a.text, 'https://audible.com' + a['href']+'&sort=pubdate-desc-rank')
cat_tuples.append(mytuple)
# each tuple has a format like this ... ('Arts & Entertainment',
# 'https://audible.com/search?feature_six_browse-bin=9178177011&node=2226646011&pageSize=50&pf_rd_p=7fe4387b-4762-42a8-8d9a-a63254c74bb2&pf_rd_r=C7ENYKDADHMCH4KY12D4&ref=a_search_l1_feature_five_browse-bin_6&sort=pubdate-desc-rank')
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r.text)
# results should start with '2Pac in the Studio' but instead it's 'Can't Hurt Me: Master Your Mind and Defy the Odds'
Adding sort=pubdate-desc-rank to the request URL appears to work in Chrome, but not with Python. I have tried changing the User Agent in my code as well, but that didn't work.
Note: I would describe Audible.com as generally unfriendly to scraping, but I don't see a clear prohibition against it. My interest in purely informational, and I do not seek to profit from gathering these results.
I took a fresh look at my code this morning and discovered that the solution to this one is a silly coding error on my part. I'm leaving it up in case anyone else has a similar issue. These lines of code:
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r.text)
Should be as follows:
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r_page.text)

How does one scrape all the products from a random website?

I tried to get all the products from this website but somehow I don't think I chose the best method because some of them are missing and I can't figure out why. It's not the first time when I get stuck when it comes to this.
The way I'm doing it now is like this:
go to the index page of the website
get all the categories from there (A-Z 0-9)
access each of the above category and recursively go through all the subcategories from there until I reach the products page
when I reach the products page, check if the product has more SKUs. If it has, get the links. Otherwise, that's the only SKU.
Now, the below code works but it just doesn't get all the products and I don't see any reasons for why it'd skip some. Maybe the way I approached everything is wrong.
from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session
INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()
def retry(link):
wait = randint(0, 10)
try:
return session_.get(link).text
except Exception as e:
print('Retrying product page in {} seconds because: {}'.format(wait, e))
sleep(wait)
return retry(link)
def get_category_sections():
au = list(ascii_uppercase)
au.remove('Q')
au.remove('Y')
au.append('0-9')
return au
def get_categories():
html_ = retry(INDEX_PAGE)
page = html.fromstring(html_)
sections = get_category_sections()
for section in sections:
for link in page.xpath("//div[#id='index-{}']//li/a/#href".format(section)):
yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)
def dig_up_products(url):
html_ = retry(url)
page = html.fromstring(html_)
for link in page.xpath(
'//h2[contains(., "CATEGORIES")]/following-sibling::*[#id="carouselSegment2b"]//li//a/#href'
):
yield from dig_up_products(link)
for link in page.xpath('//ul[#id="prodResult"]/li//div[#class="imgWrapper"]/a/#href'):
yield link
for link in page.xpath('//*[#id="ts_resultList"]/div/nav/ul/li[last()]/a/#href'):
if link != '#':
yield from dig_up_products(link)
def check_if_more_products(tree):
more_prods = [
all_prod
for all_prod in tree.xpath("//div[#id='pm2_prodTableForm']//tbody/tr/td[1]//a/#href")
]
if not more_prods:
return False
return more_prods
def main():
for category_link in get_categories():
for product_link in dig_up_products(category_link):
product_page = retry(product_link)
product_tree = html.fromstring(product_page)
more_products = check_if_more_products(product_tree)
if not more_products:
print(product_link)
else:
for sku_product_link in more_products:
print(sku_product_link)
if __name__ == '__main__':
main()
Now, the question might be too generic but I wonder if there's a rule of thumb to follow when someone wants to get all the data (products, in this case) from a website. Could someone please walk me through the whole process of discovering what's the best way to approach a scenario like this?
If your ultimate goal is to scrape the entire product listing for each category, it may make sense to target the full product listings for each category on the index page. This program uses BeautifulSoup to find each category on the index page and then iterates over each product page under each category. The final output is a list of namedtuples stories each category name with the current page link and the full product titles for each link:
url = "https://www.richelieu.com/us/en/index"
import urllib
import re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import itertools
s = soup(str(urllib.urlopen(url).read()), 'lxml')
blocks = s.find_all('div', {'id': re.compile('index\-[A-Z]')})
results_data = {[c.text for c in i.find_all('h2', {'class':'h1'})][0]:[b['href'] for b in i.find_all('a', href=True)] for i in blocks}
final_data = []
category = namedtuple('category', 'abbr, link, products')
for category1, links in results_data.items():
for link in links:
page_data = str(urllib.urlopen(link).read())
print "link: ", link
page_links = re.findall(';page\=(.*?)#results">(.*?)</a>', page_data)
if not page_links:
final_page_data = soup(page_data, 'lxml')
final_titles = [i.text for i in final_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)
else:
page_numbers = set(itertools.chain(*list(map(list, page_links))))
full_page_links = ["{}?imgMode=m&sort=&nbPerPage=48&page={}#results".format(link, num) for num in page_numbers]
for page_result in full_page_links:
new_page_data = soup(str(urllib.urlopen(page_result).read()), 'lxml')
final_titles = [i.text for i in new_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)
print final_data
The output will garner results in the format:
[category(abbr=u'A', link='https://www.richelieu.com/us/en/category/tools-and-shop-supplies/workshop-accessories/tool-accessories/sander-accessories/1058847', products=[u'Replacement Plate for MKT9924DB Belt Sander', u'Non-Grip Vacuum Pads', u'Sandpaper Belt 2\xbd " x 14" for Compact Belt Sander PC371 or PC371K', u'Stick-on Non-Vacuum Pads', u'5" Non-Vacuum Disc Pad Hook-Face', u'Sanding Filter Bag', u'Grip-on Vacuum Pads', u'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x 10.79 cm (3" x 4-1/4")', u'4" Abrasive for Finishing Tool', u'Sander Backing Pad for RO 150 Sander', u'StickFix Sander Pad for ETS 125 Sander', u'Sub-Base Pad for Stocked Sanders', u'(5") Non-Vacuum Disc Pad Vinyl-Face', u'Replacement Sub-Base Pads for Stocked Sanders', u"5'' Multi-Hole Non-Vaccum Pad", u'Sander Backing Pad for RO 90 DX Sander', u'Converting Sanding Pad', u'Stick-On Vacuum Pads', u'Replacement "Stik It" Sub Base', u'Drum Sander/Planer Sandpaper'])....
To access each attribute, call like so:
categories = [i.abbr for i in final_data]
links = [i.links for i in final_data]
products = [i.products for i in final_data]
I believe the benefit of using BeautifulSoup is this instance is that it provides a higher level of control over the scraping and is easily modified. For instance, should the OP change his mind regarding what facets of the product/index he would like to scrape, simple changes in the find_all parameters should only be needed, as the general structure of the code above centers around each product category from the index page.
First of all, there is no definite answer to your generic question of how would one know if the data one has already scraped is all the available data. This is at least web-site specific and is rarely actually revealed. Plus, the data itself might be highly dynamic. On this web-site though you may more or less use the product counters to verify the amount of results found:
Your best bet here would be to debug - use logging module to print out information while scraping, then analyze the logs and look for why there was a missing product and what caused that.
Some of the ideas I currently have:
could it be that the retry() is the problematic part - could it be that session_.get(link).text does not raise an error but does not contain the actual data in the response as well?
I think the way you extract category links is correct and I don't see you missing categories on the index page
the dig_up_products() is questionable: when you extract links to the subcategories, you have this carouselSegment2b id used in the XPath expression, but I see that on at least some of the pages (like this one) the id value is carouselSegment1b. In any case, I would probably do just //h2[contains(., "CATEGORIES")]/following-sibling::div//li//a/#href here
I also don't like that imgWrapper class used to find a product link (could be that products missing images are missed?). Why not just: //ul[#id="prodResult"]/li//a/#href - this would though bring in some duplicates which you can address separately. But, you can also look for the link in the "info" section of the product container: //ul[#id="prodResult"]/li//div[contains(#class, "infoBox")]//a/#href.
There can also be an anti-bot, anti-web-scraping strategy deployed that may temporarily ban your IP or/and User-Agent or even obfuscate the response. Check for that too.
As pointed out by #mzjn and #alecxe, some websites employ anti-scraping measures. To hide their intentions, scrapers should try to mimic a human visitor.
One particular way for websites to detect a scraper, is to measure the time between subsequent page requests. Which is why scrapers typically keep a (random) delay between requests.
Besides, hammering a web server that is not yours without giving it some slack, is not considered good netiquette.
From Scrapy's documentation:
RANDOMIZE_DOWNLOAD_DELAY
Default: True
If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while fetching requests from the same website.
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.
The randomization policy is the same used by wget --random-wait option.
If DOWNLOAD_DELAY is zero (default) this option has no effect.
Oh, and make sure the User-Agent string in your HTTP request resembles that of an ordinary web browser.
Further reading:
https://exposingtheinvisible.org/guides/scraping/
Sites not accepting wget user agent header

Web scraping every forum post (Python, Beautifulsoup)

Hello once again fellow stack'ers. Short description.. I am web scraping some data from an automotive forum using Python and saving all data into CSV files. With some help from other stackoverflow members managed to get as far as mining through all pages for certain topic, gathering the dates, title and link for each post.
I also have a seperate script I am now sturggling with implementing (For every link found, python creates a new soup for it, scrapes through all the posts and then goes back to previous link).
Would really appreciate any other tips or advice on how to make this better as it's my first time working with python, I think it might be my nested loop logic that's messed up, but checking through multiple times seems right to me.
Heres the code snippet :
link += (div.get('href'))
savedData += "\n" + title + ", " + link
tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
while tempNumber < 3:
for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):
for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):
tempNextPage = ""
tempNextPage += (tempNext.get('href'))
post = ""
post += tempRow.get_text(strip=True)
postData += post + "\n"
tempNumber += 1
tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage
tempSoup = make_soup(tempNewUrl)
print(tempNewUrl)
tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)
My main issue with it so far is that tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
Does not seem to create a new soup after it has done scraping all the posts for forum thread.
This is the output I'm getting :
http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2
http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3
1
So it does seem to find the correct links for new pages and scrape them, however for next itteration it prints the new dates AND the same exact pages. There's also a reaaly weird 10-12 seconds delays after the last link is printed and only then it hops down to print number 1 and then bash out all the new dates..
But after going for the next forum threads link, it scrapes same exact data every time.
Sorry if it looks really messy, it is sort of a side project, and my first attempt at doing some useful things, so I am very new at this, any advice or tips would be much appreciated. I'm not asking you to solve the code for me, even some pointers for my possibly wrong logic would be greatly appreciated!
So after spending a little bit more time, I have managed to ALMOST crack it. It's now at the point where python finds every thread and it's link on the forum, then goes onto each link, reads all pages and continues on with next link.
This is the fixed code for it if anyone will make any use of it.
link += (div.get('href'))
savedData += "\n" + title + ", " + link
soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
while tempNumber < 4:
for postScrape in soup3.find_all(id=re.compile("^td_post_")):
post = ""
post += postScrape.get_text(strip=True)
postData += post + "\n"
print(post)
for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
tempNextPage = ""
tempNextPage += (tempNext.get('href'))
print(tempNextPage)
soup3 = ""
soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)
tempNumber += 1
tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)
All I've had to do was to seperate the 2 for loops that were nested within each other, into own loops. Still not a perfect solution, but hey, it ALMOST works.
The non working bit: First 2 threads of provided link have multiple pages of posts. The following 10+ more threads Do not. I cannot figure out a way to check the for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
value outside of loop to see if it's empty or not. Because if it does not find a next page element / href, it just uses the last one. But if I reset the value after each run, it no longer mines each page =l A solution that just created another one problem :D.
Many thanks dear Norbis for sharing your ideas and insights and concepts
since you offer only a snippet i just try to provide an approach that shows how to login to a phpBB - using payload:
import requests
forum = "the forum name"
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'username': 'username', 'password': 'password', 'redirect':'index.php', 'sid':'', 'login':'Login'}
session = requests.Session()
r = session.post(forum + "ucp.php?mode=login", headers=headers, data=payload)
print(r.text)
but wait: we can - instead of manipulating the website using requests,
also make use a browser automation such as mechanize offers this.
This way we don't have to manage the own session and only have a few lines of code to craft each request.
a interesting example is on GitHub https://github.com/winny-/sirsi/blob/317928f23847f4fe85e2428598fbe44c4dae2352/sirsi/sirsi.py#L74-L211

Bioinformatics : Programmatic Access to the BacDive Database

the resource at "BacDive" - ( http://bacdive.dsmz.de/) is a highly useful database for accessing bacterial knowledge, such as strain information, species information and parameters such as growth temperature optimums.
I have a scenario in which I have a set of organism names in a plain text file, and I would like to programmatically search them 1 by 1 against the Bacdive database (which doesnt allow a flat file to be downloaded) and retrieve the relevent information and populate my text file accordingly.
What are the main modules (such as beautifulsoups) that I would need to accomplish this? Is it straight forward? Is it allowed to programmatically access webpages ? Do I need permission?
A bacteria name would be "Pseudomonas putida" . Searching this would give 60 hits on bacdive. Clicking one of the hits, takes us to the specific page, where the line : "Growth temperature: [Ref.: #27] Recommended growth temperature : 26 °C " is the most important.
The script would have to access bacdive (which i have tried accessing using requests, but I feel they do not allow programmatic access, I have asked the moderator about this, and they said I should register for their API first).
I now have the API access. This is the page (http://www.bacdive.dsmz.de/api/bacdive/). This may seem quite simple to people who do HTML scraping, but I am not sure what to do now that I have access to the API.
Here is the solution...
import re
import urllib
from bs4 import BeautifulSoup
def get_growth_temp(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
no_hits = int(map(float, re.findall(r'[+-]?[0-9]+',str(soup.find_all("span", class_="searchresultlayerhits"))))[0])
if no_hits > 1 :
letters = soup.find_all("li", class_="searchresultrow1") + soup.find_all("li", class_="searchresultrow2")
all_urls = []
for i in letters:
all_urls.append('http://bacdive.dsmz.de/index.php' + i.a["href"])
max_temp = []
for ind_url in all_urls:
soup = BeautifulSoup(urllib.urlopen(ind_url).read())
a = soup.body.findAll(text=re.compile('Recommended growth temperature :'))
if a:
max_temp.append(int(map(float, re.findall(r'[+-]?[0-9]+', str(a)))[0]))
print "Recommended growth temperature : %d °C:\t" % max(max_temp)
url = 'http://bacdive.dsmz.de/index.php?search=Pseudomonas+putida'
if __name__ == "__main__":
# TO Open file then iterate thru the urls/bacterias
# with open('file.txt', 'rU') as f:
# for url in f:
# get_growth_temp(url)
get_growth_temp(url)
Edit:
Here I am passing single url. if you want to pass multiple urls to get their growth temperature. call the function(url) by opening file. code is commented.
Hope it helped you..
Thanks

How to retrieve google URL from search query

So I'm trying to create a Python script that will take a search term or query, then search google for that term. It should then return 5 URL's from the result of the search term.
I spent many hours trying to get PyGoogle to work. But later found out Google no longer supports the SOAP API for search, nor do they provide new license keys. In a nutshell, PyGoogle is pretty much dead at this point.
So my question here is... What would be the most compact/simple way of doing this?
I would like to do this entirely in Python.
Thanks for any help
Use BeautifulSoup and requests to get the links from the google search results
import requests
from bs4 import BeautifulSoup
keyword = "Facebook" #enter your keyword here
search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + keyword
r = requests.get(search)
soup = BeautifulSoup(r.text, "html.parser")
container = soup.find('div',{'id':'search'})
url = container.find("cite").text
print(url)
What issues are you having with pygoogle? I know it is no longer supported, but I've utilized that project on many occasions and it would work fine for the menial task you have described.
Your question did make me curious though--so I went to Google and typed "python google search". Bam, found this repository. Installed with pip and within 5 minutes of browsing their documentation got what you asked:
import google
for url in google.search("red sox", num=5, stop=1):
print(url)
Maybe try a little harder next time, ok?
Here, link is the xgoogle library to do the same.
I tried similar to get top 10 links which also counts words in links we are targeting. I have added the code snippet for your reference :
import operator
import urllib
#This line will import GoogleSearch, SearchError class from xgoogle/search.py file
from xgoogle.search import GoogleSearch, SearchError
my_dict = {}
print "Enter the word to be searched : "
#read user input
yourword = raw_input()
try:
#This will perform google search on our keyword
gs = GoogleSearch(yourword)
gs.results_per_page = 80
#get google search result
results = gs.get_results()
source = ''
#loop through all result to get each link and it's contain
for res in results:
#print res.url.encode('utf8')
#this will give url
parsedurl = res.url.encode("utf8")
myurl = urllib.urlopen(parsedurl)
#above line will read url content, in below line we parse the content of that web page
source = myurl.read()
#This line will count occurrence of enterd keyword in our webpage
count = source.count(yourword)
#We store our result in dictionary data structure. For each url, we store it word occurent. Similar to array, this is dictionary
my_dict[parsedurl] = count
except SearchError, e:
print "Search failed: %s" % e
print my_dict
#sorted_x = sorted(my_dict, key=lambda x: x[1])
for key in sorted(my_dict, key=my_dict.get, reverse=True):
print(key,my_dict[key])

Categories

Resources