BeautifulSoup returning 'None' when element definitely exists

BeautifulSoup returning 'None' when element definitely exists - python

firstly-I apologize if I'm missing something super simple, I've looked at many questions and cannot find this out for the life of me.
Basically, the website I'm trying to gather text is here:
https://www.otcmarkets.com/stock/MNGG/overview
I want to pull the information from the side that says 'Dark or Defunct,' my current code is as follows:
url = 'https://www.otcmarkets.com/stock/MNGG/overview'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
ticker = soup.find('href', 'Dark or Defunct')
But as the title says, it just returns none. Where am I going wrong? I'm quite inexperienced so I'd love an explanation if possible.

It's returning none because it there is no mention of it in the HTML page source. Everything on that website is dynamically loaded from JavaScript sources. BeautifulSoup is designed to pull data out of HTML and XML files, and in the HTML file provided, there is no mention of "Dark or Darker" (so BeautifulSoup correctly finds nothing). You'll need to use a scraping method that has support for JavaScript. See Web-scraping JavaScript page with Python.

Related

How can i fix the find_all error while web scraping?

I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)

This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.

Nested tags are not accessable with beautifulsoup

i have been trying for a while to extract specific data from different websites using beautifulsoup, and it doesn't seem to work for deeply nested tags.
I have tried it like this:
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find(["g"], class_="highcharts-series-group")
I also tried navigating from a "high parent" tag, down to the nested tag which I want to access, but at some point the tags are not found anymore. This problem occurred on several websites.
I don't understand why I get an empty list or a "none" when I search for a tag with a specific ID or class, which for sure exists in the HTML data. My only guess is that the website is blocking this information.
Thanks for any advice :)

Webscraping table with multiple pages using BeautifulSoup

I'm trying to scrape this webpage https://www.whoscored.com/Statistics using BeautifulSoup in order to obtain all the information of the player statistics table. I'm having lot of difficulties and was wondering if anyone would be able to help me.
url = 'https://www.whoscored.com/Statistics'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
text = [element.text for element in soup.find_all('div' {'id':"statistics-table-summary"})]
My problem lies in the fact that I don't know what the correct tag is to obtain that table. Also the table has several pages and I would like to scrape every single one. The only indication I've seen of a change of page in the table is the number in the code below:
<div id="statistics-table-summary" class="" data-fwsc="11">

It looks to me like that site loads their data in using Javascript. In order to grab the data, you'll have to mimic how a browser loads a page; the requests library isn't enough. I'd recommend taking a look at a tool like Selenium, which uses a "robotic browser" to load the page. After the page is loaded, you can then use BeautifulSoup to retrieve the data you need.
Here's a link to a helpful tutorial from RealPython.
Good luck!

How to read context from hyperlink on website

I'm looking for method to read context from hyperlink on website. Is it possible?
For example:
website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
hyperlink = _some_method_to_find_hyperlink(openwebsite)
get_context_from_hyper(hyperlink)
I was searching in Beautiful Shop, but I cannot find something usefull.
I thinking that i could with lopp to find revelant hyperlink, and use urllib2 again, but website is quite large, and it would takes ages.

You could go and try the Beautiful Soup Package which enables you to parse HTML and thus extract any tag you might be looking for.

BeautifulSoup4: Missing Parsed Table Data

I'm trying to extract the Earnings Per Share data through BeautifulSoup 4 from this page.
When I parse the data, the table information is missing using the default, lxml and HTML 5 parsers. I believe this has something to do with Javascript and I have been trying to implement PyV8 to transform the script into readable HTML for BS4. The problem is I don't know where to go from here.
Do you know if this is in fact my issue? I have been reading many posts and it's been a very big headache for me today. Below is a quick example. The financeWrap includes the table information, but beautifulSoup shows that it is empty.
import requests
from bs4 import BeautifulSoup
url = "http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US"
response = requests.get(url)
soup_key_ratios = bs(response.content, 'html5lib')
financial_tables = soup_key_ratios.find("div", {"id":"financeWrap"})
print financial_tables
# Output: <div id="financeWrap">
# </div>

The issue is that you're trying to get data that is coming in through Ajax on the website. If you go to the link you provided, and looked at the source via the browser, you'll see that there should be no content with the data.
However, if you use a console manager, such as Firebug, you will see that there are Ajax requests made to the following URL, which is something you can parse via beautifulsoup (perhaps - I haven't tried it or looked at the structure of the data).
Keep in mind that this is quite possibly against the website's ToS.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup returning 'None' when element definitely exists - python

Related

How can i fix the find_all error while web scraping?

Nested tags are not accessable with beautifulsoup

Webscraping table with multiple pages using BeautifulSoup

How to read context from hyperlink on website

BeautifulSoup4: Missing Parsed Table Data

Categories

Resources