Beautiful Soup not picking up some data form the website - python

I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:

#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.

Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.

The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.

Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")

Related

Beautiful Soup object's HTML, does not match HTML from the web browser

I am scraping the links from this website https://www.firstmallorca.com/en/search, for each of the properties that appear on it, so I can further scrape them and collect more detailed data.
My problem is that the parsed HTML(I am using html5lib parser) from which I scrape the data seems to be different in some areas with respect to the HTML which I see on the browser's DevTool. To demonstrate this:
1.This is the last link I select. On the browser, its href="/en/sales/penthouse-in-santa-ponsa/102512"
1.Image
2.I print the parsed HTML from the Beautiful Soup Object from the webpage with bs4Object.prettfy() and I copy the whole output into notepad++.
3.Then, in the notepad I look for the same element as in point 1. I find it and the href="/en/sales/finca-in-portocolom/159515", which is different from what I see on the actual webpage.3.Image
I do not understand the nature of what's happening. On point 3, I was expecting to see href="/en/sales/penthouse-in-santa-ponsa/102512" instead of href="/en/sales/finca-in-portocolom/159515".
It seems to me like I am doing the scraping on other similar webpage, though not the one I see through the browser.
The website loads content via javascript, which your parser does not execute.
This is a task for selenium.
The selenium package is used to automate the interaction with the web browser from Python.

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

Scraping data from complex website (hidden content)

I am just starting with web scraping and unfortunately, I am facing a showstopper: I would like pull some financial data but it seems that the website is quite complex (dynamic content etc.).
Data I would like pull
Website:
https://www.de.vanguard/web/cf/professionell/de/produktart/detailansicht/etf/9527/EQUITY/performance
So far, I've used Beautiful Soup to get this done. However, I cannot even find the table. Any ideas?
Look into using selenium to launch an automated web browser. This loads the web page and it's associated dynamic content, as well as allow you the option to 'click' on certain web elements to load content that may be generated on_click. You can use this in tandem with BeautifulSoup by passing driver.page_source to BeautifulSoup and parsing through it that way.
This SO answer provides a basic example that would serve as a good starting point: Python WebDriver how to print whole page source (html)

How to access the subtags within a tag using beautifulsoup in python?

I am attempting to retrieve player statistics from MLB.com for the 2016 season. I am using Beautiful Soup in Python, and I need to extract the information in the table seen here:
http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2016&season_type=ANY&league_code='MLB'&sectionType=sp&statType=hitting&page=1&ts=1493672037085&playerType=ALL&sportCode='mlb'&split=&team_id=&active_sw=&position=&page_type=SortablePlayer&sortOrder='desc'&sortColumn=ab&results=&perPage=442&timeframe=&last_x_days=&extended=0
Here is what I have attempted:
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
gdata=soup.find_all('div',{'id':'datagrid'})
print(gdata)
This should return all of the subtags within the tag, but it does not. This results in the following:
[<div id="datagrid"></div>]
Can anyone explain why this is not producing the contents of the table? Furthermore, what can I do to access the contents of the table?
Thanks
If you look at the source for the webpage, it looks like the datagrid div is actually empty & the stats are inserted dynamically as json from this URL. Maybe you can use that instead. To figure this out I looked at the page source to see that the div had no children and then used Chrome developer tools Network tab to find the request where it pulled the data:
Open the web page
Open the chrome developer tools, Command+Option+I (Mac) or Control+Shift+I (Windows, Linux).
Refresh the web page with the tools open so it processes the network requests then wait for the page to load
(optional) Type xml in the search bar on the web to narrow your search results to requests that are likely to have data
Click on each request and look at the preview of the response. At this point I just manually examined the responses to see which had your data. I got lucky and got yours on the first try since it has stats in the name.

Categories

Resources