I have a big amount of HTML files which I want to process using BeautifulSoup and generate some statistics. Although, I came across the problem that the HTML files contain scripts that may generate more HTML code which is not being processed. Therefore, I need to render all Javascript into static HTML before proceeding.
I have seen some options such as using Selenium, but it doesn't seem to fit since I don't want to launch a browser (it should be done in background).
Can someone please suggest an appropriate approach to this?
Thanks in advance!
Since you need a Javascript engine, using a headless browser is the way to go.
Using Selenium web driver with the PhantomJS headless browser is probably your best option:
driver = webdriver.PhantomJS()
driver.get("...")
bs = BeautifulSoup(driver.page_source)
Related
I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:
#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.
Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.
The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.
Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")
I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.
According to performance it is more than obvious that web scraping with BautifulSoup is much faster than using a webdriver with Selenium. However I don't know any other way to get content from a dynamic web page. I thought the difference comes from the time needed for the browser to load elements but it is definitely more than that. Once the browser loads the page(5 seconds) all I had to do is to extract some <tr> tags from a table. It took about 3-4 minutes to extract 1016 records which is extremely slow in my opinion. I came to a conclusion that webdriver methods for finding elements such as find_elements_by_name are slow. Is find_elements_by.. from webdriver much slower than the find method in BeautifulSoup? And would it be faster if I get the whole html from the webdriver browser and then parse it with lxml and use the BeautifulSoup?
Yes it would be much faster to use Selenium only to get the HTML after waiting for the page to be ready and then use BeautifulSoup or lxml to parse that HTML.
Another option could be to use Puppeteer either only to get the HTML or to get the info that you want directly. It should also be faster than Selenium. There are some unofficial python bindings for it: pyppeteer
Look into 2 options:
1) sometimes these dynamic pages do actually have the data within <script> tags in a valid json format. You can use requests to get the html, beautifulsoup will get the <script> tag, then you can use json,loads() to parse.
2) go directly to the source. Look at the dev tools and search the XHR to see if you can go directly to the url/API and that generates the data and return the data that way (most likely again in json format). In my opinion, this is by far the better/faster option if available.
If you can provide the url, I can check to see if either of these options apply to your situation.
Web Scraping with Python using either with selenium or beautifulsoup should be a part of the testing strategy. Putting it straight if your intent is to scrape the static content BeautifulSoup is unmatched. But incase the website content is dynamically rendered Selenium is the way to go.
Having said that, BeautifulSoup won't wait for the dynamic content which isn't readily present in the DOM Tree once page loading completes. Where as using Selenium you have Implicit Wait and Explicit Wait at your disposal to locate the desired dynamic elements.
Finally, find_elements_by_name() may be delta expensive in terms of performance as Selenium translates it into it's equivalent find_element_by_css_selector(). You can find some more details in this discussion
Outro
Official locator strategies for the webdriver
You could also try evaluating in javascript. For example this:
item = driver.execute_script("""return {
div: document.querySelector('div').innerText,
h2: document.querySelector('h2').innerText
}""")
will be at least 10x faster than this:
item = {
"div": driver.find_element_by_css_selector('div').text,
"h2": driver.find_element_by_css_selector('h2').text
}
I wouldn't be surprised if it was faster than BS a lot of the time too.
I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)
Basically exactly as the question says, I'm trying to get the background colour out from a website.
At the moment I'm using BeautifulSoup to get the HTML, but it's proving a difficult way of the getting the CSS. Any help would be great!
This is not something you can reliably solve with BeautifulSoup. You need a real browser.
The simplest option would be to use selenium browser automation tool:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('url')
element = driver.find_element_by_id('myid')
print(element.value_of_css_property('background-color'))
value_of_css_property() documentation.