I work with Selenium in Python 2.7. I get that loading a page and similar thing takes far longer than raw requests because it simulates everything including JS etc.
The thing I don't understand is that parsing of already loaded page takes too long.
Everytime when page is loaded, I find all tags meeting some condition (about 30 div tags) and then I put each tag as an attribute to parsing function. For parsing I'm using css_selectors and similar methods like: on.find_element_by_css_selector("div.carrier p").text
As far as I understand, when tha page is loaded, the source code of this page is saved in my RAM or anywhere else so parsing should be done in miliseconds.
EDIT: I bet that parsing the same source code using BeautifulSoup would be more than 10 times faster but I don't understand why.
Do you have any explanation? Thanks
These are different tools for different purposes. Selenium is a browser automation tool that has a rich set of techniques to locate elements. BeautifulSoup is an HTML parser. When you find an element with Selenium - this is not an HTML parsing. In other words, driver.find_element_by_id("myid") and soup.find(id="myid") are very different things.
When you ask selenium to find an element, say, using find_element_by_css_selector(), there is an HTTP request being sent to /session/$sessionId/element endpoint by the JSON wire protocol. Then, your selenium python client would receive a response and return you a WebElement instance if everything went without errors. You can think of it as a real-time/dynamic thing, you are getting a real Web Element that is "living" in a browser, you can control and interact with it.
With BeautifulSoup, once you download the page source, there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.
In practice, if you are doing web-scraping and you need a real browser to execute javascript and handle AJAX, and you are doing a complex HTML parsing afterwards, it would make sense to get the desired .page_source and feed it to BeautifulSoup, or, even better in terms of speed - lxml.html.
Note that, in cases like this, usually there is no need for the complete HTML source of the page. To make the HTML parsing faster, you can feed an "inner" or "outer" HTML of the page block containing the desired data to the html parser of the choice. For example:
container = driver.find_element_by_id("container").getAttribute("outerHTML")
driver.close()
soup = BeautifulSoup(container, "lxml")
Related
I am scraping the links from this website https://www.firstmallorca.com/en/search, for each of the properties that appear on it, so I can further scrape them and collect more detailed data.
My problem is that the parsed HTML(I am using html5lib parser) from which I scrape the data seems to be different in some areas with respect to the HTML which I see on the browser's DevTool. To demonstrate this:
1.This is the last link I select. On the browser, its href="/en/sales/penthouse-in-santa-ponsa/102512"
1.Image
2.I print the parsed HTML from the Beautiful Soup Object from the webpage with bs4Object.prettfy() and I copy the whole output into notepad++.
3.Then, in the notepad I look for the same element as in point 1. I find it and the href="/en/sales/finca-in-portocolom/159515", which is different from what I see on the actual webpage.3.Image
I do not understand the nature of what's happening. On point 3, I was expecting to see href="/en/sales/penthouse-in-santa-ponsa/102512" instead of href="/en/sales/finca-in-portocolom/159515".
It seems to me like I am doing the scraping on other similar webpage, though not the one I see through the browser.
The website loads content via javascript, which your parser does not execute.
This is a task for selenium.
The selenium package is used to automate the interaction with the web browser from Python.
I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.
I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.
I need to write an automated scraper that can take care of websites that are rendered by JavaScript (like YouTube) or just simply use some JavaScript somewhere in their HTML to generate some content (like generating copyright year) and therefore downloading their HTML source make no sense as it won't be the final code (with what users will see).
I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:
def execute_javascript_on_website(self, js_command):
driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')
driver.get(self.url)
try:
return driver.execute_script(js_command)
except Exception as exception_message:
pass
finally:
driver.close()
Where js_command = "return document.documentElement.outerHTML;".
By this code I'm able to get the source code, but not the rendered one. I can do js_command = "return document;" (as I would do in console), but than I will get <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="5a784804-f623-3041-9840-03f13ce83f53", element="585b43a1-f3b2-1e4a-b348-4ddaf2944550")> object that has the HTML but it's not possible to get it out of it.
Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?
PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.
PPS.: I need to get whole HTML code (whole html tag) with JavaScript rendered in it (as it is for example when inspecting in browsers inspector). Or at least to get DOM of the website in which JavaScript is already rendered.
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
I've looked into it and I have to admit that JavaScript in #Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as #pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.
What I've also found (as #Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.
On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).
I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)