How to setup BeautifulSoup to avoid false results? - python

In using BeautifulSoup I am seeing many cases where the information sought is definitely in the HTML input yet BeautifulSoup fails to find it. This is a problem because there are cases where the information isn't there and so it is impossible to know if BeautifulSoup's search result is a case of it failing or a true case of the information simply not being there.
Here's a simple example:
url_obj = urllib2.urlopen(url)
html = url_obj.read()
url_obj.close()
parsed_html = BeautifulSoup(html)
html = parsed_html.find(id="SalesRank")
I've run tests with dozens of URL's of pages that do have this id and, to my dismay, get seemingly random results. Sometimes some of the URL's will produce a search hit and other times none.
In sharp contrast to this, if I run a simple string search I get the correct result every time:
url_obj = urllib2.urlopen(url)
html = url_obj.read()
url_obj.close()
index = html.find("SalesRank")
# Slice off a chunk-o-html from there
# Then use regex to grab what we are after
This works every time. The prior BeautifulSoup example fails in a seemingly random fashion. Same URL's. What's alarming is that I can run the BeautifulSoup code twice in a row on the same set of URL's and get different responses. The simple string search code is 100% consistent and accurate in its results.
Is there a trick to setting up BeautifulSoup in order to ensure it is as consistent and reliable as a simple string search?
If not, is there an alternative library that is rock solid reliable and repeatable?

Nowadays, the page load gets more complex and often involves a series of asynchronous calls, a lot of client-side javascript logic, DOM manipulation etc. The page you see in the browser usually is not the page you get via requests or urllib2. Additionally, the site can have defensive mechanisms working, like, for example, it can check for the User-Agent header, ban your IP after multiple continuous requests etc. This is really web-site specific and there is no "silver bullet" here.
Besides, the way BeautifulSoup parses the page depends on the underlying parser. See: Differences between parsers.
The most reliable way to achieve "What you see in the browser is what you get in the code" is to utilize a real browser, headless or not. For example, selenium package would be useful here.

Related

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

Efficient web page scraping with Python/Requests/BeautifulSoup

I am trying to grab information from the Chicago Transit Authority bustracker website. In particular, I would like to quickly output the arrival ETAs for the top two buses. I can do this rather easily with Splinter; however I am running this script on a headless Raspberry Pi model B and Splinter plus pyvirtualdisplay results in a significant amount of overhead.
Something along the lines of
from bs4 import BeautifulSoup
import requests
url = 'http://www.ctabustracker.com/bustime/eta/eta.jsp?id=15475'
r = requests.get(url)
s = BeautifulSoup(r.text,'html.parser')
does not do the trick. All of the data fields are empty (well, have &nbsp). For example, when the page looks like this:
This code snippet s.find(id='time1').text gives me u'\xa0' instead of "12 MINUTES" when I perform the analogous search with Splinter.
I'm not wedded to BeautifulSoup/requests; I just want something that doesn't require the overhead of Splinter/pyvirtualdisplay since the project requires that I obtain a short list of strings (e.g. for the image above, [['9','104th/Vincennes','1158','12 MINUTES'],['9','95th','1300','13 MINUTES']]) and then exits.
The bad news
So the bad news is the page you are trying to scrape is rendered via Javascript. Whilst tools like Splinter, Selenium, PhantomJS can render this for you and give you the output to easily scrape, Python + Requests + BeautifulSoup don't give you this out of the box.
The good news
The data pulled in from the Javascript has to come from somewhere, and usually this will come in an easier to parse format (as it's designed to be read by machines).
In this case your example loads this XML.
Now with an XML response it's not as nice as JSON so I'd recommend reading this answer about integrating with the requests library. But it will be a lot more lightweight than Splinter.

Selenium - parsing a page takes too long

I work with Selenium in Python 2.7. I get that loading a page and similar thing takes far longer than raw requests because it simulates everything including JS etc.
The thing I don't understand is that parsing of already loaded page takes too long.
Everytime when page is loaded, I find all tags meeting some condition (about 30 div tags) and then I put each tag as an attribute to parsing function. For parsing I'm using css_selectors and similar methods like: on.find_element_by_css_selector("div.carrier p").text
As far as I understand, when tha page is loaded, the source code of this page is saved in my RAM or anywhere else so parsing should be done in miliseconds.
EDIT: I bet that parsing the same source code using BeautifulSoup would be more than 10 times faster but I don't understand why.
Do you have any explanation? Thanks
These are different tools for different purposes. Selenium is a browser automation tool that has a rich set of techniques to locate elements. BeautifulSoup is an HTML parser. When you find an element with Selenium - this is not an HTML parsing. In other words, driver.find_element_by_id("myid") and soup.find(id="myid") are very different things.
When you ask selenium to find an element, say, using find_element_by_css_selector(), there is an HTTP request being sent to /session/$sessionId/element endpoint by the JSON wire protocol. Then, your selenium python client would receive a response and return you a WebElement instance if everything went without errors. You can think of it as a real-time/dynamic thing, you are getting a real Web Element that is "living" in a browser, you can control and interact with it.
With BeautifulSoup, once you download the page source, there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.
In practice, if you are doing web-scraping and you need a real browser to execute javascript and handle AJAX, and you are doing a complex HTML parsing afterwards, it would make sense to get the desired .page_source and feed it to BeautifulSoup, or, even better in terms of speed - lxml.html.
Note that, in cases like this, usually there is no need for the complete HTML source of the page. To make the HTML parsing faster, you can feed an "inner" or "outer" HTML of the page block containing the desired data to the html parser of the choice. For example:
container = driver.find_element_by_id("container").getAttribute("outerHTML")
driver.close()
soup = BeautifulSoup(container, "lxml")

What is the best practice for writing maintainable web scrapers?

I need to implement a few scrapers to crawl some web pages (because the site doesn't have open API), extracting information and save to database. I am currently using beautiful soup to write code like this:
discount_price_text = soup.select("#detail-main del.originPrice")[0].string;
discount_price = float(re.findall('[\d\.]+', discount_price_text)[0]);
I guess code like this can very easily become invalid when the web page is changed, even slightly.
How should I write scrapers less susceptible to these changes, other than writing regression tests to run regularly to catch failures?
In particular, is there any existing 'smart scraper' that can make 'best effort guess' even when the original xpath/css selector is no longer valid?
Pages have the potential to change so drastically that building a very "smart" scraper might be pretty difficult; and if possible, the scraper would be somewhat unpredictable, even with fancy techniques like machine-learning etcetera. It's hard to make a scraper that has both trustworthiness and automated flexibility.
Maintainability is somewhat of an art-form centered around how selectors are defined and used.
In the past I have rolled my own "two stage" selectors:
(find) The first stage is highly inflexible and checks the structure of the page toward a desired element. If the first stage fails, then it throws some kind of "page structure changed" error.
(retrieve) The second stage then is somewhat flexible and extracts the data from the desired element on the page.
This allows the scraper to isolate itself from drastic page changes with some level of auto-detection, while still maintaining a level of trustworthy flexibility.
I frequently have used xpath selectors, and it is really quit surprising, with a little practice, how flexible you can be with a good selector while still being very accurate. I'm sure css selectors are similar. This gets easier the more semantic and "flat" the page design is.
A few important questions to answer are:
What do you expect to change on the page?
What do you expect to stay the same on the page?
When answering these questions, the more accurate you can be the better your selectors can become.
In the end, it's your choice how much risk you want to take, how trustworthy your selectors will be, when both finding and retrieving data on a page, how you craft them makes a big difference; and ideally, it's best to get data from a web-api, which hopefully more sources will begin providing.
EDIT: Small example
Using your scenario, where the element you want is at .content > .deal > .tag > .price, the general .content .price selector is very "flexible" regarding page changes; but if, say, a false positive element arises, we may desire to avoid extracting from this new element.
Using two-stage selectors we can specify a less general, more inflexible first stage like .content > .deal, and then a second, more general stage like .price to retrieve the final element using a query relative to the results of the first.
So why not just use a selector like .content > .deal .price?
For my use, I wanted to be able to detect large page changes without running extra regression tests separately. I realized that rather than one big selector, I could write the first stage to include important page-structure elements. This first stage would fail (or report) if the structural elements no longer exist. Then I could write a second stage to more gracefully retrieve data relative to the results of the first stage.
I shouldn't say that it's a "best" practice, but it has worked well.
Completely unrelated to Python and not auto-flexible, but I think the templates of my Xidel scraper have the best maintability.
You would write it like:
<div id="detail-main">
<del class="originPrice">
{extract(., "[0-9.]+")}
</del>
</div>
Each element of the template is matched against the elements on the webpage, and if they are the same, the expressions inside {} are evaluated.
Additional elements on the page are ignored, so if you find the right balance of included elements and removed elements, the template will be unaffected by all minor changes.
Major changes on the other hand will trigger a matching failure, much better than xpath/css which will just return an empty set. Then you can change in the template just the changed elements, in the ideal case you could directly apply the diff between old/changed page to the template. In any case, you do not need to search which selector is affected or update multiple selectors for a single change, since the template can contain all queries for a single page together.
EDIT: Oops, I now see you're already using CSS selectors. I think they provide the best answer to your question. So no, I don't think there is a better way.
However, sometimes you may find that it's easier to identify the data without the structure. For example, if you want to scrape prices, you can do a regular expression search matching the price (\$\s+[0-9.]+), instead of relying on the structure.
Personally, the out-of-the-box webscraping libraries that I've tried all kind of leave something to desire (mechanize, Scrapy, and others).
I usually roll my own, using:
urllib2 (standard library),
lxml and
cssselect
cssselect allows you to use CSS selectors (just like jQuery) to find specific div's, tables etcetera. This proves to be really invaluable.
Example code to fetch the first question from SO homepage:
import urllib2
import urlparse
import cookielib
from lxml import etree
from lxml.cssselect import CSSSelector
post_data = None
url = 'http://www.stackoverflow.com'
cookie_jar = cookielib.CookieJar()
http_opener = urllib2.build_opener(
urllib2.HTTPCookieProcessor(cookie_jar),
urllib2.HTTPSHandler(debuglevel=0),
)
http_opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (X11; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
]
fp = http_opener.open(url, post_data)
parser = etree.HTMLParser()
doc = etree.parse(fp, parser)
elem = CSSSelector('#question-mini-list > div:first-child > div.summary h3 a')(doc)
print elem[0].text
Of course you don't need the cookiejar, nor the user-agent to emulate FireFox, however I find that I regularly need this when scraping sites.

Crawler with Python?

I'd like to write a crawler using python. This means: I've got the url of some websites' home page, and I'd like my program to crawl through all the website following links that stay into the website. How can I do this easily and FAST? I tried BeautifulSoup already, but it is really cpu consuming and quite slow on my pc.
I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.
I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.
import mechanize
import lxml.html
br = mechanize.Browser()
response = br.open("somewebsite")
for link in br.links():
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
print br
br.back()
# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
print link
you can also get elements via root.xpath(). A simple wget might even be the easiest solution.
Hope I could be helpful.
I like using mechanize. Its fairly simple, you download it and create a browser object. With this object you can open a URL. You can use "back" and "forward" functions as in a normal browser. You can iterate through the forms on the page and fill them out if need be.
You can iterate through all the links on the page too. Each link object has the url etc which you could click on.
here is an example:
Download all the links(related documents) on a webpage using Python
Here's an example of a very fast (concurrent) recursive web scraper using eventlet. It only prints the urls it finds but you can modify it to do what you want. Perhaps you'd want to parse the html with lxml (fast), pyquery (slower but still fast) or BeautifulSoup (slow) to get the data you want.
Have a look at scrapy (and related questions). As for performance... very difficult to make any useful suggestions without seeing the code.

Categories

Resources