Efficient web page scraping with Python/Requests/BeautifulSoup

Efficient web page scraping with Python/Requests/BeautifulSoup - python

I am trying to grab information from the Chicago Transit Authority bustracker website. In particular, I would like to quickly output the arrival ETAs for the top two buses. I can do this rather easily with Splinter; however I am running this script on a headless Raspberry Pi model B and Splinter plus pyvirtualdisplay results in a significant amount of overhead.
Something along the lines of
from bs4 import BeautifulSoup
import requests
url = 'http://www.ctabustracker.com/bustime/eta/eta.jsp?id=15475'
r = requests.get(url)
s = BeautifulSoup(r.text,'html.parser')
does not do the trick. All of the data fields are empty (well, have &nbsp). For example, when the page looks like this:
This code snippet s.find(id='time1').text gives me u'\xa0' instead of "12 MINUTES" when I perform the analogous search with Splinter.
I'm not wedded to BeautifulSoup/requests; I just want something that doesn't require the overhead of Splinter/pyvirtualdisplay since the project requires that I obtain a short list of strings (e.g. for the image above, [['9','104th/Vincennes','1158','12 MINUTES'],['9','95th','1300','13 MINUTES']]) and then exits.

The bad news
So the bad news is the page you are trying to scrape is rendered via Javascript. Whilst tools like Splinter, Selenium, PhantomJS can render this for you and give you the output to easily scrape, Python + Requests + BeautifulSoup don't give you this out of the box.
The good news
The data pulled in from the Javascript has to come from somewhere, and usually this will come in an easier to parse format (as it's designed to be read by machines).
In this case your example loads this XML.
Now with an XML response it's not as nice as JSON so I'd recommend reading this answer about integrating with the requests library. But it will be a lot more lightweight than Splinter.

Related

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!

Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

python crawl with requests to get json

When I do the crawl, I usually utilize scripts before parsing with python. Since this allows to get JSON which can be easily structured and parsed.
>>> import requests
>>> r = requests.get('~.json')
>>> r.json()
However, encountering this page, https://www.eiganetflix.jp/%E3%82%BF%E3%82%A4%E3%83%97/tv-%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA
It seems there's no interaction to call JSON to show materials on the page.
And it is hard to find pagination javascript functions. (Actually, there is, but I mean it seems hard to execute. )
In this case, how can I utilize existing requests and json method?
Or is there any easy way to crawl this?

If I understand correctly, you want to scrape a webpage which does not have a JSON response. Check to be sure that the website does not have an API that allows you to get JSON data. Or even any other structured data such as XML would also be helpful. If there is no way, you would have to screen scrape, which is not the easiest method to do. Check scrapy which is a framework for doing this, or you can use a library like beautifulsoup for a custom solution.
If the page uses Javascript, you would somehow need to run it on the page to get content and browse pages. You can spynner or Selenium to do that.

Selenium - parsing a page takes too long

I work with Selenium in Python 2.7. I get that loading a page and similar thing takes far longer than raw requests because it simulates everything including JS etc.
The thing I don't understand is that parsing of already loaded page takes too long.
Everytime when page is loaded, I find all tags meeting some condition (about 30 div tags) and then I put each tag as an attribute to parsing function. For parsing I'm using css_selectors and similar methods like: on.find_element_by_css_selector("div.carrier p").text
As far as I understand, when tha page is loaded, the source code of this page is saved in my RAM or anywhere else so parsing should be done in miliseconds.
EDIT: I bet that parsing the same source code using BeautifulSoup would be more than 10 times faster but I don't understand why.
Do you have any explanation? Thanks

These are different tools for different purposes. Selenium is a browser automation tool that has a rich set of techniques to locate elements. BeautifulSoup is an HTML parser. When you find an element with Selenium - this is not an HTML parsing. In other words, driver.find_element_by_id("myid") and soup.find(id="myid") are very different things.
When you ask selenium to find an element, say, using find_element_by_css_selector(), there is an HTTP request being sent to /session/$sessionId/element endpoint by the JSON wire protocol. Then, your selenium python client would receive a response and return you a WebElement instance if everything went without errors. You can think of it as a real-time/dynamic thing, you are getting a real Web Element that is "living" in a browser, you can control and interact with it.
With BeautifulSoup, once you download the page source, there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.
In practice, if you are doing web-scraping and you need a real browser to execute javascript and handle AJAX, and you are doing a complex HTML parsing afterwards, it would make sense to get the desired .page_source and feed it to BeautifulSoup, or, even better in terms of speed - lxml.html.
Note that, in cases like this, usually there is no need for the complete HTML source of the page. To make the HTML parsing faster, you can feed an "inner" or "outer" HTML of the page block containing the desired data to the html parser of the choice. For example:
container = driver.find_element_by_id("container").getAttribute("outerHTML")
driver.close()
soup = BeautifulSoup(container, "lxml")

How to setup BeautifulSoup to avoid false results?

In using BeautifulSoup I am seeing many cases where the information sought is definitely in the HTML input yet BeautifulSoup fails to find it. This is a problem because there are cases where the information isn't there and so it is impossible to know if BeautifulSoup's search result is a case of it failing or a true case of the information simply not being there.
Here's a simple example:
url_obj = urllib2.urlopen(url)
html = url_obj.read()
url_obj.close()
parsed_html = BeautifulSoup(html)
html = parsed_html.find(id="SalesRank")
I've run tests with dozens of URL's of pages that do have this id and, to my dismay, get seemingly random results. Sometimes some of the URL's will produce a search hit and other times none.
In sharp contrast to this, if I run a simple string search I get the correct result every time:
url_obj = urllib2.urlopen(url)
html = url_obj.read()
url_obj.close()
index = html.find("SalesRank")
# Slice off a chunk-o-html from there
# Then use regex to grab what we are after
This works every time. The prior BeautifulSoup example fails in a seemingly random fashion. Same URL's. What's alarming is that I can run the BeautifulSoup code twice in a row on the same set of URL's and get different responses. The simple string search code is 100% consistent and accurate in its results.
Is there a trick to setting up BeautifulSoup in order to ensure it is as consistent and reliable as a simple string search?
If not, is there an alternative library that is rock solid reliable and repeatable?

Nowadays, the page load gets more complex and often involves a series of asynchronous calls, a lot of client-side javascript logic, DOM manipulation etc. The page you see in the browser usually is not the page you get via requests or urllib2. Additionally, the site can have defensive mechanisms working, like, for example, it can check for the User-Agent header, ban your IP after multiple continuous requests etc. This is really web-site specific and there is no "silver bullet" here.
Besides, the way BeautifulSoup parses the page depends on the underlying parser. See: Differences between parsers.
The most reliable way to achieve "What you see in the browser is what you get in the code" is to utilize a real browser, headless or not. For example, selenium package would be useful here.

Beautiful soup - data not in HTML file

I am new to Python. I am trying to scrape data from a website and the data I want can not be seen on view > source in the browser. It comes from another file. It is possible to scrape the actual data on the screen with Beautifulsoup and Python?
example site www[dot]catleylakeman[dot]co(dot)uk/cds_banks.php
If not, is this possible using another route?
Thanks

The "other file" is http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369145707664 - you can find this out (and I suspect you already have) by using chrome's developer tools, network tab (or the equivalent in your browser).
This format is easier to parse than the final html would be; generally HTML scrapers should be used as a last resort if the website does not publish raw data like the above.

My guess is, the url you are actually looking for is:
http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122
I found it using the developer toolbar and looking at the network traffic (builtin to chrome and firefox, also using firebug). It gets called in with Ajax. You do not even need beatiful soup to parse that one as it seems to be a long string separated with *| and sometimes **|. The following should get you initial access to that data:
import urllib2
f = urllib2.urlopen('http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122')
try:
data = f.read().split('*|')
finally:
f.close()
print data

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.