Scrapy not always finding the object - python

I'm trying to scrape a hidden field on a webpage, with python's scrapy framework:
<input class="currentTime" value="4888599" />
The strange thing is, on about 40% of all pages it cannot find the value of the input field. I tried loading the failing pages with javaScript disabled (thought maybe that's the problem) inside my browser, but the value is just filled on the pages which are failing. So the value is not added with javaScript....
Anyone who had this problem before or might have a solution for this? I don't know why it cannot find the value. I'm using the following syntax to scrape:
sel.css('.currentTime::attr(value)').extract()
The class is just available once on a page and I'm searching from the body tag. So it cannot be the path which is wrong, to my opinion. It's only that object which cannot be found most of the time, all other objects are not a problem.

Instead of the CSS attributes, you should prefer XPath - it's much more powerful and allows you to do things like traverse the tree backwards (for parents, which you can then descend down again)
Not that you'd need to do this in the given example, but XPath is much more reliable in general.
A fairly generic xpath query to do what you want would be something like this.. (in the case where the node may have more than one class name
//input[contains(concat(' ',normalize-space(#class),' '),' currentTime ')]/value/text()
A more targeted example would be..
//input[#class="currentTime"]/value/text()

Related

How to identify a change in a websites’ structure programmatically

Within the implementation of a Python Scrapy crawler I would like to add a robust mechanism for monitoring/detecting potential layout changes within a website.
These changes do not necessarily affect existing spider selectors - for example, a site adds a new HTML element to represent the number of visitors an item has received - an element I might now be interested in parsing.
Having said that, detecting selector issues (Xpath/CSS) would be also beneficial in case where they are removed/relocated.
Please note this is not about selector content change or a website refresh (if-modified-since or last-modified), but rather a modification in the structure / nodes / layout of a site.
Therefore, how would one implement logic to monitor such circumstances?
This is actually a topic for research as you can see on this paper but there are of course some implemented tools that you can check out:
https://github.com/matiskay/html-similarity
https://github.com/matiskay/html-cluster
https://github.com/TeamHG-Memex/page-compare
Basically the base for comparing (on the previous approaches) is to use the Tree Edit Distance of the html layout.

HTML Selector using python’s bs4

I’m fairly new at this, and I’m trying to work through Automate the Boring stuff and make some of my own programs along the way. I’m trying to use beautiful soup’s ‘select’ method to pull the value ‘33’ out of this code
<span class="wu-value wu-value-to" _ngcontent-c19="">33</span>
I know that the span element is inside a div and i’ve tried a few selectors including:
high_temp = w_u_soup.select('div > span .wu-value wu-value-to')
But I haven’t been able to get 33 out. Any help would be appreciated. I’ve tried to look up what _ngcontent-c19 is, but I’m having trouble understanding what i’ve found thus far (I’m trying to learn python and it seems I’ll be learning a bit of HTML as a consequence)
I think you have a couple of different issues here.
First, your selector is wrong -- the selector you have is trying to select an element called wu-value-to (which isn't a valid HTML element) inside something with class wu-value inside a span which is a direct descendent of a div. To select an element with particular classes you need no space between the element name and the class descriptors.
So your selector should probably be div > span.wu-value.wu-value-to. If your entire HTML is the part you showed, just 'span' would be enough, but I'm guessing you are being specific by specifying the parent and those classes for a reason.
Second, you are selecting the element, not its text content. You need your_node.text to get the text content.
Putting it together, you should be able to get what you want with this:
w_u_soup.select('div > span.wu-value.wu-value-to').text

Scrapy Xpath not extraction data [duplicate]

This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).

Convert the XPath gotten from browser to usable XPath for Scrapy

This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).

What is the best practice for writing maintainable web scrapers?

I need to implement a few scrapers to crawl some web pages (because the site doesn't have open API), extracting information and save to database. I am currently using beautiful soup to write code like this:
discount_price_text = soup.select("#detail-main del.originPrice")[0].string;
discount_price = float(re.findall('[\d\.]+', discount_price_text)[0]);
I guess code like this can very easily become invalid when the web page is changed, even slightly.
How should I write scrapers less susceptible to these changes, other than writing regression tests to run regularly to catch failures?
In particular, is there any existing 'smart scraper' that can make 'best effort guess' even when the original xpath/css selector is no longer valid?
Pages have the potential to change so drastically that building a very "smart" scraper might be pretty difficult; and if possible, the scraper would be somewhat unpredictable, even with fancy techniques like machine-learning etcetera. It's hard to make a scraper that has both trustworthiness and automated flexibility.
Maintainability is somewhat of an art-form centered around how selectors are defined and used.
In the past I have rolled my own "two stage" selectors:
(find) The first stage is highly inflexible and checks the structure of the page toward a desired element. If the first stage fails, then it throws some kind of "page structure changed" error.
(retrieve) The second stage then is somewhat flexible and extracts the data from the desired element on the page.
This allows the scraper to isolate itself from drastic page changes with some level of auto-detection, while still maintaining a level of trustworthy flexibility.
I frequently have used xpath selectors, and it is really quit surprising, with a little practice, how flexible you can be with a good selector while still being very accurate. I'm sure css selectors are similar. This gets easier the more semantic and "flat" the page design is.
A few important questions to answer are:
What do you expect to change on the page?
What do you expect to stay the same on the page?
When answering these questions, the more accurate you can be the better your selectors can become.
In the end, it's your choice how much risk you want to take, how trustworthy your selectors will be, when both finding and retrieving data on a page, how you craft them makes a big difference; and ideally, it's best to get data from a web-api, which hopefully more sources will begin providing.
EDIT: Small example
Using your scenario, where the element you want is at .content > .deal > .tag > .price, the general .content .price selector is very "flexible" regarding page changes; but if, say, a false positive element arises, we may desire to avoid extracting from this new element.
Using two-stage selectors we can specify a less general, more inflexible first stage like .content > .deal, and then a second, more general stage like .price to retrieve the final element using a query relative to the results of the first.
So why not just use a selector like .content > .deal .price?
For my use, I wanted to be able to detect large page changes without running extra regression tests separately. I realized that rather than one big selector, I could write the first stage to include important page-structure elements. This first stage would fail (or report) if the structural elements no longer exist. Then I could write a second stage to more gracefully retrieve data relative to the results of the first stage.
I shouldn't say that it's a "best" practice, but it has worked well.
Completely unrelated to Python and not auto-flexible, but I think the templates of my Xidel scraper have the best maintability.
You would write it like:
<div id="detail-main">
<del class="originPrice">
{extract(., "[0-9.]+")}
</del>
</div>
Each element of the template is matched against the elements on the webpage, and if they are the same, the expressions inside {} are evaluated.
Additional elements on the page are ignored, so if you find the right balance of included elements and removed elements, the template will be unaffected by all minor changes.
Major changes on the other hand will trigger a matching failure, much better than xpath/css which will just return an empty set. Then you can change in the template just the changed elements, in the ideal case you could directly apply the diff between old/changed page to the template. In any case, you do not need to search which selector is affected or update multiple selectors for a single change, since the template can contain all queries for a single page together.
EDIT: Oops, I now see you're already using CSS selectors. I think they provide the best answer to your question. So no, I don't think there is a better way.
However, sometimes you may find that it's easier to identify the data without the structure. For example, if you want to scrape prices, you can do a regular expression search matching the price (\$\s+[0-9.]+), instead of relying on the structure.
Personally, the out-of-the-box webscraping libraries that I've tried all kind of leave something to desire (mechanize, Scrapy, and others).
I usually roll my own, using:
urllib2 (standard library),
lxml and
cssselect
cssselect allows you to use CSS selectors (just like jQuery) to find specific div's, tables etcetera. This proves to be really invaluable.
Example code to fetch the first question from SO homepage:
import urllib2
import urlparse
import cookielib
from lxml import etree
from lxml.cssselect import CSSSelector
post_data = None
url = 'http://www.stackoverflow.com'
cookie_jar = cookielib.CookieJar()
http_opener = urllib2.build_opener(
urllib2.HTTPCookieProcessor(cookie_jar),
urllib2.HTTPSHandler(debuglevel=0),
)
http_opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (X11; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
]
fp = http_opener.open(url, post_data)
parser = etree.HTMLParser()
doc = etree.parse(fp, parser)
elem = CSSSelector('#question-mini-list > div:first-child > div.summary h3 a')(doc)
print elem[0].text
Of course you don't need the cookiejar, nor the user-agent to emulate FireFox, however I find that I regularly need this when scraping sites.

Categories

Resources