I need to implement a few scrapers to crawl some web pages (because the site doesn't have open API), extracting information and save to database. I am currently using beautiful soup to write code like this:
discount_price_text = soup.select("#detail-main del.originPrice")[0].string;
discount_price = float(re.findall('[\d\.]+', discount_price_text)[0]);
I guess code like this can very easily become invalid when the web page is changed, even slightly.
How should I write scrapers less susceptible to these changes, other than writing regression tests to run regularly to catch failures?
In particular, is there any existing 'smart scraper' that can make 'best effort guess' even when the original xpath/css selector is no longer valid?
Pages have the potential to change so drastically that building a very "smart" scraper might be pretty difficult; and if possible, the scraper would be somewhat unpredictable, even with fancy techniques like machine-learning etcetera. It's hard to make a scraper that has both trustworthiness and automated flexibility.
Maintainability is somewhat of an art-form centered around how selectors are defined and used.
In the past I have rolled my own "two stage" selectors:
(find) The first stage is highly inflexible and checks the structure of the page toward a desired element. If the first stage fails, then it throws some kind of "page structure changed" error.
(retrieve) The second stage then is somewhat flexible and extracts the data from the desired element on the page.
This allows the scraper to isolate itself from drastic page changes with some level of auto-detection, while still maintaining a level of trustworthy flexibility.
I frequently have used xpath selectors, and it is really quit surprising, with a little practice, how flexible you can be with a good selector while still being very accurate. I'm sure css selectors are similar. This gets easier the more semantic and "flat" the page design is.
A few important questions to answer are:
What do you expect to change on the page?
What do you expect to stay the same on the page?
When answering these questions, the more accurate you can be the better your selectors can become.
In the end, it's your choice how much risk you want to take, how trustworthy your selectors will be, when both finding and retrieving data on a page, how you craft them makes a big difference; and ideally, it's best to get data from a web-api, which hopefully more sources will begin providing.
EDIT: Small example
Using your scenario, where the element you want is at .content > .deal > .tag > .price, the general .content .price selector is very "flexible" regarding page changes; but if, say, a false positive element arises, we may desire to avoid extracting from this new element.
Using two-stage selectors we can specify a less general, more inflexible first stage like .content > .deal, and then a second, more general stage like .price to retrieve the final element using a query relative to the results of the first.
So why not just use a selector like .content > .deal .price?
For my use, I wanted to be able to detect large page changes without running extra regression tests separately. I realized that rather than one big selector, I could write the first stage to include important page-structure elements. This first stage would fail (or report) if the structural elements no longer exist. Then I could write a second stage to more gracefully retrieve data relative to the results of the first stage.
I shouldn't say that it's a "best" practice, but it has worked well.
Completely unrelated to Python and not auto-flexible, but I think the templates of my Xidel scraper have the best maintability.
You would write it like:
<div id="detail-main">
<del class="originPrice">
{extract(., "[0-9.]+")}
</del>
</div>
Each element of the template is matched against the elements on the webpage, and if they are the same, the expressions inside {} are evaluated.
Additional elements on the page are ignored, so if you find the right balance of included elements and removed elements, the template will be unaffected by all minor changes.
Major changes on the other hand will trigger a matching failure, much better than xpath/css which will just return an empty set. Then you can change in the template just the changed elements, in the ideal case you could directly apply the diff between old/changed page to the template. In any case, you do not need to search which selector is affected or update multiple selectors for a single change, since the template can contain all queries for a single page together.
EDIT: Oops, I now see you're already using CSS selectors. I think they provide the best answer to your question. So no, I don't think there is a better way.
However, sometimes you may find that it's easier to identify the data without the structure. For example, if you want to scrape prices, you can do a regular expression search matching the price (\$\s+[0-9.]+), instead of relying on the structure.
Personally, the out-of-the-box webscraping libraries that I've tried all kind of leave something to desire (mechanize, Scrapy, and others).
I usually roll my own, using:
urllib2 (standard library),
lxml and
cssselect
cssselect allows you to use CSS selectors (just like jQuery) to find specific div's, tables etcetera. This proves to be really invaluable.
Example code to fetch the first question from SO homepage:
import urllib2
import urlparse
import cookielib
from lxml import etree
from lxml.cssselect import CSSSelector
post_data = None
url = 'http://www.stackoverflow.com'
cookie_jar = cookielib.CookieJar()
http_opener = urllib2.build_opener(
urllib2.HTTPCookieProcessor(cookie_jar),
urllib2.HTTPSHandler(debuglevel=0),
)
http_opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (X11; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
]
fp = http_opener.open(url, post_data)
parser = etree.HTMLParser()
doc = etree.parse(fp, parser)
elem = CSSSelector('#question-mini-list > div:first-child > div.summary h3 a')(doc)
print elem[0].text
Of course you don't need the cookiejar, nor the user-agent to emulate FireFox, however I find that I regularly need this when scraping sites.
Related
I'm making a system - mostly in Python with Scrapy - in which I can, basically, find information about a specific product. But the thing is that the request URL is massive huge, I got a clue that I should change some parts of it with variables to reach that specific product in which I would like to search for, but the URL has so many fields that I don't know, for sure, how to make it.
e.g: "https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt"
"demi+lovato+365+dias+do+ano" it's the book title, but I can see a lot of information on URL that I simply can't supply and of course, it changes from title to title. One solution I thought could be possible was to POST on search bar the title in which I was looking for and find it on result page but I don't know if it's the best approach since in fact, this is the first time I'll be working with web scraping.
Someone has some tip for how can I do that. All I could find was how to scrape all products for price comparison, scrape specific information about all these products and things like that but nothing about search for specific products.
Thanks for any contribs, this is very important for me and sorry about anything, I'm not a very present user and I'm not an English native speaker.
Feel free to make me any advice about user behavior, be better is always something I aim to.
You should use rule available in scrapy framework. This will help you to define how to navigate the site and its sub-site. Additionally you can configure other tags like span or div other than anchor tags to look for url of the link. By this way, additional query params in the link will be populated by the scrapy session as it emulate click on the hypelinks. If you skip the additional query params in the URL, there is a high chance that you will be blocked
How does scrapy use rules?
You don't need to follow that long link at all, often the different parameters are associated with your current session or settings/filters and you can keep only what you need.
Here is what I meant:
You can generate same result using these 2 urls:
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt
If both links are generating same results then that's it, otherwise you will definitely have to play with different parameters, you can't predict website behavior without actually doing the test and having a lot of parameters is an issue then try something like:
from urllib.parse import quote_plus
base_url = "https://www.amazon.com.br"
link = base_url + "/k=%s&adgrpid=%s&hvadid=%s" % ( quote_plus(title), '86887777368', '392971063429' )
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).
In using BeautifulSoup I am seeing many cases where the information sought is definitely in the HTML input yet BeautifulSoup fails to find it. This is a problem because there are cases where the information isn't there and so it is impossible to know if BeautifulSoup's search result is a case of it failing or a true case of the information simply not being there.
Here's a simple example:
url_obj = urllib2.urlopen(url)
html = url_obj.read()
url_obj.close()
parsed_html = BeautifulSoup(html)
html = parsed_html.find(id="SalesRank")
I've run tests with dozens of URL's of pages that do have this id and, to my dismay, get seemingly random results. Sometimes some of the URL's will produce a search hit and other times none.
In sharp contrast to this, if I run a simple string search I get the correct result every time:
url_obj = urllib2.urlopen(url)
html = url_obj.read()
url_obj.close()
index = html.find("SalesRank")
# Slice off a chunk-o-html from there
# Then use regex to grab what we are after
This works every time. The prior BeautifulSoup example fails in a seemingly random fashion. Same URL's. What's alarming is that I can run the BeautifulSoup code twice in a row on the same set of URL's and get different responses. The simple string search code is 100% consistent and accurate in its results.
Is there a trick to setting up BeautifulSoup in order to ensure it is as consistent and reliable as a simple string search?
If not, is there an alternative library that is rock solid reliable and repeatable?
Nowadays, the page load gets more complex and often involves a series of asynchronous calls, a lot of client-side javascript logic, DOM manipulation etc. The page you see in the browser usually is not the page you get via requests or urllib2. Additionally, the site can have defensive mechanisms working, like, for example, it can check for the User-Agent header, ban your IP after multiple continuous requests etc. This is really web-site specific and there is no "silver bullet" here.
Besides, the way BeautifulSoup parses the page depends on the underlying parser. See: Differences between parsers.
The most reliable way to achieve "What you see in the browser is what you get in the code" is to utilize a real browser, headless or not. For example, selenium package would be useful here.
I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.
Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.
However I cant find any python binding for the same. Can anyone suggest a way ?
I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.
Perhaps pywebkitgtk would do what you need.
see http://wiki.python.org/moin/WebBrowserProgramming
there are quite a lot of options - i'm maintaining the page above so that i don't keep repeating myself.
you should look at pyjamas-desktop: see the examples/uitest example because we use exactly this trick to get copies of the HTML page "out", so that the python-to-javascript compiler can be tested by comparing the page results after each unit test.
each of the runtimes supported and used by pyjamas-desktop is capable of allowing access to the "innerHTML" property of the document's body element (and a hell of a lot more).
bottom line: it is trivial to do what you want to do, but you have to know where to look to find out how to do it.
l.
You might like PyWebkitDFB from http://www.gnu.org/software/pythonwebkit/