I tried to learn response.xpath and response.css using the site: http://quotes.toscrape.com/
scrapy shell 'http://quotes.toscrape.com'
for quote in response.css("div.quote"):
title = quote.css("span.text::text").extract()
this will get one value only.
but if I use xpath:
scrapy shell 'http://quotes.toscrape.com'
for quote in response.css("div.quote"):
title = quote.xpath('//*[#class="text"]/text()').extract()
it will get a list of all titles on the whole page.
Can some people tell me what is different using the two tools? some element I prefer use response.xpath, such as specific table content, it is easy to get by following-sibling, but response.css cannot get
For a general explanation of the difference between XPath and CSS see the Scrapy docs:
Scrapy comes with its own mechanism for extracting data. They’re
called selectors because they “select” certain parts of the HTML
document specified either by XPath or CSS expressions.
XPath is a language for selecting nodes in XML documents, which can
also be used with HTML. CSS is a language for applying styles to HTML
documents. It defines selectors to associate those styles with
specific HTML elements.
XPath offers more features than pure CSS selection (the Wikipedia article gives a nice overview), at the cost of being harder to learn. Scrapy converts CSS selectors to XPath internally, so the .css() function is basically syntactic sugar for .xpath() and you can use whichever one you feel more comfortable with.
Regarding your specific examples, I think the problem is that your XPath query is not actually relative to the previous selector (the quote div), but absolute to the whole document. See this quote from "Working with relative XPaths" in the Scrapy docs:
Keep in mind that if you are nesting selectors and use an XPath that
starts with /, that XPath will be absolute to the document and not
relative to the Selector you’re calling it from.
To get the same result as with your CSS selector you could use something like this, where the XPath query is relative to the quote div:
for quote in response.css('div.quote'):
print(quote.xpath('span[#class="text"]/text()').extract())
Note that XPath also has the . expression to make any query relative to the current node, but I'm not sure how Scrapy implements this (using './/*[#class="text"]/text()' does also give the result you want).
Related
I'm working on some automated actions for Instagram using Python and Selenium and sometimes my code crashes because of a NoSuchElementException.
For example, when I first wrote a function for unfollowing a user, I used something like:
following_xpath = "//*[#id='react-root']/section/main/div/header/section/div[1]/div[2]/div/span/span[1]/button"
After running a few times, my code crashed because it couldn't find the element so upon inspecting the page I found out that the XPath now is:
following_xpath = "//*[#id="react-root"]/section/main/div/header/section/div[2]/div/div/div[2]/div/span/span[1]/button"
There's a small difference in div[1]/div[2]/div to div[2]/div/div/div[2]. So I have two questions:
Why does this happen?
Is there a bulletproof method that guarantees I will always be getting the right XPath (or element)?
It's high time we bust the myth that XPath changes.
Locator Strategies e.g. xpath and css-selectors are derived by the user and the more canonical the locators are constructed the more durable they are.
XML Path Language (XPath)
XPath 3.1 is an expression language that allows the processing of values conforming to the data model defined in XQuery and XPath Data Model (XDM) 3.1. The name of the language derives from its most distinctive feature, the path expression, which provides a means of hierarchic addressing of the nodes in an XML tree. As well as modeling the tree structure of XML, the data model also includes atomic values, function items, and sequences. This version of XPath supports JSON as well as XML, adding maps and arrays to the data model and supporting them with new expressions in the language and new functions in XQuery and XPath Functions and Operators 3.1.
Selectors
CSS (Cascading Style Sheets) is a language for describing the rendering of HTML and XML documents on screen, on paper, in speech, etc. CSS uses Selectors for binding style properties to elements in the document. These expressions can also be used, for instance, to select a set of elements, or a single element from a set of elements, by evaluating the expression across all the elements in a subtree.
This usecase
As per your code trials:
following_xpath = "//*[#id='react-root']/section/main/div/header/section/div[1]/div[2]/div/span/span[1]/button"
and
following_xpath = "//*[#id="react-root"]/section/main/div/header/section/div[2]/div/div/div[2]/div/span/span[1]/button"
Here are a couple of takeaways:
The DOM Tree contains React elements. So it is quite clear that the app uses ReactJS. React is a declarative, efficient, and flexible JavaScript library for building user interfaces. It lets you compose complex UIs from small and isolated pieces of code called components.
The xpaths are absolute xpaths.
The xpaths contains indexes.
So, the application is dynamic in nature and elements are liable to be added and moved within the HTML DOM on firing of any DOM events.
Solution
In such cases when the application is based on either:
JavaScript
Angular
ReactJS
jQuery
AJAX
Vue.js
Ember.js
GWT
The canonical approach is to construct relative and/or dynamic locators inducing WebDriverWait. Some examples:
To interact with the username field on instagram login page:
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))).send_keys("anon")
You can find a detailed discussion in Filling in login forms in Instagram using selenium and webdriver (chrome) python OSX
To locate the first line of the address just below the text as FIND US on facebook:
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[normalize-space()='FIND US']//following::span[2]")))
You can find a detailed discussion in Decoding Class names on facebook through Selenium
Interecting with GWT elements:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#title='Viewers']//preceding::span[1]//label"))).click()
You can find a detailed discussion in How to click on GWT enabled elements using Selenium and Python
The answer to (1) is simple: the page content has changed.
Firstly, the notion that there is "an XPath" for every element in a document is wrong: there are many (an infinite number) of XPath expressions that will select a given element. You've probably generated these XPaths using a tool that tries to give you what it considers the most useful XPath expression, but it's not the only one possible.
The best XPath expression to use is one that isn't going to change when the content of the page changes: but it's very hard for any tool to give you that, because it has no idea what's likely to change in the page content.
Using an #id attribute value (which these paths do) is more likely to be stable than using numeric indexing (which these paths also do), but that's based on guesses about what's likely to change, and those guesses can always be wrong. The only way of writing an XPath expression that continues to do "the right thing" when the page changes is to correctly guess what aspects of the page structure are going to vary and what parts are going to remain stable. So the only "bulletproof" answer (2) is to understand not just the current page structure, but its invariants over time.
I've been trying to get a full text hosted inside a <div> element from the web page https://www.list-org.com/company/11665809.
The element should contain a sub-string "Арбитраж".
And it does, because my code
for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
print(div)
returns response
Element div at 0x15480d93ac8
But when I'm trying to get the full text itself by using method div.text, it returns None
Which is a strange result, I think.
What should I do?
Any help would be greatly appreciated.
As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.
This is one of these strange things that happens when xpath is handled by a host language and library.
When you use the xpath expression
.//div[contains(text(), "Арбитраж")]
the search is performed according to xpath rules, which considers the target text as contained within the target div.
When you go on to the next line:
print(div.text)
you are using lxml.html, which apparently doesn't regard the target text as part of the div text, because it's preceded by the <i> tag. To get to it, with lxml.html, you have to use:
print(div.text_content())
or with xpath only:
print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])
It seems lxml.etree and beautifulsoup use different approaches. See this interesting discussion here.
Is there a way to scrap css values while scraping using python scrapy framework or by using php scraping.
any help will be appreaciated
scrapy.Selector allows you to use xpath to extract properties of HTML elements including CSS.
e.g. https://github.com/okfde/odm-datenerfassung/blob/master/crawl/dirbot/spiders/data.py#L83
(look around that code for how it fits into an entire scrapy spider)
If you don't need web crawling and just html parsing, you can use xpath directly from lxml in python. Another example:
https://github.com/codeformunich/feinstaubbot/blob/master/feinstaubbot.py
Finally, to get at the css from xpath I only know how to do it via css=element.attrib['style'] - this gives you everything inside of the style attribute which you further split by e.g. css.split(';') and then each of those by ':'.
It wouldn't surprise me if someone has a better suggestion. A little knowledge is enough to do a lot of scraping and that's how I would approach it based on previous projects.
Yes, please check the documentation for selectors basically you've two methods response.xpath() for xpath and response.css() for css selectors. For example, to get a title's text you could do any of the following:
response.xpath('//title/text()').extract_first()
response.css('title::text').extract_first()
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).