Scrapy SgmlLinkExtractor how to define XPath - python

I want to retreive the cityname and citycode and store it in one string variable. The image shows the precise location:
Google Chrome gave me the following XPath:
//*[#id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span
So I defined the following statement in scrapy to get the desired information:
plz = response.xpath('//*[#id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span/text()').extract()
However I was not successful, the string remains empty. What XPath definition should I use instead?

Most of the time this occurs, this is because browsers correct invalid HTML. How do you fix this? Inspect the (raw) HTML source and write your own XPath that navigate the DOM with the shortest/simplest query.
I scrape a lot of data off of the web and I've never used an XPath as specific as the one you got from the browser. This is for a few reasons:
It will fail quickly on invalid HTML or the most basic of hierarchy changes.
It contains no identifying data for debugging an issue when the website changes.
It's way longer than it should be.
Here's an example (there are a lot of different XPath queries you could write to find this data, I'd suggest you learning and re-writing this query so there are common themes for XPath queries throughout your project) query for grabbing that element:
//div[contains(#class, "detail-address")]//h2/following-sibling::span
The other main source of this problem is sites that extensively rely on JS to modify what is shown on the screen. Conveniently, though, this would be debugged the same was as above. As soon as you glance at the HTML returned on page load, you would notice that the data you are querying doesn't exist until JS executes. At that point, you would need to do some sort of headless browsing.
Since my answer was essentially "write your own XPath" (rather than relying on the browser), I'll leave some sources:
basic XPath introduction
list of XPath functions
XPath Chrome extension

The DOM is manipulated by javascript, so what chrome shows is the xpath after
the all the stuff has happened.
If all you want is to get the cities, you can get it this way (using scrapy):
city_text = response.css('.detail-address span::text').extract_first()
city_code, city_name = city_text.split(maxsplit=1)
Or you can manipulate the JSON in CDATA to get all the data you need:
cdata_text = response.xpath('//*[#id="tdakv"]/text()').extract_first()
json_str = cdata_text.splitlines()[2]
json_str = json_str[json_str.find('{'):]
data = json.loads(json_str) # import json
city_code = data['kvzip']
city_name = data['kvplace']

Related

XPath returns no result for some elements with scrapy shell

I am using scrapy shell to extract data of the following web page:
https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html
Most data works, but there is a table in the lower part which content (the PZN e.g.) I seem not to be able to extract.
scrapy shell
fetch('https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html')
>>> response.xpath('//*[#id="accordionContent5e95408f73b10"]/div/table/tbody/tr[1]/td/text()').extract()
It returns: []
I also downloaded the page to view as scrapy sees it:
scrapy fetch --nolog https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html > test.html
Although it looks OK in HTML and although I can grab it in chrome, it does not work in scrapy shell.
How can I retrieve this data?
Problem You have encountered is that id 'accordionContent5e95408f73b10' is dynamically generated. So, id in Your browser and scrapy's response are different ones.
In common cases there is good workaround to write xpath with "substring search" (//*[contains(#id, 'accordionContent')]), but in this case there are a lot of such ids.
I can advise to write more complicated xpath.
//div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table/tbody/tr[1]/td
What this xpath do:
Find all "subpanels" with descriptions //div[#id='accordion']/div[contains(#class, 'panel')];
We get first "subpanel" (where PZN is located) and navigate into table with data: //div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table;
And last part is retrieving first tr's td.
By the way, xpath can be simplified to //div[#id='accordion']/div[contains(#class, 'panel')][1]//table/tbody/tr[1]/td . But i've written full xpath for more accurate understanding what we're navigating.

Selenium Python: clicking links produced by JSON application

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Scrapy Xpath not extraction data [duplicate]

This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).

Extracting data from webpage using lxml XPath in Python

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

Convert the XPath gotten from browser to usable XPath for Scrapy

This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).

Categories

Resources