I want to scrape http://www.spyfu.com/overview/url?query=http%3A%2F%2Fwww.veldemangroup.com%2Fen finding the text elements under "organic keywords", so the first one would be "warehouse structure".
Working in python using scrapy and the command line tool. Trying:
response.xpath("//a[#data-reactid='.0.0.0.0.0.1.0.1.0']")
just returns "[]" - why is that, how do I get the correct ("warehouse structure") text?
The site you mention is generated dynamically only once you type in http://www.veldemangroup.com/en. You can check by typing scrapy shell http://www.spyfu.com/overview/url?query=http%3A%2F%2Fwww.veldemangroup.com%2Fen and then response.body that there is plenty of javascript and the selector you try to find (or overall others as well) is not there, so Scrapy cannot find it by itself.
Please try Selenium, this would apply not plain request how scrapy does that, but e.g. Firefox webdriver can read the site the way a browser sees this.
Related
I am using scrapy shell to extract data of the following web page:
https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html
Most data works, but there is a table in the lower part which content (the PZN e.g.) I seem not to be able to extract.
scrapy shell
fetch('https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html')
>>> response.xpath('//*[#id="accordionContent5e95408f73b10"]/div/table/tbody/tr[1]/td/text()').extract()
It returns: []
I also downloaded the page to view as scrapy sees it:
scrapy fetch --nolog https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html > test.html
Although it looks OK in HTML and although I can grab it in chrome, it does not work in scrapy shell.
How can I retrieve this data?
Problem You have encountered is that id 'accordionContent5e95408f73b10' is dynamically generated. So, id in Your browser and scrapy's response are different ones.
In common cases there is good workaround to write xpath with "substring search" (//*[contains(#id, 'accordionContent')]), but in this case there are a lot of such ids.
I can advise to write more complicated xpath.
//div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table/tbody/tr[1]/td
What this xpath do:
Find all "subpanels" with descriptions //div[#id='accordion']/div[contains(#class, 'panel')];
We get first "subpanel" (where PZN is located) and navigate into table with data: //div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table;
And last part is retrieving first tr's td.
By the way, xpath can be simplified to //div[#id='accordion']/div[contains(#class, 'panel')][1]//table/tbody/tr[1]/td . But i've written full xpath for more accurate understanding what we're navigating.
I'm using Scrapy to crawl a webpage. I'm interested in recovering a "complex" URL in this source code :
<a href="/searchresults.ja.html?label=gen173nr-1DCAQoggJCEHNlYXJjaF_lpKfpmKrluIJIFVgEaE2IAQGYARW4ARfIAQzYAQPoAQH4AQKIAgGoAgO4AqXw-usFwAIB&sid=99d1716767a9d25ee820122238489b00&tmpl=searchresults&checkin_year_month_monthday=2019-10-15&checkout_year_month_monthday=2019-10-16&city=-240905&class_interval=1&dest_id=-240905&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=87de9c92c893006c&ss=%E5%A4%A7%E9%98%AA%E5%B8%82&ss_all=0&ssb=empty&sshis=0&top_ufis=1&rows=20&offset=20" data-page-next class="bui-pagination__link paging-next ga_sr_gotopage_2_85" title="次のページ">
The xpath command I use is :
next_page = response.xpath('//a[starts-with(#class,"bui-pagination__link paging-next")]/#href').extract()
However, I get only "/searchresults.ja.html" ==> Everything after the ".html" is dumped. I'm not interested in recovering the domain name, but the complex part after the ".hmtl?"
What I would like to have is
/searchresults.ja.html?label=gen173nr-1DCAQoggJCEHNlYXJjaF_lpKfpmKrluIJIFVgEaE2IAQGYARW4ARfIAQzYAQPoAQH4AQKIAgGoAgO4AqXw-usFwAIB&sid=99d1716767a9d25ee820122238489b00&tmpl=searchresults&checkin_year_month_monthday=2019-10-15&checkout_year_month_monthday=2019-10-16&city=-240905&class_interval=1&dest_id=-240905&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=87de9c92c893006c&ss=%E5%A4%A7%E9%98%AA%E5%B8%82&ss_all=0&ssb=empty&sshis=0&top_ufis=1&rows=20&offset=20
Do you know what I should do ?
By the way the page is this one, and I'm trying to get the "next page" of results, at the bottom
The website is using JavaScript to render the next URL. The easiest way to check whether you can scrape anything directly without using JavaScript is using scrapy shell 'website' in your terminal (navigate to the directory where your scrapy spider is using the terminal and then execute the command. Check this image for execution of scrapy shell
This will open the response of the website in your terminal. Then you can type commands and check what the response is. In your case, the command will be:
response.css(".bui-pagination__item.sr_pagination_item a").getall()
Or
response.css(".bui-pagination__item.bui-pagination__next-arrow a::attr(href)").getall()
As you can see, the links are not complete as per your reference in the question. Hence, this proves that the link you're trying to extract cannot be extracted using the straightforward method. You can use Splash (for JS rendering) or manually inspect the request and then duplicate the request using the Request module in scrapy.
I want to get all divs that have category class.
Take a look at this page: www.postkhmer.com/ព័ត៌មានជាតិ
in scrapy shell: scrapy shell 'www.postkhmer.com/ព័ត៌មានជាតិ'
As you see I got only 2 elements back.
scrapy fetch --nolog http://www.postkhmer.com/ព័ត៌មានជាតិ > page.html
scrapy shell ./page.html
response.xpath('//div[#class="category"]')
Still got only 2 elements back. But when I open page.html in Sublime.
I got 15 matches:
The most interesting part is: when I remove the anchor link from the 2nd category:
and i run response.xpath('//div[#class="category"]') in the scrapy shell again, I got 3 elements:
I'm like what the hell!? Can someone help me to solve this problem please?
I've uploaded the file in here incase you want to test locally.
Only 2 things can be happening here. Either the html is malformed and scrapy can't parse it or there's some trouble with scrapy and encoding. I think the first one is more likely. http://www.freeformatter.com/html-validator.html kind of gives it away.
Since it works on Chrome what I would suggest is using selenium to make the browser fix the code and scrap the elements from there. I didn't test but maybe scrapy-splash can have the same effect.
when you save the page into a local file page.html, you skip the http header that contains encoding information.
Later on, when you open this file , either with scrapy or sublime, they have no clue what was the original encoding of the document.
Recommendation: never used documents saved to a file for parsing.
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).