Retrieve full url using Scrapy and Xpath - python

I'm using Scrapy to crawl a webpage. I'm interested in recovering a "complex" URL in this source code :
<a href="/searchresults.ja.html?label=gen173nr-1DCAQoggJCEHNlYXJjaF_lpKfpmKrluIJIFVgEaE2IAQGYARW4ARfIAQzYAQPoAQH4AQKIAgGoAgO4AqXw-usFwAIB&sid=99d1716767a9d25ee820122238489b00&tmpl=searchresults&checkin_year_month_monthday=2019-10-15&checkout_year_month_monthday=2019-10-16&city=-240905&class_interval=1&dest_id=-240905&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=87de9c92c893006c&ss=%E5%A4%A7%E9%98%AA%E5%B8%82&ss_all=0&ssb=empty&sshis=0&top_ufis=1&rows=20&offset=20" data-page-next class="bui-pagination__link paging-next ga_sr_gotopage_2_85" title="次のページ">
The xpath command I use is :
next_page = response.xpath('//a[starts-with(#class,"bui-pagination__link paging-next")]/#href').extract()
However, I get only "/searchresults.ja.html" ==> Everything after the ".html" is dumped. I'm not interested in recovering the domain name, but the complex part after the ".hmtl?"
What I would like to have is
/searchresults.ja.html?label=gen173nr-1DCAQoggJCEHNlYXJjaF_lpKfpmKrluIJIFVgEaE2IAQGYARW4ARfIAQzYAQPoAQH4AQKIAgGoAgO4AqXw-usFwAIB&sid=99d1716767a9d25ee820122238489b00&tmpl=searchresults&checkin_year_month_monthday=2019-10-15&checkout_year_month_monthday=2019-10-16&city=-240905&class_interval=1&dest_id=-240905&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=87de9c92c893006c&ss=%E5%A4%A7%E9%98%AA%E5%B8%82&ss_all=0&ssb=empty&sshis=0&top_ufis=1&rows=20&offset=20
Do you know what I should do ?
By the way the page is this one, and I'm trying to get the "next page" of results, at the bottom

The website is using JavaScript to render the next URL. The easiest way to check whether you can scrape anything directly without using JavaScript is using scrapy shell 'website' in your terminal (navigate to the directory where your scrapy spider is using the terminal and then execute the command. Check this image for execution of scrapy shell
This will open the response of the website in your terminal. Then you can type commands and check what the response is. In your case, the command will be:
response.css(".bui-pagination__item.sr_pagination_item a").getall()
Or
response.css(".bui-pagination__item.bui-pagination__next-arrow a::attr(href)").getall()
As you can see, the links are not complete as per your reference in the question. Hence, this proves that the link you're trying to extract cannot be extracted using the straightforward method. You can use Splash (for JS rendering) or manually inspect the request and then duplicate the request using the Request module in scrapy.

Related

Extract image source from lazy loading content with Scrapy

I'm trying to extract the value of the src img tag using Scrapy.
For example:
<img src="https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=" alt="Property location on the map" loading="lazy">
I want to extract the URL:
https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=
When I view the response in Chrome returned from the scrapy shell I can see the data I want (via developer tools) to extract, but when I try to extract it with XPath it returns nothing.
e.g.
response.xpath("""//*[#id="root"]/div/div[3]/main/div[15]/div/a/img""").get()
I'm guessing loading="lazy" has something to do with it, however, the returned response from scrapy shows the data I want when viewed in a browser (with javascript disabled).
Steps to reproduce:
$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
$ view(response)
Anyone know how I can extract the URL from the map? I'm interested in doing this in order to extract the lat-long of the property.
This HTML tag is been generated by some JS when you open the page on the browser. When inspecting with view(response), I suggest to set to the tab to Offline in the devtools/Network tab and reload the page.
This will prevent the tab downloading other content, the same way scrapy shell does. Indeed, after doing this we can see that this tag does not exist at this point.
But this data seems to be available on one of the scripts tag. You can check it executing the following commands.
$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
import json
jdata = json.loads(response.xpath('//script').re_first('window.PAGE_MODEL = (.*)'))
from pprint import pprint as pp
pp(jdata)

No content returns from Scrapy and Instagram

I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.
First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.
If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.

XPath returns no result for some elements with scrapy shell

I am using scrapy shell to extract data of the following web page:
https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html
Most data works, but there is a table in the lower part which content (the PZN e.g.) I seem not to be able to extract.
scrapy shell
fetch('https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html')
>>> response.xpath('//*[#id="accordionContent5e95408f73b10"]/div/table/tbody/tr[1]/td/text()').extract()
It returns: []
I also downloaded the page to view as scrapy sees it:
scrapy fetch --nolog https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html > test.html
Although it looks OK in HTML and although I can grab it in chrome, it does not work in scrapy shell.
How can I retrieve this data?
Problem You have encountered is that id 'accordionContent5e95408f73b10' is dynamically generated. So, id in Your browser and scrapy's response are different ones.
In common cases there is good workaround to write xpath with "substring search" (//*[contains(#id, 'accordionContent')]), but in this case there are a lot of such ids.
I can advise to write more complicated xpath.
//div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table/tbody/tr[1]/td
What this xpath do:
Find all "subpanels" with descriptions //div[#id='accordion']/div[contains(#class, 'panel')];
We get first "subpanel" (where PZN is located) and navigate into table with data: //div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table;
And last part is retrieving first tr's td.
By the way, xpath can be simplified to //div[#id='accordion']/div[contains(#class, 'panel')][1]//table/tbody/tr[1]/td . But i've written full xpath for more accurate understanding what we're navigating.

Writing instagram crawler with Scrapy. How can I go to the next page?

As an exercise, I decided to write a python script that would get all the images of the specified user. I'm somewhat familiar with Scrapy, this is why I chose it as scraping tool. Currently the script is capable of downloading the images only from the first page (12 max).
From what I can tell, instagram pages are generated by javascript. Scrapy's response.body (which is like source code viewed from Chrome) does not show html structure like Chrome's Inspector does. In Chrome, after 12 images, at the bottom, there's a button with link to the next page.
For example, instagram.com/instagram. Link to page 2 is instagram.com/instagram/?max_id=1292385931151632610. On page 2 there's a link to page 3 with max_id=1287301939457754444.
How can I grab that number in Scrapy so I can send my spider there? response.body doesn't even contain that number. Is there another way to reach the next page?
I know Instagram API would provide some benefits but I thought it can be done without all those tokens.
You can also add the parameter __a=1 (like in https://www.instagram.com/instagram/?__a=1) to only include the JSON in the window._sharedData object.
I used a shell script like this to do something similar:
username=instagram
max=
while :;do
c=$(curl -s "https://www.instagram.com/$username/?__a=1&max_id=$max")
jq -r '.user|.id as$user|.media.nodes[]?|$user+" "+.id+" "+.display_src'<<<"$c"
max=$(jq -r .user.media.page_info.end_cursor<<<"$c")
jq -e .user.media.page_info.has_next_page<<<"$c">/dev/null||break
done
according to robots.txt policy you should avvoid crawling /api/, /publicapi/ and /query/ paths, so crawl carefully (and responsibly) on the user pagination.
Also from what I see pagination starts with a "Load more" request, that is in fact a https://www.instagram.com/query/ request (that you need to check) with only two necessary values owner and end_cursor sent as a POST request.
Those values can be found in the original request body inside '//script[contains(., "sharedData")]/text()'

Xpath for data-reactid element

I want to scrape http://www.spyfu.com/overview/url?query=http%3A%2F%2Fwww.veldemangroup.com%2Fen finding the text elements under "organic keywords", so the first one would be "warehouse structure".
Working in python using scrapy and the command line tool. Trying:
response.xpath("//a[#data-reactid='.0.0.0.0.0.1.0.1.0']")
just returns "[]" - why is that, how do I get the correct ("warehouse structure") text?
The site you mention is generated dynamically only once you type in http://www.veldemangroup.com/en. You can check by typing scrapy shell http://www.spyfu.com/overview/url?query=http%3A%2F%2Fwww.veldemangroup.com%2Fen and then response.body that there is plenty of javascript and the selector you try to find (or overall others as well) is not there, so Scrapy cannot find it by itself.
Please try Selenium, this would apply not plain request how scrapy does that, but e.g. Firefox webdriver can read the site the way a browser sees this.

Categories

Resources