Extract image source from lazy loading content with Scrapy - python

I'm trying to extract the value of the src img tag using Scrapy.
For example:
<img src="https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=" alt="Property location on the map" loading="lazy">
I want to extract the URL:
https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=
When I view the response in Chrome returned from the scrapy shell I can see the data I want (via developer tools) to extract, but when I try to extract it with XPath it returns nothing.
e.g.
response.xpath("""//*[#id="root"]/div/div[3]/main/div[15]/div/a/img""").get()
I'm guessing loading="lazy" has something to do with it, however, the returned response from scrapy shows the data I want when viewed in a browser (with javascript disabled).
Steps to reproduce:
$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
$ view(response)
Anyone know how I can extract the URL from the map? I'm interested in doing this in order to extract the lat-long of the property.

This HTML tag is been generated by some JS when you open the page on the browser. When inspecting with view(response), I suggest to set to the tab to Offline in the devtools/Network tab and reload the page.
This will prevent the tab downloading other content, the same way scrapy shell does. Indeed, after doing this we can see that this tag does not exist at this point.
But this data seems to be available on one of the scripts tag. You can check it executing the following commands.
$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
import json
jdata = json.loads(response.xpath('//script').re_first('window.PAGE_MODEL = (.*)'))
from pprint import pprint as pp
pp(jdata)

Related

No content returns from Scrapy and Instagram

I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.
First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.
If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.

Retrieve full url using Scrapy and Xpath

I'm using Scrapy to crawl a webpage. I'm interested in recovering a "complex" URL in this source code :
<a href="/searchresults.ja.html?label=gen173nr-1DCAQoggJCEHNlYXJjaF_lpKfpmKrluIJIFVgEaE2IAQGYARW4ARfIAQzYAQPoAQH4AQKIAgGoAgO4AqXw-usFwAIB&sid=99d1716767a9d25ee820122238489b00&tmpl=searchresults&checkin_year_month_monthday=2019-10-15&checkout_year_month_monthday=2019-10-16&city=-240905&class_interval=1&dest_id=-240905&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=87de9c92c893006c&ss=%E5%A4%A7%E9%98%AA%E5%B8%82&ss_all=0&ssb=empty&sshis=0&top_ufis=1&rows=20&offset=20" data-page-next class="bui-pagination__link paging-next ga_sr_gotopage_2_85" title="次のページ">
The xpath command I use is :
next_page = response.xpath('//a[starts-with(#class,"bui-pagination__link paging-next")]/#href').extract()
However, I get only "/searchresults.ja.html" ==> Everything after the ".html" is dumped. I'm not interested in recovering the domain name, but the complex part after the ".hmtl?"
What I would like to have is
/searchresults.ja.html?label=gen173nr-1DCAQoggJCEHNlYXJjaF_lpKfpmKrluIJIFVgEaE2IAQGYARW4ARfIAQzYAQPoAQH4AQKIAgGoAgO4AqXw-usFwAIB&sid=99d1716767a9d25ee820122238489b00&tmpl=searchresults&checkin_year_month_monthday=2019-10-15&checkout_year_month_monthday=2019-10-16&city=-240905&class_interval=1&dest_id=-240905&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=87de9c92c893006c&ss=%E5%A4%A7%E9%98%AA%E5%B8%82&ss_all=0&ssb=empty&sshis=0&top_ufis=1&rows=20&offset=20
Do you know what I should do ?
By the way the page is this one, and I'm trying to get the "next page" of results, at the bottom
The website is using JavaScript to render the next URL. The easiest way to check whether you can scrape anything directly without using JavaScript is using scrapy shell 'website' in your terminal (navigate to the directory where your scrapy spider is using the terminal and then execute the command. Check this image for execution of scrapy shell
This will open the response of the website in your terminal. Then you can type commands and check what the response is. In your case, the command will be:
response.css(".bui-pagination__item.sr_pagination_item a").getall()
Or
response.css(".bui-pagination__item.bui-pagination__next-arrow a::attr(href)").getall()
As you can see, the links are not complete as per your reference in the question. Hence, this proves that the link you're trying to extract cannot be extracted using the straightforward method. You can use Splash (for JS rendering) or manually inspect the request and then duplicate the request using the Request module in scrapy.

Scrapy Extract Script Value

Using the scrapy shell on a specific url I am trying to identify how I can extract the author value or contributor value out of this script within a pages source code? I have tried
response.xpath('//script').re(r'author":"([0-9.]+)"')
this is the script in the source code of the site
<script charSet="UTF-8">...
"author":"3810161","contributor":{"id":"3810161"}},
</script>
Did you try printing all the <script> contents from Scrapy itself?
I guess you will not see the same content as you see in your navigator inspector since theses nodes appear to be Javascript rendered and Scrapy don't handle Javascript.
If you just want to extract some content from theses search results, you could just use the api (Same search parameter you posted, but give you a JSON response, really more easy to parse...)

How to access the subtags within a tag using beautifulsoup in python?

I am attempting to retrieve player statistics from MLB.com for the 2016 season. I am using Beautiful Soup in Python, and I need to extract the information in the table seen here:
http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2016&season_type=ANY&league_code='MLB'&sectionType=sp&statType=hitting&page=1&ts=1493672037085&playerType=ALL&sportCode='mlb'&split=&team_id=&active_sw=&position=&page_type=SortablePlayer&sortOrder='desc'&sortColumn=ab&results=&perPage=442&timeframe=&last_x_days=&extended=0
Here is what I have attempted:
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
gdata=soup.find_all('div',{'id':'datagrid'})
print(gdata)
This should return all of the subtags within the tag, but it does not. This results in the following:
[<div id="datagrid"></div>]
Can anyone explain why this is not producing the contents of the table? Furthermore, what can I do to access the contents of the table?
Thanks
If you look at the source for the webpage, it looks like the datagrid div is actually empty & the stats are inserted dynamically as json from this URL. Maybe you can use that instead. To figure this out I looked at the page source to see that the div had no children and then used Chrome developer tools Network tab to find the request where it pulled the data:
Open the web page
Open the chrome developer tools, Command+Option+I (Mac) or Control+Shift+I (Windows, Linux).
Refresh the web page with the tools open so it processes the network requests then wait for the page to load
(optional) Type xml in the search bar on the web to narrow your search results to requests that are likely to have data
Click on each request and look at the preview of the response. At this point I just manually examined the responses to see which had your data. I got lucky and got yours on the first try since it has stats in the name.

My first scrapy xpath selector

I am very new at this and have been trying to get my head around my first selector. Can somebody help me? I am trying to extract data from this page:
http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false
all the info under div class = listing clearfix shelfListing but I can't seem to figure out how to format response.xpath().
I have managed to launch the scrapy console but no matter what I type in response.xpath() I can't seem to select the right node. I know it works because when I type
>>>response.xpath('//div[#class="container"]')
I get a response. Yet, I don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once I get this bit I can continue working my way through the spider.
PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?
The content inside the div with listings class (and id) is loaded via an XHR request asynchronously. In other words, the html code that Scrapy gets doesn't contain it:
$ scrapy shell http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false
>>> response.xpath('//div[#id="listings"]')
[]
Using browser developer tools, you can see the request going to http://groceries.asda.com/api/items/viewitemlist url with a bunch of GET parameters.
One option would be to simulate that request and parse the resulting JSON:
How to do it is actually a part of a different question.
Here's one possible solution using selenium package:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false')
div = driver.find_element_by_id('listings')
for item in driver.find_elements_by_xpath('//div[#id="listings"]//a[#title]'):
print item.text.strip()
driver.close()
Prints:
Kellogg's Coco Pops
Kelloggs Rice Krispies
Kellogg's Coco Pops Croco Copters
...

Categories

Resources