Find the right selector css to crawl a webpage on scrapy

Find the right selector css to crawl a webpage on scrapy - python

I'm trying to crawl this webpage "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas" to extract the products name but I can't find the right selector, even for the price, h1 or the title! I tried :
response.css(".shelfProductTile-descriptionLink") #for the name product
response.css(".price-cents") # for the price
response.css(".tileList-title") # for the title
How can I proceed?

Content is dynamically loaded from a POST xhr returning json you can find in network tab of browser.
Request goes to:
https://www.woolworths.com.au/apis/ui/browse/category
Payload:
{"categoryId":"1_9573995","pageNumber":1,"pageSize":24,"sortType":"TraderRelevance","url":"/shop/browse/drinks/cordials-juices-iced-teas/iced-teas","location":"/shop/browse/drinks/cordials-juices-iced-teas/iced-teas","formatObject":"{\"name\":\"Iced Teas\"}","isSpecial":False,"isBundle":False,"isMobile":False,"filters":"null"}
with response in scrapy use:
json.loads(response.body_as_unicode())

Related

Python: Scrapy Gathering All Text of Selectors Children

I'm trying to scrape the descriptions of ebay listings, and was approaching it with this:
def parse_description(self, response):
description = response.css('div#ds_div*::text').get()
yield {
"description": description
}
The idea was to grab the text of all the tags that are under .css('div#ds_div')
However I'm getting this as an error:
"Expected selector, got %s" % (peek,))
File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected selector, got <DELIM '*' at 10>
Example URL I am trying to scrape: https://www.ebay.co.uk/itm/Vintage-Toastmaster-Chrome-Toaster-Model-D182-4-Slice-Wide-Slot-Nos/114677725765?hash=item1ab3533a45:g:ui8AAOSw-jpgBbFS
Where am I going wrong?

The error refers to the selector not being valid:
div#ds_div*::text
If you put a space in between the div#ds_div and * it is valid as you've also mentioned in the comments.
From looking at the link another problem is that the text you're trying to retrieve is inside an iframe with id desc_ifr.
If you want to scrape the content inside this iframe look at the src attribute of the iframe and scrape this url instead of the url in your question. Then you can do this:
response.css('div#ds_div p::text').get()

Scrapy gets only 24 first items of page

I tried many ways to scrape ikea page and I figured out that at last page ikea actually shows all the items. But when I try to scrape last page of ikea's product it only returns me the 24 first items (which corresponds to the items displayed for the first page.
this is the URL of the page:
https://www.ikea.com/fr/fr/cat/lits-bm003/?page=12
and this is the spider :
import scrapy
import pprint
class SpiderSpider(scrapy.Spider):
name = 'Ikea'
pages = 9
start_urls = ['https://www.ikea.com/fr/fr/cat/canapes-fu003/?page=12']
def parse(self, response):
data = {}
products = response.css('div.plp-product-list')
for product in products:
for p in product.css('div.range-revamp-product-compact'):
yield {
'Title' : p.css('div.range-revamp-header-section__title--small::text').getall()[0],
'Price' : p.css('span.range-revamp-price__integer::text').getall()[0],
'Desc' : p.css('span.range-revamp-header-section__description-text::text').getall()[0],
'Img' : p.css('img.range-revamp-aspect-ratio-image__image::attr(src)').getall()[0]
}

Scrapy's spider doesn't run JavaScript (that's the job of a browser), it will only load the same response content as a cURL would.
To do what exactly you suggest, you need a browser-based solution, like Selenium (Python) or Cypress (JavaScript). Either that or go through each page separately. Try to use a 'headless browser'.
There are probably better ways of doing this, but to address your exact question, this is the intended answer.

Click Button in Scrapy-Splash

I am writing a scrapy-splash program and I need to click on the display button on the webpage, as seen in the image below, in order to display the data, for 10th edition, so I can scrape it. I have the code I tried below but it does not work. The information I need is only accessible if I click the display button. UPDATE: Still struggling with this and I have to believe there is a way to do this. I do not want to scrape the JSON because that could be a red flag to site owners.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'email123#example.com', 'ex_usr_pass': 'password123'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button= response.xpath('//a[contains(., "- Display>>")]/#href').get()
response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item

Your code can't work because there is no anchor element and no href attribute. Clicking the button will send an XMLHttpRequest to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and the data you want is found in the JSON response.
To check the request URL and response, open Dev Tools -> Network -> XHR and click Display.
In Headers tab you will find the request URL and in Preview or Response tabs you can inspect the JSON.
As you can see you'll need a category id to build the request URL. You can find this by parsing the script element found with this XPath //script[contains(., "categories")]
Then you can send your request from the spider to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and get the data you want.
$ curl 'http://www.starcitygames.com/buylist/search?search-type=category&id=5061'
{"ok":true,"search":"10th Edition","results":[[{"id":"46269","name":"Abundance","subtitle":null,"condition":"NM\/M","foil":true,"is_parent":false,"language":"English","price":"20.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"},{"id":"176986","name":"Abundance","subtitle":null,"condition":"PL","foil":true,"is_parent":false,"language":"English","price":"12.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"}....
As you can see, you don't even need to log in into the website or Splash.

Python/Scrapy scraping from Techcrunch

I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?

I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.

Can't crawl more than a few items per page

I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?

It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the right selector css to crawl a webpage on scrapy - python

Related

Python: Scrapy Gathering All Text of Selectors Children

Scrapy gets only 24 first items of page

Click Button in Scrapy-Splash

Python/Scrapy scraping from Techcrunch

Can't crawl more than a few items per page

Categories

Resources