Can't crawl more than a few items per page

Can't crawl more than a few items per page - python

I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?

It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}

Related

Scrapy gets only 24 first items of page

I tried many ways to scrape ikea page and I figured out that at last page ikea actually shows all the items. But when I try to scrape last page of ikea's product it only returns me the 24 first items (which corresponds to the items displayed for the first page.
this is the URL of the page:
https://www.ikea.com/fr/fr/cat/lits-bm003/?page=12
and this is the spider :
import scrapy
import pprint
class SpiderSpider(scrapy.Spider):
name = 'Ikea'
pages = 9
start_urls = ['https://www.ikea.com/fr/fr/cat/canapes-fu003/?page=12']
def parse(self, response):
data = {}
products = response.css('div.plp-product-list')
for product in products:
for p in product.css('div.range-revamp-product-compact'):
yield {
'Title' : p.css('div.range-revamp-header-section__title--small::text').getall()[0],
'Price' : p.css('span.range-revamp-price__integer::text').getall()[0],
'Desc' : p.css('span.range-revamp-header-section__description-text::text').getall()[0],
'Img' : p.css('img.range-revamp-aspect-ratio-image__image::attr(src)').getall()[0]
}

Scrapy's spider doesn't run JavaScript (that's the job of a browser), it will only load the same response content as a cURL would.
To do what exactly you suggest, you need a browser-based solution, like Selenium (Python) or Cypress (JavaScript). Either that or go through each page separately. Try to use a 'headless browser'.
There are probably better ways of doing this, but to address your exact question, this is the intended answer.

How do I obtain results from 'yield' in python?

Perhaps yield in Python is remedial for some, but not for me... at least not yet.
I understand yield creates a 'generator'.
I stumbled upon yield when I decided to learn scrapy.
I wrote some code for a Spider which works as follows:
Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink
Examines hyperlinks appends those meeting specific criteria to base hyperlink
Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'
import scrapy
class newSpider(scrapy.Spider)
name = 'new'
allowed_domains = ['www.alloweddomain.com']
start_urls = ['https://www.alloweddomain.com']
def parse(self, response)
links = response.xpath('//a/#href').extract()
for link in links:
if link == 'SpecificCriteria':
next_link = response.urljoin(link)
yield Request(next_link, callback=self.parse_new)
EDIT 1:
for uid_dict in self.parse_new(response):
print(uid_dict['uid'])
break
End EDIT 1
Running the code here evaluates response as the HTTP response to start_urls and not to next_link.
def parse_new(self, response)
trs = response.xpath("//*[#class='unit-directory-row']").getall()
for tr in trs:
if 'SpecificText' in tr:
elements = tr.split()
for element in elements:
if 'onclick' in element:
subelement = element.split('(')[1]
uid = subelement.split(')')[0]
print(uid)
yield {
'uid': uid
}
break
It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.
What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request.

I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield works.
In the meantime, spider.parse_new(response) is an iterable object. That is, you can acquire its yielded results via a for loop. E.g.,
for uid_dict in spider.parse_new(response):
print(uid_dict['uid'])

After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield! It has a lot to do with two issues:
1) robots.txt. Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py
2) The logger has Filtered offsite request to. Link dont_filter=True may resolve this.

Ignoring requests while scraping two pages

I am now scraping this website on a daily basis, and am using DeltaFetch to ignore pages which have already been visited (a lot of them).
The issue I am facing is that for this website, I need to first scrape page A, and then scrape page B to retrieve additional information about the item. DeltaFetch works well in ignoring requests to page B, but that also means that every time the scraping runs, it runs requests to page A regardless of whether it has visited it or not.
This is how my code is structured right now:
# Gathering links from a page, creating an item, and passing it to parse_A
def parse(self, response):
for href in response.xpath(u'//a[text()="詳細を見る"]/#href').extract():
item = ItemLoader(item=ItemClass(), response=response)
yield scrapy.Request(response.urljoin(href),
callback=self.parse_A,
meta={'item':item.load_item()})
# Parsing elements in page A, and passing the item to parse_B
def parse_A(self, response):
item = ItemLoader(item=response.meta['item'], response=response)
item.replace_xpath('age',u"//td[contains(#class,\"age\")]/text()")
page_B = response.xpath(u'//a/img[#alt="周辺環境"]/../#href').extract_first()
yield scrapy.Request(response.urljoin(page_B),
callback=self.parse_B,
meta={'item':item.load_item()})
# Parsing elements in page B, and yielding the item
def parse_B(self, response):
item = ItemLoader(item=response.meta['item'])
item.add_value('url_B',response.url)
yield item.load_item()
Any help would be appreciated to ignore the first request to page A when this page has already been visited, using DeltaFetch.

DeltaFetch only keeps record of the requests that yield items in its database, which means only those will be skipped by default.
However, you are able to customize the key used to store a record by using the deltafetch_key meta key. If you make this key the same for the requests that call parse_A() as for those created inside parse_A(), you should be able to achieve the effect you want.
Something like this should work (untested):
from scrapy.utils.request import request_fingerprint
# (...)
def parse_A(self, response):
# (...)
yield scrapy.Request(
response.urljoin(page_B),
callback=self.parse_B,
meta={
'item': item.load_item(),
'deltafetch_key': request_fingerprint(response.request)
}
)
Note: the example above effectively replaces the filtering of requests to parse_B() urls with the filtering of requests to parse_A() urls. You might need to use a different key depending on your needs.

Scrapy - Javascript website

I'm familiar with scraping websites with Scrapy, however I cant seem to scrape this one (javascript perhaps ?).
I'm trying to download historical data for commodities for some personal research from this website:
http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx
On this website you will have to select the date and then click go. Once the data is loaded, you can click 'View in Excel' to download a CSV file with commodity prices for that day. I'm trying to build a scraper to download these CSV files for a few months. However, this website seems like a hard nut to crack. Any help will be appreciated.
Things i've tried:
1) Look at the page source to see if data is being loaded but not shown (hidden)
2) Used firebug to see if there are any AJAX requests
3) Modified POST headers to see if I can get data for different days. The post headers seem very complicated.

Asp.net websites are notoriously hard to crawl because it relies on viewsessions, being extremely strict with requests and loads of other nonsense.
Luckily your case seems to be pretty straightforward. Your scrapy approach should look something like:
import scrapy
from scrapy import FormRequest
class MxindiaSpider(scrapy.Spider):
name = "mxindia"
allowed_domains = ["mcxindia.com"]
start_urls = ('http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx',)
def parse(self, response):
yield FormRequest.from_response(response,
formdata={
'mTbdate': '02/13/2015', # your date here
'ScriptManager1': 'MupdPnl|mImgBtnGo',
'__EVENTARGUMENT': '',
'__EVENTTARGET': '',
'mImgBtnGo.x': '12',
'mImgBtnGo.y': '9'
},
callback=self.parse_cal, )
def parse_cal(self, response):
inspect_response(response, self) # everything is there!
What we do here is create FormRequest from the response object we already have. It's mart enough to find the <input> and <form> fields and generates formdata.
However some input fields that don't have defaults or we need to override the defaults need to be overriden with formdata argument.
So we provide formdata argument with updated form values. When you inspect the request you can see all of the form values you need to make a successful request:
So just copy all of them over to your formdata. Asp is really anal about the formdata so it takes some time experimenting what is required and what is not.
I'll leave you to figure out how to get to the next page yourself, usually it just adds aditional key to formadata like 'page': '2'.

Python Scrapy - Ajax Pagination Tripadvisor

I'm using Python-Scrapy to scrap the reviews of tripadvisor members pages.
Here is the url I'm using : http://www.tripadvisor.com/members/scottca075
I'm able to get the first page using scrapy. I haven't been able to get the other pages. I observed the XHR Request in the Network Tab of the browser on clicking Next button.
One GET and One POST request is sent:
On checking the parameters for the GET request, I see this:
action : undefined_Other_ClickNext_REVIEWS_ALL
gaa : Other_ClickNext_REVIEWS_ALL
gal : 50
gams : 0
gapu : Vq85qQoQKjYAABktcRMAAAAh
gass : members`
The request url is
`http://www.tripadvisor.com/ActionRecord?action=undefined_Other_ClickNext_REVIEWS_ALL&gaa=Other_ClickNext_REVIEWS_ALL&gal=0&gass=members&gapu=Vq8xPAoQLnMAAUutB9gAAAAJ&gams=1`
The parameter gal represents the offset. Each page has 50 reviews. On moving to the second page by clicking the next button, the parameter gal is set to 50. Then, 100,150,200..and so on.
The data that I want is in the POST request in json format. Image of JSON data in POST request. The request url on the post request is http://www.tripadvisor.com/ModuleAjax?
I'm confused as to how to make the request in scrapy to get the data.
I tried using FormRequest as follows:
pagination_url = "http://www.tripadvisor.com/ActionRecord"
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':'0','gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}
FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parseItem)
I also tried setting headers options in the FormRequest
headers = {'Host':'www.tripadvisor.com','Referer':'http://www.tripadvisor.com/members/prizm','X-Requested-With': 'XMLHttpRequest'}
If someone could explain what I'm missing and point me in the right direction that would be great. I have run out of ideas.
And also, I'm aware that I can use selenium. But I want to know if there is a faster way to do this.

Use ScrapyJS - Scrapy+JavaScript integration
To use ScrapyJS in your project, you first need to enable the middleware:
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:
import scrapy
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …
A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:
function main(splash)
splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
splash:go("http://example.com")
splash:runjs("$('#some-button').click()")
return splash:html()
end
For more details check here

so for you are doing correct,
add the yield in front of FormRequest as:
yield FormRequest(''')
secondly focus on the value of gal, because it is the only parameter changing here and don`t keep gal = "0".
Find the total number of reviews and start from 50 to total pages adding 50 with each request.
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':reviews_till_this_page,'gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.