Scrapy gets only 24 first items of page - python

I tried many ways to scrape ikea page and I figured out that at last page ikea actually shows all the items. But when I try to scrape last page of ikea's product it only returns me the 24 first items (which corresponds to the items displayed for the first page.
this is the URL of the page:
https://www.ikea.com/fr/fr/cat/lits-bm003/?page=12
and this is the spider :
import scrapy
import pprint
class SpiderSpider(scrapy.Spider):
name = 'Ikea'
pages = 9
start_urls = ['https://www.ikea.com/fr/fr/cat/canapes-fu003/?page=12']
def parse(self, response):
data = {}
products = response.css('div.plp-product-list')
for product in products:
for p in product.css('div.range-revamp-product-compact'):
yield {
'Title' : p.css('div.range-revamp-header-section__title--small::text').getall()[0],
'Price' : p.css('span.range-revamp-price__integer::text').getall()[0],
'Desc' : p.css('span.range-revamp-header-section__description-text::text').getall()[0],
'Img' : p.css('img.range-revamp-aspect-ratio-image__image::attr(src)').getall()[0]
}

Scrapy's spider doesn't run JavaScript (that's the job of a browser), it will only load the same response content as a cURL would.
To do what exactly you suggest, you need a browser-based solution, like Selenium (Python) or Cypress (JavaScript). Either that or go through each page separately. Try to use a 'headless browser'.
There are probably better ways of doing this, but to address your exact question, this is the intended answer.

Related

How do I obtain results from 'yield' in python?

Perhaps yield in Python is remedial for some, but not for me... at least not yet.
I understand yield creates a 'generator'.
I stumbled upon yield when I decided to learn scrapy.
I wrote some code for a Spider which works as follows:
Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink
Examines hyperlinks appends those meeting specific criteria to base hyperlink
Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'
import scrapy
class newSpider(scrapy.Spider)
name = 'new'
allowed_domains = ['www.alloweddomain.com']
start_urls = ['https://www.alloweddomain.com']
def parse(self, response)
links = response.xpath('//a/#href').extract()
for link in links:
if link == 'SpecificCriteria':
next_link = response.urljoin(link)
yield Request(next_link, callback=self.parse_new)
EDIT 1:
for uid_dict in self.parse_new(response):
print(uid_dict['uid'])
break
End EDIT 1
Running the code here evaluates response as the HTTP response to start_urls and not to next_link.
def parse_new(self, response)
trs = response.xpath("//*[#class='unit-directory-row']").getall()
for tr in trs:
if 'SpecificText' in tr:
elements = tr.split()
for element in elements:
if 'onclick' in element:
subelement = element.split('(')[1]
uid = subelement.split(')')[0]
print(uid)
yield {
'uid': uid
}
break
It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.
What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request.
I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield works.
In the meantime, spider.parse_new(response) is an iterable object. That is, you can acquire its yielded results via a for loop. E.g.,
for uid_dict in spider.parse_new(response):
print(uid_dict['uid'])
After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield! It has a lot to do with two issues:
1) robots.txt. Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py
2) The logger has Filtered offsite request to. Link dont_filter=True may resolve this.

Passing Selenium opened URL to Scrapy and scraping the data

I've been trying to scrape Bioshock games from the steam store and save their name, price and link in a CSV file. I know how to do it just by using Scrapy, but I really want to know if there's a way to do it combining both Scrapy and Selenium. I want to use Selenium just to get rid of the age check gate that pops up on certain game store sites.
Example of an age gate
Example of another age gate
So I've managed to scrape games that don't have the age gate by using Scrapy and I've managed to bypass the age gates using Selenium.
The problem I'm having is passing the game store site that Selenium opened by bypassing the age gate to Scrapy so it can crawl it. Since everything works fine on its own I came to the conclusion that the problem is that I don't know how to connect them.
def parse_product(self, response):
product = ScrapesteamItem()
sel = self.driver
#Passing first age gate
if '/agecheck/app/' in response.url:
sel.get(response.url)
select = Select(sel.find_element_by_xpath("""//*[#id="ageYear"]"""))
select.select_by_visible_text("1900")
sel.find_element_by_xpath("""//*[#id="agecheck_form"]/a""").click()
#Pass Selenium newly opened site to Scrapy
#Passing second age gate
elif '/agecheck' in response.url:
sel.get(response.url)
sel.find_element_by_xpath("""//*[#id="app_agegate"]/div[3]/a[1]""").click()
#Pass Selenium newly opened site to Scrapy
#Scraping the data with scrapy
else:
name = response.css('.apphub_AppName ::text').extract()
price = response.css('div.game_purchase_price ::text, div.discount_final_price ::text').extract()
link = response.css('head > link:nth-child(40) ::attr(href)').extract()
for product in zip(name, price, link):
scrapedInfo = {
'NAME' : product[0],
'PRICE' : product[1].rstrip().lstrip(),
'LINK' : product[2]
}
yield scrapedInfo
I hope someone will know how to do it (if it's even possible).
P.S. I know there are much better ways to scrape Steam store, I know there's an API probably but before I go and learn that I would like to know if there's a way to do it like this even if it's sub-optimal.
The straight away answer will be: apply same scraping code that you did use for Scraping the data with scrapy, i.e. something like this:
from scrapy.spiders import xxxxxxxSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.xxxxxxxSpider.com']
sitemap_rules = [
('/product/', 'parse_product'),
]
def my_custom_parse_product(self, response):
name = response.css('.apphub_AppName ::text').extract()
price = response.css('div.game_purchase_price ::text, div.discount_final_price ::text').extract()
link = response.css('head > link:nth-child(40) ::attr(href)').extract()
for product in zip(name, price, link):
scrapedInfo = {
'NAME' : product[0],
'PRICE' : product[1].rstrip().lstrip(),
'LINK' : product[2]
}
yield scrapedInfo
def parse_product(self, response):
product = ScrapesteamItem()
sel = self.driver
#Passing first age gate
if '/agecheck/app/' in response.url:
sel.get(response.url)
select = Select(sel.find_element_by_xpath("""//*[#id="ageYear"]"""))
select.select_by_visible_text("1900")
sel.find_element_by_xpath("""//*[#id="agecheck_form"]/a""").click()
#Pass Selenium newly opened site to Scrapy
response = HtmlResponse(url=response.url, body=driver.page_source)
#scrapy.http.Response(url=response.url, body=driver.page_source)
self.parse(response);
#Passing second age gate
elif '/agecheck' in response.url:
sel.get(response.url)
sel.find_element_by_xpath("""//*[#id="app_agegate"]/div[3]/a[1]""").click()
#Pass Selenium newly opened site to Scrapy
response = HtmlResponse(url=response.url, body=driver.page_source)
#scrapy.http.Response(url=response.url, body=driver.page_source)
self.parse(response);
#Scraping the data with scrapy
else:
my_custom_parse_product(response) #will actually scrap data
But it may appear that age protected pages will contain same data, but in different elements (not in response.css('.apphub_AppName ::text') for instance) in this case you will need to implement own scrape code for each page type

Python/Scrapy scraping from Techcrunch

I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?
I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.

Can't crawl more than a few items per page

I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?
It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}

Python Scrapy - Ajax Pagination Tripadvisor

I'm using Python-Scrapy to scrap the reviews of tripadvisor members pages.
Here is the url I'm using : http://www.tripadvisor.com/members/scottca075
I'm able to get the first page using scrapy. I haven't been able to get the other pages. I observed the XHR Request in the Network Tab of the browser on clicking Next button.
One GET and One POST request is sent:
On checking the parameters for the GET request, I see this:
action : undefined_Other_ClickNext_REVIEWS_ALL
gaa : Other_ClickNext_REVIEWS_ALL
gal : 50
gams : 0
gapu : Vq85qQoQKjYAABktcRMAAAAh
gass : members`
The request url is
`http://www.tripadvisor.com/ActionRecord?action=undefined_Other_ClickNext_REVIEWS_ALL&gaa=Other_ClickNext_REVIEWS_ALL&gal=0&gass=members&gapu=Vq8xPAoQLnMAAUutB9gAAAAJ&gams=1`
The parameter gal represents the offset. Each page has 50 reviews. On moving to the second page by clicking the next button, the parameter gal is set to 50. Then, 100,150,200..and so on.
The data that I want is in the POST request in json format. Image of JSON data in POST request. The request url on the post request is http://www.tripadvisor.com/ModuleAjax?
I'm confused as to how to make the request in scrapy to get the data.
I tried using FormRequest as follows:
pagination_url = "http://www.tripadvisor.com/ActionRecord"
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':'0','gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}
FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parseItem)
I also tried setting headers options in the FormRequest
headers = {'Host':'www.tripadvisor.com','Referer':'http://www.tripadvisor.com/members/prizm','X-Requested-With': 'XMLHttpRequest'}
If someone could explain what I'm missing and point me in the right direction that would be great. I have run out of ideas.
And also, I'm aware that I can use selenium. But I want to know if there is a faster way to do this.
Use ScrapyJS - Scrapy+JavaScript integration
To use ScrapyJS in your project, you first need to enable the middleware:
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:
import scrapy
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …
A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:
function main(splash)
splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
splash:go("http://example.com")
splash:runjs("$('#some-button').click()")
return splash:html()
end
For more details check here
so for you are doing correct,
add the yield in front of FormRequest as:
yield FormRequest(''')
secondly focus on the value of gal, because it is the only parameter changing here and don`t keep gal = "0".
Find the total number of reviews and start from 50 to total pages adding 50 with each request.
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':reviews_till_this_page,'gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}

Categories

Resources