Can scrapy skip the error of empty data and keeping scraping? - python

I want to scrape product pages from its sitemap, the products page are similar, but not all of them are the same.
for example
Product A
https://www.vitalsource.com/products/environment-the-science-behind-the-stories-jay-h-withgott-matthew-v9780134446400
Product B
https://www.vitalsource.com/products/abnormal-psychology-susan-nolen-hoeksema-v9781259765667
we can see the product A has the subtitle but another one doesn't have.
So I get errors when I trying to scrape all the product pages.
My question is, is there a way to let the spider skip the error for returning no data?
There is a simple way to bypass it. that is not using strip()
But I am wondering if there is a better way to do the job.
import scrapy
import re
from VitalSource.items import VitalsourceItem
from scrapy.selector import Selector
from scrapy.spiders import SitemapSpider
class VsSpider(SitemapSpider):
name = 'VS'
allowed_domains = ['vitalsource.com']
sitemap_urls = ['https://storage.googleapis.com/vst-stargate-production/sitemap/sitemap1.xml.gz']
sitemap_rules = [
('/products/', 'parse_product'),
]
def parse_product(self, response):
selector = Selector(response=response)
item = VitalsourceItem()
item['Ebook_Title'] = response.css('.product-overview__title-header::text').extract()[1].strip
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
print(item)
return item
error message
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
AttributeError: 'list' object has no attribute 'strip'

Since you need only one subtitle you can use get() with setting default value to empty string. This will save you from errors about applying strip() function to empty element.
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get('').strip()

In general scrapy will not stop crawling if callbacks raise an exception. e.g.:
def start_requests(self):
for i in range(10):
yield Requst(
f'http://example.org/page/{i}',
callback=self.parse,
errback=self.errback,
)
def parse(self, response):
# first page
if 'page/1' in response.request.url:
raise ValueError()
yield {'url': response.url}
def errback(self, failure):
print(f"oh no, failed to parse {failure.request}")
In this example 10 requests will be made and 9 items will be scraped but 1 will fail and got o errback
In your case you have nothing to fear - any request that does not raise an exception will scrape as it should, for the ones that do you'll just see an exception traceback in your terminal/logs.

You could check if a value is returned before extracting:
if response.css("div.subtitle.subtitle-pdp::text"):
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get().strip
That way the subTitle code line would only run if a value was to be returned...

Related

Scrapy one item with multiple parsing functions

I am using Scrapy with python to scrape a website and I face some difficulties with filling the item that I have created.
The products are properly scraped and everything is working well as long as the info is located within the response.xpath mentioned in the for loop.
'trend' and 'number' are properly added to the Item using ItemLoader.
However, the date of the product is not located within the response.xpath cited below but in the response.css as a title : response.css('title')
import scrapy
import datetime
from trends.items import Trend_item
from scrapy.loader import ItemLoader
#Initiate the spider
class trendspiders(scrapy.Spider):
name = 'milk'
start_urls = ['https://thewebsiteforthebestmilk/ireland/2022-03-16/7/']
def parse(self, response):
for milk_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Milk_item(), selector=milk_unique, response=response)
l.add_css('milk', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
return l.load_item()
How can I add the 'date' to my item please (found in response.css('title')?
I have tried to add l.add_css('date', "response.css('title')")in the for loop but it returns an error.
Should I create a new parsing function? If yes then how to send the info to the same Item?
I hope I’ve made myself clear.
Thank you very much for your help,
Since the date is outside of the selector you are using for each row, what you should do is extract that first before your for loop, since it doesn't need to be updated on each iteration.
Then with your item loader you can just use l.add_value to load it with the rest of the fields.
For example:
class trendspiders(scrapy.Spider):
name = 'trends'
start_urls = ['https://getdaytrends.com/ireland/2022-03-16/7/']
def parse(self, response):
date_str = response.xpath("//title/text()").get()
for trend_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Trend_item(), selector=trend_unique, response=response)
l.add_css('trend', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
l.add_value('date', date_str)
yield l.load_item()
If response.css('title').get() gives you the answer you need, why not use the same CSS with add_css:
l.add_css('date', 'title')
Also, .add_css('date', "response.css('title')") is invalid because the second argument a valid CSS selector.

Scrapy/Python yield and continue processing possible?

I am trying this sample code
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
for url in self.urls:
return Request(url)
It crawls all the pages fine. However if I yield an item before the for loop then it crawls only the first page. (as shown below)
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
yield scrapy.item.Item()
for url in self.urls:
return Request(url)
But I can use yield Request(url) instead of return... and it scrapes the pages backwards from last page to first.
I would like to understand why return does not work anymore once an item is yielded? Can somebody explain this in a simple way?
You ask why the second code does not work, but I don’t think you fully understand why the first code works :)
The for loop of your first code only loops once.
What is happening is:
self.parse() is called for the URL in self.start_urls.
self.parse() gets the first (and only the first!) URL from self.urls, and returns it, exiting self.parse().
When a response for that first URL arrives, self.parse() gets called again, and this time it returns a request (only 1 request!) for the second URL from self.urls, because the previous call to self.parse() already consumed the first URL from it (self.urls is an iterator).
The last step repeats in a loop, but it is not the for loop that does it.
You can change your original code to this and it will work the same way:
def parse(self, response):
try:
return next(self.urls)
except StopIteration:
pass
Because to call items/requests it should be generator function.
You even cannot use yield and return in the same function with the same "meaning", it will raise SyntaxError: 'return' with argument inside generator.
The return is (almost) equivalent to raising StopIteration. In this topic Return and yield in the same function you can find very detailed explanation, with links specification.

Scrapy callback doesn't work at all in my script

I've written a script in python scrapy to parse name and prices of different items available in a webpage. I tried to implement logic in my script the way I've learnt so far. However, when I execute it, I get the following error. I suppose I can't make the callback method work properly. Here is the script I've tried with:
The spider names "sth.py" contains:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http.request import Request
class SephoraSpider(CrawlSpider):
name = "sephorasp"
def start_requests(self):
yield Request(url = "https://www.sephora.ae/en/stores/", callback = self.parse_pages)
def parse_pages(self, response):
for link in response.xpath('//ul[#class="nav-primary"]//a[contains(#class,"level0")]/#href').extract():
yield Request(url = link, callback = self.parse_inner_pages)
def parse_inner_pages(self, response):
for links in response.xpath('//li[contains(#class,"amshopby-cat")]/a/#href').extract():
yield Request(url = links, callback = self.target_page)
def target_page(self, response):
for titles in response.xpath('//div[#class="product-info"]'):
product = titles.xpath('.//div[contains(#class,"product-name")]/a/text()').extract_first()
rate = titles.xpath('.//span[#class="price"]/text()').extract_first()
yield {'Name':product,'Price':rate}
"items.py" contains:
import scrapy
class SephoraItem(scrapy.Item):
Name = scrapy.Field()
Price = scrapy.Field()
Partial error looks like:
if cookie.secure and request.type != "https":
AttributeError: 'WrappedRequest' object has no attribute 'type'
Here is the total error log:
"https://www.dropbox.com/s/kguw8174ye6p3q9/output.log?dl=0"
Looks like you are running scrapy v1.1 when the current release is v1.4. As far as I remember there was a bug regarding some early 1.something version and WrappedRequest object used for handling cookies.
Try upgrading to v1.4:
pip install scrapy --upgrade

Scrapy Start_request parse

I am writing a scrapy script to search and scrape result from a website. I need to search items from website and parse each url from the search results. I started with Scrapy's start_requests where i'd pass the search query and redirect to another function parse which will retrieve the urls from the search result. Finally i called another function parse_item to parse the results. I'm able to extract the all the search results url but i'm not being able to parse the results ( parse_item is not working). Here is the code:
# -*- coding: utf-8 -*-
from scrapy.http.request import Request
from scrapy.spider import BaseSpider
class xyzspider(BaseSpider):
name = 'dspider'
allowed_domains = ["www.example.com"]
mylist = ['Search item 1','Search item 2']
url = 'https://example.com/search?q='
def start_requests(self):
for i in self.mylist:
i = i.replace(' ','+')
starturl = self.url+ i
yield Request(starturl,self.parse)
def parse(self,response):
itemurl = response.xpath(".//section[contains(#class, 'search-results')]/a/#href").extract()
for j in itemurl:
print j
yield Request(j,self.parse_item)
def parse_item(self,response):
print "hello"
'''rating = response.xpath(".//ul(#class = 'ratings')/li[1]/span[1]/text()").extract()
print rating'''
Could anyone please help me. Thank you.
I was getting a Filtered offsite request error. I changed the allowed domain from allowed_domains = www.xyz.com to
xyz.com and it worked perfectly.
Your code looks good. So you might need to use the Request attribute dont_filter set to True:
yield Request(j,self.parse_item, dont_filter=True)
From the docs:
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
Anyway I recommend you to have a look at the item Pipelines.
Those are used to process scraped items using the command:
yield my_object
Item pipelines are used to post-process everything yielded by the spider.

How to avoid duplication in a crawler

I wrote a crawler using the scrapy framework in python to select some links and meta tags.It then crawls the start urls and write the data in a JSON encoded format onto a file.The problem is that when the crawler is run two or three times with the same start urls the data in the file gets duplicated .To avoid this I used a downloader middleware in scrapy which is this : http://snippets.scrapy.org/snippets/1/
What I did was copy and paste the above code in a file inside my scrapy project and I enabled it in the settings.py file by adding the following line:
SPIDER_MIDDLEWARES = {'a11ypi.removeDuplicates.IgnoreVisitedItems':560}
where "a11ypi.removeDuplicates.IgnoreVisitedItems" is the class path name and finally I went in and modified my items.py file and included the following fields
visit_id = Field()
visit_status = Field()
But this doesn't work and still the crawler produces the same result appending it to the file when run twice
I did the writing to the file in my pipelines.py file as follows:
import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" +item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
json.dump(d,self.file)
return item
And my spider code is as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from a11ypi.items import AYpiItem
class AYpiSpider(CrawlSpider):
name = "a11y.in"
allowed_domains = ["a11y.in"]
# This is the list of seed URLs to begin crawling with.
start_urls = ["http://www.a11y.in/a11ypi/idea/fire-hi.html"]
# This is the callback method, which is used for scraping specific data
def parse(self,response):
temp = []
hxs = HtmlXPathSelector(response)
item = AYpiItem()
wholeforuri = hxs.select("//#foruri").extract() # XPath to extract the foruri, which contains both the URL and id in foruri
for i in wholeforuri:
temp.append(i.rpartition(":"))
item["foruri"] = [i[0] for i in temp] # This contains the URL in foruri
item["foruri_id"] = [i.split(":")[-1] for i in wholeforuri] # This contains the id in foruri
item['thisurl'] = response.url
item["thisid"] = hxs.select("//#foruri/../#id").extract()
item["rec"] = hxs.select("//#foruri/../#rec").extract()
return item
Kindly suggest what to do.
try to understand why the snippet is written as it is:
if isinstance(x, Request):
if self.FILTER_VISITED in x.meta:
visit_id = self._visited_id(x)
if visit_id in visited_ids:
log.msg("Ignoring already visited: %s" % x.url,
level=log.INFO, spider=spider)
visited = True
Notice in line 2, you actually require a key in in Request.meta called FILTER_VISITED in order for the middleware to drop the request. The reason is well-intended because every single url you have visited will be skipped and you will not have urls to tranverse at all if you do not do so. So, FILTER_VISITED actually allows you to choose what url patterns you want to skip. If you want links extracted with a particular rule skipped, just do
Rule(SgmlLinkExtractor(allow=('url_regex1', 'url_regex2' )), callback='my_callback', process_request = setVisitFilter)
def setVisitFilter(request):
request.meta['filter_visited'] = True
return request
P.S I do not know if it works for 0.14 and above as some of the code has changed for storing spider context in the sqlite db.

Categories

Resources