Is there any way to set allowed_domains per start_url? For each url in start_urls I want to restrict crawling to the domain of that url. Once a site has been crawled I would need that domain to be removed from allowed_domains. I guess one way would be to dynamically add / remove urls to allowed_domains?
Related question: Crawl multiple domains with Scrapy without criss-cross
You can try something like this, checking that spider Requests output for each response are for the same domain as that very response (warning: not tested):
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached
class CrissCrossOffsiteMiddleware(object):
def process_spider_output(self, response, result, spider):
domainr = urlparse_cached(response.url).hostname
for x in result:
if isinstance(x, Request):
if x.dont_filter:
yield x
else:
domaino = urlparse_cached(x).hostname
if domaino == domainr:
yield x
else:
yield x
Related
My goal is to print something from the parse method when I iterate through the for loop in get_membership_no method.
I am using python3.8.5, Scrapy 1.7.3 when I run the code mentioned bellow I get "Filtered offsite request".
Here is the console output.
And here is my code.
import scrapy
import json
class BasisMembersSpider(scrapy.Spider):
name = 'basis'
allowed_domains = ['www.basis.org.bd']
def start_requests(self):
yield scrapy.Request(url="https://basis.org.bd/get-member-list?page=1&team=", callback=self.get_membership_no)
def get_membership_no(self, response):
data_array = json.loads(response.body)['data']
for data in data_array:
yield scrapy.Request(url='https://basis.org.bd/get-company-profile/{0}'.format(data['membership_no']), callback=self.parse)
def parse(self, response):
print("I want to get this line on console. thank you.")
The reason for this behavior is that you set allowed_domains = ['www.basis.org.bd'], which blocks requests to basis.org.bd.
You can either leave allowed_domains out completely or extend your list of allowed domains like this:
allowed_domains = ['www.basis.org.bd', 'basis.org.bd']
See the documentation for allowed_domains here for more information.
Removing "www." from allowed_domains worked for me.
I found this article really helpful.
I have the following code:
#FirstSpider.py
class FirstSpider(scrapy.Spider):
name = 'first'
start_urls = ['https://www.basesite.com']
next_urls = []
def parse(self, response):
for url in response.css('bunch > of > css > here'):
self.next_urls.append(url.css('more > css > here'))
l = Loader(item=Item(), selector=url.css('more > css'))
l.add_css('add', 'more > css')
...
...
yield l.load_item()
for url in self.next_urls:
new_urls = self.start_urls[0] + url
yield scrapy.Request(new_urls, callback=SecondSpider.parse_url)
#SecondSpider.py
class SecondSpider(scrapy.Spider):
name = 'second'
start_urls = ['https://www.basesite.com']
def parse_url(self):
"""Parse team data."""
return self
# self is a HtmlResponse not a 'response' object
def parse(self, response):
"""Parse all."""
summary = self.parse_url(response)
return summary
#ThirdSpider.py
class ThirdSpider(scrapy.Spider):
# take links from second spider, continue:
I want to be able to pass the url scraped in Spider 1 to Spider 2 (in a different script). I'm curious as to why when I do, the 'response' is a HtmlResponse and not a response object ( When doing something similar to a method in the same class as Spider 1; I don't have this issue )
What am i missing here? How do i just pass the original response(s) to the second spider? ( and from the second onto the third, etc..?)
You could use Redis as shared resource between all spiders https://github.com/rmax/scrapy-redis
Run all N spiders (don't close on idle state), so each of them will be connected to same Redis and waiting tasks(url, request headers) from there;
As the side-effect push task data to Redis from X_spider with specific key (Y_spider name).
What about using inheritance? "parse" function names should be different.
If your first spider inherits from the second, it will be able to set the callback to self.parse_function_spider2
I'm interested in using Scrapy-Redis to store scraped items in Redis. In particular, the Redis-based request duplicates filter seems like a useful feature.
To start off, I adapted the spider at https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data-in-our-spider as follows:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
custom_settings = {'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter',
'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}}
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').extract_first()
item['author'] = quote.css('small.author::text').extract_first()
item['tags'] = quote.css('div.tags a.tag::text').extract()
yield item
where I generated the project using scrapy startproject tutorial at the command line and defined QuoteItem in items.py as
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
Basically, I've implemented the settings in the "Usage" section of the README in the settings per-spider and made the spider yield an Item object instead of a regular Python dictionary. (I figured this would be necessary to trigger the Item Pipeline).
Now, if I crawl the spider using scrapy crawl quotes from the command line and then do redis-cli, I see a quotes:items key:
127.0.0.1:6379> keys *
1) "quotes:items"
which is a list of length 20:
127.0.0.1:6379> llen quotes:items
(integer) 20
If I run scrapy crawl quotes again, the length of the list doubles to 40:
127.0.0.1:6379> llen quotes:items
(integer) 40
However, I would expect the length of quotes:items to still be 20, since I have simply re-scraped the same pages. Am I doing something wrong here?
Scrapy-redis doesn't filter duplicate items automatically.
The (requests) dupefilter is about the requests in a crawl. What you want seems to be something similar to the deltafetch middleware: https://github.com/scrapy-plugins/scrapy-deltafetch
You would need to adapt deltafetch to work with a distributed storage, perhaps redis' bitmap feature will fit this case.
Here is how I fixed the problem in the end. First of all, as pointed out to me in a separate question, How to implement a custom dupefilter in Scrapy?, using the start_urls class variable results in an implementation of start_requests in which the yielded Request objects have dont_filter=True. To disable this and use the default dont_filter=False instead, I implemented start_requests directly:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
'DUPEFILTER_CLASS': 'tutorial.dupefilter.RedisDupeFilter',
'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}
}
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').extract_first()
item['author'] = quote.css('small.author::text').extract_first()
item['tags'] = quote.css('div.tags a.tag::text').extract()
yield item
Secondly, as pointed out by Rolando, the fingerprints aren't by default persisted across different crawls. To implement this, I subclassed Scrapy-Redis' RFPDupeFilter class:
import scrapy_redis.dupefilter
from scrapy_redis.connection import get_redis_from_settings
class RedisDupeFilter(scrapy_redis.dupefilter.RFPDupeFilter):
#classmethod
def from_settings(cls, settings):
server = get_redis_from_settings(settings)
key = "URLs_seen" # Use a fixed key instead of one containing a timestamp
debug = settings.getbool('DUPEFILTER_DEBUG')
return cls(server=server, key=key, debug=debug)
def request_seen(self, request):
added = self.server.sadd(self.key, request.url)
return added == 0
def clear(self):
pass # Don't delete the key from Redis
The main differences are (1) the key is set to a fixed value (not one containing a time stamp) and (2) the clear method, which in Scrapy-Redis' implementation deletes the key from Redis, is effectively disabled.
Now, when I run scrapy crawl quotes the second time, I see the expected log output
2017-05-05 15:13:46 [scrapy_redis.dupefilter] DEBUG: Filtered duplicate request <GET http://quotes.toscrape.com/page/1/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
and no items are scraped.
I was trying to crawl amazon grocery uk, and to get the grocery categories, I was using the Associate Product Advertising api. My requests get enqueued however as the requests have an expiry of 15 mins, some requests are crawled after 15 mins of being enqueued which means they get expired by the time they are crawled and yield a 400 error. I was thinking of a solution of enqueueing requests in a batch, but even that will fail if the implementation controls processing them in batches as the problem is preparing the request in batches as opposed to processing them in batches. Unfortunately, Scrapy has little documentation for this use case, so how can requests be prepared in batches?
from scrapy.spiders import XMLFeedSpider
from scrapy.utils.misc import arg_to_iter
from scrapy.loader.processors import TakeFirst
from crawlers.http import AmazonApiRequest
from crawlers.items import (AmazonCategoryItemLoader)
from crawlers.spiders import MySpider
class AmazonCategorySpider(XMLFeedSpider, MySpider):
name = 'amazon_categories'
allowed_domains = ['amazon.co.uk', 'ecs.amazonaws.co.uk']
marketplace_domain_name = 'amazon.co.uk'
download_delay = 1
rotate_user_agent = 1
grocery_node_id = 344155031
# XMLSpider attributes
iterator = 'xml'
itertag = 'BrowseNodes/BrowseNode/Children/BrowseNode'
def start_requests(self):
return arg_to_iter(
AmazonApiRequest(
qargs=dict(Operation='BrowseNodeLookup',
BrowseNodeId=self.grocery_node_id),
meta=dict(ancestor_node_id=self.grocery_node_id)
))
def parse(self, response):
response.selector.remove_namespaces()
has_children = bool(response.xpath('//BrowseNodes/BrowseNode/Children'))
if not has_children:
return response.meta['category']
# here the request should be configurable to allow batching
return super(AmazonCategorySpider, self).parse(response)
def parse_node(self, response, node):
category = response.meta.get('category')
l = AmazonCategoryItemLoader(selector=node)
l.add_xpath('name', 'Name/text()')
l.add_value('parent', category)
node_id = l.get_xpath('BrowseNodeId/text()', TakeFirst(), lambda x: int(x))
l.add_value('node_id', node_id)
category_item = l.load_item()
return AmazonApiRequest(
qargs=dict(Operation='BrowseNodeLookup',
BrowseNodeId=node_id),
meta=dict(ancestor_node_id=node_id,
category=category_item)
)
One way of doing this:
Since there are two places where you yield requests you can leverage priority attribute to prioritise requests coming from parse method:
class MySpider(Spider):
name = 'myspider'
def start_requests(self):
for url in very_long_list:
yield Request(url)
def parse(self, response):
for url in short_list:
yield Reuest(url, self.parse_item, priority=1000)
def parse_item(self, response):
# parse item
In this example scrapy will prioritize requests coming out from parse which will allow you to avoid the time limit.
See more on Request.priority:
priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.
on scrapy docs
I am writing a scrapy script to search and scrape result from a website. I need to search items from website and parse each url from the search results. I started with Scrapy's start_requests where i'd pass the search query and redirect to another function parse which will retrieve the urls from the search result. Finally i called another function parse_item to parse the results. I'm able to extract the all the search results url but i'm not being able to parse the results ( parse_item is not working). Here is the code:
# -*- coding: utf-8 -*-
from scrapy.http.request import Request
from scrapy.spider import BaseSpider
class xyzspider(BaseSpider):
name = 'dspider'
allowed_domains = ["www.example.com"]
mylist = ['Search item 1','Search item 2']
url = 'https://example.com/search?q='
def start_requests(self):
for i in self.mylist:
i = i.replace(' ','+')
starturl = self.url+ i
yield Request(starturl,self.parse)
def parse(self,response):
itemurl = response.xpath(".//section[contains(#class, 'search-results')]/a/#href").extract()
for j in itemurl:
print j
yield Request(j,self.parse_item)
def parse_item(self,response):
print "hello"
'''rating = response.xpath(".//ul(#class = 'ratings')/li[1]/span[1]/text()").extract()
print rating'''
Could anyone please help me. Thank you.
I was getting a Filtered offsite request error. I changed the allowed domain from allowed_domains = www.xyz.com to
xyz.com and it worked perfectly.
Your code looks good. So you might need to use the Request attribute dont_filter set to True:
yield Request(j,self.parse_item, dont_filter=True)
From the docs:
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
Anyway I recommend you to have a look at the item Pipelines.
Those are used to process scraped items using the command:
yield my_object
Item pipelines are used to post-process everything yielded by the spider.