Scrapy middleware to replace single request with multiple requests

Scrapy middleware to replace single request with multiple requests - python

I want a middleware that will take a single Request and transform it into a generator of two different requests. As far as I can tell, the downloader middleware process_request() method can only return a single Request, not a generator of them. Is there a nice way to split an arbitrary request into multiple requests?
It seems that spider middleware process_start_requests actually happens after the start_requests Requests are sent through the downloader. For example, if I set start_urls = ['https://localhost/'] and
def process_start_requests(self, start_requests, spider):
yield Request('https://stackoverflow.com')
it will fail with ConnectionRefusedError, having tried and failed the localhost request.

I don't know what would be the logic behind transforming a request (before being sent) into multiple requests, but you can still generate several requests (or even items) from a middleware, with this:
def process_request(self, request, spider):
for a in range(10):
spider.crawler.engine.crawl(
Request(url='myurl', callback=callback_method),
spider)

Related

Scrapy appears to be deduplicating the first request when it is processed with DownloaderMiddleware

I've got a certain spider which inherits from SitemapSpider. As expected, the first request on startup is to sitemap.xml of my website. However, for it to work correctly I need to add a header to all the requests, including the initial ones which fetch the sitemap. I do so with DownloaderMiddleware, like this:
def process_request(self, request: scrapy.http.Request, spider):
if "Host" in request.headers:
return None
host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
request.headers["Host"] = host
spider.logger.info(f"Got {request}")
return request
However, looks like Scrapy's request deduplicator is stopping this request from going through. In my logs I see something like this:
2021-10-16 21:21:08 [ficbook-spider] INFO: Got <GET https://mywebsite.com/sitemap.xml>
2021-10-16 21:21:08 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mywebsite.com/sitemap.xml>
Since spider.logger.info in process_request is triggered only once, I presume that it is the first request, and, after processing, it gets deduplicated. I thought that, maybe, deduplication is triggered before DownloaderMiddleware (that would explain that the request is deduplicated without a second "Got ..." in logs), however, I don't think that's true for two reasons:
I looked through the code of SitemapSpider, and it appears to fetch the sitemap.xml only once
If it did, in fact, fetch it before, I'd expect it to do something - instead it just stops the spider, since no pages were enqueued for processing
Why does this happen? Did I make some mistake in process_request?

It won't do something with the first response and neither fetch a second response since you are returning a new request from your custom DownloaderMiddleware process_request function which is being filtered out. From the docs:
If it returns a Request object, Scrapy will stop calling
process_request methods and reschedule the returned request. Once the
newly returned request is performed, the appropriate middleware chain
will be called on the downloaded response.
It might work if you explicitly say to not filter your second request.
def process_request(self, request: scrapy.http.Request, spider):
if "Host" in request.headers:
return None
host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
new_req = request.replace(dont_filter=True)
new_req.headers["Host"] = host
spider.logger.info(f"Got {new_req}")
return new_req

Scrapy rules, callback for allowed_domains, and a different callback for denied domains

In Scrapy how can I use different callback functions for allowed domains, and denied domains.
I'm using the following rules:
rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
Basically I want parse_item to be called whenever there is a request from an allowed_domain (or sub-domain of one of those domains). Then I want parse_denied_item to be called for all requests that are not whitelisted by allowed_domains.
How can I do this?

I believe the best approach is not to use allowed_domains on LinkExtractor, and instead parse the domain out of response.url in your parse_* method and perform a different logic depending on the domain.
You can keep separate parse_* methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response) (Python 3) with the corresponding parse_* method:
rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]
def parse_all(self, response):
# [Get domain out of response.url]
if domain in allowed_domains:
yield from self.parse_item(response)
else:
yield from self.parse_denied_item(response)

Based on Gallaecio's answer. An alternate option is to use process_request of Rule. process_request will capture the request before it is sent.
From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains (assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.
process_request can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url before proceeding, and logging any of those that aren't.
Here is an example:
rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]
def process_item(self, request):
found = False
for url in self.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False: #otherwise log it
self.logDeniedDomain(urlparse(request.url).netloc)
# according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
# setting request to None should prevent this call from being executed (which is not the case for all)
# middleware is used to catch these few requests
request = None
return request
[1]: If you're encountering this problem, using process_request in Downloader middleware appears to solve it though.
My Downloader middleware:
def process_request(self, request, spider):
#catch any requests that should be filtered, and ignore them
found = False
for url in spider.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False:
print("[ignored] "+request.url)
raise IgnoreRequest('Offsite link, ignore')
return None
Make sure you import IgnoreRequest as well:
from scrapy.exceptions import IgnoreRequest
and enable the Downloader middleware in settings.py.
To verify this, you can add some verification code in process_item of your crawler to ensure no requests to out of scope sites have been made.

How to perform one final request in scrapy after all requests are done?

In the spider I'm bulding, I'm required to login to the website to start performing requests (which is quite simple), and then I go through a loop to perform some thousand requests.
However, in this website in particular, if I do not logout, I get a 10 minute penalty before I can log in again. So I've tried to logout after the loop is done, with a lower priority, like this:
def parse_after_login(self, response):
for item in [long_list]:
yield scrapy.Request(..., callback=self.parse_result, priority=100)
# After all requests have been made, perform logout:
yield scrapy.Request('/logout/', callback=self.parse_logout, priority=0)
However, there is no guarantee that the logout request won't be ready before the other requests are done processing, so a premature logout will invalidate the other requests.
I have found no way of performing a new request with the spider_closed signal.
How can I perform a new request after all other requests are completed?

you can use the spider_idle signal, which could send a request when the spider stopped processing everything.
so once you connect a method to the spider_idle signal with:
self.crawler.signals.connect(self.spider_idle, signal=signals.spider_idle)
you can now use the self.spider_idle method to call final tasks once the spider stopped processing everything:
class MySpider(Spider):
...
self.logged_out = False
...
def spider_idle(self, spider):
if not self.logged_out:
self.logged_out = True
req = Request('someurl', callback=self.parse_logout)
self.crawler.engine.crawl(req, spider)

How to detect HTTP response status code and set a proxy accordingly in scrapy?

Is there a way to set a new proxy ip (e.g.: from a pool) according to the HTTP response status code?
For example, start up with an IP form an IP list till it gets a 503 response (or another http error code), then use the next one till it gets blocked，and so on, something like:
if http_status_code in [403, 503, ..., n]:
proxy_ip = 'new ip'
# Then keep using it till it's gets another error code
Any ideas?

Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy to your Request and use that proxy for this request.
There are few ways of doing this.
First one, straight-forward just use it in your spider code:
def parse(self, response):
if response.status in range(400, 600):
return Request(response.url,
meta={'proxy': 'http://myproxy:8010'}
dont_filter=True) # you need to ignore filtering because you already did one request to this url
Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:
from project.settings import PROXY_URL
class MyDM(object):
def process_response(self, request, response, spider):
if response.status in range(400, 600):
logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
return Request(response.url,
meta={'proxy': PROXY_URL}
dont_filter=True)
return response
Note that by default scrapy doesn't let through any response codes other than 200 ones. Scrapy automatically handles redirect codes 300 with Redirect middleware and raises request errors on 400 and 500 with HttpError middleware. To handle requests other than 200 you need to either:
Specify that in Request Meta:
Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for all
Request(url, meta={'handle_httpstatus_all': True})
Set a project/spider wide parameters:
HTTPERROR_ALLOW_ALL = True # for all
HTTPERROR_ALLOWED_CODES = [404, 505] # for specific
as per http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes

How to fetch the Response object of a Request synchronously on Scrapy?

I believe using "callback" method is asynchronous, please correct me if I'm wrong. I'm still new with Python so please bear with me.
Anyway, I'm trying to make a method to check if a file exists and here is my code:
def file_exists(self, url):
res = False;
response = Request(url, method='HEAD', dont_filter=True)
if response.status == 200:
res = True
return res
I thought the Request() method will return a Response object but it still returns a Request object, to capture the Response, I have to create a different method for the callback.
Is there a way to get the Response object within the code block where you call the Response() method?

If anyone is still interested in a possible solution – I managed it by doing a request with "requests" sort of "inside" a scrapy function like this:
import requests
request_object = requests.get(the_url_you_like_to_get)
response_object = scrapy.Selector(request_object )
item['attribute'] = response_object .xpath('//path/you/like/to/get/text()').extract_first()
and then proceed.

Request objects don't generate anything.
Scrapy uses asynchronous Downloader engine which takes these Request objects and generate Response objects.
if any method in your spider returns a Request object it is automatically scheduled in the downloader and returns a Response object to specified callback(i.e. Request(url, callback=self.my_callback)).
Check out more at scrapy's architecture overview
Now depends when and where you are doing it you can schedule requests by telling the downloader to schedule some requests:
self.crawler.engine.schedule(Request(url, callback=self.my_callback), spider)
If you run this from a spider spider here can most likely be self here and self.crawler is inherited from scrapy.Spider.
Alternatively you can always block asynchronous stack by using something like requests like:
def parse(self, response):
image_url = response.xpath('//img/#href').extract_first()
if image_url:
image_head = requests.head(image_url)
if 'image' in image_head.headers['Content-Type']:
item['image'] = image_url
It will slow your spider down but it's significantly easier to implement and manage.

Scrapy uses Request and Response objects for crawling web sites.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Unless you are manually using a Downloader, it seems like the way you're using the framework is incorrect. I'd read a bit more about how you can create proper spiders here.
As for file exists, your spider can store relevant information in a database or other data structure when parsing the scraped data in its parse*() method, and you can later query it in your own code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy middleware to replace single request with multiple requests - python

Related

Scrapy appears to be deduplicating the first request when it is processed with DownloaderMiddleware

Scrapy rules, callback for allowed_domains, and a different callback for denied domains

How to perform one final request in scrapy after all requests are done?

How to detect HTTP response status code and set a proxy accordingly in scrapy?

How to fetch the Response object of a Request synchronously on Scrapy?

Categories

Resources