Errback in Scrapy for spider does not fire - python

The below code does not call the errback: error_handler. What could be the problem? It does however reach parse_listings and throw an exception that is caught by scrapy and logged.
import scrapy
class ListingsSpider(scrapy.Spider):
name = 'listings'
def start_requests(self):
yield scrapy.Request(
url="https://www.google.com/",
callback=self.parse_listings,
errback=self.error_handler,
)
def parse_listings(self, response, **request_kwargs):
raise TimeoutError
def error_handler(self, failure):
self.logger.error("DOES NOT REACH HERE")

This is by design. See https://github.com/scrapy/scrapy/issues/5438 . errback is used for errors during request handling such as connection errors, NOT during processing of the response.

Related

Unable to get rid of some error raised by process_exception

I'm trying not to show/get some error thrown by scrapy within process_response in RetryMiddleware. The error the script encounters when max retry limit is crossed. I used proxies within middleware. The weird thing is that the exception the script throws is already within the EXCEPTIONS_TO_RETRY list. It is completely okay that the script may sometimes cross the number of max retries without any success. However, I just do not wish to see that error even when it is there, meaning suppress or bypass it.
The error is like:
Traceback (most recent call last):
File "middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
This is how process_response within RetryMiddleware looks like:
class RetryMiddleware(object):
cus_retry = 3
EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, \
ConnectionRefusedError, ConnectionDone, ConnectError, \
ConnectionLost, TCPTimedOutError, TunnelError, ResponseFailed)
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
return self._retry(request, exception, spider)
def _retry(self, request, reason, spider):
retries = request.meta.get('cus_retry',0) + 1
if retries<=self.cus_retry:
r = request.copy()
r.meta['cus_retry'] = retries
r.meta['proxy'] = f'https://{ip:port}'
r.dont_filter = True
return r
else:
print("done retrying")
How can I get rid of the errors in EXCEPTIONS_TO_RETRY?
PS: The error the script encounters when max retry limit is reached no matter whatever site I choose.
Maybe the problem is not on your side but there might be something wrong with the third party site. Maybe there is a connection error on their server or maybe it is secure so like no one can access it.
Cause the error even says that the error is with the party able it is shut down or not working properly maybe first check if the third party site is working when requested. Try contacting them if you can.
Cause the error is not in your end it's on the party's end as the error says.
This question is similar to Scrapy - Set TCP Connect Timeout
When max retry is reached, method like parse_error() should handle any error if it is there within your spider:
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,errback=self.parse_error,callback=self.parse,dont_filter=True)
def parse_error(self, failure):
# print(repr(failure))
pass
However, I thought of suggesting a completely different approach here. If you go the following route, you don't need any custom middleware at all. Everything including retrying logic is already there within the spider.
class mySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"some url",
]
proxies = [] #list of proxies here
max_retries = 5
retry_urls = {}
def parse_error(self, failure):
proxy = f'https://{ip:port}'
retry_url = failure.request.url
if retry_url not in self.retry_urls:
self.retry_urls[retry_url] = 1
else:
self.retry_urls[retry_url] += 1
if self.retry_urls[retry_url] <= self.max_retries:
yield scrapy.Request(retry_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10}, errback=self.parse_error,dont_filter=True)
else:
print("gave up retrying")
def start_requests(self):
for start_url in self.start_urls:
proxy = f'https://{ip:port}'
yield scrapy.Request(start_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10},errback=self.parse_error,dont_filter=True)
def parse(self,response):
for item in response.css().getall():
print(item)
Don't forget to add the following line to get the aforesaid result from the above suggestion:
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
}
I'm using scrapy 2.3.0 by the way.
Try fixing the code in the scraper itself. Sometimes have a bad parse function can lead to a error of the kind you're describing. Once I fixed the code, it went away for me.

Create new request for Scrapy schedule

Via pika i take url from rabbitmq and try to create new request for Scrapy spider
When i start my spider by scrapy crawl spider spider just don't close due to raise DontCloseSpider() but don't create a request for spider
My custom exception:
import pika
from scrapy import signals
from scrapy.http import Request
from scrapy.exceptions import DontCloseSpider
class AddRequestExample:
def __init__(self, stats):
self.stats = stats
#classmethod
def from_crawler(cls, crawler):
s = cls(crawler)
crawler.signals.connect(s.spider_idle, signal=signals.spider_idle)
return s
def spider_idle(self, spider):
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
channel = connection.channel()
try:
url = channel.basic_get(queue='hello')[2]
url = url.decode()
crawler.engine.crawl(Request(url), self)
except Exception:
pass
raise DontCloseSpider()
my spider:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "spider"
def parse(self, response):
yield {
'url': response.url
}
It looks like you are trying to reproduce approach from this answer.
In this case you need to define request callback function. As you process spider_idle signal from extension (not from spider) - it should be spider.parse method.
def spider_idle(self, spider):
....
try:
url = channel.basic_get(queue='hello')[2]
url = url.decode()
spider.crawler.engine.crawl(Request(url=url, callback = spider.parse), self)
except Exception:
....

Capturing HTTP Errors using scrapy

I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
handle_httpstatus_all = True
link_extractor = LinkExtractor()
def start_requests(self):
"""This method ensures we login before we begin spidering"""
# Little bit of magic to handle the CSRF protection on the login form
resp = requests.get('http://localhost:8000/login/')
tree = html.fromstring(resp.content)
csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value
return [FormRequest('http://localhost:8000/login/', callback=self.parse,
formdata={'username': 'mischa_cs',
'password': 'letmein',
'csrfmiddlewaretoken': csrf_token},
cookies={'csrftoken': resp.cookies['csrftoken']})]
def parse(self, response):
item = HttpResponseItem()
item['url'] = response.url
item['status'] = response.status
item['referer'] = response.request.headers.get('Referer', '')
yield item
for link in self.link_extractor.extract_links(response):
r = Request(link.url, self.parse)
r.meta.update(link_text=link.text)
yield r
The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.
I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.
What do I need to change to capture the HTTP error codes scrapy is encountering?
handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.
I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.
So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).
I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}

Scrapy do something before resume

I'm using scrapy to crawl a web-site with authentication.
I want to be able to save the state of the crawler and I use
scrapy crawl myspider -s JOBDIR=mydir
After I resume with the same command I want to be able to login to the website before it reschedules all saved requests.
Basically, I want to be sure that my function login() and after_login() will be called before any other request is scheduled and executed. And I don't want to use cookies, because they don't allow me to pause the crawling for a long time.
I can call login() in start_requests(), but this works only when I run the crawler for the first time.
class MyCrawlSpider(CrawlSpider):
# ...
START_URLS = ['someurl.com', 'someurl2.com']
LOGIN_PAGE = u'https://login_page.php'
def login(self):
return Request(url=self.LOGIN_PAGE, callback=self.login_email,
dont_filter=True, priority=9999)
def login_form(self, response):
return FormRequest.from_response(response,
formdata={'Email': 'myemail',
'Passwd': 'mypasswd'},
callback=self.after_login,
dont_filter=True,
priority=9999)
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
else:
print("Login Successful!!")
self.is_logged_in = True
for url in self.START_URLS:
yield Request(url, callback=self.parse_artists_json, dont_filter=True)
Bottomline: Is there any callback which will always be called when I resume crawling with -s JOBDIR=... option before rescheduling previous requests? I will use it to call login() method.
You can use the spider_opened signal (more here)
This function is intended to resource allocation for the spiders and others initializations, so it doesn't expect you to yield a Request object from there.
You can go around this by having an array of pending requests. This is needed because scrapy doesn't allow you to manually scheduled requests.
Then, after resume the spider you can queue the login as the first requests on the queue:
def spider_opened(self, spider):
self.spider.requests.insert(0, self.spider.login())
You also need to add a next_request method into your spider
def next_request(self):
if self.requests:
yield self.requests.pop(0)
And queue all you requests by adding them to the requests array, and calling next_request add the end of each method:
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
else:
print("Login Successful!!")
self.is_logged_in = True
if not self.requests:
for url in self.START_URLS:
self.requests.append(Request(url, callback=self.parse_artists_json, dont_filter=True)
yield self.next_request()

Scrapy: Is it possible to pause Scrapy and resume after x minutes?

I'm trying to crawl a large site. They have a rate limiting system in place. Is it possible to pause scrapy for 10 minutes when it encounter a 403 page? I know I can set a DOWNLOAD_DELAY but I noticed that I can scrape faster by setting a small DOWNLOAD_DELAY and then pause scrapy for a few minutes when it gets 403. This way the rate limiting gets triggered only once every hour or so.
You can write your own retry middleware and put it to middleware.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep
class SleepRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
RetryMiddleware.__init__(self, settings)
def process_response(self, request, response, spider):
if response.status in [403]:
sleep(120) # few minutes
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return super(SleepRetryMiddleware, self).process_response(request, response, spider)
and don't forget change settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project.middlewares.SleepRetryMiddleware': 100,
}
Scrapy is a Twisted-based Python framework. So, never use time.sleep or pause.until inside it!
Instead, try using Deferred() from Twisted.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
More info here: Scrapy: non-blocking pause

Categories

Resources