I'm trying not to show/get some error thrown by scrapy within process_response in RetryMiddleware. The error the script encounters when max retry limit is crossed. I used proxies within middleware. The weird thing is that the exception the script throws is already within the EXCEPTIONS_TO_RETRY list. It is completely okay that the script may sometimes cross the number of max retries without any success. However, I just do not wish to see that error even when it is there, meaning suppress or bypass it.
The error is like:
Traceback (most recent call last):
File "middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
This is how process_response within RetryMiddleware looks like:
class RetryMiddleware(object):
cus_retry = 3
EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, \
ConnectionRefusedError, ConnectionDone, ConnectError, \
ConnectionLost, TCPTimedOutError, TunnelError, ResponseFailed)
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
return self._retry(request, exception, spider)
def _retry(self, request, reason, spider):
retries = request.meta.get('cus_retry',0) + 1
if retries<=self.cus_retry:
r = request.copy()
r.meta['cus_retry'] = retries
r.meta['proxy'] = f'https://{ip:port}'
r.dont_filter = True
return r
else:
print("done retrying")
How can I get rid of the errors in EXCEPTIONS_TO_RETRY?
PS: The error the script encounters when max retry limit is reached no matter whatever site I choose.
Maybe the problem is not on your side but there might be something wrong with the third party site. Maybe there is a connection error on their server or maybe it is secure so like no one can access it.
Cause the error even says that the error is with the party able it is shut down or not working properly maybe first check if the third party site is working when requested. Try contacting them if you can.
Cause the error is not in your end it's on the party's end as the error says.
This question is similar to Scrapy - Set TCP Connect Timeout
When max retry is reached, method like parse_error() should handle any error if it is there within your spider:
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,errback=self.parse_error,callback=self.parse,dont_filter=True)
def parse_error(self, failure):
# print(repr(failure))
pass
However, I thought of suggesting a completely different approach here. If you go the following route, you don't need any custom middleware at all. Everything including retrying logic is already there within the spider.
class mySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"some url",
]
proxies = [] #list of proxies here
max_retries = 5
retry_urls = {}
def parse_error(self, failure):
proxy = f'https://{ip:port}'
retry_url = failure.request.url
if retry_url not in self.retry_urls:
self.retry_urls[retry_url] = 1
else:
self.retry_urls[retry_url] += 1
if self.retry_urls[retry_url] <= self.max_retries:
yield scrapy.Request(retry_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10}, errback=self.parse_error,dont_filter=True)
else:
print("gave up retrying")
def start_requests(self):
for start_url in self.start_urls:
proxy = f'https://{ip:port}'
yield scrapy.Request(start_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10},errback=self.parse_error,dont_filter=True)
def parse(self,response):
for item in response.css().getall():
print(item)
Don't forget to add the following line to get the aforesaid result from the above suggestion:
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
}
I'm using scrapy 2.3.0 by the way.
Try fixing the code in the scraper itself. Sometimes have a bad parse function can lead to a error of the kind you're describing. Once I fixed the code, it went away for me.
Related
I have this custom scrapy proxy rotation middleware in my spider:
packetstream_proxies = [
settings.get("PS_PROXY_USA"),
settings.get("PS_PROXY_CA"),
settings.get("PS_PROXY_IT"),
settings.get("PS_PROXY_GLOBAL"),
]
unlimited_proxies = [
settings.get("UNLIMITED_PROXY_1"),
settings.get("UNLIMITED_PROXY_2"),
settings.get("UNLIMITED_PROXY_3"),
settings.get("UNLIMITED_PROXY_4"),
settings.get("UNLIMITED_PROXY_5"),
settings.get("UNLIMITED_PROXY_6"),
]
class SdtProxyMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(packetstream_proxies)
if request.meta.get("retry_times") == 1:
request.meta["proxy"] = random.choice(unlimited_proxies)
return None
My goal was to retry packetstream_proxies just one time for all requests after that it should retry with unlimited_proxies but above middleware is not working as expected it is retrying packetstream_proxies more than one time as I have set the RETRY_TIMES = 25.
How can I customize the proxy retries in order to achieve my expected goal?
If I understand what you want, you want to do all the requests with packetstream_proxies and once you need to do one or many retries choose an unlimited_proxies.
So you just need to fix your code in order to avoid errors with retry_times because in the first request it won't exist, so you need something like this:
class ProxyRotationMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(packetstream_proxies)
# For avoid have error when retry_times does not exist, just add default 0
if request.meta.get("retry_times", 0) > 0:
request.meta["proxy"] = random.choice(unlimited_proxies)
Hope I answered your question, because I work a lot with middlewares and proxies in my job
The below code does not call the errback: error_handler. What could be the problem? It does however reach parse_listings and throw an exception that is caught by scrapy and logged.
import scrapy
class ListingsSpider(scrapy.Spider):
name = 'listings'
def start_requests(self):
yield scrapy.Request(
url="https://www.google.com/",
callback=self.parse_listings,
errback=self.error_handler,
)
def parse_listings(self, response, **request_kwargs):
raise TimeoutError
def error_handler(self, failure):
self.logger.error("DOES NOT REACH HERE")
This is by design. See https://github.com/scrapy/scrapy/issues/5438 . errback is used for errors during request handling such as connection errors, NOT during processing of the response.
Author note: You might think that this post is lacking context or information, that is only because I don't know where to start. I'll gladly edit with additional information at your request.
Running scrapy I see the following error amongst all the link I am scraping:
ERROR: Error downloading <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 75, in _deactivate
self.active.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
2016-01-19 15:57:20 [scrapy] INFO: Error while removing request from slot
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 140, in <lambda>
d.addBoth(lambda _: slot.remove_request(request))
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 38, in remove_request
self.inprogress.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
When I run scrappy simply on that single URL using:
scrappy shell http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html
No errors are occurring. I am scrapping thousands of similar links with no problem but I see this issue on ~10 links. I am using the default 180 seconds download timeout from scrappy.
I don't see anything wrong with these links in my web browser too.
The parsing is initiated by the request:
request = Request(url_nrd,meta = {'item' : item},callback=self.parse_player,dont_filter=True)
Which is handled in the functions:
def parse_player(self, response):
if response.status == 404:
#doing stuff here
yield item
else:
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
yield request
def parse_more(self, response):
#parsing more stuff here
return item
Also: I didn't change defaults settings for download retries in scrappy (but I don't see any retries in my log files either).
Additional notes:
After my scraping completed and since dont_filter=True I can see that links that failed to download with the previous error at some point, didn't fail when called in previous and subsequent requests.
Possible answer:
I see that I am getting a KeyError on one of the spiders and that de-allocation of that spider failed (remove_request). Is it possible that it is because I am setting dont_filter=True and doing several requests on the same URL and that the key of the spider seems to be that URL? That the spider was de-allocated by a previous, concurrent request on the same URL?
In that case how to have a unique key per request and not indexed on the URL?
EDIT
I think my code in parse_player was the problem, I don't know for sure because I edited my code since, but I recall seeing a bad indent on yield request.
def parse_player(self, response):
if response.status == 404:
#doing stuff here
yield item
else:
paths = sel.xpath('some path extractor here')
for path in paths:
if (some_condition):
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
# Bad indent of yield request here!
yield request
Let me know if you think that might have caused the issue.
And if you simply ignore the errors ??
def parse_player(self, response):
if response.status == 200:
paths = sel.xpath('some path extractor here')
for path in paths:
if (some_condition):
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
# Bad indent of yield request here!
yield request
I made a simple script for amusment that takes the latest comment from http://www.reddit.com/r/random/comments.json?limit=1 and speaks through espeak. I ran into a problem however. If Reddit fails to give me the json data, which it commonly does, the script stops and gives a traceback. This is a problem, as it stops the script. Is there any sort of way to retry to get the json if it fails to load. I am using requests if that means anything
If you need it, here is the part of the code that gets the json data
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
For the vocabulary, the actual error you're having is an exception that has been thrown at some point in a program because of a detected runtime error, and the traceback is the program thread that tells you where the exception has been thrown.
Basically, what you want is an exception handler:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
except Exception as err:
print err
so that you jump over the part that needs the thing that couldn't work. Have a look at that doc as well: HandlingExceptions - Python Wiki
As pss suggests, if you want to retry after the url failed to load:
done = False
while not done:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
except Exception as err:
print err
done = True
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
N.B.: That solution may not be optimal, since if you're offline or the URL is always failing, it'll do an infinite loop. If you retry too fast and too much, Reddit may also ban you.
N.B. 2: I'm using the newest Python 3 syntax for exception handling, which may not work with Python older than 2.7.
N.B. 3: You may also want to choose a class other than Exception for the exception handling, to be able to select what kind of error you want to handle. It mostly depends on your app design, and given what you say, you might want to handle requests.exceptions.ConnectionError, but have a look at request's doc to choose the right one.
Here's what you may want, but please think this through and adapt it to your use case:
import requests
import time
import json
def get_reddit_comments():
retries = 5
while retries != 0:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
break # if the request succeeded we get out of the loop
except requests.exceptions.ConnectionError as err:
print("Warning: couldn't get the URL: {}".format(err))
time.delay(1) # wait 1 second between two requests
retries -= 1
if retries == 0: # if we've done 5 attempts, we fail loudly
return None
return r.text
def use_data(quote):
if not quote:
print("could not get URL, despites multiple attempts!")
return False
data = json.loads(quote)
if 'error' in data.keys():
print("could not get data from reddit: error code #{}".format(quote['error']))
return False
body = data['data']['children'][0]['data']['body']
subreddit = data['data']['children'][0]['data']['subreddit']
# … do stuff with your data here
if __name__ == "__main__":
quote = get_reddit_comments()
if not use_data(quote):
print("Fatal error: Couldn't handle data receipt from reddit.")
sys.exit(1)
I hope this snippet will help you correctly design your program. And now that you've discovered exceptions, please always remember that exceptions are for handling things that shall stay exceptional. If you throw an exception at some point in one of your programs, always ask yourself if this is something that should happen when something unexpected happens (like a webpage not loading), or if it's an expected error (like a page loading but giving you an output that is not expected).
I use CherryPy to run a very simple web server. It is intended to process the GET parameters and, if they are correct, do something with them.
import cherrypy
class MainServer(object):
def index(self, **params):
# do things with correct parameters
if 'a' in params:
print params['a']
index.exposed = True
cherrypy.quickstart(MainServer())
For example,
http://127.0.0.1:8080/abcde:
404 Not Found
The path '/abcde' was not found.
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\cherrypy\_cprequest.py", line 656, in respond
response.body = self.handler()
File "C:\Python27\lib\site-packages\cherrypy\lib\encoding.py", line 188, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "C:\Python27\lib\site-packages\cherrypy\_cperror.py", line 386, in __call__
raise self
NotFound: (404, "The path '/abcde' was not found.")
Powered by CherryPy 3.2.4
I am trying to catch this exception and show a blank page because the clients do not care about it. Specifically, the result would be an empty body, no matter the url or query string that resulted in an exception.
I had a look at documentation on error handling cherrypy._cperror, but I did not find a way to actually use it.
Note: I gave up using CherryPy and found a simple solution using BaseHTTPServer (see my answer below)
Docs somehow seem to miss this section. This is what I found while looking for detailed explanation for custom error handling from the source code.
Custom Error Handling
Anticipated HTTP responses
The 'error_page' config namespace can be used to provide custom HTML output for
expected responses (like 404 Not Found). Supply a filename from which the
output will be read. The contents will be interpolated with the values
%(status)s, %(message)s, %(traceback)s, and %(version)s using plain old Python
string formatting.
_cp_config = {
'error_page.404': os.path.join(localDir, "static/index.html")
}
Beginning in version 3.1, you may also provide a function or other callable as
an error_page entry. It will be passed the same status, message, traceback and
version arguments that are interpolated into templates
def error_page_402(status, message, traceback, version):
return "Error %s - Well, I'm very sorry but you haven't paid!" % status
cherrypy.config.update({'error_page.402': error_page_402})
Also in 3.1, in addition to the numbered error codes, you may also supply
error_page.default to handle all codes which do not have their own error_page
entry.
Unanticipated errors
CherryPy also has a generic error handling mechanism: whenever an unanticipated
error occurs in your code, it will call
Request.error_response to
set the response status, headers, and body. By default, this is the same
output as
HTTPError(500). If you want to provide
some other behavior, you generally replace "request.error_response".
Here is some sample code that shows how to display a custom error message and
send an e-mail containing the error
from cherrypy import _cperror
def handle_error():
cherrypy.response.status = 500
cherrypy.response.body = [
"<html><body>Sorry, an error occurred</body></html>"
]
sendMail('error#domain.com',
'Error in your web app',
_cperror.format_exc())
#cherrypy.config(**{'request.error_response': handle_error})
class Root:
pass
Note that you have to explicitly set
response.body
and not simply return an error message as a result.
Choose what's most suitable for you: Default Methods, Custom Error Handling.
I don't think you should use BaseHTTPServer. If your app is that simple, just get a lightweight framework (e. g. Flask), even though it might be a bit overkill, OR stay low level but still within the WSGI standard and use a WSGI-compliant server.
CherryPy IS catching your exception. That's how it returns a valid page to the browser with the caught exception.
I suggest you read through all the documentation. I realize it isn't the best documentation or organized well, but if you at least skim through it the framework will make more sense. It is a small framework, but does almost everything you'd expect from a application server.
import cherrypy
def show_blank_page_on_error():
"""Instead of showing something useful to developers but
disturbing to clients we will show a blank page.
"""
cherrypy.response.status = 500
cherrypy.response.body = ''
class Root():
"""Root of the application"""
_cp_config = {'request.error_response': show_blank_page_on_error}
#cherrypy.expose
def index(self):
"""Root url handler"""
raise Exception
See this for the example in the documentation on the page mentioned above for further reference.
You can simply use a try/except clause:
try:
cherrypy.quickstart(MainServer())
except: #catches all errors, including basic python errors
print("Error!")
This will catch every single error. But if you want to catch only cherrypy._cperror:
from cherrypy import _cperror
try:
cherrypy.quickstart(MainServer())
except _cperror.CherryPyException: #catches only CherryPy errors.
print("CherryPy error!")
Hope this helps!
import cherrypy
from cherrypy import HTTPError
def handle_an_exception():
cherrypy.response.status = 500
cherrypy.response.headers['content-type'] = 'text/plain;charset=UTF-8'
cherrypy.response.body = b'Internal Server Error'
def handle_a_404(status=None, message=None, version=None, traceback=None):
cherrypy.response.headers['content-type'] = 'text/plain;charset=UTF-8'
return f'Error page for 404'.encode('UTF-8')
def handle_default(status=None, message=None, version=None, traceback=None):
cherrypy.response.headers['content-type'] = 'text/plain;charset=UTF-8'
return f'Default error page: {status}'.encode('UTF-8')
class Root:
"""Root of the application"""
_cp_config = {
# handler for an unhandled exception
'request.error_response': handle_an_exception,
# specific handler for HTTP 404 error
'error_page.404': handle_a_404,
# default handler for any other HTTP error
'error_page.default': handle_default
}
#cherrypy.expose
def index(self):
"""Root url handler"""
raise Exception("an exception")
#cherrypy.expose
def simulate400(self):
raise HTTPError(status=400, message="Bad Things Happened")
cherrypy.quickstart(Root())
Test with:
http://127.0.0.1:8080/
http://127.0.0.1:8080/simulate400
http://127.0.0.1:8080/missing
Though this was the one of the top results when I searched for cherrypy exception handling, accepted answer did not fully answered the question. Following is a working code against cherrypy 14.0.0
# Implement handler method
def exception_handler(status, message, traceback, version)
# Your logic goes here
class MyClass()
# Update configurations
_cp_config = {"error_page.default": exception_handler}
Note the method signature. Without this signature your method will not get invoked.Following are the contents of method parameters,
status : HTTP status and a description
message : Message attached to the exception
traceback : Formatted stack trace
version : Cherrypy version
Maybe you could use a 'before_error_response' handler from cherrypy.tools
#cherrypy.tools.register('before_error_response', priority=90)
def handleexception():
cherrypy.response.status = 500
cherrypy.response.body = ''
And don't forget to enable it:
tools.handleexception.on = True
I gave up using CherryPy and ended up using the follwing code, which solves the issue in a few lines with the standard BaseHTTPServer:
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
from urlparse import urlparse, parse_qs
class GetHandler(BaseHTTPRequestHandler):
def do_GET(self):
url = urlparse(self.path)
d = parse_qs(url[4])
if 'c' in d:
print d['c'][0]
self.send_response(200)
self.end_headers()
return
server = HTTPServer(('localhost', 8080), GetHandler)
server.serve_forever()