Scrapy - How to Retry certain proxies for all requests just once? - python

I have this custom scrapy proxy rotation middleware in my spider:
packetstream_proxies = [
settings.get("PS_PROXY_USA"),
settings.get("PS_PROXY_CA"),
settings.get("PS_PROXY_IT"),
settings.get("PS_PROXY_GLOBAL"),
]
unlimited_proxies = [
settings.get("UNLIMITED_PROXY_1"),
settings.get("UNLIMITED_PROXY_2"),
settings.get("UNLIMITED_PROXY_3"),
settings.get("UNLIMITED_PROXY_4"),
settings.get("UNLIMITED_PROXY_5"),
settings.get("UNLIMITED_PROXY_6"),
]
class SdtProxyMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(packetstream_proxies)
if request.meta.get("retry_times") == 1:
request.meta["proxy"] = random.choice(unlimited_proxies)
return None
My goal was to retry packetstream_proxies just one time for all requests after that it should retry with unlimited_proxies but above middleware is not working as expected it is retrying packetstream_proxies more than one time as I have set the RETRY_TIMES = 25.
How can I customize the proxy retries in order to achieve my expected goal?

If I understand what you want, you want to do all the requests with packetstream_proxies and once you need to do one or many retries choose an unlimited_proxies.
So you just need to fix your code in order to avoid errors with retry_times because in the first request it won't exist, so you need something like this:
class ProxyRotationMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(packetstream_proxies)
# For avoid have error when retry_times does not exist, just add default 0
if request.meta.get("retry_times", 0) > 0:
request.meta["proxy"] = random.choice(unlimited_proxies)
Hope I answered your question, because I work a lot with middlewares and proxies in my job

Related

Unable to get rid of some error raised by process_exception

I'm trying not to show/get some error thrown by scrapy within process_response in RetryMiddleware. The error the script encounters when max retry limit is crossed. I used proxies within middleware. The weird thing is that the exception the script throws is already within the EXCEPTIONS_TO_RETRY list. It is completely okay that the script may sometimes cross the number of max retries without any success. However, I just do not wish to see that error even when it is there, meaning suppress or bypass it.
The error is like:
Traceback (most recent call last):
File "middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
This is how process_response within RetryMiddleware looks like:
class RetryMiddleware(object):
cus_retry = 3
EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, \
ConnectionRefusedError, ConnectionDone, ConnectError, \
ConnectionLost, TCPTimedOutError, TunnelError, ResponseFailed)
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
return self._retry(request, exception, spider)
def _retry(self, request, reason, spider):
retries = request.meta.get('cus_retry',0) + 1
if retries<=self.cus_retry:
r = request.copy()
r.meta['cus_retry'] = retries
r.meta['proxy'] = f'https://{ip:port}'
r.dont_filter = True
return r
else:
print("done retrying")
How can I get rid of the errors in EXCEPTIONS_TO_RETRY?
PS: The error the script encounters when max retry limit is reached no matter whatever site I choose.
Maybe the problem is not on your side but there might be something wrong with the third party site. Maybe there is a connection error on their server or maybe it is secure so like no one can access it.
Cause the error even says that the error is with the party able it is shut down or not working properly maybe first check if the third party site is working when requested. Try contacting them if you can.
Cause the error is not in your end it's on the party's end as the error says.
This question is similar to Scrapy - Set TCP Connect Timeout
When max retry is reached, method like parse_error() should handle any error if it is there within your spider:
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,errback=self.parse_error,callback=self.parse,dont_filter=True)
def parse_error(self, failure):
# print(repr(failure))
pass
However, I thought of suggesting a completely different approach here. If you go the following route, you don't need any custom middleware at all. Everything including retrying logic is already there within the spider.
class mySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"some url",
]
proxies = [] #list of proxies here
max_retries = 5
retry_urls = {}
def parse_error(self, failure):
proxy = f'https://{ip:port}'
retry_url = failure.request.url
if retry_url not in self.retry_urls:
self.retry_urls[retry_url] = 1
else:
self.retry_urls[retry_url] += 1
if self.retry_urls[retry_url] <= self.max_retries:
yield scrapy.Request(retry_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10}, errback=self.parse_error,dont_filter=True)
else:
print("gave up retrying")
def start_requests(self):
for start_url in self.start_urls:
proxy = f'https://{ip:port}'
yield scrapy.Request(start_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10},errback=self.parse_error,dont_filter=True)
def parse(self,response):
for item in response.css().getall():
print(item)
Don't forget to add the following line to get the aforesaid result from the above suggestion:
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
}
I'm using scrapy 2.3.0 by the way.
Try fixing the code in the scraper itself. Sometimes have a bad parse function can lead to a error of the kind you're describing. Once I fixed the code, it went away for me.

Capturing HTTP Errors using scrapy

I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
handle_httpstatus_all = True
link_extractor = LinkExtractor()
def start_requests(self):
"""This method ensures we login before we begin spidering"""
# Little bit of magic to handle the CSRF protection on the login form
resp = requests.get('http://localhost:8000/login/')
tree = html.fromstring(resp.content)
csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value
return [FormRequest('http://localhost:8000/login/', callback=self.parse,
formdata={'username': 'mischa_cs',
'password': 'letmein',
'csrfmiddlewaretoken': csrf_token},
cookies={'csrftoken': resp.cookies['csrftoken']})]
def parse(self, response):
item = HttpResponseItem()
item['url'] = response.url
item['status'] = response.status
item['referer'] = response.request.headers.get('Referer', '')
yield item
for link in self.link_extractor.extract_links(response):
r = Request(link.url, self.parse)
r.meta.update(link_text=link.text)
yield r
The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.
I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.
What do I need to change to capture the HTTP error codes scrapy is encountering?
handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.
I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.
So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).
I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}

How to cache Only http status 200 in scrapy?

I am using scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache scrapy requests. I'd like it to only cache if status is 200. Is that the default behavior? Or do I need to specify HTTPCACHE_IGNORE_HTTP_CODES to be everything except 200?
The answer is no, you do not need to do that.
You should write a CachePolicy and update settings.py to enable your policy
I put the CachePolicy class in the middlewares.py
from scrapy.extensions.httpcache import DummyPolicy
class CachePolicy(DummyPolicy):
def should_cache_response(self, response, request):
return response.status == 200
and then update the settings.py, append the following line
HTTPCACHE_POLICY = 'yourproject.middlewares.CachePolicy'
Yes, by default HttpCacheMiddleware run a DummyPolicy for the requests. It pretty much does nothing special on it's own so you need to set HTTPCACHE_IGNORE_HTTP_CODES to everything except 200.
Here's the source for the DummyPolicy
And these are the lines that actually matter:
class DummyPolicy(object):
def __init__(self, settings):
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
So in reality you can also extend this and override should_cache_response() to something that would check for 200 explicitly, i.e. return response.status == 200 and then set it as your cache policy via HTTPCACHE_POLICY setting.

Scrapy - Correct way to change User Agent in Request

I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this
class CustomRetryMiddleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
Proxy_UA_Middleware.switch_proxy()
Proxy_UA_Middleware.switch_ua()
logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
logger.debug("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
The Proxy_UA_Middlware class is quite long. Basically it contains methods that change proxy and user agent. I have both these middlewares configured properly in my settings.py file. The proxy part works okay but the User Agent doesn't change. The code I've used to changed User Agent looks like this
request.headers.setdefault('User-Agent', self.user_agent)
where self.user_agent is a random value taken from an array of user agents. This doesn't work. However, if I do this
request.headers['User-Agent'] = self.user_agent
then it works just fine and the user agent changes successfully for each retry. But I haven't seen anyone use this method to change the User Agent. My question is if changing the User Agent this way is okay and if not what am I doing wrong?
If you always want to control which user-agent to use on that middleware, then it is ok, what setdefault does is to check if there is no User-Agent assigned before, which is possible because other middlewares could be doing it, or even assigning it from the spider.
Also I think you should also disable the default UserAgentMiddleware or even set a higher priority to your middleware, check that UserAgentMiddleware priority is 400, so set yours to be before (some number before 400).
First, you are overriding a function with _ (an underscore) in the front which should be a "private" function in Python. The function might change in the different version of Scrapy and your overriding will hinder the upgrade/downgrade. It's risky for you to override it. It's better to change the user agent in another function wrapping _retry.
I've made a function for that but using Scrapy fake user agent module. I found two functions calling _retry. So, retry happens on exception and on retry statuses. We need to change the user agent on both functions in the request before it is retried. So this is the code:
from scrapy.downloadermiddlewares.retry import *
from scrapy.spidermiddlewares.httperror import *
from fake_useragent import UserAgent
class Retry500Middleware(RetryMiddleware):
def __init__(self, settings):
super(Retry500Middleware, self).__init__(settings)
fallback = settings.get('FAKEUSERAGENT_FALLBACK', None)
self.ua = UserAgent(fallback=fallback)
self.ua_type = settings.get('RANDOM_UA_TYPE', 'random')
def get_ua(self):
'''Gets random UA based on the type setting (random, firefox…)'''
return getattr(self.ua, self.ua_type)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, reason, spider) or response
return response
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, exception, spider)
Don't forget to enable the middleware via settings.py and disable the standard retry and user agent middleware.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'my_project.middlewares.Retry500Middleware': 401,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
FAKEUSERAGENT_FALLBACK = "<your favorite user agent>"

Scrapy: Is it possible to pause Scrapy and resume after x minutes?

I'm trying to crawl a large site. They have a rate limiting system in place. Is it possible to pause scrapy for 10 minutes when it encounter a 403 page? I know I can set a DOWNLOAD_DELAY but I noticed that I can scrape faster by setting a small DOWNLOAD_DELAY and then pause scrapy for a few minutes when it gets 403. This way the rate limiting gets triggered only once every hour or so.
You can write your own retry middleware and put it to middleware.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep
class SleepRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
RetryMiddleware.__init__(self, settings)
def process_response(self, request, response, spider):
if response.status in [403]:
sleep(120) # few minutes
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return super(SleepRetryMiddleware, self).process_response(request, response, spider)
and don't forget change settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project.middlewares.SleepRetryMiddleware': 100,
}
Scrapy is a Twisted-based Python framework. So, never use time.sleep or pause.until inside it!
Instead, try using Deferred() from Twisted.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
More info here: Scrapy: non-blocking pause

Categories

Resources