How to cache Only http status 200 in scrapy?

How to cache Only http status 200 in scrapy? - python

I am using scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache scrapy requests. I'd like it to only cache if status is 200. Is that the default behavior? Or do I need to specify HTTPCACHE_IGNORE_HTTP_CODES to be everything except 200?

The answer is no, you do not need to do that.
You should write a CachePolicy and update settings.py to enable your policy
I put the CachePolicy class in the middlewares.py
from scrapy.extensions.httpcache import DummyPolicy
class CachePolicy(DummyPolicy):
def should_cache_response(self, response, request):
return response.status == 200
and then update the settings.py, append the following line
HTTPCACHE_POLICY = 'yourproject.middlewares.CachePolicy'

Yes, by default HttpCacheMiddleware run a DummyPolicy for the requests. It pretty much does nothing special on it's own so you need to set HTTPCACHE_IGNORE_HTTP_CODES to everything except 200.
Here's the source for the DummyPolicy
And these are the lines that actually matter:
class DummyPolicy(object):
def __init__(self, settings):
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
So in reality you can also extend this and override should_cache_response() to something that would check for 200 explicitly, i.e. return response.status == 200 and then set it as your cache policy via HTTPCACHE_POLICY setting.

Related

Scrapy - How to Retry certain proxies for all requests just once?

I have this custom scrapy proxy rotation middleware in my spider:
packetstream_proxies = [
settings.get("PS_PROXY_USA"),
settings.get("PS_PROXY_CA"),
settings.get("PS_PROXY_IT"),
settings.get("PS_PROXY_GLOBAL"),
]
unlimited_proxies = [
settings.get("UNLIMITED_PROXY_1"),
settings.get("UNLIMITED_PROXY_2"),
settings.get("UNLIMITED_PROXY_3"),
settings.get("UNLIMITED_PROXY_4"),
settings.get("UNLIMITED_PROXY_5"),
settings.get("UNLIMITED_PROXY_6"),
]
class SdtProxyMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(packetstream_proxies)
if request.meta.get("retry_times") == 1:
request.meta["proxy"] = random.choice(unlimited_proxies)
return None
My goal was to retry packetstream_proxies just one time for all requests after that it should retry with unlimited_proxies but above middleware is not working as expected it is retrying packetstream_proxies more than one time as I have set the RETRY_TIMES = 25.
How can I customize the proxy retries in order to achieve my expected goal?

If I understand what you want, you want to do all the requests with packetstream_proxies and once you need to do one or many retries choose an unlimited_proxies.
So you just need to fix your code in order to avoid errors with retry_times because in the first request it won't exist, so you need something like this:
class ProxyRotationMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(packetstream_proxies)
# For avoid have error when retry_times does not exist, just add default 0
if request.meta.get("retry_times", 0) > 0:
request.meta["proxy"] = random.choice(unlimited_proxies)
Hope I answered your question, because I work a lot with middlewares and proxies in my job

Scrapy Only Cache Images

I thought i found a solution using RFC2616 policy but in testing the scraper execution time it seems to still say the same. So i went back to the Default Policy.
I'm directing my image_urls to
'production.pipelines.MyImagesPipeline'
Now i only need to cache the the urls i send to the item image_urls
Now from my understanding you can overwrite the policy by specifying
class DummyPolicy(object):
def should_cache_response(self, response, request):
if image_url in item['image_urls']:
return True
else:
return False
def is_cached_response_valid(self, cachedresponse, response, request):
return True
Any code suggestions to getting this working?

I created a solution by adding the meta dont_cache to certain yield requests :
yield scrapy.Request(url, self.parse, meta={'dont_cache': True})

Does requests_cache automatically update cache on update of info?

I have a very unreliable API that I request using Python. I have been thinking about using requests_cache and setting expire_after to be 999999999999 like I have seen other people do.
The only problem is, I do not know if when the API works again, that if the data is updated. If requests_cache will automatically auto-update and delete the old entry.
I have tried reading the docs but I cannot really see this anywhere.

requests_cache will not update until the expire_after time has passed. In that case it will not detect that your API is back to a working state.
I note that the project has since added an option that I implemented in the past; you can now set the old_data_on_error option when configuring the cache; see the CachedSession documentation:
old_data_on_error – If True it will return expired cached response if update fails.
It would reuse existing cache data in case a backend update fails, rather than delete that data.
In the past, I created my own requests_cache session setup (plus small patch) that would reuse cached values beyond expire_after if the backend gave a 500 error or timed out (using short timeouts) to deal with a problematic API layer, rather than rely on expire_after:
import logging
from datetime import (
datetime,
timedelta
)
from requests.exceptions import (
ConnectionError,
Timeout,
)
from requests_cache.core import (
dispatch_hook,
CachedSession,
)
log = logging.getLogger(__name__)
# Stop logging from complaining if no logging has been configured.
log.addHandler(logging.NullHandler())
class FallbackCachedSession(CachedSession):
"""Cached session that'll reuse expired cache data on timeouts
This allows survival in case the backend is down, living of stale
data until it comes back.
"""
def send(self, request, **kwargs):
# this *bypasses* CachedSession.send; we want to call the method
# CachedSession.send() would have delegated to!
session_send = super(CachedSession, self).send
if (self._is_cache_disabled or
request.method not in self._cache_allowable_methods):
response = session_send(request, **kwargs)
response.from_cache = False
return response
cache_key = self.cache.create_key(request)
def send_request_and_cache_response(stale=None):
try:
response = session_send(request, **kwargs)
except (Timeout, ConnectionError):
if stale is None:
raise
log.warning('No response received, reusing stale response for '
'%s', request.url)
return stale
if stale is not None and response.status_code == 500:
log.warning('Response gave 500 error, reusing stale response '
'for %s', request.url)
return stale
if response.status_code in self._cache_allowable_codes:
self.cache.save_response(cache_key, response)
response.from_cache = False
return response
response, timestamp = self.cache.get_response_and_time(cache_key)
if response is None:
return send_request_and_cache_response()
if self._cache_expire_after is not None:
is_expired = datetime.utcnow() - timestamp > self._cache_expire_after
if is_expired:
self.cache.delete(cache_key)
# try and get a fresh response, but if that fails reuse the
# stale one
return send_request_and_cache_response(stale=response)
# dispatch hook here, because we've removed it before pickling
response.from_cache = True
response = dispatch_hook('response', request.hooks, response, **kwargs)
return response
def basecache_delete(self, key):
# We don't really delete; we instead set the timestamp to
# datetime.min. This way we can re-use stale values if the backend
# fails
try:
if key not in self.responses:
key = self.keys_map[key]
self.responses[key] = self.responses[key][0], datetime.min
except KeyError:
return
from requests_cache.backends.base import BaseCache
BaseCache.delete = basecache_delete
The above subclass of CachedSession bypasses the original send() method to instead go directly to the original requests.Session.send() method, to return existing cached value even if the timeout has passed but the backend has failed. Deletion is disabled in favour of setting the timeout value to 0, so we can still reuse that old value if a new request fails.
Use the FallbackCachedSession instead of a regular CachedSession object.
If you wanted to use requests_cache.install_cache(), make sure to pass in FallbackCachedSession to that function in the session_factory keyword argument:
import requests_cache
requests_cache.install_cache(
'cache_name', backend='some_backend', expire_after=180,
session_factory=FallbackCachedSession)
My approach is a little more comprehensive than what requests_cache implemented some time after I hacked together the above; my version will fall back to a stale response even if you explicitly marked it as deleted before.

Try to do something like that:
class UnreliableAPIClient:
def __init__(self):
self.some_api_method_cached = {} # we will store results here
def some_api_method(self, param1, param2)
params_hash = "{0}-{1}".format(param1, param2) # need to identify input
try:
result = do_call_some_api_method_with_fail_probability(param1, param2)
self.some_api_method_cached[params_hash] = result # save result
except:
result = self.some_api_method_cached[params_hash] # resort to cached result
if result is None:
raise # reraise exception if nothing cached
return result
Of course you can make simple decorator with that, up to you - http://www.artima.com/weblogs/viewpost.jsp?thread=240808

GAE: How can a handler return a webapp2.Response when using sessions and overriding `dispatch`?

I adapted this sample code in order to get webapp2 sessions to work on Google App Engine.
What do I need to do to be able to return webapp2.Response objects from a handler that's inheriting from a BaseHandler that overrides the dispatch method?
Here's a demonstration of the kind of handler I want to write:
import webapp2
import logging
from webapp2_extras import sessions
class BaseHandler(webapp2.RequestHandler):
def dispatch(self):
# Get a session store for this request.
self.session_store = sessions.get_store(request=self.request)
try:
# Dispatch the request.
webapp2.RequestHandler.dispatch(self)
finally:
# Save all sessions.
self.session_store.save_sessions(self.response)
class HomeHandler(BaseHandler):
def get(self):
logging.debug('In homehandler')
response = webapp2.Response()
response.write('Foo')
return response
config = {}
config['webapp2_extras.sessions'] = {
'secret_key': 'some-secret-key',
}
app = webapp2.WSGIApplication([
('/test', HomeHandler),
], debug=True, config=config)
This code is obviously not working, since BaseHandler always calls dispatch with self. I've looked through the code of webapp2.RequestHandler, but it seriously eludes me how to modify my BaseHandler (or perhaps set a custom dispatcher) such that I can simply return response objects from inheriting handlers.
Curiously, the shortcut of assigning self.response = copy.deepcopy(response) does not work either.

You're mixing the two responses in one method. Use either
return webapp2.Response('Foo')
or
self.response.write('Foo')
...not both.

I took a look at webapp2.RequestHandler and noticed that returned values are just passed up the stack.
A solution which works for me is to use the returned Response when one is returned from the handler, or self.response when nothing is returned.
class BaseHandler(webapp2.RequestHandler):
def dispatch(self):
# Get a session store for this request.
self.session_store = sessions.get_store(request=self.request)
response = None
try:
# Dispatch the request.
response = webapp2.RequestHandler.dispatch(self)
return response
finally:
# Save all sessions.
if response is None:
response = self.response
self.session_store.save_sessions(response)
While I was playing I noticed that my session stored as a secure cookie was not getting updated when exceptions were raised in the handler.

Webpy: how to set http status code to 300

Maybe it is a stupid question but I cannot figure out how to a http status code in webpy.
In the documentation I can see a list of types for the main status codes, but is there a generic function to set the status code?
I'm trying to implement an unAPI server and it's required to reply with a 300 Multiple Choices to a request with only an identifier. More info here
Thanks!
EDIT: I just discovered that I can set it through web.ctx doing
web.ctx.status = '300 Multiple Choices'
is this the best solution?

The way web.py does this for 301 and other redirect types is by subclassing web.HTTPError (which in turn sets web.ctx.status). For example:
class MultipleChoices(web.HTTPError):
def __init__(self, choices):
status = '300 Multiple Choices'
headers = {'Content-Type': 'text/html'}
data = '<h1>Multiple Choices</h1>\n<ul>\n'
data += ''.join('<li>{0}</li>\n'.format(c)
for c in choices)
data += '</ul>'
web.HTTPError.__init__(self, status, headers, data)
Then to output this status code you raise MultipleChoices in your handler:
class MyHandler:
def GET(self):
raise MultipleChoices(['http://example.com/', 'http://www.google.com/'])
It'll need tuning for your particular unAPI application, of course.
See also the source for web.HTTPError in webapi.py.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to cache Only http status 200 in scrapy? - python

I am using scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache scrapy requests. I'd like it to only cache if status is 200. Is that the default behavior? Or do I need to specify HTTPCACHE_IGNORE_HTTP_CODES to be everything except 200?

Related

Scrapy - How to Retry certain proxies for all requests just once?

Scrapy Only Cache Images

Does requests_cache automatically update cache on update of info?

GAE: How can a handler return a webapp2.Response when using sessions and overriding `dispatch`?

Webpy: how to set http status code to 300

Categories

Resources