Python: HTTP caching using 'CacheControl` not working

Python: HTTP caching using 'CacheControl` not working - python

I'm using python 3.6 with requests module for API consumption and CacheControl module for caching the API response. I'm using following code but cache does not seems to be working:
import requests
from cachecontrol import CacheControl
sess = requests.session()
cached_sess = CacheControl(sess)
response = cached_sess.get('https://jsonplaceholder.typicode.com/users')
Every request to this URL returns the 200 status code (instead of 304 status code) and the same resource is requested each time even though the ETag headers is same and max-age was still valid. The API returns following cache related headers:
'Cache-Control': 'public, max-age=14400'
'Expires': 'Sat, 04 Feb 2017 22:23:28 GMT' (time of original request)
'Etag': 'W/"160d-MxiAGkI3ZBrjm0xiEDfwqw"'
What could be the issue here?
UPDATE: I'm not sending If-None-Match header with any API call, do I manually have to do it or CacheControl module should take care of it automatically?

Use a cache implementation to persist the cache between runs of the program.
from cachecontrol.caches import FileCache
sess = requests.session()
cached_sess = CacheControl(sess, cache = FileCache('.web_cache'))
Also, ensure you're using a recent CacheControl release. CacheControl has only cached resources served as Transfer-Encoding: chunked since 0.11.7:
$ curl -si https://jsonplaceholder.typicode.com/users | fgrep -i transfer-encoding
Transfer-Encoding: chunked
Every request to this URL returns the 200 status code
This is what you'll see when CacheControl is working correctly. The return of a cached response, or use of a 304, is hidden from you as a client of the code. If you believe that a fresh request is being made to the upstream server, consider something like:
import logging
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
to see what cachecontrol.controller and requests.packages.urllib3.connectionpool are doing.

Related

Redirect a download file from API server in Python

I have 2 server: one backend for API and another one as web-frontend.
The backend is working and provides files:
#API class for the route in Flask
class DownloadDocument(Resource):
def get(self):
# accept the filename which must be downloaded in HTTP request.
# That file is surely present in the directory
filename = request.headers.get('x-filename')
return send_from_directory(app.config['UPLOAD_FOLDER'], filename, as_attachment=True)
The web-frontend is currently getting the file in this way:
#app.route('/get_doc/<filename>')
#login_required
def get_doc(filename):
sending = {'x-filename': filename}
response = requests.get('http://<<api_server>>', headers=sending)
return response.content
The result is that when I click download in the webpage (http://<<frontend>>/get_doc/20201116003_895083.jpg), I see the file as text and not downloaded.
For example, I see a big page full of that:
����]ExifII*����
I don't understand if the problem is in the frontend or in the backend.
In the frontend, I tried with urllib.request.Request or requests.request.
Any idea how to manage this kind of download? Probably is something related to mime interpretation, bytes download or buffer it locally.
Of course, I don't want to download the file in the web-frontend storage. I want to redirect it to the visitor.
Here are the headers from GET:
{'Content-Disposition': 'attachment; filename=20201116003_895083.jpg', 'Content-Length': '574424', 'Content-Type': 'image/jpeg', 'Last-Modified': 'Tue, 01 Dec 2020 14:04:30 GMT', 'Cache-Control': 'public, max-age=43200', 'Expires': 'Thu, 03 Dec 2020 02:34:51 GMT', 'ETag': '"1606831470.89299-574424-736697678"', 'Date': 'Wed, 02 Dec 2020 14:34:51 GMT', 'Server': 'Werkzeug/1.0.1 Python/3.8.3'}

I'm not sure if this is a good way to architect your application. I think the Restful backend is intended to work with a Javascript frontend, rather than a separate Flask app which contacts the 'backend' with the requests library. I'm not sure how this would behave in a larger app. The requests documentation raises some prod considerations about timeouts for example. You may see some unforseen issues down the line when deploying this with a WSGI server. (imo)
However, with that considered, a quick fix for the actual issue would be to use the flask.send_file function to return the file. This accepts a file pointer as the first argument, so you'll need to use io.BytesIO to convert the bytes object:
from flask import send_file
from io import BytesIO
#app.route('/get_doc/<filename>')
#login_required
def get_doc(filename):
sending = {'x-filename': filename}
response = requests.get('http://<<api_server>>', headers=sending)
return send_file(BytesIO(response.content), mimetype='image/jpeg'
#as_attachment=True
)
You also need to provide the mimetype argument, as usually send_file guesses the mimetype based on the extension when a string like 'file.jpg' is passed as the first arg. Obviously that can't be done in this case.
You can also pass as_attachment=True if you want the user to receive a download prompt, rather that viewing the image in-browser. This is all mentioned in the send_file docs.
Again, this feels like a hack. Something seems off with using the requests library in this way. Perhaps other SO users will be able to comment further on this.

Content-type is blank in the headers of some requests

I've ran this queries millions (yes, millions) of times before with other URLs. However, I'm getting a KeyError when checking the content-type of the following webpage.
Code snippet:
r = requests.get("http://health.usnews.com/health-news/articles/2014/10/15/limiting-malpractice-claims-may-not-curb-costly-medical-tests", timeout=10, headers=headers)
if "text/html" in r.headers["content-type"]:
Error:
KeyError: 'content-type'
I checked the content of r.headers and it's:
CaseInsensitiveDict({'date': 'Fri, 20 May 2016 06:44:19 GMT', 'content-length': '0', 'connection': 'keep-alive', 'server': 'BigIP'})
What could be causing this?

Not all servers set a Content-Type header. Use .get() to retrieve a default if it is missing:
if "text/html" in r.headers.get("content-type", ''):
For the URL you gave I can't reproduce this:
$ curl -s -D - -o /dev/null "http://health.usnews.com/health-news/articles/2014/10/15/limiting-malpractice-claims-may-not-curb-costly-medical-tests"
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Powered-By: Brightspot
Content-Type: text/html;charset=UTF-8
Date: Fri, 20 May 2016 06:45:12 GMT
Set-Cookie: JSESSIONID=A0C35776067AABCF9E029150C64D8D91; Path=/; HttpOnly
Transfer-Encoding: chunked
but if the header is missing from your response then it usually isn't Python's fault, and certainly not your code's fault.
It could be you encountered a buggy server or temporary glitch, or the server you contacted doesn't like you for one reason or another. Your sample response headers have the content-length set to 0 as well, for example, indicating there was no content to serve at all.
The server that gave you that response is BigIP, a load balancer / network router product from a company called F5. Hard to say exactly what kind (they have global routing servers as well as per-datacenter or cluster load balancers). It could be that the load balancer ran out of back-end servers to serve the request, doesn't have servers in your region, or the load balancer decided that you are sending too many requests and refuses to give you more than just this response, or it is the wrong phase of the moon and Jupiter is in retrograde and it threw a tantrum. We can't know!
But, just in case this happens again, do also look at the response status code. It may well be a 4xx or 5xx status code indicating that something was wrong with your request or with the server. For example, a 429 status code response would indicate you made too many requests in a short amount of time and should slow down. Test for it by checking r.status_code.

How to view Boto3 HTTPS request string

I have been able to view the attributes of the PreparedRequest that botocore sends, but I'm wondering how I can view the exact request string that is sent to AWS. I need the exact request string to be able to compare it to another application I'm testing AWS calls with.

You could also enable debug logging in boto3. That will log all requests and responses as well as lots of other things. Its a bit obscure to enable it:
import boto3
boto3.set_stream_logger(name='botocore')
The reason you have to specify botocore as the name to log is that all of the actual requests and responses happen at the botocore layer.

So what you probably want to do is to send your request through the proxy (mitmproxy, squid). Then check the proxy for what was sent.
Since HTTPS data is encrypted you must first decrypt it, then log the response, then encrypt it back and send to AWS. One of the options is to use mitmproxy. ( It's really easy to install )
Run mitmproxy
Open up another terminal and point proxy to mitmproxys port:
export http_proxy=127.0.0.1:8080
export https_proxy=$http_proxy
Then set verify=False when creating session/client
In [1]: import botocore.session
In [2]: client = botocore.session.Session().create_client('elasticache', verify=False)
Send request and look at the output of mitmproxy
In [3]: client.describe_cache_engine_versions()
The result should be similar to this:
Host: elasticache.us-east-1.amazonaws.com
Accept-Encoding: identity
Content-Length: 53
Content-Type: application/x-www-form-urlencoded
Authorization: AWS4-HMAC-SHA256 Credential=FOOOOOO/20150428/us-east-1/elasticache/aws4_request, SignedHeaders=host;user-agent;x-amz-date, Signature=BAAAAAAR
X-Amz-Date: 20150428T213004Z
User-Agent: Botocore/0.103.0 Python/2.7.6 Linux/3.13.0-49-generic
<?xml version='1.0' encoding='UTF-8'?>
<DescribeCacheEngineVersionsResponse
xmlns="http://elasticache.amazonaws.com/doc/2015-02-02/">
<DescribeCacheEngineVersionsResult>
<CacheEngineVersions>
<CacheEngineVersion>
<CacheParameterGroupFamily>memcached1.4</CacheParameterGroupFamily>
<Engine>memcached</Engine>
<CacheEngineVersionDescription>memcached version 1.4.14</CacheEngineVersionDescription>
<CacheEngineDescription>memcached</CacheEngineDescription>
<EngineVersion>1.4.14</EngineVersion>

HTTPConnection to make DELETE request: 505 response

Frustratingly I'm needing to develop something on Python 2.6.4, and need to send a delete request to a server that seems to only support http 1.1. Here is my code:
httpConnection = httplib.HTTPConnection("localhost:9080")
httpConnection.request('DELETE', remainderURL)
httpResponse = httpConnection.getresponse()
The response code I then get is: 505 (HTTP version not supported)
I've tested sending a delete request via Firefox's RESTClient to the same URL and that works.
I can't use urllib2 because it doesn't support the DELETE request. Is the HTTPConnection object http 1.0 only? Or am I doing something wrong?

The HTTPConnection class uses HTTP/1.1 throughout, and the 505 seems to indicate it's the server that cannot handle HTTP/1.1 requests.
However, if you need to make DELETE requests, why not use the Requests package instead? A DELETE is as simple as:
import requests
requests.delete(url)
That won't magically solve your HTTP version mismatch, but you can enable verbose logging to figure out what is going on:
import sys
requests.delete(url, config=dict(verbose=sys.stderr))

You can use urllib2:
req = urllib2.Request(query_url)
req.get_method = lambda: 'DELETE' # creates the delete method
url = urllib2.urlopen(req)

httplib uses HTTP/1.1 (see HTTPConnection.putRequest method documentation).
Check httpResponse.version to see what version the server is using.

Caching of (fake) static content which is actually dynamic on GAE for Python

In my GAE app I have the following handler in app.yaml:
- url: /lang/strings.js
script: js_lang.py
So a call to /lang/strings.js will actually map to the js_lang.py request handler which populates the response as application/javascript. I want this response to be cached in the browser so that the request handler only gets called once in a while (for example when I "invalidate" the cache by importing /lang/strings.js?v=xxxx when I deploy a new version of the app.
For normal static content, there is the default_expiration element, which is very handy. And results in http response headers like this:
Expires: Fri, 01 Apr 2011 09:54:56 GMT
Cache-Control: public, max-age=600
Ok, the question: is there an easy way for me to return headers such as this, without having to explicitly set them? Alternatively, is there a code snippet out there that accepts a few basic parameters such as "days" and produces the expected http-headers?
Edit 12 April 2011
I solved this very by simply setting the two headers Expires and Cache-Control like this:
import datetime
thirty_days_in_seconds = 4320000
expires_date = datetime.datetime.now() + datetime.timedelta(days=30)
HTTP_HEADER_FORMAT = "%a, %d %b %Y %H:%M:00 GMT"
self.response.headers["Expires"] = expires_date.strftime(HTTP_HEADER_FORMAT)
self.response.headers["Cache-Control"] = "public, max-age=%s" % thirty_days_in_seconds

Have a look at Static serving blog post by Nick.
There's everything you need to know about Conditional request and how to properly get and set the correct HTTP headers:
Http Request header handling
(If-Modified-Since,If-None-Match)
Http Response headers handling
(Last-Modified, ETag)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: HTTP caching using 'CacheControl` not working - python

Related

Redirect a download file from API server in Python

Content-type is blank in the headers of some requests

How to view Boto3 HTTPS request string

HTTPConnection to make DELETE request: 505 response

Caching of (fake) static content which is actually dynamic on GAE for Python

Categories

Resources