Is it possible to use gzip compression with Server-Sent Events (SSE)? - python

I would like to know if it is possible to enable gzip compression
for Server-Sent Events (SSE ; Content-Type: text/event-stream).
It seems it is possible, according to this book:
http://chimera.labs.oreilly.com/books/1230000000545/ch16.html
But I can't find any example of SSE with gzip compression. I tried to
send gzipped messages with the response header field
Content-Encoding set to "gzip" without success.
For experimenting around SSE, I am testing a small web application
made in Python with the bottle framework + gevent ; I am just running
the bottle WSGI server:
#bottle.get('/data_stream')
def stream_data():
bottle.response.content_type = "text/event-stream"
bottle.response.add_header("Connection", "keep-alive")
bottle.response.add_header("Cache-Control", "no-cache")
bottle.response.add_header("Content-Encoding", "gzip")
while True:
# new_data is a gevent AsyncResult object,
# .get() just returns a data string when new
# data is available
data = new_data.get()
yield zlib.compress("data: %s\n\n" % data)
#yield "data: %s\n\n" % data
The code without compression (last line, commented) and without gzip
content-encoding header field works like a charm.
EDIT: thanks to the reply and to this other question: Python: Creating a streaming gzip'd file-like?, I managed to solve the problem:
#bottle.route("/stream")
def stream_data():
compressed_stream = zlib.compressobj()
bottle.response.content_type = "text/event-stream"
bottle.response.add_header("Connection", "keep-alive")
bottle.response.add_header("Cache-Control", "no-cache, must-revalidate")
bottle.response.add_header("Content-Encoding", "deflate")
bottle.response.add_header("Transfer-Encoding", "chunked")
while True:
data = new_data.get()
yield compressed_stream.compress("data: %s\n\n" % data)
yield compressed_stream.flush(zlib.Z_SYNC_FLUSH)

TL;DR: If the requests are not cached, you likely want to use zlib and declare Content-Encoding to be 'deflate'. That change alone should make your code work.
If you declare Content-Encoding to be gzip, you need to actually use gzip. They are based on the the same compression algorithm, but gzip has some extra framing. This works, for example:
import gzip
import StringIO
from bottle import response, route
#route('/')
def get_data():
response.add_header("Content-Encoding", "gzip")
s = StringIO.StringIO()
with gzip.GzipFile(fileobj=s, mode='w') as f:
f.write('Hello World')
return s.getvalue()
That only really makes sense if you use an actual file as a cache, though.

There's also middleware you can use so you don't need to worry about gzipping responses for each of your methods. Here's one I used recently.
https://code.google.com/p/ibkon-wsgi-gzip-middleware/
This is how I used it (I'm using bottle.py with the gevent server)
from gzip_middleware import Gzipper
import bottle
app = Gzipper(bottle.app())
run(app = app, host='0.0.0.0', port=8080, server='gevent')
For this particular library, you can set w/c types of responses you want to compress by modifying the DEFAULT_COMPRESSABLES variable for example
DEFAULT_COMPRESSABLES = set(['text/plain', 'text/html', 'text/css',
'application/json', 'application/x-javascript', 'text/xml',
'application/xml', 'application/xml+rss', 'text/javascript',
'image/gif'])
All responses go through the middleware and get gzipped without modifying your existing code. By default, it compresses responses whose content-type belongs to DEFAULT_COMPRESSABLES and whose content-length is greater than 200 characters.

Related

Decompressing (Gzip) chunks of response from http.client call

I have the following code that I am using to try to chunk response from a http.client.HTTPSConnection get request to an API (please note that the response is gzip encoded:
connection = http.client.HTTPSConnection(api, context = ssl._create_unverified_context())
connection.request('GET', api_url, headers = auth)
response = connection.getresponse()
while chunk := response.read(20):
data = gzip.decompress(chunk)
data = json.loads(chunk)
print(data)
This always gives out an error that it is not a gzipped file (b'\xe5\x9d'). Not sure how I can chunk data and still achieve what I am trying to do here. Basically, I am chunking so that I don't have to load the entire response in memory.
Please note I can't use any other libraries like requests, urllib etc.
The most probable reason for that is, the response your received is indeed not a gzipped file.
I notice that in your code, you pass a variable called auth. Typically, a server won't send you a compressed response if you don't specify in the request headers that you can accept it. If there is only auth-related keys in your headers like your variable name suggests, you won't receive a gzipped response. First, make sure you have 'Accept-Encoding': 'gzip' in your headers.
Going forward, you will face another problem:
Basically, I am chunking so that I don't have to load the entire response in memory.
gzip.decompress will expect a complete file, so you would need to reconstruct it and load it entirely in memory before doing that, which would undermine the whole point of chunking the response. Trying to decompress a part of a gzip with gzip.decompress will most likely give you an EOFError saying something like Compressed file ended before the end-of-stream marker was reached.
I don't know if you can manage that directly with the gzip library, but I know how to do it with zlib. You will also need to convert your chunk to a file-like object, you can do that with io.BytesIO. I see you have very strong constraints on libraries, but zlib and io are part of the python default, so hopefully you have them available. Here is a rework of your code that should help you going on:
import http
import ssl
import gzip
import zlib
from io import BytesIO
# your variables here
api = 'your_api_host'
api_url = 'your_api_endpoint'
auth = {'AuhtKeys': 'auth_values'}
# add the gzip header
auth['Accept-Encoding'] = 'gzip'
# prepare decompressing object
decompressor = zlib.decompressobj(16 + zlib.MAX_WBITS)
connection = http.client.HTTPSConnection(api, context = ssl._create_unverified_context())
connection.request('GET', api_url, headers = auth)
response = connection.getresponse()
while chunk := response.read(20):
data = decompressor.decompress(BytesIO(chunk).read())
print(data)
The problem is that gzip.decompress expects a complete file, you can't just provide a chunk to it, because the deflate algorithm relies on previous data during decompression. The whole point of the algorithm is that it's able to repeat something that it has already seen before, therefore, all data is required.
However, deflate only cares about the last 32 KiB or so. Therefore, it is possible to stream decompress such a file without needing much memory. This is not something you need to implement yourself though, Python provides the gzip.GzipFile class which can be used to wrap the file handle and behaves like a normal file:
import io
import gzip
# Create a file for testing.
# In your case you can just use the response object you get.
file_uncompressed = ""
for line_index in range(10000):
file_uncompressed += f"This is line {line_index}.\n"
file_compressed = gzip.compress(file_uncompressed.encode())
file_handle = io.BytesIO(file_compressed)
# This library does all the heavy lifting
gzip_file = gzip.GzipFile(fileobj=file_handle)
while chunk := gzip_file.read(1024):
print(chunk)

Why is it necessary to use a stream to download an image through HTTP GET?

Here is a body of code that works, taken from: https://stackoverflow.com/a/18043472
It uses the requests module in python to download an image.
import requests, shutil
url = 'http://example.com/img.png'
response = requests.get(url, stream=True)
with open('img.png', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
Two questions I've been thinking about:
1) Why is it necessary to set stream=True? (I've tested it without that parameter and the image is blank) Conceptually, I don't understand what a streaming GET request is.
2) What's the difference between a raw response and a response? (Why is shutil.copyfileobj necessary, why can't I just directly write to file?)
Thanks!
Quote from documentation:
If you set stream to True when making a request, Requests cannot
release the connection back to the pool unless you consume all the
data or call Response.close.
More info here.

Compressing request body with python-requests?

(This question is not about transparent decompression of gzip-encoded responses from a web server; I know that requests handles that automatically.)
Problem
I'm trying to POST a file to a RESTful web service. Obviously, requests makes this pretty easy to do:
files = dict(data=(fn, file))
response = session.post(endpoint_url, files=files)
In this case, my file is in a really highly-compressible format (yep, XML) so I'd like to make sure that the request body is compressed.
The server claims to accept gzip encoding (Accept-Encoding: gzip in response headers), so I should be able to gzip the whole body request body, right?
Attempted solution
Here's my attempt to make this work: I first construct the request and prepare it, then I go into the PreparedRequest object, yank out the body, run it through gzip, and put it back. (Oh, and don't forget to update the Content-Length and Content-Encoding headers.)
files = dict(data=(fn, file))
request = request.Request('POST',endpoint_url, files=files)
prepped = session.prepare_request(request)
with NamedTemporaryFile(delete=True) as gzfile:
gzip.GzipFile(fileobj=gzfile, mode="wb").write(prepped.body)
prepped.headers['Content-Length'] = gzfile.tell()
prepped.headers['Content-Encoding'] = 'gzip'
gzfile.seek(0,0)
prepped.body = gzfile.read()
response = session.send(prepped)
Unfortunately, the server is not cooperating and returns 500 Internal Server Error. Perhaps it doesn't really accept gzip-encoded requests?
Or perhaps there is a mistake in my approach? It seems rather convoluted. Is there an easier way to do request body compression with python-requests?
EDIT: Fixed (3) and (5) from #sigmavirus24's answer (these were basically just artifacts I'd overlooked in simplifying the code to post it here).
Or perhaps there is a mistake in my approach?
I'm unsure how you arrived at your approach, frankly, but there's certainly a simpler way of doing this.
First, a few things:
The files parameter constructs a multipart/form-data body. So you're compressing something that the server potentially has no clue about.
Content-Encoding and Transfer-Encoding are two very different things. You want Transfer-Encoding here.
You don't need to set a suffix on your NamedTemporaryFile.
Since you didn't explicitly mention that you're trying to compress a multipart/form-data request, I'm going to assume that you don't actually want to do that.
Your call to session.Request (which I assume should be, requests.Request) is missing a method, i.e., it should be: requests.Request('POST', endpoint_url, ...)
With those out of the way, here's how I would do this:
# Assuming `file` is a file-like obj
with NamedTemporaryFile(delete=True) as gzfile:
gzip.GzipFile(fileobj=gzfile, mode="wb").write(file.read())
headers = {'Content-Length': str(gzfile.tell()),
'Transfer-Encoding': 'gzip'}
gzfile.seek(0, 0)
response = session.post(endpoint_url, data=gzfile,
headers=headers)
Assuming that file has the xml content in it and all you meant was to compress it, this should work for you. You probably want to set a Content-Type header though, for example, you'd just do
headers = {'Content-Length': gzfile.tell(),
'Content-Type': 'application/xml', # or 'text/xml'
'Transfer-Encoding': 'gzip'}
The Transfer-Encoding tells the server that the request is being compressed only in transit and it should uncompress it. The Content-Type tells the server how to handle the content once the Transfer-Encoding has been handled. 
I had a question that was marked as an exact duplicate. I was concernd with both ends of the transaction.
The code from sigmavirus24 wasn't a direct cut and paste fix, but it was the inspiration for this version.
Here's how my solution ended up looking:
sending from the python end
import json
import requests
import StringIO
import gzip
url = "http://localhost:3000"
headers = {"Content-Type":"application/octet-stream"}
data = [{"key": 1,"otherKey": "2"},
{"key": 3,"otherKey": "4"}]
payload = json.dumps(data)
out = StringIO.StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(json.dumps(data))
out.getvalue()
r = requests.post(url+"/zipped", data=out.getvalue(), headers=headers)
receiving at the express end
var zlib = require("zlib");
var rawParser = bodyParser.raw({type: '*/*'});
app.post('/zipped', rawParser, function(req, res) {
zlib.gunzip(req.body, function(err, buf) {
if(err){
console.log("err:", err );
} else{
console.log("in the inflate callback:",
buf,
"to string:", buf.toString("utf8") );
}
});
res.status(200).send("I'm in ur zipped route");
});
There's a gist here with more verbose logging included. This version doesn't have any safety or checking built in either.

How do I batch send a multipart html post with multiple urls?

I am speaking to the gmail api and would like to batch the requests. They have a friendly guide for this here, https://developers.google.com/gmail/api/guides/batch, which suggests that I should be able to use multipart/mixed and include different urls.
I am using Python and the Requests library, but am unsure how to issue different urls. Answers like this one How to send a "multipart/form-data" with requests in python? don't mention an option for changing that part.
How do I do this?
Unfortunately, requests does not support multipart/mixed in their API. This has been suggested in several GitHub issues (#935 and #1081), but there are no updates on this for now. This also becomes quite clear if you search for "mixed" in the requests sources and get zero results :(
Now you have several options, depending on how much you want to make use of Python and 3rd-party libraries.
Google API Client
Now, the most obvious answer to this problem is to use the official Python API that Google is providing here. It comes with a HttpBatchRequest class that can handle the batch requests that you need. This is documented in detail in this guide.
Essentially, you create an HttpBatchRequest object and add all your requests to it. The library will then put everything together (taken from the guide above):
batch = BatchHttpRequest()
batch.add(service.animals().list(), callback=list_animals)
batch.add(service.farmers().list(), callback=list_farmers)
batch.execute(http=http)
Now, if for whatever reason you cannot or will not use the official Google libraries you will have to build parts of the request body yourself.
requests + email.mime
As I already mentioned, requests does not officially support multipart/mixed. But that does not mean that we cannot "force" it. When creating a Request object, we can use the files parameter to provide multipart data.
files is a dictionary that accepts 4-tuple values of this format: (filename, file_object, content_type, headers). The filename can be empty. Now we need to convert a Request object into a file(-like) object. I wrote a small method that covers the basic examples from the Google example. It is partly inspired by the internal methods that Google uses in their Python library:
import requests
from email.mime.multipart import MIMEMultipart
from email.mime.nonmultipart import MIMENonMultipart
BASE_URL = 'http://www.googleapis.com/batch'
def serialize_request(request):
'''Returns the string representation of the request'''
mime_body = ''
prepared = request.prepare()
# write first line (method + uri)
if request.url.startswith(BASE_URL):
mime_body = '%s %s\r\n' % (request.method, request.url[len(BASE_URL):])
else:
mime_body = '%s %s\r\n' % (request.method, request.url)
part = MIMENonMultipart('application', 'http')
# write headers (if possible)
for key, value in prepared.headers.iteritems():
mime_body += '%s: %s\r\n' % (key, value)
if getattr(prepared, 'body', None) is not None:
mime_body += '\r\n' + prepared.body + '\r\n'
return mime_body.encode('utf-8').lstrip()
This method will transform a requests.Request object into a UTF-8 encoded string that can later be used a a payload for a MIMENonMultipart object, i.e. the different multiparts.
Now in order to generate the actual batch request, we first need to squeeze a list of (Google API) requests into a files dictionary for the requests lib. The following method will take a list of requests.Request objects, transform each into a MIMENonMultipart and then return a dictionary that complies to the structure of the files dictionary:
import uuid
def prepare_requests(request_list):
message = MIMEMultipart('mixed')
output = {}
# thanks, Google. (Prevents the writing of MIME headers we dont need)
setattr(message, '_write_headers', lambda self: None)
for request in request_list:
message_id = new_id()
sub_message = MIMENonMultipart('application', 'http')
sub_message['Content-ID'] = message_id
del sub_message['MIME-Version']
sub_message.set_payload(serialize_request(request))
# remove first line (from ...)
sub_message = str(sub_message)
sub_message = sub_message[sub_message.find('\n'):]
output[message_id] = ('', str(sub_message), 'application/http', {})
return output
def new_id():
# I am not sure how these work exactly, so you will have to adapt this code
return '<item%s:12930812#barnyard.example.com>' % str(uuid.uuid4())[-4:]
Finally, we need to change the Content-Type from multipart/form-data to multipart/mixed and also remove the Content-Disposition and Content-Type headers from each request part. These we generated by requests and cannot be overwritten by the files dictionary.
import re
def finalize_request(prepared):
# change to multipart/mixed
old = prepared.headers['Content-Type']
prepared.headers['Content-Type'] = old.replace('multipart/form-data', 'multipart/mixed')
# remove headers at the start of each boundary
prepared.body = re.sub(r'\r\nContent-Disposition: form-data; name=.+\r\nContent-Type: application/http\r\n', '', prepared.body)
I have tried my best to test this with the Google Example from the Batching guide:
sheep = {
"animalName": "sheep",
"animalAge": "5",
"peltColor": "green"
}
commands = []
commands.append(requests.Request('GET', 'http://www.googleapis.com/batch/farm/v1/animals/pony'))
commands.append(requests.Request('PUT', 'http://www.googleapis.com/batch/farm/v1/animals/sheep', json=sheep, headers={'If-Match': '"etag/sheep"'}))
commands.append(requests.Request('GET', 'http://www.googleapis.com/batch/farm/v1/animals', headers={'If-None-Match': '"etag/animals"'}))
files = prepare_requests(commands)
r = requests.Request('POST', 'http://www.googleapis.com/batch', files=files)
prepared = r.prepare()
finalize_request(prepared)
s = requests.Session()
s.send(prepared)
And the resulting request should be close enough to what Google is providing in their Batching guide:
POST http://www.googleapis.com/batch
Content-Length: 1006
Content-Type: multipart/mixed; boundary=a21beebd15b74be89539b137bbbc7293
--a21beebd15b74be89539b137bbbc7293
Content-Type: application/http
Content-ID: <item8065:12930812#barnyard.example.com>
GET /farm/v1/animals
If-None-Match: "etag/animals"
--a21beebd15b74be89539b137bbbc7293
Content-Type: application/http
Content-ID: <item5158:12930812#barnyard.example.com>
GET /farm/v1/animals/pony
--a21beebd15b74be89539b137bbbc7293
Content-Type: application/http
Content-ID: <item0ec9:12930812#barnyard.example.com>
PUT /farm/v1/animals/sheep
Content-Length: 63
Content-Type: application/json
If-Match: "etag/sheep"
{"animalAge": "5", "animalName": "sheep", "peltColor": "green"}
--a21beebd15b74be89539b137bbbc7293--
In the end, I highly recommend the official Google library but if you cannot use it, you will have to improvise a bit :)
Disclaimer: I havent actually tried to send this request to the Google API Endpoints because the authentication procedure is too much of a hassle. I was just trying to get as close as possible to the HTTP request that is described in the Batching guide. There might be some problems with \r and \n line endings, depending on how strict the Google Endpoints are.
Sources:
requests github (especially issues #935 and #1081)
requests API documentation
Google APIs for Python

Python Flask + nginx fcgi - output large response?

I'm using Python Flask + nginx with FCGI.
On some requests, I have to output large responses. Usually those responses are fetched from a socket. Currently I'm doing the response like this:
response = []
while True:
recv = s.recv(1024)
if not recv: break
response.append(recv)
s.close()
response = ''.join(response)
return flask.make_response(response, 200, {
'Content-type': 'binary/octet-stream',
'Content-length': len(response),
'Content-transfer-encoding': 'binary',
})
The problem is I actually do not need the data. I also have a way to determine the exact response length to be fetched from the socket. So I need a good way to send the HTTP headers, then start outputing directly from the socket, instead of collecting it in memory and then supplying to nginx (probably by some sort of a stream).
I was unable to find the solution to this seemingly common issue. How would that be achieved?
Thank you!
if response in flask.make_response is an iterable, it will be iterated over to produce the response, and each string is written to the output stream on it's own.
what this means is that you can also return a generator which will yield the output when iterated over. if you know the content length, then you can (and should) pass it as header.
a simple example:
from flask import Flask
app = Flask(__name__)
import sys
import time
import flask
#app.route('/')
def generated_response_example():
n = 20
def response_generator():
for i in range(n):
print >>sys.stderr, i
yield "%03d\n" % i
time.sleep(.2)
print >>sys.stderr, "returning generator..."
gen = response_generator()
# the call to flask.make_response is not really needed as it happens imlicitly
# if you return a tuple.
return flask.make_response(gen ,"200 OK", {'Content-length': 4*n})
if __name__ == '__main__':
app.run()
if you run this and try it in a browser, you should see a nice incemental count...
(the content type is not set because it seems if i do that my browser waits until the whole content has been streamed before rendering the page. wget -qO - localhost:5000 doesn't have this problems.

Categories

Resources