Decompressing (Gzip) chunks of response from http.client call - python

I have the following code that I am using to try to chunk response from a http.client.HTTPSConnection get request to an API (please note that the response is gzip encoded:
connection = http.client.HTTPSConnection(api, context = ssl._create_unverified_context())
connection.request('GET', api_url, headers = auth)
response = connection.getresponse()
while chunk := response.read(20):
data = gzip.decompress(chunk)
data = json.loads(chunk)
print(data)
This always gives out an error that it is not a gzipped file (b'\xe5\x9d'). Not sure how I can chunk data and still achieve what I am trying to do here. Basically, I am chunking so that I don't have to load the entire response in memory.
Please note I can't use any other libraries like requests, urllib etc.

The most probable reason for that is, the response your received is indeed not a gzipped file.
I notice that in your code, you pass a variable called auth. Typically, a server won't send you a compressed response if you don't specify in the request headers that you can accept it. If there is only auth-related keys in your headers like your variable name suggests, you won't receive a gzipped response. First, make sure you have 'Accept-Encoding': 'gzip' in your headers.
Going forward, you will face another problem:
Basically, I am chunking so that I don't have to load the entire response in memory.
gzip.decompress will expect a complete file, so you would need to reconstruct it and load it entirely in memory before doing that, which would undermine the whole point of chunking the response. Trying to decompress a part of a gzip with gzip.decompress will most likely give you an EOFError saying something like Compressed file ended before the end-of-stream marker was reached.
I don't know if you can manage that directly with the gzip library, but I know how to do it with zlib. You will also need to convert your chunk to a file-like object, you can do that with io.BytesIO. I see you have very strong constraints on libraries, but zlib and io are part of the python default, so hopefully you have them available. Here is a rework of your code that should help you going on:
import http
import ssl
import gzip
import zlib
from io import BytesIO
# your variables here
api = 'your_api_host'
api_url = 'your_api_endpoint'
auth = {'AuhtKeys': 'auth_values'}
# add the gzip header
auth['Accept-Encoding'] = 'gzip'
# prepare decompressing object
decompressor = zlib.decompressobj(16 + zlib.MAX_WBITS)
connection = http.client.HTTPSConnection(api, context = ssl._create_unverified_context())
connection.request('GET', api_url, headers = auth)
response = connection.getresponse()
while chunk := response.read(20):
data = decompressor.decompress(BytesIO(chunk).read())
print(data)

The problem is that gzip.decompress expects a complete file, you can't just provide a chunk to it, because the deflate algorithm relies on previous data during decompression. The whole point of the algorithm is that it's able to repeat something that it has already seen before, therefore, all data is required.
However, deflate only cares about the last 32 KiB or so. Therefore, it is possible to stream decompress such a file without needing much memory. This is not something you need to implement yourself though, Python provides the gzip.GzipFile class which can be used to wrap the file handle and behaves like a normal file:
import io
import gzip
# Create a file for testing.
# In your case you can just use the response object you get.
file_uncompressed = ""
for line_index in range(10000):
file_uncompressed += f"This is line {line_index}.\n"
file_compressed = gzip.compress(file_uncompressed.encode())
file_handle = io.BytesIO(file_compressed)
# This library does all the heavy lifting
gzip_file = gzip.GzipFile(fileobj=file_handle)
while chunk := gzip_file.read(1024):
print(chunk)

Related

Why is it necessary to use a stream to download an image through HTTP GET?

Here is a body of code that works, taken from: https://stackoverflow.com/a/18043472
It uses the requests module in python to download an image.
import requests, shutil
url = 'http://example.com/img.png'
response = requests.get(url, stream=True)
with open('img.png', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
Two questions I've been thinking about:
1) Why is it necessary to set stream=True? (I've tested it without that parameter and the image is blank) Conceptually, I don't understand what a streaming GET request is.
2) What's the difference between a raw response and a response? (Why is shutil.copyfileobj necessary, why can't I just directly write to file?)
Thanks!
Quote from documentation:
If you set stream to True when making a request, Requests cannot
release the connection back to the pool unless you consume all the
data or call Response.close.
More info here.

Compressing request body with python-requests?

(This question is not about transparent decompression of gzip-encoded responses from a web server; I know that requests handles that automatically.)
Problem
I'm trying to POST a file to a RESTful web service. Obviously, requests makes this pretty easy to do:
files = dict(data=(fn, file))
response = session.post(endpoint_url, files=files)
In this case, my file is in a really highly-compressible format (yep, XML) so I'd like to make sure that the request body is compressed.
The server claims to accept gzip encoding (Accept-Encoding: gzip in response headers), so I should be able to gzip the whole body request body, right?
Attempted solution
Here's my attempt to make this work: I first construct the request and prepare it, then I go into the PreparedRequest object, yank out the body, run it through gzip, and put it back. (Oh, and don't forget to update the Content-Length and Content-Encoding headers.)
files = dict(data=(fn, file))
request = request.Request('POST',endpoint_url, files=files)
prepped = session.prepare_request(request)
with NamedTemporaryFile(delete=True) as gzfile:
gzip.GzipFile(fileobj=gzfile, mode="wb").write(prepped.body)
prepped.headers['Content-Length'] = gzfile.tell()
prepped.headers['Content-Encoding'] = 'gzip'
gzfile.seek(0,0)
prepped.body = gzfile.read()
response = session.send(prepped)
Unfortunately, the server is not cooperating and returns 500 Internal Server Error. Perhaps it doesn't really accept gzip-encoded requests?
Or perhaps there is a mistake in my approach? It seems rather convoluted. Is there an easier way to do request body compression with python-requests?
EDIT: Fixed (3) and (5) from #sigmavirus24's answer (these were basically just artifacts I'd overlooked in simplifying the code to post it here).
Or perhaps there is a mistake in my approach?
I'm unsure how you arrived at your approach, frankly, but there's certainly a simpler way of doing this.
First, a few things:
The files parameter constructs a multipart/form-data body. So you're compressing something that the server potentially has no clue about.
Content-Encoding and Transfer-Encoding are two very different things. You want Transfer-Encoding here.
You don't need to set a suffix on your NamedTemporaryFile.
Since you didn't explicitly mention that you're trying to compress a multipart/form-data request, I'm going to assume that you don't actually want to do that.
Your call to session.Request (which I assume should be, requests.Request) is missing a method, i.e., it should be: requests.Request('POST', endpoint_url, ...)
With those out of the way, here's how I would do this:
# Assuming `file` is a file-like obj
with NamedTemporaryFile(delete=True) as gzfile:
gzip.GzipFile(fileobj=gzfile, mode="wb").write(file.read())
headers = {'Content-Length': str(gzfile.tell()),
'Transfer-Encoding': 'gzip'}
gzfile.seek(0, 0)
response = session.post(endpoint_url, data=gzfile,
headers=headers)
Assuming that file has the xml content in it and all you meant was to compress it, this should work for you. You probably want to set a Content-Type header though, for example, you'd just do
headers = {'Content-Length': gzfile.tell(),
'Content-Type': 'application/xml', # or 'text/xml'
'Transfer-Encoding': 'gzip'}
The Transfer-Encoding tells the server that the request is being compressed only in transit and it should uncompress it. The Content-Type tells the server how to handle the content once the Transfer-Encoding has been handled. 
I had a question that was marked as an exact duplicate. I was concernd with both ends of the transaction.
The code from sigmavirus24 wasn't a direct cut and paste fix, but it was the inspiration for this version.
Here's how my solution ended up looking:
sending from the python end
import json
import requests
import StringIO
import gzip
url = "http://localhost:3000"
headers = {"Content-Type":"application/octet-stream"}
data = [{"key": 1,"otherKey": "2"},
{"key": 3,"otherKey": "4"}]
payload = json.dumps(data)
out = StringIO.StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(json.dumps(data))
out.getvalue()
r = requests.post(url+"/zipped", data=out.getvalue(), headers=headers)
receiving at the express end
var zlib = require("zlib");
var rawParser = bodyParser.raw({type: '*/*'});
app.post('/zipped', rawParser, function(req, res) {
zlib.gunzip(req.body, function(err, buf) {
if(err){
console.log("err:", err );
} else{
console.log("in the inflate callback:",
buf,
"to string:", buf.toString("utf8") );
}
});
res.status(200).send("I'm in ur zipped route");
});
There's a gist here with more verbose logging included. This version doesn't have any safety or checking built in either.

Make an http POST request to upload a file using Python urllib/urllib2

I would like to make a POST request to upload a file to a web service (and get response) using Python. For example, I can do the following POST request with curl:
curl -F "file=#style.css" -F output=json http://jigsaw.w3.org/css-validator/validator
How can I make the same request with python urllib/urllib2? The closest I got so far is the following:
with open("style.css", 'r') as f:
content = f.read()
post_data = {"file": content, "output": "json"}
request = urllib2.Request("http://jigsaw.w3.org/css-validator/validator", \
data=urllib.urlencode(post_data))
response = urllib2.urlopen(request)
I got a HTTP Error 500 from the code above. But since my curl command succeeds, it must be something wrong with my python request?
I am quite new to this topic and my question may have very simple answers or mistakes.
Personally I think you should consider the requests library to post files.
url = 'http://jigsaw.w3.org/css-validator/validator'
files = {'file': open('style.css')}
response = requests.post(url, files=files)
Uploading files using urllib2 is not impossible but quite a complicated task: http://pymotw.com/2/urllib2/#uploading-files
After some digging around, it seems this post solved my problem. It turns out I need to have the multipart encoder setup properly.
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2
register_openers()
with open("style.css", 'r') as f:
datagen, headers = multipart_encode({"file": f})
request = urllib2.Request("http://jigsaw.w3.org/css-validator/validator", \
datagen, headers)
response = urllib2.urlopen(request)
Well, there are multiple ways to do it. As mentioned above, you can send the file in "multipart/form-data". However, the target service may not be expecting this type, in which case you may try some more approaches.
Pass the file object
urllib2 can accept a file object as data. When you pass this type, the library reads the file as a binary stream and sends it out. However, it will not set the proper Content-Type header. Moreover, if the Content-Length header is missing, then it will try to access the len property of the object, which doesn't exist for the files. That said, you must provide both the Content-Type and the Content-Length headers to have the method working:
import os
import urllib2
filename = '/var/tmp/myfile.zip'
headers = {
'Content-Type': 'application/zip',
'Content-Length': os.stat(filename).st_size,
}
request = urllib2.Request('http://localhost', open(filename, 'rb'),
headers=headers)
response = urllib2.urlopen(request)
Wrap the file object
To not deal with the length, you may create a simple wrapper object. With just a little change you can adapt it to get the content from a string if you have the file loaded in memory.
class BinaryFileObject:
"""Simple wrapper for a binary file for urllib2."""
def __init__(self, filename):
self.__size = int(os.stat(filename).st_size)
self.__f = open(filename, 'rb')
def read(self, blocksize):
return self.__f.read(blocksize)
def __len__(self):
return self.__size
Encode the content as base64
Another way is encoding the data via base64.b64encode and providing Content-Transfer-Type: base64 header. However, this method requires support on the server side. Depending on the implementation, the service can either accept the file and store it incorrectly, or return HTTP 400. E.g. the GitHub API won't throw an error, but the uploaded file will be corrupted.

Is it possible to use gzip compression with Server-Sent Events (SSE)?

I would like to know if it is possible to enable gzip compression
for Server-Sent Events (SSE ; Content-Type: text/event-stream).
It seems it is possible, according to this book:
http://chimera.labs.oreilly.com/books/1230000000545/ch16.html
But I can't find any example of SSE with gzip compression. I tried to
send gzipped messages with the response header field
Content-Encoding set to "gzip" without success.
For experimenting around SSE, I am testing a small web application
made in Python with the bottle framework + gevent ; I am just running
the bottle WSGI server:
#bottle.get('/data_stream')
def stream_data():
bottle.response.content_type = "text/event-stream"
bottle.response.add_header("Connection", "keep-alive")
bottle.response.add_header("Cache-Control", "no-cache")
bottle.response.add_header("Content-Encoding", "gzip")
while True:
# new_data is a gevent AsyncResult object,
# .get() just returns a data string when new
# data is available
data = new_data.get()
yield zlib.compress("data: %s\n\n" % data)
#yield "data: %s\n\n" % data
The code without compression (last line, commented) and without gzip
content-encoding header field works like a charm.
EDIT: thanks to the reply and to this other question: Python: Creating a streaming gzip'd file-like?, I managed to solve the problem:
#bottle.route("/stream")
def stream_data():
compressed_stream = zlib.compressobj()
bottle.response.content_type = "text/event-stream"
bottle.response.add_header("Connection", "keep-alive")
bottle.response.add_header("Cache-Control", "no-cache, must-revalidate")
bottle.response.add_header("Content-Encoding", "deflate")
bottle.response.add_header("Transfer-Encoding", "chunked")
while True:
data = new_data.get()
yield compressed_stream.compress("data: %s\n\n" % data)
yield compressed_stream.flush(zlib.Z_SYNC_FLUSH)
TL;DR: If the requests are not cached, you likely want to use zlib and declare Content-Encoding to be 'deflate'. That change alone should make your code work.
If you declare Content-Encoding to be gzip, you need to actually use gzip. They are based on the the same compression algorithm, but gzip has some extra framing. This works, for example:
import gzip
import StringIO
from bottle import response, route
#route('/')
def get_data():
response.add_header("Content-Encoding", "gzip")
s = StringIO.StringIO()
with gzip.GzipFile(fileobj=s, mode='w') as f:
f.write('Hello World')
return s.getvalue()
That only really makes sense if you use an actual file as a cache, though.
There's also middleware you can use so you don't need to worry about gzipping responses for each of your methods. Here's one I used recently.
https://code.google.com/p/ibkon-wsgi-gzip-middleware/
This is how I used it (I'm using bottle.py with the gevent server)
from gzip_middleware import Gzipper
import bottle
app = Gzipper(bottle.app())
run(app = app, host='0.0.0.0', port=8080, server='gevent')
For this particular library, you can set w/c types of responses you want to compress by modifying the DEFAULT_COMPRESSABLES variable for example
DEFAULT_COMPRESSABLES = set(['text/plain', 'text/html', 'text/css',
'application/json', 'application/x-javascript', 'text/xml',
'application/xml', 'application/xml+rss', 'text/javascript',
'image/gif'])
All responses go through the middleware and get gzipped without modifying your existing code. By default, it compresses responses whose content-type belongs to DEFAULT_COMPRESSABLES and whose content-length is greater than 200 characters.

requests: disable auto decoding

Can you disable the auto decoding feature in requests version 1.2.3?
I've looked through the documentation and couldn't find anything, I'm currently experiencing a gzip decode error and want to manually debug the data coming through the request.
You can access the raw response like this:
resp = requests.get(url, stream=True)
resp.raw.read()
In order to use raw you need to set stream=True for the original request. Also, raw is a file-like object, and reading from response.content will affect the seek cursor. In other words: If you already (tried to) read response.content, response.raw.read() will return an empty string.
See FAQ: Encoded Data and Quickstart: raw response content in the requests documentation.
import requests
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
This way, you will avoid automatic decompress of gzip-encoded response, and still write it to file chunk by chunk (useful for getting big files)

Categories

Resources