requests: disable auto decoding - python

Can you disable the auto decoding feature in requests version 1.2.3?
I've looked through the documentation and couldn't find anything, I'm currently experiencing a gzip decode error and want to manually debug the data coming through the request.

You can access the raw response like this:
resp = requests.get(url, stream=True)
resp.raw.read()
In order to use raw you need to set stream=True for the original request. Also, raw is a file-like object, and reading from response.content will affect the seek cursor. In other words: If you already (tried to) read response.content, response.raw.read() will return an empty string.
See FAQ: Encoded Data and Quickstart: raw response content in the requests documentation.

import requests
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
This way, you will avoid automatic decompress of gzip-encoded response, and still write it to file chunk by chunk (useful for getting big files)

Related

Decompressing (Gzip) chunks of response from http.client call

I have the following code that I am using to try to chunk response from a http.client.HTTPSConnection get request to an API (please note that the response is gzip encoded:
connection = http.client.HTTPSConnection(api, context = ssl._create_unverified_context())
connection.request('GET', api_url, headers = auth)
response = connection.getresponse()
while chunk := response.read(20):
data = gzip.decompress(chunk)
data = json.loads(chunk)
print(data)
This always gives out an error that it is not a gzipped file (b'\xe5\x9d'). Not sure how I can chunk data and still achieve what I am trying to do here. Basically, I am chunking so that I don't have to load the entire response in memory.
Please note I can't use any other libraries like requests, urllib etc.
The most probable reason for that is, the response your received is indeed not a gzipped file.
I notice that in your code, you pass a variable called auth. Typically, a server won't send you a compressed response if you don't specify in the request headers that you can accept it. If there is only auth-related keys in your headers like your variable name suggests, you won't receive a gzipped response. First, make sure you have 'Accept-Encoding': 'gzip' in your headers.
Going forward, you will face another problem:
Basically, I am chunking so that I don't have to load the entire response in memory.
gzip.decompress will expect a complete file, so you would need to reconstruct it and load it entirely in memory before doing that, which would undermine the whole point of chunking the response. Trying to decompress a part of a gzip with gzip.decompress will most likely give you an EOFError saying something like Compressed file ended before the end-of-stream marker was reached.
I don't know if you can manage that directly with the gzip library, but I know how to do it with zlib. You will also need to convert your chunk to a file-like object, you can do that with io.BytesIO. I see you have very strong constraints on libraries, but zlib and io are part of the python default, so hopefully you have them available. Here is a rework of your code that should help you going on:
import http
import ssl
import gzip
import zlib
from io import BytesIO
# your variables here
api = 'your_api_host'
api_url = 'your_api_endpoint'
auth = {'AuhtKeys': 'auth_values'}
# add the gzip header
auth['Accept-Encoding'] = 'gzip'
# prepare decompressing object
decompressor = zlib.decompressobj(16 + zlib.MAX_WBITS)
connection = http.client.HTTPSConnection(api, context = ssl._create_unverified_context())
connection.request('GET', api_url, headers = auth)
response = connection.getresponse()
while chunk := response.read(20):
data = decompressor.decompress(BytesIO(chunk).read())
print(data)
The problem is that gzip.decompress expects a complete file, you can't just provide a chunk to it, because the deflate algorithm relies on previous data during decompression. The whole point of the algorithm is that it's able to repeat something that it has already seen before, therefore, all data is required.
However, deflate only cares about the last 32 KiB or so. Therefore, it is possible to stream decompress such a file without needing much memory. This is not something you need to implement yourself though, Python provides the gzip.GzipFile class which can be used to wrap the file handle and behaves like a normal file:
import io
import gzip
# Create a file for testing.
# In your case you can just use the response object you get.
file_uncompressed = ""
for line_index in range(10000):
file_uncompressed += f"This is line {line_index}.\n"
file_compressed = gzip.compress(file_uncompressed.encode())
file_handle = io.BytesIO(file_compressed)
# This library does all the heavy lifting
gzip_file = gzip.GzipFile(fileobj=file_handle)
while chunk := gzip_file.read(1024):
print(chunk)

Download a file directly to filesystem in Python

Is there a way to stream a file directly to the filesystem? Even if the connection is lost, I want to see all contents in that specific file in the fs. Just like wget or curl.
However, using request leads to the issue of first downloading the content of a response and then writing it to a filesystem.
with open(file_name, "wb") as file:
response = get(url) # may take some time
file.write(response.content)
Problem: while the file is "downloading" it is stored elsewhere (I guess in memory or a temporarily splace in the filesystem). That means I have a 0 byte file as long as the request is not (successfully) finished.
Can I solve this problem without using a third party lib?
Streaming directly to file can be achieved with requests and stream=true, or see more useful examples
with open(file_name, 'wb') as f:
with requests.get(url, stream=True) as r:
shutil.copyfileobj(r.raw, f)

Why is it necessary to use a stream to download an image through HTTP GET?

Here is a body of code that works, taken from: https://stackoverflow.com/a/18043472
It uses the requests module in python to download an image.
import requests, shutil
url = 'http://example.com/img.png'
response = requests.get(url, stream=True)
with open('img.png', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
Two questions I've been thinking about:
1) Why is it necessary to set stream=True? (I've tested it without that parameter and the image is blank) Conceptually, I don't understand what a streaming GET request is.
2) What's the difference between a raw response and a response? (Why is shutil.copyfileobj necessary, why can't I just directly write to file?)
Thanks!
Quote from documentation:
If you set stream to True when making a request, Requests cannot
release the connection back to the pool unless you consume all the
data or call Response.close.
More info here.

How to download a file with urllib3?

This is based on another question on this site: What's the best way to download file using urllib3
However, I cannot comment there so I ask another question:
How to download a (larger) file with urllib3?
I tried to use the same code that works with urllib2 (Download file from web in Python 3), but it fails with urllib3:
http = urllib3.PoolManager()
with http.request('GET', url) as r, open(path, 'wb') as out_file:
#shutil.copyfileobj(r.data, out_file) # this writes a zero file
shutil.copyfileobj(r.data, out_file)
This says that 'bytes' object has no attribute 'read'
I then tried to use the code in that question but it gets stuck in an infinite loop because data is always '0':
http = urllib3.PoolManager()
r = http.request('GET', url)
with open(path, 'wb') as out:
while True:
data = r.read(4096)
if data is None:
break
out.write(data)
r.release_conn()
However, if I read everything in memory, the file gets downloaded correctly:
http = urllib3.PoolManager()
r = http.request('GET', url)
with open(path, 'wb') as out:
out.write(data)
I do not want to do this, as I might potentially download very large files.
It is unfortunate that the urllib documentation does not cover the best practice in this topic.
(Also, please do not suggest requests or urllib2, because they are not flexible enough when it comes to self-signed certificates.)
You were very close, the piece that was missing is setting preload_content=False (this will be the default in an upcoming version). Also you can treat the response as a file-like object, rather than the .data attribute (which is a magic property that will hopefully be deprecated someday).
- with http.request('GET', url) ...
+ with http.request('GET', url, preload_content=False) ...
This code should work:
http = urllib3.PoolManager()
with http.request('GET', url, preload_content=False) as r, open(path, 'wb') as out_file:
shutil.copyfileobj(r, out_file)
urllib3's response object also respects the io interface, so you can also do things like...
import io
response = http.request(..., preload_content=False)
buffered_response = io.BufferedReader(response, 2048)
As long as you add preload_content=False to any of your three attempts and treat the response as a file-like object, they should all work.
It is unfortunate that the urllib documentation does not cover the best practice in this topic.
You're totally right, I hope you'll consider helping us document this use case by sending a pull request here: https://github.com/shazow/urllib3

Make an http POST request to upload a file using Python urllib/urllib2

I would like to make a POST request to upload a file to a web service (and get response) using Python. For example, I can do the following POST request with curl:
curl -F "file=#style.css" -F output=json http://jigsaw.w3.org/css-validator/validator
How can I make the same request with python urllib/urllib2? The closest I got so far is the following:
with open("style.css", 'r') as f:
content = f.read()
post_data = {"file": content, "output": "json"}
request = urllib2.Request("http://jigsaw.w3.org/css-validator/validator", \
data=urllib.urlencode(post_data))
response = urllib2.urlopen(request)
I got a HTTP Error 500 from the code above. But since my curl command succeeds, it must be something wrong with my python request?
I am quite new to this topic and my question may have very simple answers or mistakes.
Personally I think you should consider the requests library to post files.
url = 'http://jigsaw.w3.org/css-validator/validator'
files = {'file': open('style.css')}
response = requests.post(url, files=files)
Uploading files using urllib2 is not impossible but quite a complicated task: http://pymotw.com/2/urllib2/#uploading-files
After some digging around, it seems this post solved my problem. It turns out I need to have the multipart encoder setup properly.
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2
register_openers()
with open("style.css", 'r') as f:
datagen, headers = multipart_encode({"file": f})
request = urllib2.Request("http://jigsaw.w3.org/css-validator/validator", \
datagen, headers)
response = urllib2.urlopen(request)
Well, there are multiple ways to do it. As mentioned above, you can send the file in "multipart/form-data". However, the target service may not be expecting this type, in which case you may try some more approaches.
Pass the file object
urllib2 can accept a file object as data. When you pass this type, the library reads the file as a binary stream and sends it out. However, it will not set the proper Content-Type header. Moreover, if the Content-Length header is missing, then it will try to access the len property of the object, which doesn't exist for the files. That said, you must provide both the Content-Type and the Content-Length headers to have the method working:
import os
import urllib2
filename = '/var/tmp/myfile.zip'
headers = {
'Content-Type': 'application/zip',
'Content-Length': os.stat(filename).st_size,
}
request = urllib2.Request('http://localhost', open(filename, 'rb'),
headers=headers)
response = urllib2.urlopen(request)
Wrap the file object
To not deal with the length, you may create a simple wrapper object. With just a little change you can adapt it to get the content from a string if you have the file loaded in memory.
class BinaryFileObject:
"""Simple wrapper for a binary file for urllib2."""
def __init__(self, filename):
self.__size = int(os.stat(filename).st_size)
self.__f = open(filename, 'rb')
def read(self, blocksize):
return self.__f.read(blocksize)
def __len__(self):
return self.__size
Encode the content as base64
Another way is encoding the data via base64.b64encode and providing Content-Transfer-Type: base64 header. However, this method requires support on the server side. Depending on the implementation, the service can either accept the file and store it incorrectly, or return HTTP 400. E.g. the GitHub API won't throw an error, but the uploaded file will be corrupted.

Categories

Resources