How to download a file with urllib3? - python

This is based on another question on this site: What's the best way to download file using urllib3
However, I cannot comment there so I ask another question:
How to download a (larger) file with urllib3?
I tried to use the same code that works with urllib2 (Download file from web in Python 3), but it fails with urllib3:
http = urllib3.PoolManager()
with http.request('GET', url) as r, open(path, 'wb') as out_file:
#shutil.copyfileobj(r.data, out_file) # this writes a zero file
shutil.copyfileobj(r.data, out_file)
This says that 'bytes' object has no attribute 'read'
I then tried to use the code in that question but it gets stuck in an infinite loop because data is always '0':
http = urllib3.PoolManager()
r = http.request('GET', url)
with open(path, 'wb') as out:
while True:
data = r.read(4096)
if data is None:
break
out.write(data)
r.release_conn()
However, if I read everything in memory, the file gets downloaded correctly:
http = urllib3.PoolManager()
r = http.request('GET', url)
with open(path, 'wb') as out:
out.write(data)
I do not want to do this, as I might potentially download very large files.
It is unfortunate that the urllib documentation does not cover the best practice in this topic.
(Also, please do not suggest requests or urllib2, because they are not flexible enough when it comes to self-signed certificates.)

You were very close, the piece that was missing is setting preload_content=False (this will be the default in an upcoming version). Also you can treat the response as a file-like object, rather than the .data attribute (which is a magic property that will hopefully be deprecated someday).
- with http.request('GET', url) ...
+ with http.request('GET', url, preload_content=False) ...
This code should work:
http = urllib3.PoolManager()
with http.request('GET', url, preload_content=False) as r, open(path, 'wb') as out_file:
shutil.copyfileobj(r, out_file)
urllib3's response object also respects the io interface, so you can also do things like...
import io
response = http.request(..., preload_content=False)
buffered_response = io.BufferedReader(response, 2048)
As long as you add preload_content=False to any of your three attempts and treat the response as a file-like object, they should all work.
It is unfortunate that the urllib documentation does not cover the best practice in this topic.
You're totally right, I hope you'll consider helping us document this use case by sending a pull request here: https://github.com/shazow/urllib3

Related

Python: open a video file from an url the same way open("filepath", "rb") does for local files

I want to upload short videos through an API connection (which one is not relevant for the question). The videos that will be uploaded are already on a server (publicly accessible, so that is not the issue) with a direct link (eg: 'https://nameofcompany.com/uploads/videoname.mp4").
I am using the Requests library, so the post request looks like this:
requests.post(url, files={'file': OBJECT_GOES_HERE}, headers=headers)
The object should be a 'bytes-like object', so with a local file we can do:
requests.post(url, files={'file': open('localfile.mp4', 'rb')}, headers=headers)
I tested this with a local file and this works. However, as mentioned I need to upload it from the link, so how do I do that? Is there some method (or some library with a method) that would return the same type of response like the open() method does for local files? If not, how could I create one myself?
import requests
from io import BytesIO
url = 'https://nameofcompany.com/uploads/videoname.mp4'
r = requests.get(url)
video = r.content
# This is probably enough:
requests.post(url2, files={'file': video}, headers=headers)
# But if not, here's an example of using BytesIO to treat bytes as a file:
requests.post(url2, files={'file': open(BytesIO(video), 'rb')}, headers=headers)

Download a file directly to filesystem in Python

Is there a way to stream a file directly to the filesystem? Even if the connection is lost, I want to see all contents in that specific file in the fs. Just like wget or curl.
However, using request leads to the issue of first downloading the content of a response and then writing it to a filesystem.
with open(file_name, "wb") as file:
response = get(url) # may take some time
file.write(response.content)
Problem: while the file is "downloading" it is stored elsewhere (I guess in memory or a temporarily splace in the filesystem). That means I have a 0 byte file as long as the request is not (successfully) finished.
Can I solve this problem without using a third party lib?
Streaming directly to file can be achieved with requests and stream=true, or see more useful examples
with open(file_name, 'wb') as f:
with requests.get(url, stream=True) as r:
shutil.copyfileobj(r.raw, f)

Why is it necessary to use a stream to download an image through HTTP GET?

Here is a body of code that works, taken from: https://stackoverflow.com/a/18043472
It uses the requests module in python to download an image.
import requests, shutil
url = 'http://example.com/img.png'
response = requests.get(url, stream=True)
with open('img.png', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
Two questions I've been thinking about:
1) Why is it necessary to set stream=True? (I've tested it without that parameter and the image is blank) Conceptually, I don't understand what a streaming GET request is.
2) What's the difference between a raw response and a response? (Why is shutil.copyfileobj necessary, why can't I just directly write to file?)
Thanks!
Quote from documentation:
If you set stream to True when making a request, Requests cannot
release the connection back to the pool unless you consume all the
data or call Response.close.
More info here.

Make an http POST request to upload a file using Python urllib/urllib2

I would like to make a POST request to upload a file to a web service (and get response) using Python. For example, I can do the following POST request with curl:
curl -F "file=#style.css" -F output=json http://jigsaw.w3.org/css-validator/validator
How can I make the same request with python urllib/urllib2? The closest I got so far is the following:
with open("style.css", 'r') as f:
content = f.read()
post_data = {"file": content, "output": "json"}
request = urllib2.Request("http://jigsaw.w3.org/css-validator/validator", \
data=urllib.urlencode(post_data))
response = urllib2.urlopen(request)
I got a HTTP Error 500 from the code above. But since my curl command succeeds, it must be something wrong with my python request?
I am quite new to this topic and my question may have very simple answers or mistakes.
Personally I think you should consider the requests library to post files.
url = 'http://jigsaw.w3.org/css-validator/validator'
files = {'file': open('style.css')}
response = requests.post(url, files=files)
Uploading files using urllib2 is not impossible but quite a complicated task: http://pymotw.com/2/urllib2/#uploading-files
After some digging around, it seems this post solved my problem. It turns out I need to have the multipart encoder setup properly.
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2
register_openers()
with open("style.css", 'r') as f:
datagen, headers = multipart_encode({"file": f})
request = urllib2.Request("http://jigsaw.w3.org/css-validator/validator", \
datagen, headers)
response = urllib2.urlopen(request)
Well, there are multiple ways to do it. As mentioned above, you can send the file in "multipart/form-data". However, the target service may not be expecting this type, in which case you may try some more approaches.
Pass the file object
urllib2 can accept a file object as data. When you pass this type, the library reads the file as a binary stream and sends it out. However, it will not set the proper Content-Type header. Moreover, if the Content-Length header is missing, then it will try to access the len property of the object, which doesn't exist for the files. That said, you must provide both the Content-Type and the Content-Length headers to have the method working:
import os
import urllib2
filename = '/var/tmp/myfile.zip'
headers = {
'Content-Type': 'application/zip',
'Content-Length': os.stat(filename).st_size,
}
request = urllib2.Request('http://localhost', open(filename, 'rb'),
headers=headers)
response = urllib2.urlopen(request)
Wrap the file object
To not deal with the length, you may create a simple wrapper object. With just a little change you can adapt it to get the content from a string if you have the file loaded in memory.
class BinaryFileObject:
"""Simple wrapper for a binary file for urllib2."""
def __init__(self, filename):
self.__size = int(os.stat(filename).st_size)
self.__f = open(filename, 'rb')
def read(self, blocksize):
return self.__f.read(blocksize)
def __len__(self):
return self.__size
Encode the content as base64
Another way is encoding the data via base64.b64encode and providing Content-Transfer-Type: base64 header. However, this method requires support on the server side. Depending on the implementation, the service can either accept the file and store it incorrectly, or return HTTP 400. E.g. the GitHub API won't throw an error, but the uploaded file will be corrupted.

requests: disable auto decoding

Can you disable the auto decoding feature in requests version 1.2.3?
I've looked through the documentation and couldn't find anything, I'm currently experiencing a gzip decode error and want to manually debug the data coming through the request.
You can access the raw response like this:
resp = requests.get(url, stream=True)
resp.raw.read()
In order to use raw you need to set stream=True for the original request. Also, raw is a file-like object, and reading from response.content will affect the seek cursor. In other words: If you already (tried to) read response.content, response.raw.read() will return an empty string.
See FAQ: Encoded Data and Quickstart: raw response content in the requests documentation.
import requests
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
This way, you will avoid automatic decompress of gzip-encoded response, and still write it to file chunk by chunk (useful for getting big files)

Categories

Resources