How to download chunked data with Pythons urllib2

How to download chunked data with Pythons urllib2 - python

I'm trying to download a large file from a server with Python 2:
req = urllib2.Request("https://myserver/mylargefile.gz")
rsp = urllib2.urlopen(req)
data = rsp.read()
The server sends data with "Transfer-Encoding: chunked" and I'm only getting some binary data, which cannot be unpacked by gunzip.
Do I have to iterate over multiple read()s? Or multiple requests? If so, how do they have to look like?
Note: I'm trying to solve the problem with only the Python 2 standard library, without additional libraries such as urllib3 or requests. Is this even possible?

From the python documentation on urllib2.urlopen:
One caveat: the read() method, if the size argument is omitted or
negative, may not read until the end of the data stream; there is no
good way to determine that the entire stream from a socket has been
read in the general case.
So, read the data in a loop:
req = urllib2.Request("https://myserver/mylargefile.gz")
rsp = urllib2.urlopen(req)
data = rsp.read(8192)
while data:
# .. Do Something ..
data = rsp.read(8192)

If I'm not mistaken, the following worked for me - a while back:
data = ''
chunk = rsp.read()
while chunk:
data += chunk
chunk = rsp.read()
Each read reads one chunk - so keep on reading until nothing more's coming.
Don't have documenation ready supporting this...yet.

I have the same problem.
I found that "Transfer-Encoding: chunked" often appears with "Content-Encoding:
gzip".
So maybe we can get the compressed content and unzip it.
It works for me.
import urllib2
from StringIO import StringIO
import gzip
req = urllib2.Request(url)
req.add_header('Accept-encoding', 'gzip, deflate')
rsp = urllib2.urlopen(req)
if rsp.info().get('Content-Encoding') == 'gzip':
buf = StringIO(rsp.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()

Related

Content-length header not the same as when manually calculating it?

An answer here (Size of raw response in bytes) says :
Just take the len() of the content of the response:
>>> response = requests.get('https://github.com/')
>>> len(response.content)
51671
However doing that does not get the accurate content length. For example check out this python code:
import sys
import requests
def proccessUrl(url):
try:
r = requests.get(url)
print("Correct Content Length: "+r.headers['Content-Length'])
print("bytes of r.text : "+str(sys.getsizeof(r.text)))
print("bytes of r.content : "+str(sys.getsizeof(r.content)))
print("len r.text : "+str(len(r.text)))
print("len r.content : "+str(len(r.content)))
except Exception as e:
print(str(e))
#this url contains a content-length header, we will use that to see if the content length we calculate is the same.
proccessUrl("https://stackoverflow.com")
If we try and manually calculate the content length and compare it to what is in the header, we get an answer that is much larger?
Correct Content Length: 51504
bytes of r.text : 515142
bytes of r.content : 257623
len r.text : 257552
len r.content : 257606
Why does len(r.content) not return the correct content length? And how can we manually calculate it accurately if the header is missing?

The Content-Length header reflects the body of the response. That's not the same thing as the length of the text or content attributes, because the response could be compressed. requests decompresses the response for you.
You'd have to bypass a lot of internal plumbing to get the original, compressed, raw content, and then you have to access some more internals if you want the response object to still work correctly. The 'easiest' method is to enable streaming, then reading from the raw socket:
from io import BytesIO
r = requests.get(url, stream=True)
# read directly from the raw urllib3 connection
raw_content = r.raw.read()
content_length = len(raw_content)
# replace the internal file-object to serve the data again
r.raw._fp = BytesIO(raw_content)
Demo:
>>> import requests
>>> from io import BytesIO
>>> url = "https://stackoverflow.com"
>>> r = requests.get(url, stream=True)
>>> r.headers['Content-Encoding'] # a compressed response
'gzip'
>>> r.headers['Content-Length'] # the raw response contains 52055 bytes of compressed data
'52055'
>>> r.headers['Content-Type'] # we are served UTF-8 HTML data
'text/html; charset=utf-8'
>>> raw_content = r.raw.read()
>>> len(raw_content) # the raw content body length
52055
>>> r.raw._fp = BytesIO(raw_content)
>>> len(r.content) # the decompressed binary content, byte count
258719
>>> len(r.text) # the Unicode content decoded from UTF-8, character count
258658
This reads the full response into memory, so don't use this if you expect large responses! In that case, you could instead use shutil.copyfileobj() to copy the data from the r.raw file to a spooled temporary file (which will switch to an on-disk file once a certain size is reached), get the file size of that file, then stuff that file onto r.raw._fp.
A function that adds a Content-Type header to any request that is missing that header would look like this:
import requests
import shutil
import tempfile
def ensure_content_length(
url, *args, method='GET', session=None, max_size=2**20, # 1Mb
**kwargs
):
kwargs['stream'] = True
session = session or requests.Session()
r = session.request(method, url, *args, **kwargs)
if 'Content-Length' not in r.headers:
# stream content into a temporary file so we can get the real size
spool = tempfile.SpooledTemporaryFile(max_size)
shutil.copyfileobj(r.raw, spool)
r.headers['Content-Length'] = str(spool.tell())
spool.seek(0)
# replace the original socket with our temporary file
r.raw._fp.close()
r.raw._fp = spool
return r
This accepts an existing session, and lets you specify the request method too. Adjust max_size as needed for your memory constraints. Demo on https://github.com, which lacks a Content-Length header:
>>> r = ensure_content_length('https://github.com/')
>>> r
<Response [200]>
>>> r.headers['Content-Length']
'14490'
>>> len(r.content)
54814
Note that if there is no Content-Encoding header present or the value for that header is set to identity, and the Content-Length is available, then just you can rely on Content-Length being the full size of the response. That's because then there is obviously no compression applied.
As a side note: you should not use sys.getsizeof() if what your are after is the length of a bytes or str object (the number of bytes or characters in that object). sys.getsizeof() gives you the internal memory footprint of a Python object, which covers more than just the number of bytes or characters in that object. See What is the difference between len() and sys.getsizeof() methods in python?

Streaming binary file from ssh in python

I've recently come across the functionality or the requests package of python (http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow) that allows to defer downloading the response body until you access the Response.content of a file, as told here :
https://stackoverflow.com/a/16696317/8376187
def download_file(url):
local_filename = url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return local_filename
I use this to stream videos and since the headers of the file is present, my video player read the video smoothly.
I would like to do the same with an SSH/SFTP file transfer, i have tryied to use paramiko for that, but my code reads the file without getting the indexes and headers of the file making my video player fail and and is also very slow.
The code (assuming connected paramiko SSHClient() ) :
sftp_client = client.open_sftp()
remote_file = sftp_client.open('remotefile')
with open('localfile', 'wb') as f:
try:
data = remote_file.read(1024)
while (data):
f.write(data)
data = remote_file.read(1024)
finally:
remote_file.close()
Is there a way to reproduce the behavior of requests' "stream=True" option with an ssh/sftp transfer in python ?
Thanks :)

python requests post file using multipart form parameters [duplicate]

I'm performing a simple task of uploading a file using Python requests library. I searched Stack Overflow and no one seemed to have the same problem, namely, that the file is not received by the server:
import requests
url='http://nesssi.cacr.caltech.edu/cgi-bin/getmulticonedb_release2.cgi/post'
files={'files': open('file.txt','rb')}
values={'upload_file' : 'file.txt' , 'DB':'photcat' , 'OUT':'csv' , 'SHORT':'short'}
r=requests.post(url,files=files,data=values)
I'm filling the value of 'upload_file' keyword with my filename, because if I leave it blank, it says
Error - You must select a file to upload!
And now I get
File file.txt of size bytes is uploaded successfully!
Query service results: There were 0 lines.
Which comes up only if the file is empty. So I'm stuck as to how to send my file successfully. I know that the file works because if I go to this website and manually fill in the form it returns a nice list of matched objects, which is what I'm after. I'd really appreciate all hints.
Some other threads related (but not answering my problem):
Send file using POST from a Python script
http://docs.python-requests.org/en/latest/user/quickstart/#response-content
Uploading files using requests and send extra data
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

If upload_file is meant to be the file, use:
files = {'upload_file': open('file.txt','rb')}
values = {'DB': 'photcat', 'OUT': 'csv', 'SHORT': 'short'}
r = requests.post(url, files=files, data=values)
and requests will send a multi-part form POST body with the upload_file field set to the contents of the file.txt file.
The filename will be included in the mime header for the specific field:
>>> import requests
>>> open('file.txt', 'wb') # create an empty demo file
<_io.BufferedWriter name='file.txt'>
>>> files = {'upload_file': open('file.txt', 'rb')}
>>> print(requests.Request('POST', 'http://example.com', files=files).prepare().body.decode('ascii'))
--c226ce13d09842658ffbd31e0563c6bd
Content-Disposition: form-data; name="upload_file"; filename="file.txt"
--c226ce13d09842658ffbd31e0563c6bd--
Note the filename="file.txt" parameter.
You can use a tuple for the files mapping value, with between 2 and 4 elements, if you need more control. The first element is the filename, followed by the contents, and an optional content-type header value and an optional mapping of additional headers:
files = {'upload_file': ('foobar.txt', open('file.txt','rb'), 'text/x-spam')}
This sets an alternative filename and content type, leaving out the optional headers.
If you are meaning the whole POST body to be taken from a file (with no other fields specified), then don't use the files parameter, just post the file directly as data. You then may want to set a Content-Type header too, as none will be set otherwise. See Python requests - POST data from a file.

(2018) the new python requests library has simplified this process, we can use the 'files' variable to signal that we want to upload a multipart-encoded file
url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)
r.text

Client Upload
If you want to upload a single file with Python requests library, then requests lib supports streaming uploads, which allow you to send large files or streams without reading into memory.
with open('massive-body', 'rb') as f:
requests.post('http://some.url/streamed', data=f)
Server Side
Then store the file on the server.py side such that save the stream into file without loading into the memory. Following is an example with using Flask file uploads.
#app.route("/upload", methods=['POST'])
def upload_file():
from werkzeug.datastructures import FileStorage
FileStorage(request.stream).save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
return 'OK', 200
Or use werkzeug Form Data Parsing as mentioned in a fix for the issue of "large file uploads eating up memory" in order to avoid using memory inefficiently on large files upload (s.t. 22 GiB file in ~60 seconds. Memory usage is constant at about 13 MiB.).
#app.route("/upload", methods=['POST'])
def upload_file():
def custom_stream_factory(total_content_length, filename, content_type, content_length=None):
import tempfile
tmpfile = tempfile.NamedTemporaryFile('wb+', prefix='flaskapp', suffix='.nc')
app.logger.info("start receiving file ... filename => " + str(tmpfile.name))
return tmpfile
import werkzeug, flask
stream, form, files = werkzeug.formparser.parse_form_data(flask.request.environ, stream_factory=custom_stream_factory)
for fil in files.values():
app.logger.info(" ".join(["saved form name", fil.name, "submitted as", fil.filename, "to temporary file", fil.stream.name]))
# Do whatever with stored file at `fil.stream.name`
return 'OK', 200

You can send any file via post api while calling the API just need to mention files={'any_key': fobj}
import requests
import json
url = "https://request-url.com"
headers = {"Content-Type": "application/json; charset=utf-8"}
with open(filepath, 'rb') as fobj:
response = requests.post(url, headers=headers, files={'file': fobj})
print("Status Code", response.status_code)
print("JSON Response ", response.json())

#martijn-pieters answer is correct, however I wanted to add a bit of context to data= and also to the other side, in the Flask server, in the case where you are trying to upload files and a JSON.
From the request side, this works as Martijn describes:
files = {'upload_file': open('file.txt','rb')}
values = {'DB': 'photcat', 'OUT': 'csv', 'SHORT': 'short'}
r = requests.post(url, files=files, data=values)
However, on the Flask side (the receiving webserver on the other side of this POST), I had to use form
#app.route("/sftp-upload", methods=["POST"])
def upload_file():
if request.method == "POST":
# the mimetype here isnt application/json
# see here: https://stackoverflow.com/questions/20001229/how-to-get-posted-json-in-flask
body = request.form
print(body) # <- immutable dict
body = request.get_json() will return nothing. body = request.get_data() will return a blob containing lots of things like the filename etc.
Here's the bad part: on the client side, changing data={} to json={} results in this server not being able to read the KV pairs! As in, this will result in a {} body above:
r = requests.post(url, files=files, json=values). # No!
This is bad because the server does not have control over how the user formats the request; and json= is going to be the habbit of requests users.

Upload:
with open('file.txt', 'rb') as f:
files = {'upload_file': f.read()}
values = {'DB': 'photcat', 'OUT': 'csv', 'SHORT': 'short'}
r = requests.post(url, files=files, data=values)
Download (Django):
with open('file.txt', 'wb') as f:
f.write(request.FILES['upload_file'].file.read())

Regarding the answers given so far, there was always something missing that prevented it to work on my side. So let me show you what worked for me:
import json
import os
import requests
API_ENDPOINT = "http://localhost:80"
access_token = "sdfJHKsdfjJKHKJsdfJKHJKysdfJKHsdfJKHs" # TODO: get fresh Token here
def upload_engagement_file(filepath):
url = API_ENDPOINT + "/api/files" # add any URL parameters if needed
hdr = {"Authorization": "Bearer %s" % access_token}
with open(filepath, "rb") as fobj:
file_obj = fobj.read()
file_basename = os.path.basename(filepath)
file_to_upload = {"file": (str(file_basename), file_obj)}
finfo = {"fullPath": filepath}
upload_response = requests.post(url, headers=hdr, files=file_to_upload, data=finfo)
fobj.close()
# print("Status Code ", upload_response.status_code)
# print("JSON Response ", upload_response.json())
return upload_response
Note that requests.post(...) needs
a url parameter, containing the full URL of the API endpoint you're calling, using the API_ENDPOINT, assuming we have an http://localhost:8000/api/files endpoint to POST a file
a headers parameter, containing at least the authorization (bearer token)
a files parameter taking the name of the file plus the entire file content
a data parameter taking just the path and file name
Installation required (console):
pip install requests
What you get back from the function call is a response object containing a status code and also the full error message in JSON format. The commented print statements at the end of upload_engagement_file are showing you how you can access them.
Note: Some useful additional information about the requests library can be found here

Some may need to upload via a put request and this is slightly different that posting data. It is important to understand how the server expects the data in order to form a valid request. A frequent source of confusion is sending multipart-form data when it isn't accepted. This example uses basic auth and updates an image via a put request.
url = 'foobar.com/api/image-1'
basic = requests.auth.HTTPBasicAuth('someuser', 'password123')
# Setting the appropriate header is important and will vary based
# on what you upload
headers = {'Content-Type': 'image/png'}
with open('image-1.png', 'rb') as img_1:
r = requests.put(url, auth=basic, data=img_1, headers=headers)
While the requests library makes working with http requests a lot easier, some of its magic and convenience obscures just how to craft more nuanced requests.

In Ubuntu you can apply this way,
to save file at some location (temporary) and then open and send it to API
path = default_storage.save('static/tmp/' + f1.name, ContentFile(f1.read()))
path12 = os.path.join(os.getcwd(), "static/tmp/" + f1.name)
data={} #can be anything u want to pass along with File
file1 = open(path12, 'rb')
header = {"Content-Disposition": "attachment; filename=" + f1.name, "Authorization": "JWT " + token}
res= requests.post(url,data,header)

Garbled text returned while opening a URL in Python 2.7

I would like to open a StackExchange API (search endpoint) URL and parse the result [0]. The documentation says that all results are in JSON format [1]. I open up this URL in my web browser and the results are absolutely fine [2]. However, when I try opening it up using a Python program it returns encoded text which I am unable to parse. Here's a snip
á¬ôŸ?ÍøäÅ€ˆËç?bçÞIË
¡ëf)j´ñ‚TF8¯KÚpr®´Ö©iUizEÚD +¦¯÷tgNÈÃ‘.G¾LPUç?Ñ‘Ù~]ŒäÖÂ9Ÿð1£µ$JNóa?Z&Ÿtž'³Ðà#Í°¬õÅj5ŸE÷*æJî”Ï>íÓé’çÔqQI’†ksS™¾þEíqÝýly
My program to open a URL is as follows. What am I doing particularly wrong?
''' Opens a URL and returns the result '''
def open_url(query):
request = urllib2.Request(query)
response = urllib2.urlopen(request)
text = response.read()
#results = json.loads(text)
print text
title = openRawResource, AssetManager.AssetInputStream throws IOException on read of larger files
page1_query = stackoverflow_search_endpoint % (1,urllib.quote_plus(title),access_token,key)
[0] https://api.stackexchange.com/2.1/search/advanced?page=1&pagesize=100&order=desc&sort=relevance&q=openRawResource%2C+AssetManager.AssetInputStream+throws+IOException+on+read+of+larger+files&site=stackoverflow&access_token=******&key=******
[1] https://api.stackexchange.com/docs
[2] http://hastebin.com/qoxaxahaxa.sm
Soultion
I found the solution. Here's how you would do it.
request = urllib2.Request(query)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO( response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
result = json.loads(data)
Can not post the complete output as it is too huge.Many Thanks to Evert and Kristaps for pointing out about decompression and setting headers on the request. In addition, another similar question one would want to look into [3].
[3] Does python urllib2 automatically uncompress gzip data fetched from webpage?

The next paragraph of the documentation says:
Additionally, all API responses are compressed. The Content-Encoding
header is always set, but some proxies will strip this out. The proper way to decode API responses can be found here.
Your output does look like it may be compressed. Browsers automatically decompress data (depending on the Content-Encoding), so you would need to look at the header and do the same: results = json.loads(zlib.decompress(text)) or something similar.
Do check the here link as well.

I found the solution. Here's how you would do it.
request = urllib2.Request(query)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO( response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
result = json.loads(data)
Can not post the complete output as it is too huge.Many Thanks to Evert and Kristaps for pointing out about decompression and setting headers on the request. In addition, another similar question one would want to look into [1].
[1] Does python urllib2 automatically uncompress gzip data fetched from webpage?

Is it possible to peek at the data in a urllib2 response?

I need to detect character encoding in HTTP responses. To do this I look at the headers, then if it's not set in the content-type header I have to peek at the response and look for a "<meta http-equiv='content-type'>" header. I'd like to be able to write a function that looks and works something like this:
response = urllib2.urlopen("http://www.example.com/")
encoding = detect_html_encoding(response)
...
page_text = response.read()
However, if I do response.read() in my "detect_html_encoding" method, then the subseuqent response.read() after the call to my function will fail.
Is there an easy way to peek at the response and/or rewind after a read?

def detectit(response):
# try headers &c, then, worst case...:
content = response.read()
response.read = lambda: content
# now detect based on content
The trick of course is ensuring that response.read() WILL return the same thing again if needed... that's why we assign that lambda to it if necessary, i.e., if we already needed to extract the content -- that ensures the same content can be extracted again (and again, and again, ...;-).

If it's in the HTTP headers (not the document itself) you could use response.info() to detect the encoding
If you want to parse the HTML, save the response data:
page_text = response.read()
encoding = detect_html_encoding(response, page_text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to download chunked data with Pythons urllib2 - python

If I'm not mistaken, the following worked for me - a while back: data = '' chunk = rsp.read() while chunk: data += chunk chunk = rsp.read() Each read reads one chunk - so keep on reading until nothing more's coming. Don't have documenation ready supporting this...yet.

Related

Content-length header not the same as when manually calculating it?

Streaming binary file from ssh in python

python requests post file using multipart form parameters [duplicate]

Garbled text returned while opening a URL in Python 2.7

Is it possible to peek at the data in a urllib2 response?

Categories

Resources