I am attempting to use the eBay Large Merchant Services API to upload calls in bulk. This API and documentation are notoriously bad (I found a blog post that explains it and rails against it here). The closest I can find to a working Python solution is based on this Question/Answer and his related Github version which is no longer maintained.
I have followed these examples very closely and as best I can tell they never fully worked in the first place. I emailed the original author and he says it was never put in production. However it should be very close to working.
All of my attempts result in a error 11 Please specify a File with Valid Format message.
I have tried various ways to construct this call, using both the github method, or a method via the email.mime package, and via Requests library, and the http.client library. The output below is what I believe is the closest I have to a what it should be. The file attached is using the Github's sample XML file, and it is read and gzipped using the same methods as in the Github repo.
Any help on this would be appreciated. I feel like I have exhausted my possibilities.
POST https://storage.sandbox.ebay.com/FileTransferService
X-EBAY-SOA-SERVICE-VERSION: 1.1.0
Content-Type: multipart/related; boundary=MIME_boundary; type="application/xop+xml"; start="<0.urn:uuid:86f4bbc4-cfde-4bf5-a884-cbdf3c230bf2>"; start-info="text/xml"
User-Agent: python-requests/2.5.1 CPython/3.3.4 Darwin/14.1.0
Accept: */*
Accept-Encoding: gzip, deflate
X-EBAY-SOA-SECURITY-TOKEN: **MYSANDBOXTOKEN**
X-EBAY-SOA-SERVICE-NAME: FileTransferService
Connection: keep-alive
X-EBAY-SOA-OPERATION-NAME: uploadFile
Content-Length: 4863
--MIME_boundary
Content-Type: application/xop+xml; charset=UTF-8; type="text/xml; charset=UTF-8"
Content-Transfer-Encoding: binary
Content-ID: <0.urn:uuid:86f4bbc4-cfde-4bf5-a884-cbdf3c230bf2>
<uploadFileRequest xmlns:sct="http://www.ebay.com/soaframework/common/types" xmlns="http://www.ebay.com/marketplace/services">
<taskReferenceId>50009042491</taskReferenceId>
<fileReferenceId>50009194541</fileReferenceId>
<fileFormat>gzip</fileFormat>
<fileAttachment>
<Size>1399</Size>
<Data><xop:Include xmlns:xop="http://www.w3.org/2004/08/xop/include" href="cid:urn:uuid:1b449099-a434-4466-b8f6-3cc2d59797da"/></Data>
</fileAttachment>
</uploadFileRequest>
--MIME_boundary
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-ID: <urn:uuid:1b449099-a434-4466-b8f6-3cc2d59797da>
b'\x1f\x8b\x08\x08\x1a\x98\xdaT\x02\xffuploadcompression.xml\x00\xcdW[W\xdb8\x10~\x0e\xbfB\x9b}\x858\xd0\xb4\x14\x8e\xebnn\x94l\xe3\x96M\xccv\xfb\xd4#\xec!\xd1V\x96\\I\x06\xfc\xefw$\xdbI\x9cK\x0f\xec\xbe,\xe5P{f\xbe\xb9i\xe6\xb3\xed\xbf\x7fJ9y\x00\xa5\x99\x14\xef\xda\xa7\x9dn\x9b\x80\x88e\xc2\xc4\xe2]\xfb6\xba:y\xdb~\x1f\x1c\xf9\x83\x9c\x7f\x1fQC\xc7O\xf1\x92\x8a\x05\xcc\xe0G\x0e\xdah\x82p\xa1\xdf\xb5s%.\xe1\x8e\x16\x974c\xfa\x12\x06\xd3\x01\xd50\x94i&\x05\x08\xa3\xdb\xc1Q\xcb\xbf\x06\x9a\x80\xc2\xab\x96\xffg\x1908\xef\x9d\xfb^}c\x15sf`2\nN\xbb]\xdf\xab\xae\x11\xe9\xad\xa1-\xbf\x9f$\x13\x03i\x95\xc1\x0b\x12 \x04\xd1c\xa5\xa4\x9ab\t9]#\x00\xe2\xdb\xed\xdc\xf7\x9a\xc2\xca\xf2\x0bU\x02\xbb0\x85\x07\xe0\xc15[,}\xaf!\xaa\xcc\x0e\x95\xd2\xf2C\xd0\x1a\xfda\t\x11&\x8a8\xf2\t\x1e\xc9\xdc({y%UJ\x8d\xef\xad\x8d\x10\x83\x0e}[\x9b\xc3\xb7\xfc\x88\x19\x0e\xc1\xf5\xefC2\xffz\x12\xf6\xff\xc2\xff\xec\xdf\xc9\x84\x9c\x91\xeb\xf14\x1cG\xa4\xff)\xba\x9e\xf5\x87\x93hLB/\x1c\x8f&\xb7\xa1\xef\x95\xb8\xd2\xc7\x08t\xacXflV\xd1\x92i\x82\xbf\x94\xc4Rr\xb2\x04\x9e\x829&\xccX\xe1\x9d\xa2"!\x023\xbc+\x88Y\x02y\xa4\xc5/\xbe\xb7\x89/=\xde(\x96RU\x0c\xa9\x81\x85TE)m\xf9\xf5=V\xf2\xe6\xbcw\xe1{\x1b\x82\x12\xe8\xedE\xfasC\x95AU\x0c\xc1\xc5E\xe7\x02\x91\x1b\x92\xd2d(E\xc2l\n\xe5h\xe0llJ\x8e\x1a\xf1C\x9ae\xd8\xe0>\xe7\xf2\x11\x92 R9\xacs\xd9R\xd6\xdesa0\x1d;\t\xf5u\xa5\xc9\x95\xc2m\xb0\xaa\x11\xea\xea\xbb\xaa\xb3Lg\xd4\xc4\xcb\x88\xa5\x10\xd2\xa7\xe0\x156kKT\x1aN\x99;\xfdQ\xae\xa8k\xe3\x88\x16\xfa\xdb)\x16\xb1\xadh\x98GE\x06\xc1\x15{\x82\xc4u\xc2\x8e\xc5\n\xe1t\xd5i\xd0"\xc5\x01\x0f\xc1,e\xa2\x03\xbc\xbd\xa1\x1c[\xdd\x14\xaflQ9N)\xe3\xb8D\n\'/\xb0+\xf3\x9bb\xb8\\:a:\xb6\xd5wb\x99:\x07\xdb\xb6\x95\x13\x16\x9b\\\xc1\x08\x0c\xaat}\xfa\x95\xf4v6\r\x96\xc6d\x97\x9e\x07\x9d]\xb7\x1e\x9e\xff\x02\xb4\x87#\xedu\xcf\xce\xce\xbbo\xbd\xd7\xe7oN^\xbf9\xfd\xa6\xd3\xce\xdf\xd9\x02\xe3\xae\x1d\xd5S\xb3\'\xa0\x7f#\xb5\xa1|(\x13\x08z\x17\xbd\xb3\x1e\x9a\xad%\xa5\xc9\x1f9\x15\x86\x99"\xb0\xad^\xdd\x94\xba\x19\xa0Kq#9\x8bW\x03<\x83\xfb\\$\x9f\xcbQ\xafy\xce\xf7\x1a\xe2\x86\xe9\x8e\xd1Zm\xbd\xeb/\xcc,\x99\xa8\x90\xe5\xa1\xf7\xac\xe9\xaer\x1f.8\xed\x11\x0b\xdaBl\xd9\xf6\xe3\x182\x03u~[\xd2\x15v\xcbl\xbf\x8f\x1aM\x0e\xc2k\xe0\x0e)\xb4\xeaV \xb9( \x19\xa8\x94\x19\x04\x1c\x13x\xb2P"\x05\x01\x0e\xb1QR\xb0\x18\x19\x07R}L,\xe1\xa49r\xf8\x1d\x10&p\x9dqK\x13\xf2\xe8\xea$X~\x82\xe5\x13yO\x14\xc4\x80\xd1:$\x04e\xc3\xe0H\xc1\x06\xd0\x91V\\\x13\x82\xc3\x13\xca9\xc9h\xfc\x9d.p]\x8eIJENy\x152\xa3\x98\xdf\xa3T\xdf\x11khl\x9c0\x17\x94\x1bP\x90tH$m\xd6\xae\x9cc\x92q\xc0\x07\x89u\xefLc\x8c*SPD\x83z\xc0\xb5$F\x96\xe9=\x00\xd2\xea,\xec\xff\xea\xbc\xc9\\\xad|\x90{\xa4\xfa\x0e\x19O\xc7\xc3h\xf6\xf9\xd3dH\x90\xad\xc3\xf91Ir\x07G\xaee\xe8/\x83\x98QN\x04\xb5\xc3Nb*\x84t\xf5\xd5n\x12\xeb\x07\x9d\x17\x18\x8fj\xac\xd3\xc6\xb1\xcd\xd6\x92\x03/0\xc3\x07\x9b>I\x18\xe6c\xb8\xe5p%\xf3\xc5\xb2\xf2\x8f\x1b\x8c\x11\x8c\xcd\xd36\xe3\x9e\xba\xa5R\xbaS\x1d\xe9.\xd1#3/\x99\xa3\xcb!\xae\xd6\re\xc9\xa0\xa8\xe6g\x90\x17\xa0\x90\xa7\x0f\xe9\x0f\xe2\x0f#\xebm\xdf\xdd\xcc\x95\x9b\r\x06X\xc9\xe6\xe51\x94qKr\xd8\xd6\x05\xf5}I\x86\xf8p\x11\tU\xc9:\x89\xda\xae\x19\xad\x92\xda\x0c\x83n\xa7\xbbc\xee\x14{!\xc8\x97n\x12-\x1b\x1d\x00o\x99\xecu\x83\xb4/\x95\xe3\xaf\x1d\xf8JU\x02\xc7O\x19\xa0?H\xeaJ\xeeq\xd6\x91\x95v\xe4\xae=W\n\xa0\xf6\x17\x18\xf7xl\x88l{\xbd\xc3\xfd\xf5\'\x02\xf7D\xd02\xfd\xbdv\x07\x8e\xa1j|\x03\xbf\xff\x14\xf6\x1d\x02\xae^\xf9\xf8\x9d\x8c\xf0\xc5t>j\x07g\xf8\xb6\xf0\xf6\xf0\xb9\x1c\xec\xe7\xd9\xcf\xfb\xe9p\x91\x9c\xca\xb8|*\x0f\xfb\xa5\xfd\x86\xc8\xb5\xe8y}\xf8\xff\xb4\xeb\xd5\xbfl\xd7\xab\x97\xb5\xab\x8f\xec\xc8b\xaa\xf75m\xc7x\x9c+\x99\xc1\xb3L\xfb\x9a\xd1\xe7\x19\xde\xfe\xa7\xf3\xaa5\xe5\xfb\x17\xb7\xef\xe8\r\xd1\x91[\xb8\x98\xe7\tlE\x99~\xb4+\xb7Os\x183\x19\xbd\x1c\x13~}9\xe6\xc3\xe0\xe5\x98\xf9\x87\x9fa\xe6\xc09\xa8\xbdz}\xa3\xe0\x1e\xec\xf4AE0\xcf4>\xda`\x9e\xe6\xfb\x9e\xfd\x16\x0c`#\x8bP\x1a\xa9t\xf9qX\xeb>\xde\xba\xaf\x02\xfc9\xb1\xbb\x8d\xb7n.\xbc\xfaS\xca\xf7\x9a\xdf\x8cVv\xe4{\x87\xbeiQ]\xff\xfb\x07\xe0\xf2E>\x1f\x0f\x00\x00'
--MIME_boundary--
EDIT: Showing the methods I have tried to actually generate the binary data:
with open('testpayload.xml', mode='rt') as myfile:
full_xml = myfile.read() # replacing my full_xml with testpayload.xml, from Github repo
# METHOD 1: Simple gzip.compress()
payload = gzip.compress(bytes(full_xml, 'UTF-8'))
# METHOD 2: Implementing https://stackoverflow.com/a/8507012/1281743 which should produce a gzip-compatible string with header
import io
out = io.BytesIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(bytes(full_xml, 'UTF-8'))
payload = out.getvalue()
# METHOD 3: Simply reading in a pre-compressed version of same testpayload.xml file. Compressed on commandline with 'gzip testpayload.xml'
with open('testpayload.xml.gz', mode='rb') as myfile:
payload = myfile.read()
# METHOD 4: Adoping the _generate_date() function from github eBay-LMS-API repo http://goo.gl/YgFyBi
mybuffer = io.BytesIO()
fp = open('testpayload.xml', 'rb') # This file is from https://github.com/wrhansen/eBay-LMS-API/
# Create a gzip object that reads the compression to StringIO buffer
gzipbuffer = gzip.GzipFile('uploadcompression.xml.gz', 'wb', 9, mybuffer)
gzipbuffer.writelines(fp)
gzipbuffer.close()
fp.close()
mybuffer.seek(0)
payload = mybuffer.read()
mybuffer.close()
# DONE: send payload to API call
r = ebay.fileTransfer_upload_file(file_id='50009194541', task_id='50009042491', gzip_data=payload)
EDIT2, THE QUASI-SOLUTION: Python was not actually sending binary data, but rather a string representation of binary, which did not work. I ended up having to base64-encode the file first. I had tried that before, but have been waived off of that by a suggestion on the eBay forums. I just changed one of the last headers to Content-Transfer-Encoding: base64 and I got a success response.
EDIT3, THE REAL SOLUTION: Base 64 encoding only solved my problems with the example XML file. I found that my own XML payload was NOT sending successfully (same error as before), but that the script on Github could send that same file just fine. The difference was that the Github script is Python2 and able to mix string and binary data without differentiation. To send binary data in Python3 I just had to make a few simple changes to change all the strings in the request_part and binary_part into bytes, so that the bytes from the gzip payload would concatenate with it. Looks like this:
binary_part = b'\r\n'
binary_part += b'--MIME_boundary\r\n'
binary_part += b'Content-Type: application/octet-stream\r\n'
binary_part += b'Content-Transfer-Encoding: binary\r\n'
binary_part += b'Content-ID: <' + URN_UUID_ATTACHMENT.encode('utf-8') + b'>\r\n\r\n'
binary_part += gzip_data + b'\r\n'
binary_part += b'--MIME_boundary--'
So, no base64 encoding, just needed to figure out how to make Python send real binary data. And the original github repo does work as advertised, in python2.
So after scanning all my code about this, it looks like it's setup right. So if your error is a good error then it means your binary data is wrong. Can you show how you're gzipping and then reading that data back in? Here's what I'm doing
$dir = '/.../ebayUpload.gz';
if(is_file($dir)) unlink($dir);
$gz = gzopen($dir,'w9');
gzwrite($gz, $xmlFile);
chmod($dir, 0777);
gzclose($gz);
// open that file as data;
$handle = fopen($dir, 'r');
$fileData = fread($handle, filesize($dir));
fclose($handle);
And in your case the $xmlFile should be the string found in https://github.com/wrhansen/eBay-LMS-API/blob/master/examples/AddItem.xml
I should add that this is how I'm using $fileData
$binaryPart = '';
$binaryPart .= "--" . $boundry . $CRLF;
$binaryPart .= 'Content-Type: application/octet-stream' . $CRLF;
$binaryPart .= 'Content-Transfer-Encoding: binary' . $CRLF;
$binaryPart .= 'Content-ID: <urn:uuid:'.$uuid_attachment.'>' . $CRLF . $CRLF;
$binaryPart .= $fileData . $CRLF;
$binaryPart .= "--" . $boundry . "--";
Related
I'm trying to export a PDF > DOCX using Adobe's REST API:
https://documentcloud.adobe.com/document-services/index.html#post-exportPDF
Issue I am facing is not being able to save it correctly locally (it corrupts). I found another thread with similar goal but the solution there isn't working for me. Here are relevant parts of my script:
url = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%7B%22reltype%22%3A%20%22http%3A%2F%2Fns.adobe.com%2Frel%2Fprimary%22%7D"
payload = {}
payload['contentAnalyzerRequests'] = json.dumps(
{
"cpf:engine": {
"repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
},
"cpf:inputs": {
"params": {
"cpf:inline": {
"targetFormat": "docx"
}
},
"documentIn": {
"dc:format": "application/pdf",
"cpf:location": "InputFile"
}
},
"cpf:outputs": {
"documentOut": {
"dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"cpf:location": docx_filename,
}
}
}
)
myfile = {'InputFile': open(filename,'rb')}
response = requests.request("POST", url, headers=headers, data=payload, files=myfile)
location = response.headers['location']
...
polling here to make sure export is complete
...
if response.status_code == 200:
print('Export complete, saving file locally.')
write_to_file(docx_filename, response)
def write_to_file(filename, response):
with open(filename, 'wb') as f:
for chunk in response.iter_content(1024 * 1024):
f.write(chunk)
What I think is the issue (or at least a clue towards solution) is the following text at the begging of the response.content:
--Boundary_357737_1222103332_1635257304781
Content-Type: application/json
Content-Disposition: form-data; name="contentAnalyzerResponse"
{"cpf:inputs":{"params":{"cpf:inline":{"targetFormat":"docx"}},"documentIn":{"dc:format":"application/pdf","cpf:location":"InputFile"}},"cpf:engine":{"repo:assetId":"urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"},"cpf:status":{"completed":true,"type":"","status":200},"cpf:outputs":{"documentOut":{"cpf:location":"output/pdf_test.docx","dc:format":"application/vnd.openxmlformats-officedocument.wordprocessingml.document"}}}
--Boundary_357737_1222103332_1635257304781
Content-Type: application/octet-stream
Content-Disposition: form-data; name="output/pdf_test.docx"
... actual byte content starts here...
Why is this being sent? Am I writing the content to the file incorrectly (I've tried f.write(response.content) as well, same results). Should I be sending a different request to Adobe?
That extra text is actually so that the server can send multiple files at once, see https://stackoverflow.com/a/20321259. Basically the response you're getting is two files: a JSON file called contentAnalyzerResponse, and the Word doc, called output/pdf_test.docx.
You can probably parse the files using parse_form_data from werkzeug.formparser, as demonstrated here, which I've done successfully before, but I'm not sure how to get it working with multiple files.
About your question regarding stripping the content: in light of what I said above, yes, it's perfectly fine to strip it like you're doing.
Note: I'd recommend opening the file in a text editor and checking at the very end of the file to make sure that there isn't any additional --Boundary... stuff that you'll also want to strip out.
I am working with webhooks from Bold Commerce, which are validated using a hash of the timestamp and the body of the webhook, with a secret key as the signing key. The headers from the webhook looks like this :
X-Bold-Signature: 06cc9aab9fd856bdc326f21d54a23e62441adb5966182e784db47ab4f2568231
timestamp: 1556410547
Content-Type: application/json
charset: utf-8
According to their documentation, the hash is built like so (in PHP):
$now = time(); // current unix timestamp
$json = json_encode($payload, JSON_FORCE_OBJECT);
$signature = hash_hmac('sha256', $now.'.'.$json, $signingKey);
I am trying to recreate the same hash using python, and I am always getting the wrong value for the hash. I've tried several combinations, with and without base64 encoding. In python3, a bytes object is expected for the hmac, so I need to encode everything before I can compare it. At this point my code looks like so :
json_loaded = json.loads(request.body)
json_dumped = json.dumps(json_loaded)
# if I dont load and then dump the json, the message has escaped \n characters in it
message = timestamp + '.' + json_dumped
# => 1556410547.{"event_type" : "order.created", "date": "2020-06-08".....}
hash = hmac.new(secret.encode(), message.encode(), hashlib.sha256)
hex_digest = hash.hexdigest()
# => 3e4520c869552a282ed29b6523eecbd591fc19d1a3f9e4933e03ae8ce3f77bd4
# hmac_to_verify = 06cc9aab9fd856bdc326f21d54a23e62441adb5966182e784db47ab4f2568231
return hmac.compare_digest(hex_digest, hmac_to_verify)
Im not sure what I am doing wrong. For the other webhooks I am validating, I used base64 encoding, but it seems like here, that hasnt been used on the PHP side. I am not so familiar with PHP so maybe there is something I've missed in how they built the orginal hash. There could be complications coming from the fact that I have to convert back and forth between byte arrays and strings, maybe I am using the wrong encoding for that ? Please someone help, I feel like I've tried every combination and am at a loss.
EDIT : Tried this solution by leaving the body without encoding it in json and it still fails :
print(type(timestamp)
# => <class 'str'>
print(type(body))
# => <class 'bytes'>
# cant concatenate bytes to string...
message = timestamp.encode('utf-8') + b'.' + body
# => b'1556410547.{\n "event_type": "order.created",\n "event_time": "2020-06-08 11:16:04",\n ...
hash = hmac.new(secret.encode(), message, hashlib.sha256)
hex_digest = hash.hexdigest()
# ...etc
EDIT EDIT :
Actually it is working in production ! Thanks to the solution described above (concatenating everything as bytes). My Postman request with the faked webhook was still failing, but that's probably because of how I copied and pasted the webhook data from my heroku logs :) .
so I have been writing a simple web server in Python, and right now I'm trying to handle multipart/form-data POST requests. I can already handle application/x-www-form-urlencoded POST requests, but the same code won't work for the multipart. If it looks like I am misunderstanding anything, please call me out, even if it's something minor. Also if you guys have any advice on making my code better please let me know as well :) Thanks!
When the request comes in, I first parse it, and split it into a dictionary of headers and a string for the body of the request. I use those to then construct a FieldStorage form, which I can then treat like a dictionary to pull the data out:
requestInfo = ''
while requestInfo[-4:] != '\r\n\r\n':
requestInfo += conn.recv(1)
requestSplit = requestInfo.split('\r\n')[0].split(' ')
requestType = requestSplit[0]
url = urlparse.urlparse(requestSplit[1])
path = url[2] # Grab Path
if requestType == "POST":
headers, body = parse_post(conn, requestInfo)
print "!!!Request!!! " + requestInfo
print "!!!Body!!! " + body
form = cgi.FieldStorage(headers = headers, fp = StringIO(body), environ = {'REQUEST_METHOD':'POST'}, keep_blank_values=1)
Here's my parse_post method:
def parse_post(conn, headers_string):
headers = {}
headers_list = headers_string.split('\r\n')
for i in range(1,len(headers_list)-2):
header = headers_list[i].split(': ', 1)
headers[header[0]] = header[1]
content_length = int(headers['Content-Length'])
content = conn.recv(content_length)
# Parse Content differently if it's a multipart request??
return headers, content
So for an x-www-form-urlencoded POST request, I can treat FieldStorage form like a dictionary, and if I call, for example:
firstname = args['firstname'].value
print firstname
It will work. However, if I instead send a multipart POST request, it ends up printing nothing.
This is the body of the x-www-form-urlencoded request:
firstname=TEST&lastname=rwar
This is the body of the multipart request:
--070f6a3146974d399d97c85dcf93ed44
Content-Disposition: form-data; name="lastname"; filename="lastname"
rwar
--070f6a3146974d399d97c85dcf93ed44
Content-Disposition: form-data; name="firstname"; filename="firstname"
TEST
--070f6a3146974d399d97c85dcf93ed44--
So here's the question, should I manually parse the body for the data in parse_post if it's a multipart request?
Or is there a method that I need/can use to parse the multipart body?
Or am I doing this wrong completely?
Thanks again, I know it's a long read but I wanted to make sure my question was comprehensive
So I solved my problem, but in a totally hacky way.
Ended up manually parsing the body of the request, here's the code I wrote:
if("multipart/form-data" in headers["Content-Type"]):
data_list = []
content_list = content.split("\r\n\r\n")
for i in range(len(content_list) - 1):
data_list.append("")
data_list[0] += content_list[0].split("name=")[1].split(";")[0].replace('"','') + "="
for i,c in enumerate(content_list[1:-1]):
key = c.split("name=")[1].split(";")[0].replace('"','')
data_list[i+1] += key + "="
value = c.split("\r\n")
data_list[i] += value[0]
data_list[-1] += content_list[-1].split("\r\n")[0]
content = "&".join(data_list)
If anybody can still solve my problem without having to manually parse the body, please let me know!
There's the streaming-form-data project that provides a Python parser to parse data that's multipart/form-data encoded. It's intended to allow parsing data in chunks, but since there's no chunk size enforced, you could just pass your entire input at once and it should do the job. It should be installable via pip install streaming_form_data.
Here's the source code - https://github.com/siddhantgoel/streaming-form-data
Documentation - https://streaming-form-data.readthedocs.io/en/latest/
Disclaimer: I'm the author. Of course, please create an issue in case you run into a bug. :)
i am having a little problem with the answer stated at Python progress bar and downloads
if the data downloaded was gzip encoded, the content length and the total length of the data after joining them in the for data in response.iter_content(): is different as in it is bigger cause automatically decompresses gzip-encoded responses
so the bar get longer and longer and once it become to long for a single line, it start flooding the terminal
a working example of the problem (the site is the first site i found on google that got both content-length and gzip encoding):
import requests,sys
def test(link):
print("starting")
response = requests.get(link, stream=True)
total_length = response.headers.get('content-length')
if total_length is None: # no content length header
data = response.content
else:
dl = 0
data = b""
total_length = int(total_length)
for byte in response.iter_content():
dl += len(byte)
data += (byte)
done = int(50 * dl / total_length)
sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50-done)))
sys.stdout.flush()
print("total data size: %s, content length: %s" % (len(data),total_length))
test("http://www.pontikis.net/")
ps, i am on linux but it should effect other os too (except windows cause \r doesn't work on it iirc)
and i am using requests.Session for cookies (and gzip) handling so a solution with urllib and other module isn't what i am looking for
Perhaps you should try disabling gzip compression or otherwise accounting for it.
The way to turn it off for requests (when using a session as you say you are):
import requests
s = requests.Session()
del s.headers['Accept-Encoding']
The header sent will now be: Accept-Encoding: Identity and the server should not attempt to use gzip compression. If instead you're trying to download a gzip-encoded file, you should not run into this problem. You will receive a Content-Type of application/x-gzip-compressed. If the website is gzip compressed, you'll receive a Content-Type of text/html for example and a Content-Encoding of gzip.
If the server always serves compressed content then you're out of luck, but no server should do that.
If you want to do something with the functional API of requests:
import requests
r = requests.get('url', headers={'Accept-Encoding': None})
Setting the header value to None via the functional API (or even in a call to session.get) removes that header from the requests.
You could replace...
dl += len(byte)
...with:
dl = response.raw.tell()
From the documentation:
tell(): Obtain the number of bytes pulled over the wire so far. May
differ from the amount of content returned by :meth:HTTPResponse.read
if bytes are encoded on the wire (e.g, compressed).
Here is a simple process bar implement with tqdm:
def _reader_generator(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024 * 1024)
def raw_newline_count_gzip(fname):
f = gzip.open(fname, 'rb')
f_gen = _reader_generator(f.read)
return sum(buf.count(b'\n') for buf in f_gen)
num = raw_newline_count_gzip(fname)
(loop a gzip file):
with tqdm(total=num_ids) as pbar:
# do whatever you want
pbar.update(1)
The bar looks like:
35%|███▌ | 26288/74418 [00:05<00:09, 5089.45it/s]
I'm back. :) Again trying to get the gzipped contents of a URL and gunzip them. This time in Python. The #SERVER section of code is the script I'm using to generate the gzipped data. The data is known good, as it works with Java. The #CLIENT section of code is the bit of code I'm using client-side to try and read that data (for eventual JSON parsing). However, somewhere in this transfer, the gzip module forgets how to read the data it created.
#SERVER
outbuf = StringIO.StringIO()
outfile = gzip.GzipFile(fileobj = outbuf, mode = 'wb')
outfile.write(data)
outfile.close()
print "Content-Encoding: gzip\n"
print outbuf.getvalue()
#CLIENT
urlReq = urllib2.Request(url)
urlReq.add_header('Accept-Encoding', '*')
urlConn = urllib2.build_opener().open(urlReq)
urlConnObj = StringIO.StringIO(urlConn.read())
gzin = gzip.GzipFile(fileobj = urlConnObj)
return gzin.read() #IOError: Not a gzipped file.
Other Notes:
outbuf.getvalue() is the same as urlConnObj.getvalue() is the same as urlConn.read()
This StackOverflow question seemed to help me out.
Apparently, it was just wise to by-pass the gzip module entirely, opting for zlib instead. Also, changing "*" to "gzip" in the "Accept-Encoding" header may've helped.
#CLIENT
urlReq = urllib2.Request(url)
urlReq.add_header('Accept-Encoding', 'gzip')
urlConn = urllib2.urlopen(urlReq)
return zlib.decompress(urlConn.read(), 16+zlib.MAX_WBITS)