I'm trying to decompress the response from a web request using Python requests and zlib but I'm not able to decompress the response content properly. Here's my code:
import requests
import zlib
URL = "http://" #omitted real url
r = requests.get(URL)
print r.content
data = zlib.decompress(r.content, lib.MAX_WBITS)
print data
However, I keep getting various errors when changing the wbits parameter.
zlib.error: Error -3 while decompressing data: incorrect header check
zlib.error: Error -3 while decompressing data: invalid stored block lengths
I tried the wbits parameters for deflate, zlip and gzip as noted here zlib.error: Error -3 while decompressing: incorrect header check
But still can't get pass these errors. I'm trying to this in Python, I was given this piece of code that did it with Objective-C but I don't know Objective-C
#import "GTMNSData+zlib.h"
+ (NSData*) uncompress: (NSData*) data
{
Byte *bytes= (Byte*)[data bytes];
NSInteger length=[data length];
NSMutableData* retdata=[[NSMutableData alloc] initWithCapacity:length*3.5];
NSInteger bSize=0;
NSInteger offSet=0;
while (true) {
offSet+=bSize;
if (offSet>=length) {
break;
}
bSize=bytes[offSet];
bSize+=(bytes[offSet+1]<<8);
bSize+=(bytes[offSet+2]<<16);
bSize+=(bytes[offSet+3]<<24);
offSet+=4;
if ((bSize==0)||(bSize+offSet>length)) {
LogError(#"Invalid");
return data;
}
[retdata appendData:[NSData gtm_dataByInflatingBytes: bytes+offSet length:bSize]];
}
return retdata;
}
According to Python requests documentation at:
http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
it says:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
The gzip and deflate transfer-encodings are automatically decoded for you.
If requests understand the encoding, it should therefore already be uncompressed.
Use r.raw if you need to get access to the original data to handle a different decompression mechanism.
http://docs.python-requests.org/en/master/user/quickstart/#raw-response-content
The following is an untested translation of the Objective-C code:
import zlib
import struct
def uncompress(data):
length = len(data)
ret = []
bSize = 0
offSet = 0
while True:
offSet += bSize
if offSet >= length:
break
bSize = struct.unpack("<i", data[offSet:offSet+4])
offSet += 4
if bSize == 0 or bSize + offSet > length:
print "Invalid"
return ''.join(ret)
ret.append(zlib.decompress(data[offSet:offSet+bSize]))
return ''.join(ret)
Related
I get the following error:
TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.
When making the call below
import urllib.request, urllib.parse, urllib.error, urllib.request,
urllib.error, urllib.parse
import json
chemcalcURL = 'http://www.chemcalc.org/chemcalc/em'
# Define a molecular formula string
mfRange = 'C0-100H0-100N0-10O0-10'
# target mass
mass = 300
# Define the parameters and send them to Chemcalc
# other options (mass tolerance, unsaturation, etc.
params = {'mfRange': mfRange,'monoisotopicMass': mass}
response = urllib.request.urlopen(chemcalcURL, urllib.parse.urlencode(params))
# Read the output and convert it from JSON into a Python dictionary
jsondata = response.read()
data = json.loads(jsondata)
print(data)
You have to convert your request to bytes which involves the use of bytes() arguement:
response = urllib.request.urlopen(chemcalcURL, bytes(urllib.parse.urlencode(params), encoding="utf-8")
bytes() must take an encoding which for websites is almost always utf-8.
I am able to generate and stream text on the fly, but unable to generate and stream a compressed file on the fly.
from flask import Flask, request, Response,stream_with_context
import zlib
import gzip
app = Flask(__name__)
def generate_text():
for x in range(10000):
yield f"this is my line: {x}\n".encode()
#app.route('/stream_text')
def stream_text():
response = Response(stream_with_context(generate_text()))
return response
def generate_zip():
for x in range(10000):
yield zlib.compress(f"this is my line: {x}\n".encode())
#app.route('/stream_zip')
def stream_zip():
response = Response(stream_with_context(generate_zip()), mimetype='application/zip')
response.headers['Content-Disposition'] = 'attachment; filename=data.gz'
return response
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000, debug=True)
Than using curl and gunzip:
curl http://127.0.0.1:8000/stream_zip > data.gz
gunzip data.gz
gunzip: data.gz: not in gzip format
I don't care if it is zip, gzip, or any other type of compression.
generate_text in my real code generates over 4 GB of data so I would like to compress on the fly.
Saving text to file, zipping, returning zip file, and than deleting is not the solution I'm after.
I need to be in a loop generating some text -> compress that text -> streaming compressed data until I'm done.
zip/gzip ... anything is fine as long as it works.
You are yielding a series of compressed documents, not a single compressed stream. Don't use zlib.compress(), it includes the header and forms a single document.
You need to create a zlib.compressobj() object instead, and use the Compress.compress() method on that object to produce a stream of data (followed by a final call to Compress.flush()):
def generate_zip():
compressor = zlib.compressobj()
for x in range(10000):
chunk = compressor.compress(f"this is my line: {x}\n".encode())
if chunk:
yield chunk
yield compressor.flush()
The compressor can produce empty blocks when there is not enough data yet to produce a full compressed-data chunk, the above only yields if there is actually anything to send. Because your input data is so highly repetitive and thus the data can be efficiently compressed, this yields only 3 times (once with 2-byte header, once with about 21kb of compressed data covering the first 8288 iterations over range(), and finally with the remaining 4kb for the rest of the loop).
In aggregate, this produces the same data as a single zlib.compress() call with all inputs concatenated. The correct mime-type for this data format is application/zlib, not application/zip.
This format is not readily decompressible with gzip however, not without some trickery. That's because the above doesn't yet produce a GZIP file, it just produces a raw zlib-compressed stream. To make it GZIP compatible, you need to configure the compression correctly, send a header first, and add a CRC checksum and data length value at the end:
import zlib
import struct
import time
def generate_gzip():
# Yield a gzip file header first.
yield bytes([
0x1F, 0x8B, 0x08, 0x00, # Gzip file, deflate, no filename
*struct.pack('<L', int(time.time())), # compression start time
0x02, 0xFF, # maximum compression, no OS specified
])
# bookkeeping: the compression state, running CRC and total length
compressor = zlib.compressobj(
9, zlib.DEFLATED, -zlib.MAX_WBITS, zlib.DEF_MEM_LEVEL, 0)
crc = zlib.crc32(b"")
length = 0
for x in range(10000):
data = f"this is my line: {x}\n".encode()
chunk = compressor.compress(data)
if chunk:
yield chunk
crc = zlib.crc32(data, crc) & 0xFFFFFFFF
length += len(data)
# Finishing off, send remainder of the compressed data, and CRC and length
yield compressor.flush()
yield struct.pack("<2L", crc, length & 0xFFFFFFFF)
Serve this as application/gzip:
#app.route('/stream_gzip')
def stream_gzip():
response = Response(stream_with_context(generate_gzip()), mimetype='application/gzip')
response.headers['Content-Disposition'] = 'attachment; filename=data.gz'
return response
and the result can be decompressed on the fly:
curl http://127.0.0.1:8000/stream_gzip | gunzip -c | less
While I was extremely impressed by Martijn's solution, I decided to roll my own one that uses pigz for better performance:
def yield_pigz(results, compresslevel=1):
cmd = ['pigz', '-%d' % compresslevel]
pigz_proc = subprocess.Popen(cmd, bufsize=0,
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
def f():
for result in results:
pigz_proc.stdin.write(result)
pigz_proc.stdin.flush()
pigz_proc.stdin.close()
try:
t = threading.Thread(target=f)
t.start()
while True:
buf = pigz_proc.stdout.read(4096)
if len(buf) == 0:
break
yield buf
finally:
t.join()
pigz_proc.wait()
Keep in mind that you'll need to import subprocess and threading for this to work. You will also need to install pigz program (already in repositories of most Linux distributions -- on Ubuntu, just use sudo apt install pigz -y).
Example usage:
from flask import Flask, Response
import subprocess
import threading
import random
app = Flask(__name__)
def yield_something_random():
for i in range(10000):
seq = [chr(random.randint(ord('A'), ord('Z'))) for c in range(1000)]
yield ''.join(seq)
#app.route('/')
def index():
return Response(yield_pigz(yield_something_random()))
I think that currently you just sending the generator instead of the data!
You may want to do something like this (I haven't tested it, so may need some change):
def generate_zip():
import io
with gzip.GzipFile(fileobj=io.BytesIO(), mode='w') as gfile:
for x in xrange(10000):
gfile.write("this is my line: {}\n".format(x))
return gfile.read()
Working generate_zip() with low memory consumption :) :
def generate_zip():
buff = io.BytesIO()
gz = gzip.GzipFile(mode='w', fileobj=buff)
for x in xrange(10000):
gz.write("this is my line: {}\n".format(x))
yield buff.read()
buff.truncate()
gz.close()
yield buff.getvalue()
A bit of a weird question perhaps, but I'm trying to replicate a python example where they are creating a HMAC SHA256 hash from a series of parameters.
I've run into a problem where I'm supposed to translate an api key in hex to ascii and use it as secret, but I just can't get the output to be the same as python.
>>> import hmac
>>> import hashlib
>>> apiKey = "76b02c0c4543a85e45552466694cf677937833c9cce87f0a628248af2d2c495b"
>>> apiKey.decode('hex')
'v\xb0,\x0cEC\xa8^EU$fiL\xf6w\x93x3\xc9\xcc\xe8\x7f\nb\x82H\xaf-,I['
If I've understood the material online this is supposed to represent the hex string in ascii characters.
Now to the powershell script:
$apikey = '76b02c0c4543a85e45552466694cf677937833c9cce87f0a628248af2d2c495b';
$hexstring = ""
for($i=0; $i -lt $apikey.Length;$i=$i+2){
$hexelement = [string]$apikey[$i] + [string]$apikey[$i+1]
$hexstring += [CHAR][BYTE][CONVERT]::toint16($hexelement,16)
}
That outputs the following:
v°,♀EC¨^EU$fiLöw?x3ÉÌè⌂
b?H¯-,I[
They are almost the same, but not quite and using them as secret in the HMAC generates different results. Any ideas?
Stating the obvious: The key in this example is no the real key.
Update:
They look more or less the same, but the encoding of the output is different. I also verified the hex to ASCII in multiple online functions and the powershell version seems right.
Does anyone have an idea how to compare the two different outputs?
Update 2:
I converted each character to integer and both Python and Powershell generates the same numbers, aka the content should be the same.
Attaching the scripts
Powershell:
Function generateToken {
Param($apikey, $url, $httpMethod, $queryparameters=$false, $postData=$false)
#$timestamp = [int]((Get-Date -UFormat %s).Replace(",", "."))
$timestamp = "1446128942"
$datastring = $httpMethod + $url
if($queryparameters){ $datastring += $queryparameters }
$datastring += $timestamp
if($postData){ $datastring += $postData }
$hmacsha = New-Object System.Security.Cryptography.HMACSHA256
$apiAscii = HexToString -hexstring $apiKey
$hmacsha.key = [Text.Encoding]::ASCII.GetBytes($apiAscii)
$signature = $hmacsha.ComputeHash([Text.Encoding]::ASCII.GetBytes($datastring))
$signature
}
Function HexToString {
Param($hexstring)
$asciistring = ""
for($i=0; $i -lt $hexstring.Length;$i=$i+2){
$hexelement = [string]$hexstring[$i] + [string]$hexstring[$i+1]
$asciistring += [CHAR][BYTE][CONVERT]::toint16($hexelement,16)
}
$asciistring
}
Function TokenToHex {
Param([array]$Token)
$hexhash = ""
Foreach($element in $Token){
$hexhash += '{0:x}' -f $element
}
$hexhash
}
$apiEndpoint = "http://test.control.llnw.com/traffic-reporting-api/v1"
#what you see in Control on Edit My Profile page#
$apikey = '76b02c0c4543a85e45552466694cf677937833c9cce87f0a628248af2d2c495b';
$queryParameters = "shortname=bulkget&service=http&reportDuration=day&startDate=2012-01-01"
$postData = "{param1: 123, param2: 456}"
$token = generateToken -uri $apiEndpoint -httpMethod "GET" -queryparameters $queryParameters, postData=postData, -apiKey $apiKey
TokenToHex -Token $token
Python:
import hashlib
import hmac
import time
try: import simplejson as json
except ImportError: import json
class HMACSample:
def generateSecurityToken(self, url, httpMethod, apiKey, queryParameters=None, postData=None):
#timestamp = str(int(round(time.time()*1000)))
timestamp = "1446128942"
datastring = httpMethod + url
if queryParameters != None : datastring += queryParameters
datastring += timestamp
if postData != None : datastring += postData
token = hmac.new(apiKey.decode('hex'), msg=datastring, digestmod=hashlib.sha256).hexdigest()
return token
if __name__ == '__main__':
apiEndpoint = "http://test.control.llnw.com/traffic-reporting-api/v1"
#what you see in Control on Edit My Profile page#
apiKey = "76b02c0c4543a85e45552466694cf677937833c9cce87f0a628248af2d2c495b";
queryParameters = "shortname=bulkget&service=http&reportDuration=day&startDate=2012-01-01"
postData = "{param1: 123, param2: 456}"
tool = HMACSample()
hmac = tool.generateSecurityToken(url=apiEndpoint, httpMethod="GET", queryParameters=queryParameters, postData=postData, apiKey=apiKey)
print json.dumps(hmac, indent=4)
apiKey with "test" instead of the converted hex to ASCII string outputs the same value which made me suspect that the conversion was the problem. Now I'm not sure what to believe anymore.
/Patrik
ASCII encoding support characters from this code point range 0–127. Any character outside this range, encoded with byte 63, which correspond to ?, in case you decode byte array back to string. So, with your code, you ruin your key by applying ASCII encoding to it. But if what you want is a byte array, then why do you do Hex String -> ASCII String -> Byte Array instead of just Hex String -> Byte Array?
Here is PowerShell code, which generate same results, as your Python code:
function GenerateToken {
param($apikey, $url, $httpMethod, $queryparameters, $postData)
$datastring = -join #(
$httpMethod
$url
$queryparameters
#[DateTimeOffset]::Now.ToUnixTimeSeconds()
1446128942
$postData
)
$hmacsha = New-Object System.Security.Cryptography.HMACSHA256
$hmacsha.Key = #($apikey -split '(?<=\G..)(?=.)'|ForEach-Object {[byte]::Parse($_,'HexNumber')})
[BitConverter]::ToString($hmacsha.ComputeHash([Text.Encoding]::UTF8.GetBytes($datastring))).Replace('-','').ToLower()
}
$apiEndpoint = "http://test.control.llnw.com/traffic-reporting-api/v1"
#what you see in Control on Edit My Profile page#
$apikey = '76b02c0c4543a85e45552466694cf677937833c9cce87f0a628248af2d2c495b';
$queryParameters = "shortname=bulkget&service=http&reportDuration=day&startDate=2012-01-01"
$postData = "{param1: 123, param2: 456}"
GenerateToken -url $apiEndpoint -httpMethod "GET" -queryparameters $queryParameters -postData $postData -apiKey $apiKey
I also fix some other errors in your PowerShell code. In particular, arguments to GenerateToken function call. Also, I change ASCII to UTF8 for $datastring encoding. UTF8 yields exactly same bytes if all characters are in ASCII range, so it does not matter in you case. But if you want to use characters out of ASCII range in $datastring, than you should choose same encoding, as you use in Python, or you will not get the same results.
Thanks for reading.
Background:
I am trying to read a streaming API feed that returns data in JSON format, and then storing this data to a pymongo collection. The streaming API requires a "Accept-Encoding" : "Gzip" header.
What's happening:
Code fails on json.loads and outputs - Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597) (Refer Error Log below)
This does NOT happen while parsing every JSON object - it happens at random.
My guess is I am encountering some weird JSON object after every "x" proper JSON objects.
I did reference how to use pycurl if requested data is sometimes gzipped, sometimes not? and Encoding error while deserializing a json object from Google but so far have been unsuccessful at resolving this error.
Could someone please help me out here?
Error Log:
Note: The raw dump of the JSON object below is basically using the repr() method that prints the raw representation of the string without resolving CRLF/LF(s).
'{"id":"tag:search.twitter.com,2005:207958320747782146","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:493653150","link":"http://www.twitter.com/Deathnews_7_24","displayName":"Death News 7/24","postedTime":"2012-02-16T01:30:12.000Z","image":"http://a0.twimg.com/profile_images/1834408513/deathnewstwittersquare_normal.jpg","summary":"Crashes, Murders, Suicides, Accidents, Crime and Naturals Death News From All Around World","links":[{"href":"http://www.facebook.com/DeathNews724","rel":"me"}],"friendsCount":56,"followersCount":14,"listedCount":1,"statusesCount":1029,"twitterTimeZone":null,"utcOffset":null,"preferredUsername":"Deathnews_7_24","languages":["tr"]},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"web","link":"http://twitter.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","body":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","object":{"objectType":"note","id":"object:search.twitter.com,2005:207958320747782146","summary":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nytimes.com/2012/05/30/boo\xe2\x80\xa6","indices":[52,72],"expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html","url":"http://t.co/WBsNlNtA"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":11,"urls":[{"url":"http://t.co/WBsNlNtA","expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html?_r=1"}]}}\r\n{"id":"tag:search.twitter.com,2005:207958321003638785","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:178760897","link":"http://www.twitter.com/Mobanu","displayName":"Donald Ochs","postedTime":"2010-08-15T16:33:56.000Z","image":"http://a0.twimg.com/profile_images/1493224811/small_mobany_Logo_normal.jpg","summary":"","links":[{"href":"http://www.mobanuweightloss.com","rel":"me"}],"friendsCount":10272,"followersCount":9698,"listedCount":30,"statusesCount":725,"twitterTimeZone":"Mountain Time (US & Canada)","utcOffset":"-25200","preferredUsername":"Mobanu","languages":["en"],"location":{"objectType":"place","displayName":"Crested Butte, Colorado"}},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"twitterfeed","link":"http://twitterfeed.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Mobanu/statuses/207958321003638785","body":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","object":{"objectType":"note","id":"object:search.twitter.com,2005:207958321003638785","summary":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","link":"http://twitter.com/Mobanu/statuses/207958321003638785","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nyti.ms/KUmmMa","indices":[116,136],"expanded_url":"http://nyti.ms/KUmmMa","url":"http://t.co/mTsQlNQO"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":12,"urls":[{"url":"http://t.co/mTsQlNQO","expanded_url":"http://well.blogs.nytimes.com/2012/05/30/can-exercise-be-bad-for-you/?utm_medium=twitter&utm_source=twitterfeed"}]}}\r\n'
json exception: Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597)
Header Output:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Vary: Accept-Encoding
Date: Wed, 30 May 2012 22:14:48 UTC
Connection: close
Transfer-Encoding: chunked
Content-Encoding: gzip
get_stream.py:
#!/usr/bin/env python
import sys
import pycurl
import json
import pymongo
STREAM_URL = "https://stream.test.com:443/accounts/publishers/twitter/streams/track/Dev.json"
AUTH = "userid:passwd"
DB_HOST = "127.0.0.1"
DB_NAME = "stream_test"
class StreamReader:
def __init__(self):
try:
self.count = 0
self.buff = ""
self.mongo = pymongo.Connection(DB_HOST)
self.db = self.mongo[DB_NAME]
self.raw_tweets = self.db["raw_tweets_gnip"]
self.conn = pycurl.Curl()
self.conn.setopt(pycurl.ENCODING, 'gzip')
self.conn.setopt(pycurl.URL, STREAM_URL)
self.conn.setopt(pycurl.USERPWD, AUTH)
self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive)
self.conn.setopt(pycurl.HEADERFUNCTION, self.header_rcvd)
while True:
self.conn.perform()
except Exception as ex:
print "error ocurred : %s" % str(ex)
def header_rcvd(self, header_data):
print header_data
def on_receive(self, data):
temp_data = data
self.buff += data
if data.endswith("\r\n") and self.buff.strip():
try:
tweet = json.loads(self.buff, encoding = 'UTF-8')
self.buff = ""
if tweet:
try:
self.raw_tweets.insert(tweet)
except Exception as insert_ex:
print "Error inserting tweet: %s" % str(insert_ex)
self.count += 1
if self.count % 10 == 0:
print "inserted "+str(self.count)+" tweets"
except Exception as json_ex:
print "json exception: %s" % str(json_ex)
print repr(temp_data)
stream = StreamReader()
Fixed Code:
def on_receive(self, data):
self.buff += data
if data.endswith("\r\n") and self.buff.strip():
# NEW: Split the buff at \r\n to get a list of JSON objects and iterate over them
json_obj = self.buff.split("\r\n")
for obj in json_obj:
if len(obj.strip()) > 0:
try:
tweet = json.loads(obj, encoding = 'UTF-8')
except Exception as json_ex:
print "JSON Exception occurred: %s" % str(json_ex)
continue
Try to paste your dumped string into jsbeatuifier.
You'll see that it's actually two json objects, not one, which json.loads can't deal with.
They are separated by \r\n, so it should be easy to split them.
The problem is that the data argument passed to on_receive doesn't neccessarily end with \r\n if it contains a newline. As this shows it also can be somewhere in the middle of the string, so only looking at the end of the data chunk won't be enough.
in C do return -1 when i want to cancel the download in either the header or the write function. In pycurl i get this error
pycurl.error: invalid return value for write callback -1 17
I dont know what the 17 means but what am i not doing correctly?
from pycurl.c:
else if (PyInt_Check(result)) {
long obj_size = PyInt_AsLong(result);
if (obj_size < 0 || obj_size > total_size) {
PyErr_Format(ErrorObject, "invalid return value for write callback %ld %ld", (long)obj_size, (long)total_size);
goto verbose_error;
}
this would mean 17 is the total_size - is this possible ? and -1 (result) is what your callback is returning.
import pycurl
import StringIO
c = pycurl.Curl()
s = StringIO.StringIO()
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HEADER, True)
c.setopt(pycurl.NOBODY, True)
c.setopt(pycurl.WRITEFUNCTION, s.write)
c.perform()
print(s.getvalue())