INFLATE with Python 3.x [duplicate] - python

I have a gzip file and I am trying to read it via Python as below:
import zlib
do = zlib.decompressobj(16+zlib.MAX_WBITS)
fh = open('abc.gz', 'rb')
cdata = fh.read()
fh.close()
data = do.decompress(cdata)
it throws this error:
zlib.error: Error -3 while decompressing: incorrect header check
How can I overcome it?

You have this error:
zlib.error: Error -3 while decompressing: incorrect header check
Which is most likely because you are trying to check headers that are not there, e.g. your data follows RFC 1951 (deflate compressed format) rather than RFC 1950 (zlib compressed format) or RFC 1952 (gzip compressed format).
choosing windowBits
But zlib can decompress all those formats:
to (de-)compress deflate format, use wbits = -zlib.MAX_WBITS
to (de-)compress zlib format, use wbits = zlib.MAX_WBITS
to (de-)compress gzip format, use wbits = zlib.MAX_WBITS | 16
See documentation in http://www.zlib.net/manual.html#Advanced (section inflateInit2)
examples
test data:
>>> deflate_compress = zlib.compressobj(9, zlib.DEFLATED, -zlib.MAX_WBITS)
>>> zlib_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS)
>>> gzip_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS | 16)
>>>
>>> text = '''test'''
>>> deflate_data = deflate_compress.compress(text) + deflate_compress.flush()
>>> zlib_data = zlib_compress.compress(text) + zlib_compress.flush()
>>> gzip_data = gzip_compress.compress(text) + gzip_compress.flush()
>>>
obvious test for zlib:
>>> zlib.decompress(zlib_data)
'test'
test for deflate:
>>> zlib.decompress(deflate_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
>>> zlib.decompress(deflate_data, -zlib.MAX_WBITS)
'test'
test for gzip:
>>> zlib.decompress(gzip_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
'test'
the data is also compatible with gzip module:
>>> import gzip
>>> import StringIO
>>> fio = StringIO.StringIO(gzip_data) # io.BytesIO for Python 3
>>> f = gzip.GzipFile(fileobj=fio)
>>> f.read()
'test'
>>> f.close()
automatic header detection (zlib or gzip)
adding 32 to windowBits will trigger header detection
>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
'test'
>>> zlib.decompress(zlib_data, zlib.MAX_WBITS|32)
'test'
using gzip instead
or you can ignore zlib and use gzip module directly; but please remember that under the hood, gzip uses zlib.
fh = gzip.open('abc.gz', 'rb')
cdata = fh.read()
fh.close()

Update: dnozay's answer explains the problem and should be the accepted answer.
Try the gzip module, code below is straight from the python docs.
import gzip
f = gzip.open('/home/joe/file.txt.gz', 'rb')
file_content = f.read()
f.close()

I just solved the "incorrect header check" problem when uncompressing gzipped data.
You need to set -WindowBits => WANT_GZIP in your call to inflateInit2 (use the 2 version)
Yes, this can be very frustrating. A typically shallow reading of the documentation presents Zlib as an API to Gzip compression, but by default (not using the gz* methods) it does not create or uncompress the Gzip format. You have to send this non-very-prominently documented flag.

This does not answer the original question, but it may help someone else that ends up here.
The zlib.error: Error -3 while decompressing: incorrect header check also occurs in the example below:
b64_encoded_bytes = base64.b64encode(zlib.compress(b'abcde'))
encoded_bytes_representation = str(b64_encoded_bytes) # this the cause
zlib.decompress(base64.b64decode(encoded_bytes_representation))
The example is a minimal reproduction of something I encountered in some legacy Django code, where Base64 encoded bytes (from an HTTP POST) were being stored in a Django CharField (instead of a BinaryField).
When reading a CharField value from the database, str() is called on the value, without an explicit encoding, as can be seen in the Django source.
The str() documentation says:
If neither encoding nor errors is given, str(object) returns object.str(), which is the “informal” or nicely printable string representation of object. For string objects, this is the string itself. If object does not have a str() method, then str() falls back to returning repr(object).
So, in the example, we are inadvertently base64-decoding
"b'eJxLTEpOSQUABcgB8A=='"
instead of
b'eJxLTEpOSQUABcgB8A=='.
The zlib decompression in the example would succeed if an explicit encoding were used, e.g. str(b64_encoded_bytes, 'utf-8').
NOTE specific to Django:
What's especially tricky: this issue only arises when retrieving a value from the database. See for example the test below, which passes (in Django 3.0.3):
class MyModelTests(TestCase):
def test_bytes(self):
my_model = MyModel.objects.create(data=b'abcde')
self.assertIsInstance(my_model.data, bytes) # issue does not arise
my_model.refresh_from_db()
self.assertIsInstance(my_model.data, str) # issue does arise
where MyModel is
class MyModel(models.Model):
data = models.CharField(max_length=100)

To decompress incomplete gzipped bytes that are in memory, the answer by dnozay is useful but it misses the zlib.decompressobj call which I found to be necessary:
incomplete_decompressed_content = zlib.decompressobj(wbits=zlib.MAX_WBITS | 16).decompress(incomplete_gzipped_content)
Note that zlib.MAX_WBITS | 16 is 15 | 16 which is 31. For some background about wbits, see zlib.decompress.
Credit: answer by Yann Vernier which notes the the zlib.decompressobj call.

Funnily enough, I had that error when trying to work with the Stack Overflow API using Python.
I managed to get it working with the GzipFile object from the gzip directory, roughly like this:
import gzip
gzip_file = gzip.GzipFile(fileobj=open('abc.gz', 'rb'))
file_contents = gzip_file.read()

My case was to decompress email messages that are stored in Bullhorn database. The snippet is the following:
import pyodbc
import zlib
cn = pyodbc.connect('connection string')
cursor = cn.cursor()
cursor.execute('SELECT TOP(1) userMessageID, commentsCompressed FROM BULLHORN1.BH_UserMessage WHERE DATALENGTH(commentsCompressed) > 0 ')
for msg in cursor.fetchall():
#magic in the second parameter, use negative value for deflate format
decompressedMessageBody = zlib.decompress(bytes(msg.commentsCompressed), -zlib.MAX_WBITS)

Just add headers 'Accept-Encoding': 'identity'
import requests
requests.get('http://gett.bike/', headers={'Accept-Encoding': 'identity'})
https://github.com/requests/requests/issues/3849

Related

Loading a YAML-file throws an error in Python

I am reading in some YAML-files like this:
data = yaml.safe_load(pathtoyamlfile)
When doing so I get the followin error:
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:value'
When checking for the line of the YAML-file which is also given in the error messages I recognized that there is always this key-value-pair: simple: =.
Since the YAML-files are autogenerated I am not sure if I can change the files themselves. Is there a way on reading the data of the YAML-files none the less?
It looks like you have hit this bug. There is a workaround suggested in the comments.
Given this content in example.yaml:
example: =
This code fails as you've described in your question:
import yaml
with open('example.yaml') as fd:
data = yaml.safe_load(fd)
print(data)
But this works:
import yaml
yaml.SafeLoader.yaml_implicit_resolvers.pop('=')
with open('example.yaml') as fd:
data = yaml.safe_load(fd)
print(data)
And outputs:
{'example': '='}
If you cannot change the input, you might be able to upgrade the library that you use:
import sys
import ruamel.yaml
yaml_str = """\
example: =
"""
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
for key, value in data.items():
print(f'{key}: {value}')
which gives:
example: =
Please be aware that ruamel.yaml still has this bug in its safe mode loading ( YAML(typ='safe') ).

google.protobuf.message.DecodeError: Wrong wire type in tag Error in Protocol Buffer

I trying decrypt my data using google protocol buffer in python
sample.proto file:-
syntax = "proto3";
message SimpleMessage {
string deviceID = 1;
string timeStamp = 2;
string data = 3;
}
After that, I have generated python files using the proto command:-
protoc --proto_path=./ --python_out=./ simple.proto
My Python code below:-
import json
import simple_pb2
import base64
encryptedData = 'iOjEuMCwic2VxIjoxODEsInRtcyI6IjIwMjEtMDEtMjJUMTQ6MDY6MzJaIiwiZGlkIjoiUlFI'
t2 = bytes(encryptedData, encoding='utf8')
print(encryptedData)
data = base64.b64decode(encryptedData)
test = simple_pb2.SimpleMessage()
v1 = test.ParseFromString(data)
While executing above code getting error:- google.protobuf.message.DecodeError: Wrong wire type in tag Error
What i am doing wrong. can anyone help?
Your data is not "encrypted", it's just base64-encoded. If you use your example code and inspect your data variable, then you get:
import base64
data = base64.b64decode(b'eyJ2ZXIiOjEuMCwic2VxIjoxODEsInRtcyI6IjIwMjEtMDEtMjJUMTQ6MDY6MzJaIiwiZGlkIjoiUlFIVlRKRjAwMDExNzY2IiwiZG9wIjoxLjEwMDAwMDAyMzg0MTg1NzksImVyciI6MCwiZXZ0IjoiVE5UIiwiaWdzIjpmYWxzZSwibGF0IjoyMi45OTI0OTc5OSwibG5nIjo3Mi41Mzg3NDgyOTk5OTk5OTUsInNwZCI6MC4wfQo=')
print(data)
> b'{"ver":1.0,"seq":181,"tms":"2021-01-22T14:06:32Z","did":"RQHVTJF00011766","dop":1.1000000238418579,"err":0,"evt":"TNT","igs":false,"lat":22.99249799,"lng":72.538748299999995,"spd":0.0}\n'
Which is evidently a piece of of JSON data, not a binary-serialized protocol buffer - which is what ParseFromString expects. Also, looking at the names and types of the fields, it looks like this payload just doesn't match the proto definition you've shown.
There are certainly ways to parse a JSON into a proto, and even to control the field names in that transformation, but not even the number of fields match directly. So you first need to define what you want: what proto message would you expect this JSON object to represent?

Convert base64 encoded google service account key to JSON file using Python

hello I'm trying to convert a google service account JSON key (contained in a base64 encoded field named privateKeyData in file foo.json - more context here ) into the actual JSON file (I need that format as ansible only accepts that)
The foo.json file is obtained using this google python api method
what I'm trying to do (though I am using python) is also described this thread which by the way does not work for me (tried on OSx and Linux).
#!/usr/bin/env python3
import json
import base64
with open('/tmp/foo.json', 'r') as f:
ymldict = json.load(f)
b64encodedCreds = ymldict['privateKeyData']
b64decodedBytes = base64.b64decode(b64encodedCreds,validate=True)
outputStr = b64decodedBytes
print(outputStr)
#issue
outputStr = b64decodedBytes.decode('UTF-8')
print(outputStr)
yields
./test.py
b'0\x82\t\xab\x02\x01\x030\x82\td\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\tU\x04\x82\tQ0\x82\tM0\x82\x05q\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\x05b\x04\x82\x05^0\x82\x05Z0\x82\x05V\x06\x0b*\x86H\x86\xf7\r\x01\x0c\n\x01\x02\xa0\x82\x#TRUNCATING HERE
Traceback (most recent call last):
File "./test.py", line 17, in <module>
outputStr = b64decodedBytes.decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 1: invalid start byte
I think I have run out of ideas and spent now more than a day on this :(
what am I doing wrong?
Your base64 decoding logic looks fine to me. The problem you are facing is probably due to a character encoding mismatch. The response body you received after calling create (your foo.json file) is probably not encoded with UTF-8. Check out the response header's Content-Type field. It should look something like this:
Content-Type: text/javascript; charset=Shift_JIS
Try to decode your base64 decoded string with the encoding used in the content type
b64decodedBytes.decode('Shift_JIS')

StringIO() argument 1 must be string or buffer, not cStringIO.StringIO

I have a function that's reading a content object into a pandas dataframe.
import pandas as pd
from cStringIO import StringIO, InputType
def create_df(content):
assert content, "No content was provided, can't create dataframe"
if not isinstance(content, InputType):
content = StringIO(content)
content.seek(0)
return pd.read_csv(content)
However I keep getting the error TypeError: StringIO() argument 1 must be string or buffer, not cStringIO.StringIO
I checked the incoming type of the content prior to the StringIO() conversion inside the function and it's of type str. Without the conversion I get an error that the str object does not have a seek function. Any idea whats wrong here?
You only tested for InputType, which is a cStringIO.StringIO() instance that supports reading. You appear to have the other type, OutputType, the instance created for an instance that supports writing to:
>>> import cStringIO
>>> finput = cStringIO.StringIO('Hello world!') # the input type, it has data ready to read
>>> finput
<cStringIO.StringI object at 0x1034397a0>
>>> isinstance(finput, cStringIO.InputType)
True
>>> foutput = cStringIO.StringIO() # the output type, it is ready to receive data
>>> foutput
<cStringIO.StringO object at 0x102fb99d0>
>>> isinstance(foutput, cStringIO.OutputType)
True
You'd need to test for both types, just use a tuple of the two types as the second argument to isinstance():
from cStringIO import StringIO, InputType, OutputType
if not isinstance(content, (InputType, OutputType)):
content = StringIO(content)
or, and this is the better option, test for read and seek attributes, so you can also support regular files:
if not (hasattr(content, 'read') and hasattr(content, 'seek')):
# if not a file object, assume it is a string and wrap it in an in-memory file.
content = StringIO(content)
or you could just test for strings and [buffers](https://docs.python.org/2/library/functions.html#buffer(, since those are the only two types that StringIO() can support:
if isinstance(content, (str, buffer)):
# wrap strings into an in-memory file
content = StringIO(content)
This has the added bonus that any other file object in the Python library, including compressed files and tempfile.SpooledTemporaryFile() and io.BytesIO() will also be accepted and work.

zlib.error: Error -3 while decompressing: incorrect header check

I have a gzip file and I am trying to read it via Python as below:
import zlib
do = zlib.decompressobj(16+zlib.MAX_WBITS)
fh = open('abc.gz', 'rb')
cdata = fh.read()
fh.close()
data = do.decompress(cdata)
it throws this error:
zlib.error: Error -3 while decompressing: incorrect header check
How can I overcome it?
You have this error:
zlib.error: Error -3 while decompressing: incorrect header check
Which is most likely because you are trying to check headers that are not there, e.g. your data follows RFC 1951 (deflate compressed format) rather than RFC 1950 (zlib compressed format) or RFC 1952 (gzip compressed format).
choosing windowBits
But zlib can decompress all those formats:
to (de-)compress deflate format, use wbits = -zlib.MAX_WBITS
to (de-)compress zlib format, use wbits = zlib.MAX_WBITS
to (de-)compress gzip format, use wbits = zlib.MAX_WBITS | 16
See documentation in http://www.zlib.net/manual.html#Advanced (section inflateInit2)
examples
test data:
>>> deflate_compress = zlib.compressobj(9, zlib.DEFLATED, -zlib.MAX_WBITS)
>>> zlib_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS)
>>> gzip_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS | 16)
>>>
>>> text = '''test'''
>>> deflate_data = deflate_compress.compress(text) + deflate_compress.flush()
>>> zlib_data = zlib_compress.compress(text) + zlib_compress.flush()
>>> gzip_data = gzip_compress.compress(text) + gzip_compress.flush()
>>>
obvious test for zlib:
>>> zlib.decompress(zlib_data)
'test'
test for deflate:
>>> zlib.decompress(deflate_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
>>> zlib.decompress(deflate_data, -zlib.MAX_WBITS)
'test'
test for gzip:
>>> zlib.decompress(gzip_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
'test'
the data is also compatible with gzip module:
>>> import gzip
>>> import StringIO
>>> fio = StringIO.StringIO(gzip_data) # io.BytesIO for Python 3
>>> f = gzip.GzipFile(fileobj=fio)
>>> f.read()
'test'
>>> f.close()
automatic header detection (zlib or gzip)
adding 32 to windowBits will trigger header detection
>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
'test'
>>> zlib.decompress(zlib_data, zlib.MAX_WBITS|32)
'test'
using gzip instead
or you can ignore zlib and use gzip module directly; but please remember that under the hood, gzip uses zlib.
fh = gzip.open('abc.gz', 'rb')
cdata = fh.read()
fh.close()
Update: dnozay's answer explains the problem and should be the accepted answer.
Try the gzip module, code below is straight from the python docs.
import gzip
f = gzip.open('/home/joe/file.txt.gz', 'rb')
file_content = f.read()
f.close()
I just solved the "incorrect header check" problem when uncompressing gzipped data.
You need to set -WindowBits => WANT_GZIP in your call to inflateInit2 (use the 2 version)
Yes, this can be very frustrating. A typically shallow reading of the documentation presents Zlib as an API to Gzip compression, but by default (not using the gz* methods) it does not create or uncompress the Gzip format. You have to send this non-very-prominently documented flag.
This does not answer the original question, but it may help someone else that ends up here.
The zlib.error: Error -3 while decompressing: incorrect header check also occurs in the example below:
b64_encoded_bytes = base64.b64encode(zlib.compress(b'abcde'))
encoded_bytes_representation = str(b64_encoded_bytes) # this the cause
zlib.decompress(base64.b64decode(encoded_bytes_representation))
The example is a minimal reproduction of something I encountered in some legacy Django code, where Base64 encoded bytes (from an HTTP POST) were being stored in a Django CharField (instead of a BinaryField).
When reading a CharField value from the database, str() is called on the value, without an explicit encoding, as can be seen in the Django source.
The str() documentation says:
If neither encoding nor errors is given, str(object) returns object.str(), which is the “informal” or nicely printable string representation of object. For string objects, this is the string itself. If object does not have a str() method, then str() falls back to returning repr(object).
So, in the example, we are inadvertently base64-decoding
"b'eJxLTEpOSQUABcgB8A=='"
instead of
b'eJxLTEpOSQUABcgB8A=='.
The zlib decompression in the example would succeed if an explicit encoding were used, e.g. str(b64_encoded_bytes, 'utf-8').
NOTE specific to Django:
What's especially tricky: this issue only arises when retrieving a value from the database. See for example the test below, which passes (in Django 3.0.3):
class MyModelTests(TestCase):
def test_bytes(self):
my_model = MyModel.objects.create(data=b'abcde')
self.assertIsInstance(my_model.data, bytes) # issue does not arise
my_model.refresh_from_db()
self.assertIsInstance(my_model.data, str) # issue does arise
where MyModel is
class MyModel(models.Model):
data = models.CharField(max_length=100)
To decompress incomplete gzipped bytes that are in memory, the answer by dnozay is useful but it misses the zlib.decompressobj call which I found to be necessary:
incomplete_decompressed_content = zlib.decompressobj(wbits=zlib.MAX_WBITS | 16).decompress(incomplete_gzipped_content)
Note that zlib.MAX_WBITS | 16 is 15 | 16 which is 31. For some background about wbits, see zlib.decompress.
Credit: answer by Yann Vernier which notes the the zlib.decompressobj call.
Funnily enough, I had that error when trying to work with the Stack Overflow API using Python.
I managed to get it working with the GzipFile object from the gzip directory, roughly like this:
import gzip
gzip_file = gzip.GzipFile(fileobj=open('abc.gz', 'rb'))
file_contents = gzip_file.read()
My case was to decompress email messages that are stored in Bullhorn database. The snippet is the following:
import pyodbc
import zlib
cn = pyodbc.connect('connection string')
cursor = cn.cursor()
cursor.execute('SELECT TOP(1) userMessageID, commentsCompressed FROM BULLHORN1.BH_UserMessage WHERE DATALENGTH(commentsCompressed) > 0 ')
for msg in cursor.fetchall():
#magic in the second parameter, use negative value for deflate format
decompressedMessageBody = zlib.decompress(bytes(msg.commentsCompressed), -zlib.MAX_WBITS)
Just add headers 'Accept-Encoding': 'identity'
import requests
requests.get('http://gett.bike/', headers={'Accept-Encoding': 'identity'})
https://github.com/requests/requests/issues/3849

Categories

Resources