Decompressing a content of unknown length with python-lz4

Decompressing a content of unknown length with python-lz4 - python

I am trying to decompress a content of unknown size using python-lz4 using the following code
with open("compressed.msgpk", "rb") as f:
content = f.read()
if content[0] == 1:
uncompressed = lz4.block.decompress(content[1:])
but it always fails with
LZ4BlockError: Decompression failed: corrupt input or insufficient space in destination buffer. Error code: 58
I even tried specifying different/bigger sizes as shown here https://python-lz4.readthedocs.io/en/stable/lz4.block.html but nothing worked.
And if it help the content I am trying to decompress is compressed using lz4net c# library using method LZ4Codec.WrapHC(content) https://github.com/MiloszKrajewski/lz4net/blob/201ed085fed299523616bfd08776694cb61ae6b3/src/LZ4/LZ4Codec.cs#L562

The unwrap method decodes a wrapped block, but the lz4.block.decompress method appears not to take wrapping into account.
I'm not 100% familiar with the python and c# libraries you are using but I wonder (from the docs) if the lz4.frame.decompress might be the method you are looking for?

Related

"_pickle.UnpicklingError: the STRING opcode argument must be quoted" in py2 to py3 ndarray transmission

I have a workstation running Python2 with a ROS environment which obtains a camera image from a robot and sends it over the network to a Python3 machine using the standard socket library. I can't seem to unpickle correctly the opencv ndarray image.
I am able to successfully transfer simple data as lists, but I encounter an error when trying to transfer an image.
On the Python2 system, I obtain my image in this way:
img = CvBridge().imgmsg_to_cv2(img_data, desired_encoding='bgr8') # Convert from ROS image to OpenCV image
Obtaining an ndarray.
I serialize it with:
data = pickle.dumps(img, protocol=0)
And I send it.
Back on the Python3 machine, I try to unpickle it using:
response = pickle.loads(data_in, encoding='latin1') # To read a Python2 dump
At this point, I obtain the following error:
_pickle.UnpicklingError: the STRING opcode argument must be quoted
The only other solutions that I have found address cases in which data had been transferred between Unix and Windows machines, which is not my case.

In vscode (on windows 10), you should open file that is showing this error and then change option CRLF to LF and then save the file then UnpicklingError: the STRING opcode argument must be quoted error will be disappeared.

An update on this topic, for possible future reference to whoever will experience the same problem.
The problem was not caused, as I initially thought, by a misconversion between Py2 and Py3 pickled byte stream. It was caused, instead, by the incorrect transmission of the data: I was interrupting the connection before all the packages had arrived.
This is the block of code that solved my problem:
data_in = b''
while True:
block = self.socket.recv(4096)
if block:
data_in += block

Processing a Django UploadedFile as UTF-8 with universal newlines

In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).
For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):
Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
Do not support universal newlines (which probably the majority of the files uploaded to this system will need).
Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)
I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:
import csv
import codecs
class CSVParser:
def __init__(self,file):
# 'file' is assumed to be an InMemoryUploadedFile object.
dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() # seek to 0
self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
dialect=dialect)
try:
self.field_names = self.reader.next()
except StopIteration:
# The file was empty - this is not allowed.
raise ValueError('Unrecognized format (empty file)')
if len(self.field_names) <= 1:
# This probably isn't a CSV file at all.
# Note that the csv module will (incorrectly) parse ALL files, even
# binary data. This will catch most such files.
raise ValueError('Unrecognized format (too few columns)')
# Additional methods snipped, unrelated to issue
Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.
The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.
EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!

As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.
If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.
I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.

I use the csv.DictReader and it appears to be working well. I attached my code snippet, but it is basically the same as another answer here.
import csv as csv_mod
import codecs
file = request.FILES['file']
dialect = csv_mod.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open()
csv = csv_mod.DictReader( codecs.EncodedFile(file,"utf-8"), dialect=dialect )

For CSV and Excel upload to django, this site may help.

In Python, how do I decode GZIP encoding?

I downloaded a webpage in my python script.
In most cases, this works fine.
However, this one had a response header: GZIP encoding, and when I tried to print the source code of this web page, it had all symbols in my putty.
How do decode this to regular text?

I use zlib to decompress gzipped content from web.
import zlib
import urllib
f=urllib.request.urlopen(url)
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)

Decompress your byte stream using the built-in gzip module.
If you have any problems, do show the exact minimal code that you used, the exact error message and traceback, together with the result of print repr(your_byte_stream[:100])
Further information
1. For an explanation of the gzip/zlib/deflate confusion, read the "Other uses" section of this Wikipedia article.
2. It can be easier to use the zlib module than the gzip module if you have a string rather than a file. Unfortunately the Python docs are incomplete/wrong:
zlib.decompress(string[, wbits[, bufsize]])
...The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. The default value is 15. When wbits is negative, the standard gzip header is suppressed; this is an undocumented feature of the zlib library, used for compatibility with unzip‘s compression file format.
Firstly, 8 <= log2_window_size <= 15, with the meaning given above. Then what should be a separate arg is kludged on top:
arg == log2_window_size means assume string is in zlib format (RFC 1950; what the HTTP 1.1 RFC 2616 confusingly calls "deflate").
arg == -log2_window_size means assume string is in deflate format (RFC 1951; what people who didn't read the HTTP 1.1 RFC carefully actually implemented)
arg == 16 + log_2_window_size means assume string is in gzip format (RFC 1952). So you can use 31.
The above information is documented in the zlib C library manual ... Ctrl-F search for windowBits.

For Python 3
Try out this:
import gzip
fetch = opener.open(request) # basically get a response object
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')

I use something like that:
f = urllib2.urlopen(request)
data = f.read()
try:
from cStringIO import StringIO
from gzip import GzipFile
data2 = GzipFile('', 'r', 0, StringIO(data)).read()
data = data2
except:
#print "decompress error %s" % err
pass
return data

If you use the Requests module, then you don't need to use any other modules because the gzip and deflate transfer-encodings are automatically decoded for you.
Example:
>>> import requests
>>> custom_header = {'Accept-Encoding': 'gzip'}
>>> response = requests.get('https://api.github.com/events', headers=custom_header)
>>> response.headers
{'Content-Encoding': 'gzip',...}
>>> response.text
'[{"id":"9134429130","type":"IssuesEvent","actor":{"id":3287933,...
The .text property of the response is for reading the content in the text context.
The .content property of the response is for reading the content in the binary context.
See the Binary Response Content section on docs.python-requests.org

Similar to Shatu's answer for python3, but arranged a little differently:
import gzip
s = Request("https://someplace.com", None, headers)
r = urlopen(s, None, 180).read()
try: r = gzip.decompress(r)
except OSError: pass
result = json_load(r.decode())
This method allows for wrapping the gzip.decompress() in a try/except to capture and pass the OSError that results in situations where you may get mixed compressed and uncompressed data. Some small strings actually get bigger if they are encoded, so the plain data is sent instead.

This version is simple and avoids reading the whole file first by not calling the read() method. It provides a file stream like object instead that behaves just like a normal file stream.
import gzip
from urllib.request import urlopen
my_gzip_url = 'http://my_url.gz'
my_gzip_stream = urlopen(my_gzip_url)
my_stream = gzip.open(my_gzip_stream, 'r')

None of these answers worked out of the box using Python 3. Here is what worked for me to fetch a page and decode the gzipped response:
import requests
import gzip
response = requests.get('your-url-here')
data = str(gzip.decompress(response.content), 'utf-8')
print(data) # decoded contents of page

You can use urllib3 to easily decode gzip.
urllib3.response.decode_gzip(response.data)

Python decompressing gzip chunk-by-chunk

I've a memory- and disk-limited environment where I need to decompress the contents of a gzip file sent to me in string-based chunks (over xmlrpc binary transfer). However, using the zlib.decompress() or zlib.decompressobj()/decompress() both barf over the gzip header. I've tried offsetting past the gzip header (documented here), but still haven't managed to avoid the barf. The gzip library itself only seems to support decompressing from files.
The following snippet gives a simplified illustration of what I would like to do (except in real life the buffer will be filled from xmlrpc, rather than reading from a local file):
#! /usr/bin/env python
import zlib
CHUNKSIZE=1000
d = zlib.decompressobj()
f=open('23046-8.txt.gz','rb')
buffer=f.read(CHUNKSIZE)
while buffer:
outstr = d.decompress(buffer)
print(outstr)
buffer=f.read(CHUNKSIZE)
outstr = d.flush()
print(outstr)
f.close()
Unfortunately, as I said, this barfs with:
Traceback (most recent call last):
File "./test.py", line 13, in <module>
outstr = d.decompress(buffer)
zlib.error: Error -3 while decompressing: incorrect header check
Theoretically, I could feed my xmlrpc-sourced data into a StringIO and then use that as a fileobj for gzip.GzipFile(), however, in real life, I don't have memory available to hold the entire file contents in memory as well as the decompressed data. I really do need to process it chunk-by-chunk.
The fall-back would be to change the compression of my xmlrpc-sourced data from gzip to plain zlib, but since that impacts other sub-systems I'd prefer to avoid it if possible.
Any ideas?

gzip and zlib use slightly different headers.
See How can I decompress a gzip stream with zlib?
Try d = zlib.decompressobj(16+zlib.MAX_WBITS).
And you might try changing your chunk size to a power of 2 (say CHUNKSIZE=1024) for possible performance reasons.

I've got a more detailed answer here: https://stackoverflow.com/a/22310760/1733117
d = zlib.decompressobj(zlib.MAX_WBITS|32)
per documentation this automatically detects the header (zlib or gzip).

zlib decompression in python

Okay so I have some data streams compressed by python's (2.6) zlib.compress() function. When I try to decompress them, some of them won't decompress (zlib error -5, which seems to be a "buffer error", no idea what to make of that). At first, I thought I was done, but I realized that all the ones I couldn't decompress started with 0x78DA (the working ones were 0x789C), and I looked around and it seems to be a different kind of zlib compression -- the magic number changes depending on the compression used. What can I use to decompress the files? Am I hosed?

According to RFC 1950 , the difference between the "OK" 0x789C and the "bad" 0x78DA is in the FLEVEL bit-field:
FLEVEL (Compression level)
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it
is there to indicate if recompression might be worthwhile.
"OK" uses 2, "bad" uses 3. So that difference in itself is not a problem.
To get any further, you might consider supplying the following information for each of compressing and (attempted) decompressing: what platform, what version of Python, what version of the zlib library, what was the actual code used to call the zlib module. Also supply the full traceback and error message from the failing decompression attempts. Have you tried to decompress the failing files with any other zlib-reading software? With what results? Please clarify what you have to work with: Does "Am I hosed?" mean that you don't have access to the original data? How did it get from a stream to a file? What guarantee do you have that the data was not mangled in transmission?
UPDATE Some observations based on partial clarifications published in your self-answer:
You are using Windows. Windows distinguishes between binary mode and text mode when reading and writing files. When reading in text mode, Python 2.x changes '\r\n' to '\n', and changes '\n' to '\r\n' when writing. This is not a good idea when dealing with non-text data. Worse, when reading in text mode, '\x1a' aka Ctrl-Z is treated as end-of-file.
To compress a file:
# imports and other superstructure left as a exercise
str_object1 = open('my_log_file', 'rb').read()
str_object2 = zlib.compress(str_object1, 9)
f = open('compressed_file', 'wb')
f.write(str_object2)
f.close()
To decompress a file:
str_object1 = open('compressed_file', 'rb').read()
str_object2 = zlib.decompress(str_object1)
f = open('my_recovered_log_file', 'wb')
f.write(str_object2)
f.close()
Aside: Better to use the gzip module which saves you having to think about nasssties like text mode, at the cost of a few bytes for the extra header info.
If you have been using 'rb' and 'wb' in your compression code but not in your decompression code [unlikely?], you are not hosed, you just need to flesh out the above decompression code and go for it.
Note carefully the use of "may", "should", etc in the following untested ideas.
If you have not been using 'rb' and 'wb' in your compression code, the probability that you have hosed yourself is rather high.
If there were any instances of '\x1a' in your original file, any data after the first such is lost -- but in that case it shouldn't fail on decompression (IOW this scenario doesn't match your symptoms).
If a Ctrl-Z was generated by zlib itself, this should cause an early EOF upon attempted decompression, which should of course cause an exception. In this case you may be able to gingerly reverse the process by reading the compressed file in binary mode and then substitute '\r\n' with '\n' [i.e. simulate text mode without the Ctrl-Z -> EOF gimmick]. Decompress the result. Edit Write the result out in TEXT mode. End edit
UPDATE 2 I can reproduce your symptoms -- with ANY level 1 to 9 -- with the following script:
import zlib, sys
fn = sys.argv[1]
level = int(sys.argv[2])
s1 = open(fn).read() # TEXT mode
s2 = zlib.compress(s1, level)
f = open(fn + '-ct', 'w') # TEXT mode
f.write(s2)
f.close()
# try to decompress in text mode
s1 = open(fn + '-ct').read() # TEXT mode
s2 = zlib.decompress(s1) # error -5
f = open(fn + '-dtt', 'w')
f.write(s2)
f.close()
Note: you will need a use a reasonably large text file (I used an 80kb source file) to ensure that the decompression result will contain a '\x1a'.
I can recover with this script:
import zlib, sys
fn = sys.argv[1]
# (1) reverse the text-mode write
# can't use text-mode read as it will stop at Ctrl-Z
s1 = open(fn, 'rb').read() # BINARY mode
s1 = s1.replace('\r\n', '\n')
# (2) reverse the compression
s2 = zlib.decompress(s1)
# (3) reverse the text mode read
f = open(fn + '-fixed', 'w') # TEXT mode
f.write(s2)
f.close()
NOTE: If there is a '\x1a' aka Ctrl-Z byte in the original file, and the file is read in text mode, that byte and all following bytes will NOT be included in the compressed file, and thus can NOT be recovered. For a text file (e.g. source code), this is no loss at all. For a binary file, you are most likely hosed.
Update 3 [following late revelation that there's an encryption/decryption layer involved in the problem]:
The "Error -5" message indicates that the data that you are trying to decompress has been mangled since it was compressed. If it's not caused by using text mode on the files, suspicion obviously(?) falls on your decryption and encryption wrappers. If you want help, you need to divulge the source of those wrappers. In fact what you should try to do is (like I did) put together a small script that reproduces the problem on more than one input file. Secondly (like I did) see whether you can reverse the process under what conditions. If you want help with the second stage, you need to divulge the problem-reproduction script.

I was looking for
python -c 'import sys,zlib;sys.stdout.write(zlib.decompress(sys.stdin.read()))'
wrote it myself; based on answers of zlib decompression in python

Okay sorry I wasn't clear enough. This is win32, python 2.6.2. I'm afraid I can't find the zlib file, but its whatever is included in the win32 binary release. And I don't have access to the original data -- I've been compressing my log files, and I'd like to get them back. As far as other software, I've naievely tried 7zip, but of course it failed, because it's zlib, not gzip (I couldn't any software to decompress zlib streams directly). I can't give a carbon copy of the traceback now, but it was (traced back to zlib.decompress(data)) zlib.error: Error: -3. Also, to be clear, these are static files, not streams as I made it sound earlier (so no transmission errors). And I'm afraid again I don't have the code, but I know I used zlib.compress(data, 9) (i.e. at the highest compression level -- although, interestingly it seems that not all the zlib output is 78da as you might expect since I put it on the highest level) and just zlib.decompress().

Ok sorry about my last post, I didn't have everything. And I can't edit my post because I didn't use OpenID. Anyways, here's some data:
1) Decompression traceback:
Traceback (most recent call last):
File "<my file>", line 5, in <module>
zlib.decompress(data)
zlib.error: Error -5 while decompressing data
2) Compression code:
#here you can assume the data is the data to be compressed/stored
data = encrypt(zlib.compress(data,9)) #a short wrapper around PyCrypto AES encryption
f = open("somefile", 'wb')
f.write(data)
f.close()
3) Decompression code:
f = open("somefile", 'rb')
data = f.read()
f.close()
zlib.decompress(decrypt(data)) #this yeilds the error in (1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decompressing a content of unknown length with python-lz4 - python

The unwrap method decodes a wrapped block, but the lz4.block.decompress method appears not to take wrapping into account. I'm not 100% familiar with the python and c# libraries you are using but I wonder (from the docs) if the lz4.frame.decompress might be the method you are looking for?

Related

"_pickle.UnpicklingError: the STRING opcode argument must be quoted" in py2 to py3 ndarray transmission

Processing a Django UploadedFile as UTF-8 with universal newlines

In Python, how do I decode GZIP encoding?

Python decompressing gzip chunk-by-chunk

zlib decompression in python

Categories

Resources