zlib decompression in python

zlib decompression in python - python

Okay so I have some data streams compressed by python's (2.6) zlib.compress() function. When I try to decompress them, some of them won't decompress (zlib error -5, which seems to be a "buffer error", no idea what to make of that). At first, I thought I was done, but I realized that all the ones I couldn't decompress started with 0x78DA (the working ones were 0x789C), and I looked around and it seems to be a different kind of zlib compression -- the magic number changes depending on the compression used. What can I use to decompress the files? Am I hosed?

According to RFC 1950 , the difference between the "OK" 0x789C and the "bad" 0x78DA is in the FLEVEL bit-field:
FLEVEL (Compression level)
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it
is there to indicate if recompression might be worthwhile.
"OK" uses 2, "bad" uses 3. So that difference in itself is not a problem.
To get any further, you might consider supplying the following information for each of compressing and (attempted) decompressing: what platform, what version of Python, what version of the zlib library, what was the actual code used to call the zlib module. Also supply the full traceback and error message from the failing decompression attempts. Have you tried to decompress the failing files with any other zlib-reading software? With what results? Please clarify what you have to work with: Does "Am I hosed?" mean that you don't have access to the original data? How did it get from a stream to a file? What guarantee do you have that the data was not mangled in transmission?
UPDATE Some observations based on partial clarifications published in your self-answer:
You are using Windows. Windows distinguishes between binary mode and text mode when reading and writing files. When reading in text mode, Python 2.x changes '\r\n' to '\n', and changes '\n' to '\r\n' when writing. This is not a good idea when dealing with non-text data. Worse, when reading in text mode, '\x1a' aka Ctrl-Z is treated as end-of-file.
To compress a file:
# imports and other superstructure left as a exercise
str_object1 = open('my_log_file', 'rb').read()
str_object2 = zlib.compress(str_object1, 9)
f = open('compressed_file', 'wb')
f.write(str_object2)
f.close()
To decompress a file:
str_object1 = open('compressed_file', 'rb').read()
str_object2 = zlib.decompress(str_object1)
f = open('my_recovered_log_file', 'wb')
f.write(str_object2)
f.close()
Aside: Better to use the gzip module which saves you having to think about nasssties like text mode, at the cost of a few bytes for the extra header info.
If you have been using 'rb' and 'wb' in your compression code but not in your decompression code [unlikely?], you are not hosed, you just need to flesh out the above decompression code and go for it.
Note carefully the use of "may", "should", etc in the following untested ideas.
If you have not been using 'rb' and 'wb' in your compression code, the probability that you have hosed yourself is rather high.
If there were any instances of '\x1a' in your original file, any data after the first such is lost -- but in that case it shouldn't fail on decompression (IOW this scenario doesn't match your symptoms).
If a Ctrl-Z was generated by zlib itself, this should cause an early EOF upon attempted decompression, which should of course cause an exception. In this case you may be able to gingerly reverse the process by reading the compressed file in binary mode and then substitute '\r\n' with '\n' [i.e. simulate text mode without the Ctrl-Z -> EOF gimmick]. Decompress the result. Edit Write the result out in TEXT mode. End edit
UPDATE 2 I can reproduce your symptoms -- with ANY level 1 to 9 -- with the following script:
import zlib, sys
fn = sys.argv[1]
level = int(sys.argv[2])
s1 = open(fn).read() # TEXT mode
s2 = zlib.compress(s1, level)
f = open(fn + '-ct', 'w') # TEXT mode
f.write(s2)
f.close()
# try to decompress in text mode
s1 = open(fn + '-ct').read() # TEXT mode
s2 = zlib.decompress(s1) # error -5
f = open(fn + '-dtt', 'w')
f.write(s2)
f.close()
Note: you will need a use a reasonably large text file (I used an 80kb source file) to ensure that the decompression result will contain a '\x1a'.
I can recover with this script:
import zlib, sys
fn = sys.argv[1]
# (1) reverse the text-mode write
# can't use text-mode read as it will stop at Ctrl-Z
s1 = open(fn, 'rb').read() # BINARY mode
s1 = s1.replace('\r\n', '\n')
# (2) reverse the compression
s2 = zlib.decompress(s1)
# (3) reverse the text mode read
f = open(fn + '-fixed', 'w') # TEXT mode
f.write(s2)
f.close()
NOTE: If there is a '\x1a' aka Ctrl-Z byte in the original file, and the file is read in text mode, that byte and all following bytes will NOT be included in the compressed file, and thus can NOT be recovered. For a text file (e.g. source code), this is no loss at all. For a binary file, you are most likely hosed.
Update 3 [following late revelation that there's an encryption/decryption layer involved in the problem]:
The "Error -5" message indicates that the data that you are trying to decompress has been mangled since it was compressed. If it's not caused by using text mode on the files, suspicion obviously(?) falls on your decryption and encryption wrappers. If you want help, you need to divulge the source of those wrappers. In fact what you should try to do is (like I did) put together a small script that reproduces the problem on more than one input file. Secondly (like I did) see whether you can reverse the process under what conditions. If you want help with the second stage, you need to divulge the problem-reproduction script.

I was looking for
python -c 'import sys,zlib;sys.stdout.write(zlib.decompress(sys.stdin.read()))'
wrote it myself; based on answers of zlib decompression in python

Okay sorry I wasn't clear enough. This is win32, python 2.6.2. I'm afraid I can't find the zlib file, but its whatever is included in the win32 binary release. And I don't have access to the original data -- I've been compressing my log files, and I'd like to get them back. As far as other software, I've naievely tried 7zip, but of course it failed, because it's zlib, not gzip (I couldn't any software to decompress zlib streams directly). I can't give a carbon copy of the traceback now, but it was (traced back to zlib.decompress(data)) zlib.error: Error: -3. Also, to be clear, these are static files, not streams as I made it sound earlier (so no transmission errors). And I'm afraid again I don't have the code, but I know I used zlib.compress(data, 9) (i.e. at the highest compression level -- although, interestingly it seems that not all the zlib output is 78da as you might expect since I put it on the highest level) and just zlib.decompress().

Ok sorry about my last post, I didn't have everything. And I can't edit my post because I didn't use OpenID. Anyways, here's some data:
1) Decompression traceback:
Traceback (most recent call last):
File "<my file>", line 5, in <module>
zlib.decompress(data)
zlib.error: Error -5 while decompressing data
2) Compression code:
#here you can assume the data is the data to be compressed/stored
data = encrypt(zlib.compress(data,9)) #a short wrapper around PyCrypto AES encryption
f = open("somefile", 'wb')
f.write(data)
f.close()
3) Decompression code:
f = open("somefile", 'rb')
data = f.read()
f.close()
zlib.decompress(decrypt(data)) #this yeilds the error in (1)

Related

Decompressing a content of unknown length with python-lz4

I am trying to decompress a content of unknown size using python-lz4 using the following code
with open("compressed.msgpk", "rb") as f:
content = f.read()
if content[0] == 1:
uncompressed = lz4.block.decompress(content[1:])
but it always fails with
LZ4BlockError: Decompression failed: corrupt input or insufficient space in destination buffer. Error code: 58
I even tried specifying different/bigger sizes as shown here https://python-lz4.readthedocs.io/en/stable/lz4.block.html but nothing worked.
And if it help the content I am trying to decompress is compressed using lz4net c# library using method LZ4Codec.WrapHC(content) https://github.com/MiloszKrajewski/lz4net/blob/201ed085fed299523616bfd08776694cb61ae6b3/src/LZ4/LZ4Codec.cs#L562

The unwrap method decodes a wrapped block, but the lz4.block.decompress method appears not to take wrapping into account.
I'm not 100% familiar with the python and c# libraries you are using but I wonder (from the docs) if the lz4.frame.decompress might be the method you are looking for?

Do different OS's affect the md5 checksum of a file in Python?

So I have a python script that uses the pyserial library to send a file over serial to another computer. I wrote some script to calculate the md5 checksum of the file before and after being sent over serial and I have encountered some problems.
Example:
I sent a simple file named third.txt containing a list of numbers 1 through 10. Simple file, nothing fancy or large. The checksum of the file before transmitting is completely different than the checksum of the file after transmitting on the other computer, even though the files are clearly the same.
I checked to see if there was something wrong with my code by simply moving the file over a USB and doing the checksum calulations this way. This time it worked.
Any ideas why this is happening and how I might possibly fix it?
Here is my checksum code before sending. This is not the exact code, but basically what I did.
<<Code that waits for command from client>>
with open(file_loc) as file_to_read:
data = file_to_read.read()
md5a = hashlib.md5(data).hexdigest()
ser.write('\n' + md5a + '\n')
Here is my checksum code after sending.
with open(file_loc) as file_to_read:
data = file_to_read.read()
md5b = hashlib.md5(data).hexdigest()
print('Sending Checksum Command')
ser.write("\n<<SENDCHECKSUM>>\n")
md5a = ser.readline()
print(md5a)
print(md5b)
if md5a == md5b:
print("Correct File Transmission")
else:
print("The checksum indicated incorrect file transmission, please check.")
ser.flush()

Yes, opening a file in text mode potentially can result in different data being read as newlines are translated for you from the platform native format to \n. Thus, files containing \r\n will give you a different checksum when read on Windows vs. a POSIX platform.
Open files in binary mode instead:
with open(file_loc, 'rb') as file_to_read:
Note that the same applies when writing a file. If you receive data from a POSIX system using \n line endings, and you write this to a file opened for writing in text mode on Windows, you'll end up with \r\n line endings in the written file.
If you are using Python 3, you are complicating matters some more. When you are opening files in text mode, you are translating the data from encoded bytes to decoded Unicode values. What codec is used for that can also differ from OS to OS, and even from machine to machine. The default is locale-defined (using locale.getpreferredencoding(False)), and as long as the data is decodable by the default locale, you can get very different results from reading a file using a different codec. You really want to ensure you use the same codec by setting it explicitly, or better still, open files in binary mode.
Since hashlib requires you to feed it byte strings, this is less of a problem when trying to calculate the digest (you'd have run into that problem and at least have to think about codecs there), but this applies to file transfers too; writing to text file will encode the data to the default codec.

Mixing read() and write() on Python files in Windows

It appears that a write() immediately following a read() on a file opened with r+ (or r+b) permissions in Windows doesn't update the file.
Assume there is a file testfile.txt in the current directory with the following contents:
This is a test file.
I execute the following code:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.write("----")
I would expect the code to print This and update the file contents to this:
This----a test file.
This works fine on at least Linux. However, when I run it on Windows then the message is displayed correctly, but the file isn't altered - it's like the write() is being ignored. If I call tell() on the filehandle it shows that the position has been updated (it's 4 before the write() and 8 afterwards), but no change to the file.
However, if I put an explicit fd.seek(4) just before the write() line then everything works as I'd expect.
Does anybody know the reason for this behaviour under Windows?
For reference I'm using Python 2.7.3 on Windows 7 with an NTFS partition.
EDIT
In response to comments, I tried both r+b and rb+ - the official Python docs seem to imply the former is canonical.
I put calls to fd.flush() in various places, and placing one between the read() and the write() like this:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.flush()
fd.write("----")
... yields the following interesting error:
IOError: [Errno 0] Error
EDIT 2
Indirectly that addition of a flush() helped because it lead me to this post describing a similar problem. If one of the commenters on it is correct, it's a bug in the underlying Windows C library.

Python's file operation should follow the libc convention as internally its implemented using C file IO functions.
Quoting from fopen man page or fopen page in cplusplus
For files open for appending (those which include a "+" sign), on
which both input and output operations are allowed, the stream should
be flushed (fflush) or repositioned (fseek, fsetpos, rewind) between
either a writing operation followed by a reading operation or a
reading operation which did not reach the end-of-file followed by a
writing operation.
SO to summarize, if you need to read a file after writing, you need to fflush the buffer and a write operation after read should be preceded by a fseek, as fd.seek(0, os.SEEK_CUR)
So just change your code snippet to
with open("test1.txt", "r+b") as fd:
print fd.read(4)
fd.seek(0, os.SEEK_CUR)
fd.write("----")
The behavior is consistent with how a similar C program would behave
#include <cstdio>
int main()
{
char buffer[5] = {0};
FILE *fp = fopen("D:\\Temp\\test1.txt","rb+");
fread(buffer, sizeof(char), 4, fp);
printf("%s\n", buffer);
/*without fseek, file would not be updated*/
fseek(fp, 0, SEEK_CUR);
fwrite("----",sizeof(char), 4, fp);
fclose(fp);
return 0;
}

It appears that this due to the behaviour of the underlying Windows libraries (which personally I regard to be in error) and nothing wrong with Python. On adding a flush() call between reading and writing (which is apparently good practice) I got an IOError with a zero errno, which is the same issue as discussed in this blog post.
From that post I found this Python issue which mentions the problem and says that the seek() call is actually the best workaround, along with a flush() every time you change from reading to writing.
All that taken into account, it seems the best way to write the code above such that it successfully runs on Windows is:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.flush()
fd.seek(4)
fd.write("----")
Might be something to bear in mind for anybody attempting to write portable code.

have you tried flushing ?
fd.flush()
it is OS-dependant, as write uses the filesystem caching mechanism

Is it possible that the implementation missinterpretest "r+b"? Afaik "rb+" is for reading and writing in binary.

Python decompressing gzip chunk-by-chunk

I've a memory- and disk-limited environment where I need to decompress the contents of a gzip file sent to me in string-based chunks (over xmlrpc binary transfer). However, using the zlib.decompress() or zlib.decompressobj()/decompress() both barf over the gzip header. I've tried offsetting past the gzip header (documented here), but still haven't managed to avoid the barf. The gzip library itself only seems to support decompressing from files.
The following snippet gives a simplified illustration of what I would like to do (except in real life the buffer will be filled from xmlrpc, rather than reading from a local file):
#! /usr/bin/env python
import zlib
CHUNKSIZE=1000
d = zlib.decompressobj()
f=open('23046-8.txt.gz','rb')
buffer=f.read(CHUNKSIZE)
while buffer:
outstr = d.decompress(buffer)
print(outstr)
buffer=f.read(CHUNKSIZE)
outstr = d.flush()
print(outstr)
f.close()
Unfortunately, as I said, this barfs with:
Traceback (most recent call last):
File "./test.py", line 13, in <module>
outstr = d.decompress(buffer)
zlib.error: Error -3 while decompressing: incorrect header check
Theoretically, I could feed my xmlrpc-sourced data into a StringIO and then use that as a fileobj for gzip.GzipFile(), however, in real life, I don't have memory available to hold the entire file contents in memory as well as the decompressed data. I really do need to process it chunk-by-chunk.
The fall-back would be to change the compression of my xmlrpc-sourced data from gzip to plain zlib, but since that impacts other sub-systems I'd prefer to avoid it if possible.
Any ideas?

gzip and zlib use slightly different headers.
See How can I decompress a gzip stream with zlib?
Try d = zlib.decompressobj(16+zlib.MAX_WBITS).
And you might try changing your chunk size to a power of 2 (say CHUNKSIZE=1024) for possible performance reasons.

I've got a more detailed answer here: https://stackoverflow.com/a/22310760/1733117
d = zlib.decompressobj(zlib.MAX_WBITS|32)
per documentation this automatically detects the header (zlib or gzip).

Downloading text files with Python and ftplib.FTP from z/os

I'm trying to automate downloading of some text files from a z/os PDS, using Python and ftplib.
Since the host files are EBCDIC, I can't simply use FTP.retrbinary().
FTP.retrlines(), when used with open(file,w).writelines as its callback, doesn't, of course, provide EOLs.
So, for starters, I've come up with this piece of code which "looks OK to me", but as I'm a relative Python noob, can anyone suggest a better approach? Obviously, to keep this question simple, this isn't the final, bells-and-whistles thing.
Many thanks.
#!python.exe
from ftplib import FTP
class xfile (file):
def writelineswitheol(self, sequence):
for s in sequence:
self.write(s+"\r\n")
sess = FTP("zos.server.to.be", "myid", "mypassword")
sess.sendcmd("site sbd=(IBM-1047,ISO8859-1)")
sess.cwd("'FOO.BAR.PDS'")
a = sess.nlst("RTB*")
for i in a:
sess.retrlines("RETR "+i, xfile(i, 'w').writelineswitheol)
sess.quit()
Update: Python 3.0, platform is MingW under Windows XP.
z/os PDSs have a fixed record structure, rather than relying on line endings as record separators. However, the z/os FTP server, when transmitting in text mode, provides the record endings, which retrlines() strips off.
Closing update:
Here's my revised solution, which will be the basis for ongoing development (removing built-in passwords, for example):
import ftplib
import os
from sys import exc_info
sess = ftplib.FTP("undisclosed.server.com", "userid", "password")
sess.sendcmd("site sbd=(IBM-1047,ISO8859-1)")
for dir in ["ASM", "ASML", "ASMM", "C", "CPP", "DLLA", "DLLC", "DLMC", "GEN", "HDR", "MAC"]:
sess.cwd("'ZLTALM.PREP.%s'" % dir)
try:
filelist = sess.nlst()
except ftplib.error_perm as x:
if (x.args[0][:3] != '550'):
raise
else:
try:
os.mkdir(dir)
except:
continue
for hostfile in filelist:
lines = []
sess.retrlines("RETR "+hostfile, lines.append)
pcfile = open("%s/%s"% (dir,hostfile), 'w')
for line in lines:
pcfile.write(line+"\n")
pcfile.close()
print ("Done: " + dir)
sess.quit()
My thanks to both John and Vinay

Just came across this question as I was trying to figure out how to recursively download datasets from z/OS. I've been using a simple python script for years now to download ebcdic files from the mainframe. It effectively just does this:
def writeline(line):
file.write(line + "\n")
file = open(filename, "w")
ftp.retrlines("retr " + filename, writeline)

You should be able to download the file as a binary (using retrbinary) and use the codecs module to convert from EBCDIC to whatever output encoding you want. You should know the specific EBCDIC code page being used on the z/OS system (e.g. cp500). If the files are small, you could even do something like (for a conversion to UTF-8):
file = open(ebcdic_filename, "rb")
data = file.read()
converted = data.decode("cp500").encode("utf8")
file = open(utf8_filename, "wb")
file.write(converted)
file.close()
Update: If you need to use retrlines to get the lines and your lines are coming back in the correct encoding, your approach will not work, because the callback is called once for each line. So in the callback, sequence will be the line, and your for loop will write individual characters in the line to the output, each on its own line. So you probably want to do self.write(sequence + "\r\n") rather than the for loop. It still doesn' feel especially right to subclass file just to add this utility method, though - it probably needs to be in a different class in your bells-and-whistles version.

Your writelineswitheol method appends '\r\n' instead of '\n' and then writes the result to a file opened in text mode. The effect, no matter what platform you are running on, will be an unwanted '\r'. Just append '\n' and you will get the appropriate line ending.
Proper error handling should not be relegated to a "bells and whistles" version. You should set up your callback so that your file open() is in a try/except and retains a reference to the output file handle, your write call is in a try/except, and you have a callback_obj.close() method which you use when retrlines() returns to explicitly file_handle.close() (in a try/except) -- that way you get explict error handling e.g. messages "can't (open|write to|close) file X because Y" AND you save having to think about when your files are going to be implicitly closed and whether you risk running out of file handles.
Python 3.x ftplib.FTP.retrlines() should give you str objects which are in effect Unicode strings, and you will need to encode them before you write them -- unless the default encoding is latin1 which would be rather unusual for a Windows box. You should have test files with (1) all possible 256 bytes (2) all bytes that are valid in the expected EBCDIC codepage.
[a few "sanitation" remarks]
You should consider upgrading your Python from 3.0 (a "proof of concept" release) to 3.1.
To facilitate better understanding of your code, use "i" as an identifier only as a sequence index and only if you irredeemably acquired the habit from FORTRAN 3 or more decades ago :-)
Two of the problems discovered so far (appending line terminator to each character, wrong line terminator) would have shown up the first time you tested it.

Use retrlines of ftplib to download file from z/os, each line has no '\n'.
It's different from windows ftp command 'get xxx'.
We can rewrite the function 'retrlines' to 'retrlines_zos' in ftplib.py.
Just copy the whole code of retrlines, and chane the 'callback' line to:
...
callback(line + "\n")
...
I tested and it worked.

you want a lambda function and a callback. Like so:
def writeLineCallback(line, file):
file.write(line + "\n")
ftpcommand = "RETR {}{}{}".format("'",zOsFile,"'")
filename = "newfilename"
with open( filename, 'w' ) as file :
callback_lambda = lambda x: writeLineCallback(x,file)
ftp.retrlines(ftpcommand, callback_lambda)
This will download file 'zOsFile' and write it to 'newfilename'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

zlib decompression in python - python

I was looking for python -c 'import sys,zlib;sys.stdout.write(zlib.decompress(sys.stdin.read()))' wrote it myself; based on answers of zlib decompression in python

Related

Decompressing a content of unknown length with python-lz4

Do different OS's affect the md5 checksum of a file in Python?

Mixing read() and write() on Python files in Windows

Python decompressing gzip chunk-by-chunk

Downloading text files with Python and ftplib.FTP from z/os

Categories

Resources