Python and zlib: Terribly slow decompressing concatenated streams

Python and zlib: Terribly slow decompressing concatenated streams - python

I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.
If I try to decompress it as a single object, I only get the first stream (about 19 kb).
I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:
import zlib
outfile = open('output.xml', 'w')
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
print "got it"
while i < len(data):
try:
zo = zlib.decompressobj()
dat =zo.decompress(data[i:])
outfile.write(dat)
zo.flush()
i += len(data[i:]) - len(zo.unused_data)
except zlib.error:
i += 1
outfile.close()
zipstreams('payload')
infile.close()
This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!
Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.
Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?
Thanks for any pointers or suggestions you have!

It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:
You say that the vendor states that they are
using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.
From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: https://www.rfc-editor.org/rfc/rfc1950
So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).
Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).
Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.
import zlib
# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG
def findstart(header, buf, source):
"""Find `header` in str `buf`, reading more from `source` if necessary"""
while buf.find(header) == -1:
more = source.read(2**12)
if len(more) == 0: # EOF without finding the header
return ''
buf += more
offset = buf.find(header)
return buf[offset:]
You can then advance to the start of the next stream. I've added a try/except pair since the same byte sequence might occur outside a stream:
source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header
buf = ''
while True:
decomp = zlib.decompressobj()
# Find the start of the next stream
buf = findstart(ZHEAD, buf, source)
try:
stream = decomp.decompress(buf)
except zlib.error:
print "Spurious match(?) at output offset %d." % outfile.tell(),
print "Skipping 2 bytes"
buf = buf[2:]
continue
# Read until zlib decides it's seen a complete file
while decomp.unused_data == '':
block = source.read(2**12)
if len(block) > 0:
stream += decomp.decompress(block)
else:
break # We've reached EOF
outfile.write(stream)
buf = decomp.unused_data # Save for the next stream
if len(block) == 0:
break # EOF
outfile.close()
PS 1. If I were you I'd write each XML stream into a separate file.
PS 2. You can test whatever you do on the first MB of your file, till you get adequate performance.

Decompressing 833 MB should take about 30 seconds on a modern processor (e.g. a 2 GHz i7). So yes, you are doing something very wrong. Attempting to decompress at every byte offset to see if you get an error is part of the problem, though not all of it. There are better ways to find the compressed data. Ideally you should find or figure out the format. Alternatively, you can search for valid zlib headers using the RFC 1950 specification, though you may get false positives.
More significant may be that you are reading the entire 833 MB into memory at once, and decompressing the 3 GB to memory, possibly in large pieces each time. How much memory does your machine have? You may be thrashing to virtual memory.
If the code you show works, then the data is not zipped. zip is a specific file format, normally with the .zip extension, that encapsulates raw deflate data within a structure of local and central directory information intended to reconstruct a directory in a file system. You must have something rather different, since your code is looking for and apparently finding zlib streams. What is the format you have? Where did you get it? How is it documented? Can you provide a dump of, say, the first 100 bytes?
The way this should be done is not to read the whole thing into memory and decompress entire streams at once, also into memory. Instead, make use of the zlib.decompressobj interface which allows you provide a piece at a time, and get the resulting available decompressed data. You can read the input file in much smaller pieces, find the decompressed data streams by using the documented format or looking for zlib (RFC 1950 headers), and then running those a chunk at a time through the decompressed object, writing out the decompressed data where you want it. decomp.unused_data can be used to detect the end of the compressed stream (as in the example you found).

From what you've described in the comments, it sounds like they're concatenating together the individual files they would have sent you separately. Which means each one has a 32-byte header you need to skip.
If you don't skip those headers, it would probably have exactly the behavior you described: If you get lucky, you'll get 32 invalid-header errors and then successfully parse the next stream. If you get unlucky, the 32 bytes of garbage will look like the start of a real stream, and you'll waste a whole lot of time parsing some arbitrary number of bytes until you finally get a decoding error. (If you get really unlucky, it'll actually decode successfully, giving you a giant hunk of garbage and eating up one or more subsequent streams.)
So, try just skipping 32 bytes after each stream finishes.
Or, if you have a more reliable way of detecting the start of the next stream (this is why I told you to print out the offsets and look at the data in a hex editor, and while alexis told you to look at the zlib spec), do that instead.

Related

In-place modification of gzip files

I need to modify a gzipped tab-delimited file. I can read from input and write modified reads to an output file as:
output = tempfile.NamedTemporaryFile(mode="w", delete=False)
with gzip.open(input, "rb") as in_file,\
gzip.open(output, "wb") as out_file:
for l in in_file:
split_line = l.split("\t")
if split_line[0] == "hello":
split_line[0] = "hi"
out_file.write("\t".join(split_line))
The gzipped files I work with are in 100s of GB scale, hence rewriting the entire file to a different file only for modifying a subset is not ideal. Therefore, I am interested in a solution that modifies the file in-place (i.e., modifying the original file as you traverse through it).

For normal gzip files, certainly not. Your only option would be read the gzip file up to where you want the modification, make the modification, and recompress the rest. Some attention is required where you make the cut, to remove the deflate block that includes the cut, and recompress from there, appending the remaining deflate blocks on the correct bit position.
You could, in theory, prepare a large gzip file so that such modifications could be done in place. You would need to break up the gzip file into independent blocks, where the history at the start of each block is discarded. (pigz does this with the --independent option.) You would also need to insert several empty blocks or other filler space at the end of each independent block to allow for variations in the length of the independent block so that the modified result can fit back into the exact same number of bytes. There are five-byte and two-byte empty blocks you can insert, that in combination should be able to accommodate any small number of byte count difference, if you have enough of them.
You would need a separate index of the locations of these independent blocks, otherwise you would be spending time searching for them, again making the time dependent on the length of the file.
In order to not significantly impact the overall compression ratio of the gzip file, you would want the independent blocks to be on the order of 128K bytes uncompressed or larger. Any modification would require recompression of an entire independent block.
You would also need to update the CRC and length at the end of the gzip file. I think that there's a way to update the CRC without recomputing it for the whole file, but I'd have to think about it. It is certainly possible if the length of the file doesn't change, but if you are inserting or deleting bytes, it gets trickier.
This would all be a large amount of work to try to put a square gzip peg into a round random modification hole. It suggests that you are simply using the wrong format for the application. Find a different format for what you want to do.

How to do read() efficiently with reasonable buffer size, yet return early on a short read

We are writing a class that calls read on an io object, likely (but not necessarily) a file or a pipe. That stream contains a large number of small binary messages of format (length, blob), not fixed length, but in the order of 100 bytes.
The two obvious options are along the lines of
while True:
length = unpack(f.read(4))
blob = read(length)
which works fine, but is slow or:
while True:
buffer = f.read(8192)
for blob in unpack_buffer(blob):
...
# handle the remainder
Which is fast, but when reading from a pipe in live streaming, will not return until there's 8K of data there, so not good for latency on sporadic input.
We have considered non-blocking reads, but that path is problematic is there isn't always an fd, and we're also reluctant to mess with the file object's params on a file object passed in from the user.
Is there some way to make read() return a partial buffer < 8192 bytes as soon as some data is present? I believe the underlying Unix read() syscall does this, but not fread().

Read N number of bytes from stdin of python and output to a temp file for further processing

I would like to read a fixed number of bytes from stdin of a python script and output it to one temporary file batch by batch for further processing. Therefore, when the first N number of bytes are passed to the temp file, I want it to execute the subsequent scripts and then read the next N bytes from stdin. I am not sure what to iterate over in the top loop before While true. This is an example of what I tried.
import sys
While True:
data = sys.stdin.read(2330049) # Number of bytes I would like to read in one iteration
if data == "":
break
file1=open('temp.fil','wb') #temp file
file1.write(data)
file1.close()
further_processing on temp.fil (I think this can only be done after file1 is closed)

Two quick suggestions:
You should pretty much never do While True
Python3
Are you trying to read from a file? or from actual standard in? (Like the output of a script piped to this?)
Here is an answer I think will work for you, if you are reading from a file, that I pieced together from some other answers listed at the bottom:
with open("in-file", "rb") as in_file, open("out-file", "wb") as out_file:
data = in_file.read(2330049)
while byte != "":
out_file.write(data)
If you want to read from actual standard in, I would read all of it in, then split it up by bytes. The only way this won't work is if you are trying to deal with constant streaming data...which I would most definitely not use standard in for.
The .encode('UTF-8') and .decode('hex') methods might be of use to you also.
Sources: https://stackoverflow.com/a/1035360/957648 & Python, how to read bytes from file and save it?

Python - split files

I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?
----Adding current code----
if not os.path.exists(output_path):
os.makedirs(output_path)
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
outFile = open('output/tempfile', 'wb')
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
outFile.write(chunk)
f = open('output/tempfile', 'rb').read().split('\r\n\r\n')
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.
Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.
from pprint import pprint
someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''
n = 16
# emulate a stream by creating 37 blocks of 16 bytes
byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
pprint(byteBlocks)
# this string is present twice, but both times it is split across two bytearrays
matchBytes = bytearray('requests.post()')
# our buffer
buff = bytearray()
count = 0
for bb in byteBlocks:
buff += bb
count += 1
# every two blocks
if (count % 2) == 0:
if count == 2:
start = 0
else:
start = len(matchBytes)
# check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
# this will check all the bytes only once...
if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))
Update:
So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.
Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.
Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
match = '\r\n\r\n'
data = ''
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
data += chunk
f = data.split(match)
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Get uncompressed size of a .gz file in python

Using gzip, tell() returns the offset in the uncompressed file.
In order to show a progress bar, I want to know the original (uncompressed) size of the file.
Is there an easy way to find out?

Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB)
import struct
def getuncompressedsize(filename):
with open(filename, 'rb') as f:
f.seek(-4, 2)
return struct.unpack('I', f.read(4))[0]

The gzip format specifies a field called ISIZE that:
This contains the size of the original (uncompressed) input data modulo 2^32.
In gzip.py, which I assume is what you're using for gzip support, there is a method called _read_eof defined as such:
def _read_eof(self):
# We've read to the end of the file, so we have to rewind in order
# to reread the 8 bytes containing the CRC and the file size.
# We check the that the computed CRC and size of the
# uncompressed data matches the stored values. Note that the size
# stored is the true file size mod 2**32.
self.fileobj.seek(-8, 1)
crc32 = read32(self.fileobj)
isize = U32(read32(self.fileobj)) # may exceed 2GB
if U32(crc32) != U32(self.crc):
raise IOError, "CRC check failed"
elif isize != LOWU32(self.size):
raise IOError, "Incorrect length of data produced"
There you can see that the ISIZE field is being read, but only to to compare it to self.size for error detection. This then should mean that GzipFile.size stores the actual uncompressed size. However, I think it's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.
I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.

Despite what the other answers say, the last four bytes are not a reliable way to get the uncompressed length of a gzip file. First, there may be multiple members in the gzip file, so that would only be the length of the last member. Second, the length may be more than 4 GB, in which case the last four bytes represent the length modulo 232. Not the length.
However for what you want, there is no need to get the uncompressed length. You can instead base your progress bar on the amount of input consumed, as compared to the length of the gzip file, which is readily obtained. For typical homogenous data, that progress bar would show exactly the same thing as a progress bar based instead on the uncompressed data.

Unix way: use "gunzip -l file.gz" via subprocess.call / os.popen, capture and parse its output.

The last 4 bytes of the .gz hold the original size of the file

I am not sure about performance, but this could be achieved without knowing gzip magic by using:
with gzip.open(filepath, 'rb') as file_obj:
file_size = file_obj.seek(0, io.SEEK_END)
This should also work for other (compressed) stream readers like bz2 or the plain open.
EDIT:
as suggested in the comments, 2 in second line was replaced by io.SEEK_END, which is definitely more readable and probably more future-proof.
EDIT:
Works only in Python 3.

f = gzip.open(filename)
# kludge - report uncompressed file position so progess bars
# don't go to 400%
f.tell = f.fileobj.tell

Looking at the source for the gzip module, I see that the underlying file object for GzipFile seems to be fileobj. So:
mygzipfile = gzip.GzipFile()
...
mygzipfile.fileobj.tell()
?
Maybe it would be good to do some sanity checking before doing that, like checking that the attribute exists with hasattr.
Not exactly a public API, but...

GzipFile.size stores the uncompressed size, but it's only incremented when you read the file, so you should prefer len(fd.read()) instead of the non-public GzipFile.size.

Here is a Python2 version for #norok's solution
import gzip, io
with oepn("yourfile.gz", "rb") as f:
prev, cur = 0, f.seek(1000000, io.SEEK_CUR)
while prev < cur:
prev, cur = cur, f.seek(1000000, io.SEEK_CUR)
filesize = cur
Note that just like f.seek(0, io.SEEK_END) this is slow for large files, but it will overcome the 4GB size limitation of the faster solutions suggested here

import gzip
File = gzip.open("input.gz", "r")
Size = gzip.read32(File)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.