I have a file with multiple zlib-compressed binary data, and the offsets and lengths are unknown. Below, I have a script that gets the offset of the byte after the final zlib compressed data, which is what I need. The script works; however, in order to get the length of the original zlib compressed data, I have to decompress it and re-compress it. Is there a better way to get the length without having to re-compress it? Here's my code:
import zlib
def inflate(infile):
data = infile.read()
offset = 0
while offset < len(data):
window = data[offset : offset + 2]
for key, value in zlib_headers.items():
if window == key:
decomp_obj = zlib.decompressobj()
yield key, offset, decomp_obj.decompress(data[offset:])
if offset == len(data):
break
offset += 1
if __name__ == "__main__":
zlib_headers = {b"\x78\x01": 3, b"\x78\x9c": 6, b"\x78\xda": 9}
with open("input_file", "rb") as infile:
*_, last = inflate(infile)
key, offset, data = last
start_offset = offset + len(zlib.compress(data, zlib_headers[key]))
print(start_offset)
Recompressing it won't even work. The recompression could be a different length. There is no assurance that the result will be the same, unless you control the compression process that made the compressed data in the first place, and you can guarantee that it uses the same compression code, same version of that code, and exactly the same settings. There is not even enough information in the zlib header to determine what the compression level was. By the way, your list of possible zlib headers is incomplete. There are 29 others it could be. The easiest and most reliable way to determine whether or not a zlib stream starts at the current byte is to begin decompressing until you either get an error or it completes. The first thing the decompressor will do is check the zlib header for validity.
To find the length of the decompressed data, feed decomp_obj.decompress() a fixed number of bytes at a time. E.g. 65536 bytes. Keep track of how many bytes you have fed it. Stop when decomp_obj.eof is true. That indicates that the end of the zlib stream has been reached. Then decomp_obj.unused_data will be the bytes you fed it that were after the zlib stream. Subtract the length of the leftover from your total amount fed, and you have the length of the zlib stream.
Related
I am encountering issues ungzipping chunks of bytes that I am reading from S3 using the iter_chunks() method from boto3. The strategy of ungzipping the file chunk-by-chunk originates from this issue.
The code is as follows:
dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
data = dec.decompress(chunk)
print(len(chunk), len(data))
# 524288 65505
# 524288 0
# 524288 0
# ...
This code initially prints out the value of 65505 followed thereafter by 0 for every subsequent iteration. My understanding is that this code should ungzip each compressed chunk, and then print the length of the uncompressed version.
Is there something I'm missing?
It seems like your input file is block gzip (bgzip http://www.htslib.org/doc/bgzip.html ) because you have a 65k block of data decoded.
GZip files can be concatenated together ( see https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage) and Block GZip uses this to concatenate blocks of the same file, so that by using an associated index only the specific block containing information of interest has to be decoded.
So to stream decode a block gzip file, you need to use the leftover data from one block to start a new one. E.g.
# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
# decompress this chunk of data
data = dec.decompress(chunk)
# bgzip is a concatenation of gzip files
# if there is stuff in this chunk beyond the current block
# it needs to be processed
while len(dec.unused_data):
# end of one block
leftovers = dec.unused_data
# create a new decompressor
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
#decompress the leftovers
data = data+dec.decompress(leftovers)
# TODO handle data
I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?
----Adding current code----
if not os.path.exists(output_path):
os.makedirs(output_path)
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
outFile = open('output/tempfile', 'wb')
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
outFile.write(chunk)
f = open('output/tempfile', 'rb').read().split('\r\n\r\n')
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')
Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.
Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.
from pprint import pprint
someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''
n = 16
# emulate a stream by creating 37 blocks of 16 bytes
byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
pprint(byteBlocks)
# this string is present twice, but both times it is split across two bytearrays
matchBytes = bytearray('requests.post()')
# our buffer
buff = bytearray()
count = 0
for bb in byteBlocks:
buff += bb
count += 1
# every two blocks
if (count % 2) == 0:
if count == 2:
start = 0
else:
start = len(matchBytes)
# check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
# this will check all the bytes only once...
if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))
Update:
So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.
Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.
Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
match = '\r\n\r\n'
data = ''
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
data += chunk
f = data.split(match)
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')
I'm making a python "script" that sends a string to a webservice (in C#). I NEED to compress or compact this string, because the bandwidth and MBs data is LIMITED (yeah, in capitals because it's very limited).
I was thinking of converting it into a file and then compressing the file. But I'm looking for a method to directly compress the string.
How can I compress or compact the string?
How about zlib?
import zlib
a = "this string needs compressing"
a = zlib.compress(a.encode())
print(zlib.decompress(a).decode()) # outputs original contents of a
You can also use sys.getsizeof(obj) to see how much data an object takes up before and after compression.
import sys
import zlib
text=b"""This function is the primary interface to this module along with
decompress() function. This function returns byte object by compressing the data
given to it as parameter. The function has another parameter called level which
controls the extent of compression. It an integer between 0 to 9. Lowest value 0
stands for no compression and 9 stands for best compression. Higher the level of
compression, greater the length of compressed byte object."""
# Checking size of text
text_size=sys.getsizeof(text)
print("\nsize of original text",text_size)
# Compressing text
compressed = zlib.compress(text)
# Checking size of text after compression
csize=sys.getsizeof(compressed)
print("\nsize of compressed text",csize)
# Decompressing text
decompressed=zlib.decompress(compressed)
#Checking size of text after decompression
dsize=sys.getsizeof(decompressed)
print("\nsize of decompressed text",dsize)
print("\nDifference of size= ", text_size-csize)
I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.
If I try to decompress it as a single object, I only get the first stream (about 19 kb).
I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:
import zlib
outfile = open('output.xml', 'w')
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
print "got it"
while i < len(data):
try:
zo = zlib.decompressobj()
dat =zo.decompress(data[i:])
outfile.write(dat)
zo.flush()
i += len(data[i:]) - len(zo.unused_data)
except zlib.error:
i += 1
outfile.close()
zipstreams('payload')
infile.close()
This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!
Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.
Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?
Thanks for any pointers or suggestions you have!
It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:
You say that the vendor states that they are
using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.
From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: https://www.rfc-editor.org/rfc/rfc1950
So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).
Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).
Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.
import zlib
# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG
def findstart(header, buf, source):
"""Find `header` in str `buf`, reading more from `source` if necessary"""
while buf.find(header) == -1:
more = source.read(2**12)
if len(more) == 0: # EOF without finding the header
return ''
buf += more
offset = buf.find(header)
return buf[offset:]
You can then advance to the start of the next stream. I've added a try/except pair since the same byte sequence might occur outside a stream:
source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header
buf = ''
while True:
decomp = zlib.decompressobj()
# Find the start of the next stream
buf = findstart(ZHEAD, buf, source)
try:
stream = decomp.decompress(buf)
except zlib.error:
print "Spurious match(?) at output offset %d." % outfile.tell(),
print "Skipping 2 bytes"
buf = buf[2:]
continue
# Read until zlib decides it's seen a complete file
while decomp.unused_data == '':
block = source.read(2**12)
if len(block) > 0:
stream += decomp.decompress(block)
else:
break # We've reached EOF
outfile.write(stream)
buf = decomp.unused_data # Save for the next stream
if len(block) == 0:
break # EOF
outfile.close()
PS 1. If I were you I'd write each XML stream into a separate file.
PS 2. You can test whatever you do on the first MB of your file, till you get adequate performance.
Decompressing 833 MB should take about 30 seconds on a modern processor (e.g. a 2 GHz i7). So yes, you are doing something very wrong. Attempting to decompress at every byte offset to see if you get an error is part of the problem, though not all of it. There are better ways to find the compressed data. Ideally you should find or figure out the format. Alternatively, you can search for valid zlib headers using the RFC 1950 specification, though you may get false positives.
More significant may be that you are reading the entire 833 MB into memory at once, and decompressing the 3 GB to memory, possibly in large pieces each time. How much memory does your machine have? You may be thrashing to virtual memory.
If the code you show works, then the data is not zipped. zip is a specific file format, normally with the .zip extension, that encapsulates raw deflate data within a structure of local and central directory information intended to reconstruct a directory in a file system. You must have something rather different, since your code is looking for and apparently finding zlib streams. What is the format you have? Where did you get it? How is it documented? Can you provide a dump of, say, the first 100 bytes?
The way this should be done is not to read the whole thing into memory and decompress entire streams at once, also into memory. Instead, make use of the zlib.decompressobj interface which allows you provide a piece at a time, and get the resulting available decompressed data. You can read the input file in much smaller pieces, find the decompressed data streams by using the documented format or looking for zlib (RFC 1950 headers), and then running those a chunk at a time through the decompressed object, writing out the decompressed data where you want it. decomp.unused_data can be used to detect the end of the compressed stream (as in the example you found).
From what you've described in the comments, it sounds like they're concatenating together the individual files they would have sent you separately. Which means each one has a 32-byte header you need to skip.
If you don't skip those headers, it would probably have exactly the behavior you described: If you get lucky, you'll get 32 invalid-header errors and then successfully parse the next stream. If you get unlucky, the 32 bytes of garbage will look like the start of a real stream, and you'll waste a whole lot of time parsing some arbitrary number of bytes until you finally get a decoding error. (If you get really unlucky, it'll actually decode successfully, giving you a giant hunk of garbage and eating up one or more subsequent streams.)
So, try just skipping 32 bytes after each stream finishes.
Or, if you have a more reliable way of detecting the start of the next stream (this is why I told you to print out the offsets and look at the data in a hex editor, and while alexis told you to look at the zlib spec), do that instead.
Using gzip, tell() returns the offset in the uncompressed file.
In order to show a progress bar, I want to know the original (uncompressed) size of the file.
Is there an easy way to find out?
Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB)
import struct
def getuncompressedsize(filename):
with open(filename, 'rb') as f:
f.seek(-4, 2)
return struct.unpack('I', f.read(4))[0]
The gzip format specifies a field called ISIZE that:
This contains the size of the original (uncompressed) input data modulo 2^32.
In gzip.py, which I assume is what you're using for gzip support, there is a method called _read_eof defined as such:
def _read_eof(self):
# We've read to the end of the file, so we have to rewind in order
# to reread the 8 bytes containing the CRC and the file size.
# We check the that the computed CRC and size of the
# uncompressed data matches the stored values. Note that the size
# stored is the true file size mod 2**32.
self.fileobj.seek(-8, 1)
crc32 = read32(self.fileobj)
isize = U32(read32(self.fileobj)) # may exceed 2GB
if U32(crc32) != U32(self.crc):
raise IOError, "CRC check failed"
elif isize != LOWU32(self.size):
raise IOError, "Incorrect length of data produced"
There you can see that the ISIZE field is being read, but only to to compare it to self.size for error detection. This then should mean that GzipFile.size stores the actual uncompressed size. However, I think it's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.
I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.
Despite what the other answers say, the last four bytes are not a reliable way to get the uncompressed length of a gzip file. First, there may be multiple members in the gzip file, so that would only be the length of the last member. Second, the length may be more than 4 GB, in which case the last four bytes represent the length modulo 232. Not the length.
However for what you want, there is no need to get the uncompressed length. You can instead base your progress bar on the amount of input consumed, as compared to the length of the gzip file, which is readily obtained. For typical homogenous data, that progress bar would show exactly the same thing as a progress bar based instead on the uncompressed data.
Unix way: use "gunzip -l file.gz" via subprocess.call / os.popen, capture and parse its output.
The last 4 bytes of the .gz hold the original size of the file
I am not sure about performance, but this could be achieved without knowing gzip magic by using:
with gzip.open(filepath, 'rb') as file_obj:
file_size = file_obj.seek(0, io.SEEK_END)
This should also work for other (compressed) stream readers like bz2 or the plain open.
EDIT:
as suggested in the comments, 2 in second line was replaced by io.SEEK_END, which is definitely more readable and probably more future-proof.
EDIT:
Works only in Python 3.
f = gzip.open(filename)
# kludge - report uncompressed file position so progess bars
# don't go to 400%
f.tell = f.fileobj.tell
Looking at the source for the gzip module, I see that the underlying file object for GzipFile seems to be fileobj. So:
mygzipfile = gzip.GzipFile()
...
mygzipfile.fileobj.tell()
?
Maybe it would be good to do some sanity checking before doing that, like checking that the attribute exists with hasattr.
Not exactly a public API, but...
GzipFile.size stores the uncompressed size, but it's only incremented when you read the file, so you should prefer len(fd.read()) instead of the non-public GzipFile.size.
Here is a Python2 version for #norok's solution
import gzip, io
with oepn("yourfile.gz", "rb") as f:
prev, cur = 0, f.seek(1000000, io.SEEK_CUR)
while prev < cur:
prev, cur = cur, f.seek(1000000, io.SEEK_CUR)
filesize = cur
Note that just like f.seek(0, io.SEEK_END) this is slow for large files, but it will overcome the 4GB size limitation of the faster solutions suggested here
import gzip
File = gzip.open("input.gz", "r")
Size = gzip.read32(File)