Ungzipping chunks of bytes from from S3 using iter_chunks()

Ungzipping chunks of bytes from from S3 using iter_chunks() - python

I am encountering issues ungzipping chunks of bytes that I am reading from S3 using the iter_chunks() method from boto3. The strategy of ungzipping the file chunk-by-chunk originates from this issue.
The code is as follows:
dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
data = dec.decompress(chunk)
print(len(chunk), len(data))
# 524288 65505
# 524288 0
# 524288 0
# ...
This code initially prints out the value of 65505 followed thereafter by 0 for every subsequent iteration. My understanding is that this code should ungzip each compressed chunk, and then print the length of the uncompressed version.
Is there something I'm missing?

It seems like your input file is block gzip (bgzip http://www.htslib.org/doc/bgzip.html ) because you have a 65k block of data decoded.
GZip files can be concatenated together ( see https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage) and Block GZip uses this to concatenate blocks of the same file, so that by using an associated index only the specific block containing information of interest has to be decoded.
So to stream decode a block gzip file, you need to use the leftover data from one block to start a new one. E.g.
# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
# decompress this chunk of data
data = dec.decompress(chunk)
# bgzip is a concatenation of gzip files
# if there is stuff in this chunk beyond the current block
# it needs to be processed
while len(dec.unused_data):
# end of one block
leftovers = dec.unused_data
# create a new decompressor
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
#decompress the leftovers
data = data+dec.decompress(leftovers)
# TODO handle data

Related

Python how to get length of data compressed with zlib?

I have a file with multiple zlib-compressed binary data, and the offsets and lengths are unknown. Below, I have a script that gets the offset of the byte after the final zlib compressed data, which is what I need. The script works; however, in order to get the length of the original zlib compressed data, I have to decompress it and re-compress it. Is there a better way to get the length without having to re-compress it? Here's my code:
import zlib
def inflate(infile):
data = infile.read()
offset = 0
while offset < len(data):
window = data[offset : offset + 2]
for key, value in zlib_headers.items():
if window == key:
decomp_obj = zlib.decompressobj()
yield key, offset, decomp_obj.decompress(data[offset:])
if offset == len(data):
break
offset += 1
if __name__ == "__main__":
zlib_headers = {b"\x78\x01": 3, b"\x78\x9c": 6, b"\x78\xda": 9}
with open("input_file", "rb") as infile:
*_, last = inflate(infile)
key, offset, data = last
start_offset = offset + len(zlib.compress(data, zlib_headers[key]))
print(start_offset)

Recompressing it won't even work. The recompression could be a different length. There is no assurance that the result will be the same, unless you control the compression process that made the compressed data in the first place, and you can guarantee that it uses the same compression code, same version of that code, and exactly the same settings. There is not even enough information in the zlib header to determine what the compression level was. By the way, your list of possible zlib headers is incomplete. There are 29 others it could be. The easiest and most reliable way to determine whether or not a zlib stream starts at the current byte is to begin decompressing until you either get an error or it completes. The first thing the decompressor will do is check the zlib header for validity.
To find the length of the decompressed data, feed decomp_obj.decompress() a fixed number of bytes at a time. E.g. 65536 bytes. Keep track of how many bytes you have fed it. Stop when decomp_obj.eof is true. That indicates that the end of the zlib stream has been reached. Then decomp_obj.unused_data will be the bytes you fed it that were after the zlib stream. Subtract the length of the leftover from your total amount fed, and you have the length of the zlib stream.

reading large file uses 100% memory and my whole pc frozes

I have made a simple app that encrypts and decrypts files. but when i load a large file like a 2gb, my program uses 100% of the memory. I use multiprocessing and multi threading.
poolSize = min(cpu_count(), len(fileList))
process_pool = Pool(poolSize)
thread_pool = ThreadPool(len(fileList))
lock = Lock()
worker = partial(encfile, process_pool, lock)
thread_pool.map(worker, fileList)
def encfile(process_pool, lock, file):
with open(file, 'rb') as original_file:
original = original_file.read()
encrypted = process_pool.apply(encryptfn, args=(key, original,))
with open (file, 'wb') as encrypted_file:
encrypted_file.write(encrypted)

This is my general idea:
Since memory is a problem, you have to read the files in smaller chunks, say 64K pieces and encrypt each 64K block and write those out. Of course, the encrypted block will have a length other than 64K so the problem becomes how to decrypt. So each encrypted block must be prefixed with a fixed-length header that is nothing more than the length of the following encrypted block encoded as a 4-byte unsigned integer (which should be way more than adequate). The decryption algorithm loop first reads the next 4-byte length and then know from that how many bytes long is the encrypted block that follows.
By the way, there is no need to pass to encfile a lock if you are not using it to, for example, count files processed.
from tempfile import mkstemp
from os import fdopen, replace
BLOCKSIZE = 64 * 1024
ENCRYPTED_HEADER_LENGTH = 4
def encfile(process_pool, lock, file):
"""
Encrypt file in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as original_file, \
fdopen (fd, 'wb') as encrypted_file:
while True:
original = original_file.read(BLOCKSIZE)
if not original:
break
encrypted = process_pool.apply(encryptfn, args=(key, original))
l = len(encrypted)
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
encrypted_file.write(l_bytes)
encrypted_file.write(encrypted)
replace(path, file)
def decfile(file):
"""
Decrypt files in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as encrypted_file, \
fdopen (fd, 'wb') as original_file:
while True:
l_bytes = encrypted_file.read(ENCRYPTED_HEADER_LENGTH)
if not l_bytes:
break
l = int.from_bytes(l_bytes, 'big')
encrypted = encrypted_file.read(l)
decrypted = decryptfn(key, encrypted)
original_file.write(decrypted)
replace(path, file)
Explanation
The larger the block size the more memory is required (your original program read the entire file; this program will only read 64K at a time). But I am assuming that too small a block size results in too many calls to the encryption, which is done by multiprocessing and that would require more CPU overhead -- so it's a tradeoff. 64K was arbitrary. Increase by a lot if you have the memory. You can even try 1024 * 1024 (1M).
I attempted to explain this the following before, but let me elaborate:
So let's say when you encrypt a 64K block then encrypted size for that one particular 64K block ends up being 67,986 bytes long (a different 64K block encrypted will in general have a different length unless its unencrypted value happened to have been the same). If I just write out the data with no other information, I would need some way to know that to decrypt the file that it is first necessary to read back 67,986 bytes of data and pass that to the decrypt method (with the correct key, of course) because you have to decrypt the precise results of what was encrypted, no fewer nor no greater bytes. In other words, you can't just read back the encrypted file in arbitrary chunks and pass those chunks to the decrypt method. But what would be that way? So the only way to know how big each encrypted chunk is would be to prefix those chunks with a header that gives the length of the following chunk.
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big') takes the integerlength stored in variable l and encodes it as a byte array of size ENCRYPTED_HEADER_LENGTH in "big endian" order meaning that the bytes are arranged from high order bytes to low order bytes:
>>> ENCRYPTED_HEADER_LENGTH = 4
>>> l = 67986
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
>>> l_bytes
b'\x00\x01\t\x92'
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'little')
>>> l_bytes
b'\x92\t\x01\x00'
>>>
\t is the tab character with a value of \x09 so we would be writing out 0010992, which is a 4-byte hexadecimal value for 67986

Can gzip compress data without loading it all into memory, i.e. streaming/on-the-fly?

Is it possible to gzip data via some amount of streaming, i.e. without loading all of the compressed data in memory at once?
For example, can I gzip a file that will be 10gb gzipped, on a machine with 2gb of memory?
At https://docs.python.org/3/library/gzip.html#gzip.compress, the gzip.compress function returns the bytes of the gzip, so must be all loaded in memory. But... it's not clear how gzip.open works internally: whether the zipped bytes will all be in memory at once. Does the gzip format itself make it particularly tricky to achieve a streaming gzip?
[This question is tagged with Python, but non-Python answers welcome as well]

You don't have to compress all 10gb at once. You can read the input data in chunks, and compress each chunk separately, so it doesn't have to all fit in memory at once.
chunksize = 100 * 1024 * 1024 # 100 mb chunks
with open("bigfile.txt") as infile:
while True:
chunk = infile.read(chunksize)
if not chunk:
break
compressed = gzip.compress(chunk)
# do something with compressed
If you're creating a compressed file, you can write the chunks directly to the gzip file.
with open("bigfile.txt") as infile, gzip.open("bigfile.txt.gz", "w") as gzipfile:
while True:
chunk = infile.read(chunksize)
if not chunk:
break
gzipfile.write(chunk)

[This is based on #Barmar's answer and comments]
You can achieve streaming gzip compression. The gzip module uses zlib which is documented to achieve streaming compression, and peeking into the gzip module source, it doesn't appear to load all the output bytes into memory.
You can also do this directly with the zlib module, for example with a small pipeline of generators:
import zlib
def yield_uncompressed_bytes():
# In a real case, would yield bytes pulled from the filesystem or the network
chunk = b'*' * 65000
for _ in range(0, 10000):
print('In: ', len(chunk))
yield chunk
def yield_compressed_bytes(_uncompressed_bytes):
compress_obj = zlib.compressobj(wbits=zlib.MAX_WBITS + 16)
for chunk in _uncompressed_bytes:
if compressed_bytes := compress_obj.compress(chunk):
yield compressed_bytes
if compressed_bytes := compress_obj.flush():
yield compressed_bytes
uncompressed_bytes = yield_uncompressed_bytes()
compressed_bytes = yield_compressed_bytes(uncompressed_bytes)
for chunk in compressed_bytes:
# In a real case, could save to the filesystem, or send over the network
print('Out:', len(chunk))
You can see that the In: are interspersed with the Out:, suggesting that the zlib compressobj is indeed not storing the whole output in memory.

Python and zlib: Terribly slow decompressing concatenated streams

I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.
If I try to decompress it as a single object, I only get the first stream (about 19 kb).
I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:
import zlib
outfile = open('output.xml', 'w')
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
print "got it"
while i < len(data):
try:
zo = zlib.decompressobj()
dat =zo.decompress(data[i:])
outfile.write(dat)
zo.flush()
i += len(data[i:]) - len(zo.unused_data)
except zlib.error:
i += 1
outfile.close()
zipstreams('payload')
infile.close()
This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!
Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.
Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?
Thanks for any pointers or suggestions you have!

It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:
You say that the vendor states that they are
using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.
From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: https://www.rfc-editor.org/rfc/rfc1950
So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).
Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).
Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.
import zlib
# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG
def findstart(header, buf, source):
"""Find `header` in str `buf`, reading more from `source` if necessary"""
while buf.find(header) == -1:
more = source.read(2**12)
if len(more) == 0: # EOF without finding the header
return ''
buf += more
offset = buf.find(header)
return buf[offset:]
You can then advance to the start of the next stream. I've added a try/except pair since the same byte sequence might occur outside a stream:
source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header
buf = ''
while True:
decomp = zlib.decompressobj()
# Find the start of the next stream
buf = findstart(ZHEAD, buf, source)
try:
stream = decomp.decompress(buf)
except zlib.error:
print "Spurious match(?) at output offset %d." % outfile.tell(),
print "Skipping 2 bytes"
buf = buf[2:]
continue
# Read until zlib decides it's seen a complete file
while decomp.unused_data == '':
block = source.read(2**12)
if len(block) > 0:
stream += decomp.decompress(block)
else:
break # We've reached EOF
outfile.write(stream)
buf = decomp.unused_data # Save for the next stream
if len(block) == 0:
break # EOF
outfile.close()
PS 1. If I were you I'd write each XML stream into a separate file.
PS 2. You can test whatever you do on the first MB of your file, till you get adequate performance.

Decompressing 833 MB should take about 30 seconds on a modern processor (e.g. a 2 GHz i7). So yes, you are doing something very wrong. Attempting to decompress at every byte offset to see if you get an error is part of the problem, though not all of it. There are better ways to find the compressed data. Ideally you should find or figure out the format. Alternatively, you can search for valid zlib headers using the RFC 1950 specification, though you may get false positives.
More significant may be that you are reading the entire 833 MB into memory at once, and decompressing the 3 GB to memory, possibly in large pieces each time. How much memory does your machine have? You may be thrashing to virtual memory.
If the code you show works, then the data is not zipped. zip is a specific file format, normally with the .zip extension, that encapsulates raw deflate data within a structure of local and central directory information intended to reconstruct a directory in a file system. You must have something rather different, since your code is looking for and apparently finding zlib streams. What is the format you have? Where did you get it? How is it documented? Can you provide a dump of, say, the first 100 bytes?
The way this should be done is not to read the whole thing into memory and decompress entire streams at once, also into memory. Instead, make use of the zlib.decompressobj interface which allows you provide a piece at a time, and get the resulting available decompressed data. You can read the input file in much smaller pieces, find the decompressed data streams by using the documented format or looking for zlib (RFC 1950 headers), and then running those a chunk at a time through the decompressed object, writing out the decompressed data where you want it. decomp.unused_data can be used to detect the end of the compressed stream (as in the example you found).

From what you've described in the comments, it sounds like they're concatenating together the individual files they would have sent you separately. Which means each one has a 32-byte header you need to skip.
If you don't skip those headers, it would probably have exactly the behavior you described: If you get lucky, you'll get 32 invalid-header errors and then successfully parse the next stream. If you get unlucky, the 32 bytes of garbage will look like the start of a real stream, and you'll waste a whole lot of time parsing some arbitrary number of bytes until you finally get a decoding error. (If you get really unlucky, it'll actually decode successfully, giving you a giant hunk of garbage and eating up one or more subsequent streams.)
So, try just skipping 32 bytes after each stream finishes.
Or, if you have a more reliable way of detecting the start of the next stream (this is why I told you to print out the offsets and look at the data in a hex editor, and while alexis told you to look at the zlib spec), do that instead.

Read binary file with header using bitarray's from file in Python

I wrote a program that uses bitarray 0.8.0 to write bits to a binary file. I would like to add a header to this binary file to describe what's inside the file.
My problem is that I think the method "fromfile" of bitarray necessarily starts reading the file from the beginning. I could make a workaround so that the reading program gets the header and then rewrite a temporary file containing only the binary portion (bitarray tofile), but it doesn't sound too efficient of an idea.
Is there any way to do this properly?
My file could look something like the following where clear text is the header and binary data is the bitarray information:
...{(0, 0): '0'}{(0, 0): '0'}{(0, 0): '0'}���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������...
Edit:
I tried the following after reading the response:
bits = ""
b = bitarray()
with open(Filename, 'rb') as file:
#Get header
byte = file.read(1)
while byte != "":
# read header
byte = file.read(1)
b.fromfile(file)
print b.to01()
print "len(b.to01())", len(b.to01())
The length is 0 and the print of "to01()" is empty.
However, the print of the header is fine.

My problem is that I think the method "fromfile" of bitarray necessarily starts reading the file from the beginning.
This is likely false; it, like most other file read routines, probably starts at the current position within the file, and stops at EOF.
EDIT:
From the documentation:
fromfile(f, [n])
Read n bytes from the file object f and append them to the bitarray interpreted as machine values. When n is omitted, as many bytes are read until EOF is reached.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.