How to stream from ZipFile? How to zip "on the fly"? - python

I want to zip a stream and stream out the result. I'm doing it using AWS Lambda which matters in sense of available disk space and other restrictions.
I'm going to use the zipped stream to write an AWS S3 object using upload_fileobj() or put(), if it matters.
I can create an archive as a file until I have small objects:
import zipfile
zf = zipfile.ZipFile("/tmp/byte.zip", "w")
zf.writestr(filename, my_stream.read())
zf.close()
For large amount of data I can create an object instead of file:
from io import BytesIO
...
byte = BytesIO()
zf = zipfile.ZipFile(byte, "w")
....
but how can I pass the zipped stream to the output? If I use zf.close() - the stream will be closed, if I don't use it - the archive will be incomplete.

Instead of using Python't built-in zipfile, you can use stream-zip (full disclosure: written by me)
If you have an iterable of bytes, my_data_iter say, you can get an iterable of a zip file using its stream_zip function:
from datetime import datetime
from stream_zip import stream_zip, ZIP_64
def files():
modified_at = datetime.now()
perms = 0o600
yield 'my-file-1.txt', modified_at, perms, ZIP_64, my_data_iter
my_zip_iter = stream_zip(files())
If you need a file-like object, say to pass to boto3's upload_fileobj, you can convert from the iterable with a transformation function:
def to_file_like_obj(iterable):
chunk = b''
offset = 0
it = iter(iterable)
def up_to_iter(size):
nonlocal chunk, offset
while size:
if offset == len(chunk):
try:
chunk = next(it)
except StopIteration:
break
else:
offset = 0
to_yield = min(size, len(chunk) - offset)
offset = offset + to_yield
size -= to_yield
yield chunk[offset - to_yield:offset]
class FileLikeObj:
def read(self, size=-1):
return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))
return FileLikeObj()
my_file_like_obj = to_file_like_obj(my_zip_iter)

You might like to try the zipstream version of zipfile. For example, to compress stdin to stdout as a zip file holding the data as a file named TheLogFile using iterators:
#!/usr/bin/python3
import sys, zipstream
with zipstream.ZipFile(mode='w', compression=zipstream.ZIP_DEFLATED) as z:
z.write_iter('TheLogFile', sys.stdin.buffer)
for chunk in z:
sys.stdout.buffer.write(chunk)

Related

boto get md5 s3 file

I have a use case where I upload hundreds of file to my S3 bucket using multi part upload. After each upload I need to make sure that the uploaded file is not corrupt (basically check for data integrity). Currently, after uploading the file, I re-download it and compute the md5 on the content string and compare it with the md5 of local file. So something like:
conn = S3Connection('access key', 'secretkey')
bucket = conn.get_bucket('bucket_name')
source_path = 'file_to_upload'
source_size = os.stat(source_path).st_size
mp = bucket.initiate_multipart_upload(os.path.basename(source_path))
chunk_size = 52428800
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
offset = chunk_size * i
bytes = min(chunk_size, source_size - offset)
with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:
mp.upload_part_from_file(fp, part_num=i + 1, md5=k.compute_md5(fp, bytes))
mp.complete_upload()
obj_key = bucket.get_key('file_name')
print(obj_key.md5) #prints None
print(obj_key.base64md5) #prints None
content = bucket.get_key('file_name').get_contents_as_string()
# compute the md5 on content
This approach is wasteful as it doubles the bandwidth usage. I tried
bucket.get_key('file_name').md5
bucket.get_key('file_name').base64md5
but both return None.
Is there any other way to achieve md5 without downloading the whole thing?
yes
use bucket.get_key('file_name').etag[1 :-1]
this way get key's MD5 without downloading it's contents.
With boto3, I use head_object to retrieve the ETag.
import boto3
import botocore
def s3_md5sum(bucket_name, resource_name):
try:
md5sum = boto3.client('s3').head_object(
Bucket=bucket_name,
Key=resource_name
)['ETag'][1:-1]
except botocore.exceptions.ClientError:
md5sum = None
pass
return md5sum
You can recover md5 without downloading the file, from e_tag attribute, like that:
boto3.resource('s3').Object(<BUCKET_NAME>, file_path).e_tag[1 :-1]
Then use this function to compare classic s3 files:
def md5_checksum(file_path):
m = hashlib.md5()
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(1024 * 1024), b''):
m.update(data)
return m.hexdigest()
Or this function for multi-part files:
def etag_checksum(file_path, chunk_size=8 * 1024 * 1024):
md5s = []
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(chunk_size), b''):
md5s.append(hashlib.md5(data).digest())
m = hashlib.md5("".join(md5s))
return '{}-{}'.format(m.hexdigest(), len(md5s))
Finally use this function to choose between the two:
def md5_compare(file_path, s3_file_md5):
if '-' in s3_file_md5 and s3_file_md5 == etag_checksum(file_path):
return True
if '-' not in s3_file_md5 and s3_file_md5 == md5_checksum(file_path):
return True
print("MD5 not equals for file " + file_path)
return False
Credit to: https://zihao.me/post/calculating-etag-for-aws-s3-objects/
Since 2016, the best way to do this without any additional object retrievals is by presenting the --content-md5 argument during a PutObject request. AWS will then verify that the provided MD5 matches their calculated MD5. This also works for multipart uploads and objects >5GB.
An example call from the knowledge center:
aws s3api put-object --bucket awsexamplebucket --key awsexampleobject.txt --body awsexampleobjectpath --content-md5 examplemd5value1234567== --metadata md5checksum=examplemd5value1234567==
https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/

How to use `numpy.savez` in a loop for save more than one array?

From a loop I'm getting an array. I want to save this arrays in a tempfile.
The problem is that np.savez only saves the last array from the loop. I think I understand why this happens, but dont know how to do it better.
To solve my problem I had the idea to open the tempfile in mode=a+b with the goal to append the new arrays from the loop. But this doesn't work.
My code so far:
tmp = TemporaryFile(mode="a+b")
for i in range(10):
array = getarray[i] #demo purpose
np.savez(tmp,array)
tmp.seek(0)
Then using the tempfile to read the arrays:
tmp_read = np.load(tmp)
print tmp_read.files
[OUTPUT]: ['arr_0']
But I want 10 arrays in the tempfile. Any ideas?
thanks
You can use the *args arguments to save many arrays in only one temp file.
np.savez(tmp, *getarray[:10])
or:
np.savez(tmp, *[getarray[0], getarray[1], getarray[8]])
It is also possible to use custom keys by using ** operator.
import numpy as np
a1 = [1,2,3]
a2 = [10,20,30]
savez_dict = dict()
for i in ['a1', 'a2']:
savez_dict['key_'+i] = i
np.savez("t.npz", **savez_dict)
Sorry for my English in advance.
Because the function savez opens the file, writes all variables, then close the file, data are over-written when it called.
savez is simple. you can find the code at https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py
how about implementing "your_own_savez", then use the following code.
tmp = TemporaryFile()
f = my_savez(tmp)
for i in range(10):
array = getarray[i] #demo purpose
f.savez(array)
f.close()
tmp.seek(0)
tmp_read = np.load(tmp)
print tmp_read.files
Here is my quick and dirty code.
import numpy as np
import tempfile
class my_savez(object):
def __init__(self, file):
# Import is postponed to here since zipfile depends on gzip, an optional
# component of the so-called standard library.
import zipfile
# Import deferred for startup time improvement
import tempfile
import os
if isinstance(file, basestring):
if not file.endswith('.npz'):
file = file + '.npz'
compression = zipfile.ZIP_STORED
zip = self.zipfile_factory(file, mode="w", compression=compression)
# Stage arrays in a temporary file on disk, before writing to zip.
fd, tmpfile = tempfile.mkstemp(suffix='-numpy.npy')
os.close(fd)
self.tmpfile = tmpfile
self.zip = zip
self.i = 0
def zipfile_factory(self, *args, **kwargs):
import zipfile
import sys
if sys.version_info >= (2, 5):
kwargs['allowZip64'] = True
return zipfile.ZipFile(*args, **kwargs)
def savez(self, *args, **kwds):
import os
import numpy.lib.format as format
namedict = kwds
for val in args:
key = 'arr_%d' % self.i
if key in namedict.keys():
raise ValueError("Cannot use un-named variables and keyword %s" % key)
namedict[key] = val
self.i += 1
try:
for key, val in namedict.iteritems():
fname = key + '.npy'
fid = open(self.tmpfile, 'wb')
try:
format.write_array(fid, np.asanyarray(val))
fid.close()
fid = None
self.zip.write(self.tmpfile, arcname=fname)
finally:
if fid:
fid.close()
finally:
os.remove(self.tmpfile)
def close(self):
self.zip.close()
tmp = tempfile.TemporaryFile()
f = my_savez(tmp)
for i in range(10):
array = np.zeros(10)
f.savez(array)
f.close()
tmp.seek(0)
tmp_read = np.load(tmp)
print tmp_read.files
for k, v in tmp_read.iteritems():
print k, v
I am not an experienced programmer, but this is the way I did it (just in case it may help someone in the future). In addition, it is the first time that I am posting here, so I apologize if I am not following some kind of standard ;)
Creating the npz file:
import numpy as np
tmp = file("C:\\Windows\\Temp\\temp_npz.npz",'wb')
# some variables
a= [23,4,67,7]
b= ['w','ww','wwww']
c= np.ones((2,6))
# a lit containing the name of your variables
var_list=['a','b','c']
# save the npz file with the variables you selected
str_exec_save = "np.savez(tmp,"
for i in range(len(var_list)):
str_exec_save += "%s = %s," % (var_list[i],var_list[i])
str_exec_save += ")"
exec(str_exec_save)
tmp.close
Loading the variables with their original names:
import numpy as np
import tempfile
tmp = open("C:\\Windows\\Temp\\temp_npz.npz",'rb')
# loading of the saved variables
var_load = np.load(tmp)
# getting the name of the variables
files = var_load.files
# loading then with their original names
for i in range(len(files)):
exec("%s = var_load['%s']" % (files[i],files[i]) )
The only difference is that the variables will become numpy variables.

Python ungzipping stream of bytes?

Here is the situation:
I get gzipped xml documents from Amazon S3
import boto
from boto.s3.connection import S3Connection
from boto.s3.key import Key
conn = S3Connection('access Id', 'secret access key')
b = conn.get_bucket('mydev.myorg')
k = Key(b)
k.key('documents/document.xml.gz')
I read them in file as
import gzip
f = open('/tmp/p', 'w')
k.get_file(f)
f.close()
r = gzip.open('/tmp/p', 'rb')
file_content = r.read()
r.close()
Question
How can I ungzip the streams directly and read the contents?
I do not want to create temp files, they don't look good.
Yes, you can use the zlib module to decompress byte streams:
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
The offset of 32 signals to the zlib header that the gzip header is expected but skipped.
The S3 key object is an iterator, so you can do:
for data in stream_gzip_decompress(k):
# do something with the decompressed data
I had to do the same thing and this is how I did it:
import gzip
f = StringIO.StringIO()
k.get_file(f)
f.seek(0) #This is crucial
gzf = gzip.GzipFile(fileobj=f)
file_content = gzf.read()
For Python3x and boto3-
So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.
import io
import zipfile
import boto3
import sys
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
count = 0
obj = s3.Object(
bucket_name='MonkeyBusiness',
key='/Daily/Business/Banana/{current-date}/banana.zip'
)
buffer = io.BytesIO(obj.get()["Body"].read())
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
for _ in foo2:
line_counter += 1
print (line_counter)
z.close()
if __name__ == '__main__':
stream_zip_file()
You can try PIPE and read contents without downloading file
import subprocess
c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for row in c.stdout:
print row
In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.

Unzip part of a file using python gzip module

I am trying to unzip a gzipped file in Python using the gzip module. The pre-condition is that, I get 160 bytesof data at a time, and I need to unzip it before I request for the next 160 bytes. Partial unzipping is OK, before requesting the next 160 bytes. The code I have is
import gzip
import time
import StringIO
file = open('input_cp.gz', 'rb')
buf = file.read(160)
sio = StringIO.StringIO(buf)
f = gzip.GzipFile(fileobj=sio)
data = f.read()
print data
The error I am getting is IOError: CRC check failed. I am assuming this is cuz it expects the entire gzipped content to be present in buf, whereas I am reading in only 160 bytes at a time. Is there a workaround this??
Thanks
Create your own class with a read() method (and whatever else GzipFile needs from fileobj, like close and seek) and pass it to GzipFile. Something like:
class MyBuffer(object):
def __init__(self, input_file):
self.input_file = input_file
def read(self, size=-1):
if size < 0:
size = 160
return self.input_file.read(min(160, size))
Then use it like:
file = open('input_cp.gz', 'rb')
mybuf = MyBuffer(file)
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()

Create a zip file from a generator in Python?

I've got a large amount of data (a couple gigs) I need to write to a zip file in Python. I can't load it all into memory at once to pass to the .writestr method of ZipFile, and I really don't want to feed it all out to disk using temporary files and then read it back.
Is there a way to feed a generator or a file-like object to the ZipFile library? Or is there some reason this capability doesn't seem to be supported?
By zip file, I mean zip file. As supported in the Python zipfile package.
The only solution is to rewrite the method it uses for zipping files to read from a buffer. It would be trivial to add this to the standard libraries; I'm kind of amazed it hasn't been done yet. I gather there's a lot of agreement the entire interface needs to be overhauled, and that seems to be blocking any incremental improvements.
import zipfile, zlib, binascii, struct
class BufferedZipFile(zipfile.ZipFile):
def writebuffered(self, zipinfo, buffer):
zinfo = zipinfo
zinfo.file_size = file_size = 0
zinfo.flag_bits = 0x00
zinfo.header_offset = self.fp.tell()
self._writecheck(zinfo)
self._didModify = True
zinfo.CRC = CRC = 0
zinfo.compress_size = compress_size = 0
self.fp.write(zinfo.FileHeader())
if zinfo.compress_type == zipfile.ZIP_DEFLATED:
cmpr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15)
else:
cmpr = None
while True:
buf = buffer.read(1024 * 8)
if not buf:
break
file_size = file_size + len(buf)
CRC = binascii.crc32(buf, CRC) & 0xffffffff
if cmpr:
buf = cmpr.compress(buf)
compress_size = compress_size + len(buf)
self.fp.write(buf)
if cmpr:
buf = cmpr.flush()
compress_size = compress_size + len(buf)
self.fp.write(buf)
zinfo.compress_size = compress_size
else:
zinfo.compress_size = file_size
zinfo.CRC = CRC
zinfo.file_size = file_size
position = self.fp.tell()
self.fp.seek(zinfo.header_offset + 14, 0)
self.fp.write(struct.pack("<LLL", zinfo.CRC, zinfo.compress_size, zinfo.file_size))
self.fp.seek(position, 0)
self.filelist.append(zinfo)
self.NameToInfo[zinfo.filename] = zinfo
Changed in Python 3.5 (from official docs): Added support for writing to unseekable streams.
This means that now for zipfile.ZipFile we can use streams which do not store the entire file in memory. Such streams do not support movement over the entire data volume.
So this is simple generator:
from zipfile import ZipFile, ZipInfo
def zipfile_generator(path, stream):
with ZipFile(stream, mode='w') as zf:
z_info = ZipInfo.from_file(path)
with open(path, 'rb') as entry, zf.open(z_info, mode='w') as dest:
for chunk in iter(lambda: entry.read(16384), b''):
dest.write(chunk)
# Yield chunk of the zip file stream in bytes.
yield stream.get()
# ZipFile was closed.
yield stream.get()
path is a string path of the large file or directory or pathlike object.
stream is the unseekable stream instance of the class like this (designed according to official docs):
from io import RawIOBase
class UnseekableStream(RawIOBase):
def __init__(self):
self._buffer = b''
def writable(self):
return True
def write(self, b):
if self.closed:
raise ValueError('Stream was closed!')
self._buffer += b
return len(b)
def get(self):
chunk = self._buffer
self._buffer = b''
return chunk
You can try this code online: https://repl.it/#IvanErgunov/zipfilegenerator
There is also another way to create a generator without ZipInfo and manually reading and dividing your large file. You can pass the queue.Queue() object to your UnseekableStream() object and write to this queue in another thread. Then in current thread you can simply read chunks from this queue in iterable way. See docs
P.S.
Python Zipstream by allanlei is outdated and unreliable way. It was an attempt to add support for unseekable streams before it was done officially.
I took Chris B.'s answer and created a complete solution. Here it is in case anyone else is interested:
import os
import threading
from zipfile import *
import zlib, binascii, struct
class ZipEntryWriter(threading.Thread):
def __init__(self, zf, zinfo, fileobj):
self.zf = zf
self.zinfo = zinfo
self.fileobj = fileobj
zinfo.file_size = 0
zinfo.flag_bits = 0x00
zinfo.header_offset = zf.fp.tell()
zf._writecheck(zinfo)
zf._didModify = True
zinfo.CRC = 0
zinfo.compress_size = compress_size = 0
zf.fp.write(zinfo.FileHeader())
super(ZipEntryWriter, self).__init__()
def run(self):
zinfo = self.zinfo
zf = self.zf
file_size = 0
CRC = 0
if zinfo.compress_type == ZIP_DEFLATED:
cmpr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15)
else:
cmpr = None
while True:
buf = self.fileobj.read(1024 * 8)
if not buf:
self.fileobj.close()
break
file_size = file_size + len(buf)
CRC = binascii.crc32(buf, CRC)
if cmpr:
buf = cmpr.compress(buf)
compress_size = compress_size + len(buf)
zf.fp.write(buf)
if cmpr:
buf = cmpr.flush()
compress_size = compress_size + len(buf)
zf.fp.write(buf)
zinfo.compress_size = compress_size
else:
zinfo.compress_size = file_size
zinfo.CRC = CRC
zinfo.file_size = file_size
position = zf.fp.tell()
zf.fp.seek(zinfo.header_offset + 14, 0)
zf.fp.write(struct.pack("<lLL", zinfo.CRC, zinfo.compress_size, zinfo.file_size))
zf.fp.seek(position, 0)
zf.filelist.append(zinfo)
zf.NameToInfo[zinfo.filename] = zinfo
class EnhZipFile(ZipFile, object):
def _current_writer(self):
return hasattr(self, 'cur_writer') and self.cur_writer or None
def assert_no_current_writer(self):
cur_writer = self._current_writer()
if cur_writer and cur_writer.isAlive():
raise ValueError('An entry is already started for name: %s' % cur_write.zinfo.filename)
def write(self, filename, arcname=None, compress_type=None):
self.assert_no_current_writer()
super(EnhZipFile, self).write(filename, arcname, compress_type)
def writestr(self, zinfo_or_arcname, bytes):
self.assert_no_current_writer()
super(EnhZipFile, self).writestr(zinfo_or_arcname, bytes)
def close(self):
self.finish_entry()
super(EnhZipFile, self).close()
def start_entry(self, zipinfo):
"""
Start writing a new entry with the specified ZipInfo and return a
file like object. Any data written to the file like object is
read by a background thread and written directly to the zip file.
Make sure to close the returned file object, before closing the
zipfile, or the close() would end up hanging indefinitely.
Only one entry can be open at any time. If multiple entries need to
be written, make sure to call finish_entry() before calling any of
these methods:
- start_entry
- write
- writestr
It is not necessary to explicitly call finish_entry() before closing
zipfile.
Example:
zf = EnhZipFile('tmp.zip', 'w')
w = zf.start_entry(ZipInfo('t.txt'))
w.write("some text")
w.close()
zf.close()
"""
self.assert_no_current_writer()
r, w = os.pipe()
self.cur_writer = ZipEntryWriter(self, zipinfo, os.fdopen(r, 'r'))
self.cur_writer.start()
return os.fdopen(w, 'w')
def finish_entry(self, timeout=None):
"""
Ensure that the ZipEntry that is currently being written is finished.
Joins on any background thread to exit. It is safe to call this method
multiple times.
"""
cur_writer = self._current_writer()
if not cur_writer or not cur_writer.isAlive():
return
cur_writer.join(timeout)
if __name__ == "__main__":
zf = EnhZipFile('c:/tmp/t.zip', 'w')
import time
w = zf.start_entry(ZipInfo('t.txt', time.localtime()[:6]))
w.write("Line1\n")
w.write("Line2\n")
w.close()
zf.finish_entry()
w = zf.start_entry(ZipInfo('p.txt', time.localtime()[:6]))
w.write("Some text\n")
w.close()
zf.close()
gzip.GzipFile writes the data in gzipped chunks , which you can set the size of your chunks according to the numbers of lines read from the files.
an example:
file = gzip.GzipFile('blah.gz', 'wb')
sourcefile = open('source', 'rb')
chunks = []
for line in sourcefile:
chunks.append(line)
if len(chunks) >= X:
file.write("".join(chunks))
file.flush()
chunks = []
The essential compression is done by zlib.compressobj. ZipFile (under Python 2.5 on MacOSX appears to be compiled). The Python 2.3 version is as follows.
You can see that it builds the compressed file in 8k chunks. Taking out the source file information is complex because a lot of source file attributes (like uncompressed size) is recorded in the zip file header.
def write(self, filename, arcname=None, compress_type=None):
"""Put the bytes from filename into the archive under the name
arcname."""
st = os.stat(filename)
mtime = time.localtime(st.st_mtime)
date_time = mtime[0:6]
# Create ZipInfo instance to store file information
if arcname is None:
zinfo = ZipInfo(filename, date_time)
else:
zinfo = ZipInfo(arcname, date_time)
zinfo.external_attr = st[0] << 16L # Unix attributes
if compress_type is None:
zinfo.compress_type = self.compression
else:
zinfo.compress_type = compress_type
self._writecheck(zinfo)
fp = open(filename, "rb")
zinfo.flag_bits = 0x00
zinfo.header_offset = self.fp.tell() # Start of header bytes
# Must overwrite CRC and sizes with correct data later
zinfo.CRC = CRC = 0
zinfo.compress_size = compress_size = 0
zinfo.file_size = file_size = 0
self.fp.write(zinfo.FileHeader())
zinfo.file_offset = self.fp.tell() # Start of file bytes
if zinfo.compress_type == ZIP_DEFLATED:
cmpr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,
zlib.DEFLATED, -15)
else:
cmpr = None
while 1:
buf = fp.read(1024 * 8)
if not buf:
break
file_size = file_size + len(buf)
CRC = binascii.crc32(buf, CRC)
if cmpr:
buf = cmpr.compress(buf)
compress_size = compress_size + len(buf)
self.fp.write(buf)
fp.close()
if cmpr:
buf = cmpr.flush()
compress_size = compress_size + len(buf)
self.fp.write(buf)
zinfo.compress_size = compress_size
else:
zinfo.compress_size = file_size
zinfo.CRC = CRC
zinfo.file_size = file_size
# Seek backwards and write CRC and file sizes
position = self.fp.tell() # Preserve current position in file
self.fp.seek(zinfo.header_offset + 14, 0)
self.fp.write(struct.pack("<lLL", zinfo.CRC, zinfo.compress_size,
zinfo.file_size))
self.fp.seek(position, 0)
self.filelist.append(zinfo)
self.NameToInfo[zinfo.filename] = zinfo
Some (many? most?) compression algorithms are based on looking at redundancies across the entire file.
Some compression libraries will choose between several compression algorithms based on which works best on the file.
I believe the ZipFile module does this, so it wants to see the entire file, not just pieces at a time.
Hence, it won't work with generators or files to big to load in memory. That would explain the limitation of the Zipfile library.
In case anyone stumbles upon this question, which is still relevant in 2017 for Python 2.7, here's a working solution for a true streaming zip file, with no requirement for the output to be seekable as in the other cases. The secret is to set bit 3 of the general purpose bit flag (see https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT section 4.3.9.1).
Note that this implementation will always create a ZIP64-style file, allowing the streaming to work for arbitrarily large files. It includes an ugly hack to force the zip64 end of central directory record, so be aware it will cause all zipfiles written by your process to become ZIP64-style.
import io
import zipfile
import zlib
import binascii
import struct
class ByteStreamer(io.BytesIO):
'''
Variant on BytesIO which lets you write and consume data while
keeping track of the total filesize written. When data is consumed
it is removed from memory, keeping the memory requirements low.
'''
def __init__(self):
super(ByteStreamer, self).__init__()
self._tellall = 0
def tell(self):
return self._tellall
def write(self, b):
orig_size = super(ByteStreamer, self).tell()
super(ByteStreamer, self).write(b)
new_size = super(ByteStreamer, self).tell()
self._tellall += (new_size - orig_size)
def consume(self):
bytes = self.getvalue()
self.seek(0)
self.truncate(0)
return bytes
class BufferedZipFileWriter(zipfile.ZipFile):
'''
ZipFile writer with true streaming (input and output).
Created zip files are always ZIP64-style because it is the only safe way to stream
potentially large zip files without knowing the full size ahead of time.
Example usage:
>>> def stream():
>>> bzfw = BufferedZip64FileWriter()
>>> for arc_path, buffer in inputs: # buffer is a file-like object which supports read(size)
>>> for chunk in bzfw.streambuffer(arc_path, buffer):
>>> yield chunk
>>> yield bzfw.close()
'''
def __init__(self, compression=zipfile.ZIP_DEFLATED):
self._buffer = ByteStreamer()
super(BufferedZipFileWriter, self).__init__(self._buffer, mode='w', compression=compression, allowZip64=True)
def streambuffer(self, zinfo_or_arcname, buffer, chunksize=2**16):
if not isinstance(zinfo_or_arcname, zipfile.ZipInfo):
zinfo = zipfile.ZipInfo(filename=zinfo_or_arcname,
date_time=time.localtime(time.time())[:6])
zinfo.compress_type = self.compression
zinfo.external_attr = 0o600 << 16 # ?rw-------
else:
zinfo = zinfo_or_arcname
zinfo.file_size = file_size = 0
zinfo.flag_bits = 0x08 # Streaming mode: crc and size come after the data
zinfo.header_offset = self.fp.tell()
self._writecheck(zinfo)
self._didModify = True
zinfo.CRC = CRC = 0
zinfo.compress_size = compress_size = 0
self.fp.write(zinfo.FileHeader())
if zinfo.compress_type == zipfile.ZIP_DEFLATED:
cmpr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15)
else:
cmpr = None
while True:
buf = buffer.read(chunksize)
if not buf:
break
file_size += len(buf)
CRC = binascii.crc32(buf, CRC) & 0xffffffff
if cmpr:
buf = cmpr.compress(buf)
compress_size += len(buf)
self.fp.write(buf)
compressed_bytes = self._buffer.consume()
if compressed_bytes:
yield compressed_bytes
if cmpr:
buf = cmpr.flush()
compress_size += len(buf)
self.fp.write(buf)
zinfo.compress_size = compress_size
compressed_bytes = self._buffer.consume()
if compressed_bytes:
yield compressed_bytes
else:
zinfo.compress_size = file_size
zinfo.CRC = CRC
zinfo.file_size = file_size
# Write CRC and file sizes after the file data
# Always write as zip64 -- only safe way to stream what might become a large zipfile
fmt = '<LQQ'
self.fp.write(struct.pack(fmt, zinfo.CRC, zinfo.compress_size, zinfo.file_size))
self.fp.flush()
self.filelist.append(zinfo)
self.NameToInfo[zinfo.filename] = zinfo
yield self._buffer.consume()
# The close method needs to be patched to force writing a ZIP64 file
# We'll hack ZIP_FILECOUNT_LIMIT to do the forcing
def close(self):
tmp = zipfile.ZIP_FILECOUNT_LIMIT
zipfile.ZIP_FILECOUNT_LIMIT = 0
super(BufferedZipFileWriter, self).close()
zipfile.ZIP_FILECOUNT_LIMIT = tmp
return self._buffer.consume()
The gzip library will take a file-like object for compression.
class GzipFile([filename [,mode [,compresslevel [,fileobj]]]])
You still need to provide a nominal filename for inclusion in the zip file, but you can pass your data-source to the fileobj.
(This answer differs from that of Damnsweet, in that the focus should be on the data-source being incrementally read, not the compressed file being incrementally written.)
And I see now the original questioner won't accept Gzip :-(
Now with python 2.7 you can add data to the zipfile insted of the file :
http://docs.python.org/2/library/zipfile#zipfile.ZipFile.writestr
This is 2017. If you are still looking to do this elegantly, use Python Zipstream by allanlei.
So far, it is probably the only well written library to accomplish that.
gzip.GzipFile writes the data in gzipped chunks , which you can set the size of your chunks according to the numbers of lines read from the files.
an example:
file = gzip.GzipFile('blah.gz', 'wb')
sourcefile = open('source', 'rb')
chunks = []
for line in sourcefile:
chunks.append(line)
if len(chunks) >= X:
file.write("".join(chunks))
file.flush()
chunks = []
You can use stream-zip for this (full disclosure: written mostly by me).
Say you have generators of bytes you want to zip:
def file_data_1():
yield b'Some bytes a'
yield b'Some bytes b'
def file_data_2():
yield b'Some bytes c'
yield b'Some bytes d'
You can created a single iterable of the zipped bytes of these generators:
from datetime import datetime
from stream_zip import ZIP_64, stream_zip
def zip_member_files():
modified_at = datetime.now()
perms = 0o600
yield 'my-file-1.txt', modified_at, perms, ZIP_64, file_data_1()
yield 'my-file-2.txt', modified_at, perms, ZIP_64, file_data_2()
zipped_chunks = stream_zip(zip_member_files()):
And then, for example, save this iterable to disk by:
with open('my.zip', 'wb') as f:
for chunk in zipped_chunks:
f.write(chunk)

Categories

Resources