Faster alternative to Python's zipfile module? - python

Is there a noticeably faster alternative to Python 2.7.4 zipfile module (with ZIP_DEFLATED) for zipping a large number of files into a single zip file? I had a look at czipfile https://pypi.python.org/pypi/czipfile/1.0.0, but that appears to be focused on faster decrypting (not compressing).
I am routinely having to process a large number of image files (~12,000 files of a combination of .exr and .tiff files) with each file between ~1MB - 6MB in size (and ~9 GB for all the files) into a single zip file for shipment. This zipping takes ~90 minutes to process (running on Windows 7 64bit).
If anyone can recommend a different python module (or alternatively a C/C++ library or even a standalone tool) that would be able to compress a large number of files into a single .zip file in less time than the zipfile module, that would be greatly appreciated (anything close to ~5-10% faster (or more) would be very helpful).

As Patashu mentions, outsourcing to 7-zip might be the best idea.
Here's some sample code to get you started:
import os
import subprocess
path_7zip = r"C:\Program Files\7-Zip\7z.exe"
path_working = r"C:\temp"
outfile_name = "compressed.zip"
os.chdir(path_working)
ret = subprocess.check_output([path_7zip, "a", "-tzip", outfile_name, "*.txt", "*.py", "-pSECRET"])
As martineau mentioned you might experiment with compression methods. This page gives some examples on how to change the command line parameters.

Related

Compressing Excel file does not reduce the size

I am using python to automate the conversion of Excel file to zip folder. The file is converted to zip using below code.
import os
from zipfile import ZipFile
path = r'C:/Users/kj/Desktop/Results'
os.chdir(path)
new_path = "C:/Users/kj/Desktop/New folder/"
ZipFile(new_path+"Week6.zip", 'w').write("Week6.csv")
The problem is below code is converting it to zip file but not reducing the size of the file. It remains same. But when I manually zip file, it reduces the size considerably. Please suggest what can be done to do it manually?
Because the default compression algorithm is ZIP_STORED which basically saves exact files in uncompressed format. so you have to mention your compression algorithm like lzma:
ZipFile(new_path+"Week6.zip", mode='w', compression=ZIP_LZMA)

Python shutil.make_archive resulting in larger file than original

I want to tar a 49 GB folder so it's compressed, but not persisted afterwards.
I'm using the below code to create a temporary directory (I chose the parent path) and then shutil.make_archive:
import tempfile
import shutil
with tempfile.TemporaryDirectory(dir=temp_dir_path) as temp_output_dir:
output_path = shutil.make_archive(temp_output_dir, "tar", dir_to_tar)
However, the size of the tar file generated is larger than when I du -sh the original folder. I stopped the code when it reached 51 GB.
Am I using this wrong?
The tar (tape archive) format doesn't compress, it just "archives" files together (it's basically just a directory structure and the files it contains shoved into a single file). Definitionally, it's a little larger than the original data.
If you want the tarball to be compressed so it ends up smaller than the original data, use one of the compressed format arguments to make_archive, e.g. 'gztar' (faster, worse compression, most portable) or 'xztar' (slower, best compression, somewhat less portable) instead of just 'tar'.

Unzip folder by chunks in python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

Zipping files with maximum size limit

I have a few directories that contain a varying number of files, and I want to create zip files for each directory containing all of the files in the directory. This is fine for most of them, but one directory has significantly more files and zipping the entire thing would result in a 20GB+ file. I'd rather limit the maximum size of the zip file and split it into, say, 5GB parts. Is there an easy way to do this with Python? I'm using the zipfile module right now, but I'm not seeing a way to tell it to automatically split into multiple zips at a certain filesize.
If you can use RAR insted of ZIP,
You can try this RAR-PACKAGE
from librar import archive
myRAR = archive.Archive("resultPath",base)
myRAR.add_file("FilePath")
myRAR.set_volume_size("5000000K") # split archive based on volume size 5 Gigabytes
update:
that rar-package is outdated and does not work with python 3, but we have a better solution now:
rar a -m5 -v10m myarchive movie.avi
It will compress movie.avi and split it into 10 MB chunks (-v10m), using the best compression ratio (-m5)
more info:
https://ubuntuincident.wordpress.com/2011/05/27/compress-with-rar-and-split-into-multiple-files/

Is there a python module for regex matching in zip files

I have over a million text files compressed into 40 zip files. I also have a list of about 500 model names of phones. I want to find out the number of times a particular model was mentioned in the text files.
Is there any python module which can do a regex match on the files without unzipping it. Is there a simple way to solve this problem without unzipping?
There's nothing that will automatically do what you want.
However, there is a python zipfile module that will make this easy to do. Here's how to iterate over the lines in the file.
#!/usr/bin/python
import zipfile
f = zipfile.ZipFile('myfile.zip')
for subfile in f.namelist():
print subfile
data = f.read(subfile)
for line in data.split('\n'):
print line
You could loop through the zip files, reading individual files using the zipfile module and running your regex on those, eliminating to unzip all the files at once.
I'm fairly certain that you can't run a regex over the zipped data, at least not meaningfully.
To access the contents of a zip file you have to unzip it, although the zipfile package makes this fairly easy, as you can unzip each file within an archive individually.
Python zipfile module
Isn't it (at least theoretically) possible, to read in the ZIP's Huffman coding and then translate the regexp into the Huffman code? Might this be more efficient than first de-compressing the data, then running the regexp?
(Note: I know it wouldn't be quite that simple: you'd also have to deal with other aspects of the ZIP coding—file layout, block structures, back-references—but one imagines this could be fairly lightweight.)
EDIT: Also note that it's probably much more sensible to just use the zipfile solution.

Categories

Resources