Python shutil.make_archive resulting in larger file than original - python

I want to tar a 49 GB folder so it's compressed, but not persisted afterwards.
I'm using the below code to create a temporary directory (I chose the parent path) and then shutil.make_archive:
import tempfile
import shutil
with tempfile.TemporaryDirectory(dir=temp_dir_path) as temp_output_dir:
output_path = shutil.make_archive(temp_output_dir, "tar", dir_to_tar)
However, the size of the tar file generated is larger than when I du -sh the original folder. I stopped the code when it reached 51 GB.
Am I using this wrong?

The tar (tape archive) format doesn't compress, it just "archives" files together (it's basically just a directory structure and the files it contains shoved into a single file). Definitionally, it's a little larger than the original data.
If you want the tarball to be compressed so it ends up smaller than the original data, use one of the compressed format arguments to make_archive, e.g. 'gztar' (faster, worse compression, most portable) or 'xztar' (slower, best compression, somewhat less portable) instead of just 'tar'.

Related

Compressing Excel file does not reduce the size

I am using python to automate the conversion of Excel file to zip folder. The file is converted to zip using below code.
import os
from zipfile import ZipFile
path = r'C:/Users/kj/Desktop/Results'
os.chdir(path)
new_path = "C:/Users/kj/Desktop/New folder/"
ZipFile(new_path+"Week6.zip", 'w').write("Week6.csv")
The problem is below code is converting it to zip file but not reducing the size of the file. It remains same. But when I manually zip file, it reduces the size considerably. Please suggest what can be done to do it manually?
Because the default compression algorithm is ZIP_STORED which basically saves exact files in uncompressed format. so you have to mention your compression algorithm like lzma:
ZipFile(new_path+"Week6.zip", mode='w', compression=ZIP_LZMA)

zip archives created inside mpi4py cannot be opened

I ran some code in python that created a large number of .csv files. I'm doing this on a supercomputer and I used mpi4py to manage the parallel processing. Several processes ran on each node, with a number of operations done by each process. Each process created a .csv file. I didn't want to drop all the .csv files onto the main hard disk because there were so many. So instead, I wrote them to local SSDs, each SSD connected to a node. Then, I wanted to zip all the .csvs from each SSD into one zip archive per node and put the archive on the main hard drive. I did it like this:
from mpi4py import MPI
import os
import platform
import zipfile
# MPI setup
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
def get_localrank(comm):
comm_localrank = MPI.Comm.Split_type(comm, MPI.COMM_TYPE_SHARED, 0)
return comm_localrank.Get_rank()
def ls_csv(path, ext=".csv"):
all_files = os.listdir(path)
return [ path+"/"+f for f in all_files if f.endswith(ext) ]
def ssd_to_drive(comm, drive_path):
# only execute on one MPI process per node
if get_localrank(comm) == 0:
# every node's file needs a different name
archive_name = drive_path + "/data_" + platform.node() + ".zip"
csvs = ls_csv("ssd_path")
zf = zipfile.ZipFile(archive_name, 'w', zipfile.ZIP_DEFLATED)
for csvf in csvs:
zf.write(csvf)
zf.close()
##copy archived csvs from the ssds back to project
ssd_to_drive(comm=comm, drive_path="mypath")
I get the zip files back in the directory I want them, but I can't open them. Either they are corrupt or unzip thinks they are part of a multi-part archive, and if that's the case, I don't see how to reconstruct them.
When I do less data_nodename.zip, I get the following (where the alphanumeric string before each error statement represents "nodename"):
e33n08
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
f09n09
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
f09n10
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
e33n06
Zipfile is disk 37605 of a multi-disk archive, and this is not the disk on
which the central zipfile directory begins (disk 25182).
f09n11
Zipfile is disk 40457 of a multi-disk archive, and this is not the disk on
which the central zipfile directory begins (disk 740).
a06n10
end-of-central-directory record claims this
is disk 11604 but that the central directory starts on disk 48929; this is a
contradiction. Attempting to process anyway.
f10n14
end-of-central-directory record claims this
is disk 15085 but that the central directory starts on disk 52010; this is a
contradiction.
Of note, I only have 3 zip files that seem to claim to be the beginning of the multi-disk archive (e33n08, f09n09, f09n10), but at least 4 references to a "beginning of central directory" disk (25182, 740, 48929, 52010).
So now I can't figure out if these are corrupt or if zipfile really thought it was creating a multi-disk archive, why this happened, or how to reconstruct the archive if it really is multi-disk.
Finally, when I did this procedure with multiple mpi tasks but only one node, the single zip archive created was fine and readable with less and unzip.
Can't believe I missed this, but the files were indeed corrupt. The job had exited before writing the zip archives was complete.

Unzip folder by chunks in python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

Zipping files with maximum size limit

I have a few directories that contain a varying number of files, and I want to create zip files for each directory containing all of the files in the directory. This is fine for most of them, but one directory has significantly more files and zipping the entire thing would result in a 20GB+ file. I'd rather limit the maximum size of the zip file and split it into, say, 5GB parts. Is there an easy way to do this with Python? I'm using the zipfile module right now, but I'm not seeing a way to tell it to automatically split into multiple zips at a certain filesize.
If you can use RAR insted of ZIP,
You can try this RAR-PACKAGE
from librar import archive
myRAR = archive.Archive("resultPath",base)
myRAR.add_file("FilePath")
myRAR.set_volume_size("5000000K") # split archive based on volume size 5 Gigabytes
update:
that rar-package is outdated and does not work with python 3, but we have a better solution now:
rar a -m5 -v10m myarchive movie.avi
It will compress movie.avi and split it into 10 MB chunks (-v10m), using the best compression ratio (-m5)
more info:
https://ubuntuincident.wordpress.com/2011/05/27/compress-with-rar-and-split-into-multiple-files/

Faster alternative to Python's zipfile module?

Is there a noticeably faster alternative to Python 2.7.4 zipfile module (with ZIP_DEFLATED) for zipping a large number of files into a single zip file? I had a look at czipfile https://pypi.python.org/pypi/czipfile/1.0.0, but that appears to be focused on faster decrypting (not compressing).
I am routinely having to process a large number of image files (~12,000 files of a combination of .exr and .tiff files) with each file between ~1MB - 6MB in size (and ~9 GB for all the files) into a single zip file for shipment. This zipping takes ~90 minutes to process (running on Windows 7 64bit).
If anyone can recommend a different python module (or alternatively a C/C++ library or even a standalone tool) that would be able to compress a large number of files into a single .zip file in less time than the zipfile module, that would be greatly appreciated (anything close to ~5-10% faster (or more) would be very helpful).
As Patashu mentions, outsourcing to 7-zip might be the best idea.
Here's some sample code to get you started:
import os
import subprocess
path_7zip = r"C:\Program Files\7-Zip\7z.exe"
path_working = r"C:\temp"
outfile_name = "compressed.zip"
os.chdir(path_working)
ret = subprocess.check_output([path_7zip, "a", "-tzip", outfile_name, "*.txt", "*.py", "-pSECRET"])
As martineau mentioned you might experiment with compression methods. This page gives some examples on how to change the command line parameters.

Categories

Resources