Zipping files with maximum size limit - python

I have a few directories that contain a varying number of files, and I want to create zip files for each directory containing all of the files in the directory. This is fine for most of them, but one directory has significantly more files and zipping the entire thing would result in a 20GB+ file. I'd rather limit the maximum size of the zip file and split it into, say, 5GB parts. Is there an easy way to do this with Python? I'm using the zipfile module right now, but I'm not seeing a way to tell it to automatically split into multiple zips at a certain filesize.

If you can use RAR insted of ZIP,
You can try this RAR-PACKAGE
from librar import archive
myRAR = archive.Archive("resultPath",base)
myRAR.add_file("FilePath")
myRAR.set_volume_size("5000000K") # split archive based on volume size 5 Gigabytes

update:
that rar-package is outdated and does not work with python 3, but we have a better solution now:
rar a -m5 -v10m myarchive movie.avi
It will compress movie.avi and split it into 10 MB chunks (-v10m), using the best compression ratio (-m5)
more info:
https://ubuntuincident.wordpress.com/2011/05/27/compress-with-rar-and-split-into-multiple-files/

Related

Python shutil.make_archive resulting in larger file than original

I want to tar a 49 GB folder so it's compressed, but not persisted afterwards.
I'm using the below code to create a temporary directory (I chose the parent path) and then shutil.make_archive:
import tempfile
import shutil
with tempfile.TemporaryDirectory(dir=temp_dir_path) as temp_output_dir:
output_path = shutil.make_archive(temp_output_dir, "tar", dir_to_tar)
However, the size of the tar file generated is larger than when I du -sh the original folder. I stopped the code when it reached 51 GB.
Am I using this wrong?
The tar (tape archive) format doesn't compress, it just "archives" files together (it's basically just a directory structure and the files it contains shoved into a single file). Definitionally, it's a little larger than the original data.
If you want the tarball to be compressed so it ends up smaller than the original data, use one of the compressed format arguments to make_archive, e.g. 'gztar' (faster, worse compression, most portable) or 'xztar' (slower, best compression, somewhat less portable) instead of just 'tar'.

How to remove duplicate pdf files from a list in python

I have a list containing the pdf files
l=['ab.pdf', 'cd.pdf', 'ef.pdf', 'gh.pdf']
Out of these four files few are duplicate only names are changed, how to delete those file from the list ?
for example ab.pdf and cd.pdf are same, so the final output will be
l=['ab.pdf', 'ef.pdf', 'gh.pdf']
I have tried filecmp library but it only tells if two files are duplicate.
How to do it most efficiently in pythonic way ?
Then you need to tell the computer the path of the files in your computer and compare the files one by one, although it sounds not such effective.

Simple way to concatenate files with maximum size of source files

I have a bunch of several million small text files. I would like to concatenate them to bigger files of around 10 MByte each in order to process them faster. Before I start with a Python script I wonder if there is a way to do so via shell - like a maximum file size parameter in cat or something similar?
Maybe try cat on multiple files and push the standard output to a file? Like this:
cat * > one_big_file
If you don't want to process all files to one big file, but several smaller ones - maybe group their file names by some regex? (the exact solution then depends on how your filenames look like)
cat `grep [regex]` > one_big_file
You can also try creating one big file and then splitting it into several parts with:
split -b10m one_big_file part

Most efficient/fastest way to get a single file from a directory

What is the most efficient and fastest way to get a single file from a directory using Python?
More details on my specific problem:
I have a directory containing a lot of pregenerated files, and I just want to pick a random one. Since I know that there's no really efficient way of picking a random file from a directory other than listing all the files first, my files are generated with an already random name, thus they are already randomly sorted, and I just need to pick the first file from the folder.
So my question is: how can I pick the first file from my folder, without having to load the whole list of files from the directory (nor having the OS to do that, my optimal goal would be to force the OS to just return me a single file and then stop!).
Note: I have a lot of files in my directory, hence why I would like to avoid listing all the files to just pick one.
Note2: each file is only picked once, then deleted to ensure that only new files are picked the next time (thus ensuring some kind of randomness).
SOLUTION
I finally chose to use an index file that will store:
the index of the current file to be picked (eg: 1 for file1.ext, 2 for file2.ext, etc..)
the index of the last file generated (eg: 1999 for file1999.ext)
Of course, this means that my files are not generated with a random name anymore, but using a deterministic incrementable pattern (eg: "file%s.ext" % ID)
Thus I have a near constant time for my two main operations:
Accessing the next file in the folder
Counting the number of files that are left (so that I can generate new files in a background thread when needed).
This is a specific solution for my problem, for more generic solutions, please read the accepted answer.
Also you might be interested into these two other solutions I've found to optimize the access of files and directory walking using Python:
os.walk optimized
Python FAM (File Alteration Monitor)
Don't have a lot of pregenerated files in 1 directory. Divide them over subdirectories if more than 'n' files in the directory.
when creating the files add the name of the newest file to a list stored in a text file. When you want to read/process/delete a file:
Open the text file
Set filename to the name on the top of the list.
Delete the name from the top of the list
Close the text file
Process filename.
Just use random.choice() on the os.listdir() result:
import random
import os
randomfilename = random.choice(os.listdir(path_to_directory))
os.listdir() returns results in the ordering given by the OS. Using random filenames does not change that ordering, only adding items to or removing items from the directory can influence that ordering.
If your fear that you'll have too many files, do not use a single directory. Instead, set up a tree of directories with pre-generated names, pick one of those at random, then pick a file from there.

Is there a python module for regex matching in zip files

I have over a million text files compressed into 40 zip files. I also have a list of about 500 model names of phones. I want to find out the number of times a particular model was mentioned in the text files.
Is there any python module which can do a regex match on the files without unzipping it. Is there a simple way to solve this problem without unzipping?
There's nothing that will automatically do what you want.
However, there is a python zipfile module that will make this easy to do. Here's how to iterate over the lines in the file.
#!/usr/bin/python
import zipfile
f = zipfile.ZipFile('myfile.zip')
for subfile in f.namelist():
print subfile
data = f.read(subfile)
for line in data.split('\n'):
print line
You could loop through the zip files, reading individual files using the zipfile module and running your regex on those, eliminating to unzip all the files at once.
I'm fairly certain that you can't run a regex over the zipped data, at least not meaningfully.
To access the contents of a zip file you have to unzip it, although the zipfile package makes this fairly easy, as you can unzip each file within an archive individually.
Python zipfile module
Isn't it (at least theoretically) possible, to read in the ZIP's Huffman coding and then translate the regexp into the Huffman code? Might this be more efficient than first de-compressing the data, then running the regexp?
(Note: I know it wouldn't be quite that simple: you'd also have to deal with other aspects of the ZIP coding—file layout, block structures, back-references—but one imagines this could be fairly lightweight.)
EDIT: Also note that it's probably much more sensible to just use the zipfile solution.

Categories

Resources