Through much trial and error, I've figured out how to make lzma compressed files through PyLZMA, but I'd like to replicate the seemingly simple task of compressing a folder and all of its files/directories recursively to a 7z file. I would just do it through 7z.exe, but I can't seem to be able to catch the stdout of the process until it's finished and I would like some per-7z-file progress as I will be compressing folders that range from hundreds of gigs to over a terabyte in size. Unfortunately I can't provide any code that I have tried, simply because the only thing I've seen examples of using py7zlib are extracting files from pre-existing files. Has anybody had any luck with the combination of these two or could provide some help?
For what it's worth, this would be on Windows using python 2.7. Bonus points if there some magic multi-threading that could occur here, especially given how slow lzma compression seems to be (time, however, is not an issue here). Thanks in advance!
A pure Python alternative is to create .tar.xz files with a combination of the standard library tarfile module and the liblzma wrapper module pyliblzma. This will create files comparable to in size to .7z archives:
import tarfile
import lzma
TAR_XZ_FILENAME = 'archive.tar.xz'
DIRECTORY_NAME = '/usr/share/doc/'
xz_file = lzma.LZMAFile(TAR_XZ_FILENAME, mode='w')
with tarfile.open(mode='w', fileobj=xz_file) as tar_xz_file:
tar_xz_file.add(DIRECTORY_NAME)
xz_file.close()
The tricky part is the progress report. The example above uses the recursive mode for directories of the tarfile.TarFile class, so there the add method would not return until the whole directory was added.
The following questions discuss possible strategies for monitoring the progress of the tar file creation:
Python tarfile progress output?
Python tarfile progress
Related
I've a program which process a zip file using zipfile. It works with an iterator, since the uncompressed file is bigger than 2GB and it can become a memory problem.
with zipfile.Zipfile(BytesIO(my_file)) as myzip:
for file_inside in myzip.namelist():
with myzip.open(file_inside) as file:
# Process here
# for loop ....
Then I noticed that this process was being extremely slow to process my file. And I can understand that it may take some time, but at least it should use my machine resources: lets say the python process should use at 100% the core where it lives.
Since it doesn't, I started researching the possible root causes. I'm not an expert in compression matters, so first considered basic things:
Resources seem not to be the problem, there's plenty RAM available even if my coding approach wouldn't use it.
CPU is not in high level usage, not even for one core.
The file being open is just about 80MB when compressed, so disk reading should not be a slowing issue either.
This made me to think that the bottleneck could be in the most invisible parameters: RAM bandwidth. However I have no idea how could I measure this.
Then on the software side, I found on the zipfile docs:
Decryption is extremely slow as it is implemented in native Python rather than C.
I guess that if it's using native Python, it's not even using OpenGL acceleration so another point for slowliness. I'm also curious about how this method works, again because of the low CPU usage.
So my question is of course, how could I work in a similar way (not having the full uncompress file in RAM), but uncompressing in a faster way in Python? Is there another library or maybe another approach to overcome this slowliness?
There is this lib for python to handle zipping files without memory hassle.
Quoted from the docs:
Buzon - ZipFly
ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without memory inflation.
Never used but can help.
I've done some research and found the following:
You could "pip install czipfile", more information at https://pypi.org/project/czipfile/
Another solution is to use "Cython", a variant of python -https://www.reddit.com/r/Python/comments/cksvp/whats_a_python_zip_library_with_fast_decryption/
Or you could outsource to 7-Zip, as explained here: Faster alternative to Python's zipfile module?
It's quite stupid that Python doesn't implement zip decryption in pure c.
So I make it in cython, which is 17 times faster.
Just get the dezip.pyx and setup.py from this gist.
https://gist.github.com/zylo117/cb2794c84b459eba301df7b82ddbc1ec
And install cython and build a cython library
pip3 install cython
python3 setup.py build_ext --inplace
Then run the original script with two more lines.
import zipfile
# add these two lines
from dezip import _ZipDecrypter_C
setattr(zipfile, '_ZipDecrypter', _ZipDecrypter_C)
z = zipfile.ZipFile('./test.zip', 'r')
z.extractall('/tmp/123', None, b'password')
I'm currently working with python 2.7's ZipFile. I have been running into memory issues zipping multiple files to send to the user.
When I have code:
fp = open('myzip.zip', 'w')
archive = zipfile.ZipFile(fp, 'w', zipfile.ZIP_DEFLATED)
for filepath in filepaths:
archive.write(filepath)
archive.close()
Does Python load all those files into memory at some point? I would have expected Python to stream the contents of the files into the zip, but I'm not sure that that is the case.
This question/answer (both by same user) suggests that it's all done in memory. They provide a link to a modified library EnhancedZipFile that sounds like it works as you desire, however it doesn't seem to have had much activity on the project.
If you're not dependent on zip specifically, then this answer implies that the bzip library can handle large files.
Python will hold all the files you iterate over in memory, there is no way i can think of to bypass that except for calling an executable that will do the job for you on the operating system.
I've been packing up a script and some resources into a executable zipfile using the technique in (for example) this blog post
The process suddenly stopped working, and I think it has to do with the zip64 extension. When I try to run the executable zipfile I get:
/usr/bin/python: can't find 'main' module in '/path/to/my_app.zip'
I believe that the only change is that one of the resource files (a disk image) has gotten larger. I've verified that __main__.py is still in the root of the archive. The size of the zipfile used to be 600MB, and is now 2.5GB. I noticed in the zipimport docs the following statement:
ZIP archives with an archive comment are currently not supported.
Reading through the wikipedia article on the zipfile format I see that:
The .ZIP file format allows for a comment containing up to 65,535 bytes of data to occur at the end of the file after the central directory.[25]
And later, regarding zip64:
In essence, it uses a "normal" central directory entry for a file, followed by an optional "zip64" directory entry, which has the larger fields.[29]
Inferring a bit, it sounds like this might be what's happening: my zipfile has grown to require the zip64 extension. The zip64 extension data is stored in the comment section so now there is an active comment section, and python's zipimport is refusing to read my zipfile.
Can anyone provide guidance on:
verifying the cause of why python can't find __main__.py in my zipfile
providing any workaround
Note that the image file has always been 16GB in size, however it used to only occupy 600MB on the disk (it resides on an ext4 filesystem, if that matters). It now occupies > 7GB on disk. From the wikipedia page:
The original .ZIP format had a 4 GiB limit on various things (uncompressed size of a file, compressed size of a file and total size of the archive)
I build the zipfile using a python script so in order to try and work around this issue, I add the python code to the zipfile before adding the image file. The thought being that python might simply ignore the comment section and see a valid zipfile that contains the python code but not the large image file. This doesn't appear to be the case.
Digging into the python source code, in zipimport.c, we can see that indeed it looks for the end-of-central-directory block in the last 22 bytes of the file. It does not retract the search from there if it doesn't find a valid end-of-central-directory (and I guess that makes it not a compliant zip parser). In any case, what it does do is it looks at the offset and size of the central directory reported in the end-of-central-directory record. offset + size should be the location of the end-of-central-directory record. If it is not, it computes the difference between the actual and expected location and adds this offset to all offsets in the central directory. This means that python supports loading modules from a zipfile which is catted to the end of another file.
It appears, however, that the zipimport implementation for 2.7.6 (distributed with ubuntu 14.04) is broken for large zipfiles (greater than > 2GB? I think max signed 32-bit long). It works, however for python 3.4.3 (also distributed with ubuntu 14.04). zipimport.c has changed sometime between 2.7.6 and 2.7.12, so it may work for newer python 2's.
My solution is to pack a resource zip, and a code zip, then cat them together, and run the app with python3 instead of python2. I write the offset and size of the resource zip to a metadata file in the code zip. The code uses this information and filechunkio to get a zipfile.ZipFile for the resource zip segment of the packfile.
Not a great solution, but it works.
In a django project, I need to generate some pdf files for objects in db. Since each file takes a few seconds to generate, I use celery to run tasks asynchronously.
Problem is, I need to add each file to a zip archive. I was planning to use the python zipfile module, but different tasks can be run in different threads, and I wonder what will happen if two tasks try to add a file to the archive at the same time.
Is the following code thread safe or not? I cannot find any valuable information in the python's official doc.
try:
zippath = os.path.join(pdf_directory, 'archive.zip')
zipfile = ZipFile(zippath, 'a')
zipfile.write(pdf_fullname)
finally:
zipfile.close()
Note: this is running under python 2.6
No, it is not thread-safe in that sense.
If you're appending to the same zip file, you'd need a lock there, or the file contents could get scrambled.
If you're appending to different zip files, using separate ZipFile() objects, then you're fine.
Python 3.5.5 makes writing to ZipFile and reading multiple ZipExtFiles threadsafe: https://docs.python.org/3.5/whatsnew/changelog.html#id93
As far as I can tell, the change has not been backported to Python 2.7.
Update: after studying the code and some testing, it becomes apparent that the locking is still not thoroughly implemented. It correctly works only for writestr and doesn't work for open and write.
While this question is old, it's still high up on google results, so I just want to chime in to say that I noticed on python 3.4 64bit on windows the lzma zipfile is thread-safe; all others fail.
with zipfile.ZipFile("test.zip", "w", zipfile.ZIP_LZMA) as zip:
#do stuff in threads
Note that you can't bind the same file with multiple zipfile.ZipFile instances, instead you have to use the same one in all threads; here that is the variable named zip.
In my case I get about 80-90% CPU use on 8 cores and SSD, which is nice.
I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up?
Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2?
More Info/Edit:
I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of files and is as thoroughly compressed?
Do you have to use bzip2? Reading it's documentation, it's quite clear it's not designed to support random access. Perhaps you should use a compression format that more closely matches your requirements. The good old Zip format supports random access, but might compress worse, of course.
Bzip2 compresses in large blocks (900 KiB by default, I believe). One method that would speed up the scanning of the tar file dramatically, but would reduce compression performance, would be to compress each file individually and then tar the results together. This is essentially what Zip-format files are (though using zlib compression rather than bzip2). But you could then easily grab the tar index and only have to decompress the specific file(s) you are looking for.
I don't think most tar programs offer much ability to organize files in any meaningful way, though you could write a program to do this for your special case (I know Python has tar-writing libraries though I've only used them once or twice). However, you'd still have the problem of having to decompress most of the data before you found what you were looking for.