Is python zipfile thread-safe?

Is python zipfile thread-safe? - python

In a django project, I need to generate some pdf files for objects in db. Since each file takes a few seconds to generate, I use celery to run tasks asynchronously.
Problem is, I need to add each file to a zip archive. I was planning to use the python zipfile module, but different tasks can be run in different threads, and I wonder what will happen if two tasks try to add a file to the archive at the same time.
Is the following code thread safe or not? I cannot find any valuable information in the python's official doc.
try:
zippath = os.path.join(pdf_directory, 'archive.zip')
zipfile = ZipFile(zippath, 'a')
zipfile.write(pdf_fullname)
finally:
zipfile.close()
Note: this is running under python 2.6

No, it is not thread-safe in that sense.
If you're appending to the same zip file, you'd need a lock there, or the file contents could get scrambled.
If you're appending to different zip files, using separate ZipFile() objects, then you're fine.

Python 3.5.5 makes writing to ZipFile and reading multiple ZipExtFiles threadsafe: https://docs.python.org/3.5/whatsnew/changelog.html#id93
As far as I can tell, the change has not been backported to Python 2.7.
Update: after studying the code and some testing, it becomes apparent that the locking is still not thoroughly implemented. It correctly works only for writestr and doesn't work for open and write.

While this question is old, it's still high up on google results, so I just want to chime in to say that I noticed on python 3.4 64bit on windows the lzma zipfile is thread-safe; all others fail.
with zipfile.ZipFile("test.zip", "w", zipfile.ZIP_LZMA) as zip:
#do stuff in threads
Note that you can't bind the same file with multiple zipfile.ZipFile instances, instead you have to use the same one in all threads; here that is the variable named zip.
In my case I get about 80-90% CPU use on 8 cores and SSD, which is nice.

Related

Fast zip decryption in python

I've a program which process a zip file using zipfile. It works with an iterator, since the uncompressed file is bigger than 2GB and it can become a memory problem.
with zipfile.Zipfile(BytesIO(my_file)) as myzip:
for file_inside in myzip.namelist():
with myzip.open(file_inside) as file:
# Process here
# for loop ....
Then I noticed that this process was being extremely slow to process my file. And I can understand that it may take some time, but at least it should use my machine resources: lets say the python process should use at 100% the core where it lives.
Since it doesn't, I started researching the possible root causes. I'm not an expert in compression matters, so first considered basic things:
Resources seem not to be the problem, there's plenty RAM available even if my coding approach wouldn't use it.
CPU is not in high level usage, not even for one core.
The file being open is just about 80MB when compressed, so disk reading should not be a slowing issue either.
This made me to think that the bottleneck could be in the most invisible parameters: RAM bandwidth. However I have no idea how could I measure this.
Then on the software side, I found on the zipfile docs:
Decryption is extremely slow as it is implemented in native Python rather than C.
I guess that if it's using native Python, it's not even using OpenGL acceleration so another point for slowliness. I'm also curious about how this method works, again because of the low CPU usage.
So my question is of course, how could I work in a similar way (not having the full uncompress file in RAM), but uncompressing in a faster way in Python? Is there another library or maybe another approach to overcome this slowliness?

There is this lib for python to handle zipping files without memory hassle.
Quoted from the docs:
Buzon - ZipFly
ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without memory inflation.
Never used but can help.

I've done some research and found the following:
You could "pip install czipfile", more information at https://pypi.org/project/czipfile/
Another solution is to use "Cython", a variant of python -https://www.reddit.com/r/Python/comments/cksvp/whats_a_python_zip_library_with_fast_decryption/
Or you could outsource to 7-Zip, as explained here: Faster alternative to Python's zipfile module?

It's quite stupid that Python doesn't implement zip decryption in pure c.
So I make it in cython, which is 17 times faster.
Just get the dezip.pyx and setup.py from this gist.
https://gist.github.com/zylo117/cb2794c84b459eba301df7b82ddbc1ec
And install cython and build a cython library
pip3 install cython
python3 setup.py build_ext --inplace
Then run the original script with two more lines.
import zipfile
# add these two lines
from dezip import _ZipDecrypter_C
setattr(zipfile, '_ZipDecrypter', _ZipDecrypter_C)
z = zipfile.ZipFile('./test.zip', 'r')
z.extractall('/tmp/123', None, b'password')

Deleting temporary files during script evaluation

I'm using a python package, cdo, which heavily relies on tempfile for storing intermediate results. The created temporary files are quite large and when running bigger calculations, I've run into the problem that the /tmp directory got filled up and the script failed with a disk full error (we are talking about 10s to 100s of GB). I've found a workaround to the problem by creating a local folder, say $HOME/tmp and then doing
import tempfile
tempfile.tempdir='$HOME/tmp'
before importing the cdo module. While this works for me, it is somewhat cumbersome if I want also others to use my scripts. Therefore I was wondering, whether there would be a more elegant way to solve the problem by, e.g., telling tmpfile periodically to clear out all temporary files (usually this is only done once the script finishes). From my side this would be possible, because I am running a long loop, which produces one named file each iteration and all the temporary files created during that iteration would be discardable afterwards.

as the examples show: you could use tempfile in a context manager:
with tempfile.TemporaryFile() as fp:
fp.write(b'Hello world!')
fp.seek(0)
fp.read()
that way they are removed when the context exits.
...do you have that much control over how cdo uses tempfiles?

Behavior of Python 2.7's ZipFile when zipping files

I'm currently working with python 2.7's ZipFile. I have been running into memory issues zipping multiple files to send to the user.
When I have code:
fp = open('myzip.zip', 'w')
archive = zipfile.ZipFile(fp, 'w', zipfile.ZIP_DEFLATED)
for filepath in filepaths:
archive.write(filepath)
archive.close()
Does Python load all those files into memory at some point? I would have expected Python to stream the contents of the files into the zip, but I'm not sure that that is the case.

This question/answer (both by same user) suggests that it's all done in memory. They provide a link to a modified library EnhancedZipFile that sounds like it works as you desire, however it doesn't seem to have had much activity on the project.
If you're not dependent on zip specifically, then this answer implies that the bzip library can handle large files.

Python will hold all the files you iterate over in memory, there is no way i can think of to bypass that except for calling an executable that will do the job for you on the operating system.

Sharing Data in Python

I have some Pickled data, which is stored on disk, and it is about 100 MB in size.
When my python program is executed, the picked data is loaded using the cPickle module, and all that works fine.
If I execute the python multiple times using python main.py for example, each python process will load the same data multiple times, which is the correct behaviour.
How can I make it so, all new python process share this data, so it is only loaded a single time into memory?

If you're on Unix, one possibility is to load the data into memory, and then have the script use os.fork() to create a bunch of sub-processes. As long as the sub-processes don't attempt to modify the data, they would automatically share the parent's copy of it, without using any additional memory.
Unfortunately, this won't work on Windows.
P.S. I once asked about placing Python objects into shared memory, but that didn't produce any easy solutions.

Depending on how seriously you need to solve this problem, you may want to look at memcached, if that is not overkill.

Check if the directory content has changed with shell script or python

I have a program that create files in a specific directory.
When those files are ready, I run Latex to produce a .pdf file.
So, my question is, how can I use this directory change as a trigger
to call Latex, using a shell script or a python script?
Best Regards

inotify replaces dnotify.
Why?
...dnotify requires opening one file descriptor for each directory that you intend to watch for changes...
Additionally, the file descriptor pins the directory, disallowing the backing device to be unmounted, which causes problems in scenarios involving removable media. When using inotify, if you are watching a file on a file system that is unmounted, the watch is automatically removed and you receive an unmount event.
...and more.
More Why?
Unlike its ancestor dnotify, inotify doesn't complicate your work by various limitations. For example, if you watch files on a removable media these file aren't locked. In comparison with it, dnotify requires the files themselves to be open and thus really "locks" them (hampers unmounting the media).
Reference

Is dnotify what you need?

Make on unix systems is usually used to track by date what needs rebuilding when files have changed. I normally use a rather good makefile for this job. There seems to be another alternative around on google code too

You not only need to check for changes, but need to know that all changes are complete before running LaTeX. For example, if you start LaTeX after the first file has been modified and while more changes are still pending, you'll be using partial data and have to re-run later.
Wait for your first program to complete:
#!/bin/bash
first-program &&
run-after-changes-complete
Using && means the second command is only executed if the first completes successfully (a zero exit code). Because this simple script will always run the second command even if the first doesn't change any files, you can incorporate this into whatever build system you are already familiar with, such as make.

Python FAM is a Python interface for FAM (File Alteration Monitor)
You can also have a look at Pyinotify, which is a module for monitoring file system changes.

Not much of a python man myself. But in a pinch, assuming you're on linux, you could periodically shell out and "ls -lrt /path/to/directory" (get the directory contents and sort by last modified), and compare the results of the last two calls for a difference. If so, then there was a change. Not very detailed, but gets the job done.

You can use native python module hashlib which implements MD5 algorithm:
>>> import hashlib
>>> import os
>>> m = hashlib.md5()
>>> for root, dirs, files in os.walk(path):
for file_read in files:
full_path = os.path.join(root, file_read)
for line in open(full_path).readlines():
m.update(line)
>>> m.digest()
'pQ\x1b\xb9oC\x9bl\xea\xbf\x1d\xda\x16\xfe8\xcf'
You can save this result in a file or a variable, and compare it to the result of the next run. This will detect changes in any files, in any sub-directory.
This does not take into account file permission changes; if you need to monitor these change as well, this could be addressed via appending a string representing the permissions (accessible via os.stat for instance, attributes depend on your system) to the mvariable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.