Fast zip decryption in python

Fast zip decryption in python - python

I've a program which process a zip file using zipfile. It works with an iterator, since the uncompressed file is bigger than 2GB and it can become a memory problem.
with zipfile.Zipfile(BytesIO(my_file)) as myzip:
for file_inside in myzip.namelist():
with myzip.open(file_inside) as file:
# Process here
# for loop ....
Then I noticed that this process was being extremely slow to process my file. And I can understand that it may take some time, but at least it should use my machine resources: lets say the python process should use at 100% the core where it lives.
Since it doesn't, I started researching the possible root causes. I'm not an expert in compression matters, so first considered basic things:
Resources seem not to be the problem, there's plenty RAM available even if my coding approach wouldn't use it.
CPU is not in high level usage, not even for one core.
The file being open is just about 80MB when compressed, so disk reading should not be a slowing issue either.
This made me to think that the bottleneck could be in the most invisible parameters: RAM bandwidth. However I have no idea how could I measure this.
Then on the software side, I found on the zipfile docs:
Decryption is extremely slow as it is implemented in native Python rather than C.
I guess that if it's using native Python, it's not even using OpenGL acceleration so another point for slowliness. I'm also curious about how this method works, again because of the low CPU usage.
So my question is of course, how could I work in a similar way (not having the full uncompress file in RAM), but uncompressing in a faster way in Python? Is there another library or maybe another approach to overcome this slowliness?

There is this lib for python to handle zipping files without memory hassle.
Quoted from the docs:
Buzon - ZipFly
ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without memory inflation.
Never used but can help.

I've done some research and found the following:
You could "pip install czipfile", more information at https://pypi.org/project/czipfile/
Another solution is to use "Cython", a variant of python -https://www.reddit.com/r/Python/comments/cksvp/whats_a_python_zip_library_with_fast_decryption/
Or you could outsource to 7-Zip, as explained here: Faster alternative to Python's zipfile module?

It's quite stupid that Python doesn't implement zip decryption in pure c.
So I make it in cython, which is 17 times faster.
Just get the dezip.pyx and setup.py from this gist.
https://gist.github.com/zylo117/cb2794c84b459eba301df7b82ddbc1ec
And install cython and build a cython library
pip3 install cython
python3 setup.py build_ext --inplace
Then run the original script with two more lines.
import zipfile
# add these two lines
from dezip import _ZipDecrypter_C
setattr(zipfile, '_ZipDecrypter', _ZipDecrypter_C)
z = zipfile.ZipFile('./test.zip', 'r')
z.extractall('/tmp/123', None, b'password')

Related

How do I speed up repeated calls a ruby program (github's linguist) from python?

I'm using github's linguist to identify unknown source code files. Running this from the command line after a gem install github-linguist is insanely slow. I'm using python's subprocess module to make a command-line call on a stock Ubuntu 14 installation.
Running against an empty file: linguist __init__.py takes about 2 seconds (similar results for other files). I assume this is completely from the startup time of Ruby. As #MartinKonecny points out, it seems that it is the linguist program itself.
Is there some way to speed this process up -- or a way to bundle the calls together?

One possibility is to just adapt the linguist program (https://github.com/github/linguist/blob/master/bin/linguist) to take multiple paths on the command-line. It requires mucking with a bit of Ruby, sure, but it would make it possible to pass multiple files without the startup overhead of Linguist each time.
A script this simple could suffice:
require 'linguist/file_blob'
ARGV.each do |path|
blob = Linguist::FileBlob.new(path, Dir.pwd)
# print out blob.name, blob.language, blob.sloc, etc.
end

Huge memory usage of Python's json module?

When I load the file into json, pythons memory usage spikes to about 1.8GB and I can't seem to get that memory to be released. I put together a test case that's very simple:
with open("test_file.json", 'r') as f:
j = json.load(f)
I'm sorry that I can't provide a sample json file, my test file has a lot of sensitive information, but for context, I'm dealing with a file in the order of 240MB. After running the above 2 lines I have the previously mentioned 1.8GB of memory in use. If I then do del j memory usage doesn't drop at all. If I follow that with a gc.collect() it still doesn't drop. I even tried unloading the json module and running another gc.collect.
I'm trying to run some memory profiling but heapy has been churning 100% CPU for about an hour now and has yet to produce any output.
Does anyone have any ideas? I've also tried the above using cjson rather than the packaged json module. cjson used about 30% less memory but otherwise displayed exactly the same issues.
I'm running Python 2.7.2 on Ubuntu server 11.10.
I'm happy to load up any memory profiler and see if it does better then heapy and provide any diagnostics you might think are necessary. I'm hunting around for a large test json file that I can provide for anyone else to give it a go.

I think these two links address some interesting points about this not necessarily being a json issue, but rather just a "large object" issue and how memory works with python vs the operating system
See Why doesn't Python release the memory when I delete a large object? for why memory released from python is not necessarily reflected by the operating system:
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
About running large object processes in a subprocess to let the OS deal with cleaning up:
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

Preload files into memory and called by other utility as argument

When I launch one .jar in Python, I will give filename as argument for that execute .jar file. such as: java -jar xx.jar -file xx.file
I just noticed, when that Java process try to read xx.file as i/o read, it will cost 30% CPU usage in task manager.
So I think about, if we can pre-read files to memory, can mmap do that?
Any suggestion to improve it? If I have more than 50 java.exe process, CPU usage and i/o issue should be an big issue for me.

This does not really seem like a python question at all. But...
If you have 50 processes reading the same file, your OS will most likely cache this file for you already (if there's enough space to cache it of course)
That should remove the problems with I/O (disk) cost.
But you say that reading results in 30% CPU usage in task manager. Do you know what this CPU time is really used for? Is it for reading the file? JIT-compiling your java code? Just starting up the JVM:s?
You should make sure you know exactly what the problem is before trying to solve it.

Is python zipfile thread-safe?

In a django project, I need to generate some pdf files for objects in db. Since each file takes a few seconds to generate, I use celery to run tasks asynchronously.
Problem is, I need to add each file to a zip archive. I was planning to use the python zipfile module, but different tasks can be run in different threads, and I wonder what will happen if two tasks try to add a file to the archive at the same time.
Is the following code thread safe or not? I cannot find any valuable information in the python's official doc.
try:
zippath = os.path.join(pdf_directory, 'archive.zip')
zipfile = ZipFile(zippath, 'a')
zipfile.write(pdf_fullname)
finally:
zipfile.close()
Note: this is running under python 2.6

No, it is not thread-safe in that sense.
If you're appending to the same zip file, you'd need a lock there, or the file contents could get scrambled.
If you're appending to different zip files, using separate ZipFile() objects, then you're fine.

Python 3.5.5 makes writing to ZipFile and reading multiple ZipExtFiles threadsafe: https://docs.python.org/3.5/whatsnew/changelog.html#id93
As far as I can tell, the change has not been backported to Python 2.7.
Update: after studying the code and some testing, it becomes apparent that the locking is still not thoroughly implemented. It correctly works only for writestr and doesn't work for open and write.

While this question is old, it's still high up on google results, so I just want to chime in to say that I noticed on python 3.4 64bit on windows the lzma zipfile is thread-safe; all others fail.
with zipfile.ZipFile("test.zip", "w", zipfile.ZIP_LZMA) as zip:
#do stuff in threads
Note that you can't bind the same file with multiple zipfile.ZipFile instances, instead you have to use the same one in all threads; here that is the variable named zip.
In my case I get about 80-90% CPU use on 8 cores and SSD, which is nice.

Fast file/directory scan method for windows?

I'm looking for a high performance method or library for scanning all files on disk or in a given directory and grabbing their basic stats - filename, size, and modification date.
I've written a python program that uses os.walk along with os.path.getsize to get the file list, and it works fine, but is not particularly fast. I noticed one of the freeware programs I had downloaded accomplished the same scan much faster than my program.
Any ideas for speeding up the file scan? Here's my python code, but keep in mind that I'm not at all married to os.walk and perfectly willing to use others APIs (including windows native APIs) if there are better alternatives.
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
...
I should also note I realize the python code probably can't be sped up that much; I'm particularly interested in any native APIs that provide better speed.

Well, I would expect this to be heavily I/O bound task.
As such, optimizations on python side would be quite ineffective; the only optimization I could think of is some different way of accessing/listing files, in order to reduce the actual read from the file system.
This of course requires a deep knowledge of the file system, that I do not have, and I do not expect python's developer to have while implementing os.walk.
What about spawning a command prompt, and then issue 'dir' and parse the results?
It could be a bit an overkill, but with any luck, 'dir' is making some effort for such optimizations.

It seems as if os.walk has been considerably improved in python 2.5, so you might check if you're running that version.
Other than that, someone has already compared the speed of os.walk to ls and noticed a clear advance of the latter, but not in a range that would actually justify using it.

You might want to look at the code for some Python version control systems like Mercurial or Bazaar. They have devoted a lot of time to coming up with ways to quickly traverse a directory tree and detect changes (or "finding basic stats about the files").

Use scandir python module (formerly betterwalk) on github by Ben Hoyt.
http://github.com/benhoyt/scandir
It is much faster than python walk, but uses the same syntax. Just import scandir and change os.walk() to scandir.walk(). That's it. It is the fastest way to traverse directories and files in python.

When you look at the code for os.walk, you'll see that there's not much fat to be trimmed.
For example, the following is only a hair faster than os.walk.
import os
import stat
listdir= os.listdir
pathjoin= os.path.join
fstat= os.stat
is_dir= stat.S_ISDIR
is_reg= stat.S_ISREG
def yieldFiles( path ):
for f in listdir(path):
nm= pathjoin( path, f )
s= fstat( nm ).st_mode
if is_dir( s ):
for sub in yieldFiles( nm ):
yield sub
elif is_reg( s ):
yield f
else:
pass # ignore these
Consequently, the overheads must he in the os module itself. You'll have to resort to making direct Windows API calls.
Look at the Python for Windows Extensions.

I'm wondering if you might want to group your I/O operations.
For instance, if you're walking a dense directory tree with thousands of files, you might try experimenting with walking the entire tree and storing all the file locations, and then looping through the (in-memory) locations and getting file statistics.
If your OS stores these two data in different locations (directory structure in one place, file stats in another), then this might be a significant optimization.
Anyway, that's something I'd try before digging further.

Python 3.5 just introduced os.scandir (see PEP-0471) which avoids a number of non-required system calls such as stat() and GetFileAttributes() to provide a significantly quicker file-system iterator.
os.walk() will now be implemented using os.scandir() as its iterator, and so you should see potentially large performance improvements whilst continuing to use os.walk().
Example usage:
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_file():
print(entry.name)

The os.path module has a directory tree walking function as well. I've never run any sort of benchmarks on it, but you could give it a try. I'm not sure there's a faster way than os.walk/os.path.walk in Python, however.

This is only partial help, more like pointers; however:
I believe you need to do the following:
fp = open("C:/$MFT", "rb")
using an account that includes SYSTEM permissions, because even as an admin, you can't open the "Master File Table" (kind of an inode table) of an NTFS filesystem. After you succeed in that, then you'll just have to locate information on the web that explains the structure of each file record (I believe it's commonly 1024 bytes per on-disk file, which includes the file's primary pathname) and off you go for super-high speeds of disk structure reading.

I would suggest using folderstats for creating statistics from a folder structure, I have tested on folder/files structures up to 400k files and folders.
It is as simple as:
import folderstats
import pandas as pd
df = folderstats.folderstats(path5, ignore_hidden=True)
df.head()
df.shape
Output will be a dataframe see example below:-
path
name
extension
size
atime
mtime
ctime
folder
num_files
depth
uid
md5
./folder_structure.png
folder_structure
png
525239
2022-01-10 16:08:32
2020-11-22 19:38:03
2020-11-22 19:38:03
False
0
1000
a3cac43de8dd5fc33d7bede1bb1849de
./requirements-dev.txt
requirements-dev
txt
33
2022-01-10 14:14:50
2022-01-08 17:54:50
2022-01-08 17:54:50
False
0
1000
42c7e7d9bc4620c2c7a12e6bbf8120bb

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.