Python can't load modules from a large zipfile

Python can't load modules from a large zipfile - python

I've been packing up a script and some resources into a executable zipfile using the technique in (for example) this blog post
The process suddenly stopped working, and I think it has to do with the zip64 extension. When I try to run the executable zipfile I get:
/usr/bin/python: can't find 'main' module in '/path/to/my_app.zip'
I believe that the only change is that one of the resource files (a disk image) has gotten larger. I've verified that __main__.py is still in the root of the archive. The size of the zipfile used to be 600MB, and is now 2.5GB. I noticed in the zipimport docs the following statement:
ZIP archives with an archive comment are currently not supported.
Reading through the wikipedia article on the zipfile format I see that:
The .ZIP file format allows for a comment containing up to 65,535 bytes of data to occur at the end of the file after the central directory.[25]
And later, regarding zip64:
In essence, it uses a "normal" central directory entry for a file, followed by an optional "zip64" directory entry, which has the larger fields.[29]
Inferring a bit, it sounds like this might be what's happening: my zipfile has grown to require the zip64 extension. The zip64 extension data is stored in the comment section so now there is an active comment section, and python's zipimport is refusing to read my zipfile.
Can anyone provide guidance on:
verifying the cause of why python can't find __main__.py in my zipfile
providing any workaround
Note that the image file has always been 16GB in size, however it used to only occupy 600MB on the disk (it resides on an ext4 filesystem, if that matters). It now occupies > 7GB on disk. From the wikipedia page:
The original .ZIP format had a 4 GiB limit on various things (uncompressed size of a file, compressed size of a file and total size of the archive)
I build the zipfile using a python script so in order to try and work around this issue, I add the python code to the zipfile before adding the image file. The thought being that python might simply ignore the comment section and see a valid zipfile that contains the python code but not the large image file. This doesn't appear to be the case.

Digging into the python source code, in zipimport.c, we can see that indeed it looks for the end-of-central-directory block in the last 22 bytes of the file. It does not retract the search from there if it doesn't find a valid end-of-central-directory (and I guess that makes it not a compliant zip parser). In any case, what it does do is it looks at the offset and size of the central directory reported in the end-of-central-directory record. offset + size should be the location of the end-of-central-directory record. If it is not, it computes the difference between the actual and expected location and adds this offset to all offsets in the central directory. This means that python supports loading modules from a zipfile which is catted to the end of another file.
It appears, however, that the zipimport implementation for 2.7.6 (distributed with ubuntu 14.04) is broken for large zipfiles (greater than > 2GB? I think max signed 32-bit long). It works, however for python 3.4.3 (also distributed with ubuntu 14.04). zipimport.c has changed sometime between 2.7.6 and 2.7.12, so it may work for newer python 2's.
My solution is to pack a resource zip, and a code zip, then cat them together, and run the app with python3 instead of python2. I write the offset and size of the resource zip to a metadata file in the code zip. The code uses this information and filechunkio to get a zipfile.ZipFile for the resource zip segment of the packfile.
Not a great solution, but it works.

Related

How to create a memory-mapped file in Python that is accessible from a called application?

Ok, I realize the title probably wasn't that clear. I'll clarify here and hope someone can help with a better title.
I'm opening a compressed file (tarball or similar) in python and reading some of the contents. One of the enclosed files is quite large (about 200GB, mostly zeros). Since the python tarfile module gives me file-handle like objects, I can generally use them as if I opened the file in the archive with out ever fully decompressing the enclosed file.
Unfortunately, I have to do some processing on this enclosed file using a 3rd party tool that I can't modify. This 3rd party tool only operates on files that are on disk. It won't take input from stdin.
What I do now is extract the entire 200 GB (mostly zeros) file to the disk for further processing. Obviously, this takes a while.
What I'd like to do is (using python if possible) make a "file" on disk that maps back to the "file handle" from the tarfile module in python. I could then pass this "file" to my 3rd party tool and go from there.
My target OS is linux (though a solution that also works on OSX would be nice). I don't care about working on Windows.
Edits
External tool takes a filename (or full path) as a parameter. It prints out data to stdout (which python reads)
I've gotten a setup using sparse files working. While not as fast as I had hoped, it is significantly faster than it was before.

Interruption of a os.rename in python

I made a script in python that renames all files and folders(does not recurse) in "." directory: the directory in which file is kept. It happened that I ran the script in a directory which contained no files and only one directory let's say imp with path .\imp. While program was renaming it, the electricity went off and the job was interrupted (sorry did't had UPS).
Now as the name suggests, assume imp contains important data. The renaming process also took quite good time ( compared to others ) before electricity went off even when all it was renaming was one folder. After this endeavour is some data corrupted, lost or anything?
Just make this more useful: what happens os.rename is forced to stop when it is doing its job? How is the effect different for files and folders?
Details
Python Version - 2.7.10
Operating System - Windows 10 Pro

You are using Windows, which means you are (probably) on NTFS. NTFS is a modern, journaling file system. It should not corrupt or lose any data, though it's possible that only some of the changes that constitute a rename have been applied (for instance, the filename might change without updating the modification time, or vice-versa). It is also possible that none of those changes have been applied.
Note the word "should" is not the same as "will." NTFS should not lose data in this fashion, and if it does, it's a bug. But because all software has bugs, it is important to keep backups of files you care about.

Compress a folder recursively as 7z with PyLZMA and py7zlib

Through much trial and error, I've figured out how to make lzma compressed files through PyLZMA, but I'd like to replicate the seemingly simple task of compressing a folder and all of its files/directories recursively to a 7z file. I would just do it through 7z.exe, but I can't seem to be able to catch the stdout of the process until it's finished and I would like some per-7z-file progress as I will be compressing folders that range from hundreds of gigs to over a terabyte in size. Unfortunately I can't provide any code that I have tried, simply because the only thing I've seen examples of using py7zlib are extracting files from pre-existing files. Has anybody had any luck with the combination of these two or could provide some help?
For what it's worth, this would be on Windows using python 2.7. Bonus points if there some magic multi-threading that could occur here, especially given how slow lzma compression seems to be (time, however, is not an issue here). Thanks in advance!

A pure Python alternative is to create .tar.xz files with a combination of the standard library tarfile module and the liblzma wrapper module pyliblzma. This will create files comparable to in size to .7z archives:
import tarfile
import lzma
TAR_XZ_FILENAME = 'archive.tar.xz'
DIRECTORY_NAME = '/usr/share/doc/'
xz_file = lzma.LZMAFile(TAR_XZ_FILENAME, mode='w')
with tarfile.open(mode='w', fileobj=xz_file) as tar_xz_file:
tar_xz_file.add(DIRECTORY_NAME)
xz_file.close()
The tricky part is the progress report. The example above uses the recursive mode for directories of the tarfile.TarFile class, so there the add method would not return until the whole directory was added.
The following questions discuss possible strategies for monitoring the progress of the tar file creation:
Python tarfile progress output?
Python tarfile progress

Check if the directory content has changed with shell script or python

I have a program that create files in a specific directory.
When those files are ready, I run Latex to produce a .pdf file.
So, my question is, how can I use this directory change as a trigger
to call Latex, using a shell script or a python script?
Best Regards

inotify replaces dnotify.
Why?
...dnotify requires opening one file descriptor for each directory that you intend to watch for changes...
Additionally, the file descriptor pins the directory, disallowing the backing device to be unmounted, which causes problems in scenarios involving removable media. When using inotify, if you are watching a file on a file system that is unmounted, the watch is automatically removed and you receive an unmount event.
...and more.
More Why?
Unlike its ancestor dnotify, inotify doesn't complicate your work by various limitations. For example, if you watch files on a removable media these file aren't locked. In comparison with it, dnotify requires the files themselves to be open and thus really "locks" them (hampers unmounting the media).
Reference

Is dnotify what you need?

Make on unix systems is usually used to track by date what needs rebuilding when files have changed. I normally use a rather good makefile for this job. There seems to be another alternative around on google code too

You not only need to check for changes, but need to know that all changes are complete before running LaTeX. For example, if you start LaTeX after the first file has been modified and while more changes are still pending, you'll be using partial data and have to re-run later.
Wait for your first program to complete:
#!/bin/bash
first-program &&
run-after-changes-complete
Using && means the second command is only executed if the first completes successfully (a zero exit code). Because this simple script will always run the second command even if the first doesn't change any files, you can incorporate this into whatever build system you are already familiar with, such as make.

Python FAM is a Python interface for FAM (File Alteration Monitor)
You can also have a look at Pyinotify, which is a module for monitoring file system changes.

Not much of a python man myself. But in a pinch, assuming you're on linux, you could periodically shell out and "ls -lrt /path/to/directory" (get the directory contents and sort by last modified), and compare the results of the last two calls for a difference. If so, then there was a change. Not very detailed, but gets the job done.

You can use native python module hashlib which implements MD5 algorithm:
>>> import hashlib
>>> import os
>>> m = hashlib.md5()
>>> for root, dirs, files in os.walk(path):
for file_read in files:
full_path = os.path.join(root, file_read)
for line in open(full_path).readlines():
m.update(line)
>>> m.digest()
'pQ\x1b\xb9oC\x9bl\xea\xbf\x1d\xda\x16\xfe8\xcf'
You can save this result in a file or a variable, and compare it to the result of the next run. This will detect changes in any files, in any sub-directory.
This does not take into account file permission changes; if you need to monitor these change as well, this could be addressed via appending a string representing the permissions (accessible via os.stat for instance, attributes depend on your system) to the mvariable.

python on xp: errno 13 permission denied - limits to number of files in folder?

I'm running Python 2.6.2 on XP. I have a large number of text files (100k+) spread across several folders that I would like to consolidate in a single folder on an external drive.
I've tried using shutil.copy() and shutil.copytree() and distutils.file_util.copy_file() to copy files from source to destination. None of these methods has successfully copied all files from a source folder, and each attempt has ended with IOError Errno 13 Permission Denied and I am unable to create a new destination file.
I have noticed that all the destination folders I've used, regardless of the source folders used, have ended up with exactly 13,106 files. I cannot open any new files for writing in folders that have this many (or more files), which may be why I'm getting Errno 13.
I'd be grateful for suggestions on whether and why this problem is occurring.
many thanks,
nick

Are you using FAT32? The maximum number of directory entries in a FAT32 folder is is 65.534. If a filename is longer than 8.3, it will take more than one directory entry. If you are conking out at 13,106, this indicates that each filename is long enough to require five directory entries.
Solution: Use an NTFS volume; it does not have per-folder limits and supports long filenames natively (that is, instead of using multiple 8.3 entries). The total number of files on an NTFS volume is limited to around 4.3 billion, but they can be put in folders in any combination.

I wouldn't have that many files in a single folder, it is a maintenance nightmare. BUT if you need to, don't do this on FAT: you have max. 64k files in a FAT folder.
Read the error message
Your specific problem could also be be, that you as the error message suggests are hitting a file which you can't access. And there's no reason to believe that the count of files until this happens should change. It is a computer after all, and you are repeating the same operation.

I predict that your external drive is formatted 32 and that the filenames you're writing to it are somewhere around 45 characters long.
FAT32 can only have 65536 directory entries in a directory. Long file names use multiple directory entries each. And "." always takes up one entry. That you are able to write 65536/5 - 1 = 13106 entries strongly suggests that your filenames take up 5 entries each and that you have a FAT32 filesystem. This is because there exists code using 16-bit numbers as directory entry offsets.
Additionally, you do not want to search through multi-1000 entry directories in FAT -- the search is linear. I.e. fopen(some_file) will induce the OS to march linearly through the list of files, from the beginning every time, until it finds some_file or marches off the end of the list.
Short answer: Directories are a good thing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.