I'm currently working with python 2.7's ZipFile. I have been running into memory issues zipping multiple files to send to the user.
When I have code:
fp = open('myzip.zip', 'w')
archive = zipfile.ZipFile(fp, 'w', zipfile.ZIP_DEFLATED)
for filepath in filepaths:
archive.write(filepath)
archive.close()
Does Python load all those files into memory at some point? I would have expected Python to stream the contents of the files into the zip, but I'm not sure that that is the case.
This question/answer (both by same user) suggests that it's all done in memory. They provide a link to a modified library EnhancedZipFile that sounds like it works as you desire, however it doesn't seem to have had much activity on the project.
If you're not dependent on zip specifically, then this answer implies that the bzip library can handle large files.
Python will hold all the files you iterate over in memory, there is no way i can think of to bypass that except for calling an executable that will do the job for you on the operating system.
Related
I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!
gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.
I've been trying to use the built-in python zipfiles module to manipulate some .zip files on windows, I wish to use them to store a number of files related to the current project in a program. The problem comes when I load the files from the zip and then wish to re-save them into a new, different zip file:
import zipfile
zp = zipfile.ZipFile(r"first.zip",mode='r')
myfile = zp.open(r"stored_file.txt",mode='r')
### Do something, then want to save again ###
zp2 = zipfile.ZipFile(r"second.zip",mode='w')
#Doesn't work, as myfile isn't a real file:
zp2.write(myfile)
#Doesn't work, as the path can't be resolved:
zp2.write(os.path.join(zp.filename,myfile.name))
#The following works... as long as you haven't called read()
#since .seek(0) doesn't work for ZipExtFile
zp2.writestr(myfile.name,myfile.read())
I could, of course, extract the files to somewhere and then re-add them to the new zip that way, but it would be clunky and require a lot of cleanup (and creating a lot of temporary files).
Equally I could keep track of the original zip file and use the writestr method by re-opening the file, but I was hoping to avoid it. I just wondered if there was a better way around this problem; it means I'll have to have code that determines whether the file originally came from a zip or not as well and handle it differently if it did.
Edit: If anyone else has the final problem with seek(0) not working on ZipExtFile, it is possible to use an io.StringIO class to hold the result of str(myfile.read()), which is then seekable. It means I have to keep the files loaded in memory, though, so I'm going to go with keeping track of the zipfile and transferring them only when I need them.
Through much trial and error, I've figured out how to make lzma compressed files through PyLZMA, but I'd like to replicate the seemingly simple task of compressing a folder and all of its files/directories recursively to a 7z file. I would just do it through 7z.exe, but I can't seem to be able to catch the stdout of the process until it's finished and I would like some per-7z-file progress as I will be compressing folders that range from hundreds of gigs to over a terabyte in size. Unfortunately I can't provide any code that I have tried, simply because the only thing I've seen examples of using py7zlib are extracting files from pre-existing files. Has anybody had any luck with the combination of these two or could provide some help?
For what it's worth, this would be on Windows using python 2.7. Bonus points if there some magic multi-threading that could occur here, especially given how slow lzma compression seems to be (time, however, is not an issue here). Thanks in advance!
A pure Python alternative is to create .tar.xz files with a combination of the standard library tarfile module and the liblzma wrapper module pyliblzma. This will create files comparable to in size to .7z archives:
import tarfile
import lzma
TAR_XZ_FILENAME = 'archive.tar.xz'
DIRECTORY_NAME = '/usr/share/doc/'
xz_file = lzma.LZMAFile(TAR_XZ_FILENAME, mode='w')
with tarfile.open(mode='w', fileobj=xz_file) as tar_xz_file:
tar_xz_file.add(DIRECTORY_NAME)
xz_file.close()
The tricky part is the progress report. The example above uses the recursive mode for directories of the tarfile.TarFile class, so there the add method would not return until the whole directory was added.
The following questions discuss possible strategies for monitoring the progress of the tar file creation:
Python tarfile progress output?
Python tarfile progress
In a django project, I need to generate some pdf files for objects in db. Since each file takes a few seconds to generate, I use celery to run tasks asynchronously.
Problem is, I need to add each file to a zip archive. I was planning to use the python zipfile module, but different tasks can be run in different threads, and I wonder what will happen if two tasks try to add a file to the archive at the same time.
Is the following code thread safe or not? I cannot find any valuable information in the python's official doc.
try:
zippath = os.path.join(pdf_directory, 'archive.zip')
zipfile = ZipFile(zippath, 'a')
zipfile.write(pdf_fullname)
finally:
zipfile.close()
Note: this is running under python 2.6
No, it is not thread-safe in that sense.
If you're appending to the same zip file, you'd need a lock there, or the file contents could get scrambled.
If you're appending to different zip files, using separate ZipFile() objects, then you're fine.
Python 3.5.5 makes writing to ZipFile and reading multiple ZipExtFiles threadsafe: https://docs.python.org/3.5/whatsnew/changelog.html#id93
As far as I can tell, the change has not been backported to Python 2.7.
Update: after studying the code and some testing, it becomes apparent that the locking is still not thoroughly implemented. It correctly works only for writestr and doesn't work for open and write.
While this question is old, it's still high up on google results, so I just want to chime in to say that I noticed on python 3.4 64bit on windows the lzma zipfile is thread-safe; all others fail.
with zipfile.ZipFile("test.zip", "w", zipfile.ZIP_LZMA) as zip:
#do stuff in threads
Note that you can't bind the same file with multiple zipfile.ZipFile instances, instead you have to use the same one in all threads; here that is the variable named zip.
In my case I get about 80-90% CPU use on 8 cores and SSD, which is nice.
I'm looking for a high performance method or library for scanning all files on disk or in a given directory and grabbing their basic stats - filename, size, and modification date.
I've written a python program that uses os.walk along with os.path.getsize to get the file list, and it works fine, but is not particularly fast. I noticed one of the freeware programs I had downloaded accomplished the same scan much faster than my program.
Any ideas for speeding up the file scan? Here's my python code, but keep in mind that I'm not at all married to os.walk and perfectly willing to use others APIs (including windows native APIs) if there are better alternatives.
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
...
I should also note I realize the python code probably can't be sped up that much; I'm particularly interested in any native APIs that provide better speed.
Well, I would expect this to be heavily I/O bound task.
As such, optimizations on python side would be quite ineffective; the only optimization I could think of is some different way of accessing/listing files, in order to reduce the actual read from the file system.
This of course requires a deep knowledge of the file system, that I do not have, and I do not expect python's developer to have while implementing os.walk.
What about spawning a command prompt, and then issue 'dir' and parse the results?
It could be a bit an overkill, but with any luck, 'dir' is making some effort for such optimizations.
It seems as if os.walk has been considerably improved in python 2.5, so you might check if you're running that version.
Other than that, someone has already compared the speed of os.walk to ls and noticed a clear advance of the latter, but not in a range that would actually justify using it.
You might want to look at the code for some Python version control systems like Mercurial or Bazaar. They have devoted a lot of time to coming up with ways to quickly traverse a directory tree and detect changes (or "finding basic stats about the files").
Use scandir python module (formerly betterwalk) on github by Ben Hoyt.
http://github.com/benhoyt/scandir
It is much faster than python walk, but uses the same syntax. Just import scandir and change os.walk() to scandir.walk(). That's it. It is the fastest way to traverse directories and files in python.
When you look at the code for os.walk, you'll see that there's not much fat to be trimmed.
For example, the following is only a hair faster than os.walk.
import os
import stat
listdir= os.listdir
pathjoin= os.path.join
fstat= os.stat
is_dir= stat.S_ISDIR
is_reg= stat.S_ISREG
def yieldFiles( path ):
for f in listdir(path):
nm= pathjoin( path, f )
s= fstat( nm ).st_mode
if is_dir( s ):
for sub in yieldFiles( nm ):
yield sub
elif is_reg( s ):
yield f
else:
pass # ignore these
Consequently, the overheads must he in the os module itself. You'll have to resort to making direct Windows API calls.
Look at the Python for Windows Extensions.
I'm wondering if you might want to group your I/O operations.
For instance, if you're walking a dense directory tree with thousands of files, you might try experimenting with walking the entire tree and storing all the file locations, and then looping through the (in-memory) locations and getting file statistics.
If your OS stores these two data in different locations (directory structure in one place, file stats in another), then this might be a significant optimization.
Anyway, that's something I'd try before digging further.
Python 3.5 just introduced os.scandir (see PEP-0471) which avoids a number of non-required system calls such as stat() and GetFileAttributes() to provide a significantly quicker file-system iterator.
os.walk() will now be implemented using os.scandir() as its iterator, and so you should see potentially large performance improvements whilst continuing to use os.walk().
Example usage:
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_file():
print(entry.name)
The os.path module has a directory tree walking function as well. I've never run any sort of benchmarks on it, but you could give it a try. I'm not sure there's a faster way than os.walk/os.path.walk in Python, however.
This is only partial help, more like pointers; however:
I believe you need to do the following:
fp = open("C:/$MFT", "rb")
using an account that includes SYSTEM permissions, because even as an admin, you can't open the "Master File Table" (kind of an inode table) of an NTFS filesystem. After you succeed in that, then you'll just have to locate information on the web that explains the structure of each file record (I believe it's commonly 1024 bytes per on-disk file, which includes the file's primary pathname) and off you go for super-high speeds of disk structure reading.
I would suggest using folderstats for creating statistics from a folder structure, I have tested on folder/files structures up to 400k files and folders.
It is as simple as:
import folderstats
import pandas as pd
df = folderstats.folderstats(path5, ignore_hidden=True)
df.head()
df.shape
Output will be a dataframe see example below:-
path
name
extension
size
atime
mtime
ctime
folder
num_files
depth
uid
md5
./folder_structure.png
folder_structure
png
525239
2022-01-10 16:08:32
2020-11-22 19:38:03
2020-11-22 19:38:03
False
0
1000
a3cac43de8dd5fc33d7bede1bb1849de
./requirements-dev.txt
requirements-dev
txt
33
2022-01-10 14:14:50
2022-01-08 17:54:50
2022-01-08 17:54:50
False
0
1000
42c7e7d9bc4620c2c7a12e6bbf8120bb