I plan on getting a huge folder of data. The total size of the folder would be approximately 2TB and it would be comprised of about 2 million files. I will need to perform some processing on those files (mainly removing 99% of them).
I anticipate some issues due to the size of the data. In particular, I would like to know if Python is able to list these files correctly using os.listdir() in a reasonable time.
For instance, I know from experience that in some cases, deleting huge folders like this one on Ubuntu can be painful.
os.scandir was created largely because of issues with using os.listdir on huge directories, so I would expect os.listdir to suffer in the scenario you describe, where os.scandir should perform better, both because it can process the folders with lower memory consumption and because (typically) you benefit at least a little by avoiding per-entry stat calls (e.g. to distinguish files from directories).
Unless you're given those millions of files already in the form of huge folder, you may easily separate them when copying, for example, use first few characters of the file as a folder name, for example:
abcoweowiejr.jpg goes to abc/ folder
012574034539.jpg goes to 012/ folder
and so on... This way you never have to read a folder that has millions of files.
Related
I have come across trouble regarding performance of my scripts, while generating and using large quantity of small files.
I have two directories on my disk (same behavior on HDD and SSD). First with ~10_000 input files and second for ~1_300_000 output files. I wrote script to process files and generate output using multiprocessing library in Python.
First output 400_000-600_files (not sure when I hit 'threshold') are generated at constant pace and all 8 cores of CPU are used at 100%. Then it gets much worse. Performance decreases 20 times and cores usage drops to 1-3%, when hitting 1_000_000 files in directory.
I omitted this issue by creating second output directory and writing second half of output files there (I needed quick hotfix).
Now, I have two questions:
1) How is creating new and writing to it executed in Python on Windows? What is the bottleneck here? (my guess is that Windows look up if file already exists in directory before writing to it)
2) What is more elegant way (than splitting into dirs) to handle this issue correctly?
In case anyone has the same problem, the bottleneck turned out to be lookup time for files in crowded directories.
I resolved the issue by splitting files into separate directories grouped by one parameter with even distribution over 20 different variables. Though now I would do it in a different way.
I recommend solving a similar issue with shelve Python built-in module. A shelve is one file in the filesystem and you can access it like a dictionary and put pickles inside. Just like in real life :) Example here.
I want to know how many files are in a folder (specifically a shared network folder on windows if that makes a difference here).
I am using this code right now:
include os.path
def countFiles(path):
return len([f for f in os.listdir(path)
if os.path.isfile(os.path.join(path, f))])
It works fine when there are a few files in the folder, but it takes a noticably long time in a directory with many files (say 4000). I am running this frequently (files are being added every ~15 seconds) so the slowdown is painful.
In my particular case, I know there aren't any subfolders, so I could skip the os.path.isfile check, but I'd like to keep my solution general. Frankly, I am surprised that there isn't a built in # of files function on os.path.
In order to know how many files there are in a folder, the system must enumerate each entry, then it must check whether an entry is a file or not. There's no faster way unless the system provides you with a filesystem event (e.g. fsevent or eventfd) to tell you when things change.
These operations are slow for a disk-based filesystem (tens to hundreds of microseconds), and even slower on a network drive; you'll notice they are pretty slow even in a normal file browser. Modern OSes deal with the slowness through aggressive caching, but this has its limits (especially for network filesystems, where the overhead of keeping the cache fresh can exceed the cost of doing the operations in the first place).
To speed it up, you could cache the isfile result for names you've already checked, under the assumption that they won't transmute into directories. This would save you many isfile checks, at the expense of a bit of safety (if e.g. someone deletes a file and replaces it with an identically-named folder).
I have some bash code which moves files and directory to /tmp/rmf rather than deleting them, for safety purposes.
I am migrating the code to Python to add some functionality. One of the added features is checking the available size on /tmp and asserting that the moved directory can fit in /tmp.
Checking for available space is done using os.statvfs, but how can I measure the disk usage of the moved directory?
I could either call du using subprocess, or recursively iterate over the directory tree and sum the sizes of each file. Which approach would be better?
I think you might want to reconsider your strategy. Two reasons:
Checking if you can move a file, asserting you can move a file, and then moving a file provides a built-in race-condition to the operation. A big file gets created in /tmp/ after you've asserted but before you've moved your file.. Doh.
Moving the file across filesystems will result in a huge amount of overhead. This is why on OSX each volume has their own 'Trash' directory. Instead of moving the blocks that compose the file, you just create a new inode that points to the existing data.
I'd consider how long the file needs to be available and the visibility to consumers of the files. If it's all automated stuff happening on the backend - renaming a file to 'hide' it from computer and human consumers is easy enough in most cases and has the added benefit of being an atomic operation)
Occasionally scan the filesystem for 'old' files to cull and rm them after some grace period. No drama. Also makes restoring files a lot easier since it's just a rename to restore.
This should do the trick:
import os
path = 'THE PATH OF THE DIRECTORY YOU WANT TO FETCH'
os.statvfs(path)
I'm running Python 2.6.2 on XP. I have a large number of text files (100k+) spread across several folders that I would like to consolidate in a single folder on an external drive.
I've tried using shutil.copy() and shutil.copytree() and distutils.file_util.copy_file() to copy files from source to destination. None of these methods has successfully copied all files from a source folder, and each attempt has ended with IOError Errno 13 Permission Denied and I am unable to create a new destination file.
I have noticed that all the destination folders I've used, regardless of the source folders used, have ended up with exactly 13,106 files. I cannot open any new files for writing in folders that have this many (or more files), which may be why I'm getting Errno 13.
I'd be grateful for suggestions on whether and why this problem is occurring.
many thanks,
nick
Are you using FAT32? The maximum number of directory entries in a FAT32 folder is is 65.534. If a filename is longer than 8.3, it will take more than one directory entry. If you are conking out at 13,106, this indicates that each filename is long enough to require five directory entries.
Solution: Use an NTFS volume; it does not have per-folder limits and supports long filenames natively (that is, instead of using multiple 8.3 entries). The total number of files on an NTFS volume is limited to around 4.3 billion, but they can be put in folders in any combination.
I wouldn't have that many files in a single folder, it is a maintenance nightmare. BUT if you need to, don't do this on FAT: you have max. 64k files in a FAT folder.
Read the error message
Your specific problem could also be be, that you as the error message suggests are hitting a file which you can't access. And there's no reason to believe that the count of files until this happens should change. It is a computer after all, and you are repeating the same operation.
I predict that your external drive is formatted 32 and that the filenames you're writing to it are somewhere around 45 characters long.
FAT32 can only have 65536 directory entries in a directory. Long file names use multiple directory entries each. And "." always takes up one entry. That you are able to write 65536/5 - 1 = 13106 entries strongly suggests that your filenames take up 5 entries each and that you have a FAT32 filesystem. This is because there exists code using 16-bit numbers as directory entry offsets.
Additionally, you do not want to search through multi-1000 entry directories in FAT -- the search is linear. I.e. fopen(some_file) will induce the OS to march linearly through the list of files, from the beginning every time, until it finds some_file or marches off the end of the list.
Short answer: Directories are a good thing.
is it bad to output many files to the same directory in unix/linux? I run thousands of jobs on a cluster and each outputs a file, to one directory. The upper bound here is around ~50,000 files. Can IO be limited in speed in light of this? If so, does the problem go away with a nested directory structure?
Thanks.
See:
How many files can I put in a directory?
I believe that most filesystems store the names of contained files in a list (or some other linear-time access data structure) so storing large numbers of files in a single directory can cause slowness for simple operations like listing. Having a nested structure can ameliorate this problem by creating a tree structure (or even a Trie, if it makes sense) of names which can reduce the time it takes to retrieve file stats.
My suggestion is to use nested directory structure (ie categorization). You can name them using timestamps, special prefixes for each application etc. This gives you a sense of order when you need to search for specific files and for easier management of your files.