We have a significant ~(50kloc) tree of packages/modules (approx 2200files) that we ship around to our cluster with each job. The jobs run for ~12 hours, so the overhead of untarring/bootrapping (i.e. resolving PYTHONPATH for each module) usually isn't a big deal. However, as the number of cores in our worker nodes have increased, we've increasingly hit the case where the scheduler will have 12 jobs land simultaneously, which will grind the poor scratch drive to a halt servicing all the requests (worse, for reasons beyond our control, each job requires a separate loopback filesystem, so there's 2 layers of indirection on the drive).
Is there a way to hint to the interpreter the proper location of each file (without decorating the code with paths strewn throughout (maybe overriding import?)) or bundle up all the associated .pyc files into some sort of binary blob that can just be read once?
Thanks!
We had problems like this on our cluster. (The Lustre filesystem was slow for metadata operations.) Our solution was to use the "zip import" facilities in Python.
In our case we made a single zip of the stdlib (placed in the name given already in sys.path, like "/usr/lib/python26.zip") and another zip of our project, with the latter added to the PYTHONPATH.
This is much faster because it's a single filesystem metadata read, followed by a quick zip file read of the table-of-contents to figure out what's inside, and cache for later lookups.
Related
I have come across trouble regarding performance of my scripts, while generating and using large quantity of small files.
I have two directories on my disk (same behavior on HDD and SSD). First with ~10_000 input files and second for ~1_300_000 output files. I wrote script to process files and generate output using multiprocessing library in Python.
First output 400_000-600_files (not sure when I hit 'threshold') are generated at constant pace and all 8 cores of CPU are used at 100%. Then it gets much worse. Performance decreases 20 times and cores usage drops to 1-3%, when hitting 1_000_000 files in directory.
I omitted this issue by creating second output directory and writing second half of output files there (I needed quick hotfix).
Now, I have two questions:
1) How is creating new and writing to it executed in Python on Windows? What is the bottleneck here? (my guess is that Windows look up if file already exists in directory before writing to it)
2) What is more elegant way (than splitting into dirs) to handle this issue correctly?
In case anyone has the same problem, the bottleneck turned out to be lookup time for files in crowded directories.
I resolved the issue by splitting files into separate directories grouped by one parameter with even distribution over 20 different variables. Though now I would do it in a different way.
I recommend solving a similar issue with shelve Python built-in module. A shelve is one file in the filesystem and you can access it like a dictionary and put pickles inside. Just like in real life :) Example here.
I would like to know if there is any benefit to using python2.7's multiprocessing module to asynchronously copy files from one folder to another.
Is diskio always forced to be in serial? Does this change if you are copying from one hard disk to a different hard disk? Does this change depending on operating system (windows / linux)?
Perhaps it is possible to read in parallel, but not possible to write?
This is all assuming that the fiels being moved/copied are different files going to different locations.
I/O goes to the system cache in RAM before hitting a hard drive. Fro writes, you may find the copies are fast until you exhaust RAM and then slows down and that multiple reads of the same data are fast. If you copy the same file to several places, there is an advantage to do the copies of that file before moving to the next.
I/O to a single hard drive (or group of hard drives joined with a RAID or volume manager) is mostly serial except that the operating system and drive may reorder operations to read / write nearby tracks before seeking for tracks that are further away. There is some advantage to doing parallel copies because there are more opportunities to reorder, but since you are really writing from the system RAM cache sometime after your application writes, the benefits may be hard to measure.
There is a greater benefit moving between drives. Those go mostly in parallel, although there is some contention for the buses (eg, pcie, sata) that run the drives.
If you have a lot of files to copy, multiprocessing is a reasonable way to go, but you may find that subprocess to the native copy utilities is faster.
It's a common question not specifically about some language or platform. Who is responsible for a file created in systems $TEMP folder?
If it's my duty, why should I care where to put this file? I can place it anywhere with same result.
If it's OS responsibility, can I forgot about this file right after use?
Thanks and sorry for my basic English.
As a general rule, you should remove the temporary files that you create.
Recall that the $TEMP directory is a shared resource that other programs can use. Failure to remove the temporary files will have an impact on the other programs that use $TEMP.
What kind of impacts? That will depend upon the other programs. If those other programs create a lot of temporary files, then their execution will be slower as it will take longer to create a new temporary file as the directory will have to be scanned on each temporary file creation to ensure that the file name is unique.
Consider the following (based on real events) ...
In years past, my group at work had to use the Intel C Compiler. We found that over time, it appeared to be slowing down. That is, the time it took to run our sanity tests using it took longer and longer. This also applied to building/compiling a single C file. We tracked the problem down.
ICC was opening, stat'ing and reading every file under $TEMP. For what purpose, I know not. Although the argument can be made that the problem lay with the ICC, the existence of the files under $TEMP was slowing it and our development team down. Deleting those temporary files resulted in the sanity checks running in less than a half hour instead of over two--a significant time saver.
Hope this helps.
There is no standard and no common rules. In most OSs, the files in the temporary folder will pile up. Some systems try to prevent this by deleting files in there automatically after some time but that sometimes causes grief, for example with long running processes or crash backups.
The reason for $TEMP to exist is that many programs (especially in early times when RAM was scarce) needed a place to store temporary data since "super computers" in the 1970s had only a few KB of RAM (yes, N*1024 bytes where N is << 100 - you couldn't even fit the image of your mouse cursor into that). Around 1980, 64KB was a lot.
The solution was a folder where anyone could write. Security wasn't an issue at the time, memory was.
Over time, OSs started to get better systems to create temporary files and to clean them up but backwards compatibility prevented a clean, "work for all" solution.
So even though you know where the data ends up, you are responsible to clean up the files after yourself. To make error analysis easier, I tend to write my code in such a way that files are only deleted when everything is fine - that way, I can look at intermediate results to figure out what is wrong. But logging is often a better and safer solution.
Related: Memory prices 1957-2014 12KB of Ram did cost US $4'680,- in 1973.
With Python (I am using v3.4), I am creating a program that will run on up to 20 machines concurrently connected to a same network.
Those machines will need from time to time to access the network and make some operations (create/delete/modify files)…
What I had currently planned was:
network will have a folder "db"
I will use the shelve module to have my shared data (a simple dictionary of less than 3,000 entries containing some path names and some logging information) in this folder. Since there is not so much data it is a convenient module that will be fast enough
to avoid operations by multiple machines on the network at the same time (and avoid conflicts), machines will have to lock the shelve file on the network before doing anything (anyway they will also need to update this file) and will unlock only at the end of their operation (updating its contents if necessary)
The problem I have is that shelve does not have a convenient way for concurrent access or locking mechanism. I understand that my 2 possibilities are:
either I don't use shelve and use another module to manage my simple database (ex: sqlite3 except it is a bit more complex than the simple shelve module)
either I create a locking mechanism that will work on network by multiple machines (so far I didn't find a module that seemed totally reliable)
Additional requirements (if possible) are:
It will mainly be used with Windows but I would like the solution to be cross-platform so that I can re-use it with Linux
network file system will be anything accessible through a standard explorer (in linux/windows) through an address like "\Machine\Folder". It could be either a file server or just a shared folder present on one of the machines.
Any recommendation?
Thanks
Multiplatform, locking. You best option is portalocker (https://pypi.python.org/pypi/portalocker/0.3). We've used it extensively -- and it's a winner !
I have some bash code which moves files and directory to /tmp/rmf rather than deleting them, for safety purposes.
I am migrating the code to Python to add some functionality. One of the added features is checking the available size on /tmp and asserting that the moved directory can fit in /tmp.
Checking for available space is done using os.statvfs, but how can I measure the disk usage of the moved directory?
I could either call du using subprocess, or recursively iterate over the directory tree and sum the sizes of each file. Which approach would be better?
I think you might want to reconsider your strategy. Two reasons:
Checking if you can move a file, asserting you can move a file, and then moving a file provides a built-in race-condition to the operation. A big file gets created in /tmp/ after you've asserted but before you've moved your file.. Doh.
Moving the file across filesystems will result in a huge amount of overhead. This is why on OSX each volume has their own 'Trash' directory. Instead of moving the blocks that compose the file, you just create a new inode that points to the existing data.
I'd consider how long the file needs to be available and the visibility to consumers of the files. If it's all automated stuff happening on the backend - renaming a file to 'hide' it from computer and human consumers is easy enough in most cases and has the added benefit of being an atomic operation)
Occasionally scan the filesystem for 'old' files to cull and rm them after some grace period. No drama. Also makes restoring files a lot easier since it's just a rename to restore.
This should do the trick:
import os
path = 'THE PATH OF THE DIRECTORY YOU WANT TO FETCH'
os.statvfs(path)