I would like to know if there is any benefit to using python2.7's multiprocessing module to asynchronously copy files from one folder to another.
Is diskio always forced to be in serial? Does this change if you are copying from one hard disk to a different hard disk? Does this change depending on operating system (windows / linux)?
Perhaps it is possible to read in parallel, but not possible to write?
This is all assuming that the fiels being moved/copied are different files going to different locations.
I/O goes to the system cache in RAM before hitting a hard drive. Fro writes, you may find the copies are fast until you exhaust RAM and then slows down and that multiple reads of the same data are fast. If you copy the same file to several places, there is an advantage to do the copies of that file before moving to the next.
I/O to a single hard drive (or group of hard drives joined with a RAID or volume manager) is mostly serial except that the operating system and drive may reorder operations to read / write nearby tracks before seeking for tracks that are further away. There is some advantage to doing parallel copies because there are more opportunities to reorder, but since you are really writing from the system RAM cache sometime after your application writes, the benefits may be hard to measure.
There is a greater benefit moving between drives. Those go mostly in parallel, although there is some contention for the buses (eg, pcie, sata) that run the drives.
If you have a lot of files to copy, multiprocessing is a reasonable way to go, but you may find that subprocess to the native copy utilities is faster.
Related
Is a file stored to disc, when only present for a fraction of a second?
I'm running with python 3.7 on ubuntu 18.04.
I make use of a python script. This script extracts json-files from a zip-package. Resulting files will be processed. Afterwards the resulting files well be deleted.
As I'm running on an SSD. I want to spare write cycles to it.
Does linux buffer such write cycles to the RAM, or do I need to assume, that I'm forcing my poor SSD into sever thousend write cycles per second?
Linux may cache file operations under some circumstances, but you're looking for it to optimize by avoiding ever committing a whole sequence of operations to storage at all, based on there being no net effect. I do not think you can expect that.
It sounds like you might be better served by using a different filesystem in the first place. Linux has memory-backed file systems (served by the tmpfs filesystem driver, for example), so perhaps you want to set up such a filesystem for your application to use for these scratch files.1 Do note, however, that these are backed by virtual memory, so, although this approach should reduce the number of write cycles on your SSD, it might not eliminate all writes.
1 For example, see https://unix.stackexchange.com/a/66331/289373
I have written a code which does some processing , I want to reduce the execution time of the program and I think it can be done if I run it on my RAM which is 1GB.
So will running my program form RAM make any difference to my execution time and if yes how it can be done.
Believe it or not, when you use a modernish computer system, most of your computation is done from RAM. (Well, technically, it's "done" from processor registers, but those are filled from RAM so let's brush that aside for the purposes of this answer)
This is thanks to the magic we call caches and buffers. A disk "cache" in RAM is filled by the operating system whenever something is read from permanent storage. Any further reads of that same data (until and unless it is "evicted" from the cache) only read memory instead of the permanent storage medium.
A "buffer" works similarly for write output, with data first being written to RAM and then eventually flushed out to the underlying medium.
So, in the course of normal operation, any runs of your program after the first (unless you've done a lot of work in between), will already be from RAM. Ditto the program's input file: if it's been read recently, it's already cached in memory! So you're unlikely to be able to speed things up by putting it in memory yourself.
Now, if you want to force things for some reason, you can create a "ramdisk", which is a filesystem backed by RAM. In Linux the easy way to do this is to mount "tmpfs" or put files in the /dev/shm directory. Files on a tmpfs filesystem go away when the computer loses power and are entirely stored in RAM, but otherwise behave like normal disk-backed files. From the way your question is phrased, I don't think this is what you want. I think your real answer is "whatever performance problems you think you have, this is not the cause, sorry".
It's a common question not specifically about some language or platform. Who is responsible for a file created in systems $TEMP folder?
If it's my duty, why should I care where to put this file? I can place it anywhere with same result.
If it's OS responsibility, can I forgot about this file right after use?
Thanks and sorry for my basic English.
As a general rule, you should remove the temporary files that you create.
Recall that the $TEMP directory is a shared resource that other programs can use. Failure to remove the temporary files will have an impact on the other programs that use $TEMP.
What kind of impacts? That will depend upon the other programs. If those other programs create a lot of temporary files, then their execution will be slower as it will take longer to create a new temporary file as the directory will have to be scanned on each temporary file creation to ensure that the file name is unique.
Consider the following (based on real events) ...
In years past, my group at work had to use the Intel C Compiler. We found that over time, it appeared to be slowing down. That is, the time it took to run our sanity tests using it took longer and longer. This also applied to building/compiling a single C file. We tracked the problem down.
ICC was opening, stat'ing and reading every file under $TEMP. For what purpose, I know not. Although the argument can be made that the problem lay with the ICC, the existence of the files under $TEMP was slowing it and our development team down. Deleting those temporary files resulted in the sanity checks running in less than a half hour instead of over two--a significant time saver.
Hope this helps.
There is no standard and no common rules. In most OSs, the files in the temporary folder will pile up. Some systems try to prevent this by deleting files in there automatically after some time but that sometimes causes grief, for example with long running processes or crash backups.
The reason for $TEMP to exist is that many programs (especially in early times when RAM was scarce) needed a place to store temporary data since "super computers" in the 1970s had only a few KB of RAM (yes, N*1024 bytes where N is << 100 - you couldn't even fit the image of your mouse cursor into that). Around 1980, 64KB was a lot.
The solution was a folder where anyone could write. Security wasn't an issue at the time, memory was.
Over time, OSs started to get better systems to create temporary files and to clean them up but backwards compatibility prevented a clean, "work for all" solution.
So even though you know where the data ends up, you are responsible to clean up the files after yourself. To make error analysis easier, I tend to write my code in such a way that files are only deleted when everything is fine - that way, I can look at intermediate results to figure out what is wrong. But logging is often a better and safer solution.
Related: Memory prices 1957-2014 12KB of Ram did cost US $4'680,- in 1973.
I have to archive a large amount of data off of CDs and DVDs, and I thought it was an interesting problem that people might have useful input on. Here's the setup:
The script will be running on multiple boxes on multiple platforms, so I thought python would be the best language to use. If the logic creates a bottleneck, any other language works.
We need to archive ~1000 CDs and ~500 DVDs, so speed is a critical issue
The data is very valuable, so verification would be useful
The discs are pretty old, so a lot of them will be hard or impossible to read
Right now, I was planning on using shutil.copytree to dump the files into a directory, and compare file trees and sizes. Maybe throw in a quick hash, although that will probably slow things down too much.
So my specific questions are:
What is the fastest way to copy files off a slow medium like CD/DVDs? (or does the method even matter)
Any suggestions of how to deal with potentially failing discs? How do you detect discs that have issues?
When you read file by file, you're seeking randomly around the disc, which is a lot slower than a bulk transfer of contiguous data. And, since the fastest CD drives are several dozen times slower than the slowest hard drives (and that's not even counting the speed hit for doing multiple reads on each bad sector for error correction), you want to get the data off the CD as soon as possible.
Also, of course, having an archive as a .iso file or similar means that, if you improve your software later, you can re-scan the filesystem without needing to dig out the CD again (which may have further degraded in storage).
Meanwhile, trying to recovering damaged CDs, and damaged filesystems, is a lot more complicated than you'd expect.
So, here's what I'd do:
Block-copy the discs directly to .iso files (whether in Python, or with dd), and log all the ones that fail.
Hash the .iso files, not the filesystems. If you really need to hash the filesystems, keep in mind that the common optimization of compression the data before hashing (that is, tar czf - | shasum instead of just tar cf - | shasum) usually slows things down, even for easily-compressable data—but you might as well test it both ways on a couple discs. If you need your verification to be legally useful you may have to use a timestamped signature provided by an online service, instead, in which case compressing probably will be worthwhile.
For each successful .iso file, mount it and use basic file copy operations (whether in Python, or with standard Unix tools), and again log all the ones that fail.
Get a free or commercial CD recovery tool like IsoBuster (not an endorsement, just the first one that came up in a search, although I have used it successfully before) and use it to manually recover all of the damaged discs.
You can do a lot of this work in parallel—when each block copy finishes, kick off the filesystem dump in the background while you're block-copying the next drive.
Finally, if you've got 1500 discs to recover, you might want to invest in a DVD jukebox or auto-loader. I'm guessing new ones are still pretty expensive, but there must be people out there selling older ones for a lot cheaper. (From a quick search online, the first thing that came up was $2500 new and $240 used…)
Writing your own backup system is not fun. Have you considered looking at ready-to-use backup solutions? There are plenty, many free ones...
If you are still bound to write your own... Answering your specific questions:
With CD/DVD you first typically have to master the image (using a tool like mkisofs), then write image to the medium. There are tools that wrap both operations for you (genisofs I believe) but this is typically the process.
To verify the backup quality, you'll have to read back all written files (by mounting a newly written CD) and compare their checksums against those of the original files. In order to do incremental backups, you'll have to keep archives of checksums for each file you save (with backup date etc).
We have a significant ~(50kloc) tree of packages/modules (approx 2200files) that we ship around to our cluster with each job. The jobs run for ~12 hours, so the overhead of untarring/bootrapping (i.e. resolving PYTHONPATH for each module) usually isn't a big deal. However, as the number of cores in our worker nodes have increased, we've increasingly hit the case where the scheduler will have 12 jobs land simultaneously, which will grind the poor scratch drive to a halt servicing all the requests (worse, for reasons beyond our control, each job requires a separate loopback filesystem, so there's 2 layers of indirection on the drive).
Is there a way to hint to the interpreter the proper location of each file (without decorating the code with paths strewn throughout (maybe overriding import?)) or bundle up all the associated .pyc files into some sort of binary blob that can just be read once?
Thanks!
We had problems like this on our cluster. (The Lustre filesystem was slow for metadata operations.) Our solution was to use the "zip import" facilities in Python.
In our case we made a single zip of the stdlib (placed in the name given already in sys.path, like "/usr/lib/python26.zip") and another zip of our project, with the latter added to the PYTHONPATH.
This is much faster because it's a single filesystem metadata read, followed by a quick zip file read of the table-of-contents to figure out what's inside, and cache for later lookups.