It's a common question not specifically about some language or platform. Who is responsible for a file created in systems $TEMP folder?
If it's my duty, why should I care where to put this file? I can place it anywhere with same result.
If it's OS responsibility, can I forgot about this file right after use?
Thanks and sorry for my basic English.
As a general rule, you should remove the temporary files that you create.
Recall that the $TEMP directory is a shared resource that other programs can use. Failure to remove the temporary files will have an impact on the other programs that use $TEMP.
What kind of impacts? That will depend upon the other programs. If those other programs create a lot of temporary files, then their execution will be slower as it will take longer to create a new temporary file as the directory will have to be scanned on each temporary file creation to ensure that the file name is unique.
Consider the following (based on real events) ...
In years past, my group at work had to use the Intel C Compiler. We found that over time, it appeared to be slowing down. That is, the time it took to run our sanity tests using it took longer and longer. This also applied to building/compiling a single C file. We tracked the problem down.
ICC was opening, stat'ing and reading every file under $TEMP. For what purpose, I know not. Although the argument can be made that the problem lay with the ICC, the existence of the files under $TEMP was slowing it and our development team down. Deleting those temporary files resulted in the sanity checks running in less than a half hour instead of over two--a significant time saver.
Hope this helps.
There is no standard and no common rules. In most OSs, the files in the temporary folder will pile up. Some systems try to prevent this by deleting files in there automatically after some time but that sometimes causes grief, for example with long running processes or crash backups.
The reason for $TEMP to exist is that many programs (especially in early times when RAM was scarce) needed a place to store temporary data since "super computers" in the 1970s had only a few KB of RAM (yes, N*1024 bytes where N is << 100 - you couldn't even fit the image of your mouse cursor into that). Around 1980, 64KB was a lot.
The solution was a folder where anyone could write. Security wasn't an issue at the time, memory was.
Over time, OSs started to get better systems to create temporary files and to clean them up but backwards compatibility prevented a clean, "work for all" solution.
So even though you know where the data ends up, you are responsible to clean up the files after yourself. To make error analysis easier, I tend to write my code in such a way that files are only deleted when everything is fine - that way, I can look at intermediate results to figure out what is wrong. But logging is often a better and safer solution.
Related: Memory prices 1957-2014 12KB of Ram did cost US $4'680,- in 1973.
Related
I routinely use pickle.dump() to save large files in Python 2.7. In my code, I have one .pickle file that I continually update with each iteration of my code, overwriting the same file each time.
However, I occasionally encounter crashes (e.g. from server issues). This may happen in the middle of the pickle dump, rendering the pickle incomplete and the pickle file unreadable, and I lose all my data from the past iterations.
I guess one way I could do it is to save one .pickle file for each iteration, and combine all of them later. Are there any other recommended methods, or best practices in writing to disk that is robust to crashing?
You're effectively doing backups, as your goal is the same: disaster recovery, lose as little work as possible.
In backups, there are these standard practices, so choose whatever fits you best:
backing up
full backup (save everything each time)
incremental backup (save only what changed since the last backup)
differential backup (save only what changed since the last full backup)
dealing with old backups
circular buffer/rotating copies (delete or overwrite backups older than X days/iterations, optionally change indices in others' names)
consolidatating old incremental/differential copies into the preceding full backup (as a failsafe, consolidate into a new file and only then delete the old ones)
Are there any other recommended methods, or best practices ?
Let me mention a cool Mike McKearn's dill package
import dill as pickle
pickle.dump_session( aRotatingIndexFileNAME ) # saves python session-state
# # session is worth
# # many computing [CPU-core * hrs]
if needed, just
import dill as pickle; pickle.load_session( aLastSessionStateFileNAME )
Using rotating filenames are a common best-practice for archiving up to some depth of roll-back capabilities, so worth not repeating these here.
dill has literally saved me on this, including having knowingly the same call-signatures for easy substitutions into python projects.
I am running Python with MPI on a supercomputing cluster. I am getting strange nondeterministic behavior that I think is a result of I/O complications that are not present on the single machines I'm used to working with.
One of the things my code does is to create directories using os.makedirs somewhat frequently. I know also that I generally should not write small amounts of data to the filesystem-- this can end up with the data getting stuck in some buffer and not written for a long time. I suspect this may be happening with my directory creation calls, and then later code tries to write to files inside the directory before it exists. Two questions:
is creating a new directory effectively the same thing as writing a small amount of data?
When forcing data to be written, I use flush and os.fsync. These require a file object. Is there an equivalent to make sure the directory has been created?
Creating a new directory is effectively the same as writing small amount of data. It adds an inode.
The only way mkdir (or os.mkdirs) should fail is if the directory exists - otherwise the directory will always be created. In terms of the data being buffered - it's unlikely that this would happen - even journaled filesystems will sync out pretty regularly.
If you're having non-deterministic behavior, just wrap your directory creation / writing a file into that directory inside a try / except / finally that makes a few efforts? But really - the need for such code hints at something much more sinister and is likely a bigger issue.
I have to archive a large amount of data off of CDs and DVDs, and I thought it was an interesting problem that people might have useful input on. Here's the setup:
The script will be running on multiple boxes on multiple platforms, so I thought python would be the best language to use. If the logic creates a bottleneck, any other language works.
We need to archive ~1000 CDs and ~500 DVDs, so speed is a critical issue
The data is very valuable, so verification would be useful
The discs are pretty old, so a lot of them will be hard or impossible to read
Right now, I was planning on using shutil.copytree to dump the files into a directory, and compare file trees and sizes. Maybe throw in a quick hash, although that will probably slow things down too much.
So my specific questions are:
What is the fastest way to copy files off a slow medium like CD/DVDs? (or does the method even matter)
Any suggestions of how to deal with potentially failing discs? How do you detect discs that have issues?
When you read file by file, you're seeking randomly around the disc, which is a lot slower than a bulk transfer of contiguous data. And, since the fastest CD drives are several dozen times slower than the slowest hard drives (and that's not even counting the speed hit for doing multiple reads on each bad sector for error correction), you want to get the data off the CD as soon as possible.
Also, of course, having an archive as a .iso file or similar means that, if you improve your software later, you can re-scan the filesystem without needing to dig out the CD again (which may have further degraded in storage).
Meanwhile, trying to recovering damaged CDs, and damaged filesystems, is a lot more complicated than you'd expect.
So, here's what I'd do:
Block-copy the discs directly to .iso files (whether in Python, or with dd), and log all the ones that fail.
Hash the .iso files, not the filesystems. If you really need to hash the filesystems, keep in mind that the common optimization of compression the data before hashing (that is, tar czf - | shasum instead of just tar cf - | shasum) usually slows things down, even for easily-compressable data—but you might as well test it both ways on a couple discs. If you need your verification to be legally useful you may have to use a timestamped signature provided by an online service, instead, in which case compressing probably will be worthwhile.
For each successful .iso file, mount it and use basic file copy operations (whether in Python, or with standard Unix tools), and again log all the ones that fail.
Get a free or commercial CD recovery tool like IsoBuster (not an endorsement, just the first one that came up in a search, although I have used it successfully before) and use it to manually recover all of the damaged discs.
You can do a lot of this work in parallel—when each block copy finishes, kick off the filesystem dump in the background while you're block-copying the next drive.
Finally, if you've got 1500 discs to recover, you might want to invest in a DVD jukebox or auto-loader. I'm guessing new ones are still pretty expensive, but there must be people out there selling older ones for a lot cheaper. (From a quick search online, the first thing that came up was $2500 new and $240 used…)
Writing your own backup system is not fun. Have you considered looking at ready-to-use backup solutions? There are plenty, many free ones...
If you are still bound to write your own... Answering your specific questions:
With CD/DVD you first typically have to master the image (using a tool like mkisofs), then write image to the medium. There are tools that wrap both operations for you (genisofs I believe) but this is typically the process.
To verify the backup quality, you'll have to read back all written files (by mounting a newly written CD) and compare their checksums against those of the original files. In order to do incremental backups, you'll have to keep archives of checksums for each file you save (with backup date etc).
The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.
Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.
The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.
It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".
Attempt #2:
People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:
1) Reading a list of files is much faster than walking a directory.
2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.
3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.
Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.
Original post:
I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.
Here's my idea for how to go about this:
I have a function called all_files:
def all_files(dir_path, ...parms...):
...
The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.
This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path affected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.
Finally, I'll take the list of all files, filter it according to various parameters, and return the result.
What do you think of my approach? Is there a better way to do this?
Edit:
Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.
import os
from os.path import join, exists
import cProfile, pstats
dir_name = "temp_dir"
index_path = ".index"
def create_test_files():
os.mkdir(dir_name)
index_file = open(index_path, 'w')
for i in range(10):
print "creating dir: ", i
sub_dir = join(dir_name, str(i))
os.mkdir(sub_dir)
for i in range(100):
file_path = join(sub_dir, str(i))
open(file_path, 'w').close()
index_file.write(file_path + "\n")
index_file.close()
#
# 0.238 seconds
def test_walk():
for info in os.walk("temp_dir"):
pass
# 0.001 seconds
def test_read():
open(index_path).readlines()
if not exists("temp_dir"):
create_test_files()
def profile(s):
cProfile.run(s, 'profile_results.txt')
p = pstats.Stats('profile_results.txt')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
profile("test_walk()")
profile("test_read()")
Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.
Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.
Flaws and potential problems:
You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.
The filesystem monitor program still has to recursively walk the file system, so the work is still being done.
In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.
Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.
I could go on.
If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.
The best answer came from Michał Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.
Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.
Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.
Of course it's much faster than walking the directory.
The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.
"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)
There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.
To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.
Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.
There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.
However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.
Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.
(Edit in response to "I don't think you understand what I'm trying to do")
Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.
Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.
You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.
If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?
That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.
But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.
I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.
As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.
I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.
class DirectoryIterator:
def __init__(self, start_dir, pattern):
self.directory = start_dir
self.pattern = pattern
def __iter__(self):
[([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]
###########
for file_name in DirectoryIterator(".", "\.py$"): print file_name