I am running Python with MPI on a supercomputing cluster. I am getting strange nondeterministic behavior that I think is a result of I/O complications that are not present on the single machines I'm used to working with.
One of the things my code does is to create directories using os.makedirs somewhat frequently. I know also that I generally should not write small amounts of data to the filesystem-- this can end up with the data getting stuck in some buffer and not written for a long time. I suspect this may be happening with my directory creation calls, and then later code tries to write to files inside the directory before it exists. Two questions:
is creating a new directory effectively the same thing as writing a small amount of data?
When forcing data to be written, I use flush and os.fsync. These require a file object. Is there an equivalent to make sure the directory has been created?
Creating a new directory is effectively the same as writing small amount of data. It adds an inode.
The only way mkdir (or os.mkdirs) should fail is if the directory exists - otherwise the directory will always be created. In terms of the data being buffered - it's unlikely that this would happen - even journaled filesystems will sync out pretty regularly.
If you're having non-deterministic behavior, just wrap your directory creation / writing a file into that directory inside a try / except / finally that makes a few efforts? But really - the need for such code hints at something much more sinister and is likely a bigger issue.
Related
I'm using a python package, cdo, which heavily relies on tempfile for storing intermediate results. The created temporary files are quite large and when running bigger calculations, I've run into the problem that the /tmp directory got filled up and the script failed with a disk full error (we are talking about 10s to 100s of GB). I've found a workaround to the problem by creating a local folder, say $HOME/tmp and then doing
import tempfile
tempfile.tempdir='$HOME/tmp'
before importing the cdo module. While this works for me, it is somewhat cumbersome if I want also others to use my scripts. Therefore I was wondering, whether there would be a more elegant way to solve the problem by, e.g., telling tmpfile periodically to clear out all temporary files (usually this is only done once the script finishes). From my side this would be possible, because I am running a long loop, which produces one named file each iteration and all the temporary files created during that iteration would be discardable afterwards.
as the examples show: you could use tempfile in a context manager:
with tempfile.TemporaryFile() as fp:
fp.write(b'Hello world!')
fp.seek(0)
fp.read()
that way they are removed when the context exits.
...do you have that much control over how cdo uses tempfiles?
It's a common question not specifically about some language or platform. Who is responsible for a file created in systems $TEMP folder?
If it's my duty, why should I care where to put this file? I can place it anywhere with same result.
If it's OS responsibility, can I forgot about this file right after use?
Thanks and sorry for my basic English.
As a general rule, you should remove the temporary files that you create.
Recall that the $TEMP directory is a shared resource that other programs can use. Failure to remove the temporary files will have an impact on the other programs that use $TEMP.
What kind of impacts? That will depend upon the other programs. If those other programs create a lot of temporary files, then their execution will be slower as it will take longer to create a new temporary file as the directory will have to be scanned on each temporary file creation to ensure that the file name is unique.
Consider the following (based on real events) ...
In years past, my group at work had to use the Intel C Compiler. We found that over time, it appeared to be slowing down. That is, the time it took to run our sanity tests using it took longer and longer. This also applied to building/compiling a single C file. We tracked the problem down.
ICC was opening, stat'ing and reading every file under $TEMP. For what purpose, I know not. Although the argument can be made that the problem lay with the ICC, the existence of the files under $TEMP was slowing it and our development team down. Deleting those temporary files resulted in the sanity checks running in less than a half hour instead of over two--a significant time saver.
Hope this helps.
There is no standard and no common rules. In most OSs, the files in the temporary folder will pile up. Some systems try to prevent this by deleting files in there automatically after some time but that sometimes causes grief, for example with long running processes or crash backups.
The reason for $TEMP to exist is that many programs (especially in early times when RAM was scarce) needed a place to store temporary data since "super computers" in the 1970s had only a few KB of RAM (yes, N*1024 bytes where N is << 100 - you couldn't even fit the image of your mouse cursor into that). Around 1980, 64KB was a lot.
The solution was a folder where anyone could write. Security wasn't an issue at the time, memory was.
Over time, OSs started to get better systems to create temporary files and to clean them up but backwards compatibility prevented a clean, "work for all" solution.
So even though you know where the data ends up, you are responsible to clean up the files after yourself. To make error analysis easier, I tend to write my code in such a way that files are only deleted when everything is fine - that way, I can look at intermediate results to figure out what is wrong. But logging is often a better and safer solution.
Related: Memory prices 1957-2014 12KB of Ram did cost US $4'680,- in 1973.
I have some bash code which moves files and directory to /tmp/rmf rather than deleting them, for safety purposes.
I am migrating the code to Python to add some functionality. One of the added features is checking the available size on /tmp and asserting that the moved directory can fit in /tmp.
Checking for available space is done using os.statvfs, but how can I measure the disk usage of the moved directory?
I could either call du using subprocess, or recursively iterate over the directory tree and sum the sizes of each file. Which approach would be better?
I think you might want to reconsider your strategy. Two reasons:
Checking if you can move a file, asserting you can move a file, and then moving a file provides a built-in race-condition to the operation. A big file gets created in /tmp/ after you've asserted but before you've moved your file.. Doh.
Moving the file across filesystems will result in a huge amount of overhead. This is why on OSX each volume has their own 'Trash' directory. Instead of moving the blocks that compose the file, you just create a new inode that points to the existing data.
I'd consider how long the file needs to be available and the visibility to consumers of the files. If it's all automated stuff happening on the backend - renaming a file to 'hide' it from computer and human consumers is easy enough in most cases and has the added benefit of being an atomic operation)
Occasionally scan the filesystem for 'old' files to cull and rm them after some grace period. No drama. Also makes restoring files a lot easier since it's just a rename to restore.
This should do the trick:
import os
path = 'THE PATH OF THE DIRECTORY YOU WANT TO FETCH'
os.statvfs(path)
The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.
Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.
The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.
It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".
Attempt #2:
People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:
1) Reading a list of files is much faster than walking a directory.
2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.
3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.
Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.
Original post:
I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.
Here's my idea for how to go about this:
I have a function called all_files:
def all_files(dir_path, ...parms...):
...
The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.
This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path affected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.
Finally, I'll take the list of all files, filter it according to various parameters, and return the result.
What do you think of my approach? Is there a better way to do this?
Edit:
Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.
import os
from os.path import join, exists
import cProfile, pstats
dir_name = "temp_dir"
index_path = ".index"
def create_test_files():
os.mkdir(dir_name)
index_file = open(index_path, 'w')
for i in range(10):
print "creating dir: ", i
sub_dir = join(dir_name, str(i))
os.mkdir(sub_dir)
for i in range(100):
file_path = join(sub_dir, str(i))
open(file_path, 'w').close()
index_file.write(file_path + "\n")
index_file.close()
#
# 0.238 seconds
def test_walk():
for info in os.walk("temp_dir"):
pass
# 0.001 seconds
def test_read():
open(index_path).readlines()
if not exists("temp_dir"):
create_test_files()
def profile(s):
cProfile.run(s, 'profile_results.txt')
p = pstats.Stats('profile_results.txt')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
profile("test_walk()")
profile("test_read()")
Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.
Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.
Flaws and potential problems:
You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.
The filesystem monitor program still has to recursively walk the file system, so the work is still being done.
In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.
Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.
I could go on.
If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.
The best answer came from MichaĆ Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.
Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.
Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.
Of course it's much faster than walking the directory.
The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.
"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)
There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.
To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.
Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.
There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.
However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.
Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.
(Edit in response to "I don't think you understand what I'm trying to do")
Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.
Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.
You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.
If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?
That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.
But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.
I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.
As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.
I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.
class DirectoryIterator:
def __init__(self, start_dir, pattern):
self.directory = start_dir
self.pattern = pattern
def __iter__(self):
[([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]
###########
for file_name in DirectoryIterator(".", "\.py$"): print file_name