I am using os.walk to build a map of a data-store (this map is used later in the tool I am building)
This is the code I currently use:
def find_children(tickstore):
children = []
dir_list = os.walk(tickstore)
for i in dir_list:
children.append(i[0])
return children
I have done some analysis on it:
dir_list = os.walk(tickstore) runs instantly, if I do nothing with dir_list then this function completes instantly.
It is iterating over dir_list that takes a long time, even if I don't append anything, just iterating over it is what takes the time.
Tickstore is a big datastore, with ~10,000 directories.
Currently it takes approx 35minutes to complete this function.
Is there any way to speed it up?
I've looked at alternatives to os.walk but none of them seemed to provide much of an advantage in terms of speed.
Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk was rewritten to be more efficient.
This work done as part of PEP 471.
Extracted from the PEP:
Python's built-in os.walk() is significantly slower than it needs to
be, because -- in addition to calling os.listdir() on each directory
-- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.
But the underlying system calls -- FindFirstFile / FindNextFile on
Windows and readdir on POSIX systems -- already tell you whether the
files returned are directories or not, so no further system calls are
needed. Further, the Windows system calls return all the information
for a stat_result object on the directory entry, such as file size and
last modification time.
In short, you can reduce the number of system calls required for a
tree function like os.walk() from approximately 2N to N, where N is
the total number of files and directories in the tree. (And because
directory trees are usually wider than they are deep, it's often much
better than this.)
In practice, removing all those extra system calls makes os.walk()
about 8-9 times as fast on Windows, and about 2-3 times as fast on
POSIX systems. So we're not talking about micro-optimizations. See
more benchmarks here.
A method to optimize it in python2.7, use scandir.walk() instead of os.walk(), the parameters are exactly the same.
import scandir
directory = "/tmp"
res = scandir.walk(directory)
for item in res:
print item
PS: Just as #recoup mentioned in comment, scandir needs to be installed before usage in python2.7.
os.walk is currently quite slow because it first lists the directory and then does a stat on each entry to see if it is a directory or a file.
An improvement is proposed in PEP 471 and should be coming soon in Python 3.5. In the meantime you could use the scandir package to get the same benefits in Python 2.7
Related
I search for a way to optimize my code to run under 7 sec. Currently it runs in 20 sec. Any clues?
if(platform.system() == "Windows"):
app = string + "." + extension
for root, dirs, files in os.walk("C:\\"):
if app in files:
path = os.path.join(root, app)
os.startfile(path)
I found some interesting topic on the internet saying that you could speed it up by 5x-9x:
Taken from here
I've noticed that os.walk() is a lot slower than it needs to be because it
does an os.stat() call for every file/directory. It does this because it uses
listdir(), which doesn't return any file attributes.
So instead of using the file information provided by FindFirstFile() /
FindNextFile() and readdir() -- which listdir() calls -- os.walk() does a
stat() on every file to see whether it's a directory or not. Both
FindFirst/FindNext and readdir() give this information already. Using it would
basically bring the number of system calls down from O(N) to O(log N).
I've written a proof-of-concept (see [1] below) using ctypes and
FindFirst/FindNext on Windows, showing that for sizeable directory trees it
gives a 4x to 6x speedup -- so this is not a micro-optimization!
I started trying the same thing with opendir/readdir on Linux, but don't have
as much experience there, and wanted to get some feedback on the concept
first. I assume it'd be a similar speedup by using d_type & DT_DIR from
readdir().
The problem is even worse when you're calling os.walk() and then doing your
own stat() on each file, for example, to get the total size of all files in a
tree -- see [2]. It means it's calling stat() twice on every file, and I see
about a 9x speedup in this scenario using the info FindFirst/Next provide.
So there are a couple of things here:
1) The simplest thing to do would be to keep the APIs exactly the same, and
get the ~5x speedup on os.walk() -- on Windows, unsure of the exact speedup on
Linux. And on OS's where readdir() doesn't return "is directory" information,
obviously it'd fall back to using the stat on each file.
2) We could significantly improve the API by adding a listdir_stat() or
similar function, which would return a list of (filename, stat_result) tuples
instead of just the names. That might be useful in its own right, but of
course os.walk() could use it to speed itself up. Then of course it might be
good to have walk_stat() which you could use to speed up the "summing sizes"
cases.
Other related improvements to the listdir/walk APIs that could be considered
are:
* Using the wildcard/glob that FindFirst/FindNext take to do filtering -- this
would avoid fnmatch-ing and might speed up large operations, though I don't
have numbers. Obviously we'd have to simulate this with fnmatch on non-Windows
OSs, but this kind of filtering is something I've done with os.walk() many
times, so just having the API option would be useful ("glob" keyword arg?).
* Changing listdir() to yield instead of return a list (or adding yieldir?).
This fits both the FindNext/readdir APIs, and would address issues like [3].
Anyway, cutting a long story short -- do folks think 1) is a good idea? What
about some of the thoughts in 2)? In either case, what would be the best way
to go further on this?
Thanks,
Ben.
I want to know how many files are in a folder (specifically a shared network folder on windows if that makes a difference here).
I am using this code right now:
include os.path
def countFiles(path):
return len([f for f in os.listdir(path)
if os.path.isfile(os.path.join(path, f))])
It works fine when there are a few files in the folder, but it takes a noticably long time in a directory with many files (say 4000). I am running this frequently (files are being added every ~15 seconds) so the slowdown is painful.
In my particular case, I know there aren't any subfolders, so I could skip the os.path.isfile check, but I'd like to keep my solution general. Frankly, I am surprised that there isn't a built in # of files function on os.path.
In order to know how many files there are in a folder, the system must enumerate each entry, then it must check whether an entry is a file or not. There's no faster way unless the system provides you with a filesystem event (e.g. fsevent or eventfd) to tell you when things change.
These operations are slow for a disk-based filesystem (tens to hundreds of microseconds), and even slower on a network drive; you'll notice they are pretty slow even in a normal file browser. Modern OSes deal with the slowness through aggressive caching, but this has its limits (especially for network filesystems, where the overhead of keeping the cache fresh can exceed the cost of doing the operations in the first place).
To speed it up, you could cache the isfile result for names you've already checked, under the assumption that they won't transmute into directories. This would save you many isfile checks, at the expense of a bit of safety (if e.g. someone deletes a file and replaces it with an identically-named folder).
Is using os.walk in the way below the least time consuming way to recursively search through a folder and return all the files that end with .tnt?
for root, dirs, files in os.walk('C:\\data'):
print "Now in root %s" %root
for f in files:
if f.endswith('.tnt'):
Yes, using os.walk is indeed the best way to do that.
As everyone has said, os.walk is almost certainly the best way to do it.
If you actually have a performance problem, and profiling has shown that it's caused by os.walk (and/or iterating the results with .endswith), your best answer is probably to step outside Python. Replace all of the code above with:
for f in sys.argv[1:]:
Now you need some outside tool that can gather the paths and run your script. (Ideally batching as many paths as possible into each script execution.)
If you can rely on Windows Desktop Search having indexed the drive, it should only need to do a quick database operation to find all files under a certain path with a certain extension. I have no idea how to write a batch file that runs that query and gets the results as a list of arguments to pass to a Python script (or a PowerShell file that runs the query and passes the results to IronPython without serializing it into a list of arguments), but it would be worth researching this before anything else.
If you can't rely on your platform's desktop search index, on any POSIX platform, it would almost certainly be fastest and simplest to use this one-liner shell script:
find /my/path -name '*.tnt' -exec myscript.py {} +
Unfortunately, you're not on a POSIX platform, you're on Windows, which doesn't come with the find tool, which is the thing that's doing all the heavy lifting here.
There are ports of find to native Windows, but you'll have to figure out the command-line intricaties to get everything quoted right and format the path and so on, so you can write the one-liner batch file. Alternatively, you could install cygwin and use the exact same shell script you'd use on a POSIX system. Or you could find a more Windows-y tool that does what you need.
This could conceivably be slower rather than faster—Windows isn't designed to execute lots of little processes with as little overhead as possible, and I believe it has smaller limits on command lines than platforms like linux or OS X, so you may spend more time waiting for the interpreter to start and exit than you save. You'd have to test to see. In fact, you probably want to test both native and cygwin versions (with both native and cygwin Python, in the latter case).
You don't actually have to move the find invocation into a batch/shell script; it's probably the simplest answer, but there are others, such as using subprocess to call find from within Python. This might solve performance problems caused by starting the interpreter too many times.
Getting the right amount of parallelism may also help—spin off each invocation of your script to the background and don't wait for them to finish. (I believe on Windows, the shell isn't involved in this; instead there's a tool named something like "run" that kicks off a process detached from the shell. But I don't remember the details.)
If none of this works out, you may have to write a custom C extension that does the fastest possible Win32 or .NET thing (which also means you have to do the research to find out what that is…) so you can call that from within Python.
Attempt #2:
People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:
1) Reading a list of files is much faster than walking a directory.
2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.
3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.
Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.
Original post:
I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.
Here's my idea for how to go about this:
I have a function called all_files:
def all_files(dir_path, ...parms...):
...
The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.
This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path affected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.
Finally, I'll take the list of all files, filter it according to various parameters, and return the result.
What do you think of my approach? Is there a better way to do this?
Edit:
Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.
import os
from os.path import join, exists
import cProfile, pstats
dir_name = "temp_dir"
index_path = ".index"
def create_test_files():
os.mkdir(dir_name)
index_file = open(index_path, 'w')
for i in range(10):
print "creating dir: ", i
sub_dir = join(dir_name, str(i))
os.mkdir(sub_dir)
for i in range(100):
file_path = join(sub_dir, str(i))
open(file_path, 'w').close()
index_file.write(file_path + "\n")
index_file.close()
#
# 0.238 seconds
def test_walk():
for info in os.walk("temp_dir"):
pass
# 0.001 seconds
def test_read():
open(index_path).readlines()
if not exists("temp_dir"):
create_test_files()
def profile(s):
cProfile.run(s, 'profile_results.txt')
p = pstats.Stats('profile_results.txt')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
profile("test_walk()")
profile("test_read()")
Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.
Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.
Flaws and potential problems:
You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.
The filesystem monitor program still has to recursively walk the file system, so the work is still being done.
In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.
Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.
I could go on.
If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.
The best answer came from Michał Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.
Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.
Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.
Of course it's much faster than walking the directory.
The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.
"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)
There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.
To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.
Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.
There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.
However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.
Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.
(Edit in response to "I don't think you understand what I'm trying to do")
Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.
Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.
You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.
If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?
That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.
But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.
I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.
As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.
I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.
class DirectoryIterator:
def __init__(self, start_dir, pattern):
self.directory = start_dir
self.pattern = pattern
def __iter__(self):
[([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]
###########
for file_name in DirectoryIterator(".", "\.py$"): print file_name
I'm looking for a high performance method or library for scanning all files on disk or in a given directory and grabbing their basic stats - filename, size, and modification date.
I've written a python program that uses os.walk along with os.path.getsize to get the file list, and it works fine, but is not particularly fast. I noticed one of the freeware programs I had downloaded accomplished the same scan much faster than my program.
Any ideas for speeding up the file scan? Here's my python code, but keep in mind that I'm not at all married to os.walk and perfectly willing to use others APIs (including windows native APIs) if there are better alternatives.
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
...
I should also note I realize the python code probably can't be sped up that much; I'm particularly interested in any native APIs that provide better speed.
Well, I would expect this to be heavily I/O bound task.
As such, optimizations on python side would be quite ineffective; the only optimization I could think of is some different way of accessing/listing files, in order to reduce the actual read from the file system.
This of course requires a deep knowledge of the file system, that I do not have, and I do not expect python's developer to have while implementing os.walk.
What about spawning a command prompt, and then issue 'dir' and parse the results?
It could be a bit an overkill, but with any luck, 'dir' is making some effort for such optimizations.
It seems as if os.walk has been considerably improved in python 2.5, so you might check if you're running that version.
Other than that, someone has already compared the speed of os.walk to ls and noticed a clear advance of the latter, but not in a range that would actually justify using it.
You might want to look at the code for some Python version control systems like Mercurial or Bazaar. They have devoted a lot of time to coming up with ways to quickly traverse a directory tree and detect changes (or "finding basic stats about the files").
Use scandir python module (formerly betterwalk) on github by Ben Hoyt.
http://github.com/benhoyt/scandir
It is much faster than python walk, but uses the same syntax. Just import scandir and change os.walk() to scandir.walk(). That's it. It is the fastest way to traverse directories and files in python.
When you look at the code for os.walk, you'll see that there's not much fat to be trimmed.
For example, the following is only a hair faster than os.walk.
import os
import stat
listdir= os.listdir
pathjoin= os.path.join
fstat= os.stat
is_dir= stat.S_ISDIR
is_reg= stat.S_ISREG
def yieldFiles( path ):
for f in listdir(path):
nm= pathjoin( path, f )
s= fstat( nm ).st_mode
if is_dir( s ):
for sub in yieldFiles( nm ):
yield sub
elif is_reg( s ):
yield f
else:
pass # ignore these
Consequently, the overheads must he in the os module itself. You'll have to resort to making direct Windows API calls.
Look at the Python for Windows Extensions.
I'm wondering if you might want to group your I/O operations.
For instance, if you're walking a dense directory tree with thousands of files, you might try experimenting with walking the entire tree and storing all the file locations, and then looping through the (in-memory) locations and getting file statistics.
If your OS stores these two data in different locations (directory structure in one place, file stats in another), then this might be a significant optimization.
Anyway, that's something I'd try before digging further.
Python 3.5 just introduced os.scandir (see PEP-0471) which avoids a number of non-required system calls such as stat() and GetFileAttributes() to provide a significantly quicker file-system iterator.
os.walk() will now be implemented using os.scandir() as its iterator, and so you should see potentially large performance improvements whilst continuing to use os.walk().
Example usage:
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_file():
print(entry.name)
The os.path module has a directory tree walking function as well. I've never run any sort of benchmarks on it, but you could give it a try. I'm not sure there's a faster way than os.walk/os.path.walk in Python, however.
This is only partial help, more like pointers; however:
I believe you need to do the following:
fp = open("C:/$MFT", "rb")
using an account that includes SYSTEM permissions, because even as an admin, you can't open the "Master File Table" (kind of an inode table) of an NTFS filesystem. After you succeed in that, then you'll just have to locate information on the web that explains the structure of each file record (I believe it's commonly 1024 bytes per on-disk file, which includes the file's primary pathname) and off you go for super-high speeds of disk structure reading.
I would suggest using folderstats for creating statistics from a folder structure, I have tested on folder/files structures up to 400k files and folders.
It is as simple as:
import folderstats
import pandas as pd
df = folderstats.folderstats(path5, ignore_hidden=True)
df.head()
df.shape
Output will be a dataframe see example below:-
path
name
extension
size
atime
mtime
ctime
folder
num_files
depth
uid
md5
./folder_structure.png
folder_structure
png
525239
2022-01-10 16:08:32
2020-11-22 19:38:03
2020-11-22 19:38:03
False
0
1000
a3cac43de8dd5fc33d7bede1bb1849de
./requirements-dev.txt
requirements-dev
txt
33
2022-01-10 14:14:50
2022-01-08 17:54:50
2022-01-08 17:54:50
False
0
1000
42c7e7d9bc4620c2c7a12e6bbf8120bb