I'm looking for a high performance method or library for scanning all files on disk or in a given directory and grabbing their basic stats - filename, size, and modification date.
I've written a python program that uses os.walk along with os.path.getsize to get the file list, and it works fine, but is not particularly fast. I noticed one of the freeware programs I had downloaded accomplished the same scan much faster than my program.
Any ideas for speeding up the file scan? Here's my python code, but keep in mind that I'm not at all married to os.walk and perfectly willing to use others APIs (including windows native APIs) if there are better alternatives.
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
...
I should also note I realize the python code probably can't be sped up that much; I'm particularly interested in any native APIs that provide better speed.
Well, I would expect this to be heavily I/O bound task.
As such, optimizations on python side would be quite ineffective; the only optimization I could think of is some different way of accessing/listing files, in order to reduce the actual read from the file system.
This of course requires a deep knowledge of the file system, that I do not have, and I do not expect python's developer to have while implementing os.walk.
What about spawning a command prompt, and then issue 'dir' and parse the results?
It could be a bit an overkill, but with any luck, 'dir' is making some effort for such optimizations.
It seems as if os.walk has been considerably improved in python 2.5, so you might check if you're running that version.
Other than that, someone has already compared the speed of os.walk to ls and noticed a clear advance of the latter, but not in a range that would actually justify using it.
You might want to look at the code for some Python version control systems like Mercurial or Bazaar. They have devoted a lot of time to coming up with ways to quickly traverse a directory tree and detect changes (or "finding basic stats about the files").
Use scandir python module (formerly betterwalk) on github by Ben Hoyt.
http://github.com/benhoyt/scandir
It is much faster than python walk, but uses the same syntax. Just import scandir and change os.walk() to scandir.walk(). That's it. It is the fastest way to traverse directories and files in python.
When you look at the code for os.walk, you'll see that there's not much fat to be trimmed.
For example, the following is only a hair faster than os.walk.
import os
import stat
listdir= os.listdir
pathjoin= os.path.join
fstat= os.stat
is_dir= stat.S_ISDIR
is_reg= stat.S_ISREG
def yieldFiles( path ):
for f in listdir(path):
nm= pathjoin( path, f )
s= fstat( nm ).st_mode
if is_dir( s ):
for sub in yieldFiles( nm ):
yield sub
elif is_reg( s ):
yield f
else:
pass # ignore these
Consequently, the overheads must he in the os module itself. You'll have to resort to making direct Windows API calls.
Look at the Python for Windows Extensions.
I'm wondering if you might want to group your I/O operations.
For instance, if you're walking a dense directory tree with thousands of files, you might try experimenting with walking the entire tree and storing all the file locations, and then looping through the (in-memory) locations and getting file statistics.
If your OS stores these two data in different locations (directory structure in one place, file stats in another), then this might be a significant optimization.
Anyway, that's something I'd try before digging further.
Python 3.5 just introduced os.scandir (see PEP-0471) which avoids a number of non-required system calls such as stat() and GetFileAttributes() to provide a significantly quicker file-system iterator.
os.walk() will now be implemented using os.scandir() as its iterator, and so you should see potentially large performance improvements whilst continuing to use os.walk().
Example usage:
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_file():
print(entry.name)
The os.path module has a directory tree walking function as well. I've never run any sort of benchmarks on it, but you could give it a try. I'm not sure there's a faster way than os.walk/os.path.walk in Python, however.
This is only partial help, more like pointers; however:
I believe you need to do the following:
fp = open("C:/$MFT", "rb")
using an account that includes SYSTEM permissions, because even as an admin, you can't open the "Master File Table" (kind of an inode table) of an NTFS filesystem. After you succeed in that, then you'll just have to locate information on the web that explains the structure of each file record (I believe it's commonly 1024 bytes per on-disk file, which includes the file's primary pathname) and off you go for super-high speeds of disk structure reading.
I would suggest using folderstats for creating statistics from a folder structure, I have tested on folder/files structures up to 400k files and folders.
It is as simple as:
import folderstats
import pandas as pd
df = folderstats.folderstats(path5, ignore_hidden=True)
df.head()
df.shape
Output will be a dataframe see example below:-
path
name
extension
size
atime
mtime
ctime
folder
num_files
depth
uid
md5
./folder_structure.png
folder_structure
png
525239
2022-01-10 16:08:32
2020-11-22 19:38:03
2020-11-22 19:38:03
False
0
1000
a3cac43de8dd5fc33d7bede1bb1849de
./requirements-dev.txt
requirements-dev
txt
33
2022-01-10 14:14:50
2022-01-08 17:54:50
2022-01-08 17:54:50
False
0
1000
42c7e7d9bc4620c2c7a12e6bbf8120bb
Related
I search for a way to optimize my code to run under 7 sec. Currently it runs in 20 sec. Any clues?
if(platform.system() == "Windows"):
app = string + "." + extension
for root, dirs, files in os.walk("C:\\"):
if app in files:
path = os.path.join(root, app)
os.startfile(path)
I found some interesting topic on the internet saying that you could speed it up by 5x-9x:
Taken from here
I've noticed that os.walk() is a lot slower than it needs to be because it
does an os.stat() call for every file/directory. It does this because it uses
listdir(), which doesn't return any file attributes.
So instead of using the file information provided by FindFirstFile() /
FindNextFile() and readdir() -- which listdir() calls -- os.walk() does a
stat() on every file to see whether it's a directory or not. Both
FindFirst/FindNext and readdir() give this information already. Using it would
basically bring the number of system calls down from O(N) to O(log N).
I've written a proof-of-concept (see [1] below) using ctypes and
FindFirst/FindNext on Windows, showing that for sizeable directory trees it
gives a 4x to 6x speedup -- so this is not a micro-optimization!
I started trying the same thing with opendir/readdir on Linux, but don't have
as much experience there, and wanted to get some feedback on the concept
first. I assume it'd be a similar speedup by using d_type & DT_DIR from
readdir().
The problem is even worse when you're calling os.walk() and then doing your
own stat() on each file, for example, to get the total size of all files in a
tree -- see [2]. It means it's calling stat() twice on every file, and I see
about a 9x speedup in this scenario using the info FindFirst/Next provide.
So there are a couple of things here:
1) The simplest thing to do would be to keep the APIs exactly the same, and
get the ~5x speedup on os.walk() -- on Windows, unsure of the exact speedup on
Linux. And on OS's where readdir() doesn't return "is directory" information,
obviously it'd fall back to using the stat on each file.
2) We could significantly improve the API by adding a listdir_stat() or
similar function, which would return a list of (filename, stat_result) tuples
instead of just the names. That might be useful in its own right, but of
course os.walk() could use it to speed itself up. Then of course it might be
good to have walk_stat() which you could use to speed up the "summing sizes"
cases.
Other related improvements to the listdir/walk APIs that could be considered
are:
* Using the wildcard/glob that FindFirst/FindNext take to do filtering -- this
would avoid fnmatch-ing and might speed up large operations, though I don't
have numbers. Obviously we'd have to simulate this with fnmatch on non-Windows
OSs, but this kind of filtering is something I've done with os.walk() many
times, so just having the API option would be useful ("glob" keyword arg?).
* Changing listdir() to yield instead of return a list (or adding yieldir?).
This fits both the FindNext/readdir APIs, and would address issues like [3].
Anyway, cutting a long story short -- do folks think 1) is a good idea? What
about some of the thoughts in 2)? In either case, what would be the best way
to go further on this?
Thanks,
Ben.
I contact an SFTP server and show files based on the modified timestamp.
Currently, it is done using something like:
files = os.listdir(SFTP)
Loop over files and get the timestamp using os.stat.
Sort the final list in Python.
This looping in Step 2 is very costly when the SFTP is on a different server because it has to make a network call from the server to the SFTP for each and every file.
Is there a way to get both the file and modified time using os.listdir or a similar API?
I am using a Windows back-end and the SFTP connection usually is done using the win32wnet.WNetAddConnection2 package. A generic solution would be helpful, if not a specific solution should be fine too.
If you're using Windows, you've got a lot to gain to use os.scandir() (python 3.5+) or the backport scandir module: scandir.scandir()
That's because on Windows (as opposed to Linux/Unix), os.listdir() already performs a file stat behind the scenes but the result is discarded except for the name. Which forces you to perform another stat call.
scandir returns a list of directory entries, not names. On windows, the size/object type fields are already filled, so when you perform a stat on the entry (as shown in the example below), it's at zero cost:
(taken from https://www.python.org/dev/peps/pep-0471/)
def get_tree_size(path):
"""Return total size of files in given path and subdirs."""
total = 0
for entry in os.scandir(path):
if entry.is_dir(follow_symlinks=False):
total += get_tree_size(entry.path)
else:
total += entry.stat(follow_symlinks=False).st_size
return total
so just replace your first os.listdir() call by os.scandir() and you'll have all the information for the same cost as a simple os.listdir()
(this is the most interesting on Windows, and a lot less on Linux. I've used it on a slow filesystem on Windows and got a 8x performance gain compared to good old os.listdir followed by os.path.isdir in my case)
You should use special libraries for this, such as sftp or ftplib, they provide specific utils that will be helpful for you.
Also, you can try to call the interesting command on the server.
If youre able to send one line commands to the server, you could do [os.stat(i) for i in os.listdir()]
If that doesn't work for you, I suppose you could just do os.system("ls -l")
If neither of those work, please do tell me!
I am using os.walk to build a map of a data-store (this map is used later in the tool I am building)
This is the code I currently use:
def find_children(tickstore):
children = []
dir_list = os.walk(tickstore)
for i in dir_list:
children.append(i[0])
return children
I have done some analysis on it:
dir_list = os.walk(tickstore) runs instantly, if I do nothing with dir_list then this function completes instantly.
It is iterating over dir_list that takes a long time, even if I don't append anything, just iterating over it is what takes the time.
Tickstore is a big datastore, with ~10,000 directories.
Currently it takes approx 35minutes to complete this function.
Is there any way to speed it up?
I've looked at alternatives to os.walk but none of them seemed to provide much of an advantage in terms of speed.
Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk was rewritten to be more efficient.
This work done as part of PEP 471.
Extracted from the PEP:
Python's built-in os.walk() is significantly slower than it needs to
be, because -- in addition to calling os.listdir() on each directory
-- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.
But the underlying system calls -- FindFirstFile / FindNextFile on
Windows and readdir on POSIX systems -- already tell you whether the
files returned are directories or not, so no further system calls are
needed. Further, the Windows system calls return all the information
for a stat_result object on the directory entry, such as file size and
last modification time.
In short, you can reduce the number of system calls required for a
tree function like os.walk() from approximately 2N to N, where N is
the total number of files and directories in the tree. (And because
directory trees are usually wider than they are deep, it's often much
better than this.)
In practice, removing all those extra system calls makes os.walk()
about 8-9 times as fast on Windows, and about 2-3 times as fast on
POSIX systems. So we're not talking about micro-optimizations. See
more benchmarks here.
A method to optimize it in python2.7, use scandir.walk() instead of os.walk(), the parameters are exactly the same.
import scandir
directory = "/tmp"
res = scandir.walk(directory)
for item in res:
print item
PS: Just as #recoup mentioned in comment, scandir needs to be installed before usage in python2.7.
os.walk is currently quite slow because it first lists the directory and then does a stat on each entry to see if it is a directory or a file.
An improvement is proposed in PEP 471 and should be coming soon in Python 3.5. In the meantime you could use the scandir package to get the same benefits in Python 2.7
I am creating a sort of "Command line" in Python. I already added a few functions, such as changing login/password, executing, etc., But is it possible to browse files in the directory that the main file is in with a command/module, or will I have to make the module myself and use the import command? Same thing with changing directories to view, too.
Browsing files is as easy as using the standard os module. If you want to do something with those files, that's entirely different.
import os
all_files = os.listdir('.') # gets all files in current directory
To change directories you can issue os.chdir('path/to/change/to'). In fact there are plenty of useful functions found in the os module that facilitate the things you're asking about. Making them pretty and user-friendly, however, is up to you!
I'd like to see someone write a a semantic file-browser, i.e. one that auto-generates tags for files according to their input and then allows views and searching accordingly.
Think about it... take an MP3, lookup the lyrics, run it through Zemanta, bam! a PDF file, a OpenOffice file, etc., that'd be pretty kick-butt! probably fairly intensive too, but it'd be pretty dang cool!
Cheers,
-C
Attempt #2:
People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:
1) Reading a list of files is much faster than walking a directory.
2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.
3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.
Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.
Original post:
I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.
Here's my idea for how to go about this:
I have a function called all_files:
def all_files(dir_path, ...parms...):
...
The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.
This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path affected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.
Finally, I'll take the list of all files, filter it according to various parameters, and return the result.
What do you think of my approach? Is there a better way to do this?
Edit:
Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.
import os
from os.path import join, exists
import cProfile, pstats
dir_name = "temp_dir"
index_path = ".index"
def create_test_files():
os.mkdir(dir_name)
index_file = open(index_path, 'w')
for i in range(10):
print "creating dir: ", i
sub_dir = join(dir_name, str(i))
os.mkdir(sub_dir)
for i in range(100):
file_path = join(sub_dir, str(i))
open(file_path, 'w').close()
index_file.write(file_path + "\n")
index_file.close()
#
# 0.238 seconds
def test_walk():
for info in os.walk("temp_dir"):
pass
# 0.001 seconds
def test_read():
open(index_path).readlines()
if not exists("temp_dir"):
create_test_files()
def profile(s):
cProfile.run(s, 'profile_results.txt')
p = pstats.Stats('profile_results.txt')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
profile("test_walk()")
profile("test_read()")
Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.
Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.
Flaws and potential problems:
You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.
The filesystem monitor program still has to recursively walk the file system, so the work is still being done.
In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.
Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.
I could go on.
If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.
The best answer came from MichaĆ Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.
Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.
Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.
Of course it's much faster than walking the directory.
The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.
"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)
There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.
To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.
Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.
There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.
However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.
Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.
(Edit in response to "I don't think you understand what I'm trying to do")
Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.
Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.
You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.
If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?
That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.
But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.
I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.
As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.
I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.
class DirectoryIterator:
def __init__(self, start_dir, pattern):
self.directory = start_dir
self.pattern = pattern
def __iter__(self):
[([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]
###########
for file_name in DirectoryIterator(".", "\.py$"): print file_name