Why is os.scandir() as slow as os.listdir()? - python

I tried to optimize a file browsing function written in Python, on Windows, by using os.scandir() instead of os.listdir(). However, time remains unchanged, about 2 minutes and a half, and I can't tell why.
Below are the functions, original and altered:
os.listdir() version:
def browse(self, path, tree):
# for each entry in the path
for entry in os.listdir(path):
entity_path = os.path.join(path, entry)
# check if support by git or not
if self.git_ignore(entity_path) is False:
# if is a dir create a new level in the tree
if os.path.isdir( entity_path ):
tree[entry] = Folder(entry)
self.browse(entity_path, tree[entry])
# if is a file add it to the tree
if os.path.isfile(entity_path):
tree[entry] = File(entity_path)
os.scandir() version:
def browse(self, path, tree):
# for each entry in the path
for dirEntry in os.scandir(path):
entry_path = dirEntry.name
entity_path = dirEntry.path
# check if support by git or not
if self.git_ignore(entity_path) is False:
# if is a dir create a new level in the tree
if dirEntry.is_dir(follow_symlinks=True):
tree[entry_path] = Folder(entity_path)
self.browse(entity_path, tree[entry_path])
# if is a file add it to the tree
if dirEntry.is_file(follow_symlinks=True):
tree[entry_path] = File(entity_path)
In addition, here are the auxiliary functions used within this one:
def git_ignore(self, filepath):
if '.git' in filepath:
return True
if '.ci' in filepath:
return True
if '.delivery' in filepath:
return True
child = subprocess.Popen(['git', 'check-ignore', str(filepath)],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
output = child.communicate()[0]
status = child.wait()
return status == 0
============================================================
class Folder(dict):
def __init__(self, path):
self.path = path
self.categories = {}
============================================================
class File(object):
def __init__(self, path):
self.path = path
self.filename, self.extension = os.path.splitext(self.path)
Does anyone have a solution for how I can make the function run faster? My assumption is that the extraction of the name and path at the beginning makes it run slower than it should, is that correct?

Regarding your question:
os.walk seems to call stats more times than necessary. That seems to be the reason why it's slower than os.scandir().
In this case, I think the best way to boost your speed performance would be to
use parallel processing, which can improve the speed incredibly in some loops.
There are multiple posts about this issue. Here one:
Parallel Processing in Python – A Practical Guide with Examples.
Nevertheless I would like to share some thoughts about it.
I have also been wondering what are the best usage of these three options (scandir, listdir, walk). There is not much documentation about performance comparisons. Probably the best way would be to test it yourself as you did. Here my conclusions about that:
Usage of os.listdir():
It doesn't seem to have advantages compared to os.scandir() excepting that is easier to understand. I still use it when I only need to list files in directory.
PROS:
Fast & Simple
CONS:
Too simple, only works for listing files and dirs in directory, so you might need to combine it with other methods to get extra features about the files metadata. If you so, better use os.scandir().
Usage of os.walk():
This is the most used function when we need to fetch all the items in a directory (and subdirs).
PROS:
It's probably the easiest way to walk around all the items paths and names.
CONS:
It seems to call stats more times than necessary. That seems to be the reason why it's slower than os.scandir().
Although it gives you the root parts of the files, it doesn't provide the extra meta-info of os.scandir().
Usage of os.scandir():
It seems to have (almost) the best of both worlds. It gives you the speed of the simple os.listdir with extra features that would allow you to
simplify your loops, since you could avoid using exiftool or other metadata tools
when you need extra information about the files.
PROS:
Fast. same speed than os.listdir()
very nice extra features.
CONS:
If you want to dive into subfiles you need to create another function in order to scan over each subdir. This function is pretty simple, but maybe it would be more pythonic (I just mean with more elegant sintax) to use os.walk in this case.
So that's my view after reading a bit and using them. I'm happy to be corrected, so I can learn more about it.

Related

Get file location and names from Windows Camera

I was running into troubles using QCamera with focusing and other things, so I thought I can use the Camerasoftware served with Windows 10. Based on the thread of opening the Windows Camera I did some trials to aquire the taken images and use them for my program. In the documentation and its API I didn't find usable snippets (for me), so I created the hack mentioned below. It assumes that the images are in the target folder 'C:\\Users\\*username*\\Pictures\\Camera Roll' which is mentioned in the registry (See below), but I don't know if this is reliable or how to get the proper key name.
I don't think that this is the only and cleanest solution. So, my question is how to get taken images and open/close the Camera proper?
Actualy the function waits till the 'WindowsCamera.exe' has left the processlist and return newly added images / videos in the target folder
In the registry I found:
Entry: Computer\HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\User Shell Folders with key name {3B193882-D3AD-4eab-965A-69829D1FB59F}for the target folder. I don't think that this key is usable.
Working example of my hack:
import subprocess
import pathlib
import psutil
def check_for_files(path, pattern):
print(" check_for_files:", (path, pattern))
files = []
for filename in pathlib.Path(path).rglob(pattern):
files.append (filename)
return files
def get_Windows_Picture(picpath):
prefiles = check_for_files(picpath, '*.jpg')
x = subprocess.call('start microsoft.windows.camera:', shell=True)
processlist = [proc.info['name'] for proc in psutil.process_iter (['name'])]
while 'WindowsCamera.exe' in processlist:
processlist = [proc.info['name'] for proc in psutil.process_iter (['name'])]
postfiles = check_for_files(picpath, '*.jpg')
newfiles = []
for file in postfiles:
if file not in prefiles:
newfiles.append(str(file))
return newfiles
if __name__ == "__main__":
picpath = str (pathlib.Path ("C:/Users/*user*/Pictures/Camera Roll"))
images = get_Windows_Picture(picpath)
print("Images:", images)
The Camera Roll is a "known Windows folder" which means some APIs can retrieve the exact path (even if it's non-default) for you:
SHGetKnownFolderPath
SHGetKnownFolderIDList
SHSetKnownFolderPath
The knownfolderid documentation will give you the constant name of the required folder (in your case FOLDERID_CameraRoll). As you can see in the linked page, the default is %USERPROFILE%\Pictures\Camera Roll (It's the default, so this doesn't mean it's the same for everyone).
The problem in Python is that you'll need to use ctypes which can be cumbersome some times (especially in your case when you'll have to deal with GUIDs and releasing the memory returned by the API).
This gist gives a good example on how to call SHGetKnownFolderPath from Python with ctypes. In your case you'll only need the CameraRoll member in the FOLDERID class so you can greatly simplify the code.
Side note: Don't poll for the process end, just use the wait() function on the Popen object.

How can I create a folder system inside of my python program?

Is there any way to actually create a standalone file system in a python program? I know that you can use os.mkdir() and os.chdir() but these write directly to your actual system instead of being stored in the program. I've tried several ways to do this, including:
if command == ("md"):
newDir = input("")
with open('directories.txt', 'a') as f:
f.write(newDir)
Obviously, this doesn't work, but I was wondering if anyone has some ideas on how one might do this. (This is for a basic but hopefully semi-function MS-DOS style os I'm working on.)
Not sure why writing to the file is a problem, but you could basically have a Tree type data structure to represent a file system. For example:
class File:
def __init__(self, filename, directory=False):
self.filename = filename
self.directory = directory
self.files = [] if directory else None
where self.files is a list of File objects that you can iterate through as a tree. This is obviously rudimentary and I'll leave a more detailed implementation up to you.

MPTT Algorith Modification

I've got this algorithm to generate MPTT from my folder structure:
https://gist.github.com/unbracketed/946520
Found on github and works perfectly for my needs. Currently I have requirement to add to it functionality of skipping some folders in tree. For example I want to skip everything in/under /tmp/A/B1/C2. So my tree will not contain anything from C2 (including C2).
I'm not so useless in python so I've created that queries (and passed extra list to function):
def is_subdir(path, directory):
path = os.path.realpath(path)
directory = os.path.realpath(directory)
relative = os.path.relpath(path, directory)
return not relative.startswith(os.pardir + os.sep)
/Now we can add somewhere
for single in ignorelist:
if fsprocess.is_subdir(node,single):
But my question is where to stick in the function? I've tried to this on the top and in if do return but that exiting my whole application. It's recurring invoking itself so I'm pretty lost.
Any good advices? I've tried contact script creator on github no owner. Really good job with this algorithm, save me a lot of the time and it's perfect for our project requirements.
def generate_mptt(root_dir):
"""
Given a root directory, generate a calculated MPTT
representation for the file hierarchy
"""
for root, dirs, _ in os.walk(root_dir):
Your check should go here:
if any(is_subdir(root, path) for path in ignorelist):
del dirs[:] # don't descend
continue
Sort of like that. Assuming that is_subdir(root, path) returns True if root is a subdirectory of path.
dirs.sort()
tree[root] = dirs
preorder_tree(root_dir, tree[root_dir])
mptt_list.sort(key=lambda x: x.left)

Does os.walk take advantage of the file type returned by the OS for efficiency?

The os.walk function returns separate lists for directories and files. The underlying OS calls on many common operating systems such as Windows and Linux return a file type or flag specifying whether each directory entry is a file or a directory; without this flag it's necessary to query the OS again for each returned filename. Does the code for os.walk make use of this information or does it throw it away as os.listdir does?
Nope, it does not.
Under the hood, os.walk() uses os.listdir() and os.path.isdir() to list files and directories separately. See the source code of walk().
Specifically:
try:
# Note that listdir and error are globals in this module due
# to earlier import-*.
names = listdir(top)
except error, err:
if onerror is not None:
onerror(err)
return
dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
where listdir and isdir are module globals for the os.listdir() and os.path.isdir() functions. It calls itself recursively for subdirs.
As Martijn Pieters's answer explains, os.walk just uses os.listdir and os.path.isdir.
There's been some discussion on this a few times on the mailing lists, but no concrete suggestion for the stdlib has ever come out of it. There are various edge cases that make this less trivial than it seems. Also, if Python 3.4 or later grows a new path module, there's a good chance os.walk will just be replaced/deprecated rather than improved in place.
However, there are a number of third-party modules that you can use.
The simplest is probably Ben Hoyt's betterwalk. I believe he's intending to get this on PyPI, and maybe even submit it for Python 3.4 or later, but at present you have to install it off github. betterwalk provides an os.listdir replacement called iterdir_stat, and a 90%-complete os.walk replacement built on top of it. On most POSIX systems, and Win32, it can usually avoid unnecessary stat calls. (There are some cases where it can't do as good a job as fts (3)/nftw (3)/find (1), but at worst it just does some unnecessary calls, rather than failing. The parts that may not be complete, last I checked, are dealing with symlinks, and maybe error handling.)
There's also a nice wrapper around fts for POSIX systems, which is obviously ideal as far as performance goes on modern POSIX systems—but it has a different (better, in my opinion, but still different) interface, and doesn't support Windows or other platforms (or even older POSIX systems).
There are also about 30-odd "everything under the sun to do with paths" modules on PyPI and elsewhere, some of which have new walk-like functions.

PyGTK/GIO: monitor directory for changes recursively

Take the following demo code (from the GIO answer to this question), which uses a GIO FileMonitor to monitor a directory for changes:
import gio
def directory_changed(monitor, file1, file2, evt_type):
print "Changed:", file1, file2, evt_type
gfile = gio.File(".")
monitor = gfile.monitor_directory(gio.FILE_MONITOR_NONE, None)
monitor.connect("changed", directory_changed)
import glib
ml = glib.MainLoop()
ml.run()
After running this code, I can then create and modify child nodes and be notified of the changes. However, this only works for immediate children (I am aware that the docs don't say otherwise). The last of the following shell commands will not result in a notification:
touch one
mkdir two
touch two/three
Is there an easy way to make it recursive? I'd rather not manually code something that looks for directory creation and adds a monitor, removing them on deletion, etc.
The intended use is for a VCS file browser extension, to be able to cache the statuses of files in a working copy and update them individually on changes. So there might by anywhere from tens to thousands (or more) directories to monitor. I'd like to just find the root of the working copy and add the file monitor there.
I know about pyinotify, but I'm avoiding it so that this works under non-Linux kernels such as FreeBSD or... others. As far as I'm aware, the GIO FileMonitor uses inotify underneath where available, and I can understand not emphasising the implementation to maintain some degree of abstraction, but it suggested to me that it should be possible.
(In case it matters, I originally posted this on the PyGTK mailing list.)
"Is there an easy way to make it
recursive?"
I'm not aware of any "easy way" to achieve this. The underlying systems, such as inotify on Linux or kqueue on BSDs don't provide facilities to automatically add recursive watches. I'm also not aware of any library layering what you want atop GIO.
So you'll most likely have to build this yourself. As this can be a bit trick in some corner cases (e.g. mkdir -p foo/bar/baz) I would suggest looking at how pynotify implements its auto_add functionality (grep through the pynotify source) and porting that over to GIO.
I'm not sure if GIO allows you to have more than one monitor at once, but if it does there's no* reason you can't do something like this:
import gio
import os
def directory_changed(monitor, file1, file2, evt_type):
if os.path.isdir(file2): #maybe this needs to be file1?
add_monitor(file2)
print "Changed:", file1, file2, evt_type
def add_monitor(dir):
gfile = gio.File(dir)
monitor = gfile.monitor_directory(gio.FILE_MONITOR_NONE, None)
monitor.connect("changed", directory_changed)
add_monitor('.')
import glib
ml = glib.MainLoop()
ml.run()
*when I say no reason, there's the possibility that this could become a resource hog, though with nearly zero knowledge about GIO I couldn't really say. It's also entirely possible to roll your own in Python with a few commands (os.listdir among others). It might look something like this
import time
import os
class Watcher(object):
def __init__(self):
self.dirs = []
self.snapshots = {}
def add_dir(self, dir):
self.dirs.append(dir)
def check_for_changes(self, dir):
snapshot = self.snapshots.get(dir)
curstate = os.listdir(dir)
if not snapshot:
self.snapshots[dir] = curstate
else:
if not snapshot == curstate:
print 'Changes: ',
for change in set(curstate).symmetric_difference(set(snapshot)):
if os.path.isdir(change):
print "isdir"
self.add_dir(change)
print change,
self.snapshots[dir] = curstate
print
def mainloop(self):
if len(self.dirs) < 1:
print "ERROR: Please add a directory with add_dir()"
return
while True:
for dir in self.dirs:
self.check_for_changes(dir)
time.sleep(4) # Don't want to be a resource hog
w = Watcher()
w.add_dir('.')
w.mainloop()

Categories

Resources