MPTT Algorith Modification - python

I've got this algorithm to generate MPTT from my folder structure:
https://gist.github.com/unbracketed/946520
Found on github and works perfectly for my needs. Currently I have requirement to add to it functionality of skipping some folders in tree. For example I want to skip everything in/under /tmp/A/B1/C2. So my tree will not contain anything from C2 (including C2).
I'm not so useless in python so I've created that queries (and passed extra list to function):
def is_subdir(path, directory):
path = os.path.realpath(path)
directory = os.path.realpath(directory)
relative = os.path.relpath(path, directory)
return not relative.startswith(os.pardir + os.sep)
/Now we can add somewhere
for single in ignorelist:
if fsprocess.is_subdir(node,single):
But my question is where to stick in the function? I've tried to this on the top and in if do return but that exiting my whole application. It's recurring invoking itself so I'm pretty lost.
Any good advices? I've tried contact script creator on github no owner. Really good job with this algorithm, save me a lot of the time and it's perfect for our project requirements.

def generate_mptt(root_dir):
"""
Given a root directory, generate a calculated MPTT
representation for the file hierarchy
"""
for root, dirs, _ in os.walk(root_dir):
Your check should go here:
if any(is_subdir(root, path) for path in ignorelist):
del dirs[:] # don't descend
continue
Sort of like that. Assuming that is_subdir(root, path) returns True if root is a subdirectory of path.
dirs.sort()
tree[root] = dirs
preorder_tree(root_dir, tree[root_dir])
mptt_list.sort(key=lambda x: x.left)

Related

How to detect cycles in directory traversal

I am using Python on Ubuntu (Linux). I would also like this code to work on modern-ish Windows PCs. I can take or leave Macs, since I have no plan to get one, but I am hoping to make this code as portable as possible.
I have written some code that is supposed to traverse a directory and run a test on all of its subdirectories (and later, some code that will do something with each file, so I need to know how to detect links there too).
I have added a check for symlinks, but I do not know how to protect against hardlinks that could cause infinite recursion. Ideally, I'd like to also protect against duplicate detections in general (root: [A,E], A: [B,C,D], E: [D,F,G], where D is the same file or directory in both A and E).
My current thought is to check if the path from the root directory to the current folder is the same as the path being tested, and if it isn't, skip it as an instance of a cycle. However, I think that would take a lot of extra I/O or it might just retrace the (actually cyclic) path that was just created.
How do I properly detect cycles in my filesystem?
def find(self) -> bool:
if self._remainingFoldersToSearch:
current_folder = self._remainingFoldersToSearch.pop()
if not current_folder.is_symlink():
contents = current_folder.iterdir()
try:
for item in contents:
if item.is_dir():
if item.name == self._indicator:
potentialArchive = [x.name for x in item.iterdir()]
if self._conf in potentialArchive:
self._archives.append(item)
if self._onArchiveReadCallback:
self._onArchiveReadCallback(item)
else:
self._remainingFoldersToSearch.append(item)
self._searched.append(item)
if self._onFolderReadCallback:
self._onFolderReadCallback(item)
except PermissionError:
logging.info("Invalid permissions accessing folder:", exc_info=True)
return True
else:
return False

Why is os.scandir() as slow as os.listdir()?

I tried to optimize a file browsing function written in Python, on Windows, by using os.scandir() instead of os.listdir(). However, time remains unchanged, about 2 minutes and a half, and I can't tell why.
Below are the functions, original and altered:
os.listdir() version:
def browse(self, path, tree):
# for each entry in the path
for entry in os.listdir(path):
entity_path = os.path.join(path, entry)
# check if support by git or not
if self.git_ignore(entity_path) is False:
# if is a dir create a new level in the tree
if os.path.isdir( entity_path ):
tree[entry] = Folder(entry)
self.browse(entity_path, tree[entry])
# if is a file add it to the tree
if os.path.isfile(entity_path):
tree[entry] = File(entity_path)
os.scandir() version:
def browse(self, path, tree):
# for each entry in the path
for dirEntry in os.scandir(path):
entry_path = dirEntry.name
entity_path = dirEntry.path
# check if support by git or not
if self.git_ignore(entity_path) is False:
# if is a dir create a new level in the tree
if dirEntry.is_dir(follow_symlinks=True):
tree[entry_path] = Folder(entity_path)
self.browse(entity_path, tree[entry_path])
# if is a file add it to the tree
if dirEntry.is_file(follow_symlinks=True):
tree[entry_path] = File(entity_path)
In addition, here are the auxiliary functions used within this one:
def git_ignore(self, filepath):
if '.git' in filepath:
return True
if '.ci' in filepath:
return True
if '.delivery' in filepath:
return True
child = subprocess.Popen(['git', 'check-ignore', str(filepath)],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
output = child.communicate()[0]
status = child.wait()
return status == 0
============================================================
class Folder(dict):
def __init__(self, path):
self.path = path
self.categories = {}
============================================================
class File(object):
def __init__(self, path):
self.path = path
self.filename, self.extension = os.path.splitext(self.path)
Does anyone have a solution for how I can make the function run faster? My assumption is that the extraction of the name and path at the beginning makes it run slower than it should, is that correct?
Regarding your question:
os.walk seems to call stats more times than necessary. That seems to be the reason why it's slower than os.scandir().
In this case, I think the best way to boost your speed performance would be to
use parallel processing, which can improve the speed incredibly in some loops.
There are multiple posts about this issue. Here one:
Parallel Processing in Python – A Practical Guide with Examples.
Nevertheless I would like to share some thoughts about it.
I have also been wondering what are the best usage of these three options (scandir, listdir, walk). There is not much documentation about performance comparisons. Probably the best way would be to test it yourself as you did. Here my conclusions about that:
Usage of os.listdir():
It doesn't seem to have advantages compared to os.scandir() excepting that is easier to understand. I still use it when I only need to list files in directory.
PROS:
Fast & Simple
CONS:
Too simple, only works for listing files and dirs in directory, so you might need to combine it with other methods to get extra features about the files metadata. If you so, better use os.scandir().
Usage of os.walk():
This is the most used function when we need to fetch all the items in a directory (and subdirs).
PROS:
It's probably the easiest way to walk around all the items paths and names.
CONS:
It seems to call stats more times than necessary. That seems to be the reason why it's slower than os.scandir().
Although it gives you the root parts of the files, it doesn't provide the extra meta-info of os.scandir().
Usage of os.scandir():
It seems to have (almost) the best of both worlds. It gives you the speed of the simple os.listdir with extra features that would allow you to
simplify your loops, since you could avoid using exiftool or other metadata tools
when you need extra information about the files.
PROS:
Fast. same speed than os.listdir()
very nice extra features.
CONS:
If you want to dive into subfiles you need to create another function in order to scan over each subdir. This function is pretty simple, but maybe it would be more pythonic (I just mean with more elegant sintax) to use os.walk in this case.
So that's my view after reading a bit and using them. I'm happy to be corrected, so I can learn more about it.

Check if a given directory contains any directory in python

Essentially, I'm wondering if the top answer given to this question can be implemented in Python. I am reviewing the modules os, os.path, and shutil and I haven't yet been able to find an easy equivalent, though I assume I'm just missing something simple.
More specifically, say I have a directory A, and inside directory A is any other directory. I can call os.walk('path/to/A') and check if dirnames is empty, but I don't want to make the program go through the entire tree rooted at A; i.e. what I'm looking for should stop and return true as soon as it finds a subdirectory.
For clarity, on a directory containing files but no directories an acceptable solution will return False.
maybe you want
def folders_in(path_to_parent):
for fname in os.listdir(path_to_parent):
if os.path.isdir(os.path.join(path_to_parent,fname)):
yield os.path.join(path_to_parent,fname)
print(list(folders_in("/path/to/parent")))
this will return a list of all subdirectories ... if its empty then there are no subdirectories
or in one line
set([os.path.dirname(p) for p in glob.glob("/path/to/parent/*/*")])
although for a subdirectory to be counted with this method it must have some file in it
or manipulating walk
def subfolders(path_to_parent):
try:
return next(os.walk(path_to_parent))[1]
except StopIteration:
return []
I would just do as follows:
#for example
dir_of_interest = "/tmp/a/b/c"
print(dir_of_interest in (v[0] for v in os.walk("/tmp/")))
This prints True or False, depending if dir_of_interest is in the generator. And you use here generator, so the directories to check are generated one by one.
You can break from the walk anytime you want. For example, this brakes is a current folder being walked, has no subdirectories:
for root, dirs, files in os.walk("/tmp/"):
print(root,len(dirs))
if not len(dirs): break
Maybe this is in line with what you are after.
Try this:
#!/usr/local/cpython-3.4/bin/python
import glob
import os
top_of_hierarchy = '/tmp/'
#top_of_hierarchy = '/tmp/orbit-dstromberg'
pattern = os.path.join(top_of_hierarchy, '*')
for candidate in glob.glob(pattern):
if os.path.isdir(candidate):
print("{0} is a directory".format(candidate))
break
else:
print('No directories found')
# Tested on 2.6, 2.7 and 3.4
I apparently can't comment yet; however, I wanted to update part of the answer https://stackoverflow.com/users/541038/joran-beasley gave, or at least what worked for me.
Using python3 (3.7.3), I had to modify his first code snippet as follows:
import os
def has_folders(path_to_parent):
for fname in os.listdir(path_to_parent):
if os.path.isdir(os.path.join(path_to_parent, fname)):
yield os.path.join(path_to_parent, fname)
print(list(has_folders("/repo/output")))
Further progress on narrowing to "does given directory contain any directory" results in code like:
import os
def folders_in(path_to_parent):
for fname in os.listdir(path_to_parent):
if os.path.isdir(os.path.join(path_to_parent, fname)):
yield os.path.join(path_to_parent, fname)
def has_folders(path_to_parent):
folders = list(folders_in(path_to_parent))
return len(folders) != 0
print(has_folders("the/path/to/parent"))
The result of this code should be True or False

Python. Unchroot directory

I chrooted directory using following commands:
os.chroot("/mydir")
How to return to directory to previous - before chrooting?
Maybe it is possible to unchroot directory?
SOLUTION:
Thanks to Phihag. I found a solution. Simple example:
import os
os.mkdir('/tmp/new_dir')
dir1 = os.open('.', os.O_RDONLY)
dir2 = os.open('/tmp/new_dir', os.O_RDONLY)
os.getcwd() # we are in 'tmp'
os.chroot('/tmp/new_dir') # chrooting 'new_dir' directory
os.fchdir(dir2)
os.getcwd() # we are in chrooted directory, but path is '/'. It's OK.
os.fchdir(dir1)
os.getcwd() # we came back to not chrooted 'tmp' directory
os.close(dir1)
os.close(dir2)
More info
If you haven't changed your current working directory, you can simply call
os.chroot('../..') # Add '../' as needed
Of course, this requires the CAP_SYS_CHROOT capability (usually only given to root).
If you have changed your working directory, you can still escape, but it's harder:
os.mkdir('tmp')
os.chroot('tmp')
os.chdir('../../') # Add '../' as needed
os.chroot('.')
If chroot changes the current working directory, you can get around that by opening the directory, and using fchdir to go back.
Of course, if you intend to go out of a chroot in the course of a normal program (i.e. not a demonstration or security exploit), you should rethink your program. First of all, do you really need to escape the chroot? Why can't you just copy the required info into it beforehand?
Also, consider using a second process that stays outside of the chroot and answers to the requests of the chrooted one.

Find a path in Windows relative to another

This problem should be a no-brainer, but I haven't yet been able to nail it.
I need a function that takes two parameters, each a file path, relative or absolute, and returns a filepath which is the first path (target) resolved relative to the second path (start). The resolved path may be relative to the current directory or may be absolute (I don't care).
Here as an attempted implementation, complete with several doc tests, that exercises some sample uses cases (and demonstrates where it fails). A runnable script is also available on my source code repository, but it may change. The runnable script will run the doctest if no parameters are supplied or will pass one or two parameters to findpath if supplied.
def findpath(target, start=os.path.curdir):
r"""
Find a path from start to target where target is relative to start.
>>> orig_wd = os.getcwd()
>>> os.chdir('c:\\windows') # so we know what the working directory is
>>> findpath('d:\\')
'd:\\'
>>> findpath('d:\\', 'c:\\windows')
'd:\\'
>>> findpath('\\bar', 'd:\\')
'd:\\bar'
>>> findpath('\\bar', 'd:\\foo') # fails with '\\bar'
'd:\\bar'
>>> findpath('bar', 'd:\\foo')
'd:\\foo\\bar'
>>> findpath('bar\\baz', 'd:\\foo')
'd:\\foo\\bar\\baz'
>>> findpath('\\baz', 'd:\\foo\\bar') # fails with '\\baz'
'd:\\baz'
Since we're on the C drive, findpath may be allowed to return
relative paths for targets on the same drive. I use abspath to
confirm that the ultimate target is what we expect.
>>> os.path.abspath(findpath('\\bar'))
'c:\\bar'
>>> os.path.abspath(findpath('bar'))
'c:\\windows\\bar'
>>> findpath('..', 'd:\\foo\\bar')
'd:\\foo'
>>> findpath('..\\bar', 'd:\\foo')
'd:\\bar'
The parent of the root directory is the root directory.
>>> findpath('..', 'd:\\')
'd:\\'
restore the original working directory
>>> os.chdir(orig_wd)
"""
return os.path.normpath(os.path.join(start, target))
As you can see from the comments in the doctest, this implementation fails when the start specifies a drive letter and the target is relative to the root of the drive.
This brings up a few questions
Is this behavior a limitation of os.path.join? In other words, should os.path.join('d:\foo', '\bar') resolve to 'd:\bar'? As a Windows user, I tend to think so, but I hate to think that a mature function like path.join would need alteration to handle this use case.
Is there an example of an existing target path resolver such as findpath that will work in all of these test cases?
If 'no' to the above questions, how would you implement this desired behavior?
I agree with you: this seems like a deficiency in os.path.join. Looks like you have to deal with the drives separately. This code passes all your tests:
def findpath(target, start=os.path.curdir):
sdrive, start = os.path.splitdrive(start)
tdrive, target = os.path.splitdrive(target)
rdrive = tdrive or sdrive
return os.path.normpath(os.path.join(rdrive, os.path.join(start, target)))
(and yes, I had to nest two os.path.join's to get it to work...)

Categories

Resources