How to detect cycles in directory traversal - python

I am using Python on Ubuntu (Linux). I would also like this code to work on modern-ish Windows PCs. I can take or leave Macs, since I have no plan to get one, but I am hoping to make this code as portable as possible.
I have written some code that is supposed to traverse a directory and run a test on all of its subdirectories (and later, some code that will do something with each file, so I need to know how to detect links there too).
I have added a check for symlinks, but I do not know how to protect against hardlinks that could cause infinite recursion. Ideally, I'd like to also protect against duplicate detections in general (root: [A,E], A: [B,C,D], E: [D,F,G], where D is the same file or directory in both A and E).
My current thought is to check if the path from the root directory to the current folder is the same as the path being tested, and if it isn't, skip it as an instance of a cycle. However, I think that would take a lot of extra I/O or it might just retrace the (actually cyclic) path that was just created.
How do I properly detect cycles in my filesystem?
def find(self) -> bool:
if self._remainingFoldersToSearch:
current_folder = self._remainingFoldersToSearch.pop()
if not current_folder.is_symlink():
contents = current_folder.iterdir()
try:
for item in contents:
if item.is_dir():
if item.name == self._indicator:
potentialArchive = [x.name for x in item.iterdir()]
if self._conf in potentialArchive:
self._archives.append(item)
if self._onArchiveReadCallback:
self._onArchiveReadCallback(item)
else:
self._remainingFoldersToSearch.append(item)
self._searched.append(item)
if self._onFolderReadCallback:
self._onFolderReadCallback(item)
except PermissionError:
logging.info("Invalid permissions accessing folder:", exc_info=True)
return True
else:
return False

Related

Can GetFileAttributesExA return stale information on an SMB3 mount?

I'm trying to figure the root cause of an issue in a single-threaded Python program that essentially goes like this (heavily simplified):
# Before running
os.remove(path)
# While running
if os.path.isfile(path):
with open(path) as fd:
...
I'm essentially seeing erratic behavior where isfile (which uses stat, itself using GetFileAttributesExA under the hood in Python 2.7, see here) can return True when the file doesn't exist, failing the next open call.
path being on an SMB3 network share, I'm suspecting caching behavior of some kind. Is it possible that GetFileAttributesExA returns stale information?
Reducing SMB client caching from the default (10s) to 0s seems to make the issue disappear:
Set-SmbClientConfiguration -DirectoryCacheLifetime 0 -FileInfoCacheLifetime 0
(Note: The correct fix here is to try opening the file and catch the exception, of course, but I'm puzzled by the issue and would like to understand the root cause.)

repeat exact same code on different directories with `try` `except` in python

I am writing a python script that parses information from different files into pandas dataframes. At first, I am aiming at a certain directory, then I am calling the commands to parse the information from multiple files. If, however, that directory does not exist, I should execute the exact same code, though in another directory. To illustrate, it should be something like:
import os
try:
cwd = "/path/do/dir"
os.chdir(cwd)
#do a code block here
except FileNotFoundError: # i.e. in case /path/to/dir does not exist
cwd = /path/to/anotherDir
os.chdir(cwd)
# do the same code block
What I am doing currently is to repeat the same code block in the try and in the except chunks, though I was wondering if there is a more elegant way of doing it, like assigning the whole code block to a variable or a function, then calling this variable/ function in the except chunk.
You can simply iterate over a list of paths, and try your code block for every file.
You can even add a break at the end of your try block if you want to stop iterating as soon as you found a path that worked.
paths = ['path/to/dir', 'path/to/other/dir']
for cwd in paths:
try:
os.chdir(cwd)
# code that may fail here
break
except FileNotFoundError:
# This will NOT cause the code to fail; it's the same as `pass` but
# more informative
print("File not found:", cwd)
Isn't it better to simply check if the directory/file exists?
import os
if os.path.isdir('some/path') and os.path.isfile('some/path/file.txt'):
os.chdir('some/path')
else:
os.chdir('other/path')
#code block here
Of course, this assumes that nothing else can go wrong in the '#code block'.

Debugging strategy for a bug (apparently) affected by timing

I'm fairly inexperienced with python, so I find myself at a loss as to how to approach this bug. I've inherited a python app that mainly just copies files and folders from one place to another. All the files and folders are on the local machine and the user has full admin rights, so there are no networking or security issues here.
What I've found is that a number of files fail to get copied from one directory to another unless I slow down the code somehow. If I just run the program it fails, but if I step through with a debugger or add print statements to the copy loop, it succeeds. The difference there seems to be either the timing of the loop or moving things around in memory.
I've seen this sort of bug in compiled languages before, and it usually indicates either a race condition or memory corruption. However, there is only one thread and no interaction with other processes, so a race condition seems impossible. Memory corruption remains a possibility, but I'm not sure how to investigate that possibility with python.
Here is the loop in question. As you can see, it's rather simple:
def concat_dirs(dir, subdir):
return dir + "/" + subdir
for name in fullnames:
install_name = concat_dirs(install_path, name)
dirname = os.path.dirname(install_name)
if not os.path.exists(dirname):
os.makedirs(dirname)
shutil.copyfile(concat_dirs(java_path, name), install_name)
That loop usually fails to copy the files unless I either step through it with a debugger or add this statement after the shutil.copyfile line.
print "copied ", concat_dirs(java_path, name), " to ", install_name
If I add that statement or step through in debug, the loop works perfectly and consistently. I'm tempted to say "good enough" with the print statement but I know that's just masking an underlying problem.
I'm not asking you to debug my code because I know you can't; I'm asking for a debug strategy. How do I approach finding this bug?
You do have a race condition: you check for the existence of dirname and then try to create it. That should cause the program to bomb if something unexpected happens, but...
In Python, we says that it's easier to ask forgiveness than permission. Go ahead and create that directory each time, then apologize if it already exists:
import errno
import os
for name in fullnames:
source_name = os.path.join(java_path, name)
dest_name = os.path.join(install_path, name)
dest_dir = os.path.dirname(dest_name)
try:
os.makedirs(dest_dir)
except OSError as exc:
if exc.errno != errno.EEXIST:
raise
shutil.copyfile(source_name, dest_name)
I'm not sure how I'd troubleshoot this, other than by trying it the non-racy way and seeing what happens. There may be a subtle filesystem issue that's making this run oddly.

with open inside try - except block, too many files open?

Quite simply, I am cycling through all sub folders in a specific location, and collecting a few numbers from three different files.
def GrepData():
import glob as glob
import os as os
os.chdir('RUNS')
RUNSDir = os.getcwd()
Directories = glob.glob('*.*')
ObjVal = []
ParVal = []
AADVal = []
for dir in Directories:
os.chdir(dir)
(X,Y) = dir.split(sep='+')
AADPath = glob.glob('Aad.out')
ObjPath = glob.glob('fobj.out')
ParPath = glob.glob('Par.out')
try:
with open(os.path.join(os.getcwd(),ObjPath[0])) as ObjFile:
for line in ObjFile:
ObjVal.append(list([X,Y,line.split()[0]]))
ObjFile.close()
except(IndexError):
ObjFile.close()
try:
with open(os.path.join(os.getcwd(),ParPath[0])) as ParFile:
for line in ParFile:
ParVal.append(list([X,Y,line.split()[0]]))
ParFile.close()
except(IndexError):
ParFile.close()
try:
with open(os.path.join(os.getcwd(),AADPath[0])) as AADFile:
for line in AADFile:
AADVal.append(list([X,Y,line.split()[0]]))
AADFile.close()
except(IndexError):
AADFile.close()
os.chdir(RUNSDir)
Each file open command is placed in a try - except block, as in a few cases the file that is opened will be empty, and thus appending the line.split() will lead to an index error since the list is empty.
However when running this script i get the following error: "OSError: [Errno 24] Too Many open files"
I was under the impression that the idea of the "with open..." statement was that it took care of closing the file after use? Clearly that is not happening.
So what I am asking for is two things:
The answer to: "Is my understanding of with open correct?"
How can I correct whatever error is inducing this problem?
(And yes i know the code is not exactly elegant. The whole try - except ought to be a single object that is reused - but I will fix that after figuring out this error)
Try moving your try-except inside the with like so:
with open(os.path.join(os.getcwd(),ObjPath[0])) as ObjFile:
for line in ObjFile:
try:
ObjVal.append(list([X,Y,line.split()[0]]))
except(IndexError):
pass
Notes: there is no need to close your file manually, this is what with is for. Also, there is no need to use as os in your imports if you are using the same name.
"Too many open files" has nothing to do with writing semantically incorrect python code, and you are using with correctly. The key is the part of your error that says "OSError," which refers to the underlying operating system.
When you call open(), the python interpreter will execute a system call. The details of the system call vary a bit by which OS you are using, but on linux this call is open(2). The operating system kernel will handle the system call. While the file is open, it has an entry in the system file table and takes up OS resources -- this means effectively it is "taking up space" whilst it is open. As such the OS has a limit to the number of files that can be opened at any one time.
Your problem is that while you call open(), you don't call close() quickly enough. In the event that your directory structure requires you to have many thousands files open at once that might approach this cap, it can be temporarily changed (at least on linux, I'm less familiar with other OSes so I don't want to go into too many details about how to do this across platforms).

Traversing FTP listing

I am trying to to get all directories' name from an FTP server and store them in hierarchical order in a multidimensional list or dict
So for example, a server that contains the following structure:
/www/
mysite.com
images
png
jpg
at the end of the script, would give me a list such as
['/www/'
['mysite.com'
['images'
['png'],
['jpg']
]
]
]
I have tried using a recursive function like so:
def traverse(dir):
FTP.dir(dir, traverse)
FTP.dir returns lines in this format:
drwxr-xr-x 5 leavesc1 leavesc1 4096 Nov 29 20:52 mysite.com
so doing line[56:] will give me just the directory name(mysite.com). I use this in the recursive function.
But i cannot get it to work. I've tried many different approaches and can't get it to work. Lots of FTP errors as well (either can't find the directory - which is a logical issue, and sometimes unexpected errors returned by the server, which leaves no log and i can't debug)
bottom line question:
How to get a hierarchical directory listing from an FTP server?
Here is a naive and slow implementation. It is slow because it tries to CWD to each directory entry to determine if it is a directory or a file, but this works. One could optimize it by parsing LIST command output, but this is strongly server-implementation dependent.
import ftplib
def traverse(ftp, depth=0):
"""
return a recursive listing of an ftp server contents (starting
from the current directory)
listing is returned as a recursive dictionary, where each key
contains a contents of the subdirectory or None if it corresponds
to a file.
#param ftp: ftplib.FTP object
"""
if depth > 10:
return ['depth > 10']
level = {}
for entry in (path for path in ftp.nlst() if path not in ('.', '..')):
try:
ftp.cwd(entry)
level[entry] = traverse(ftp, depth+1)
ftp.cwd('..')
except ftplib.error_perm:
level[entry] = None
return level
def main():
ftp = ftplib.FTP("localhost")
ftp.connect()
ftp.login()
ftp.set_pasv(True)
print traverse(ftp)
if __name__ == '__main__':
main()
Here's a first draft of a Python 3 script that worked for me. It's much faster than calling cwd(). Pass in server, port, directory, username, and password as arguments. I left output as a list as an exercise for the reader.
import ftplib
import sys
def ftp_walk(ftp, dir):
dirs = []
nondirs = []
for item in ftp.mlsd(dir):
if item[1]['type'] == 'dir':
dirs.append(item[0])
else:
nondirs.append(item[0])
if nondirs:
print()
print('{}:'.format(dir))
print('\n'.join(sorted(nondirs)))
else:
# print(dir, 'is empty')
pass
for subdir in sorted(dirs):
ftp_walk(ftp, '{}/{}'.format(dir, subdir))
ftp = ftplib.FTP()
ftp.connect(sys.argv[1], int(sys.argv[2]))
ftp.login(sys.argv[4], sys.argv[5])
ftp_walk(ftp, sys.argv[3])
You're not going to like this, but "it depends on the server" or, more accurately, "it depends on the output format of the server".
Different servers can be set to display different output, so your initial proposal is bound to failure in the general case.
The "naive and slow implementation" above will cause enough errors that some FTP servers will cut you off (which is probably what happened after about 7 of them...).
If the server supports the MLSD command, then use the “a directory and its descendants” code from that answer.
If we are using Python look at:
http://docs.python.org/library/os.path.html (os.path.walk)
If there already is a good module for this, don't reinvent the wheel. Can't believe the post two spots above got two ups, anyway, enjoy.

Categories

Resources