I need to know if pathA is a subset of, or is contained within pathB.
I'm making a little script that will walk some old volumes and find duplicate files. My general approach (and even if it's a bad one for it's inefficiency, it's just for me and it works, so I'm ok with the brute-forceness of it) has been:
Map all the files to a log
Create a hash for all the files in the log
Sort the hash list for duplicates
Move the duplicates somewhere for inspection prior to deletion
I want to be able to exclude certain directories, though (ie. System files). This is what I've written:
#self.search_dir = top level directory to be searched for duplicates
#self.mfl = master_file_list, being built by this func, a list of all files in search_dir
#self.no_crawl_list = list of files and directories to be excluded from the search
def build_master_file_list(self):
for root, directories, files in os.walk(self.search_dir):
files = [f for f in files if not f[0] == '.']
directories[:] = [d for d in directories if not d[0] == '.']
for filename in files:
filepath = os.path.join(root, filename)
if [root, filepath] in self.no_crawl_list:
pass
else:
self.mfl.write(filepath + "\n")
self.mfl.close()
But I'm pretty sure this isn't going to do what I'd intended. My goal is to have all subdirectories of anything in self.no_crawl_list excluded as well, such that:
if
/path/to/excluded_dir is added to self.no_crawl_list
then paths like /path/to/excluded_dir/sub_dir/implicitly_excluded_file.txt
will be skipped as well. I think my code is currently being entirely literal about what to skip. Short of exploding the path parts and comparing them to every possible combination in self.no_crawl_list, however, I don't know how to do this. 'Lil help? :)
As per the assistance of Lukas Graf in the comments above, I was able to build this and it works like a charm:
def is_subpath(self, path, of_paths):
if isinstance(of_paths, basestring): of_paths = [of_paths]
abs_of_paths = [os.path.abspath(of_path) for of_path in of_paths]
return any(os.path.abspath(path).startswith(subpath) for subpath in abs_of_paths)
Also, this currently doesn't account for symlinks and assumes a UNIX filesystem, see comments in original question for advice on extending this.
Related
How to get Absolute file path within a specified directory and ignore dot(.) directories and dot(.)files
I have below solution, which will provide a full path within the directory recursively,
Help me with the fastest way of list files with full path and ignore .directories/ and .files to list
(Directory may contain 100 to 500 millions files )
import os
def absoluteFilePath(directory):
for dirpath,_,filenames in os.walk(directory):
for f in filenames:
yield os.path.abspath(os.path.join(dirpath, f))
for files in absoluteFilePath("/my-huge-files"):
#use some start with dot logic ? or any better solution
Example:
/my-huge-files/project1/file{1..100} # Consider all files from file1 to 100
/my-huge-files/.project1/file{1..100} # ignore .project1 directory and its files (Do not need any files under .(dot) directories)
/my-huge-files/project1/.file1000 # ignore .file1000, it is starts with dot
os.walk by definition visits every file in a hierarchy, but you can select which ones you actually print with a simple textual filter.
for file in absoluteFilePath("/my-huge-files"):
if '/.' not in file:
print(file)
When your starting path is already absolute, calling os.path.abspath on it is redundant, but I guess in the great scheme of things, you can just leave it in.
Don't use os.walk() as it will visit every file
Instead, fall back to .scandir() or .listdir() and write your own implementation
You can use pathlib.Path(test_path).expanduser().resolve() to fully expand a path
import os
from pathlib import Path
def walk_ignore(search_root, ignore_prefixes=(".",)):
""" recursively walk directories, ignoring files with some prefix
pass search_root as an absolute directory to get absolute results
"""
for dir_entry in os.scandir(Path(search_root)):
if dir_entry.name.startswith(ignore_prefixes):
continue
if dir_entry.is_dir():
yield from walk_ignore(dir_entry, ignore_prefixes=ignore_prefixes)
else:
yield Path(dir_entry)
You may be able to save some overhead with a closure, coercing to Path once, yielding only .name, etc., but that's really up to your needs
Also not to your question, but related to it; if the files are very small, you'll likely find that packing them together (several files in one) or tuning the filesystem block size will see tremendously better performance
Finally, some filesystems come with bizarre caveats specific to them and you can likely break this with oddities like symlink loops
I'm new to python and get stuck by a problem I encountered while studying loops and folder navigation.
The task is simple: loop through a folder and count all '.txt' files.
I believe there may be some modules to tackle this task easily and I would appreciate it if you can share them. But since this is just a random question I encountered while learning python, it would be nice if this can be solved using the tools I just acquired, like for/while loops.
I used for and while clauses to loop through a folder. However, I'm unable to loop through a folder entirely.
Here is the code I used:
import os
count=0 # set count default
path = 'E:\\' # set path
while os.path.isdir(path):
for file in os.listdir(path): # loop through the folder
print(file) # print text to keep track the process
if file.endswith('.txt'):
count+=1
print('+1') #
elif os.path.isdir(os.path.join(path,file)): #if it is a subfolder
print(os.path.join(path,file))
path=os.path.join(path,file)
print('is dir')
break
else:
path=os.path.join(path,file)
Since the number of files and subfolders in a folder is unknown, I think a while loop is appropriate here. However, my code has many errors or pitfalls I don't know how to fix. for example, if multiple subfolders exist, this code will only loop the first subfolder and ignore the rest.
Your problem is that you quickly end up trying to look at non-existent files. Imagine a directory structure where a non-directory named A (E:\A) is seen first, then a file b (E:\b).
On your first loop, you get A, detect it does not end in .txt, and that it is a directory, so you change path to E:\A.
On your second iteration, you get b (meaning E:\b), but all your tests (aside from the .txt extension test) and operations concatenate it with the new path, so you test relative to E:\A\b, not E:\b.
Similarly, if E:\A is a directory, you break the inner loop immediately, so even if E:\c.txt exists, if it occurs after A in the iteration order, you never even see it.
Directory tree traversal code must involve a stack of some sort, either explicitly (by appending and poping from a list of directories for eventual processing), or implicitly (via recursion, which uses the call stack to achieve the same purpose).
In any event, your specific case should really just be handled with os.walk:
for root, dirs, files in os.walk(path):
print(root) # print text to keep track the process
count += sum(1 for f in files if f.endswith('txt'))
# This second line matches your existing behavior, but might not be intended
# Remove it if directories ending in .txt should not be included in the count
count += sum(1 for d in files if d.endswith('txt'))
Just for illustration, the explicit stack approach to your code would be something like:
import os
count = 0 # set count default
paths = ['E:\\'] # Make stack of paths to process
while paths:
# paths.pop() gets top of directory stack to process
# os.scandir is easier and more efficient than os.listdir,
# though it must be closed (but with statement does this for us)
with os.scandir(paths.pop()) as entries:
for entry in entries: # loop through the folder
print(entry.name) # print text to keep track the process
if entry.name.endswith('.txt'):
count += 1
print('+1')
elif entry.is_dir(): #if it is a subfolder
print(entry.path, 'is dir')
# Add to paths stack to get to it eventually
paths.append(entry.path)
You probably want to apply recursion to this problem. In short, you will need a function to handle directories that will call itself when it encounters a sub-directory.
This might be more than you need, but it will allow you to list all the files within the directory that are .txt files but you can also add criteria to the search within the files as well. Here is the function:
def file_search(root,extension,search,search_type):
import pandas as pd
import os
col1 = []
col2 = []
rootdir = root
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if "." + extension in file.lower():
try:
with open(os.path.join(subdir, file)) as f:
contents = f.read()
if search_type == 'any':
if any(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
elif search_type == 'all':
if all(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
except:
pass
df = pd.DataFrame({'Folder':col1,
'File':col2})[['Folder','File']]
return df
Here is an example of how to use the function:
search_df = file_search(root = r'E:\\',
search=['foo','bar'], #words to search for
extension = 'txt', #could change this to 'csv' or 'sql' etc.
search_type = 'all') #use any or all
search_df
The analysis of your code has already been addressed by #ShadowRanger's answer quite well.
I will try to address this part of your question:
there may be some modules to tackle this task easily
For these kind of tasks, there actually exists the glob module, which implements Unix style pathname pattern expansion.
To count the number of .txt files in a directory and all its subdirectories, one may simply use the following:
import os
from glob import iglob, glob
dirpath = '.' # for example
# getting all matching elements in a list a computing its length
len(glob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
# or iterating through all matching elements and summing 1 each time a new item is found
# (this approach is more memory-efficient)
sum(1 for _ in iglob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
Basically glob.iglob() is the iterator version of glob.glob().
for nested Directories it's easier to use functions like os.walk
take this for example
subfiles = []
for dirpath, subdirs, files in os.walk(path):
for x in files:
if x.endswith(".txt"):
subfiles.append(os.path.join(dirpath, x))`
and it'ill return a list of all txt files
else ull need to use Recursion for task like this
I'm using Glob.Glob to search a folder, and the sub-folders there in for all the invoices I have. To simplify that I'm going to add the program to the context menu, and have it take the path as the first part of,
import glob
for filename in glob.glob(path + "/**/*.pdf", recursive=True):
print(filename)
I'll have it keep the list and send those files to a Printer, in a later version, but for now just writing the name is a good enough test.
So my question is twofold:
Is there anything fundamentally wrong with the way I'm writing this?
Can anyone point me in the direction of how to actually capture folder path and provide it as path-variable?
You should have a look at this question: Python script on selected file. It shows how to set up a "Sent To" command in the context menu. This command calls a python script an provides the file name sent via sys.argv[1]. I assume that also works for a directory.
I do not have Python3.5 so that I can set the flag recursive=True, so I prefer to provide you a solution which you can run on any Python version (known up to day).
The solution consists in using calling os.walk() to run explore the directories and the set build-in type.
it is better to use set instead of list as with this later one you'll need more code to check if the directory you want to add is not listed already.
So basically you can keep two sets: one for the names of files you want to print and the other one for the directories and their sub folders.
So you can adapat this solution to your class/method:
import os
path = '.' # Any path you want
exten = '.pdf'
directories_list = set()
files_list = set()
# Loop over direcotries
for dirpath, dirnames, files in os.walk(path):
for name in files:
# Check if extension matches
if name.lower().endswith(exten):
files_list.add(name)
directories_list.add(dirpath)
You can then loop over directories_list and files_list to print them out.
I'd like to find the full path of any given file, but when I tried to use
os.path.abspath("file")
it would only give me the file location as being in the directory where the program is running. Does anyone know why this is or how I can get the true path of the file?
What you are looking to accomplish here is ultimately a search on your filesystem. This does not work out too well, because it is extremely likely you might have multiple files of the same name, so you aren't going to know with certainty whether the first match you get is in fact the file that you want.
I will give you an example of how you can start yourself off with something simple that will allow you traverse through directories to be able to search.
You will have to give some kind of base path to be able to initiate the search that has to be made for the path where this file resides. Keep in mind that the more broad you are, the more expensive your searching is going to be.
You can do this with the os.walk method.
Here is a simple example of using os.walk. What this does is collect all your file paths with matching filenames
Using os.walk
from os import walk
from os.path import join
d = 'some_file.txt'
paths = []
for i in walk('/some/base_path'):
if d in i[2]:
paths.append(join(i[0], d))
So, for each iteration over os.walk you are going to get a tuple that holds:
(path, directories, files)
So that is why I am checking against location i[2] to look at files. Then I join with i[0], which is the path, to put together the full filepath name.
Finally, you can actually put the above code all in to one line and do:
paths = [join(i[0], d) for i in walk('/some/base_path') if d in i[2]]
I've seen a lot of people asking questions about searching through folders and creating a list of files, but I haven't found anything that has helped me do the opposite.
I have a csv file with a list of files and their extensions (xxx0.laz, xxx1.laz, xxx2.laz, etc). I need to read through this list and then search through a folder for those files. Then I need to move those files to another folder.
So far, I've taken the csv and created a list. At first I was having trouble with the list. Each line had a "\n" at the end, so I removed those. From the only other example I've found... [How do I find and move certain files based on a list in excel?. So I created a set from the list. However, I'm not really sure why or if I need it.
So here's what I have:
id = open('file.csv','r')
list = list(id)
list_final = ''.join([item.rstrip('\n') for item in list])
unique_identifiers = set(list_final)
os.chdir(r'working_dir') # I set this as the folder to look through
destination_folder = 'folder_loc' # Folder to move files to
for identifier in unique_identifiers:
for filename in glob.glob('%s_*' % identifier)"
shutil.move(filename, destination_folder)
I've been wondering about this ('%s_*' % identifier) with the glob function. I haven't found any examples with this, perhaps that needs to be changed?
When I do all that, I don't get anything. No errors and no actual files moved...
Perhaps I'm going about this the wrong way, but that is the only thing I've found so far anywhere.
its really not hard:
for fname in open("my_file.csv").read().split(","):
shutil.move(fname.strip(),dest_dir)
you dont need a whole lot of things ...
also if you just want all the *.laz files in a source directory you dont need a csv at all ...
for fname in glob.glob(os.path.join(src_dir,"*.laz")):
shutil.move(fname,dest_dir)