Quickly check for subdirectories in list

Quickly check for subdirectories in list - python

I have two sets of paths, with maybe 5000 files in the first set and 10000 files in the second. The first set is contained in the second set. I need to check if any of the entries in the second set is a child of any entry in the first set (i.e. if it's a subdirectory or file in another directory from the first set). There are some additional requirements:
No operations on the file system, it should be done only on the path strings (except for dealing with symlinks if needed).
Platform independent (e.g. upper/lower case, different separators)
It should be robust with respect to different ways of expressing the same path.
It should deal with both symlinks and their targets.
Some paths will be absolute and some relative.
This should be as fast as possible!
I'm thinking along the lines of getting both os.path.abspath() and os.path.realpath() for each entry and then comparing them with os.path.commonpath([parent]) == os.path.commonpath([parent, child]). I can't come up with a good way of running this fast though. Or is it safe to just compare the strings directly? That would make it much much easier. Thanks!
EDIT: I was a bit unclear about the platform independence. It should work for all platforms, but there won't be for example Windows and Unix style paths mixed.

You can first calculate the real path of all paths using os.path.realpath and then use os.path.commonprefix to check if one path in a child of the first set of paths.
Example:
import os
first = ['a', 'b/x', '/r/c']
second = ['e', 'b/x/t', 'f']
first = set(os.path.realpath(p) for p in first)
second = set(os.path.realpath(p) for p in second)
for s in second:
if any(os.path.commonprefix([s, f]) == f
for f in first):
print(s)
You get:
/full/path/to/b/x/t

Related

Easiest way to append non-directory component to Python Path?

Several SO questions ask how to append a directory to a pathlib.Path object. That's not this question.
I would like to use a Path object a prefix for a series of files in a single directory, like this:
2022-01-candidates.csv
2022-01-resumes.zip
2022-02-candidates.csv
2022-02-resumes.zip
Ideally, I would construct Path objects for the 2022-01 and 2022-02 components, and then append -candidates.csv and -resumes.zip to each.
Unfortunately, Path appears to only understand appending subdirectores, not extensions to existing path names.
The only workaround that I see is something like p.parent / (p.name + "-candidates.csv"). Although that's not so bad, it's clumsy and this pattern is common for me. I wonder whether I'm missing a more streamlined method. (For example, why isn't there a + concatenation operator?)
Path.with_suffix() requires that the suffix start with a dot, so that doesn't work.

As you mentioned, using the division operator always creates a sub-directory, and with_suffix is only for extensions. You could use with_path to edit the filename:
import pathlib
path = pathlib.Path("2022-01")
path.with_name(f"{path.name}-candidates.csv")

Finding files with a name pattern

I need to find whether a file with specific pattern name is available in a current directory or not. I used the following code for this purpose.
H1 = []
for record_name in my_list:
file_name = 'RSN' + '_' + record_name[0:5] + '*' + record_name[-8:]
H1 += glob.glob(record_name)
It should be noted that I used the above method because in some cases there are some differences between the available record_name and the real name of the file available in the current directory. For example, the true name for one of my file is "RSN20148_BB40204628_KRPHHZ", while I have "20148_40204628_KRPHHZ" in my_list. Please note that the second one does not have "RSN" and "BB" terms.
The above procedure works, but the problem is that it takes a lot of time. Is there any suggestion to reduce the time?
Please note that I can not use os.listdir() to get the name of all files because the order of files in my_list is important for me.

Maybe implement an algorithm of yours where if record names are unique, you could create a dictionary (orderedDict if Python < 3.6 else by default, dicts are ordered) with all the record names set to False.
Then use threading with os.path.exists(path) which sets that key to True or False depending upon if that record exists. Dictionary being O(1) with threading might give you a performance boost.
A last note - This is all theoretical and you would have to implement/optimise yourself to see if it gives you a performance boost at all or adds unnecessary overhead.
Cheers!

remove redundant entries from list of paths

I have a list of files and directories. I'm trying to write a function to remove entries where there is also an entry for an ancestor directory present. What I have so far seems to work, but I think it is inefficient because it tests the full list of directories for every file.
Maybe there's a library out there to do this, but I can't find it. The purpose is to allow the user to choose a list of files and directories to upload.
As you can see from the example, directories are a subset of entries. I'd prefer to just provide the entries.
import os
def remove_redundant_entries(entries, directories):
result = []
for entry in entries:
# make a copy and successively get the dirname and test it
partial_path = entry
found = False
while partial_path != os.sep:
partial_path = os.path.dirname(partial_path)
if partial_path in directories:
found = True
break
if not found:
result.append(entry)
return result
entries = [
"/home/fred/work/f1.txt",
"/home/fred/work/f2.txt",
"/home/fred/play/f3.txt",
"/home/fred/play",
"/home/jane/dev/f1.txt",
"/home/jane"]
directories = [
"/home/fred/play",
"/home/jane"]
print remove_redundant_entries(entries, directories)
# result:
['/home/fred/work/f1.txt', '/home/fred/work/f2.txt', '/home/fred/play', '/home/jane']
if you know of a library or can give a clue to a better algorithm I'd appreciate it. Meanwhile, I will try something based on sorting the entries, as ancestors should always precede their children in the list.
EDIT: - RESULTS
I ran all solutions 10,000 times through the profiler with the tests set - and with one file added /home/fred/work/f2.txt.bak to test make sure a regular filename does cause another to be discarded.
My original code: 1060004 function calls in 0.394 seconds
Stephen Rauch's answer - worked first time: 3250004 function calls in 2.089 seconds
carrdelling's answer - which didn't work for similar filenames: 480004 function calls in 0.146 seconds
carrdelling's edited answer - works for all cases: 680004 function calls in 0.231 seconds
Thanks to everyone who contributed!

If you sort your input list of entries, then the problem is easier:
def remove_redundant_entries(entries):
split_entries = sorted(entries)
valid_entries = []
for entry in split_entries:
if any(entry.startswith(p) for p in valid_entries):
continue
valid_entries.append(entry)
return valid_entries
Note that any short-circuits as soon as one comparison is true (you it would not compare agains the whole list unless strictly necessary). Also, since the list comes sorted, you are guaranteed that the output will have the minimum number (and highest level) paths.
EDIT:
If you also need the ability to keep in the list multiple files in the same folder (even if some file names are subsets of others), you just need to modify the sorting criteria:
split_entries = sorted(entries, key=lambda x: (x.count(os.sep), -len(x)))
With that, folders that are higher in the tree will come earlier (so you'll end up with the minimum number of paths), but within a folder files with longer names will come earlier - so they won't get discarded because of files with shorter (prefix-like) names.

You can use a set to lookup the already present more efficiently like:
Code:
def remove_redundant_entries(entries):
present = set()
result = []
for entry in sorted(entries):
path = os.path.abspath(entry).split(os.sep)
found = any(
tuple(path[:i+1]) in present for i in range(len(path)))
if not found:
result.append(entry)
present.add(tuple(path))
return result
Test Code:
import os
entries = [
"/home/fred/work/f1.txt",
"/home/fred/work/f2.txt",
"/home/fred/play/f3.txt",
"/home/fred/play",
"/home/jane/dev/f1.txt",
"/home/jane"]
result = remove_redundant_entries(entries)
expected = ['/home/fred/work/f1.txt', '/home/fred/work/f2.txt',
'/home/fred/play', '/home/jane']
assert set(result) == set(expected)

Using python to parse a large set of filenames concatenated from inconsistent object names

/tldr Looking to parse a large set of filenames that are a concatenation of two names (container + child) for the original two names where nomenclature is inconsistent. Python library suggestions or any other guidance appreciated.
I am looking for a way to parse strings for information where the nomenclature and formatting of information within those strings will likely be inconsistent to some degree.
Background
Industry: Automation controls
Problem to be solved:
Time series data is exported from an automation system with a single data point being saved to a single .csv file. (example: If the controls system were an environmental controls system the point might be the measured temperature of a room taken at 15 minute intervals.) It is possible to have an environment where there are a few dozen points that export to CSV files or several thousand points that export to CSV files. The structure that the points are normally stored in is as follows: points are contained within a controller, controllers are integrated under a management system and occasionally management systems could be integrated into another management system. The resulting structure is a simple hierarchical tree.
The filenames associated with the CSV files are assembled from the path structure of each point as follows: Directories are created for the management systems (nested if necessary) and under those directories are the CSV files where the filename is a concatenation of the controller name and the point name.
I have written a python script that processes a monthly export of the CSV files (currently about 5500 of them [growing]) into a structured data store and another that assembles spreadsheets for others to review. Currently, I am using some really ugly regular expressions and even uglier string.find()s with a list of static string values that I have hand entered to parse out control names and point names for each file so that they can be inserted into the structured data store.
Unfortunately, as mentioned above, the nomenclature used in these environments are rarely consistent. Point names vary widely. The point referenced above might be known as ROOMTEMP, RM_T, RM-T, ROOM-T, ZN_T, ZNT, RMT or several other possibilities. This applies to almost any point contained within a controller. Controller names are also somewhat inconsistent where they may be named for what type of device they are controlling, the geographic location of the device or even an asset number associated with the device.
I would very much like to get out of the business of hand writing regular expressions to parse file names every time a new location is added. I would like to write code that reads in filenames and looks for patterns across the filenames and then makes a recommendation for parsing the controller and point name out of each filename. I already have an interface where I can assign controller name and point name to each point object by hand so if there are errors with the parse I can modify the results. Ideally, the patterns created by the existing objects would influence the suggested names of new files being parsed.
Some examples of filenames are as follows:
UNIT1254_SAT.csv, UNIT1254_RMT.csv, UNIT1254_fil.csv, AHU_5311_CLG_O.csv, QE239-01_DISCH_STPT.csv, HX_E2_CHW_Return.csv, Plant_RM221_CHW_Sys_Enable.csv, TU_E7_Actual Clg Setpoint.csv, 1725_ROOMTEMP.csv, 1725_DA_T.csv, 1725_RA_T.csv
The order will always be consistent where it is a concatenation of controller name and then point name. There will most likely be a consistent character used to separate controller name from point name (normally an underscore, but occasionally a dash or some other character.)
Does anyone have any recommendations on how to get started with parsing these file names? I’ve thought through a few ideas, but keep shelving them before trying them prior to implementation because I keep finding the potential for performance issues or identifying failure points. The rest of my code is working pretty much the way I need it to, I just haven’t figured out an efficient or useful way to pull the correct names out of the filename. Unfortunately, It is not an option to modify the names on the control system side to be consistent.

I don't know if the following code will help you, but I hope it'll give you at least some idea.
Considering that a filename as "QE239-01_STPT_1725_ROOMTEMP_DA" can contain following names
'QE239-01'
'QE239-01_STPT'
'QE239-01_STPT_1725'
'QE239-01_STPT_1725_ROOMTEMP'
'QE239-01_STPT_1725_ROOMTEMP_DA'
'STPT'
'STPT_1725'
'STPT_1725_ROOMTEMP'
'STPT_1725_ROOMTEMP_DA'
'1725'
'1725_ROOMTEMP'
'1725_ROOMTEMP_DA'
'ROOMTEMP'
'ROOMTEMP_DA'
'DA'
as being possible elements (container name or point name) of the filename,
I defined the function treat() to return this list from the name.
Then the code treats all the filenames to find all the possible elements of filenames.
The function is based on the idea that in the chosen example the element ROOMTEMP can't follow the element STPT because STPT_ROOMTEMP isn't a possible container name in this example string since there is 1725 between these two elements.
And then, with the help of a function in difflib module, I try to discriminate elements that may have some similarity, in order to try to detect patterns under which several elements of names can be gathered.
You must play with the value passed as argument to cutoff parameter to choose what could be the best to give interesting results for you.
It's far from being good, certainly, but I didn't understood all aspects of your problem.
s =\
"""UNIT1254_SAT
UNIT1254_RMT
UNIT1254_fil
AHU_5311_CLG_O
QE239-01_DISCH_STPT,
HX_E2_CHW_Return
Plant_RM221_CHW_Sys_Enable
TU_E7_Actual Clg Setpoint
1725_ROOMTEMP
1725_DA_T
1725_RA_T
UNT147_ROOMTEMP
TRU_EZ_RM_T
HXX_V2_RM-T
RHXX_V2_ROOM-T
SIX8_ZN_T
Plint_RP228_ZNT
SOHO79_EZ_RMT"""
li = s.split('\n')
print(li)
print('- - - - - - - - - - - - - - - - - ')
import difflib
from pprint import pprint
def treat(name):
lu = name.split('_')
W = []
while lu:
W.extend('_'.join(lu[0:x]) for x in range(1,len(lu)+1))
lu.pop(0)
return W
if 0:
q = "QE239-01_STPT_1725_ROOMTEMP_DA"
pprint(treat(q))
print('==========================================')
WALL = []
for t in li:
WALL.extend(treat(t))
pprint(WALL)
for x in WALL:
j = set(difflib.get_close_matches(x, WALL, n=9000000, cutoff=0.7 ))
if len(j)>1:
print(j,'\n')

Checking a file exists (and ignoring case) in Python

I have a Python script and I want to check if a file exists, but I want to ignore case
eg.
path = '/Path/To/File.log'
if os.path.isfile(path):
return true
The directory may look like this "/path/TO/fILe.log". But the above should still return true.

Generate one-time a set S of all absolute paths in the filesystem using os.walk, lowering them all as you collect them using str.lower.
Iterate through your large list of paths to check for existing, checking with if my_path.lower() in S.
(Optional) Go and interrogate whoever provided you the list with inconsistent cases. It sounds like an XY problem, there may be some strange reason for this and an easier way out.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.