remove redundant entries from list of paths

remove redundant entries from list of paths - python

I have a list of files and directories. I'm trying to write a function to remove entries where there is also an entry for an ancestor directory present. What I have so far seems to work, but I think it is inefficient because it tests the full list of directories for every file.
Maybe there's a library out there to do this, but I can't find it. The purpose is to allow the user to choose a list of files and directories to upload.
As you can see from the example, directories are a subset of entries. I'd prefer to just provide the entries.
import os
def remove_redundant_entries(entries, directories):
result = []
for entry in entries:
# make a copy and successively get the dirname and test it
partial_path = entry
found = False
while partial_path != os.sep:
partial_path = os.path.dirname(partial_path)
if partial_path in directories:
found = True
break
if not found:
result.append(entry)
return result
entries = [
"/home/fred/work/f1.txt",
"/home/fred/work/f2.txt",
"/home/fred/play/f3.txt",
"/home/fred/play",
"/home/jane/dev/f1.txt",
"/home/jane"]
directories = [
"/home/fred/play",
"/home/jane"]
print remove_redundant_entries(entries, directories)
# result:
['/home/fred/work/f1.txt', '/home/fred/work/f2.txt', '/home/fred/play', '/home/jane']
if you know of a library or can give a clue to a better algorithm I'd appreciate it. Meanwhile, I will try something based on sorting the entries, as ancestors should always precede their children in the list.
EDIT: - RESULTS
I ran all solutions 10,000 times through the profiler with the tests set - and with one file added /home/fred/work/f2.txt.bak to test make sure a regular filename does cause another to be discarded.
My original code: 1060004 function calls in 0.394 seconds
Stephen Rauch's answer - worked first time: 3250004 function calls in 2.089 seconds
carrdelling's answer - which didn't work for similar filenames: 480004 function calls in 0.146 seconds
carrdelling's edited answer - works for all cases: 680004 function calls in 0.231 seconds
Thanks to everyone who contributed!

If you sort your input list of entries, then the problem is easier:
def remove_redundant_entries(entries):
split_entries = sorted(entries)
valid_entries = []
for entry in split_entries:
if any(entry.startswith(p) for p in valid_entries):
continue
valid_entries.append(entry)
return valid_entries
Note that any short-circuits as soon as one comparison is true (you it would not compare agains the whole list unless strictly necessary). Also, since the list comes sorted, you are guaranteed that the output will have the minimum number (and highest level) paths.
EDIT:
If you also need the ability to keep in the list multiple files in the same folder (even if some file names are subsets of others), you just need to modify the sorting criteria:
split_entries = sorted(entries, key=lambda x: (x.count(os.sep), -len(x)))
With that, folders that are higher in the tree will come earlier (so you'll end up with the minimum number of paths), but within a folder files with longer names will come earlier - so they won't get discarded because of files with shorter (prefix-like) names.

You can use a set to lookup the already present more efficiently like:
Code:
def remove_redundant_entries(entries):
present = set()
result = []
for entry in sorted(entries):
path = os.path.abspath(entry).split(os.sep)
found = any(
tuple(path[:i+1]) in present for i in range(len(path)))
if not found:
result.append(entry)
present.add(tuple(path))
return result
Test Code:
import os
entries = [
"/home/fred/work/f1.txt",
"/home/fred/work/f2.txt",
"/home/fred/play/f3.txt",
"/home/fred/play",
"/home/jane/dev/f1.txt",
"/home/jane"]
result = remove_redundant_entries(entries)
expected = ['/home/fred/work/f1.txt', '/home/fred/work/f2.txt',
'/home/fred/play', '/home/jane']
assert set(result) == set(expected)

Related

Generating a filtered list with for and if on the same line

The project I am working on has folders of 3 different types on its data directory: green, yellow and red. There are many folders of each data type and their names start with the respective color, as in: "RED_folder1". I came upon the following code, that has the objective of generating a list containing all folders of a specific type.
Define target directory:
directory = '/project/data'
Generate list of folders of the red type:
red_folders = [k for k in os.listdir(directory) if os.path.isdir(os.path.join(directory,k)) and 'RED_' in k[:len('RED_')]]
I feel weird about this code because it is creating a list of the k entries in os.listdir(directory) that fit the given criteria, but the criteria themselves are also dependent on k. In which order is this line of code handled? Is it correct to assume that whole k in os.listdir(directory) list is created and then the if statement is considered and inapropriate entries discarted?
My confusion might come from not knowing exactly in which order python handles multiple operators on the same line, so a short explanation on that would also be nice.

An equivalent code for your list composition:
for k in os.listdir(directory):
if os.path.isdir(os.path.join(directory,k)) and 'RED_' in k[:len('RED_')]:
red_folders.append(k)
First a list of directories is generated. Then var k iterates over each element of sequence. The If statement is executed. If condition is found to be true, then the k is appended.
Now in your list composition the same mechanism is working. The only difference is that instead of red_folders.append(k), a whole new list is generated and then is given the name red_folders.
red_folders=new_list_created_by_list_composition
If you have any queries or not satisfied with the answer, then please tell me

Finding files with a name pattern

I need to find whether a file with specific pattern name is available in a current directory or not. I used the following code for this purpose.
H1 = []
for record_name in my_list:
file_name = 'RSN' + '_' + record_name[0:5] + '*' + record_name[-8:]
H1 += glob.glob(record_name)
It should be noted that I used the above method because in some cases there are some differences between the available record_name and the real name of the file available in the current directory. For example, the true name for one of my file is "RSN20148_BB40204628_KRPHHZ", while I have "20148_40204628_KRPHHZ" in my_list. Please note that the second one does not have "RSN" and "BB" terms.
The above procedure works, but the problem is that it takes a lot of time. Is there any suggestion to reduce the time?
Please note that I can not use os.listdir() to get the name of all files because the order of files in my_list is important for me.

Maybe implement an algorithm of yours where if record names are unique, you could create a dictionary (orderedDict if Python < 3.6 else by default, dicts are ordered) with all the record names set to False.
Then use threading with os.path.exists(path) which sets that key to True or False depending upon if that record exists. Dictionary being O(1) with threading might give you a performance boost.
A last note - This is all theoretical and you would have to implement/optimise yourself to see if it gives you a performance boost at all or adds unnecessary overhead.
Cheers!

Quickly check for subdirectories in list

I have two sets of paths, with maybe 5000 files in the first set and 10000 files in the second. The first set is contained in the second set. I need to check if any of the entries in the second set is a child of any entry in the first set (i.e. if it's a subdirectory or file in another directory from the first set). There are some additional requirements:
No operations on the file system, it should be done only on the path strings (except for dealing with symlinks if needed).
Platform independent (e.g. upper/lower case, different separators)
It should be robust with respect to different ways of expressing the same path.
It should deal with both symlinks and their targets.
Some paths will be absolute and some relative.
This should be as fast as possible!
I'm thinking along the lines of getting both os.path.abspath() and os.path.realpath() for each entry and then comparing them with os.path.commonpath([parent]) == os.path.commonpath([parent, child]). I can't come up with a good way of running this fast though. Or is it safe to just compare the strings directly? That would make it much much easier. Thanks!
EDIT: I was a bit unclear about the platform independence. It should work for all platforms, but there won't be for example Windows and Unix style paths mixed.

You can first calculate the real path of all paths using os.path.realpath and then use os.path.commonprefix to check if one path in a child of the first set of paths.
Example:
import os
first = ['a', 'b/x', '/r/c']
second = ['e', 'b/x/t', 'f']
first = set(os.path.realpath(p) for p in first)
second = set(os.path.realpath(p) for p in second)
for s in second:
if any(os.path.commonprefix([s, f]) == f
for f in first):
print(s)
You get:
/full/path/to/b/x/t

How to impliment a binary search on a list created from a file

This is my first post, please be gentle. I'm attempting to sort some
files into ascending and descending order. Once I have sorted a file, I am storing it in a list which is assigned to a variable. The user is then to choose a file and search for an item. I get an error message....
TypeError: unorderable types; int() < list()
.....when ever I try to search for an item using the variable of my sorted list, the error occurs on line 27 of my code. From research, I know that an int and list cannot be compared, but I cant for the life of me think how else to search a large (600) list for an item.
At the moment I'm just playing around with binary search to get used to it.
Any suggestions would be appreciated.
year = []
with open("Year_1.txt") as file:
for line in file:
line = line.strip()
year.append(line)
def selectionSort(alist):
for fillslot in range(len(alist)-1,0,-1):
positionOfMax=0
for location in range(1,fillslot+1):
if alist[location]>alist[positionOfMax]:
positionOfMax = location
temp = alist[fillslot]
alist[fillslot] = alist[positionOfMax]
alist[positionOfMax] = temp
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first<=last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
selectionSort(year)
testlist = []
testlist.append(year)
print(binarySearch(testlist, 2014))
Year_1.txt file consists of 600 items, all years in the format of 2016.
They are listed in descending order and start at 2017, down to 2013. Hope that makes sense.

Is there some reason you're not using the Python: bisect module?
Something like:
import bisect
sorted_year = list()
for each in year:
bisect.insort(sorted_year, each)
... is sufficient to create the sorted list. Then you can search it using functions such as those in the documentation.
(Actually you could just use year.sort() to sort the list in-place ... bisect.insort() might be marginally more efficient for building the list from the input stream in lieu of your call to year.append() ... but my point about using the `bisect module remains).
Also note that 600 items is trivial for modern computing platforms. Even 6,000 won't take but a few milliseconds. On my laptop sorting 600,000 random integers takes about 180ms and similar sized strings still takes under 200ms.
So you're probably not gaining anything by sorting this list in this application at that data scale.
On the other hand Python also includes a number of modules in its standard libraries for managing structured data and data files. For example you could use Python: SQLite3.
Using this you'd use standard SQL DDL (data definition language) to describe your data structure and schema, SQL DML (data manipulation language: INSERT, UPDATE, and DELETE statements) to manage the contents of the data and SQL queries to fetch data from it. Your data can be returned sorted on any column and any mixture of ascending and descending on any number of columns with the standard SQL ORDER BY clauses and you can add indexes to your schema to ensure that the data is stored in a manner to enable efficient querying and traversal (table scans) in any order on any key(s) you choose.
Because Python includes SQLite in its standard libraries, and because SQLite provides SQL client/server semantics over simple local files ... there's almost no downside to using it for structured data. It's not like you have to install and maintain additional software, servers, handle network connections to a remote database server nor any of that.

I'm going to walk through some steps before getting to the answer.
You need to post a [mcve]. Instead of telling us to read from "Year1.txt", which we don't have, you need to put the list itself in the code. Do you NEED 600 entries to get the error in your code? No. This is sufficient:
year = ["2001", "2002", "2003"]
If you really need 600 entries, then provide them. Either post the actual data, or
year = [str(x) for x in range(2017-600, 2017)]
The code you post needs to be Cut, Paste, Boom - reproduces the error on my computer just like that.
selectionSort is completely irrelevant to the question, so delete it from the question entirely. In fact, since you say the input was already sorted, I'm not sure what selectionSort is actually supposed to do in your code, either. :)
Next you say testlist = [].append(year). USE YOUR DEBUGGER before you ask here. Simply looking at the value in your variable would have made a problem obvious.
How to append list to second list (concatenate lists)
Fixing that means you now have a list of things to search. Before you were searching a list to see if 2014 matched the one thing in there, which was a complete list of all the years.
Now we get into binarySearch. If you look at the variables, you see you are comparing the integer 2014 with some string, maybe "1716", and the answer to that is useless, if it even lets you do that (I have python 2.7 so I am not sure exactly what you get there). But the point is you can't find the integer 2014 in a list of strings, so it will always return False.
If you don't have a debugger, then you can place strategic print statements like
print ("debug info: binarySearch comparing ", item, alist[midpoint])
Now here, what VBB said in comments worked for me, after I fixed the other problems. If you are searching for something that isn't even in the list, and expecting True, that's wrong. Searching for "2014" returns True, if you provide the correct list to search. Alternatively, you could force it to string and then search for it. You could force all the years to int during the input phase. But the int 2014 is not the same as the string "2014".

check if any members of a set start with a given prefix string in Python

In Python, I have a set of filenames and a given directory name (in a loop).
Both the filenames in the set and the given directory name are in the same namespace, e.g. I simply want to see if there are any member strings in the set that start with a given string / prefix.
What's the Python way to see if any filenames in a set start with a given directory prefix?

You can use the builtin any function:
any(x.startswith(prefix) for x in your_set)
This gives you a simple True or False -- depending on whether any of the items meets the criteria. If you want to know which element met your criteria, then you'll need next:
next((x for x in your_set if x.startswith(prefix)),None)
Of course, this only returns 1 element that meets your criteria -- If you need all of them, then see the answer by jsbueno.

You have to loop over all the set elements - you can get a lsit of these names by simply doing:
results = [element for element in your_set if element.startswith(your_prefix)]
But by doing that you loose the fast acces set has to elements - since it is done with the hashes of strigns it stores. So, there is no fast way, using pure sets, to quickly check for substrings in its elements.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.