Finding files with a name pattern

Finding files with a name pattern - python

I need to find whether a file with specific pattern name is available in a current directory or not. I used the following code for this purpose.
H1 = []
for record_name in my_list:
file_name = 'RSN' + '_' + record_name[0:5] + '*' + record_name[-8:]
H1 += glob.glob(record_name)
It should be noted that I used the above method because in some cases there are some differences between the available record_name and the real name of the file available in the current directory. For example, the true name for one of my file is "RSN20148_BB40204628_KRPHHZ", while I have "20148_40204628_KRPHHZ" in my_list. Please note that the second one does not have "RSN" and "BB" terms.
The above procedure works, but the problem is that it takes a lot of time. Is there any suggestion to reduce the time?
Please note that I can not use os.listdir() to get the name of all files because the order of files in my_list is important for me.

Maybe implement an algorithm of yours where if record names are unique, you could create a dictionary (orderedDict if Python < 3.6 else by default, dicts are ordered) with all the record names set to False.
Then use threading with os.path.exists(path) which sets that key to True or False depending upon if that record exists. Dictionary being O(1) with threading might give you a performance boost.
A last note - This is all theoretical and you would have to implement/optimise yourself to see if it gives you a performance boost at all or adds unnecessary overhead.
Cheers!

Related

Alternative for nested loop operation in python?

I want a fast alternative of a nested loop operation in which the second loop occurs after some operation in first loop.
For example:
date = target_date_list = pd.date_range(start=start_date, end=end_date).strftime(f'year=%Y/month=%m/day=%d')
for date in target_date_list:
folder = f'path_to_folder/{date}'
for file in folder:
//some operation

There is no meaningfully faster alternative here. The inner loop's values are dependent on the value generated by the outer loop, so the micro-optimization of using itertools.product isn't available.
If you're actually iterating a directory (not characters in a string describing a directory), I'd strongly recommend using os.scandir over os.listdir (assuming like many folks you were using the latter without knowing the former existed), as it's much faster when:
You're operating on large directories
You're filtering the contents based on stat info (in particular entry types, which come for free without a stat at all; on Windows, you get even more for free, and anywhere else if you do stat, it's cached on the entry so you can check multiple results without triggering a re-stat)
With os.scandir, and inner loop previously implemented like:
for file in os.listdir(dir):
path = os.path.join(dir, file)
if file.endswith('.txt') and os.path.isfile(path) and os.path.getsize(path) > 4096:
# do stuff with 4+KB file described by "path"
can simplify slightly and speed up by changing to:
with os.scandir(dir) as direntries:
for entry in direntries:
if entry.name.endswith('.txt') and entry.is_file() and entry.stat().st_size >= 4096:
# do stuff with 4+KB file described by "entry.path"
but fundamentally, this optimization has nothing to do with avoiding nested loops; if you want to iterate all the files, you have to iterate all the files. A nested loop will need to occur somehow even if you hide it behind utility methods, and the cost will not be meaningful relative to the cost of file system access.

As a rule of thumb, your best bet for better performance in a for loop is to use a generator expression. However, I suspect that the performance boost for your particular example will be minimal, since your outer loop is just a trivial task of assigning a variable to a string.
date = target_date_list = pd.date_range(start=start_date, end=end_date).strftime(f'year=%Y/month=%m/day=%d')
for file in (f'path_to_folder/{date}' for date in target_date_list):
//some operation

remove redundant entries from list of paths

I have a list of files and directories. I'm trying to write a function to remove entries where there is also an entry for an ancestor directory present. What I have so far seems to work, but I think it is inefficient because it tests the full list of directories for every file.
Maybe there's a library out there to do this, but I can't find it. The purpose is to allow the user to choose a list of files and directories to upload.
As you can see from the example, directories are a subset of entries. I'd prefer to just provide the entries.
import os
def remove_redundant_entries(entries, directories):
result = []
for entry in entries:
# make a copy and successively get the dirname and test it
partial_path = entry
found = False
while partial_path != os.sep:
partial_path = os.path.dirname(partial_path)
if partial_path in directories:
found = True
break
if not found:
result.append(entry)
return result
entries = [
"/home/fred/work/f1.txt",
"/home/fred/work/f2.txt",
"/home/fred/play/f3.txt",
"/home/fred/play",
"/home/jane/dev/f1.txt",
"/home/jane"]
directories = [
"/home/fred/play",
"/home/jane"]
print remove_redundant_entries(entries, directories)
# result:
['/home/fred/work/f1.txt', '/home/fred/work/f2.txt', '/home/fred/play', '/home/jane']
if you know of a library or can give a clue to a better algorithm I'd appreciate it. Meanwhile, I will try something based on sorting the entries, as ancestors should always precede their children in the list.
EDIT: - RESULTS
I ran all solutions 10,000 times through the profiler with the tests set - and with one file added /home/fred/work/f2.txt.bak to test make sure a regular filename does cause another to be discarded.
My original code: 1060004 function calls in 0.394 seconds
Stephen Rauch's answer - worked first time: 3250004 function calls in 2.089 seconds
carrdelling's answer - which didn't work for similar filenames: 480004 function calls in 0.146 seconds
carrdelling's edited answer - works for all cases: 680004 function calls in 0.231 seconds
Thanks to everyone who contributed!

If you sort your input list of entries, then the problem is easier:
def remove_redundant_entries(entries):
split_entries = sorted(entries)
valid_entries = []
for entry in split_entries:
if any(entry.startswith(p) for p in valid_entries):
continue
valid_entries.append(entry)
return valid_entries
Note that any short-circuits as soon as one comparison is true (you it would not compare agains the whole list unless strictly necessary). Also, since the list comes sorted, you are guaranteed that the output will have the minimum number (and highest level) paths.
EDIT:
If you also need the ability to keep in the list multiple files in the same folder (even if some file names are subsets of others), you just need to modify the sorting criteria:
split_entries = sorted(entries, key=lambda x: (x.count(os.sep), -len(x)))
With that, folders that are higher in the tree will come earlier (so you'll end up with the minimum number of paths), but within a folder files with longer names will come earlier - so they won't get discarded because of files with shorter (prefix-like) names.

You can use a set to lookup the already present more efficiently like:
Code:
def remove_redundant_entries(entries):
present = set()
result = []
for entry in sorted(entries):
path = os.path.abspath(entry).split(os.sep)
found = any(
tuple(path[:i+1]) in present for i in range(len(path)))
if not found:
result.append(entry)
present.add(tuple(path))
return result
Test Code:
import os
entries = [
"/home/fred/work/f1.txt",
"/home/fred/work/f2.txt",
"/home/fred/play/f3.txt",
"/home/fred/play",
"/home/jane/dev/f1.txt",
"/home/jane"]
result = remove_redundant_entries(entries)
expected = ['/home/fred/work/f1.txt', '/home/fred/work/f2.txt',
'/home/fred/play', '/home/jane']
assert set(result) == set(expected)

Data structure for filesystem

I'm storing / caching the filesystem (filenames only) in memory to be able to do fast research à la Everything. Thus I don't want to use the OS's built-in file search GUI.
I do it with:
import os
L = []
for root,dirs,files in os.walk(PATH):
L.append([root, files])
and the result is like this:
[['D:\\', ['a.jpg', 'b.jpg']],
...
['D:\\Temp12', ['test.txt', 'test2.txt']]]
The problem is that doing research takes too much time when L will contain millions of elements:
query = 'test2' #searching for filename containg this text
for dir in L:
for f in dir[1]:
if query in f:
print '%s found: %s' % (query, os.path.join(dir[0],f))
Indeed, this is a very naive search because it requires to browse the whole list to find items.
How to make the queries faster?
Maybe it seems that a list is not the right data structure to do full-text research, is there a tree-like structure?

Research in a lists are O(n), Research in dictionaries are amortized O(1). If you don't need to associate values, use sets.
If you want more about this : https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt
In your case, I would use sets. It will make your queries a lot faster.
EDIT :
The way you are doing it, checking every file for a match can't be quicker that way. Even if you use a dict, you would check every filename for a match.
New idea :
You can create a dict with all filename as keys and root as value for each. This way you can recreate the full path later.
The idea is now to create a tree were each node is a letter and were path between each will made words (filename). It could be difficult to implement and the result may not be faster depending on the way you construct the tree.
You have to remember that you want to check each and every filename and using a list or a dict won't change that. The tree/graph is the only solution I can think of.

Could you consider using a database for this?
SQLite offers :memory: option, which creates your database in memory only. Of course you can optimise your algorithm and data structure as pointed out in other answers and comments, but databases generally are already very good at this with their indexing, and you would not need to design something similar.
Your table(s) could be either simply one table with fields full_path and filename, and if you indexed it by filename, it would be fast. This would store a lot of redundant information, as every file would have the full path in full_path. A better solution would be to have a table for directories and another for files, and you would just reference directories from files to get the full path of a match.
Just a thought.
Hannu

Python: How to find duplicate folder names and rename them?

I'm running into some difficulties with python.
I have a code I'm using in conjunction with ArcGIS that is parsing filenames into a database to return the corresponding unique ID and to rename the folder with this unique ID.
It has been working great before, but I need to handle some exceptions, like when the Unique ID already exists within the directory, and when the action has already been completed on the directory. The unique id contains all numbers so I've been trying:
elif re.findall('[0-9]', fn):
Roll = string.join(string, "1")
print (Roll)
os.rename(os.path.join(basedir, fn),
os.path.join(basedir, Roll))
which returns all folders with a unique ID. I just can't figure out how to get a count of the number of times a specific folder name occurs in the directory.

I suspect you're making this way harder on yourself than you need to, but answering your immediate question:
folder_name_to_create = 'whatever'
if os.path.exists(folder_name_to_create):
folder_name_to_create += '_1'
If you are getting name collisions, I suspect you need to look at your "unique" naming algorithm, but maybe I'm misunderstanding what you mean by that.

add the name to a set and then check if it's in the set.

One way to do it might be the following: Create a dictionary whose keys are your folder names, and the value associated with each key is an integer, the number of occurrences of that name. Each time you process a folder, update the dictionary's keys/values appropriately. After you've added all the folder names in your set, check all the count values in the dictionary, and any time the count is > 1 you know you have a duplicate.
Or, if you need to detect duplicates as you go, just check whether the key already exists. In that case you don't really need the value at all, and you can use a set or list instead of a dict.
You could use collections.Counter to help you in this. You can see an example usage in this question. It shouldn't be too difficult to adapt that example to your needs.
Hope this helps.

can this python be shorter

I tend to obsess about expressing code the most compactly and succinctly possible without sacrificing runtime efficiency.
Here's my code:
p_audio = plate.parts.filter(content__iendswith=".mp3")
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv")
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")
extra_context.update({
'p_audio': p_audio and p_audio[0],
'p_video': p_video and p_video[0],
'p_swf': p_swf and p_swf[0]
})
Are there any python/django gurus that can drastically shorten this code?

Actually, in your pursuit of compactness and efficiency, you have managed to come up with code that is terribly inefficient. This is because when you refer to p_audio or not p_audio, that causes that queryset to be evaluated - and because you haven't sliced it before then, that means that the entire filter is brought from the database - eg all the plate objects that end with mp3, and so on.
You should ensure you do the slice for each query first, before you refer to the value of that query. Since you're concerned with code compactness, you probably want to slice with [:1] first, to get a queryset of a single object:
p_audio = plate.parts.filter(content__iendswith=".mp3")[:1]
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv") [:1]
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")[:1]
and the rest can stay the same.
Edit to add Because you're only interested in the first element of each list, as evidenced by the fact that you only pass [0] from each element into the context. But in your code, not p_audio refers to the original, unsliced queryset: and to determine the true/false value of the qs, Django has to evaluate it, which gets all matching elements from the database and converts them into Python objects. Since you don't actually want those objects, you're doing a lot more work than you need.
Note though that it's not re-running it every time: just the first time, since after the first evaluation the queryset is cached internally. But as I say, that's already more work than you want.

Besides featuring less redundancy, this is also way easier to extend with new content types.
kinds = (("p_audio", ".mp3"), ("p_video", ".flv"), ("p_swf", ".swf"))
extra_context.update((key, False) for key, _ in kinds)
for key, ext in kinds:
entries = plate.parts.filter(content__iendswith=ext)
if entries:
extra_context[key] = entries[0]
break

Just adding this as another answer inspired by Pyroscope's above (as my edit there has to be peer reviewed)
The latest incarnation is exploiting that the Django template system just disregards nonexistant context items when they are referenced, so mp3, etc below do not need to be initialized to False (or 0). So, the following meets all the functionality of the code from the OP. The other optimization is that mp3, etc are used as key names (instead of "p_audio" etc.)
for key in ['mp3','flv','swf'] :
entries = plate.parts.filter(content__iendswith=key)[:1]
extra_context[key] = entries and entries[0]
if extra_context[key] :
break

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.