Data structure for filesystem - python

I'm storing / caching the filesystem (filenames only) in memory to be able to do fast research à la Everything. Thus I don't want to use the OS's built-in file search GUI.
I do it with:
import os
L = []
for root,dirs,files in os.walk(PATH):
L.append([root, files])
and the result is like this:
[['D:\\', ['a.jpg', 'b.jpg']],
...
['D:\\Temp12', ['test.txt', 'test2.txt']]]
The problem is that doing research takes too much time when L will contain millions of elements:
query = 'test2' #searching for filename containg this text
for dir in L:
for f in dir[1]:
if query in f:
print '%s found: %s' % (query, os.path.join(dir[0],f))
Indeed, this is a very naive search because it requires to browse the whole list to find items.
How to make the queries faster?
Maybe it seems that a list is not the right data structure to do full-text research, is there a tree-like structure?

Research in a lists are O(n), Research in dictionaries are amortized O(1). If you don't need to associate values, use sets.
If you want more about this : https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt
In your case, I would use sets. It will make your queries a lot faster.
EDIT :
The way you are doing it, checking every file for a match can't be quicker that way. Even if you use a dict, you would check every filename for a match.
New idea :
You can create a dict with all filename as keys and root as value for each. This way you can recreate the full path later.
The idea is now to create a tree were each node is a letter and were path between each will made words (filename). It could be difficult to implement and the result may not be faster depending on the way you construct the tree.
You have to remember that you want to check each and every filename and using a list or a dict won't change that. The tree/graph is the only solution I can think of.

Could you consider using a database for this?
SQLite offers :memory: option, which creates your database in memory only. Of course you can optimise your algorithm and data structure as pointed out in other answers and comments, but databases generally are already very good at this with their indexing, and you would not need to design something similar.
Your table(s) could be either simply one table with fields full_path and filename, and if you indexed it by filename, it would be fast. This would store a lot of redundant information, as every file would have the full path in full_path. A better solution would be to have a table for directories and another for files, and you would just reference directories from files to get the full path of a match.
Just a thought.
Hannu

Related

How to get elements of a collection that are not in another collection?

I am a Java programmer, but there is a task, where the better way to resolve it is to use (the more effective and suitable for the server) python, which is not familiar to me.
What about task? I have file, which contains sorted ids (~5 mln ids) in such format:
00000011-1f0e-4d89-b658-af53b36c882e
0000008a-5816-4324-82f6-9242a8867094
000000be-d08c-41b9-97f3-594d2660dfb5
000000f2-ea63-48c0-98f6-1dbb25f0249e
0000014d-f6b0-4b3e-b767-14cd2495fd81
00000155-ec3b-4d1a-a3ae-28e95cfc79c7
00000231-65f9-424a-bf03-1d3cbefc6c40
00000281-cb21-4d3c-ba13-874161962567
000002be-6e9d-455d-aa16-49e2ac242868
00000375-4d9a-4dd6-8e0c-38e5c2134a3c
00000383-fc20-4154-921c-c187bb3f6628
000003fc-7a06-4525-a12a-df64732324e5
00000420-af64-4015-9bc4-6b9e18b86183
00000476-1bf9-4608-8979-d60ecd5b368b
...
Also I have another file, which contains ~60 mln sorted ids. The format is the same.
I need to read all ids from the first file to variable for example l1 and read all ids from the second file to variable for example l2. After that I want to find all elements of the l1, which are absent in l2 and write them to the third file. The first files are many, that is why I must repeat these actions from time to time.
Tell me, please, what is the best way to choose for solving this problem, which object types to use for l1 and l2 (the lists of ids are sorted) and what will the python script look like all in all?
Element in first set, not in second set:
s1=set([1,2,3,4,5])
s2=set([3,4,5,6,7])
s3=s1-s2
print(s3)
For this file merge scenario, you can google for a better algorithm to resolve it.

Manage python structures stored in a file as if they are in memory?

I want to manage many files in such a way that the file stays on disk and my app work with part of the data.
I have to manage 2 types of files text-files/book-like, cvs-files/time-series.
For every file I may generate multiple dimentionally reduced copies, which i want to keep and cache so i dont have to regenerate them.
I can see two ways of doing this:
1. create my own lib that uses mem-mapping
2. use tool as DASK
Dask seem like a good choice, but I can not find a way for the Bag object to iterate in a loop and/or range-access i.e.
for i in bag_obj[2:10] : .....
bag_obj[5:10]
I can only do .take()
Second is there a way to map a LIST to a file and do list operations as normal list as if it is in memory.
I came up with it , is this the best :
def slice(self, pfrom, pto):
assert self.bag is not None
self.bag.take(pto)[pfrom:]
but does not work cause returns computed() value ;(
this may be a solution ?
from dask.bag.core import Bag
def slice(self, pfrom, pto): return self.take(pto)[pfrom:]
Bag.slice = slice

Alternative for nested loop operation in python?

I want a fast alternative of a nested loop operation in which the second loop occurs after some operation in first loop.
For example:
date = target_date_list = pd.date_range(start=start_date, end=end_date).strftime(f'year=%Y/month=%m/day=%d')
for date in target_date_list:
folder = f'path_to_folder/{date}'
for file in folder:
//some operation
There is no meaningfully faster alternative here. The inner loop's values are dependent on the value generated by the outer loop, so the micro-optimization of using itertools.product isn't available.
If you're actually iterating a directory (not characters in a string describing a directory), I'd strongly recommend using os.scandir over os.listdir (assuming like many folks you were using the latter without knowing the former existed), as it's much faster when:
You're operating on large directories
You're filtering the contents based on stat info (in particular entry types, which come for free without a stat at all; on Windows, you get even more for free, and anywhere else if you do stat, it's cached on the entry so you can check multiple results without triggering a re-stat)
With os.scandir, and inner loop previously implemented like:
for file in os.listdir(dir):
path = os.path.join(dir, file)
if file.endswith('.txt') and os.path.isfile(path) and os.path.getsize(path) > 4096:
# do stuff with 4+KB file described by "path"
can simplify slightly and speed up by changing to:
with os.scandir(dir) as direntries:
for entry in direntries:
if entry.name.endswith('.txt') and entry.is_file() and entry.stat().st_size >= 4096:
# do stuff with 4+KB file described by "entry.path"
but fundamentally, this optimization has nothing to do with avoiding nested loops; if you want to iterate all the files, you have to iterate all the files. A nested loop will need to occur somehow even if you hide it behind utility methods, and the cost will not be meaningful relative to the cost of file system access.
As a rule of thumb, your best bet for better performance in a for loop is to use a generator expression. However, I suspect that the performance boost for your particular example will be minimal, since your outer loop is just a trivial task of assigning a variable to a string.
date = target_date_list = pd.date_range(start=start_date, end=end_date).strftime(f'year=%Y/month=%m/day=%d')
for file in (f'path_to_folder/{date}' for date in target_date_list):
//some operation

Stepwise creation of a YAML file

I am facing the following problem: I create a big data set (several 10GB) of python objects. I want to create an output file in YAML format containing an entry for each object that contains information about the object saved as a nested dictionary. However, I never hold all data in memory at the same time.
The output data should be stored in a dictionary mapping an object name to the saved values. A simple version would look like this:
object_1:
value_1: 42
value_2: 23
object_2:
value_1: 17
value_2: 13
[...]
object_a_lot:
value_1: 47
value_2: 11
To keep a low memory footprint, I would like to write the entry for each object and immediately delete it after writing. My current approach is as follows:
from yaml import dump
[...] # initialize huge_object_list. Here it is still small
with open("output.yaml", "w") as yaml_file:
for my_object in huge_object_list:
my_object.compute() # this blows up the size of the object
# create a single entry for the top level dict
object_entry = dump(
{my_object.name: my_object.get_yaml_data()},
default_flow_style=False,
)
yaml_file.write(object_entry)
my_object.delete_big_stuff() # delete the memory consuming stuff in the object, keep other information which is needed later
Basically I am writing several dictionaries, but each only has one key and since the object names are unique this does not blow up. This works, but feels like a bit of a hack and I would like to ask if someone knows of a way to do this better/ proper.
Is there a way to write a big dictionary to a YAML file, one entry at a time?
If you want to write out a YAML file in stages, you can do it the way you describe.
If your keys are not guaranteed to be unique, then I would recommend using a sequence (i.e. list a the top-level (even with one item), instead of a mapping.
This doesn't solve the problem of re-reading the file as PyYAML will try to read the file as a whole and that is not going load quickly, and keep in mind that the memory overhead of PyYAML will require for loading a file can easily be over 100x (a hundred times) the file size. My ruamel.yaml is wrt to memory somewhat better but still requires several tens of times the file size in memory.
You can of course cut up a file based on "leading" spaces, a new key (or dash for an item in case you use sequences) can be easily found in a different way. You can also look at storing each key-value pair in its own document within one file, that vastly reduces the overhead during loading if you combine the key-value pairs of the single documents yourself.
In similar situations I stored individual YAML "objects" in different files, using the filenames as keys to the "object" values. This requires some efficient filesystem (e.g. tail packing) and depends on what is available based on the OS your system is based on.

How to find a specific file in Python

I have a directory with files of the following structure
A2ML1_A8K2U0_MutationOutput.txt
A4GALT_Q9NPC4_MutationOutput.txt
A4GNT_Q9UNA3_MutationOutput.txt
...
The first few letters represent the gene, the next few the Uniprot Number (a unique protein identifier) and MutationOutput is self explanatory.
In Python, I want to execute the following line:
f_outputfile.write(mutation_directory + SOMETHING +line[1+i]+"_MutationOutput.txt\n")
here, line[1+i] correctly identifies the Uniprot ID.
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
I know I can list all the files in the directory, then I can do str.split() on each string and find it. But is there a way I can do that smarter? Should I use a dictionary? Can I just do a quick regex search?
The entire directory is about 8,116 files -- so not that many.
Thank you for your help!
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
Think about how you'd do this in the shell:
$ ls mutation_directory/*_A8K2U0_MutationOutput.txt
mutation_directory/A2ML1_A8K2U0_MutationOutput.txt
Or, if you're on Windows:
D:\Somewhere> dir mutation_directory\*_A8K2U0_MutationOutput.txt
A2ML1_A8K2U0_MutationOutput.txt
And you can do the exact same thing in Python, with the glob module:
>>> import glob
>>> glob.glob('mutation_directory/*_A8K2U0_MutationOutput.txt')
['mutation_directory/A2ML1_A8K2U0_MutationOutput.txt']
And of course you can wrap this up in a function:
>>> def find_gene(uniprot):
... pattern = 'mutation_directory/*_{}_MutationOutput.txt'.format(uniprot)
... return glob.glob(pattern)[0]
But is there a way I can do that smarter? Should I use a dictionary?
Whether that's "smarter" depends on your use pattern.
If you're looking up thousands of files per run, it would certainly be more efficient to read the directory just once and use a dictionary instead of repeatedly searching. But if you're planning on, e.g., reading in an entire file anyway, that's going to take orders of magnitude longer than looking it up, so it probably won't matter. And you know what they say about premature optimization.
But if you want to, you can make a dictionary keyed by the Uniprot number pretty easily:
d = {}
for f in os.listdir('mutation_directory'):
gene, uniprot, suffix = f.split('_')
d[uniprot] = f
And then:
>>> d['A8K2U0']
'mutation_directory/A2ML1_A8K2U0_MutationOutput.txt'
Can I just do a quick regex search?
For your simple case, you don't need regular expressions.*
More importantly, what are you going to search? Either you're going to loop—in which case you might as well use glob—or you're going to have to build up an artificial giant string to search—in which case you're better off just building the dictionary.
* In fact, at least on some platforms/implementations, glob is implemented by making a regular expression out of your simple wildcard pattern, but you don't have to worry about that.
You can use glob
In [4]: import glob
In [5]: files = glob.glob('*_Q9UNA3_*')
In [6]: files
Out[6]: ['A4GNT_Q9UNA3_MutationOutput.txt']

Categories

Resources