Python search: how to do it efficiently - python

I have one class have 2 variable members:
class A:
fullname = ""
email = ""
There is a list of A stored in memory, now I need to search against fullname or email, the search needs to support fuzzy search(assemble the SQL 'like' clause), eg) search "abc", for "dabcd" it should match(if it can show the exact match first it will be better).
I think I should build index on 'fullname' and 'email'?
Please suggest, thanks!
EDIT: If I only need exact match, will two dictionaries with 'fullname' and 'email' to be the key is the best choice? I see some articles says the fetching is O(1).
2nd EDIT: The 'best' I defined is for the search speed(best speed). As I think in python a reference will only be stored as the pointer into the dictionaries, so the space allocation should not be a problem. I have thousands of record.

Look at the sqlite3 module. You can put your data into an in-memory database, index it, and query it with standard SQL.

If I only need exact match, will two dictionaries with 'fullname' and 'email' to be the key is the best choice?
If by "Best" you mean "best speed", then yes.
I see some articles says the fetching is O(1).
That's correct.
Two dictionaries would be fast.
If you want "like" clause behavior, it doesn't matter. Most structures are equally slow. A dictionary will work, and will be reasonably fast. A list, however, will be about the same speed.
def find_using_like( some_partial_key, dictionary ):
for k in dictionary:
if some_partial_key in key:
return dictionary[k]

Related

Fast String "Startswith" Matching for Dict like object

I currently have some code which needs to be very performant, where I am essentially doing a string dictionary key lookup:
class Foo:
def __init__(self):
self.fast_lookup = {"a": 1, "b": 2}
def bar(self, s):
return self.fast_lookup[s]
self.fast_lookup has O(1) lookup time, and there is no try/if etc code that would slow down the lookup
Is there anyway to retain this speed while doing a "startswith" lookup instead? With the code above calling bar on s="az" would result in a key error, if it were changed to a "startswith" implementation then it would return 1.
NB: I am well aware how I could do this with a regex/startswith statement, I am looking for performance specifically for startswith dict lookup
An efficient way to do this would be to use the pyahocorasick module to construct a trie with the possible keys to match, then use the longest_prefix method to determine how much of a given string matches. If no "key" matched, it returns 0, otherwise it will say how much of the string passed exists in the automata.
After installing pyahocorasick, it would look something like:
import ahocorasick
class Foo:
def __init__(self):
self.fast_lookup = ahocorasick.Automaton()
for k, v in {"a": 1, "b": 2}.items():
self.fast_lookup.add_word(k, v)
def bar(self, s):
index = self.fast_lookup.longest_prefix(s)
if not index: # No prefix match at all
raise KeyError(s)
return self.fast_lookup.get(s[:index])
If it turns out the longest prefix doesn't actually map to a value (say, 'cat' is mapped, but you're looking up 'cab', and no other entry actually maps 'ca' or 'cab'), this will die with a KeyError. Tweak as needed to achieve precise behavior desired (you might need to use longest_prefix as a starting point and try to .get() for all substrings of that length or less until you get a hit for instance).
Note that this isn't the primary purpose of Aho-Corasick (it's an efficient way to search for many fixed strings in one or more long strings in a single pass), but tries as a whole are an efficient way to deal with prefix search of this form, and Aho-Corasick is implemented in terms of tries and provides most of the useful features of tries to make it more broadly useful (as in this case).
I dont fully understand the question, but what I would do is try and think of ways to reduce the work the lookup even has to do. If you know the basic searches the startswith is going to do, you can just add those as keys to the dictionary and values that point to the same object. Your dict will get pretty big pretty fast, however it will greatly reduce the lookup i believe. So maybe for a more dynamic method you can add dict keys for the first groups of letters up to three for each entry.
Without activly storing the references for each search, your code will always need to get each dict objects value until it gets one that matches. You cannot reduce that.

Match a string against a set of keywords associated with a key

Let's say I have a list of text files (song lyrics) to return based on the user input:
song_1.txt
song_2.txt
...
song_n.txt
I'm not happy with the idea of listing all of them at once for the user to choose from, so my initial thought was to create a simple function that takes user input as an argument, performs a search against a list of pre-defined keywords per song and returns the "best matching song" as a response.
I'm fairly new to python and programming in general and best thing I could think of so far is something like that:
keywords = {'song_1': ['hate', 'bad'], 'song_2': ['love', 'good']}
def find_song_by_keyword(user_input):
for song, keyword in keywords.items():
if user_input in keyword:
return song + '.txt'
result = find_song_by_keyword('love')
print(result)
song_2.txt
Then I'm going to read a song from file and return it to the user, but my question is:
What's the best way to match a string to keywords considering the fact I need to trace back the "key"? I have a feeling there's a better solution to match something to keywords instead of using for loop + dictionary with list as a value. Just looking for some directions on that matter in general (I'd appreciate a link to something related to "search" in general, maybe something deeper than that).
There's nothing wrong with using a for-loop, especially since you exit when you find the first match and that appears to be what you need.
However, you could get the key like this as well:
result = next((song for song, words in keywords.items() if 'love' in words), None)
Or, if you don't want to repeat yourself and you need to use this in several places, just wrap that in a function definition of course:
def find_song_by_keyword(user_input):
return next((song for song, words in keywords.items() if user_input in words), None)
result = find_song_by_keyword('love')
Solutions like inverting the dictionary can be a good idea if you have to do this very frequently, since you're trading space for better performance, but of course the action of inverting takes some time as well, so it doesn't appear to match your use case very well.
User #Barmar does make a good point that the original dictionary could be created as a dictionary of 'search term -> file name' instead of the current 'file name -> search term'. You could likely construct a dictionary with search terms as keys about as possibly as you construct the current dictionary, depending on how you plan to construct it.

Best way to pythonically get a list from a dict, extend it and put it back

I'm relatively new to python and programing and have code that is working, however I'd like to know if there is a better more condensed way to achieve the same thing.
My code creates a dictionary with some context key value pairs, then I go and get groups of questions looping a number of times. I want to gather all the questions into my data dictionary, adding the list of questions the first time, and extending it with subsequent loops.
My working code:
data = {
'name': product_name,
'question summary': question_summary,
}
for l in loop:
<my code gets a list of new questions>
if 'questions' not in data:
data ['questions'] = new_questions['questions']
else:
all_questions = data.get('questions')
all_questions.extend(new_questions['questions'])
data ['questions'] = all_questions
I've read about using a default dict to enable automatic creation of a dictionary item if it doesn't exist, however I'm not sure how I would define data in the first place as some of its key value pairs aren't lists and I want it to have the extra context key value pairs.
I also feel that the 3 lines of code appending more questions to the list of questions in data (if it exists) should/could be shorter but this doesn't work as data.get() isn't callable
data['questions'] = data.get('questions').extend(new_questions['questions'])
and this doesn't work because extend returns none:
data['questions'] = all_questions.extend(new_questions['questions'])
Ok so I figured out how to condense the 3 lines, see answer, below however I'd still like to know if the If: else: is good form in this case.
You might be looking for the setdefault method:
data.setdefault('questions', []).extend(new_questions['questions'])
Ok so Quack Quack I figured out how to condense the 3 lines - this works:
data['questions'].extend(new_questions['questions'])

The best way to search a file and return required fields from rows

I have the following code that works perfectly. It searches a txt file for an ID number, and if it exists, returns the first and last name.
full listing: https://repl.it/Jau3/0
import csv
#==========Search by ID number. Return Just the Name Fields for the Student
with open("studentinfo.txt","r") as f:
studentfileReader=csv.reader(f)
id=input("Enter Id:")
for row in studentfileReader:
for field in row:
if field==id:
currentindex=row.index(id)
print(row[currentindex+1]+" "+row[currentindex+2])
File contents
001,Joe,Bloggs,Test1:99,Test2:100,Test3:33
002,Ash,Smith,Test1:22,Test2:63,Test3:99
For teaching and learning purposes, I would like to know if there are any other methods to achieve the same thing (elegant, simple, pythonic) or if perhaps this is indeed the best solution?
My question arises from the fact that it seems possible that there may be an inbuilt method or some function that more efficiently retrieves the current index and searches for fields.....perhaps not though.
Thanks in advance for the discussion and any explanations that I will accept as answers.
If the list keeps this format you could access the field of the row by index to condense it a bit.
for row in studentfileReader:
if row[0]==id:
print(row[1]+" "+row[2])
it also avoids a match if the ID is not in the beginning but somewhere in between e.g. "Test1:002"
I don't really know if there is such thing as a "pythonic" way of finding a record on a matching key, but here is an example that adds a couple interesting things over your own example and the other answers, like the use of generators, and comprehension. Besides, what is more pythonic than a one-liner.
any is a python built-in, it might interest you to know that it exists since it does exactly what you do.
with open("studentinfo.txt","r") as f:
sid=input("Enter Id:")
print any((line.split(",")[0] == sid for line in f.readlines()))
You should probably consider using csv.DictReader for this usage, since you have tabular data with consistent columns.
If you only want to retrieve data once then you can simply iterate through the file until the first occurrence of the desired id, as follows;
import csv
def search_by_student_id(id):
with open('studentinfo.txt','r') as f:
reader = csv.DictReader(f, ['id', 'surname', 'first_name'],
restkey='results')
for line in reader:
if line['id'] == id:
return line['surname'], line['first_name']
print(search_by_student_id('001'))
# ('Joe', 'Bloggs')
If however, you plan on looking up entries from this data multiple times it would pay to create a dictionary, which is more expensive to create, but significantly reduces lookup times. Then you could look up data like this;
def build_student_id_dict():
with open('studentinfo.txt','r') as f:
reader = csv.DictReader(f, ['id', 'surname', 'first_name'],
restkey='results')
student_id_dict = {}
for line in reader:
student_id_dict[line['id']] = line['surname'], line['first_name']
return student_id_dict
student_by_id_dict = build_student_id_dict()
print(student_by_id_dict['002'])
# ('Ash', 'Smith')
You could read it into a list or even better a dictionary in terms of look up time, and then simply use the following:
if in l or if in d (l or d being the list / dictionary respectively)
An interesting discussion however, on whether this would be the simplest method, or if your existing solution is.
Dictionaries:
1 # retrieve the value for a particular key
2 value = d[key]
A note on time complexity and efficiency in the use of dictionaries:
Python mappings must be able to, given a particular key object, determine which (if any) value object is associated with a given key. One simple approach would be to store a LIST of (key, value) pairs, and then search the list sequentially every time a value was requested. Immediately you can see that this would be very slow with a large number of items - in complexity terms, this algorithm would be O(n), where n is referring to the number of items in the mapping.
Python's dictionary is the answer here, although it isn't always te best solution - the implementation reduces the average complexity of dictionary lookups to O(1) by requiring that key objects provide a "hash" function. In your case and because structurally the data you are dealing with is not terribly complex, it may be easiest to stick to your existing solution, although a dictionary should certainly be considered if it is time efficiency you are after.

Get key from a value(list) in a dic

Is there a way to retrive a key from values if the values is a list:
def add(self, name, number):
if name in self.bok.values():
print 'The name alredy exists in the telefonbok.'
else:
self.bok.update({number: []})
self.bok[number].append(name)
print self.bok
This works if i only have one element in the list:
self.bok.keys()[self.bok.values().index(my value i want to get the corresponding key)]
But if i insert more elements is gives me the error that it isnt in the list,
if u are wondering im creating an telephone book using class and dictionary so im supposed
to give and alias to the number and also be able to change the number on one name and alias should also get the new number. Would appriciate any help sorry if i'm blurry
If you find yourself wondering "how do I look up a key by its value?" it usually means that your dictionary is going the wrong way, or at least that you should be keeping two dictionaries. This is especially true if you notice yourself ensuring that the values are unique by hand!
At the moment, your conditional is never true (unless self.bok.values() is updated some other way), because name is (presumably) a string whereas self.bok.values() looks like it's a list of lists. If names should only appear once in the telephone book, that's a good hint that you should have a dictionary going the opposite direction.
Assuming you also need the number-to-name lookup, what I would do is add another dictionary to your class, and update them both whenever you add a new name/number pair.
import collections # defaultdict is a very nice object
# elsewhere, presumably in the __init__ method
self.name_to_number = {}
self.number_to_names = collections.defaultdict(list)
def add(self, name, number):
if name in self.name_to_number:
print 'The name alredy exists in the telefonbok.'
else:
self.name_to_number[name] = number
self.number_to_names[number].append(name)
If you're dead set on doing it the hard way for whatever reason, the findByValue method in Óscar López's answer is what you need.

Categories

Resources