I have a very minimalistic code that performs autocompletion for input queries set by the user by storing a historical data of names(close to 1000) in a list. Right now, it gives suggestion in lexicographical smallest order.
The names stored in a list are (fictitious):
names = ["show me 7 wonders of the world","most beautiful places","top 10 places to visit","Population > 1000","Cost greater than 100"]
The queries given by the user can be:
queries = ["10", "greater", ">", "7 w"]
Current Implementation:
class Index(object):
def __init__(self, words):
index = {}
for w in sorted(words, key=str.lower, reverse=True):
lw = w.lower()
for i in range(1, len(lw) + 1):
index[lw[:i]] = w
self.index = index
def by_prefix(self, prefix):
"""Return lexicographically smallest word that starts with a given
prefix.
"""
return self.index.get(prefix.lower(), 'no matches found')
def typeahead(usernames, queries):
users = Index(usernames)
print "\n".join(users.by_prefix(q) for q in queries)
This works fine if the queries start with the pre-stored names. But fails to provide suggestions if a random entry is made(querying somewhere from the middle of string). It also does not recognize numbers and fails for that too.
I was wondering if there could be a way to include the above functionalities to improve my existing implementation.
Any help is greatly appreciated.
It's O(n) but it works. Your function is checking if it starts with a prefix, but the behavior you describe you want is checking if the string contains the query
def __init__(self, words):
self.index = sorted(words, key=str.lower, reverse=True)
def by_prefix(self, prefix):
for item in self.index:
if prefix in item:
return item
This gives:
top 10 places to visit
Cost greater than 100
Population > 1000
show me 7 wonders of the world
Just for the record this takes 0.175 seconds on my pc for 5 queries of 1,000,005 records, with the last 5 records being the matching ones. (Worst case scenario)
If you are not concerned by performance, you can use if prefix in item: for every item in your list names. This statement matches if prefix is part of the string item, e.g.:
prefix item match
'foo' 'foobar' True
'bar' 'foobar' True
'ob' 'foobar' True
...
I think that this is the simplest way to achieve this, but clearly not the fastest.
Another option is to add more entries to your index, e.g. for the item "most beautiful places":
"most beautiful places"
"beautiful places"
"places"
If you do this, you also get matches if you start typing a word that's not the first word in the sentence. You can modify your code like this to do that:
class Index(object):
def __init__(self, words):
index = {}
for w in sorted(words, key=str.lower, reverse=True):
lw = w.lower()
tokens = lw.split(' ')
for j in range(len(tokens)):
w_part = ' '.join(tokens[j:])
for i in range(1, len(w_part) + 1):
index[w_part[:i]] = w
self.index = index
The downside from this approach is that the index gets very large. You could also combine this approach with the one pointed out by Keatinge by storing 2-digit prefixes for every word in your index dictionary and store a list of queries that contain this prefix as items of the index dictionary.
Related
I'm trying to create a telegram bot with Python that sends a new word every day. I already have the schedule function setup. As for the words - I created a class terms and I'm using the pop() method to choose the last word from the list of objects in the class. However, I still haven't been able to automatically update the object list when I add a new term to the object class.
Here's the code I have so far:
def dailyword():
class terms:
def __init__(self, word, meaning, example):
self.word = word
self.meaning = meaning
self.example = example
Munchkin = terms("Munchkin", 'a word of endearment used by parents with their children','Munchkin, eat your vegetables to grow strong')
Babe = terms("Babe",'a word of endearment that couples use','Babe, do you want to go surfing this weekend?')
Sweetie = terms("Sweetie", 'a word used between couples to show affection',"Sweetie, can you take out the trash?")
objectList = [Munchkin, Babe, Sweetie]
for x in objectList:
popObject = (objectList.pop())
newdict = {'word':popObject.word,'meaning':popObject.meaning,'example':popObject.example}
wordoftheday = "BuzzWord of the day is -{word}\nMeaning -{meaning}\nExample -{example}".format(**newdict)
telegram_bot_sendtext(wordoftheday)
The program sends the same word every time because you iterate on list and keep popping items which means the final popObject is always objectList[0]
You do not need to loop on the list. Just use objectList.pop() once.
Something like below:
def add_word(word, meaning, example):
new_word = terms(word, meaning, example)
objectList.append(new_word)
def send_word(objectList):
popObject = objectList.pop()
newdict = {'word':popObject.word,'meaning':popObject.meaning,'example':popObject.example}
wordoftheday = "BuzzWord of the day is -{word}\nMeaning -{meaning}\nExample -{example}".format(**newdict)
telegram_bot_sendtext(wordoftheday)
Hope you are enlightened!
I'm trying to create a "translator" of sorts, in which if the raw_input has any curses (pre-determined, I list maybe 6 test ones), the function will output a string with the curse as ****.
This is my code below:
def censor(sequence):
curse = ('badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6')
nsequence = sequence.split()
aword = ''
bsequence = []
for x in range(0, len(nsequence)):
if nsequence[x] != curse:
bsequence.append(nsequence[x])
else:
bsequence.append('*' * (len(x)))
latest = ''.join(bsequence)
return bsequence
if __name__ == "__main__":
print(censor(raw_input("Your sentence here: ")))
A simple approach is to simply use Python's native string method: str.replace
def censor(string):
curses = ('badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6')
for curse in curses:
string = string.replace(curse, '*' * len(curse))
return string
To improve efficiency, you could try to compile the list of curses into a regular expression and then do a single replacement operation.
Python Documentation
First, there's no need to iterate over element indices here. Python allows you to iterate over the elements themselves, which is ideal for this case.
Second, you are checking whether each of those words in the given sentence is equal to the entire tuple of potential bad words. You want to check whether each word is in that tuple (a set would be better).
Third, you are mixing up indices and elements when you do len(x) - that assumes that x is the word itself, but it is actually the index, as you use elsewhere.
Fourth, you are joining the sequence within the loop, and on the empty string. You should join it on a space, and only after you've checked each element.
def censor(sequence):
curse = {'badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6'}
nsequence = sequence.split()
bsequence = []
for x in nsequence:
if x not in curse:
bsequence.append(x)
else:
bsequence.append('*' * (len(x)))
return ' '.join(bsequence)
I'm trying to implement a postscript interpreter in python. For this part of the program, I'm trying to access multiple occurrences of the same element in a list, but the function call does not do that. I can explain it better with the code.
This loop steps through a list of tokens
for token in tokens:
process_token(token)
tokens is define as:
line = "/three 3 def /four 4 def"
tokens = line.strip().split(" ")
So after this is done tokens looks like ['/three', '3', 'def', '/four', '4', 'def'].
Process tokens will continue to push thing on to a stack until it reaches an operation to be done, in this case def. Once it gets to def it will execute:
if (t == "def"):
handle_def (tokens.index(t)-2, tokens.index(t)-1)
stack.pop()
and here is handle_def():
def handle_def (t, t1):
name = tokens[t]
defin = tokens [t1]
x=name[1:]
dict1 [x]= float(defin)
The problem is when it is done adding {'three':3} to the dictionary, it then should keep reading and add {'four':4} to the dictionary. But when handle_def (tokens.index(t)-2, tokens.index(t)-1) is called it will pass in the index numbers for the first occurrence of def, meaning it just puts {'three':3} into the dictionary again. I want it to skip past the first one and go the later occurrences of the word def. How do I make it do that?
Sorry for the long post, but i felt like it needed the explanation.
list.index will give only the first occurrence in the list. You can use the enumerate function to get the index of the current item being processed, like this
for index, token in enumerate(tokens):
process_token(index, token)
...
...
def process_token(index, t):
...
if t == "def":
handle_def (index - 2, index - 1)
...
My problem:
Let say I have strings:
ali, aligator, aliance
because they have common prefix I want to store them in trie, like:
trie['ali'] = None
trie['aligator'] = None
trie['aliance'] = None
So far so good - i can use trie implementation from Biopython library.
But what I want to achive is abilitiy to find all keys in that trie that contain particular substring.
For example:
trie['ga'] would return 'aligator' and
trie['li'] would return ('ali','aligator','aliance').
Any suggestions?
Edit: I think you may be looking for a Suffix tree, particularly noting that "Suffix trees also provided one of the first linear-time solutions for the longest common substring problem.".
Just noticed another SO question that seems very related: Finding longest common substring using Trie
I would do something like this:
class Trie(object):
def __init__(self,strings=None):
#set the inicial strings, if passed
if strings:
self.strings = strings
else:
self.strings =[]
def __getitem__(self, item):
#search for the partial string on the string list
for s in self.strings:
if item in s:
yield s
def __len__(self):
#just for fun
return len(self.strings)
def append(self,*args):
#append args to existing strings
for item in args:
if item not in self.strings:
self.strings.append(item)
Then:
t1 = Trie()
t1.append("ali","aligator","aliance")
print list(t1['ga'])
print list(t1['li'])
>>['aligator']
>>['ali', 'aligator', 'aliance']
Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:
def any_it(iterable):
for element in iterable:
if element: return True
return False
regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg
filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...
If you organized the necessary list as a trie, then you could look in that trie to check if the fruit starts with a valid prefix. That should be faster than comparing the fruit against every prefix.
For example (only mildly tested):
import bisect
import re
class Node(object):
def __init__(self):
self.children = []
self.children_values = []
self.exists = False
# Based on code at http://docs.python.org/library/bisect.html
def _index_of(self, ch):
i = bisect.bisect_left(self.children_values, ch)
if i != len(self.children_values) and self.children_values[i] == ch:
return (i, self.children[i])
return (i, None)
def add(self, value):
if len(value) == 0:
self.exists = True
return
i, child = self._index_of(value[0])
if not child:
child = Node()
self.children.insert(i, child)
self.children_values.insert(i, value[0])
child.add(value[1:])
def contains_prefix_of(self, value):
if self.exists:
return True
i, child = self._index_of(value[0])
if not child:
return False
return child.contains_prefix_of(value[1:])
necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
'LIGHTRED', 'LIGHTGREEN', 'GRAY']
trie = Node()
for value in necessary:
trie.add(value)
# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
fruit = regexp.findall(line)[0]
if trie.contains_prefix_of(fruit):
filtered.append(line)
This changes your algorithm from O(N * k), where N is the number of elements of necessary and k is the length of fruit, to just O(k) (more or less). It does take more memory though, but that might be a worthwhile trade-off for your case.
I'm convinced Zach's answer is on the right track. Out of curiosity, I've implemented another version (incorporating Zach's comments about using a dict instead of bisect) and folded it into a solution that matches your example.
#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416
pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time
f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))
For brevity, implementation of PrefixMatch is published here.
If your list of necessary prefixes is static or changes infrequently, you can speed up subsequent runs by pickling and reusing the PickleMatch object instead of rebuilding it each time.
update (on sorted results)
According to the changelog for Python 2.4:
key should be a single-parameter function that takes a list element and
returns a comparison key for the
element. The list is then sorted using
the comparison keys.
also, in the source code, line 1792:
/* Special wrapper to support stable sorting using the decorate-sort-undecorate
pattern. Holds a key which is used for comparisons and the original record
which is returned during the undecorate phase. By exposing only the key
.... */
This means that your regex pattern is only evaluated once for each entry (not once for each compare), hence it should not be too expensive to do:
sorted_generator = sorted(filtered, key=regexp.match(line).group(1))
I personally like your code as is since you consider "fruit=COLOR" as a pattern which others does not. I think you want to find some solution like memoization which enables you to skip test for already solved problem but this is not the case I guess.
def any_it(iterable):
for element in iterable:
if element: return True
return False
necessary = ['YELLOW', 'GREEN', 'RED', ...]
predicate = lambda line: any_it("fruit=" + color in line for color in necessary)
filtered = ifilter(predicate, open("testest"))
Tested (but unbenchmarked) code:
import re
import fileinput
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]
filtered = []
for line in fileinput.input(["test.txt"]):
try:
key = regexp.match(line).group(1)
except AttributeError:
continue # no match
for p in necessary:
if key.startswith(p):
filtered.append(line)
break
# "filtered" now holds your results
print "".join(filtered)
Diff to code in question:
We do not first load the whole file into memory (as is done when you use file.readlines()). Instead, we process each line as the file is read in. I use the fileinput module here for brevity, but one can also use line = file.readline() and a while line: loop.
We stop iterating through the necessary list once a match is found.
We modified the regex pattern and use re.match instead of re.findall. That's assuming that each line would only contain one "fruit=..." entry.
update
If the format of the input file is consistent, you can squeeze out a little more performance by getting rid of regex altogether.
try:
# with line = "2 asdasd fruit=SOMETHING asdasd...."
key = line.split(" ", 3)[2].split("=")[1]
except:
continue # no match
filtered=[]
for line in open('huge_file'):
found=regexp.findall(line)
if found:
fruit=found[0]
for x in necessary:
if fruit.startswith(x):
filtered.append(line)
break
or maybe :
necessary=['fruit=%s'%x for x in necessary]
filtered=[]
for line in open('huge_file'):
for x in necessary:
if x in line:
filtered.append(line)
break
I'd make a simple list of ['fruit=RED','fruit=GREEN'... etc. with ['fruit='+n for n in necessary], then use in rather than a regex to test them. I don't think there's any way to do it really quickly, though.
filtered = (line for line in f if any(a in line for a in necessary_simple))
(The any() function is doing the same thing as your any_it() function)
Oh, and get rid of file.readlines(), just iterate over the file.
Untested code:
filtered = []
for line in lines:
value = line.split('=', 1)[1].split(' ',1)[0]
if value not in necessary:
filtered.append(line)
That should be faster than pattern matching 10 000 patterns onto a line.
Possibly there are even faster ways. :)
It shouldn't take too long to iterate through 100,000 strings, but I see you have a 10,000 strings list, which means you iterate 10,000 * 100,000 = 1,000,000,000 times the strings, so I don't know what did you expect...
As for your question, if you encounter a word from the list and you only need 1 or more (if you want exacly 1 you need to iterate through the whole list) you can skip the rest, it should optimize the search operation.