python trie implementation with substring matching

python trie implementation with substring matching - python

My problem:
Let say I have strings:
ali, aligator, aliance
because they have common prefix I want to store them in trie, like:
trie['ali'] = None
trie['aligator'] = None
trie['aliance'] = None
So far so good - i can use trie implementation from Biopython library.
But what I want to achive is abilitiy to find all keys in that trie that contain particular substring.
For example:
trie['ga'] would return 'aligator' and
trie['li'] would return ('ali','aligator','aliance').
Any suggestions?

Edit: I think you may be looking for a Suffix tree, particularly noting that "Suffix trees also provided one of the first linear-time solutions for the longest common substring problem.".
Just noticed another SO question that seems very related: Finding longest common substring using Trie

I would do something like this:
class Trie(object):
def __init__(self,strings=None):
#set the inicial strings, if passed
if strings:
self.strings = strings
else:
self.strings =[]
def __getitem__(self, item):
#search for the partial string on the string list
for s in self.strings:
if item in s:
yield s
def __len__(self):
#just for fun
return len(self.strings)
def append(self,*args):
#append args to existing strings
for item in args:
if item not in self.strings:
self.strings.append(item)
Then:
t1 = Trie()
t1.append("ali","aligator","aliance")
print list(t1['ga'])
print list(t1['li'])
>>['aligator']
>>['ali', 'aligator', 'aliance']

Related

Searching for a string in multiple lists with AND conditions

I am trying to create an if condition that checks the existence of a certain string and its multiple forms in different lists. The condition currently looks like this.
if ("example_string" in node and "example_string" in node_names) or\
("string_similar_to_example_string" in node and
"string_similar_to_example_string" in node_names):
return response
Here node is a string that matches the string exactly and node_names is a list of strings that has strings matching the example string. This logic right now works, but I wanted to know if there is a better way to write this that makes it readable and clear.

As you said, your logic is working here.
So, a small function might help with the code readability.
# assuming node and node_names are global here, otherwise you can pass it as well.
def foo(bar):
"""foo function description"""
if bar in node and bar in node_names:
return True
if foo("example_string") or foo("string_similar_to_example_string"):
return response

I like reducing functions like all or any. For instance, I think you could have it done with any and then really just organising the sentence.
E.g,
does_match = False
for string_to_lookup in ["example_string", "string_similar"]:
does_match = string_to_lookup in node and string_to_lookup in node_names)
return response if does_match else None
Before I go, I want to point out something: I think your condition for matching the list of strings is wrong; At least, it is different from the match against node.
Suppose you have your "example_string", and node is "example_string_yay" and node_names is ["example_string_1", "another_string_bla", "example_what"]. If you do:
>>> res = "example_string" in node and "example_string" in node_names
>>> print(res)
False
but if you do:
>>> res = "example_string" in node and any("example_string" in n for n in node_names)
>>> print(res)
True

A better way to rewrite multiple appended replace methods using an input array of strings in python?

I have a really ugly command where I use many appended "replace()" methods to replace/substitute/scrub many different strings from an original string. For example:
newString = originalString.replace(' ', '').replace("\n", '').replace('()', '').replace('(Deployed)', '').replace('(BeingAssembled)', '').replace('ilo_', '').replace('ip_', '').replace('_ilop', '').replace('_ip', '').replace('backupnetwork', '').replace('_ilo', '').replace('prod-', '').replace('ilo-','').replace('(EndofLife)', '').replace('lctcvp0033-dup,', '').replace('newx-', '').replace('-ilo', '').replace('-prod', '').replace('na,', '')
As you can see, it's a very ugly statement and makes it very difficult to know what strings are in the long command. It also makes it hard to reuse.
What I'd like to do is define an input array of of many replacement pairs, where a replacement pair looks like [<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>]; where the greater array looks something like:
replacementArray = [
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>]
]
AND, I'd like to pass that replacementArray, along with the original string that needs to be scrubbed to a function that has a structure something like:
def replaceAllSubStrings(originalString, replacementArray):
newString = ''
for each pair in replacementArray:
perform the substitution
return newString
MY QUESTION IS: What is the right way to write the function's code block to apply each pair in the replacementArray? Should I be using the "replace()" method? The "sub()" method? I'm confused as to how to restructure the original code into a nice clean function.
Thanks, in advance, for any help you can offer.

You have the right idea. Use sequence unpacking to iterate each pair of values:
def replaceAllSubStrings(originalString, replacementArray):
for in_rep, out_rep in replacementArray:
originalString = originalString.replace(in_rep, out_rep)
return originalString

How about using re?
import re
def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat
replaces = {
"a": "b",
"well": "hello"
}
replacer = make_xlat(replaces)
replacer("a well?")
# b hello?
You can add as many items in replaces as you want.

Python - Autocomplete extension for numbers and suggestion using random queries

I have a very minimalistic code that performs autocompletion for input queries set by the user by storing a historical data of names(close to 1000) in a list. Right now, it gives suggestion in lexicographical smallest order.
The names stored in a list are (fictitious):
names = ["show me 7 wonders of the world","most beautiful places","top 10 places to visit","Population > 1000","Cost greater than 100"]
The queries given by the user can be:
queries = ["10", "greater", ">", "7 w"]
Current Implementation:
class Index(object):
def __init__(self, words):
index = {}
for w in sorted(words, key=str.lower, reverse=True):
lw = w.lower()
for i in range(1, len(lw) + 1):
index[lw[:i]] = w
self.index = index
def by_prefix(self, prefix):
"""Return lexicographically smallest word that starts with a given
prefix.
"""
return self.index.get(prefix.lower(), 'no matches found')
def typeahead(usernames, queries):
users = Index(usernames)
print "\n".join(users.by_prefix(q) for q in queries)
This works fine if the queries start with the pre-stored names. But fails to provide suggestions if a random entry is made(querying somewhere from the middle of string). It also does not recognize numbers and fails for that too.
I was wondering if there could be a way to include the above functionalities to improve my existing implementation.
Any help is greatly appreciated.

It's O(n) but it works. Your function is checking if it starts with a prefix, but the behavior you describe you want is checking if the string contains the query
def __init__(self, words):
self.index = sorted(words, key=str.lower, reverse=True)
def by_prefix(self, prefix):
for item in self.index:
if prefix in item:
return item
This gives:
top 10 places to visit
Cost greater than 100
Population > 1000
show me 7 wonders of the world
Just for the record this takes 0.175 seconds on my pc for 5 queries of 1,000,005 records, with the last 5 records being the matching ones. (Worst case scenario)

If you are not concerned by performance, you can use if prefix in item: for every item in your list names. This statement matches if prefix is part of the string item, e.g.:
prefix item match
'foo' 'foobar' True
'bar' 'foobar' True
'ob' 'foobar' True
...
I think that this is the simplest way to achieve this, but clearly not the fastest.

Another option is to add more entries to your index, e.g. for the item "most beautiful places":
"most beautiful places"
"beautiful places"
"places"
If you do this, you also get matches if you start typing a word that's not the first word in the sentence. You can modify your code like this to do that:
class Index(object):
def __init__(self, words):
index = {}
for w in sorted(words, key=str.lower, reverse=True):
lw = w.lower()
tokens = lw.split(' ')
for j in range(len(tokens)):
w_part = ' '.join(tokens[j:])
for i in range(1, len(w_part) + 1):
index[w_part[:i]] = w
self.index = index
The downside from this approach is that the index gets very large. You could also combine this approach with the one pointed out by Keatinge by storing 2-digit prefixes for every word in your index dictionary and store a list of queries that contain this prefix as items of the index dictionary.

How to create a censor "translator" via function in python

I'm trying to create a "translator" of sorts, in which if the raw_input has any curses (pre-determined, I list maybe 6 test ones), the function will output a string with the curse as ****.
This is my code below:
def censor(sequence):
curse = ('badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6')
nsequence = sequence.split()
aword = ''
bsequence = []
for x in range(0, len(nsequence)):
if nsequence[x] != curse:
bsequence.append(nsequence[x])
else:
bsequence.append('*' * (len(x)))
latest = ''.join(bsequence)
return bsequence
if __name__ == "__main__":
print(censor(raw_input("Your sentence here: ")))

A simple approach is to simply use Python's native string method: str.replace
def censor(string):
curses = ('badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6')
for curse in curses:
string = string.replace(curse, '*' * len(curse))
return string
To improve efficiency, you could try to compile the list of curses into a regular expression and then do a single replacement operation.
Python Documentation

First, there's no need to iterate over element indices here. Python allows you to iterate over the elements themselves, which is ideal for this case.
Second, you are checking whether each of those words in the given sentence is equal to the entire tuple of potential bad words. You want to check whether each word is in that tuple (a set would be better).
Third, you are mixing up indices and elements when you do len(x) - that assumes that x is the word itself, but it is actually the index, as you use elsewhere.
Fourth, you are joining the sequence within the loop, and on the empty string. You should join it on a space, and only after you've checked each element.
def censor(sequence):
curse = {'badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6'}
nsequence = sequence.split()
bsequence = []
for x in nsequence:
if x not in curse:
bsequence.append(x)
else:
bsequence.append('*' * (len(x)))
return ' '.join(bsequence)

Grep multi-layered iterable for strings that match (Python)

Say that we have a multilayered iterable with some strings at the "final" level, yes strings are iterable, but I think that you get my meaning:
['something',
('Diff',
('diff', 'udiff'),
('*.diff', '*.patch'),
('text/x-diff', 'text/x-patch')),
('Delphi',
('delphi', 'pas', 'pascal', 'objectpascal'),
('*.pas',),
('text/x-pascal',['lets', 'put one here'], )),
('JavaScript+Mako',
('js+mako', 'javascript+mako'),
('application/x-javascript+mako',
'text/x-javascript+mako',
'text/javascript+mako')),
...
]
Is there any convenient way that I could implement a search that would give me the indices of the matching strings? I would like something that would act something like this (where the above list is data):
>>> grep('javascript', data)
and it would return [ (2,1,1), (2,2,0), (2,2,1), (2,2,2) ] perhaps. Maybe I'm missing a comparable solution that returns nothing of the sort but can help me find some strings within a multi-layered list of iterables of iterables of .... strings.
I wrote a little bit but it was seeming juvenile and inelegant so I thought I would ask here. I guess that I could just keep nesting the exception the way I started here to the number of levels that the function would then support, but I was hoping to get something neat, abstract, pythonic.
import re
def rgrep(s, data):
''' given a iterable of strings or an iterable of iterables of strings,
returns the index/indices of strings that contain the search string.
Args::
s - the string that you are searching for
data - the iterable of strings or iterable of iterables of strings
'''
results = []
expr = re.compile(s)
for item in data:
try:
match = expr.search(item)
if match != None:
results.append( data.index(item) )
except TypeError:
for t in item:
try:
m = expr.search(t)
if m != None:
results.append( (list.index(item), item.index(t)) )
except TypeError:
''' you can only go 2 deep! '''
pass
return results

I'd split recursive enumeration from grepping:
def enumerate_recursive(iter, base=()):
for index, item in enumerate(iter):
if isinstance(item, basestring):
yield (base + (index,)), item
else:
for pair in enumerate_recursive(item, (base + (index,))):
yield pair
def grep_index(filt, iter):
return (index for index, text in iter if filt in text)
This way you can do both non-recursive and recursive grepping:
l = list(grep_index('opt1', enumerate(sys.argv))) # non-recursive
r = list(grep_index('diff', enumerate_recursive(your_data))) # recursive
Also note that we're using iterators here, saving RAM for longer sequences if necessary.
Even more generic solution would be to give a callable instead of string to grep_index. But that might not be necessary for you.

Here is a grep that uses recursion to search the data structure.
Note that good data structures lead the way to elegant solutions.
Bad data structures make you bend over backwards to accomodate.
This feels to me like one of those cases where a bad data structure is obstructing
rather than helping you.
Having a simple data structure with a more uniform structure
(instead of using this grep) might be worth investigating.
#!/usr/bin/env python
data=['something',
('Diff',
('diff', 'udiff'),
('*.diff', '*.patch'),
('text/x-diff', 'text/x-patch',['find','java deep','down'])),
('Delphi',
('delphi', 'pas', 'pascal', 'objectpascal'),
('*.pas',),
('text/x-pascal',['lets', 'put one here'], )),
('JavaScript+Mako',
('js+mako', 'javascript+mako'),
('application/x-javascript+mako',
'text/x-javascript+mako',
'text/javascript+mako')),
]
def grep(astr,data,prefix=[]):
result=[]
for idx,elt in enumerate(data):
if isinstance(elt,basestring):
if astr in elt:
result.append(tuple(prefix+[idx]))
else:
result.extend(grep(astr,elt,prefix+[idx]))
return result
def pick(data,idx):
if idx:
return pick(data[idx[0]],idx[1:])
else:
return data
idxs=grep('java',data)
print(idxs)
for idx in idxs:
print('data[%s] = %s'%(idx,pick(data,idx)))

To get the position use enumerate()
>>> data = [('foo', 'bar', 'frrr', 'baz'), ('foo/bar', 'baz/foo')]
>>>
>>> for l1, v1 in enumerate(data):
... for l2, v2 in enumerate(v1):
... if 'f' in v2:
... print l1, l2, v2
...
0 0 foo
1 0 foo/bar
1 1 baz/foo
In this example I am using a simple match 'foo' in bar yet you probably use regex for the job.
Obviously, enumerate() can provide support in more than 2 levels as in your edited post.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python trie implementation with substring matching - python

Edit: I think you may be looking for a Suffix tree, particularly noting that "Suffix trees also provided one of the first linear-time solutions for the longest common substring problem.". Just noticed another SO question that seems very related: Finding longest common substring using Trie

Related

Searching for a string in multiple lists with AND conditions

A better way to rewrite multiple appended replace methods using an input array of strings in python?

Python - Autocomplete extension for numbers and suggestion using random queries

How to create a censor "translator" via function in python

Grep multi-layered iterable for strings that match (Python)

Categories

Resources