How to implement the remove function of a trie in python? - python

I've read the following implementation of the trie in python:
https://stackoverflow.com/a/11016430/2225221
and tried to make the remove fnction for it.
Basically, I had problems even with the start: If you want to remove a word from a trie, it can has sub-"words", or it can be "subword" of another word.
If you remove with "del dict[key]", you are removing these above mentioned two kinds of words also.
Could anyone help me in this, how to remove properly the chosen word (let us presume it's in the trie)

Basically, to remove a word from the trie (as it is implemented in the answer you linked to), you'd just have to remove its _end marker, for example like this:
def remove_word(trie, word):
current_dict = trie
for letter in word:
current_dict = current_dict.get(letter, None)
if current_dict is None:
# the trie doesn't contain this word.
break
else:
del current_dict[_end]
Note however that this doesn't ensure that the trie has its minimal size. After deleting the word, there may be branches in the trie left that are no longer used by any words. This doesn't affect the correctness of the data structure, it just means that the trie may consume more memory than absolutely necessary. You could improve this by iterating backwards from the leaf node and delete branches until you find one that has more than one child.
EDIT: Here's an idea how you could implement a remove function that also culls any unnecessary branches. There's probably a more efficient way to do it, but this might get you started:
def remove_word2(trie, word):
current_dict = trie
path = [current_dict]
for letter in word:
current_dict = current_dict.get(letter, None)
path.append(current_dict)
if current_dict is None:
# the trie doesn't contain this word.
break
else:
if not path[-1].get(_end, None):
# the trie doesn't contain this word (but a prefix of it).
return
deleted_branches = []
for current_dict, letter in zip(reversed(path[:-1]), reversed(word)):
if len(current_dict[letter]) <= 1:
deleted_branches.append((current_dict, letter))
else:
break
if len(deleted_branches) > 0:
del deleted_branches[-1][0][deleted_branches[-1][1]]
del path[-1][_end]
Essentially, it first finds the "path" to the word that is about to be deleted and then iterates through that backwards to find nodes that can be removed. It then removes the root of the path that can be deleted (which also implicitly deletes the _end node).

I think it is better to do it recursively, code as following:
def remove(self, word):
self.delete(self.tries, word, 0)
def delete(self, dicts, word, i):
if i == len(word):
if 'end' in dicts:
del dicts['end']
if len(dicts) == 0:
return True
else:
return False
else:
return False
else:
if word[i] in dicts and self.delete(dicts[word[i]], word, i + 1):
if len(dicts[word[i]]) == 0:
del dicts[word[i]]
return True
else:
return False
else:
return False

def remove_a_word_util(self, word, idx, node):
if len(word) == idx:
node.is_end_of_word = False
return bool(node.children)
ch = word[idx]
if ch not in node.children:
return True
flag = self.remove_a_word_util(word, idx+1, node.children[ch])
if flag:
return True
node.children.pop(ch)
return bool(node.children) or node.is_end_of_word

One method of handling structures like this is through recursion. The great thing about recursion in this case is that it zips to the bottom of the trie, then passes the returned values back up through the branches.
The following function does just that. It goes to the leaf and deletes the _end value, just in case the input word is a prefix of another. It then passes up a boolean (boo) which indicates that the current_dict is still in an outlying branch. Once we hit a point where the current dict has more than one child, we delete the appropriate branch and set boo to False so each remaining recursion will do nothing.
def trie_trim(term, trie=SYNONYMS, prev=0):
# checks that we haven't hit the end of the word
if term:
first, rest = term[0], term[1:]
current_length = len(trie)
next_length, boo = trie_trim(rest, trie=trie[first], prev=current_length)
# this statement avoids trimming excessively if the input is a prefix because
# if the word is a prefix, the first returned value will be greater than 1
if boo and next_length > 1:
boo = False
# this statement checks for the first occurrence of the current dict having more than one child
# or it checks that we've hit the bottom without trimming anything
elif boo and (current_length > 1 or not prev):
del trie[first]
boo = False
return current_length, boo
# when we do hit the end of the word, delete _end
else:
del trie[_end]
return len(trie) + 1, True

A bit of a long one, but I hope this helps answer your question:
class Trie:
WORD_END = "$"
def __init__(self):
self.trie = {}
def insert(self, word):
cur = self.trie
for char in word:
if char not in cur:
cur[char] = {}
cur = cur[char]
cur[Trie.WORD_END] = word
def delete(self, word):
def _delete(word, cur_trie, i=0):
if i == len(word):
if Trie.WORD_END not in cur_trie:
raise ValueError("'%s' is not registered in the trie..." %word)
cur_trie.pop(Trie.WORD_END)
if len(cur_trie) > 0:
return False
return True
if word[i] not in cur_trie:
raise ValueError("'%s' is not registered in the trie..." %word)
cont = _delete(word, cur_trie[word[i]], i+1)
if cont:
cur_trie.pop(word[i])
if Trie.WORD_END in cur_trie:
return False
return True
return False
_delete(word, self.trie)
t = Trie()
t.insert("bar")
t.insert("baraka")
t.insert("barakalar")
t.delete("barak") # raises error as 'barak' is not a valid WORD_END although it is a valid path.
t.delete("bareka") # raises error as 'e' does not exist in the path.
t.delete("baraka") # deletes the WORD_END of 'baraka' without deleting any letter as there is 'barakalar' afterwards.
t.delete("barakalar") # deletes until the previous word (until the first Trie.WORD_END; "$" - by going backwards with recursion) in the same path (until 'baraka').

In case you need the whole DS:
class TrieNode:
def __init__(self):
self.children = {}
self.wordCounter = 0
self.prefixCounter = 0
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word: str) -> None:
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node.prefixCounter += 1
node = node.children[char]
node.wordCounter += 1
def countWordsEqualTo(self, word: str) -> int:
node = self.root
if node.children:
for char in word:
node = node.children[char]
else:
return 0
return node.wordCounter
def countWordsStartingWith(self, prefix: str) -> int:
node = self.root
if node.children:
for char in prefix:
node = node.children[char]
else:
return 0
return node.prefixCounter
def erase(self, word: str) -> None:
node = self.root
for char in word:
if node.children:
node.prefixCounter -= 1
node = node.children[char]
else:
return None
node.wordCounter -= 1
if node.wordCounter == 0:
self.dfsRemove(self.root, word, 0)
def dfsRemove(self, node: TrieNode, word: str, idx: int) -> None:
if len(word) == idx:
node.wordCounter = 0
return
char = word[idx]
if char not in node.children:
return
self.dfsRemove(node.children[char], word, idx+1)
node.children.pop(char)
trie = Trie();
trie.insert("apple"); #// Inserts "apple".
trie.insert("apple"); #// Inserts another "apple".
print(trie.countWordsEqualTo("apple")) #// There are two instances of "apple" so return 2.
print(trie.countWordsStartingWith("app")) #// "app" is a prefix of "apple" so return 2.
trie.erase("apple") #// Erases one "apple".
print(trie.countWordsEqualTo("apple")) #// Now there is only one instance of "apple" so return 1.
print(trie.countWordsStartingWith("app")) #// return 1
trie.erase("apple"); #// Erases "apple". Now the trie is empty.
print(trie.countWordsEqualTo("apple")) #// return 0
print(trie.countWordsStartingWith("app")) #// return 0

I would argue that this implementation is the most succinct and easiest to understand after a bit of staring.
def removeWord(word, node=None):
if not node:
node = self.root
if word == "":
node.isEnd = False
return
newnode = node.children[word[0]]
removeWord(word[1:], newnode)
if not newnode.isEnd and len(newnode.children) == 0:
del node.children[word[0]]
Although it's a little tricky to understand with the default parameter node=None at first, this is the most succinct implementation of a Trie removal that handles marking the word node.isEnd = False while also pruning extraneous nodes.
The method is first called as Trie.removeWord("ToBeDeletedWord").
In subsequent recursion calls, a node tied to the corresponding letter ("T" then "o" then "B" then "e" etc. etc.) is added to the next recursion (e.g "remove 'oBeDeletedWord' with the node at T").
Once we hit the end node that has the full string ToBeDeletedWord , the last recursion calls removeWord("", <node d>)
In this last recursion call, we mark node.isEnd = False. Later, the node is no longer marked isEnd and it has no children so we can call the delete operator.
Once that last recursion call ends, the rest of the recursions (e.g TobeDeletedWor, TobeDeletedWo, TobeDeletedW, etc. etc.) will then observe that it too is not an end node and there are no more children. These nodes will also delete.
You will have to read this a couple of times but this implementation is concise, readable, and correct. The difficulty is that the recursion happens midfunction rather than at the beginning or end.

TL;DR
class TrieNode:
children: dict[str, "TrieNode"]
def __init__(self) -> None:
self.children = {}
self.end = False
def __contains__(self, char: str) -> bool:
return char in self.children
def __getitem__(self, __name: str) -> "TrieNode":
return self.children[__name]
def __setitem__(self, __name: str, __value: "TrieNode") -> None:
self.children[__name] = __value
def __len__(self):
return len(self.children)
def __delitem__(self, __name: str):
del self.children[__name]
class Trie:
def __init__(self, words: list[str]) -> None:
self.root = TrieNode()
for w in words:
self.insert(w)
def insert(self, word: str):
curr = self.root
for c in word:
curr = curr.children.setdefault(c, TrieNode())
curr.end = True
def remove(self, word: str):
def _remove(node: TrieNode, index: int):
if index >= len(word):
node.end = False
if not node.children:
return True
elif word[index] in node:
if _remove(node[word[index]], index + 1):
del node[word[index]]
_remove(self.root, 0)

Related

Fastest way to store strings then find all stored strings that start with substring [duplicate]

I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like.
Should a trie be an object of nested dictionaries? Where each letter is divided in to letters and so on?
Would a lookup performed on such a dictionary be fast if there are 100k or 500k entries?
How to implement word-blocks consisting of more than one word separated with - or space?
How to link prefix or suffix of a word to another part in the structure? (for DAWG)
I want to understand the best output structure in order to figure out how to create and use one.
I would also appreciate what should be the output of a DAWG along with trie.
I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.
Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie:
>>> _end = '_end_'
>>>
>>> def make_trie(*words):
... root = dict()
... for word in words:
... current_dict = root
... for letter in word:
... current_dict = current_dict.setdefault(letter, {})
... current_dict[_end] = _end
... return root
...
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},
'z': {'_end_': '_end_'}}},
'f': {'o': {'o': {'_end_': '_end_'}}}}
If you're not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (It's like a version of get that also updates the dictionary.)
Next, a function to test whether the word is in the trie:
>>> def in_trie(trie, word):
... current_dict = trie
... for letter in word:
... if letter not in current_dict:
... return False
... current_dict = current_dict[letter]
... return _end in current_dict
...
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False
I'll leave insertion and removal to you as an exercise.
Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.
Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.
Here is a list of python packages that implement Trie:
marisa-trie - a C++ based implementation.
python-trie - a simple pure python implementation.
PyTrie - a more advanced pure python implementation.
pygtrie - a pure python implementation by Google.
datrie - a double array trie implementation based on libdatrie.
Have a look at this:
https://github.com/kmike/marisa-trie
Static memory-efficient Trie structures for Python (2.x and 3.x).
String data in a MARISA-trie may take up to 50x-100x less memory than
in a standard Python dict; the raw lookup speed is comparable; trie
also provides fast advanced methods like prefix search.
Based on marisa-trie C++ library.
Here's a blog post from a company using marisa trie successfully:
https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/
At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server.
...
I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.
What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.
There are also a couple of pure-python implementations, though unless you're on a restricted platform you'd want to use the C++ backed implementation above for best performance:
https://github.com/bdimmick/python-trie
https://pypi.python.org/pypi/PyTrie
Modified from senderle's method (above). I found that Python's defaultdict is ideal for creating a trie or a prefix tree.
from collections import defaultdict
class Trie:
"""
Implement a trie with insert, search, and startsWith methods.
"""
def __init__(self):
self.root = defaultdict()
# #param {string} word
# #return {void}
# Inserts a word into the trie.
def insert(self, word):
current = self.root
for letter in word:
current = current.setdefault(letter, {})
current.setdefault("_end")
# #param {string} word
# #return {boolean}
# Returns if the word is in the trie.
def search(self, word):
current = self.root
for letter in word:
if letter not in current:
return False
current = current[letter]
if "_end" in current:
return True
return False
# #param {string} prefix
# #return {boolean}
# Returns if there is any word in the trie
# that starts with the given prefix.
def startsWith(self, prefix):
current = self.root
for letter in prefix:
if letter not in current:
return False
current = current[letter]
return True
# Now test the class
test = Trie()
test.insert('helloworld')
test.insert('ilikeapple')
test.insert('helloz')
print test.search('hello')
print test.startsWith('hello')
print test.search('ilikeapple')
There's no "should"; it's up to you. Various implementations will have different performance characteristics, take various amounts of time to implement, understand, and get right. This is typical for software development as a whole, in my opinion.
I would probably first try having a global list of all trie nodes so far created, and representing the child-pointers in each node as a list of indices into the global list. Having a dictionary just to represent the child linking feels too heavy-weight, to me.
Using defaultdict and reduce function.
Create Trie
from functools import reduce
from collections import defaultdict
T = lambda : defaultdict(T)
trie = T()
reduce(dict.__getitem__,'how',trie)['isEnd'] = True
Trie :
defaultdict(<function __main__.<lambda>()>,
{'h': defaultdict(<function __main__.<lambda>()>,
{'o': defaultdict(<function __main__.<lambda>()>,
{'w': defaultdict(<function __main__.<lambda>()>,
{'isEnd': True})})})})
Search In Trie :
curr = trie
for w in 'how':
if w in curr:
curr = curr[w]
else:
print("Not Found")
break
if curr['isEnd']:
print('Found')
from collections import defaultdict
Define Trie:
_trie = lambda: defaultdict(_trie)
Create Trie:
trie = _trie()
for s in ["cat", "bat", "rat", "cam"]:
curr = trie
for c in s:
curr = curr[c]
curr.setdefault("_end")
Lookup:
def word_exist(trie, word):
curr = trie
for w in word:
if w not in curr:
return False
curr = curr[w]
return '_end' in curr
Test:
print(word_exist(trie, 'cam'))
Here is full code using a TrieNode class. Also implemented auto_complete method to return the matching words with a prefix.
Since we are using dictionary to store children, there is no need to convert char to integer and vice versa and don't need to allocate array memory in advance.
class TrieNode:
def __init__(self):
#Dict: Key = letter, Item = TrieNode
self.children = {}
self.end = False
class Trie:
def __init__(self):
self.root = TrieNode()
def build_trie(self,words):
for word in words:
self.insert(word)
def insert(self,word):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.end = True
def search(self, word):
node = self.root
for char in word:
if char in node.children:
node = node.children[char]
else:
return False
return node.end
def _walk_trie(self, node, word, word_list):
if node.children:
for char in node.children:
word_new = word + char
if node.children[char].end:
# if node.end:
word_list.append( word_new)
# word_list.append( word)
self._walk_trie(node.children[char], word_new , word_list)
def auto_complete(self, partial_word):
node = self.root
word_list = [ ]
#find the node for last char of word
for char in partial_word:
if char in node.children:
node = node.children[char]
else:
# partial_word not found return
return word_list
if node.end:
word_list.append(partial_word)
# word_list will be created in this method for suggestions that start with partial_word
self._walk_trie(node, partial_word, word_list)
return word_list
create a Trie
t = Trie()
words = ['hi', 'hieght', 'rat', 'ram', 'rattle', 'hill']
t.build_trie(words)
Search for word
words = ['hi', 'hello']
for word in words:
print(word, t.search(word))
hi True
hel False
search for words using prefix
partial_word = 'ra'
t.auto_complete(partial_word)
['rat', 'rattle', 'ram']
If you want a TRIE implemented as a Python class, here is something I wrote after reading about them:
class Trie:
def __init__(self):
self.__final = False
self.__nodes = {}
def __repr__(self):
return 'Trie<len={}, final={}>'.format(len(self), self.__final)
def __getstate__(self):
return self.__final, self.__nodes
def __setstate__(self, state):
self.__final, self.__nodes = state
def __len__(self):
return len(self.__nodes)
def __bool__(self):
return self.__final
def __contains__(self, array):
try:
return self[array]
except KeyError:
return False
def __iter__(self):
yield self
for node in self.__nodes.values():
yield from node
def __getitem__(self, array):
return self.__get(array, False)
def create(self, array):
self.__get(array, True).__final = True
def read(self):
yield from self.__read([])
def update(self, array):
self[array].__final = True
def delete(self, array):
self[array].__final = False
def prune(self):
for key, value in tuple(self.__nodes.items()):
if not value.prune():
del self.__nodes[key]
if not len(self):
self.delete([])
return self
def __get(self, array, create):
if array:
head, *tail = array
if create and head not in self.__nodes:
self.__nodes[head] = Trie()
return self.__nodes[head].__get(tail, create)
return self
def __read(self, name):
if self.__final:
yield name
for key, value in self.__nodes.items():
yield from value.__read(name + [key])
This version is using recursion
import pprint
from collections import deque
pp = pprint.PrettyPrinter(indent=4)
inp = raw_input("Enter a sentence to show as trie\n")
words = inp.split(" ")
trie = {}
def trie_recursion(trie_ds, word):
try:
letter = word.popleft()
out = trie_recursion(trie_ds.get(letter, {}), word)
except IndexError:
# End of the word
return {}
# Dont update if letter already present
if not trie_ds.has_key(letter):
trie_ds[letter] = out
return trie_ds
for word in words:
# Go through each word
trie = trie_recursion(trie, deque(word))
pprint.pprint(trie)
Output:
Coool👾 <algos>🚸 python trie.py
Enter a sentence to show as trie
foo bar baz fun
{
'b': {
'a': {
'r': {},
'z': {}
}
},
'f': {
'o': {
'o': {}
},
'u': {
'n': {}
}
}
}
This is much like a previous answer but simpler to read:
def make_trie(words):
trie = {}
for word in words:
head = trie
for char in word:
if char not in head:
head[char] = {}
head = head[char]
head["_end_"] = "_end_"
return trie
class TrieNode:
def __init__(self):
self.keys = {}
self.end = False
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word: str, node=None) -> None:
if node == None:
node = self.root
# insertion is a recursive operation
# this is base case to exit the recursion
if len(word) == 0:
node.end = True
return
# if this key does not exist create a new node
elif word[0] not in node.keys:
node.keys[word[0]] = TrieNode()
self.insert(word[1:], node.keys[word[0]])
# that means key exists
else:
self.insert(word[1:], node.keys[word[0]])
def search(self, word: str, node=None) -> bool:
if node == None:
node = self.root
# this is positive base case to exit the recursion
if len(word) == 0 and node.end == True:
return True
elif len(word) == 0:
return False
elif word[0] not in node.keys:
return False
else:
return self.search(word[1:], node.keys[word[0]])
def startsWith(self, prefix: str, node=None) -> bool:
if node == None:
node = self.root
if len(prefix) == 0:
return True
elif prefix[0] not in node.keys:
return False
else:
return self.startsWith(prefix[1:], node.keys[prefix[0]])
class Trie:
head = {}
def add(self,word):
cur = self.head
for ch in word:
if ch not in cur:
cur[ch] = {}
cur = cur[ch]
cur['*'] = True
def search(self,word):
cur = self.head
for ch in word:
if ch not in cur:
return False
cur = cur[ch]
if '*' in cur:
return True
else:
return False
def printf(self):
print (self.head)
dictionary = Trie()
dictionary.add("hi")
#dictionary.add("hello")
#dictionary.add("eye")
#dictionary.add("hey")
print(dictionary.search("hi"))
print(dictionary.search("hello"))
print(dictionary.search("hel"))
print(dictionary.search("he"))
dictionary.printf()
Out
True
False
False
False
{'h': {'i': {'*': True}}}
Python Class for Trie
Trie Data Structure can be used to store data in O(L) where L is the length of the string so for inserting N strings time complexity would be O(NL) the string can be searched in O(L) only same goes for deletion.
Can be clone from https://github.com/Parikshit22/pytrie.git
class Node:
def __init__(self):
self.children = [None]*26
self.isend = False
class trie:
def __init__(self,):
self.__root = Node()
def __len__(self,):
return len(self.search_byprefix(''))
def __str__(self):
ll = self.search_byprefix('')
string = ''
for i in ll:
string+=i
string+='\n'
return string
def chartoint(self,character):
return ord(character)-ord('a')
def remove(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
raise ValueError("Keyword doesn't exist in trie")
if ptr.isend is not True:
raise ValueError("Keyword doesn't exist in trie")
ptr.isend = False
return
def insert(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
ptr.children[i] = Node()
ptr = ptr.children[i]
ptr.isend = True
def search(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
return False
if ptr.isend is not True:
return False
return True
def __getall(self,ptr,key,key_list):
if ptr is None:
key_list.append(key)
return
if ptr.isend==True:
key_list.append(key)
for i in range(26):
if ptr.children[i] is not None:
self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list)
def search_byprefix(self,key):
ptr = self.__root
key_list = []
length = len(key)
for idx in range(length):
i = self.chartoint(key[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
return None
self.__getall(ptr,key,key_list)
return key_list
t = trie()
t.insert("shubham")
t.insert("shubhi")
t.insert("minhaj")
t.insert("parikshit")
t.insert("pari")
t.insert("shubh")
t.insert("minakshi")
print(t.search("minhaj"))
print(t.search("shubhk"))
print(t.search_byprefix('m'))
print(len(t))
print(t.remove("minhaj"))
print(t)
Code Oputpt
True
False
['minakshi', 'minhaj']
7
minakshi
minhajsir
pari
parikshit
shubh
shubham
shubhi
With prefix search
Here is #senderle's answer, slightly modified to accept prefix search (and not only whole-word matching):
_end = '_end_'
def make_trie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def in_trie(trie, word):
current_dict = trie
for letter in word:
if _end in current_dict:
return True
if letter not in current_dict:
return False
current_dict = current_dict[letter]
t = make_trie(['hello', 'hi', 'foo', 'bar'])
print(in_trie(t, 'hello world'))
# True
In response to #basj
The following code will capture \b (end of word) letters.
_end = '_end_'
def make_trie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def in_trie(trie, word):
current_dict = trie
for letter in word:
if letter not in current_dict: # Adjusted the
return False # order of letter
if _end in current_dict[letter]: # checks to capture
return True # the last letter.
current_dict = current_dict[letter]
t = make_trie(['hello', 'hi', 'foo', 'bar'])
>>> print(in_trie(t, 'hi'))
True
>>> print(in_trie(t, 'hola'))
False
>>> print(in_trie(t, 'hello friend'))
True
>>> print(in_trie(t, 'hel'))
None

Leetcode Python 208. Implement Trie (Prefix Tree)

Can someone say what is wrong with my code, it is passing all the test cases except the last one when I downloaded the specific test case both the expected and actual output seems same, the question is https://leetcode.com/problems/implement-trie-prefix-tree/description/
Edit 1:
Here is the code:
class Trie:
def __init__(self):
"""
Initialize your data structure here.
"""
self.data = None
self.children = {}
self.isWord = False
def insert(self, word):
"""
Inserts a word into the trie.
:type word: str
:rtype: void
"""
if len(word) == 0:
return
if word[0] not in self.children:
self.children[word[0]] = Trie()
self.insertHelper(word[1:], self.children[word[0]])
else:
self.insertHelper(word[1:], self.children[word[0]])
if len(word) == 1:
self.isWord = True
def insertHelper(self, word, trie):
if len(word) == 0:
return
if word[0] not in trie.children:
trie.children[word[0]] = Trie()
trie.insertHelper(word[1:], trie.children[word[0]])
else:
trie.insertHelper(word[1:], trie.children[word[0]])
if len(word) == 1:
trie.isWord = True
def search(self, word):
"""
Returns if the word is in the trie.
:type word: str
:rtype: bool
"""
if len(word) == 1 and word[0] in self.children and self.isWord:
return True
elif len(word) == 0:
return False
if word[0] in self.children:
return self.searchHelper(word[1:], self.children[word[0]])
else:
return False
def searchHelper(self, word, trie):
if len(word) == 1 and word[0] in trie.children and trie.isWord:
return True
elif len(word) == 0:
return False
if word[0] in trie.children:
return self.searchHelper(word[1:], trie.children[word[0]])
else:
return False
def startsWith(self, prefix):
"""
Returns if there is any word in the trie that starts with the given prefix.
:type prefix: str
:rtype: bool
"""
if len(prefix) == 0:
return False
if prefix[0] in self.children:
return self.startsWithHelper(prefix[1:], self.children[prefix[0]])
else:
return False
def startsWithHelper(self, prefix, trie):
if len(prefix) == 0:
return True
if prefix[0] in trie.children:
return trie.startsWithHelper(prefix[1:], trie.children[prefix[0]])
else:
return False
Thanks in advance.
One quirk I noticed is passing an empty prefix into startsWith(). If this method is modeled on the Python str method startswith(), then we expect True:
>>> "apple".startswith("")
True
>>>
But your Trie returns False in this situation:
>>> t = Trie()
>>> t.insert("apple")
>>> t.startsWith("")
False
>>>
Below is my rework of your code that I did primarily to understand it but I also found you had redundancies, particularly your Helper functions. This code fixes the quirk mentioned above and is Python 3 specific:
class Trie:
def __init__(self):
self.children = {}
self.isWord = False
def insert(self, word):
"""
Inserts a word into the trie.
:type word: str (or list internally upon recursion)
:rtype: None
"""
if not word:
return
head, *tail = word
if head not in self.children:
self.children[head] = Trie()
trie = self.children[head]
if tail:
trie.insert(tail)
else:
self.isWord = True
def search(self, word):
"""
Returns True if the word is in the trie.
:type word: str (or list internally upon recursion)
:rtype: bool
"""
if not word:
return False
head, *tail = word
if head in self.children:
if not tail and self.isWord:
return True
return self.children[head].search(word[1:])
return False
def startsWith(self, prefix):
"""
Returns if there is any word in the trie that starts with the given prefix.
:type prefix: str (or list internally upon recursion)
:rtype: bool
"""
if not prefix:
return True
head, *tail = prefix
if head in self.children:
return self.children[head].startsWith(tail)
return False
Here's another solution using the 'defaultdictionary' from the collections module to utilize recursion in the 'insert' function too.
Credit: https://leetcode.com/problems/implement-trie-prefix-tree/discuss/631957/python-elegant-solution-no-nested-dictionaries
class Trie:
def __init__(self):
"""
Initialize your data structure here.
"""
self.nodes = collections.defaultdict(Trie)
self.is_word = False
def insert(self, word: str) -> None:
"""
Inserts a word into the trie.
"""
if not word:
self.is_word = True
else:
self.nodes[word[0]].insert(word[1:])
def search(self, word: str) -> bool:
"""
Returns if the word is in the trie.
"""
if not word:
return self.is_word
if word[0] in self.nodes:
return self.nodes[word[0]].search(word[1:])
return False
def startsWith(self, prefix: str) -> bool:
"""
Returns if there is any word in the trie that starts with the given prefix.
"""
if not prefix:
return True
if prefix[0] in self.nodes:
return self.nodes[prefix[0]].startsWith(prefix[1:])
return False
Your Trie object will be instantiated and called as such:
obj = Trie()
obj.insert(word)
param_2 = obj.search(word)
param_3 = obj.startsWith(prefix)
class TrieNode:
def __init__(self):
# each key is a TrieNode
self.keys = {}
self.end = False
class Trie:
def __init__(self):
self.root = TrieNode()
# node=this.root gives error "this" is not defined
def insert(self, word: str, node=None) -> None:
if node == None:
node = self.root
# insertion is a recursive operation
if len(word) == 0:
node.end = True
return
elif word[0] not in node.keys:
node.keys[word[0]] = TrieNode()
self.insert(word[1:], node.keys[word[0]])
# that means key exists
else:
self.insert(word[1:], node.keys[word[0]])
def search(self, word: str, node=None) -> bool:
if node == None:
node = self.root
# node.end=True means we have inserted the word before
if len(word) == 0 and node.end == True:
return True
# if we inserted apple and then search for app we get false becase we never inserted app so a-p-p last_p.end is not True
# But startsWith(app) would return True
elif len(word) == 0:
return False
elif word[0] not in node.keys:
return False
else:
# we have to return becasue api expects us to return bool
return self.search(word[1:], node.keys[word[0]])
def startsWith(self, prefix: str, node=None) -> bool:
if node == None:
node = self.root
if len(prefix) == 0:
return True
elif prefix[0] not in node.keys:
return False
else:
return self.startsWith(prefix[1:], node.keys[prefix[0]])

Algorithm to remove words from a trie whose frequency < 5 and length > 15

I have a huge trie dictionary that I built from data from web. Although it is just 5MB when I write the trie into a file its' size is so big when I load it on the memory (more than 100 MB). So I've to compress the trie.
I am facing difficulties in writing a recursive function (preferably runs in linear time like a DFS) to remove the words whose frequency is < 5 and length > 15. Any help is appreciated
Here is my trie structure.
class TrieNode:
def __init__(self):
self.ch = '|'
self.score = 0
self.childs = [None]*26
self.isWord = False
class Trie:
def __init__(self):
self.root = TrieNode('$')
#staticmethod
def print_trie(node, level):
if node is None:
return
print(node.ch, " ", level, " ", node.isWord)
for i in range(26):
Trie.print_trie(node.childs[i], level+1)
def insert(self, word):
word = word.lower()
if not is_valid(word):
return
childs = self.root.childs
i = 0
while i < len(word):
idx = to_int(word[i])
if childs[idx] is not None:
t = childs[idx]
else:
t = TrieNode(word[i])
childs[idx] = t
childs = t.childs
if i == len(word)-1:
t.isWord = True
t.score += 1
i += 1
def search_node(self, word):
word = word.lower()
if not is_valid(word):
return False, 0
if self.root is None or word is None or len(word) == 0:
return False, 0
children = self.root.childs
for i in range(len(word)):
idx = to_int(word[i])
if children[idx] is not None:
t = children[idx]
children = t.childs
else:
return False, 0
if t.isWord:
return True, t.score
else:
return False, t.score
The following method takes a node and its level (initially pass in root and 0) and returns True if the node should remain alive after pruning and False if the node should be removed from the trie (with its subtrie).
def prune(node, level):
if node is None:
return False
canPruneNode = True
for idx in xrange(len(node.children)):
# If any of the children remains alive, don't prune current node.
if prune(children[idx], level + 1):
canPruneNode = False
else:
# Remove dead child.
node.children[idx] = None
if node.isWord and level > 15 and node.score < 5:
node.isWord = False
# Current node should be removed if and only if all of its children
# were removed and it doesn't represent a word itself after pruning.
return node.isWord or not canPruneNode
I am not sure if removing will solve the problem. The space consumed is not because of the words but because of the 26 children every node has.
Eg. I have a word cat with frequency 30 & there's another word cater whose frequency is 10. So, if you delete the node for t in cat then all the subsequent nodes will be deleted (that is cater will be reduced to cat)
So, removing a word from Trie means nothing but setting its score to 0.

Python: Create a Binary search Tree using a list

The objective of my code is to get each seperate word from a txt file and put it into a list and then making a binary search tree using that list to count the frequency of each word and printing each word in alphabetical order along with its frequency. Each word in the can only contain letters, numbers, -, or ' The part that I am unable to do with my beginner programming knowledge is to make the Binary Search Tree using the list I have (I am only able to insert the whole list in one Node instead of putting each word in a Node to make the tree). The code I have so far is this:
def read_words(filename):
openfile = open(filename, "r")
templist = []
letterslist = []
for lines in openfile:
for i in lines:
ii = i.lower()
letterslist.append(ii)
for p in letterslist:
if p not in ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',"'","-",' '] and p.isdigit() == False:
letterslist.remove(p)
wordslist = list("".join(letterslist).split())
return wordslist
class BinaryTree:
class _Node:
def __init__(self, value, left=None, right=None):
self._left = left
self._right = right
self._value = value
self._count = 1
def __init__(self):
self.root = None
def isEmpty(self):
return self.root == None
def insert(self, value) :
if self.isEmpty() :
self.root = self._Node(value)
return
parent = None
pointer = self.root
while (pointer != None) :
if value == pointer._value:
pointer._count += 1
return
elif value < pointer._value:
parent = pointer
pointer = pointer._left
else :
parent = pointer
pointer = pointer._right
if (value <= parent._value) :
parent._left = self._Node(value)
else :
parent._right = self._Node(value)
def printTree(self):
pointer = self.root
if pointer._left is not None:
pointer._left.printTree()
print(str(pointer._value) + " " + str(pointer._count))
if pointer._right is not None:
pointer._right.printTree()
def createTree(self,words):
if len(words) > 0:
for word in words:
BinaryTree().insert(word)
return BinaryTree()
else:
return None
def search(self,tree, word):
node = tree
depth = 0
count = 0
while True:
print(node.value)
depth += 1
if node.value == word:
count = node.count
break
elif word < node.value:
node = node.left
elif word > node.value:
node = node.right
return depth, count
def main():
words = read_words('sample.txt')
b = BinaryTree()
b.insert(words)
b.createTree(words)
b.printTree()
Since you're a beginner I'd advice to implement the tree methods with recursion instead of iteration since this will result to simpler implementation. While recursion might seem a bit difficult concept at first often it is the easiest approach.
Here's a draft implementation of a binary tree which uses recursion for insertion, searching and printing the tree, it should support the functionality you need.
class Node(object):
def __init__(self, value):
self.value = value
self.left = None
self.right = None
self.count = 1
def __str__(self):
return 'value: {0}, count: {1}'.format(self.value, self.count)
def insert(root, value):
if not root:
return Node(value)
elif root.value == value:
root.count += 1
elif value < root.value:
root.left = insert(root.left, value)
else:
root.right = insert(root.right, value)
return root
def create(seq):
root = None
for word in seq:
root = insert(root, word)
return root
def search(root, word, depth=1):
if not root:
return 0, 0
elif root.value == word:
return depth, root.count
elif word < root.value:
return search(root.left, word, depth + 1)
else:
return search(root.right, word, depth + 1)
def print_tree(root):
if root:
print_tree(root.left)
print root
print_tree(root.right)
src = ['foo', 'bar', 'foobar', 'bar', 'barfoo']
tree = create(src)
print_tree(tree)
for word in src:
print 'search {0}, result: {1}'.format(word, search(tree, word))
# Output
# value: bar, count: 2
# value: barfoo, count: 1
# value: foo, count: 1
# value: foobar, count: 1
# search foo, result: (1, 1)
# search bar, result: (2, 2)
# search foobar, result: (2, 1)
# search bar, result: (2, 2)
# search barfoo, result: (3, 1)
To answer your direct question, the reason why you are placing all of the words into a single node is because of the following statement inside of main():
b.insert(words)
The insert function creates a Node and sets the value of the node to the item you pass in. Instead, you need to create a node for each item in the list which is what your createTree() function does. The preceeding b.insert is not necessary.
Removing that line makes your tree become correctly formed, but reveals a fundamental problem with the design of your data structure, namely the printTree() method. This method seems designed to traverse the tree and recursively call itself on any child. In your initial version this function worked, because there the tree was mal-formed with only a single node of the whole list (and the print function simply printed that value since right and left were empty).
However with a correctly formed tree the printTree() function now tries to invoke itself on the left and right descendants. The descendants however are of type _Node, not of type BinaryTree, and there is no methodprintTree() declared for _Node objects.
You can salvage your code and solve this new error in one of two ways. First you can implement your BinaryTree.printTree() function as _Node.printTree(). You can't do a straight copy and paste, but the logic of the function won't have to change much. Or you could leave the method where it is at, but wrap each _left or _right node inside of a new BinaryTree so that they would have the necessary printTree() method. Doing this would leave the method where it is at, but you will still have to implement some kind of helper traversal method inside of _Node.
Finally, you could change all of your _Node objects to be _BinaryTree objects instead.
The semantic difference between a node and a tree is one of scope. A node should only be aware of itself, its direct children (left and right), and possibly its parent. A tree on the other hand can be aware of any of its descendents, no matter how far removed. This is accomplished by treating any child node as its own tree. Even a leaf, without any children at all can be thought of as a tree with a depth of 0. This behavior is what lets a tree work recursively. Your code is mixing the two together.

Is there something wrong with this while-loop?

When I execute this code, it prints 'Constructed', meaning it executed Trie Construction - then my terminal outputs nothing, it doesn't return or print any error, it's just blank, as if it's still working on the problem. Is there something wrong with the while loop? Is it that the 'trie' is an external variable?
trie is a list of nodes, a class I defined.
class node:
def __init__(self, parent, daughters, edge):
self.parent = parent
self.daughters = daughters
self.edge = edge
trie.append(self)
self.index = len(trie) - 1
patterns is a list of fixed strings.
def TrieConstruction(patterns, trie):
trie.append(node(0, [], 0))
for pattern in patterns:
currentNode = trie[0]
for base in pattern:
for daughter in currentNode.daughters:
if base == daughter.edge:
currentNode = daughter
break
else:
trie.append(node(currentNode, [], base))
currentNode = trie[-1]
print('Constructed.')
return
def PrefixTrieMatching(text, trie):
v = trie[0]
for index, base in enumerate(text):
if v.daughters == []:
pattern_out = []
climb(v.index)
return ''.join(pattern_out)
else:
for daughter in v.daughters:
if base == daughter.edge:
v = daughter
break
else:
print('No matches found.')
return
def climb(index):
if index == 0:
return
else:
pattern_out.append(node.edge)
climb(trie[index].parent)
def TrieMatching(text, trie):
while text != []:
PrefixTrieMatching(text, trie)
text = text[0:len(text) - 2]
print('Complete.')
return
print('Next, we generate a trie with the patterns, and then run the text over the trie to search for matches.')
trie = []
TrieConstruction(patterns, trie)
TrieMatching(text, trie)
EDIT:
Disregard my previous answer, if you are entering a string as text, it should be:
while text != "":
PrefixTrieMatching(text, trie)
text = text[0:len(text) - 2]
as the string would never be an empty list
You are doing more work than needed, just use while text which will return False only for an empty string and just slice your string slicing two chars from the end at a time:
def TrieMatching(text, trie):
while text:
PrefixTrieMatching(text, trie)
text = text[:-2]
An empty list, str, dict etc will always evaluate to False so you don't ever need to explicitly check if my_list != [], if my_str != "", if my_list and if my_str etc.. is sufficient.

Categories

Resources