Adding an element into a Marisa Trie

Adding an element into a Marisa Trie - python

In a spelling error detection task, I use marisa_tries data structures for my lexicon with Python 3.5.
Short question
How to add an element in a marisa_trie ?
Context
The idea is : if a word is in my lexicon, then it is correct. Now, if it is not in my lexicon, it is probably incorrect. But I computed frequencies of words on the overall document and if a word frequency is high enough, I want to save this word, considering it's frequent enough so probably correct.
In that case, how to add this new word to my marisa_trie.Trie lexicon? (without having to build a new trie every time)?
Thank you :)

marisa_trie.Trie implements an immutable trie, so the answer to your question is: it is not possible.
You might want to try a similar Python package called datrie which supports modification and relatively fast queries (PyPI page lists some benchmark against builtin dict).

Related

Python: Find all anagrams of a sentence

I want to find all possible anagrams from a phrase, for example if I input "Donald Trump" I should get "Darn mud plot", "Damp old runt" and probably hundreds more.
I have a dictionary of around 100,000 words, no problems there.
But the only way I can think of is to loop through the dictionary and add all words that can be built from the input to a list. Then loop through the list and if the word length is less than the length of the input, loop through the dictionary again add all possible words that can be made from the remaining letters that would make it the length of the input or less. And keep looping through until I have all combinations of valid words of length equal to input length.
But this is O(n!) complexity, and it would take almost forever to run. I've tried it.
Is there any way to approach this problem such that the complexity will be less? I may have found something on the net for perl, but I absolutely cannot read perl code, especially not perl golf.

I like your idea of filtering the word list down to just the words that could possibly be made with the input letters, and I like the idea of trying to string them together, but I think there are a few major optimizations you could put into place that would likely speed things up quite a bit.
For starters, rather than choosing a word and then rescanning the entire dictionary for what's left, I'd consider just doing a single filtering pass at the start to find all possible words that could be made with the letters that you have. Your dictionary is likely going to be pretty colossal (150,000+, I'd suspect), so rescanning it after each decision point is going to be completely infeasible. Once you have the set of words you can legally use in the anagram, from there you're left with the problem of finding which combinations of them can be used to form a complete anagram of the sentence.
I'd begin by finding unordered lists of words that anagram to the target rather than all possible ordered lists of words, because there's many fewer of them to find. Once you have the unordered lists, you can generate the permutations from them pretty quickly.
To do this, I'd use a backtracking recursion where at each point you maintain a histogram of the remaining letter counts. You can use that to filter out words that can't be added in any more, and this essentially saves you the cost of having to check the whole dictionary each time. I'd imagine this recursion will dead-end a lot, and that you'll probably find all your answers without too much effort.
You might consider some other heuristics along the way. For example, you might want to start with larger words first to pull out as many letters as possible and keep the branching factor low. To do that, you could sort your word list from longest to shortest and try the words in that order. You could alternatively try to use the most constrained letters up first to decrease the branching factor. These sorts of heuristics will probably work really well in practice.
Overall you're still looking at exponential work in the worst case, but it shouldn't be too bad for shorter strings.

How to effeciently find all fuzzy matches between a set of terms and a list of sentences?

I have a list of sentences (e.g. "This is an example sentence") and a glossary of terms (e.g. "sentence", "example sentence") and need to find all the terms that match the sentence with a cutoff on some Levenshtein ratio.
How can I do it fast enough? Splitting sentences, using FTS to find words that appear in terms and filtering terms by ratio works but it's quite slow. Right now I'm using sphinxsearch + python-Levelshtein, are there better tools?
Would the reverse search: FTS matching terms in sentence be faster?

If speed is a real issue, and if your glossary of terms is not going to be updated often, compared to the number of searches you want to do, you could look into something like a Levenshtein Automaton. I don't know of any python libraries that support it, but if you really need it you could implement it yourself. To find all possible paths will require some dynamic programming.
If you just need to get it done, just loop over the glossary and test each one against each word in the string. That should give you an answer in polynomial time. If you're on a multicore processor, you might get some speedup by doing it in parallel.

How best to store large sequences of text in Python?

I recently discovered that a student of mine was doing an independent project in which he was using very large strings (2-4MB) as values in a dictionary.
I've never had a reason to work with such large blocks of text and it got me wondering if there were performance issues associated with creating such large strings.
Is there a better way of doing it than to simply create a string? I realize this question is largely context dependent, but I'm looking for generalized answers that may cover more than one possible use-case.
If you were working with that much text, how would you store it in your code, and would you do anything different than if you were simply working with an ordinary string of only a few characters?

It depends a lot on what you're doing with the strings. I'm not exactly sure how Python stores strings but I've done a lot of work on XEmacs (similar to GNU Emacs) and on the underlying implementation of Emacs Lisp, which is a dynamic language like Python, and I know how strings are implemented there. Strings are going to be stored as blocks of memory similar to arrays. There's not a huge issue creating large arrays in Python, so I don't think simply storing the strings this way will cause performance issues. Some things to consider though:
How are you building up the string? If you build up piece-by-piece by simply appending to ever larger strings, you have an O(N^2) algorithm that will be very slow. Java handles this with a StringBuilder class. I'm not sure if there's an exact equivalent in Python but you can simply create an array with all the parts you want to join together, then join at the end using ''.join(array).
Do you need to search the string? This isn't related to creating the strings but it's something to consider. Searching will in general be O(n) in the size of the string; there are speedups that make it O(n/m) where m is the size of the substring you're searching for, but that's about it. The main consideration here is whether to store one big string or a series of substrings. If you need to search all the substrings, that won't help much over searching a big string, but it's possible you might know in advance that some parts don't need to be searched.
Do you need to access substrings? Again, this isn't related to creating the strings, it's something to consider. Accessing a substring by position is just a matter of indexing to the right memory location, but if you need to take large substrings, it may be inefficient, and you might be able to speed things up by storing your string as an array of substrings, and then creating a new string as another array with some of the strings shared. However, doing it this way takes work, and shouldn't be done unless it's really necessary.
In sum, I think for simple cases it's fine to have large strings like this, but you should think about the sorts of operations you're going to perform and what their O(...) time is.

I would say that potential issues depend on two things:
how many strings of this kind are hold in memory at the same time, compared to the capacity of the memory (the RAM) ?
what are the operations done on these strings ?
It seems to me I've read that operations on strings in Python are very efficient, so it isn't supposed to present problem working on very long strings. But in fact it depends on the algorithm of each operation performed on a big string.
This answer is rather vague, I haven't enough eperience to make more useful estimation of the problem. But the question is also very broad.

Why is collections.deque slower than collections.defaultdict?

Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque was way slower than collections.defaultdict when I wanted to verify the existence of a value.
I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list instead of defaultdict at first, but replaced it with deque as soon as I noticed that a single word lookup took about 25 seconds.
Surprisingly, that wasn't faster than using a list so I returned to using defaultdict which returned results almost instantaneously.
Can someone explain this difference in performance to me?
Thanks in advance
PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates

These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:
Lists are dynamic arrays, you use them to store items sequentially for fast random access, use as stack (adding and removing at the end) or just storing something and later iterating over it in the same order.
Deques are sequences too, only for adding and removing elements at both ends instead of random access or stack-like growth.
Dictionaries (providing a default value just a relatively simple and convenient but - for this question - irrelevant extension) are hash tables, they associate fully-featured keys (instead of an index) with values and provide very fast access to a value by a key and (necessarily) very fast checks for key existence. They don't maintain order and require the keys to be hashable, but well, you can't make an omelette without breaking eggs.
All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.
For each of these edits, there is a check edit in NWORDS to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, since in checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences, in has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).
If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1) in check. Practically, you should just use a set which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).

passing text through a dictionary in Python

I currently have python code that compares two texts using the cosine similarity measure. I got the code here.
What I want to do is take the two texts and pass them through a dictionary (not a python dictionary, just a dictionary of words) first before calculating the similarity measure. The dictionary will just be a list of words, although it will be a large list. I know it shouldn't be hard and I could maybe stumble my way through something, but I would like it to be efficient too. Thanks.

If the dictionary fites in memory, use a Python set:
ok_words = set(["a", "b", "c", "e"])
def filter_words(words):
return [word for word in words if word in ok_words]
If it doesn't fit in memory, you can use shelve

The structure you try to create is known as Inverted Index. Here you can find some general information about it and snippets from Heaps and Mills's implementation. Unfortunately, I wasn't able to find it's source, as well as any other efficient implementation. (Please leave comment if you will find any.)
If you haven't a goal to create a library in pure Python, you can use PyLucene - Python extension for accessing Lucene, which is in it's turn very powerful search engine in Java. Lucene implements inverted index and can easily provide you information on word frequency. It also supports wide range of analyzers (parsers + stemmers) for a dozen of languages.
(Also note, that Lucene already has it's own Similarity measure class.)
Some words about similarity and Vector Space Models. It is very powerful abstraction, but your implementation suffers several disadvantages. With a growth of number of documents in your index your co-occurrence matrix will became to big to fit in memory, and searching in it will take a long time. To stop this effect dimension reduction is used. In methods like LSA this is done by Singular Value Decomposition. Also pay attention to such techniques as PLSA, which uses probabilistic theory, and Random Indexing, which is the only incremental (and so the only appropriate for the large indexes) VSM method.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.