I want to find all possible anagrams from a phrase, for example if I input "Donald Trump" I should get "Darn mud plot", "Damp old runt" and probably hundreds more.
I have a dictionary of around 100,000 words, no problems there.
But the only way I can think of is to loop through the dictionary and add all words that can be built from the input to a list. Then loop through the list and if the word length is less than the length of the input, loop through the dictionary again add all possible words that can be made from the remaining letters that would make it the length of the input or less. And keep looping through until I have all combinations of valid words of length equal to input length.
But this is O(n!) complexity, and it would take almost forever to run. I've tried it.
Is there any way to approach this problem such that the complexity will be less? I may have found something on the net for perl, but I absolutely cannot read perl code, especially not perl golf.
I like your idea of filtering the word list down to just the words that could possibly be made with the input letters, and I like the idea of trying to string them together, but I think there are a few major optimizations you could put into place that would likely speed things up quite a bit.
For starters, rather than choosing a word and then rescanning the entire dictionary for what's left, I'd consider just doing a single filtering pass at the start to find all possible words that could be made with the letters that you have. Your dictionary is likely going to be pretty colossal (150,000+, I'd suspect), so rescanning it after each decision point is going to be completely infeasible. Once you have the set of words you can legally use in the anagram, from there you're left with the problem of finding which combinations of them can be used to form a complete anagram of the sentence.
I'd begin by finding unordered lists of words that anagram to the target rather than all possible ordered lists of words, because there's many fewer of them to find. Once you have the unordered lists, you can generate the permutations from them pretty quickly.
To do this, I'd use a backtracking recursion where at each point you maintain a histogram of the remaining letter counts. You can use that to filter out words that can't be added in any more, and this essentially saves you the cost of having to check the whole dictionary each time. I'd imagine this recursion will dead-end a lot, and that you'll probably find all your answers without too much effort.
You might consider some other heuristics along the way. For example, you might want to start with larger words first to pull out as many letters as possible and keep the branching factor low. To do that, you could sort your word list from longest to shortest and try the words in that order. You could alternatively try to use the most constrained letters up first to decrease the branching factor. These sorts of heuristics will probably work really well in practice.
Overall you're still looking at exponential work in the worst case, but it shouldn't be too bad for shorter strings.
Related
I'm given a matrix containing a blueprint of a crossword puzzle - unfilled, of course. The goal is to fill the whole puzzle - it's a task from Checkio, and I've been struggling with this for quite some time now.
From what I understand of complexity, there's no perfect algorithm for this problem. Still, there has to be the best way to do this, right? I've tried some different things, and results were not that good with increasing number of words in the crossword and/or dictionary.
So, some of the things I've tried:
simple brute forcing. Did not work at all, as it has been ignoring
and overwriting intersections.
brute forcing while keeping all the relevant data - worked as expected with a specific dictionary, turned to hell with a
moderately big one even with word-length optimization. Figures.
blind intersection filling - the idea where I thought it would be better not to bother with the intersecting words, instead focusing
on the letters. Like start with As and check if you can fill the
whole crossword with these restrictions. If it did not work for some
word, increment one of the letters and try the whole thing again.
Results were abysmal as you can expect.
recursive exploring - worked perfectly on more simple blueprints, but fell flat with more complex ones. There was an issue with simple
loops which was resolved simply enough, but I did not find a
reasonable solution for the situation where the path splits and then
rejoins several further splits later (so there's nothing left to
solve for the second branch, but it doesn't know that).
minimizing intersections - haven't tested this yet, but it looks promising. The idea is that I find the shortest list of words
containing all intersections... that also don't intersect with each
other. Then I can just use a generator for each of those words, and
then check if the depending words with those intersections exist. If
they don't, I just grab the next word from a generator.
And this is where I'm currently at. I decided to ask about this here as it's already at that point where I think it took more time than it should have, and even then my latest idea may not even be the proper way to do it.
So, what is the proper way to do it?
Edit:
Input is a list of strings representing the crossword and a list of strings representing the dictionary. Output is a list of strings representing the filled crossword.
And example of a crossword:
['...XXXXXX',
'.XXX.X...',
'.....X.XX',
'XXXX.X...',
'XX...X.XX',
'XX.XXX.X.',
'X......X.',
'XX.X.XXX.',
'XXXX.....']
The output would be a similar list with filled letters instead of dots.
Note that the 'dictionary' is just that, a small English dictionary and not a list of words fitted as answers for this puzzle.
So, what is the proper way to do it?
I don't know if it is optimal, but I would be using the principles of Floodfill.
Data structures:
Crossword words and their intersections. Sort them by the number of words in the dictionary for the corresponding word length. This will most likely mean that you will start with one of the longest words.
Dictionary accessible by word length.
If the dictionary is large it would be beneficial to be able to quickly find words of a certain length with a specific n:th letter, where n corresponds to an intersection position.
Note that for each crossword word, any two words that fit and have the same letters in all intersections are equivalent. Thus, it is possible to select a subset from the dictionary for each crossword word. The subset is the set of equivalence classes. So for each crossword word you can create a subset of the dictionary that contains at most [the number of letters in the alphabet] to the power of [the number of intersections]. This subset would constitute the equivalence classes that might fit a particular crossword word.
Algorithm:
Take the first/next unsolved crossword word. Assign it the first/next
word that fits.
Take the first/next intersection. Assign the other crossword word the first word that fits.
If there are no more intersections to follow onwards, go back to the intersection you came from and continue with the next intersection.
If there is no word in the dictionary that fits, backtrack one intersection and search for the next word that fits.
In a spelling error detection task, I use marisa_tries data structures for my lexicon with Python 3.5.
Short question
How to add an element in a marisa_trie ?
Context
The idea is : if a word is in my lexicon, then it is correct. Now, if it is not in my lexicon, it is probably incorrect. But I computed frequencies of words on the overall document and if a word frequency is high enough, I want to save this word, considering it's frequent enough so probably correct.
In that case, how to add this new word to my marisa_trie.Trie lexicon? (without having to build a new trie every time)?
Thank you :)
marisa_trie.Trie implements an immutable trie, so the answer to your question is: it is not possible.
You might want to try a similar Python package called datrie which supports modification and relatively fast queries (PyPI page lists some benchmark against builtin dict).
I have a list of sentences (e.g. "This is an example sentence") and a glossary of terms (e.g. "sentence", "example sentence") and need to find all the terms that match the sentence with a cutoff on some Levenshtein ratio.
How can I do it fast enough? Splitting sentences, using FTS to find words that appear in terms and filtering terms by ratio works but it's quite slow. Right now I'm using sphinxsearch + python-Levelshtein, are there better tools?
Would the reverse search: FTS matching terms in sentence be faster?
If speed is a real issue, and if your glossary of terms is not going to be updated often, compared to the number of searches you want to do, you could look into something like a Levenshtein Automaton. I don't know of any python libraries that support it, but if you really need it you could implement it yourself. To find all possible paths will require some dynamic programming.
If you just need to get it done, just loop over the glossary and test each one against each word in the string. That should give you an answer in polynomial time. If you're on a multicore processor, you might get some speedup by doing it in parallel.
If I have a long unsorted list of 300k elements, will sorting this list first and then do a "for" loop on list speed up code? I need to do a "for loop" regardless, cant use list comprehension.
sortedL=[list].sort()
for i in sortedL:
(if i is somenumber)
"do some work"
How could I signal to python that sortedL is sorted and not read whole list. Is there any benefit to sorting a list? If there is then how can I implement?
It would appear that you're considering sorting the list so that you could then quickly look for somenumber.
Whether the sorting will be worth it depends on whether you are going to search once, or repeatedly:
If you're only searching once, sorting the list will not speed things up. Just iterate over the list looking for the element, and you're done.
If, on the other hand, you need to search for values repeatedly, by all means pre-sort the list. This will enable you to use bisect to quickly look up values.
The third option is to store elements in a dict. This might offer the fastest lookups, but will probably be less memory-efficient than using a list.
The cost of a for loop in python is not dependent on whether the input data is sorted.
That being said, you might be able to break out of the for loop early or have other computation saving things at the algorithm level if you sort first.
If you want to search within a sorted list, you need an algorithm that takes advantage of the sorting.
One possibility is the built-in bisect module. This is a bit of a pain to use, but there's a recipe in the documentation for building simple sorted-list functions on top of it.
With that recipe, you can just write this:
i = index(sortedL, somenumber)
Of course if you're just sorting for the purposes of speeding up a single search, this is a bit silly. Sorting will take O(N log N) time, then searching will take O(log N), for a total time of O(N log N); just doing a linear search will take O(N) time. So, unless you're typically doing log N searches on the same list, this isn't worth doing.
If you don't actually need sorting, just fast lookups, you can use a set instead of a list. This gives you O(1) lookup for all but pathological cases.
Also, if you want to keep a list sorted while continuing to add/remove/etc., consider using something like blist.sortedlist instead of a plain list.
Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque was way slower than collections.defaultdict when I wanted to verify the existence of a value.
I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list instead of defaultdict at first, but replaced it with deque as soon as I noticed that a single word lookup took about 25 seconds.
Surprisingly, that wasn't faster than using a list so I returned to using defaultdict which returned results almost instantaneously.
Can someone explain this difference in performance to me?
Thanks in advance
PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates
These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:
Lists are dynamic arrays, you use them to store items sequentially for fast random access, use as stack (adding and removing at the end) or just storing something and later iterating over it in the same order.
Deques are sequences too, only for adding and removing elements at both ends instead of random access or stack-like growth.
Dictionaries (providing a default value just a relatively simple and convenient but - for this question - irrelevant extension) are hash tables, they associate fully-featured keys (instead of an index) with values and provide very fast access to a value by a key and (necessarily) very fast checks for key existence. They don't maintain order and require the keys to be hashable, but well, you can't make an omelette without breaking eggs.
All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.
For each of these edits, there is a check edit in NWORDS to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, since in checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences, in has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).
If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1) in check. Practically, you should just use a set which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).