I'm given a matrix containing a blueprint of a crossword puzzle - unfilled, of course. The goal is to fill the whole puzzle - it's a task from Checkio, and I've been struggling with this for quite some time now.
From what I understand of complexity, there's no perfect algorithm for this problem. Still, there has to be the best way to do this, right? I've tried some different things, and results were not that good with increasing number of words in the crossword and/or dictionary.
So, some of the things I've tried:
simple brute forcing. Did not work at all, as it has been ignoring
and overwriting intersections.
brute forcing while keeping all the relevant data - worked as expected with a specific dictionary, turned to hell with a
moderately big one even with word-length optimization. Figures.
blind intersection filling - the idea where I thought it would be better not to bother with the intersecting words, instead focusing
on the letters. Like start with As and check if you can fill the
whole crossword with these restrictions. If it did not work for some
word, increment one of the letters and try the whole thing again.
Results were abysmal as you can expect.
recursive exploring - worked perfectly on more simple blueprints, but fell flat with more complex ones. There was an issue with simple
loops which was resolved simply enough, but I did not find a
reasonable solution for the situation where the path splits and then
rejoins several further splits later (so there's nothing left to
solve for the second branch, but it doesn't know that).
minimizing intersections - haven't tested this yet, but it looks promising. The idea is that I find the shortest list of words
containing all intersections... that also don't intersect with each
other. Then I can just use a generator for each of those words, and
then check if the depending words with those intersections exist. If
they don't, I just grab the next word from a generator.
And this is where I'm currently at. I decided to ask about this here as it's already at that point where I think it took more time than it should have, and even then my latest idea may not even be the proper way to do it.
So, what is the proper way to do it?
Edit:
Input is a list of strings representing the crossword and a list of strings representing the dictionary. Output is a list of strings representing the filled crossword.
And example of a crossword:
['...XXXXXX',
'.XXX.X...',
'.....X.XX',
'XXXX.X...',
'XX...X.XX',
'XX.XXX.X.',
'X......X.',
'XX.X.XXX.',
'XXXX.....']
The output would be a similar list with filled letters instead of dots.
Note that the 'dictionary' is just that, a small English dictionary and not a list of words fitted as answers for this puzzle.
So, what is the proper way to do it?
I don't know if it is optimal, but I would be using the principles of Floodfill.
Data structures:
Crossword words and their intersections. Sort them by the number of words in the dictionary for the corresponding word length. This will most likely mean that you will start with one of the longest words.
Dictionary accessible by word length.
If the dictionary is large it would be beneficial to be able to quickly find words of a certain length with a specific n:th letter, where n corresponds to an intersection position.
Note that for each crossword word, any two words that fit and have the same letters in all intersections are equivalent. Thus, it is possible to select a subset from the dictionary for each crossword word. The subset is the set of equivalence classes. So for each crossword word you can create a subset of the dictionary that contains at most [the number of letters in the alphabet] to the power of [the number of intersections]. This subset would constitute the equivalence classes that might fit a particular crossword word.
Algorithm:
Take the first/next unsolved crossword word. Assign it the first/next
word that fits.
Take the first/next intersection. Assign the other crossword word the first word that fits.
If there are no more intersections to follow onwards, go back to the intersection you came from and continue with the next intersection.
If there is no word in the dictionary that fits, backtrack one intersection and search for the next word that fits.
Related
I am designing an experiment where participants will be prompted a random sequence of actions, and I will be recording data through out the experiment. My intention is to capture every possible transition from one action to another using the shortest sequence possible. Say that there are N possible actions, I am searching for an algorithm that can generate a set of random sequences with the following properties:
Sliding through each sequence, consecutive two elements represents a transition from one action to another. Therefore, except for the start and end of the sequence, every element will serve as the end of one transition and also the start of the next transition. From what I observe using small examples, this approach appears to produce the shortest sequence while covering all transitions.
Code implementing the algorithm must return all such valid shortest sequences.
Cannot have two consecutive elements be the same (i.e. self transitions are not allowed).
Must use basic functions available in Python and MATLAB, so I cannot use modules/libraries that maybe available in Python but not in MATLAB (or vice-versa).
As an example, say I have 3 actions: {A, B, C}. One of the expected sequences this algorithm should produce is: ABCBACA. Sliding through this sequence, taking 2 elements at a time, I get {AB, BC, CB, BA, AC, CA}. As expected, this covers all 6 transitions that is possible using a sequence of length 7. The sequence has no two consecutive elements that are the same. Another valid sequence that this algorithm might produce is: ACABCBA. Sliding through this sequence taking 2 elements at a time, I get {AC, CA, AB, BC, CB, BA}, thus covering all transitions, with no two consecutive elements being the same.
I worked out both examples using a pen and paper, but I am having trouble seeing a pattern, particularly for N >3. How do I proceed from here?
It appears that a sequence of length N*(N-1) + 1 would be the shortest sequence in my case, which I think makes sense. I also observed that the start and end of such sequences are the same (i.e. if we start at A, we end at A). It almost appears as if this is a circular list instead of a linear list. Is this generally true?
If I'm understanding what you're asking correctly, here's basically what you need to do:
Create a directed graph with a node per possible transition (so one for AB, one for AC, etc), and add connections from each node to every node that starts with your "end" (so for AB, you'd connect it to BA and BC -- remember, these are unidirectional)
Find an arbitrary Hamiltonian cycle of the graph above.
You're done. Problem is, finding a Hamiltonian cycle is an NP-complete problem in the general case. As such, finding an efficient way of doing it for large N might prove challenging, to put it lightly. If you only need it for N of fairly small size, then you can just pick any algorithm that finds Hamiltonian cycles and stick it in.
Hell, you can probably just concatenate random transitions that 1. haven't been used yet and 2. start with whatever the previous transition ended (in other words, traverse the graph described above at random, without ever returning to a node you've already visited), and if you run out of options before using up all transitions, just start over. It would surely find solutions for small N (say, <= 6) reasonably quickly, and clearly it has equal probability of finding any valid solution.
As for your question on whether the solution will always be circular; yes, that is correct. It's pretty clear to see if you think about the fact that in an optimal solution, you will see every single transition exactly once, and also that any "outgoing" transition must be paired with an "incoming" transition of the same action: e.g. if you start with AB, your pool will contain N transitions of the form xA, but only N-1 Ax ones, and as such you will end up being left with a single dangling transition of the form xA that therefore must come last.
It's possible there is some kind of alternative solution that leverages the structure of this specific problem to produce a more efficient solution, but if there is, I'm not seeing it. This problem is basically a slightly smaller scale version of finding the shortest superpermutation, though, which isn't currently known to have a more efficient solution.
For anyone looking at this in the future: I came across De Bruijn sequence which is almost exactly the solution I want to my problem. The Python code referenced in the article works fairly well for my problem. The only modification I needed to make was that in the output string, I needed to ensure that all permutations involving self transitions (e.g. AA, BB, CC, etc.) were collapsed into single symbols (i.e. A, B, C, etc.).
Also, as the Wikipedia page states:
... Note that these sequences are understood to "wrap around" in a cycle ...
So this confirms my observation that the sequences always end and start at the same point. Multiple sequences can be obtained by supplying permuted strings to the input (i.e. the inputs ABC, ACB, BAC, etc.) and we get the outputs we are interested in. The output produced by the Python code appears to be always ordered.
I want to find all possible anagrams from a phrase, for example if I input "Donald Trump" I should get "Darn mud plot", "Damp old runt" and probably hundreds more.
I have a dictionary of around 100,000 words, no problems there.
But the only way I can think of is to loop through the dictionary and add all words that can be built from the input to a list. Then loop through the list and if the word length is less than the length of the input, loop through the dictionary again add all possible words that can be made from the remaining letters that would make it the length of the input or less. And keep looping through until I have all combinations of valid words of length equal to input length.
But this is O(n!) complexity, and it would take almost forever to run. I've tried it.
Is there any way to approach this problem such that the complexity will be less? I may have found something on the net for perl, but I absolutely cannot read perl code, especially not perl golf.
I like your idea of filtering the word list down to just the words that could possibly be made with the input letters, and I like the idea of trying to string them together, but I think there are a few major optimizations you could put into place that would likely speed things up quite a bit.
For starters, rather than choosing a word and then rescanning the entire dictionary for what's left, I'd consider just doing a single filtering pass at the start to find all possible words that could be made with the letters that you have. Your dictionary is likely going to be pretty colossal (150,000+, I'd suspect), so rescanning it after each decision point is going to be completely infeasible. Once you have the set of words you can legally use in the anagram, from there you're left with the problem of finding which combinations of them can be used to form a complete anagram of the sentence.
I'd begin by finding unordered lists of words that anagram to the target rather than all possible ordered lists of words, because there's many fewer of them to find. Once you have the unordered lists, you can generate the permutations from them pretty quickly.
To do this, I'd use a backtracking recursion where at each point you maintain a histogram of the remaining letter counts. You can use that to filter out words that can't be added in any more, and this essentially saves you the cost of having to check the whole dictionary each time. I'd imagine this recursion will dead-end a lot, and that you'll probably find all your answers without too much effort.
You might consider some other heuristics along the way. For example, you might want to start with larger words first to pull out as many letters as possible and keep the branching factor low. To do that, you could sort your word list from longest to shortest and try the words in that order. You could alternatively try to use the most constrained letters up first to decrease the branching factor. These sorts of heuristics will probably work really well in practice.
Overall you're still looking at exponential work in the worst case, but it shouldn't be too bad for shorter strings.
I have a list of sentences (e.g. "This is an example sentence") and a glossary of terms (e.g. "sentence", "example sentence") and need to find all the terms that match the sentence with a cutoff on some Levenshtein ratio.
How can I do it fast enough? Splitting sentences, using FTS to find words that appear in terms and filtering terms by ratio works but it's quite slow. Right now I'm using sphinxsearch + python-Levelshtein, are there better tools?
Would the reverse search: FTS matching terms in sentence be faster?
If speed is a real issue, and if your glossary of terms is not going to be updated often, compared to the number of searches you want to do, you could look into something like a Levenshtein Automaton. I don't know of any python libraries that support it, but if you really need it you could implement it yourself. To find all possible paths will require some dynamic programming.
If you just need to get it done, just loop over the glossary and test each one against each word in the string. That should give you an answer in polynomial time. If you're on a multicore processor, you might get some speedup by doing it in parallel.
I'm writing an algorithm to solve skyscrapers puzzles:
Skyscraper puzzles combine the row and column constraints of Sudoku with external clue values that re-imagine each row or column of numbers as a road full of skyscrapers of varying height. Higher numbers represent higher buildings.
To solve a Skyscraper puzzle you must place 1 to 5, or 1 to whatever the size of the puzzle is, once each into every row and column, while also solving each of the given skyscraper clues.
To understand Skyscraper puzzles, you must imagine that each value you place into the grid represents a skyscraper of that number of floors. So a 1 is a 1-floor skyscraper, while a 4 is a 4-floor skyscraper. Now imagine that you go and stand outside the grid where one of the clue numbers is and look back into the grid. That clue number tells you how many skyscrapers you can see from that point, looking only along the row or column where the clue is, and from the point of view of the clue. Taller buildings always obscure lower buildings, so in other words higher numbers always conceal lower numbers.
All the basic techniques are implemented and working, but I've realized that with bigger puzzles (5x5>) I need some sort of recursive algorithm. I found a decent working python script, but I'm not really following what it actually does beyond solving basic clues.
Does anyone know the proper way of solving these puzzles or can anyone reveal the essentials in the code above?
Misha showed you the brute-force way. A much faster recursive algorithm can be made based on constraint propagation. Peter Norvig (head of Google Research) wrote an excellent article about how to use this technique to solve Sudoku with python. Read it and try to understand every detail, you will learn a lot, guaranteed. Since the skyscraper puzzle has a lot in common with Sudoku (without the 3X3 blocks, but with some extra constraints given by the numbers on the edge), you could probably steal a lot of his code.
You start, as with Sudoku, where each field has a list of all the possible numbers from 1..N. After that, you look at one horizontal/vertical line or edge clue at a time and remove illegal options. E.g. in a 5x5 case, an edge of 3 excludes 5 from the first two and 4 from the first squares. The constraint propagation should do the rest. Keep looping over edge constraints until they are fulfilled or you get stuck after cycling through all constraints. As shown by Norvig, you then start guessing and remove numbers in case of a contradiction.
In case of Sudoku, a given clue has to be processed only once, since once you assign a single number to one square (you remove all the other possibilities), all the information of the clue has been used. With the skyscrapers, however, you might have to apply a given clue several times until it is totally satisfied (e.g. when the complete line is solved).
If you're desperate, you can brute-force the puzzle. I usually do this as a first step to become familiar with the puzzle. Basically, you need to populate NxN squares with integers from 1 to N inclusive, following the following constraints:
Each integer appears in every row exactly once
Each integer appears in every column exactly once
The row "clues" are satisfied
The column "clues" are satisfied
The brute force solution would work like this. First, represent the board as a 2D array of integers. Then write a function is_valid_solution that returns True if the board satisfies the above constraints, and False otherwise. This part is relatively easy to do in O(N^2).
Finally, iterate over the possible board permutations, and call is_valid_solution for each permutation. When that returns True, you've found a solution. There are a total of N^(NxN) possible arrangements, so your complete solution will be O(N^(NxN)). You can do better by using the above constraints for reducing the search space.
The above method will take a relatively long while to run (O(N^(NxN)) is pretty horrible for an algorithm), but you'll (eventually) get a solution. When you've got that working, try to think of a better way to to it; if you get stuck, then come back here.
EDIT
A slightly better alternative to the above would be to perform a search (e.g. depth-first) starting with an empty board. At each iteration of the search, you'd populate one cell of the table with a number (while not violating any of the constraints). Once you happen to fill up the board, you're done.
Here's pseudo-code for a recursive brute-force depth-first search. The search will be NxN nodes deep, and the branching factor at each node is at most N. This means you will need to examine at most 1 + N + N^2 + ... + N^(N-1) or (N^N-1)/(N-1) nodes. For each of these nodes, you need to call is_valid_board which is O(N^2) in the worst case (when the board is full).
def fill_square(board, row, column):
if row == column == N-1: # the board is full, we're done
print board
return
next_row, next_col = calculate_next_position(row, col)
for value in range(1, N+1):
next_board = copy.deepcopy(board)
next_board[row][col] = value
if is_valid_board(next_board):
fill_square(next_board, next_row, next_col)
board = initialize_board()
fill_square(board, 0, 0)
The function calculate_next_position selects the next square to fill. The easiest way to do this is just a scanline traversal of the board. A smarter way would be to fill rows and columns alternately.
Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque was way slower than collections.defaultdict when I wanted to verify the existence of a value.
I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list instead of defaultdict at first, but replaced it with deque as soon as I noticed that a single word lookup took about 25 seconds.
Surprisingly, that wasn't faster than using a list so I returned to using defaultdict which returned results almost instantaneously.
Can someone explain this difference in performance to me?
Thanks in advance
PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates
These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:
Lists are dynamic arrays, you use them to store items sequentially for fast random access, use as stack (adding and removing at the end) or just storing something and later iterating over it in the same order.
Deques are sequences too, only for adding and removing elements at both ends instead of random access or stack-like growth.
Dictionaries (providing a default value just a relatively simple and convenient but - for this question - irrelevant extension) are hash tables, they associate fully-featured keys (instead of an index) with values and provide very fast access to a value by a key and (necessarily) very fast checks for key existence. They don't maintain order and require the keys to be hashable, but well, you can't make an omelette without breaking eggs.
All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.
For each of these edits, there is a check edit in NWORDS to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, since in checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences, in has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).
If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1) in check. Practically, you should just use a set which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).