Find neighbour tuples - python

I'm looking for a algorithm but miss the right keywords to get an overwiew. What I try to realize is a function that finds correlations/patterns/... in a dataset of tuples (simplified). For example:
dataset=(('a','b','c'),('1','a'), ('x','y','b','c'))
print magic(1.0, dataset)
-> ('b','c')
As you see, the function should return pairs of elements, that always appear together (1.0 = 100%) or with a specific propability.
Can anybody please tell me which group of algorithms will suite for my problem? Maybe pointing to a lib that does the work and is tested? :)

Have a look at Frequent Itemset Mining (FIM) and Association rule mining.
In your question, you are essentially interested in association rules of the type A -> B with confidence 100%.
In particular, the APRIORI algorithm, if you are interested in cooccurrences larger than 3.
Note that if you only want pairs, APRIORI boils down to scanning your database twice to count all pairs; you don't gain anything by pruning. Depending on the sparsity of your data, intersecting inverted lists can be much much faster.


Get paths of minimum MSE python

I have a list of list of vectors.
For each of the vectors in the first list , I want to extract the path of minimum distance (MSE) across vectors in each list.
For example, for the first element in the first list, I should obtain this path:
[1,2,3] -> [2,6,3] -> [1,7,3]
in terms of indexes:
I should obtain this path for each element in the first list. The lists are huge and the real vectors are of about 300 elements.
There is some pythonic method that avoids hard iterating with for loops?
My algorithms knowledge is a little limited. I dont think there is any particular python specific best method for this. The comment Rock LI made is accurate. Theres a million dollar prize if you can find a best method to this. Implement a Dijkstra algorithm or whatever your favorite search method is for this. You can auto calculate weights from one list to the next. beyond that its pure algorithms

Spelling correction likelihood

As stated by most spelling corrector tutors, the correct word W^ for an incorrectly spelled word x is:
W^ = argmaxW P(X|W) P(W)
Where P(X|W) is the likelihood and P(W) is the Language model.
In the tutorial from where i am learning spelling correction, the instructor says that P(X|W) can be computed by using a confusion matrix which keeps track of how many times a letter in our corpus is mistakenly typed for another letter. I am using the World Wide Web as my corpus and it cant be guaranteed that a letter was mistakenly typed for another letter. So is it okay if i use the Levenshtein distance between X and W, instead of using the confusion matrix? Does it make much of a difference?
The way i am going to compute Lev. distance in python is this:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
See this
And here's the tutorial to make my question clearer: Click here
PS. i am working with Python
There are a few things to say.
The model you are using to predict the most likely correction is a simple, cascaded probability model: There is a probability for W to be entered by the user, and a conditional probability for the misspelling X to appear when W was meant. The correct terminology for P(X|W) is conditional probability, not likelihood. (A likelihood is used when estimating how well a candidate probability model matches given data. So it plays a role when you machine-learn a model, not when you apply a model to predict a correction.)
If you were to use Levenshtein distance for P(X|W), you would get integers between 0 and the sum of the lengths of W and X. This would not be suitable, because you are supposed to use a probability, which has to be between 0 and 1. Even worse, the value you get would be the larger the more different the candidate is from the input. That's the opposite of what you want.
However, fortunately, SequenceMatcher.ratio() is not actually an implementation of Levenshtein distance. It's an implementation of a similarity measure and returns values between 0 and 1. The closer to 1, the more similar the two strings are. So this makes sense.
Strictly speaking, you would have to verify that SequenceMatcher.ratio() is actually suitable as a probability measure. For this, you'd have to check if the sum of all ratios you get for all possible misspellings of W is a total of 1. This is certainly not the case with SequenceMatcher.ratio(), so it is not in fact a mathematically valid choice.
However, it will still give you reasonable results, and I'd say it can be used for a practical and prototypical implementation of a spell-checker. There is a perfomance concern, though: Since SequenceMatcher.ratio() is applied to a pair of strings (a candidate W and the user input X), you might have to apply this to a huge number of possible candidates coming from the dictionary to select the best match. That will be very slow when your dictionary is large. To improve this, you'll need to implement your dictionary using a data structure that has approximate string search built into it. You may want to look at this existing post for inspiration (it's for Java, but the answers include suggestions of general algorithms).
Yes, it is OK to use Levenshtein distance instead of the corpus of misspellings. Unless you are Google, you will not get access to a large and reliable enough corpus of misspellings. There any many other metrics that will do the job. I have used Levenshtein distance weighted by distance of differing letters on a keyboard. The idea is that abc is closer to abx than to abp, because p is farther away from x on my keyboard than c. Another option involves accounting for swapped characters- swap is a more likely correction of sawp that saw, because this is how people type. They often swap the order of characters, but it takes some real talent to type saw and then randomly insert a p at the end.
The rules above are called error model- you are trying to leverage knowledge of how real-world spelling mistakes occur to help with your decision. You can (and people have) come with really complex rules. Whether they makes a difference is an empirical question, you need to try and see. Chances are some rules will work better for some kinds of misspellings and worse for others. Google how does aspell work for more examples.
PS All of the example mistakes above have been purely due to the use of a keyboard. Sometime, people do not know how to spell a word- this is whole other can of worms. Google soundex.

0/1 Knapsack with few variables: which algorithm?

I have to implement the solution to a 0/1 Knapsack problem with constraints.
My problem will have in most cases few variables (~ 10-20, at most 50).
I recall from university that there are a number of algorithms that in many cases perform better than brute force (I'm thinking, for example, to a branch and bound algorithm).
Since my problem is relative small, I'm wondering if there is an appreciable advantange in terms of efficiency when using a sophisticate solution as opposed to brute force.
If it helps, I'm programming in Python.
You can either use pseudopolynomial algorithm, which uses dynamic programming, if the sum of weights is small enough. You just calculate, whether you can get weight X with first Y items for each X and Y.
This runs in time O(NS), where N is number of items and S is sum of weights.
Another possibility is to use meet-in-the middle approach.
Partition items into two halves and:
For the first half take every possible combination of items (there are 2^(N/2) possible combinations in each half) and store its weight in some set.
For the second half take every possible combination of items and check whether there is a combination in first half with suitable weight.
This should run in O(2^(N/2)) time.
Brute force stuff would work fine for 10 variables, but for, say, 40 you'd get some 1000'000'000'000 possible solutions, which would probably take too long to enumerate. I'd consider approximate algorithms, e.g. the polynomial time algorithm (see, e.g. or use a search algorithm such as branch-and-bound, maybe with an additional heuristic.
Brute force algorithms will always return the best solutions. The problem with them is that in exponential order problems they quickly become not feasible.
If you are guaranteed to have up to 20 variables, you will test no more than 1 million solutions (2^20= 1M). Hence, brute force is feasible and no other algorithm will return a better solution.
Heuristics are great, but they should be used only when we have no exact solution to the problem. There is a great book that might help you: How to Solve it, by Michalewicz.

passing text through a dictionary in Python

I currently have python code that compares two texts using the cosine similarity measure. I got the code here.
What I want to do is take the two texts and pass them through a dictionary (not a python dictionary, just a dictionary of words) first before calculating the similarity measure. The dictionary will just be a list of words, although it will be a large list. I know it shouldn't be hard and I could maybe stumble my way through something, but I would like it to be efficient too. Thanks.
If the dictionary fites in memory, use a Python set:
ok_words = set(["a", "b", "c", "e"])
def filter_words(words):
return [word for word in words if word in ok_words]
If it doesn't fit in memory, you can use shelve
The structure you try to create is known as Inverted Index. Here you can find some general information about it and snippets from Heaps and Mills's implementation. Unfortunately, I wasn't able to find it's source, as well as any other efficient implementation. (Please leave comment if you will find any.)
If you haven't a goal to create a library in pure Python, you can use PyLucene - Python extension for accessing Lucene, which is in it's turn very powerful search engine in Java. Lucene implements inverted index and can easily provide you information on word frequency. It also supports wide range of analyzers (parsers + stemmers) for a dozen of languages.
(Also note, that Lucene already has it's own Similarity measure class.)
Some words about similarity and Vector Space Models. It is very powerful abstraction, but your implementation suffers several disadvantages. With a growth of number of documents in your index your co-occurrence matrix will became to big to fit in memory, and searching in it will take a long time. To stop this effect dimension reduction is used. In methods like LSA this is done by Singular Value Decomposition. Also pay attention to such techniques as PLSA, which uses probabilistic theory, and Random Indexing, which is the only incremental (and so the only appropriate for the large indexes) VSM method.

Bubble Breaker Game Solver better than greedy?

For a mental exercise I decided to try and solve the bubble breaker game found on many cell phones as well as an example here:Bubble Break Game
The random (N,M,C) board consists N rows x M columns with C colors
The goal is to get the highest score by picking the sequence of bubble groups that ultimately leads to the highest score
A bubble group is 2 or more bubbles of the same color that are adjacent to each other in either x or y direction. Diagonals do not count
When a group is picked, the bubbles disappear, any holes are filled with bubbles from above first, ie shift down, then any holes are filled by shifting right
A bubble group score = n * (n - 1) where n is the number of bubbles in the bubble group
The first algorithm is a simple exhaustive recursive algorithm which explores going through the board row by row and column by column picking bubble groups. Once the bubble group is picked, we create a new board and try to solve that board, recursively descending down
Some of the ideas I am using include normalized memoization. Once a board is solved we store the board and the best score in a memoization table.
I create a prototype in python which shows a (2,15,5) board takes 8859 boards to solve in about 3 seconds. A (3,15,5) board takes 12,384,726 boards in 50 minutes on a server. The solver rate is ~3k-4k boards/sec and gradually decreases as the memoization search takes longer. Memoization table grows to 5,692,482 boards, and hits 6,713,566 times.
What other approaches could yield high scores besides the exhaustive search?
I don't seen any obvious way to divide and conquer. But trending towards larger and larger bubbles groups seems to be one approach
Thanks to David Locke for posting the paper link which talks above a window solver which uses a constant-depth lookahead heuristic.
According to this paper, determining if you can empty the board (which is related to the problem you want to solve) is NP-Complete. That doesn't mean that you won't be able to find a good algorithm, it just means that you likely won't find an efficient one.
I'm thinking you could try a branch and bound search with the following idea:
Given a state of the game S, you branch on S by breaking it up in m sets Si where each Si is the state after taking a legal move of all m legal moves given the state S
You need two functions U(S) and L(S) that compute a lower and upper bound respectively of a given state S.
For the U(S) function I'm thinking calculate the score that you would get if you were able to freely shuffle K bubbles in the board (each move) and arrange the blocks in such a way that would result in the highest score, where K is a value you choose yourself. When your calculating U(S) for a given S it should go quicker if you choose higher K (the conditions are relaxed) so choosing the value of K will be a trade of for quickness of finding U(S) and quality (how tight an upper bound U(S) is.)
For the L(S) function calculate the score that you would get if you simply randomly kept click until you got to a state that could not be solved any further. You can do this several times taking the highest lower bound that you get.
Once you have these two functions you can apply standard Bound and Branch search. Note that the speed of your search is going to greatly depend on how tight your Upper Bound is and how tight your Lower Bound is.
To get a faster solution than exhaustive search, I think what you want is probably dynamic programming. In dynamic programming, you find some sort of "step" that takes you possibly closer to your solution, and keep track of the results of each step in a big matrix. Then, once you have filled in the matrix, you can find the best result, and then work backward to get a path through the matrix that leads to the best result. The matrix is effectively a form of memoization.
Dynamic programming is discussed in The Algorithm Design Manual but there is also plenty of discussion of it on the web. Here's a good intro:
I'm not sure exactly what the "step" is for this problem. Perhaps you could make a scoring metric for a board that simply sums the points for each of the bubble groups, and then record this score as you try popping balloons? Good steps would tend to cause bubble groups to coalesce, improving the score, and bad steps would break up bubble groups, making the score worse.
You can translate this problem into problem of searching shortest path on graph.
I would try whit A* and heuristics would include number of islands.
In my chess program I use some ideas which could probably adapted to this problem.
Move Ordering. First find all
possible moves, store them in a list,
and sort them according to some
heuristic. The "better" ones first,
the "bad" ones last. For example,
this could be a function of the size
of the group (prefer medium sized
groups), or the number of adjacent
colors, groups, etc.
Iterative Deepening. Instead of
running a pure depth-first search,
cut of the search after a certain
deep and use some heuristic to assess
the result. Now research the tree
with "better" moves first.
Pruning. Don't search moves which
seems "obviously" bad, according to
some, again, heuristic. This involves
the risk that you won't find the
optimal solution anymore, but
depending on your heuristics you will
very likely find it much earlier.
Hash Tables. No need to store every
board you come accross, just remember
a certain number and overwrite older
I'm almost finished writing my version of the "solver" in Java. It does both exhaustive search, which takes fricking ages for larger board sizes, and a directed search based on a "pool" of possible paths, which is pruned after every generation, and a fitness function used to prune the pool. I'm just trying to tune the fitness function now...
Update - this is now available at
This isn't my area of expertise, but I would like to recommend a book to you. Get a copy of The Algorithm Design Manual by Steven Skiena. This has a whole list of different algorithms, and once you read through it you can use it as a reference. If nothing else it will help you consider your options.

