implementation of genetic algorithm for spell checker

implementation of genetic algorithm for spell checker - python

I want to implement the spell checker which will checks the spelling in a text file and outputs the errors and corrections. I want to create this using python.
But, the main thing is I want to implement that with using genetic algorithm. How can I implement the genetic algorithm for spell checker?

Don't expect my idea here to be perfect or optimal, but it might be a good starting point for you if you decide to go this route. A genetic algorithm may not be the best choice for a spell checker though.
For a genetic algorithm, you need to have a starting population, a way to pass the genes to the "next generation" (crossover), a definite means of creating mutations, and a way of selecting which ones are passed on to the next generation (aka a fitness function). Along with this you'll need, of course, a corpus. You can try the dictionary.com API if it's any good (I've never used it) http://www.programmableweb.com/api/dictionary.com.
For the starting population, you have the horrible issue in that your starting population will be thousands of the exact same word (i.e. ['hello']*1000). From here you can just check if it's a word, then if it is just return True (because grammar checking there vs their vs they're will be a pain in the ass).
To start off, you'll need to rely entirely on mutations to gain diversity, so maybe make mutations more likely if it's an earlier generation, and once the diversity grows the chance of mutation decreases. Mutations can be any of: insert a random letter somewhere, remove a letter somewhere, change a letter somewhere, do more than one of these.
For your fitness function, your best bet will be to use a sequence alignment algorithm. See: http://en.wikipedia.org/wiki/Sequence_alignment. If you REALLY want to get advanced, try creating phonetic spellings for each word in your population and see if they match anything in the corpus, and increase score based on that (i.e. tho and though would have the same pronunciation). I cannot claim to know anything about that. Bare in mind all of this will slow down your application horribly, so keep that in mind. It might be best to limit your population to 1000-2000.
For your crossover, you should take a few of your samples (early on you may need to use roulette to pick which will be the most fit, but later on you can use tournament for speed purposes). Again you can use the sequence alignment between each "parent", and then decide which letter to pull from each parent (i.e. soeed vs s_eeo can come out to be soeed, seed, seeo, or soeeo).
Don't take this as an expert solution, plus I only put a few minutes of thought into this, but it could be a good start if you decide to use a genetic algorithm.

Related

What is the optimal topic-modelling workflow with MALLET?

Introduction
I'd like to know what other topic modellers consider to be an optimal topic-modelling workflow all the way from pre-processing to maintenance. While this question consists of a number of sub-questions (which I will specify below), I believe this thread would be useful for myself and others who are interested to learn about best practices of end-to-end process.
Proposed Solution Specifications
I'd like the proposed solution to preferably rely on R for text processing (but Python is fine also) and topic-modelling itself to be done in MALLET (although if you believe other solutions work better, please let us know). I tend to use the topicmodels package in R, however I would like to switch to MALLET as it offers many benefits over topicmodels. It can handle a lot of data, it does not rely on specific text pre-processing tools and it appears to be widely used for this purpose. However some of the issues outline below are also relevant for topicmodels too. I'd like to know how others approach topic modelling and which of the below steps could be improved. Any useful piece of advice is welcome.
Outline
Here is how it's going to work: I'm going to go through the workflow which in my opinion works reasonably well, and I'm going to outline problems at each step.
Proposed Workflow
1. Clean text
This involves removing punctuation marks, digits, stop words, stemming words and other text-processing tasks. Many of these can be done either as part of term-document matrix decomposition through functions such as for example TermDocumentMatrix from R's package tm.
Problem: This however may need to be performed on the text strings directly, using functions such as gsub in order for MALLET to consume these strings. Performing in on the strings directly is not as efficient as it involves repetition (e.g. the same word would have to be stemmed several times)
2. Construct features
In this step we construct a term-document matrix (TDM), followed by the filtering of terms based on frequency, and TF-IDF values. It is preferable to limit your bag of features to about 1000 or so. Next go through the terms and identify what requires to be (1) dropped (some stop words will make it through), (2) renamed or (3) merged with existing entries. While I'm familiar with the concept of stem-completion, I find that it rarely works well.
Problem: (1) Unfortunately MALLET does not work with TDM constructs and to make use of your TDM, you would need to find the difference between the original TDM -- with no features removed -- and the TDM that you are happy with. This difference would become stop words for MALLET. (2) On that note I'd also like to point out that feature selection does require a substantial amount of manual work and if anyone has ideas on how to minimise it, please share your thoughts.
Side note: If you decide to stick with R alone, then I can recommend the quanteda package which has a function dfm that accepts a thesaurus as one of the parameters. This thesaurus allows to to capture patterns (usually regex) as opposed to words themselves, so for example you could have a pattern \\bsign\\w*.?ups? that would match sign-up, signed up and so on.
3. Find optimal parameters
This is a hard one. I tend to break data into test-train sets and run cross-validation fitting a model of k topics and testing the fit using held-out data. Log likelihood is recorded and compared for different resolutions of topics.
Problem: Log likelihood does help to understand how good is the fit, but (1) it often tends to suggest that I need more topics than it is practically sensible and (2) given how long it generally takes to fit a model, it is virtually impossible to find or test a grid of optimal values such as iterations, alpha, burn-in and so on.
Side note: When selecting the optimal number of topics, I generally select a range of topics incrementing by 5 or so as incrementing a range by 1 generally takes too long to compute.
4. Maintenance
It is easy to classify new data into a set existing topics. However if you are running it over time, you would naturally expect that some of your topics may cease to be relevant, while new topics may appear. Furthermore, it might be of interest to study the lifecycle of topics. This is difficult to account for as you are dealing with a problem that requires an unsupervised solution and yet for it to be tracked over time, you need to approach it in a supervised way.
Problem: To overcome the above issue, you would need to (1) fit new data into an old set of topics, (2) construct a new topic model based on new data (3) monitor log likelihood values over time and devise a threshold when to switch from old to new; and (4) merge old and new solutions somehow so that the evolution of topics would be revealed to a lay observer.
Recap of Problems
String cleaning for MALLET to consume the data is inefficient.
Feature selection requires manual work.
Optimal number of topics selection based on LL does not account for what is practically sensible
Computational complexity does not give the opportunity to find an optimal grid of parameters (other than the number of topics)
Maintenance of topics over time poses challenging issues as you have to retain history but also reflect what is currently relevant.
If you've read that far, I'd like to thank you, this is a rather long post. If you are interested in the suggest, feel free to either add more questions in the comments that you think are relevant or offer your thoughts on how to overcome some of these problems.
Cheers

Thank you for this thorough summary!
As an alternative to topicmodels try the package mallet in R. It runs Mallet in a JVM directly from R and allows you to pull out results as R tables. I expect to release a new version soon, and compatibility with tm constructs is something others have requested.
To clarify, it's a good idea for documents to be at most around 1000 tokens long (not vocabulary). Any more and you start to lose useful information. The assumption of the model is that the position of a token within a given document doesn't tell you anything about that token's topic. That's rarely true for longer documents, so it helps to break them up.
Another point I would add is that documents that are too short can also be a problem. Tweets, for example, don't seem to provide enough contextual information about word co-occurrence, so the model often devolves into a one-topic-per-doc clustering algorithm. Combining multiple related short documents can make a big difference.
Vocabulary curation is in practice the most challenging part of a topic modeling workflow. Replacing selected multi-word terms with single tokens (for example by swapping spaces for underscores) before tokenizing is a very good idea. Stemming is almost never useful, at least for English. Automated methods can help vocabulary curation, but this step has a profound impact on results (much more than the number of topics) and I am reluctant to encourage people to fully trust any system.
Parameters: I do not believe that there is a right number of topics. I recommend using a number of topics that provides the granularity that suits your application. Likelihood can often detect when you have too few topics, but after a threshold it doesn't provide much useful information. Using hyperparameter optimization makes models much less sensitive to this setting as well, which might reduce the number of parameters that you need to search over.
Topic drift: This is not a well understood problem. More examples of real-world corpus change would be useful. Looking for changes in vocabulary (e.g. proportion of out-of-vocabulary words) is a quick proxy for how well a model will fit.

Spelling correction likelihood

As stated by most spelling corrector tutors, the correct word W^ for an incorrectly spelled word x is:
W^ = argmaxW P(X|W) P(W)
Where P(X|W) is the likelihood and P(W) is the Language model.
In the tutorial from where i am learning spelling correction, the instructor says that P(X|W) can be computed by using a confusion matrix which keeps track of how many times a letter in our corpus is mistakenly typed for another letter. I am using the World Wide Web as my corpus and it cant be guaranteed that a letter was mistakenly typed for another letter. So is it okay if i use the Levenshtein distance between X and W, instead of using the confusion matrix? Does it make much of a difference?
The way i am going to compute Lev. distance in python is this:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
See this
And here's the tutorial to make my question clearer: Click here
PS. i am working with Python

There are a few things to say.
The model you are using to predict the most likely correction is a simple, cascaded probability model: There is a probability for W to be entered by the user, and a conditional probability for the misspelling X to appear when W was meant. The correct terminology for P(X|W) is conditional probability, not likelihood. (A likelihood is used when estimating how well a candidate probability model matches given data. So it plays a role when you machine-learn a model, not when you apply a model to predict a correction.)
If you were to use Levenshtein distance for P(X|W), you would get integers between 0 and the sum of the lengths of W and X. This would not be suitable, because you are supposed to use a probability, which has to be between 0 and 1. Even worse, the value you get would be the larger the more different the candidate is from the input. That's the opposite of what you want.
However, fortunately, SequenceMatcher.ratio() is not actually an implementation of Levenshtein distance. It's an implementation of a similarity measure and returns values between 0 and 1. The closer to 1, the more similar the two strings are. So this makes sense.
Strictly speaking, you would have to verify that SequenceMatcher.ratio() is actually suitable as a probability measure. For this, you'd have to check if the sum of all ratios you get for all possible misspellings of W is a total of 1. This is certainly not the case with SequenceMatcher.ratio(), so it is not in fact a mathematically valid choice.
However, it will still give you reasonable results, and I'd say it can be used for a practical and prototypical implementation of a spell-checker. There is a perfomance concern, though: Since SequenceMatcher.ratio() is applied to a pair of strings (a candidate W and the user input X), you might have to apply this to a huge number of possible candidates coming from the dictionary to select the best match. That will be very slow when your dictionary is large. To improve this, you'll need to implement your dictionary using a data structure that has approximate string search built into it. You may want to look at this existing post for inspiration (it's for Java, but the answers include suggestions of general algorithms).

Yes, it is OK to use Levenshtein distance instead of the corpus of misspellings. Unless you are Google, you will not get access to a large and reliable enough corpus of misspellings. There any many other metrics that will do the job. I have used Levenshtein distance weighted by distance of differing letters on a keyboard. The idea is that abc is closer to abx than to abp, because p is farther away from x on my keyboard than c. Another option involves accounting for swapped characters- swap is a more likely correction of sawp that saw, because this is how people type. They often swap the order of characters, but it takes some real talent to type saw and then randomly insert a p at the end.
The rules above are called error model- you are trying to leverage knowledge of how real-world spelling mistakes occur to help with your decision. You can (and people have) come with really complex rules. Whether they makes a difference is an empirical question, you need to try and see. Chances are some rules will work better for some kinds of misspellings and worse for others. Google how does aspell work for more examples.
PS All of the example mistakes above have been purely due to the use of a keyboard. Sometime, people do not know how to spell a word- this is whole other can of worms. Google soundex.

Efficient scheduling of university courses

I'm currently working on a website that will allow students from my university to automatically generate valid schedules based on the courses they'd like to take.
Before working on the site itself, I decided to tackle the issue of how to schedule the courses efficiently.
A few clarifications:
Each course at our university (and I assume at every other
university) comprises of one or more sections. So, for instance,
Calculus I currently has 4 sections available. This means that, depending on the amount of sections, and whether or not the course has a lab, this drastically affects the scheduling process.
Courses at our university are represented using a combination of subject abbreviation and course code. In the case of Calculus I: MATH 1110.
The CRN is a code unique to a section.
The university I study at is not mixed, meaning males and females study in (almost) separate campuses. What I mean by almost is that the campus is divided into two.
The datetimes and timeranges dicts are meant to decreases calls to datetime.datetime.strptime(), which was a real bottleneck.
My first attempt consisted of the algorithm looping continuously until 30 schedules were found. Schedules were created by randomly choosing a section from one of the inputted courses, and then trying to place sections from the remaining courses to try to construct a valid schedule. If not all of the courses fit into the schedule i.e. there were conflicts, the schedule was scrapped and the loop continued.
Clearly, the above solution is flawed. The algorithm took too long to run, and relied too much on randomness.
The second algorithm does the exact opposite of the old one. First, it generates a collection of all possible schedule combinations using itertools.product(). It then iterates through the schedules, crossing off any that are invalid. To ensure assorted sections, the schedule combinations are shuffled (random.shuffle()) before being validated. Again, there is a bit of randomness involved.
After a bit of optimization, I was able to get the scheduler to run in under 1 second for an average schedule consisting of 5 courses. That's great, but the problem begins once you start adding more courses.
To give you an idea, when I provide a certain set of inputs, the amount of combinations possible is so large that itertools.product() does not terminate in a reasonable amount of time, and eats up 1GB of RAM in the process.
Obviously, if I'm going to make this a service, I'm going to need a faster and more efficient algorithm. Two that have popped up online and in IRC: dynamic programming and genetic algorithms.
Dynamic programming cannot be applied to this problem because, if I understand the concept correctly, it involves breaking up the problem into smaller pieces, solving these pieces individually, and then bringing the solutions of these pieces together to form a complete solution. As far as I can see, this does not apply here.
As for genetic algorithms, I do not understand them much, and cannot even begin to fathom how to apply one in such a situation. I also understand that a GA would be more efficient for an extremely large problem space, and this is not that large.
What alternatives do I have? Is there a relatively understandable approach I can take to solve this problem? Or should I just stick to what I have and hope that not many people decide to take 8 courses next semester?
I'm not a great writer, so I'm sorry for any ambiguities in the question. Please feel free to ask for clarification and I'll try my best to help.
Here is the code in its entirety.
http://bpaste.net/show/ZY36uvAgcb1ujjUGKA1d/
Note: Sorry for using a misleading tag (scheduling).

Scheduling is a very famous constraint satisfaction problem that is generally NP-Complete. A lot of work has been done on the subject, even in the same context as you: Solving the University Class Scheduling Problem Using Advanced ILP Techniques. There are even textbooks on the subject.
People have taken many approaches, including:
Dynamic programming
Genetic algorithms
Neural networks
You need to reduce your problem-space and complexity. Make as many assumptions as possible (max amount of classes, block based timing, ect). There is no silver bullet for this problem but it should be possible to find a near-optimal solution.
Some semi-recent publications:
QUICK scheduler a time-saving tool for scheduling class sections
Scheduling classes on a College Campus

Did you ever read anything about genetic programming? The idea behind it is that you let the 'thing' you want solved evolve, just by itsself, until it has grown to the best solution(s) possible.
You generate a thousand schedules, of which usually zero are anywhere in the right direction of being valid. Next, you change 'some' courses, randomly. From these new schedules you select some of the best, based on ratings you give according to the 'goodness' of the schedule. Next, you let them reproduce, by combining some of the courses on both schedules. You end up with a thousand new schedules, but all of them a tiny fraction better than the ones you had. Let it repeat until you are satisfied, and select the schedule with the highest rating from the last thousand you generated.
There is randomness involved, I admit, but the schedules keep getting better, no matter how long you let the algorithm run. Just like real life and organisms there is survival of the fittest, and it is possible to view the different general 'threads' of the same kind of schedule, that is about as good as another one generated. Two very different schedules can finally 'battle' it out by cross breeding.
A project involving school schedules and genetic programming:
http://www.codeproject.com/Articles/23111/Making-a-Class-Schedule-Using-a-Genetic-Algorithm
I think they explain pretty well what you need.
My final note: I think this is a very interesting project. It is quite difficult to make, but once done it is just great to see your solution evolve, just like real life. Good luck!

The way you're currently generating combinations of sections is probably throwing up huge numbers of combinations that are excluded by conflicts between more than one course. I think you could reduce the number of combinations that you need to deal with by generating the product of the sections for only two courses first. Eliminate the conflicts from that set, then introduce the sections for a third course. Eliminate again, then introduce a fourth, and so on. This should see a more linear growth in the processing time required as the number of courses selected increases.

This is a hard problem. It you google something like 'course scheduling problem paper' you will find a lot of references. Genetic algorithm - no, dynamic programming - yes. GAs are much harder to understand and implement than standard DP algos. Usually people who use GAs out of the box, don't understand standard techniques. Do some research and you will find different algorithms. You might be able to find some implementations. Coming up with your own algorithm is way, way harder than putting some effort into understanding DP.

The problem you're describing is a Constraint Satisfaction Problem. My approach would be the following:
Check if there's any uncompatibilities between courses, if yes, record them as constraints or arcs
While not solution is found:
Select the course with less constrains (that is, has less uncompatibilities with other courses)
Run the AC-3 algorithm to reduce search space
I've tried this approach with sudoku solving and it worked (solved the hardest sudoku in the world in less than 10 seconds)

python lottery suggestion

I know python offers random module to do some simple lottery. Let say random.shuffle() is a good one.
However, I want to build my own simple one. What should I look into? Is there any specific mathematical philosophies behind lottery?
Let say, the simplest situation. 100 names and generate 20 names randomly.
I don't want to use shuffle, since I want to learn to build one myself.
I need some advise to start. Thanks.

You can generate your own pseudo-random numbers -- there's a huge amount of theory behind that, start for example here -- and of course you won't be able to compete with Python's random "Mersenne twister" (explained halfway down the large wikipedia page I pointed you to), in either quality or speed, but for purposes of understanding, it's a good endeavor. Or, you can get physically-random numbers, for example from /dev/random or /dev/urandom on Linux machines (Windows machines have their own interfaces for that, too) -- one has more pushy physical randomness, the other one has better performance.
Once you do have (or borrow from random;-) a pseudo-random (or really random) number generator, picking 20 items at random from 100 is still an interesting problem. While shuffling is a more general approach, a more immediately understandable one might be, assuming your myrand(N) function returns a random or pseudorandom int between 0 included and N excluded:
def pickfromlist(howmany, thelist):
result = []
listcopy = list(thelist)
while listcopy and len(result) < howmany:
i = myrand(len(listcopy))
result.append(listcopy.pop(i))
return result
Definitely not maximally efficient, but, I hope, maximally clear!-) In words: as long as required and feasible, pick one random item out of the remaining ones (the auxiliary list listcopy gives us the "remaining ones" at any step, and gets modified by .pop without altering the input parameter thelist, since it's a shallow copy).

See the Fisher-Yates Shuffle, described also in Knuth's The Art of Computer Programming.

I praise your desire to do this on your own.
Back in the 1950's, random numbers were unavailable to most people without a supercomputer (of the time). The RAND corporation published a book called a million random digits with 100,000 normal deviates which had, literally, just that: random numbers. It was awesome because it enabled laypeople to use high-quality random numbers for research purposes.
Now, back to your question.
I recommend you read the instructions on how to use the book (yes, it comes with instructions) and try to implement that in your Python code. This will not be efficient or elegant, but you will understand the implications of the algorithm you ultimately settle for. I love the part that instructs you to
open the book to an unselected page of
the digit table and blindly choose a
five-digit number; this number with
the first number reduced modulo 2
determines the starting line; the two
digits to the right of the initially
selected five-digit number are reduced
modulo 50 to determine the starting
column in the starting line
It was an art to read that table of numbers!
To be sure, I'm not encouraging you to reinvent the wheel for production code. I'm encouraging you to learn about the art of randomness by implementing a clever, if not very efficient, random number generator.
My work requires that I use high-quality random numbers, on limited occasions I have found the site www.random.org a very good source of both insight and material. From their website:
RANDOM.ORG offers true random numbers
to anyone on the Internet. The
randomness comes from atmospheric
noise, which for many purposes is
better than the pseudo-random number
algorithms typically used in computer
programs. People use RANDOM.ORG for
holding drawings, lotteries and
sweepstakes, to drive games and
gambling sites, for scientific
applications and for art and music.
Now, go and implement your own lottery.

You can use: random.sample
Return a k length list of unique
elements chosen from the population
sequence. Used for random sampling
without replacement.
For a more low-level approach, use `random.choice', in a loop:
Return a random element from the
non-empty sequence seq.
The pseudo-random generator (PRNG) in Python is pretty good. If you want to go even more low-level, you can implement your own. Start with reading this article. The mathematical name for lottery is "sampling without replacement". Google that for information - here's a good link.

The main shortcoming of software-based methods of generating lottery numbers is the fact that all random numbers generated by software are pseudo-random.
This may not be a problem for your simple application, but you did ask about a 'specific mathematical philosophy'. You will have noticed that all commercial lottery systems use physical methods: balls with numbers.
And behind the scenes, the numbers generated by physical lottery systems will be carefully scrutunised for indications of non-randomness and steps taken to eliminate it.
As I say, this may not be a consideration for your simple application, but the overriding requirement of a true lottery (the 'specific mathematical philosophy') should be mathematically demonstrable randomness

Bubble Breaker Game Solver better than greedy?

For a mental exercise I decided to try and solve the bubble breaker game found on many cell phones as well as an example here:Bubble Break Game
The random (N,M,C) board consists N rows x M columns with C colors
The goal is to get the highest score by picking the sequence of bubble groups that ultimately leads to the highest score
A bubble group is 2 or more bubbles of the same color that are adjacent to each other in either x or y direction. Diagonals do not count
When a group is picked, the bubbles disappear, any holes are filled with bubbles from above first, ie shift down, then any holes are filled by shifting right
A bubble group score = n * (n - 1) where n is the number of bubbles in the bubble group
The first algorithm is a simple exhaustive recursive algorithm which explores going through the board row by row and column by column picking bubble groups. Once the bubble group is picked, we create a new board and try to solve that board, recursively descending down
Some of the ideas I am using include normalized memoization. Once a board is solved we store the board and the best score in a memoization table.
I create a prototype in python which shows a (2,15,5) board takes 8859 boards to solve in about 3 seconds. A (3,15,5) board takes 12,384,726 boards in 50 minutes on a server. The solver rate is ~3k-4k boards/sec and gradually decreases as the memoization search takes longer. Memoization table grows to 5,692,482 boards, and hits 6,713,566 times.
What other approaches could yield high scores besides the exhaustive search?
I don't seen any obvious way to divide and conquer. But trending towards larger and larger bubbles groups seems to be one approach
Thanks to David Locke for posting the paper link which talks above a window solver which uses a constant-depth lookahead heuristic.

According to this paper, determining if you can empty the board (which is related to the problem you want to solve) is NP-Complete. That doesn't mean that you won't be able to find a good algorithm, it just means that you likely won't find an efficient one.

I'm thinking you could try a branch and bound search with the following idea:
Given a state of the game S, you branch on S by breaking it up in m sets Si where each Si is the state after taking a legal move of all m legal moves given the state S
You need two functions U(S) and L(S) that compute a lower and upper bound respectively of a given state S.
For the U(S) function I'm thinking calculate the score that you would get if you were able to freely shuffle K bubbles in the board (each move) and arrange the blocks in such a way that would result in the highest score, where K is a value you choose yourself. When your calculating U(S) for a given S it should go quicker if you choose higher K (the conditions are relaxed) so choosing the value of K will be a trade of for quickness of finding U(S) and quality (how tight an upper bound U(S) is.)
For the L(S) function calculate the score that you would get if you simply randomly kept click until you got to a state that could not be solved any further. You can do this several times taking the highest lower bound that you get.
Once you have these two functions you can apply standard Bound and Branch search. Note that the speed of your search is going to greatly depend on how tight your Upper Bound is and how tight your Lower Bound is.

To get a faster solution than exhaustive search, I think what you want is probably dynamic programming. In dynamic programming, you find some sort of "step" that takes you possibly closer to your solution, and keep track of the results of each step in a big matrix. Then, once you have filled in the matrix, you can find the best result, and then work backward to get a path through the matrix that leads to the best result. The matrix is effectively a form of memoization.
Dynamic programming is discussed in The Algorithm Design Manual but there is also plenty of discussion of it on the web. Here's a good intro: http://20bits.com/articles/introduction-to-dynamic-programming/
I'm not sure exactly what the "step" is for this problem. Perhaps you could make a scoring metric for a board that simply sums the points for each of the bubble groups, and then record this score as you try popping balloons? Good steps would tend to cause bubble groups to coalesce, improving the score, and bad steps would break up bubble groups, making the score worse.

You can translate this problem into problem of searching shortest path on graph. http://en.wikipedia.org/wiki/Shortest_path_problem
I would try whit A* and heuristics would include number of islands.

In my chess program I use some ideas which could probably adapted to this problem.
Move Ordering. First find all
possible moves, store them in a list,
and sort them according to some
heuristic. The "better" ones first,
the "bad" ones last. For example,
this could be a function of the size
of the group (prefer medium sized
groups), or the number of adjacent
colors, groups, etc.
Iterative Deepening. Instead of
running a pure depth-first search,
cut of the search after a certain
deep and use some heuristic to assess
the result. Now research the tree
with "better" moves first.
Pruning. Don't search moves which
seems "obviously" bad, according to
some, again, heuristic. This involves
the risk that you won't find the
optimal solution anymore, but
depending on your heuristics you will
very likely find it much earlier.
Hash Tables. No need to store every
board you come accross, just remember
a certain number and overwrite older
ones.

I'm almost finished writing my version of the "solver" in Java. It does both exhaustive search, which takes fricking ages for larger board sizes, and a directed search based on a "pool" of possible paths, which is pruned after every generation, and a fitness function used to prune the pool. I'm just trying to tune the fitness function now...
Update - this is now available at http://bubblesolver.sourceforge.net/

This isn't my area of expertise, but I would like to recommend a book to you. Get a copy of The Algorithm Design Manual by Steven Skiena. This has a whole list of different algorithms, and once you read through it you can use it as a reference. If nothing else it will help you consider your options.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.