Compare similarity of two names and identify duplicates with neural network - python

I have a dataset which contains pairs of names, it looks like this:
ID; name1; name2
1; Mike Miller; Mike Miler
2; John Doe; Pete McGillen
3; Sara Johnson; Edita Johnson
4; John Lemond-Lee Peter; John LL. Peter
5; Marta Sunz; Martha Sund
6; John Peter; Johanna Petera
7; Joanna Nemzik; Joanna Niemczik
I have some cases, which are labelled. So I check them manually and decide if these are duplicates or not. The manual judgement in these cases would be:
1: Is a duplicate
2: Is not a duplicate
3: Is not a duplicate
4: Is a duplicate
5: Is not a duplicate
6: Is not a duplicate
7: Is a duplicate
(The 7th case is a specific case, because here phonetics come into the game too. However, this is not the main problem, I am ok with ignoring phonetics.)
A first approach would be to calculate the Levenshtein-distance for each pair and mark those as a duplicate, where the Levenshtein-distance is for example less or equal than 2. This would lead to the following output:
1: Levenshtein distance: 2 => duplicate
2: Levenshtein distance: 11 => not a duplicate
3: Levenshtein distance: 4 => not a duplicate
4: Levenshtein distance: 8 => not a duplicate
5: Levenshtein distance: 2 => duplicate
6: Levenshtein distance: 4 => not a duplicate
7: Levenshtein distance: 2 => duplicate
This would be an approach which uses a "fixed" algorithm based on the Levinshtein distance.
Now, I would like to do this task with using a neural network / machine learning:
I do not need the neural network to detect semantic similarity, like "hospital" and "clininc". However, I would like to avoid the Levenshtein-distance, as I would like the ML algorithm to be able to detect "John Lemond-Lee Peter" and "John LL. Peter" as a potential duplicate, also not with a 100% certainty. The Levenshtein distance would lead to a relative high number in this case (8), as there are quite some characters to be added. In a case like "John Peter" and "Johanna Petera" the Levenshtein-distance would lead to a smaller number (4), however this is in fact no duplicate and for this case I would hope that the ML algorithm would be able to detect that this is likely not a duplicate. So I need the ML algorithm to "learn the way I need the duplicates to be checked". With my labelling I would give as an input I would give the ML algorithm the direction, of what I want.
I actually thought that this should be an easy task for a ML algorithm / neural network, but I am not sure.
How can I implement a neural network to compare the pairs of names and identify duplicates without using an explicit distance metric (like the Levenshtein distance, euclidean etc.)?
I thought that it would be possible to convert the strings to numbers and a neural network can work with this and learn to detect duplicates according to my labelling style. So without having to specify a distance metric. I thought about an human: I would give this task to a person and this person would judge and make a decision. This person has no clue about a Levenshtein-distance or any other mathematical concept. So I just want to train the neural network to learn to do what the human is doing. Of course, every human is different and it also depends on my labelling.
(Edit: The ML/neural network solutions I have seen so far (like this) use a metric like levenshtein as a feature input. But as I said I thought it should be possible to teach the neural network the "human judgement" without making use of such a distance measure? Regarding my specific case with having pairs of names: What would the benefit be a of a ML approach using levenshtein distance as a feature? Because it will just detect those pairs of names as a duplicate that have a low levenshtein distance. So I could use a simple algorithm to mark a pair as duplicate if the levenshtein distance between the two names is less than x. Why use a ML instead, what would be the additional benefit?)

In my experience, OpenAI's GPT-3 works well with such tasks (I'm using it for analyzing astrophysical texts). You should describe a task in the natural language and then provide a few examples for few-shot learning. Here's the quick experiment I've performed in OpenAI Playground (green text was generated by GPT-3):

A naive approach will be somewhat similar to using Levenstein distance. First, convert both names to vectors via pretrained language model (I think FastText will be the best choice as it uses ngrams and will be more sensetive to chars). Than combine these two vectors (the first thing that came to mind is to compute a metric, e.g. calculate Euclidian distance between them). Now, you can see this task as a classification problem, and you can pass calculated metric (or other function) and label (duplicate/not duplicate) to classifier. So, in fact you'll be still computing distance between names but instead of names themself, it will be their high-dimensional representation.
Probably this approach isn't a best choice but it can be a nice baseline for your task. Your problem is a special case of a so called Similarity learning, so you can do a research and choose a specific method from this field.
Also you can take a look on this paper. There authors use character-based measures to vectorize texts and than pass them to ML models.

The task you are solving is usually called Fuzzy Matching. There are some libraries that implement well known algorithms that may help you, like fuzzyset, fuzzywuzzy or difflib. Consider giving a try to some of those.
If you still need to look for machine learning approche, consider that your first requirement is a dataset with pair of texts labeled as match or not match and then implement a binary classifier.
In general rules, classical achine learning algorithms require less data and less parameter tunning to solve the task, but you need to provide better features to the model (which ofteneans you spend more time in the feature engineering stage), but I think your problem is simple-enough to be solved with just machine learning.
If you want to try neutral networks you could try Siamese networks or implement a binary classifier.
That said, make sure that your implementation consider the input text at char level instead of word level.

I have read carefully whole your question, but still I don't know why you want a neural network for that.
Real, sad answer
Tweak edit distance (more general distance than Levenshtein) by adding some weights - idea: swapping characters that are close on the keyboard is more likely than those that are faraway. So distance between Asa and Ada is smaller than Asa and Ala.
Case (4) you can cover with regex.
Happy answer
If you insist to go with ML solutions, here is the sketch of what I would do if forced:
Prepare a lot of pairs labeled (a lot means e.g. 50 thousands).
Pad the names to constant length (e.g. 32 characters).
Apply character level encoding (one-hot should do the job).
Train a binary classifier (e.g. in a form of siamese network) on such inputs.

Related

Is there any supervised clustering algorithm or a way to apply prior knowledge to your clustering?

In my case I have a dataset of letters and symbols, detected in an image. The detected items are represented by their coordinates, type (letter, number etc), value, orientation and not the actual bounding box of the image. My goal is, using this dataset, to group them into different "words" or contextual groups in general.
So far I achieved ok-ish results by applying classic unsupervised clustering, using DBSCAN algorithm, but still this is way tοo limited on the geometric distance of the samples and so the resulting groups cannot resemble the "words" I am aiming for. So I am searching for a way to influence the results of the clustering algorithm by using the knowledge I have about the "word-like" nature of the clusters needed.
My possible approach that I thought was to create a dataset of true and false clusters and train an SVM model (or any classifier) to detect whether a proposed cluster is correct or not. But still for this, I have no solid proof that I can train a model well enough to discriminate between good and bad clusters, plus I find it difficult to efficiently and consistently represent the clusters, based on the features of their members. Moreover, since my "testing data" will be a big amount of all possible combinations of the letters and symbols I have, the whole approach seems a bit too complicated to attempt implementing it without any proof or indications that it's going to work in the end.
To conclude, my question is, if someone has any prior experience with that kind of task (in my mind sounds rather simple task, but apparently it is not). Do you know of any supervised clustering algorithm and if so, which is the proper way to represent clusters of data so that you can efficiently train a model with them?
Any idea/suggestion or even hint towards where I can research about it will be much appreciated.
There are papers on supervised clustering. A nice, clear one is Eick et al., which is available for free. Unfortunately, I do not think any off-the-shelf libraries in python support this. There is also this in the specific realm of text, but it is a much more domain-specific approach compared to Eick.
But there is a very simple solution that is effectively a type of supervised clustering. Decision Trees essentially chop feature space into regions of high-purity, or at least attempt to. So you can do this as a quick type of supervised clustering:
Create a Decision Tree using the label data.
Think of each leaf as a "cluster."
In sklearn, you can retrieve the leaves of a Decision Tree by using the apply() method.
A standard approach would be to use the dendrogram.
Then merge branches only if they agree with your positive examples and don't violate any of your negative examples.

Matching Property on Heterogenous Data using Deep Learning

The issue I face is that I want to match properties (houses/apartments etc) that are similar to each other (e.g. longitude and latitude (numerical), bedrooms (numerical), district (categorial), condition (categorical) etc.) using deep learning. The data is heterogenous because we mix numerical and categorical data and the problem is unsupervised because we don’t use any labels.
My goal is to get a measure for how similar properties are so I can find the top matches for each target property. I could use KNN, but I want to use something that allows me to find embeddings and that uses deep learning.
I suppose I could determine a mixed distance measure such as the Gower Distance as the loss function, but how would I go about setting up a model that determines the, say, the top 10 matches for each target property in my sample?
Any help or points to similar problem sets (Kaggle, notebooks, github) would be very appreciated.
Thanks
Given that you want an unsupervised approach, you could try using an auto-encoder. I have found Variational Auto-Encoders (VAEs) to be pretty good for other problems. The learned embedding should respect distance in the input space to some extent, but you might need to modify the loss function slightly if you want examples to be separated in a specific way.
To get the top k, you can just encode each example, compute a distance matrix and take the top k in each row (or col).
I have an implementation of VAEs (and others) in Pytorch: here for your reference, obviously you will need a different network architecture to handle the categorical aspects etc.
Hope this helps!

Word clustering in python

How to cluster only words in a given set of Data: i have been going through few algorithms online like k-Means algotihm,but it seems they are related to document clustering instead of word clustering.Can anyone suggest me some way to only cluster words in a given set of data???.
please am new to python.
Based on the fact that my last answer was indeed a false answer since it was used for document clustering and not word clustering, here is the real answer.
What you are looking for is word2vec.
Indeed, word2vec is a google tool based on deep-learning that works really well. It transforms words into vectorial representation, and therefore allows you to do multiple things with it.
For example, one of its many examples that work well are algebric relation of words:
vector('puppy') - vector('dog') + vector('cat') is close to vector('kitten')
vector('king') - vector('man') + vector('woman') is close to vector('queen')
What it means by that is it can sort of encompass the context of a word, and therefore it will work really well for numerous applications.
When you have vectors instead of words, you can pretty much do anything you want. You can for example do a k-means clustering with a cosine distance as the measure of dissimilarity...
Hope this answers well to your question. You can read more about word2vec in different papers or websites if you'd like. I won't link them here since it is not the subject of the question.
Word clustering will be really disappointing because the computer does not understand language.
You could use levenshtein distance and then do hierarchical clustering.
But:
dog and fog have a distance of 1, i.e. are highly similar.
dog and cat have 3 out of 3 letters different.
So unless you can define a good measure of similarity, don't cluster words.

Utilising Genetic algorithm to overcome different size datasets in model

SO I realise the question I am asking here is large and complex.
A potential solution to variences in sizes of
In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering,
but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model.
The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the
percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified
but hopefully demonstrates the principle)
Example data
Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy
12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527
12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527
12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527
13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527
13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527
13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527
13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527
13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527
14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527
14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527
14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527
14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527
14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527
15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527
15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527
15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527
15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527
16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527
16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527
16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527
16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527
16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527
17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527
17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527
17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527
17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527
17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527
So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation:
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where α is the only dynamic number
If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates
For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed"
Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed"
So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature.
So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department?
So to clarify, this post is no longer a discussion about methodology
I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank.
Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.
The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets)
May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point.
And particularly those slides.
[EDIT]
After a second reading, here is my second understanding:
You have a set of example data with two related attributes X and Y.
You do not want X/Y to dominate when Y is small, (considered as less representative).
As a consequence you want to "weigth" the examples with a adapted value adjusted_xy .
You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R.
To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy .
With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal.
Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ).
One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information.
Once this said, I think you surely know how GA works . You have to
define the content of the chromosome : this appears to be your alpha parameter.
define an appropriate fitness function
The fitness function for one individual can be a sum of distances over all examples of the dataset.
As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA.
As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot).
Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).
Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate.
The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem).
The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem.
'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation.
Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class.
Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course).
Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.
No. Don't use a genetic algorithm.
The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before.
You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles.
This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding.
Take the problem back where it belongs and ask the statisticians instead.
For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not.
If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.
I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
which may give you something to play with.
Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows.
There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is.
Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :)
There are other tests of course, but AIC should get you started.
For a simple test, check out p-value

Spelling correction likelihood

As stated by most spelling corrector tutors, the correct word W^ for an incorrectly spelled word x is:
W^ = argmaxW P(X|W) P(W)
Where P(X|W) is the likelihood and P(W) is the Language model.
In the tutorial from where i am learning spelling correction, the instructor says that P(X|W) can be computed by using a confusion matrix which keeps track of how many times a letter in our corpus is mistakenly typed for another letter. I am using the World Wide Web as my corpus and it cant be guaranteed that a letter was mistakenly typed for another letter. So is it okay if i use the Levenshtein distance between X and W, instead of using the confusion matrix? Does it make much of a difference?
The way i am going to compute Lev. distance in python is this:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
See this
And here's the tutorial to make my question clearer: Click here
PS. i am working with Python
There are a few things to say.
The model you are using to predict the most likely correction is a simple, cascaded probability model: There is a probability for W to be entered by the user, and a conditional probability for the misspelling X to appear when W was meant. The correct terminology for P(X|W) is conditional probability, not likelihood. (A likelihood is used when estimating how well a candidate probability model matches given data. So it plays a role when you machine-learn a model, not when you apply a model to predict a correction.)
If you were to use Levenshtein distance for P(X|W), you would get integers between 0 and the sum of the lengths of W and X. This would not be suitable, because you are supposed to use a probability, which has to be between 0 and 1. Even worse, the value you get would be the larger the more different the candidate is from the input. That's the opposite of what you want.
However, fortunately, SequenceMatcher.ratio() is not actually an implementation of Levenshtein distance. It's an implementation of a similarity measure and returns values between 0 and 1. The closer to 1, the more similar the two strings are. So this makes sense.
Strictly speaking, you would have to verify that SequenceMatcher.ratio() is actually suitable as a probability measure. For this, you'd have to check if the sum of all ratios you get for all possible misspellings of W is a total of 1. This is certainly not the case with SequenceMatcher.ratio(), so it is not in fact a mathematically valid choice.
However, it will still give you reasonable results, and I'd say it can be used for a practical and prototypical implementation of a spell-checker. There is a perfomance concern, though: Since SequenceMatcher.ratio() is applied to a pair of strings (a candidate W and the user input X), you might have to apply this to a huge number of possible candidates coming from the dictionary to select the best match. That will be very slow when your dictionary is large. To improve this, you'll need to implement your dictionary using a data structure that has approximate string search built into it. You may want to look at this existing post for inspiration (it's for Java, but the answers include suggestions of general algorithms).
Yes, it is OK to use Levenshtein distance instead of the corpus of misspellings. Unless you are Google, you will not get access to a large and reliable enough corpus of misspellings. There any many other metrics that will do the job. I have used Levenshtein distance weighted by distance of differing letters on a keyboard. The idea is that abc is closer to abx than to abp, because p is farther away from x on my keyboard than c. Another option involves accounting for swapped characters- swap is a more likely correction of sawp that saw, because this is how people type. They often swap the order of characters, but it takes some real talent to type saw and then randomly insert a p at the end.
The rules above are called error model- you are trying to leverage knowledge of how real-world spelling mistakes occur to help with your decision. You can (and people have) come with really complex rules. Whether they makes a difference is an empirical question, you need to try and see. Chances are some rules will work better for some kinds of misspellings and worse for others. Google how does aspell work for more examples.
PS All of the example mistakes above have been purely due to the use of a keyboard. Sometime, people do not know how to spell a word- this is whole other can of worms. Google soundex.

Categories

Resources