Minimum Levenshtein distance across multiple words - python

I am trying to do some string matching using the Levenshtein algorithm for closest words on businesses. (In python but language won't make a huge difference)
An example query would be
search = 'bna'
lat & lon are close by the result I am looking for.
There is a pub right by the latitude and longitude called BNA Brewing Co. by searching BNA my hopes would be that that shows up first (as bna == bna)
I have tried two different way
m = min([editdistance.eval(search, place_split) for place_split in place.name.split(' ')
if place_split not in string.punctuation])
returns without ranking based on geographical distance, only levenshtein distance
Coffee & Books In Town Center
Talk 'n' Coffee
Raggedy Ann & Andy's
and with taking into account geographical distance, secondary to levenshtein
Shapers Hair Salon & Spa
Amora Day Spa
Pure Esthetics and Micro-Pigmentation
And
m = editdistance.eval(search, place.name)
The first one returns without ranking based on geographical distance, only levenshtein distance
KFC
MOO
A&W
and with taking into account geographical distance, secondary to levenshtein
A&W
A&W
KFC
So you can see that neither way are returning anything close to BNA Brewing Co.
What kind of logic do I have to use to get it to return something when the search terms exactly matches one of the place names in my database?

Recall that Levenshtein distances count the number of substitutions, additions and deletions required to transform one string into another. Because of this, they often are minimized when comparing strings of similar length (because even if a lot of substitutions are required, you don't have to add or remove a bunch of characters). You can see this playing out in your second example where your best outputs all are the same length as your search string (len("bna") == len("A&W")).
If your search string is always going to be a single word, then your idea to calculate the distance for each word in the string is a good one since each word is more likely to be a similar length to your search string. However, currently you are doing a case sensitive comparison which means that editdistance.eval('bna', 'BNA') == 3 which I'm guessing you don't want.
try:
m = min([editdistance.eval(search.lower(), place_split.lower()) for place_split in place.name.split(' ') if place_split not in string.punctuation])
which should give you a case insensitive search.

Related

Gensim: word mover distance with string as input instead of list of string

I'm trying to find out how similar are 2 sentences.
For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
What i give as an input are 2 strings:
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
The model i'm using is the one that you can find on the web: word2vec-google-news-300
I load it with this code:
wv = api.load("word2vec-google-news-300")
It give me reasonable results.
Here it's where the problem starts.
For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!
def preprocess(sentence):
return [w for w in sentence.lower().split() if w not in stop_words]
sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
When i follow the documentation i get results really different:
wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113
I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?
The function needs a list-of-string-tokens to give proper results: if your results pasing full strings look good to you, it's pure luck and/or poor evaluation.
So: why do you consider 0.556 to be a better value than -0.017?
Since passing the texts as plain strings means they are interpreted as lists-of-single-characters, the value there is going to be a function of how different the letters in the two texts are - and the fact that all English sentences of about the same length have very similar letter-distributions, means most texts will rate as very-similar under that error.
Also, similarity or distance values mainly have meaning in comparison to other pairs of sentences, not two different results from different processes (where one of them is essentially random). You shouldn't consider absolute values that are exceeding some set threshold, or close to 1.0, as definitively good. You should instead consider relative differences, between two similarity/distance values, to mean one pair is more similary/distant than another pair.
Finally: converting a distance (which goes from 0.0 for closest to infinity for furthest) to a similarity (which typically goes from 1.0 for most-similar to -1.0 or 0.0 for least-similar) is not usefully done via the formula you're using, similarity = 1.0 - distance. Because a distance can be larger than 2.0, you could have arbitrarily negative similarities with that approach, and be fooled to think -0.017 (etc) is bad, because it's negative, even if it's quite good across all the possible return values.
Some more typical distance-to-similarity conversions are given in another SO question:
How do I convert between a measure of similarity and a measure of difference (distance)?

Extracting a section of a string in python with limitations

I have a string output that looks like this:
Distance AAAB: ,0.13634,0.13700,0.00080,0.00080,-0.00066,.00001,
Distance AAAC: ,0.12617,0.12680,0.00080,0.00080,-0.00063,,
Distance AAAD: ,0.17045,0.16990,0.00080,0.00080,0.00055,,
Distance AAAE: ,0.09330,0.09320,0.00080,0.00080,0.00010,,
Distance AAAF: ,0.21048,0.21100,0.00080,0.00080,-0.00052,,
Distance AAAG: ,0.02518,0.02540,0.00040,0.00040,-0.00022,,
Distance AAAH: ,0.11404,0.11450,0.00120,0.00110,-0.00046,,
Distance AAAI: ,0.10811,0.10860,0.00080,0.00070,-0.00049,,
Distance AAAJ: ,0.02430,0.02400,0.00200,0.00200,0.00030,,
Distance AAAK: ,0.09449,0.09400,0.00200,0.00100,0.00049,,
Distance AAAL: ,0.07689,0.07660,0.00050,0.00050,0.00029,
What I want to do is extract a specific set of data out of this block, for example only Distance AAAH like so:
Distance AAAH: ,0.11404,0.11450,0.00120,0.00110,-0.00046,,
The measurements will always begin with Distance AAA*: with the star being the only character that will change.
Complications:
This needs to be generic, because I have a lot of different data sets and so Distance AAAH might not always be followed by Distance AAAI or preceded by Distance AAAG, since the measurements for different items vary. I also can't rely on .len(), because the last measurement can sometimes be blank (As it is with Distance AAAH) or can be filled (As with Distance AAAB. And I don't think I can use .find(), because I need all of the numbers following Distance AAAH.
I am still very new and I tried my best to find a solution similar to this problem, but have not had much luck.
You can search your text by this script :
#fullText = YOUR STRING
text = fullText.splitlines()
for line in text:
if line.startswith('Distance AAAH:'):
print line
Output:Distance AAAH: ,0.11404,0.11450,0.00120,0.00110,-0.00046,,
You could use re module. And making a function should be convenient.
import re
def SearchDistance(pattern,text):
pattern = pattern.replace(' ','\s')
print re.findall(r'{0}.+'.format(pattern),a)
SearchDistance('Distance AAAH',a)
Output:
['Distance AAAH: ,0.11404,0.11450,0.00120,0.00110,-0.00046,,']

Matching 2 short descriptions and returning a confidence level

I have some data that I get from the Banks using Yodlee and the corresponding transaction messages on the mobile. Both have some description in them - short descriptions.
For example -
string1 = "tatasky_TPSL MUMBA IND"
string2 = "tatasky_TPSL"
They can be matched if one is a completely inside the other. However, some strings like
string1 = "T.G.I Friday's"
string1 = "TGI Friday's MUMBA MAH"
Still need to be matched. Is there a y algorithm which gives a confidence level in matching 2 descriptions ?
You might want to use Normalized edit distance also called levenstien distance levenstien distance wikipedia. So after getting levenstien distance between two strings, you can normalize it by dividing by the length of longest string (or average of those two strings). This normalised socre can act as confidense. You can find some 4-5 python packages of calculating levenstien distance. You can try it online as well edit distance calculator
Alternatively one simple solution is algorithm called longest common subsequence, which can be used here

check if two words are related to each other

I have two lists: one, the interests of the user; and second, the keywords about a book. I want to recommend the book to the user based on his given interests list. I am using the SequenceMatcher class of Python library difflib to match similar words like "game", "games", "gaming", "gamer", etc. The ratio function gives me a number between [0,1] stating how similar the 2 strings are. But I got stuck at one example where I calculated the similarity between "looping" and "shooting". It comes out to be 0.6667.
for interest in self.interests:
for keyword in keywords:
s = SequenceMatcher(None,interest,keyword)
match_freq = s.ratio()
if match_freq >= self.limit:
#print interest, keyword, match_freq
final_score += 1
break
Is there any other way to perform this kind of matching in Python?
Firstly a word can have many senses and when you try to find similar words you might need some word sense disambiguation http://en.wikipedia.org/wiki/Word-sense_disambiguation.
Given a pair of words, if we take the most similar pair of senses as the gauge of whether two words are similar, we can try this:
from nltk.corpus import wordnet as wn
from itertools import product
wordx, wordy = "cat","dog"
sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)
maxscore = 0
for i,j in list(product(*[sem1,sem2])):
score = i.wup_similarity(j) # Wu-Palmer Similarity
maxscore = score if maxscore < score else maxscore
There are other similarity functions that you can use. http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html. The only problem is when you encounter words not in wordnet. Then i suggest you fallback on difflib.
At first, I thought to regular expressions to perform additional tests to discriminate the matchings with low ratio. It can be a solution to treat specific problem like the one happening with words ending with ing. But that's only a limited case and thre can be numerous other cases that would need to add specific treatment for each one.
Then I thought that we could try to find additional criterium to eliminate not semantically matching words having a letters simlarity ratio enough to be detected as matcging together though the ratio is low,
WHILE in the same time catching real semantically matching terms having low ratio because they are short.
Here's a possibility
from difflib import SequenceMatcher
interests = ('shooting','gaming','looping')
keywords = ('loop','looping','game')
s = SequenceMatcher(None)
limit = 0.50
for interest in interests:
s.set_seq2(interest)
for keyword in keywords:
s.set_seq1(keyword)
b = s.ratio()>=limit and len(s.get_matching_blocks())==2
print '%10s %-10s %f %s' % (interest, keyword,
s.ratio(),
'** MATCH **' if b else '')
print
gives
shooting loop 0.333333
shooting looping 0.666667
shooting game 0.166667
gaming loop 0.000000
gaming looping 0.461538
gaming game 0.600000 ** MATCH **
looping loop 0.727273 ** MATCH **
looping looping 1.000000 ** MATCH **
looping game 0.181818
Note this from the doc:
SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence against many
sequences, use set_seq2() to set the commonly used sequence once and
call set_seq1() repeatedly, once for each of the other sequences.
Thats because SequenceMatcher is based on edit distance or something alike. semantic similarity is more suitable for your case or hybrid of the two.
take a look NLTK pack (code example) as you are using python and maybe this paper
for people using c++ can check this open source project for reference

string comparison in python but not Levenshtein distance (I think)

I found a crude string comparison in a paper I am reading done as follows:
The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable)
I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author)
For example for 2 sequences ABCDE and BCEFA, there are two possible graphs
graph 1) which connects B with B C with C and E with E
graph 2) connects A with A
I cannot connect A with A when I am connecting the other three (graph 1) since that would be crossing lines (imagine you draw lines between B-B, C-C and E-E); that is the line inking A-A will cross the lines linking B-B, C-C and E-E.
So these two sequences result in 2 possible graphs; one has 3 connections (BB, CC and EE) and the other only one (AA) then I calculate the score d as given by the equation below.
Consequently, to define the degree of similarity between two
penta-strings we calculate the distance d between them. Aligning the
two penta-strings, we look for all the identities between their
characters, wherever these may be located. If each identity is
represented by a link between both penta-strings, we define a graph
for this pair. We call any part of this graph a configuration.
Next, we retain all of those configurations in which there is no character
cross pairing (the meaning is explained in my example above, i.e., no crossings of links between identical characters and only those graphs are retained).
Each of these is then evaluated as a function of the
number p of characters related to the graph, the shifting Δi for the
corresponding pairs and the gap δij between connected characters of
each penta-string. The minimum value is chosen as characteristic and
is called distance d: d Min(50 – 10p + ΣΔi + Σδij) Although very rough,
this measure is generally in good agreement with the qualitative eye
guided estimation. For instance, the distance between abcde and abcfg
is 20, whereas that between abcde and abfcg is 23 =(50 – 30 + 1 +2).
I am confused as to how to go about doing this. Any suggestions to help me would be much appreciated.
I tried the Levenshtein and also simple sequence alignment as used in protein sequence comparison
The link to the paper is:
http://peds.oxfordjournals.org/content/16/2/103.long
I could not find any information on the first author, Alain Figureau and my emails to MA Soto have not been answered (as of today).
Thank you
Well, it's definitely not Levenshtein:
>>> from nltk import metrics
>>> metrics.distance.edit_distance('abcde','abcfg')
2
>>> metrics.distance.edit_distance('abcde','abfcg')
3
>>> help(metrics.distance.edit_distance)
Help on function edit_distance in module nltk.metrics.distance:
edit_distance(s1, s2)
Calculate the Levenshtein edit-distance between two strings.
The edit distance is the number of characters that need to be
substituted, inserted, or deleted, to transform s1 into s2. For
example, transforming "rain" to "shine" requires three steps,
consisting of two substitutions and one insertion:
"rain" -> "sain" -> "shin" -> "shine". These operations could have
been done in other orders, but at least three steps are needed.
#param s1, s2: The strings to be analysed
#type s1: C{string}
#type s2: C{string}
#rtype C{int}
Just after the text block you cite, there is a reference to a previous paper from the same authors : Secondary Structure of Proteins and Three-dimensional Pattern Recognition. I think it is worth to look into it if there is no explanantion of the distance (I'm not at work so I haven't the access to the full document).
Otherwise, you can also try to contact directly the authors : Alain Figureau seems to be an old-school French researcher with no contact whatsoever (no webpage, no e-mail, no "social networking",..) so I advise to try contacting M.A. Soto , whose e-mail is given at the end of the paper. I think they will give you the answer you're looking for : the experiment's procedure has to be crystal clear in order to be repeatable, it's part of the scientific method in research.

Categories

Resources