Levenshtein function to find closest name

Levenshtein function to find closest name - python

I need some help with the below code. I need to find the closest word to the entered word in this case to test I set word_0 as 'pikaru' which should return 'pikachu'. The levenshtein function returns us the distance between the two words that we entered. When I run the below code the answer I get is charmander which is way off, any help would be appreciated.
import backend
name_to_stats, id_to_name, names,
pokemon_by_typebackend.get_pokemon_stats()
words = names
word_0 = 'pikaru'
def find_closest_word(word_0, words):
"""Finds the closest word in the list to word_0 as measured by the
Levenshtein distance
Args:
word_0: a str
words: a list of str
Returns:
The closest word in words to word_0 as a str.
"""
# Hint: use the levenshtein_distance() function to help you out here.
closest_word = words[0]
#closest_distance = levenshtein_distance(word_0, words[0])
for i in words:
distance = levenshtein_distance(word_0, closest_word)
new_distance = levenshtein_distance(word_0, i)
if distance < new_distance:
return i
def levenshtein_distance(s1, s2):
"""Returns the Levenshtein distance between strs s1 and s2
Args:
s1: a str
s2: a str
"""
# This function has already been implemented for you.
# Source of the implementation:
# https://stackoverflow.com/questions/2460177/edit-distance-in-python
# If you'd like to know more about this algorithm, you can study it in
# CSCC73 Algorithms. It applies an advanced technique called dynamic
# programming.
# For more information:
# https://en.wikipedia.org/wiki/Levenshtein_distance
# https://en.wikipedia.org/wiki/Dynamic_programming
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1],
distances_[-1])))
distances = distances_
return distances[-1]

It looks like the error is in the return statement of your find_closest_word function:
if distance < new_distance:
return i
The function will not find the closest word, it will actually find the first word in the list that's further away than words[0]. Instead, try looping through words and keeping track of which word is the best you've seen so far. Something like:
best_distance = levenshtein_distance(word_0, words[0])
best_word = words[0]
for w in words:
d = levenshtein_distance(word_0, w)
if d < best_distance:
best_distance = d
best_word = w
return best_word

Related

Longest common substring with rolling hash

I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Here are test cases for the code:
string - 'Tim had been saying that he had been there'
search - 'had'
expected output - 4
string - 'he got what he got and what he wanted'
search - 'he'
expected out - 2
def return_distance(input, search):
words = input.split()
distance = None
indx = []
if not input or not search:
return None
else:
if words.count(search) >1:
indx = [ index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0]
for i in range(len(indx)-1):
distance = min(distance, indx[i+1] - indx[i])-1
return distance
I am thinking how to optimize the code. I admit it is poorly written.

How about
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])])
This splits the input sentence, makes a list of every index that matches the target word, then iterates over this list to compute the differences between each index and returns the minimum difference.
Since behavior is unspecified when the sentence doesn't have a word, it raises an error but you can add a check for this and return the value of your choice if desired using min's default parameter:
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
As an aside, naming a variable input overwrites a builtin and return_distance is a rather ambiguous name for a function.
Adding a precondition for parameters for None as done with if not input or not search: is not typically done in Python (we assume caller will always pass in a string and adhere to the function's contract).
If you want to generalize this further, move the split() duty to the domain of the caller which enables the function to operate on arbitrary iterables:
def min_distance_between_occurrences(it, target):
idxes = [i for i, e in enumerate(it) if e == target]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
Call with:
min_distance_between_occurrences("a b c a".split(), "a")
min_distance_between_occurrences([(1, 2), (1, 3), (1, 2)], (1, 2))
Refactoring aside, as pointed out in the comments, the original code isn't correct. Issues include:
search_str does not exist. You probably meant search.
distance and min_dist don't really work together. Pick one or the other and use it for all minimum calculations.
min(min_dist, indx[i+1] - indx[i])-1 subtracts 1 in the wrong place, throwing off the count.
Here's a potential fix for these issues:
def return_distance(input, search):
words = input.split()
distance = None
if words.count(search) > 1:
indx = [index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0] - 1
# ^^^^
for i in range(len(indx) - 1):
distance = min(distance, indx[i+1] - indx[i] - 1)
# ^^^^
return distance

One way is to use min with list comprehension on indx
min_dist = min([(indx[i+1] - indx[i]-1) for i in range(len(indx)-1) ])

Calculating Jaccard similarity on two lists to return highest similarity words in Python

I have a huge list (containing ~250k words) which was unique words. (Say list1)
I have another list containing 5 words which are misspelled. (Say list2)
I need to find jaccard similarity (based on varying ngrams). between the two lists and return the closest matching word from list1. Working on from a few answers that I found on this site, I was able to:
Split both lists into ngrams via a function.
Calculate jaccard similarity for first element of the second list and first list.
This is giving me a valid answer. However, I am unable to build on from here to return closest matching words from list1. I know I need to apply the ngram function to each element of my list1. And then compute jaccard similarity with list2 and return the maximum valued element from this. But unable to implement it via a loop. This is the code I'm using:
def spell_correcter(list2=['word1', 'word2',... 'word5']):
from sklearn.metrics import jaccard_similarity_score
import re
def find_ngrams(text: str, number: int=3) -> set:
#returns a set of ngrams for the given string
if not text:
return set()
str1 = ''.join(text)
words = [f' {x} ' for x in re.split(r'\W+', str1.lower()) if x.strip()]
ngrams = set()
for word in words:
for x in range(0, len(word) - number + 1):
ngrams.add(word[x:x+number])
return ngrams
def similarity(text1: str, text2: str, number: int=3) -> float:
#Finds the similarity between 2 strings using ngrams.
ngrams1 = find_ngrams(text1, number)
ngrams2 = find_ngrams(text2, number)
num_unique = len(ngrams1 | ngrams2)
num_equal = len(ngrams1 & ngrams2)
#Tried to compute for entire list1; very slow. Didn't execute
#for i in range(0, len(text1)):
#ngrams1 = find_ngrams(text1, number)
#num_unique = len(ngrams1 | ngrams2)
#num_equal = len(ngrams1 & ngrams2)
#jaccard = float(num_equal) / float(num_unique)
return float(num_equal) / float(num_unique)
b = list2[0]
a = similarity(list1, b)
return a
Can someone help with this code?

Shingling is a process of creating a single object by taking consecutive words and grouping them.
We tokenize the list 1 and create a ngram fo len(list2) and calculate the jaccard similarity of each ngram with the list 2. This will give the most similar words in list 1 with words in list 2:
def jaccard_similarity(list_x, list_y):
set_x = set(list_x)
set_y = set(list_y)
intersection = set_x.intersection(set_y)
union = set_x.union(set_y)
return len(intersection) / len(union) if len(union) > 0 else 0
def shingling_jaccard_similarity(text_x, text_y, n):
x = list(nltk.ngrams(tokenizer.tokenize(text_x), n))
y = list(nltk.ngrams(tokenizer.tokenize(text_y), n))
sim_score = jaccard_similarity(x,y)
return sim_score
shingling_jaccard_similarity(list1, list2, len(list2))

Finding the closest sub-string by Hamming distance

I need to find the substring of s that is closest to a string by Hamming distance and have it return a tuple of the index of the closest substring, the Hamming distance of the closest substring to p, and the closest substring itself.
I have this code so far:
def ham_dist(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
But I am confused on how I would figure this out:
Your function should return (1,2,'bcef') because the closest substring is 'bcef', it begins at index 1 in s, and its Hamming distance to p is 2.
In your function, you should use your ham_dist function from part (a). If there is more than one substring with the same minimum distance to p, return any of them.

You can run through the source string and compute the Hamming distance between your search string and the substring of the same length starting at the current index. You save the index, Hamming distance and substring if it is smaller than what you had before. This way you will get the minimal value.
source_string = "pGpEusuCSWEaPOJmamlFAnIBgAJGtcJaMPFTLfUfkQKXeymydQsdWCTyEFjFgbSmknAmKYFHopWceEyCSumTyAFwhrLqQXbWnXSn"
search_string = "tyraM"
def ham_dist(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
def search_min_dist(source,search):
l = len(search)
index = 0
min_dist = l
min_substring = source[:l]
for i in range(len(source)-l+1):
d = ham_dist(search, source[i:i+l])
if d<min_dist:
min_dist = d
index = i
min_substring = source[i:i+l]
return (index,min_dist,min_substring)
print search_min_dist(source_string,search_string)
Output
(28, 2, 'tcJaM')

The answer from Hugo Delahaye is a good one and does a better job of answering your question directly, but a different way to think about problems like this is to let Python's min() function figure out the answer. Under this type of data-centric programming (see Rule 5), your goal is to organize the data to make that possible.
s = 'abcefgh'
p = 'cdef'
N = len(p)
substrings = [
s[i : i + N]
for i in range(0, len(s) - N + 1)
]
result = min(
(ham_dist(p, sub), sub, i)
for i, sub in enumerate(substrings)
)
print(substrings) # ['abce', 'bcef', 'cefg', 'efgh']
print(result) # (2, 'bcef', 1)

calculate path similarity score and list comprehension questions

I have two lists of synsets generated from wordnet.synsets():
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
#convert tag to the one used by wordnet
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
#define a function to find synset reference
def doc_to_synsets(doc):
token = nltk.word_tokenize(doc)
tag = nltk.pos_tag(token)
wordnet_tag = convert_tag(tag)
syns = [wn.synsets(token, wordnet_tag) for token in nltk.word_tokenize(doc)]
syns_list = [token[0] for token in syns if token]
return syns_list
#convert two example text documents
doc1 = 'This is a test function.'
doc2 = 'Use this function to check if the code in doc_to_synsets is correct!'
s1 = doc_to_synsets(doc1)
s2 = doc_to_synsets(doc2)
I am trying to write a function to find the synset in s2 with the largest 'path similarity' score for each synset in s1. Hence, for s1, which contains 4 unique synsets, the function should return 4 path similarity scores, from which I will convert into a pandas Series object for ease of computation.
I have been working on this following code so far
def similarity_score(s1, s2):
list = []
for word1 in s1:
best = max(wn.path_similarity(word1, word2) for word2 in s2)
list.append(best)
return list
However, it only return an empty list without any values in it.
[]
Would anyone care to look at what's wrong with my for loop and perhaps enlighten me on this subject?
Thank you.

I removed the "Sysnet" class references since I don't have whatever that class is, and it doesn't matter for scoring purposes. The score function is abstracted out so you can define it however you like. I took a stab at a very simplistic rule. It compares each position, demarcated by the . separators, to see if they are equal. If they are, the score is incremented. For example, in s1, be.v.01 compared to a made up be.f.02 would have a score of 1, because on the prefix matches. If instead we compared to be.v.02, we would have a score of 2, etc.
s1 = [('be.v.01'),
('angstrom.n.01'),
('function.n.01'),
('trial.n.02')]
s2 = [('use.n.01'),
('function.n.01'),
('see.n.01'),
('code.n.01'),
('inch.n.01'),
('be.v.01'),
('correct.v.01')]
def score(s1,s2):
score = 0
for x,y in zip(s1.split('.'),s2.split('.')):
if x == y:
score += 1
return score
closest = [] # list of [target,best_match]
for sysnet1 in s1:
max_score = 0
best = None
for sysnet2 in s2:
cur_score = score(sysnet1,sysnet2)
if cur_score > max_score:
max_score = cur_score
best = sysnet2
closest.append([sysnet1,best])
print(closest)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Levenshtein function to find closest name - python

Related

Longest common substring with rolling hash

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Calculating Jaccard similarity on two lists to return highest similarity words in Python

Finding the closest sub-string by Hamming distance

calculate path similarity score and list comprehension questions

Categories

Resources