Finding the closest sub-string by Hamming distance - python

I need to find the substring of s that is closest to a string by Hamming distance and have it return a tuple of the index of the closest substring, the Hamming distance of the closest substring to p, and the closest substring itself.
I have this code so far:
def ham_dist(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
But I am confused on how I would figure this out:
Your function should return (1,2,'bcef') because the closest substring is 'bcef', it begins at index 1 in s, and its Hamming distance to p is 2.
In your function, you should use your ham_dist function from part (a). If there is more than one substring with the same minimum distance to p, return any of them.

You can run through the source string and compute the Hamming distance between your search string and the substring of the same length starting at the current index. You save the index, Hamming distance and substring if it is smaller than what you had before. This way you will get the minimal value.
source_string = "pGpEusuCSWEaPOJmamlFAnIBgAJGtcJaMPFTLfUfkQKXeymydQsdWCTyEFjFgbSmknAmKYFHopWceEyCSumTyAFwhrLqQXbWnXSn"
search_string = "tyraM"
def ham_dist(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
def search_min_dist(source,search):
l = len(search)
index = 0
min_dist = l
min_substring = source[:l]
for i in range(len(source)-l+1):
d = ham_dist(search, source[i:i+l])
if d<min_dist:
min_dist = d
index = i
min_substring = source[i:i+l]
return (index,min_dist,min_substring)
print search_min_dist(source_string,search_string)
Output
(28, 2, 'tcJaM')

The answer from Hugo Delahaye is a good one and does a better job of answering your question directly, but a different way to think about problems like this is to let Python's min() function figure out the answer. Under this type of data-centric programming (see Rule 5), your goal is to organize the data to make that possible.
s = 'abcefgh'
p = 'cdef'
N = len(p)
substrings = [
s[i : i + N]
for i in range(0, len(s) - N + 1)
]
result = min(
(ham_dist(p, sub), sub, i)
for i, sub in enumerate(substrings)
)
print(substrings) # ['abce', 'bcef', 'cefg', 'efgh']
print(result) # (2, 'bcef', 1)

Related

Longest common substring with rolling hash

I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Here are test cases for the code:
string - 'Tim had been saying that he had been there'
search - 'had'
expected output - 4
string - 'he got what he got and what he wanted'
search - 'he'
expected out - 2
def return_distance(input, search):
words = input.split()
distance = None
indx = []
if not input or not search:
return None
else:
if words.count(search) >1:
indx = [ index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0]
for i in range(len(indx)-1):
distance = min(distance, indx[i+1] - indx[i])-1
return distance
I am thinking how to optimize the code. I admit it is poorly written.
How about
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])])
This splits the input sentence, makes a list of every index that matches the target word, then iterates over this list to compute the differences between each index and returns the minimum difference.
Since behavior is unspecified when the sentence doesn't have a word, it raises an error but you can add a check for this and return the value of your choice if desired using min's default parameter:
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
As an aside, naming a variable input overwrites a builtin and return_distance is a rather ambiguous name for a function.
Adding a precondition for parameters for None as done with if not input or not search: is not typically done in Python (we assume caller will always pass in a string and adhere to the function's contract).
If you want to generalize this further, move the split() duty to the domain of the caller which enables the function to operate on arbitrary iterables:
def min_distance_between_occurrences(it, target):
idxes = [i for i, e in enumerate(it) if e == target]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
Call with:
min_distance_between_occurrences("a b c a".split(), "a")
min_distance_between_occurrences([(1, 2), (1, 3), (1, 2)], (1, 2))
Refactoring aside, as pointed out in the comments, the original code isn't correct. Issues include:
search_str does not exist. You probably meant search.
distance and min_dist don't really work together. Pick one or the other and use it for all minimum calculations.
min(min_dist, indx[i+1] - indx[i])-1 subtracts 1 in the wrong place, throwing off the count.
Here's a potential fix for these issues:
def return_distance(input, search):
words = input.split()
distance = None
if words.count(search) > 1:
indx = [index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0] - 1
# ^^^^
for i in range(len(indx) - 1):
distance = min(distance, indx[i+1] - indx[i] - 1)
# ^^^^
return distance
One way is to use min with list comprehension on indx
min_dist = min([(indx[i+1] - indx[i]-1) for i in range(len(indx)-1) ])

Longest of k consecutive strings

I'm defining a Python function to determine the longest string if the original strings are combined for every k consecutive strings. The function takes two parameters, strarr and k.
Here is an example:
max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2) --> "abigailtheta"
Here's my code so far (instinct is that I'm not passing k correctly within the function)
def max_consec(strarr, k):
lenList = []
for value in zip(strarr, strarr[k:]):
consecStrings = ''.join(value)
lenList.append(consecStrings)
for word in lenList:
if max(word):
return word
Here is a test case not passing:
testing(longest_consec(["zone", "abigail", "theta", "form", "libe", "zas"], 2), "abigailtheta")
My output:
'zonetheta' should equal 'abigailtheta'
It's not quite clear to me what you mean by "every k consecutive strings", but if you mean taking k-length slices of the list and concatenating all the strings in each slice, for example
['a', 'bb', 'ccc', 'dddd'] # k = 3
becomes
['a', 'bb', 'ccc']
['bb', 'ccc', 'dddd']
then
'abbccc'
'bbcccddd'
then this works ...
# for every item in strarr, take the k-length slice from that point and join the resulting strings
strs = [''.join(strarr[i:i + k]) for i in range(len(strarr) - k + 1)]
# find the largest by `len`gth
max(strs, key=len)
this post gives alternatives, though some of them are hard to read/verbose
Store the string lengths in an array. Now assume a window of size k passing through this list. Keep track of the sum in this window and starting point of the window.
When window reaches the end of the array you should have maximum sum and index where the maximum occurs. Construct the result with the elements from this window.
Time complexity: O(size of array + sum of all strings sizes) ~ O(n)
Also add some corner case handling when k > array_size or k <= 0
def max_consec(strarr, k):
size = len(strarr)
# corner cases
if k > size or k <= 0:
return "None" # or None
# store lengths
lenList = [len(word) for word in strarr]
print(lenList)
max_sum = sum(lenList[:k]) # window at index 0
prev_sum = max_sum
max_index = 0
for i in range(1, size - k + 1):
length = prev_sum - lenList[i-1] + lenList[i + k - 1] # window sum at i'th index. Subract previous sum and add next sum
prev_sum = length
if length > max_sum:
max_sum = length
max_index = i
return "".join(strarr[max_index:max_index+k]) # join strings in the max sum window
word = max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
print(word)
def max_consec(strarr, k):
n = -1
result = ""
for i in range(len(strarr)):
s = ''.join(strarr[i:i+k])
if len(s) > n:
n = len(s)
result = s
return result
Iterate over the list of strings and create a new string concatenating it with next k strings
Check if the newly created string is the longest. If so memorize it
Repeat the above steps until the iterations complete
return the memorised string
If I understand your question correctly. You have to eliminate duplicate values (in this case with set), sort them by length and concatenate the k longest words.
>>> def max_consec(words, k):
... words = sorted(set(words), key=len, reverse=True)
... return ''.join(words[:k])
...
>>> max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
'abigailtheta'
Update:
If the k elements should be consecutive. You can create pairs of consecutive words (in this case with zip). And return the longest if they get joined.
>>> def max_consec(words, k):
... return max((''.join(pair) for pair in zip(*[words[i:] for i in range(k)])), key=len)
...
>>> max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
'abigailtheta'

Levenshtein function to find closest name

I need some help with the below code. I need to find the closest word to the entered word in this case to test I set word_0 as 'pikaru' which should return 'pikachu'. The levenshtein function returns us the distance between the two words that we entered. When I run the below code the answer I get is charmander which is way off, any help would be appreciated.
import backend
name_to_stats, id_to_name, names,
pokemon_by_typebackend.get_pokemon_stats()
words = names
word_0 = 'pikaru'
def find_closest_word(word_0, words):
"""Finds the closest word in the list to word_0 as measured by the
Levenshtein distance
Args:
word_0: a str
words: a list of str
Returns:
The closest word in words to word_0 as a str.
"""
# Hint: use the levenshtein_distance() function to help you out here.
closest_word = words[0]
#closest_distance = levenshtein_distance(word_0, words[0])
for i in words:
distance = levenshtein_distance(word_0, closest_word)
new_distance = levenshtein_distance(word_0, i)
if distance < new_distance:
return i
def levenshtein_distance(s1, s2):
"""Returns the Levenshtein distance between strs s1 and s2
Args:
s1: a str
s2: a str
"""
# This function has already been implemented for you.
# Source of the implementation:
# https://stackoverflow.com/questions/2460177/edit-distance-in-python
# If you'd like to know more about this algorithm, you can study it in
# CSCC73 Algorithms. It applies an advanced technique called dynamic
# programming.
# For more information:
# https://en.wikipedia.org/wiki/Levenshtein_distance
# https://en.wikipedia.org/wiki/Dynamic_programming
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1],
distances_[-1])))
distances = distances_
return distances[-1]
It looks like the error is in the return statement of your find_closest_word function:
if distance < new_distance:
return i
The function will not find the closest word, it will actually find the first word in the list that's further away than words[0]. Instead, try looping through words and keeping track of which word is the best you've seen so far. Something like:
best_distance = levenshtein_distance(word_0, words[0])
best_word = words[0]
for w in words:
d = levenshtein_distance(word_0, w)
if d < best_distance:
best_distance = d
best_word = w
return best_word

Longest common sequence between many sub-sequences

Fancy title :)
I have a file that contains the following:
>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...
Then, I have a function that returns a dictionary (called dict) that returns the sequences as keys and the strings (combined on one line) as values for the keys. The sequences range from 40 to 59.
I want to take a dictionary of sequences and return the longest common sub-sequence found in ALL the sequences. Managed to find some help here on stackoverflow and made a code that only compares the LAST TWO strings in that dictionary, not all of them :).
This is the code
def longest_common_sequence(s1, s2):
m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
longest, x_longest = 0, 0
for x in range(1, 1 + len(s1)):
for y in range(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
for i in range(40,59):
s1=str(dictionar['sequence_'+str(i)])
s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)
How can I modify it to get the common subsequence among ALL sequences in dictionary? Thanks!
EDIT: As #lmcarreiro pointed out, there is a relevant difference between substrings (or subarrays or sublists) and subsequences. To my understanding we are all talking about substrings here, so I will use this term in my answer.
Guillaumes answer can be improved:
def eachPossibleSubstring(string):
for size in range(len(string) + 1, 0, -1):
for start in range(len(string) - size + 1):
yield string[start:start+size]
def findLongestCommonSubstring(strings):
shortestString = min(strings, key=len)
for substring in eachPossibleSubstring(shortestString):
if all(substring in string
for string in strings if string != shortestString):
return substring
print findLongestCommonSubstring([
'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])
This prints:
ACDBACDBACDBACD
This is faster because I return the first found and search from longest to shortest.
The basic idea is this: Take each possible substring of the shortest of your strings (in the order from the longest to the shortest) and see if this substring can be found in all other strings. If so, return it, otherwise try the next substring.
You need to understand generators. Try e. g. this:
for substring in eachPossibleSubstring('abcd'):
print substring
or
print list(eachPossibleSubstring('abcd'))
I'd start by defining a function to return all possible subsequences of a given sequence:
from itertools import combinations_with_replacement
def subsequences(sequence):
"returns all possible subquences of a given sequence"
for start, stop in combinations_with_replacement(range(len(sequence)), 2):
if start < stop:
yield sequence[start:stop]
then I'd make another method to check if a given subsequence in present in all given sequences:
def is_common_subsequence(sub, sequences):
"returns True if <sub> is a common subsequence in all <sequences>"
return all(sub in sequence for sequence in sequences)
then using the 2 methods above it is pretty easy to get all common subsequences in a given set of sequences:
def common_sequences(sequences):
"return all subsequences common in sequences"
shortest_seq = min(sequences, key=len)
return set(subsequence for subsequence in subsequences(shortest_seq) \
if is_common_subsequence(subsequence, sequences))
... and extracting the longuest sequence:
def longuest_common_subsequence(sequences):
"returns the longuest subsequence in sequences"
return max(common_sequences(sequences), key=len)
Result:
sequences = {
41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
42: '123ABCDEFGHIJKLMNOPQRSTUVW',
43: '123456ABCDEFGHIJKLMNOPQRST'
}
sequences2 = {
0: 'ABCDEFGHIJ',
1: 'DHSABCDFKDDSA',
2: 'SGABCEIDEFJRNF'
}
print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST
print(longuest_common_subsequence(sequences2.values()))
>>> ABC
Here you have a possible approach. First let's define a function that returns the longest substring between two strings:
def longest_substring(s1, s2):
t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
l, xl = 0, 0
for x in range(1,1+len(s1)):
for y in range(1,1+len(s2)):
if s1[x-1] == s2[y-1]:
t[x][y] = t[x-1][y-1] + 1
if t[x][y]>l:
l = t[x][y]
xl = x
else:
t[x][y] = 0
return s1[xl-l: xl]
Now I'll create a random dict of sequences for the example:
import random
import string
d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}
print d
{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}
Finally, we need to find the longest subsequence between all sequences:
import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)
Output:
'VIL'

Categories

Resources