How can I merge overlapping strings in python? - python

I have some strings,
['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
These strings partially overlap each other. If you manually overlapped them you would get:
SGALWDVPSPV
I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.
Does anyone have any advice on a nice way to do this in python? Thanks!

Here is a quick sorting solution:
s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b
Output:
'SGALWDVPSPV'
This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.

My proposed solution with a more challenging test list:
#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']
for repeat in range(0, len(strFrag)-1):
bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
for otherStr in strFrag[1:]:
for x in range(0,len(otherStr)):
if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
if len(otherStr)-x > bestMatch[0]:
bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
if x > bestMatch[0]:
bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
if bestMatch[0] > 2:
strFrag[0] = bestMatch[2]
strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]
print(strFrag)
print(strFrag[0])
Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
def overlap(a, b):
return max(i for i in range(len(b)+1) if a.endswith(b[:i]))
res = lst[0]
for s in lst[1:]:
o = overlap(res, s)
res += s[o:]
print(res) # SGALWDVPSPV
Or using reduce:
from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))
This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:
def overlap(a, b):
return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))

Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:
from collections import defaultdict
from random import choice, shuffle
def overlap(a, b):
""" get the maximum overlap of a & b plus where the overlap starts """
overlaps = []
for i in range(len(b)):
for j in range(len(a)):
if a.endswith(b[:i + 1], j):
overlaps.append((i, j))
return max(overlaps) if overlaps else (0, -1)
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']
shuffle(lst) # to verify order doesn't matter
overlaps = defaultdict(list)
while len(lst) > 1:
overlaps.clear()
for a in lst:
for b in lst:
if a == b:
continue
amount, start = overlap(a, b)
overlaps[amount].append((start, a, b))
maximum = max(overlaps)
if maximum == 0:
break
start, a, b = choice(overlaps[maximum]) # pick one among equals
lst.remove(a)
lst.remove(b)
lst.append(a[:start] + b)
print(*lst)
OUTPUT
% python3 test.py
NONSEQUITUR SGALWDVPSPV
%
Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.
The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.

Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.
Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.
Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.

Dredging up an old thread, but had to solve this myself today.
For this specific case, where the fragments are already in order, and each overlap by the same amount (in this case 1), the following fairly simply concatenation works, though might not be the worlds most robust solution:
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
reference = "SGALWDVPSPV"
string = "".join([i[0] for i in lst] + [lst[-1][1:]])
reference == string
True

Related

Longest repeated substring in massive string

Given a long string, find the longest repeated sub-string.
The brute-force approach of course is to find all substrings and check the substrings of the remaining string, but the string(s) in question have millions of characters (like a DNA sequence, AGGCTAGCT etc) and I'd like something that finishes before the universe collapses in on itself.
Tried a number of approaches, and I have one solution that works quite fast on strings of up to several million, but takes literally forever (6+ hours) for larger strings, particularly when the length of the repeated sequence gets really long.
def find_lrs(text, cntr=2):
sol = (0, 0, 0)
del_list = ['01','01','01']
while len(del_list) != 0:
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
del_list = [(item, d[item]) for item in d if len(d[item]) > 1]
# if list is empty, we're done
if len(del_list) == 0:
return sol
else:
sol = (del_list[0][1][0], (del_list[0][1][1]),len(del_list[0][0]))
cntr += 1
return sol
I know it's ugly, but hey, I'm a beginner, and I'm just happy I got something to work. Idea is to go through the string starting out with length-2 substrings as the keys, and the index the substring is at the value. If the text was, say, 'BANANA', after the first pass through, the dict would look like this:
{'BA': [0], 'AN': [1, 3], 'NA': [2, 4], 'A': [5]}
BA shows up only once - starting at index 0. AN and NA show up twice, showing up at index 1/3 and 2/4, respectively.
I then create a list that only includes keys that showed up at least twice. In the example above, we can remove BA, since it only showed up once - if there's no substring of length 2 starting out with 'BA', there won't be an substring of length 3 starting with BA.
So after the first past through the pruned list is:
[('AN', [1, 3]), ('NA', [2, 4])]
Since there is at least two possibilities, we save the longest substring and indices found so far and increment the substring length to 3. We continue until no substring was repeated.
As noted, this works on strings up to 10 million in about 2 minutes, which apparently is reasonable - BUT, that's with the longest repeated sequence being fairly short. On a shorter string but longer repeated sequence, it takes -hours- to run. I suspect that it has something to do with how big the dictionary gets, but not quite sure why.
What I'd like to do of course is keep the dictionary short by removing the substrings that clearly aren't repeated, but I can't delete items from the dict while iterating over it. I know there are suffix tree approaches and such that - for now - are outside my ken.
Could simply be that this is beyond my current knowledge, which of course is fine, but I can't help shaking the idea that there is a solution here.
I forgot to update this. After going over my code again, away from my PC - literally writing out little diagrams on my iPad - I realized that the code above wasn't doing what I thought it was doing.
As noted above, my plan of attack was to start out by going through the string starting out with length-2 substrings as the keys, and the index the substring is at the value, creating a list that captures only length-2 substrings that occured at least twice, and only look at those locations.
All well and good - but look closely and you'll realize that I'm never actually updating the default dictionary to only have locations with two or more repeats! //bangs head against table.
I ultimately came up with two solutions. The first solution used a slightly different approach, the 'sorted suffixes' approach. This gets all the suffixes of the word, then sorts them in alphabetical order. For example, the suffixes of "BANANA", sorted, would be:
A
ANA
ANANA
BANANA
NA
NANA
We then look at each adjacent suffix and find how many letters each pair start out having in common. A and ANA have only 'A' in common. ANA and ANANA have "ANA" in common, so we have length 3 as the longest repeated substring. ANANA and BANANA have nothing in common at the start, ditto BANANA and NA. NA and NANA have "NA" in common. So 'ANA', length 3, is the longest repeated substring.
I made a little helper function to do the actual comparing. The code looks like this:
def longest_prefix(suf1, suf2, mx=None):
min_len = min(len(suf1), len(suf2))
for i in range(min_len):
if suf1[i] != suf2[i]:
return (suf1[0:i], len(suf1[0:i]))
return (suf1[0:i], len(suf1[0:i]))
def longest_repeat(txt):
lst = sorted([text[i:] for i in range(len(text))])
print(lst)
mxLen = 0
mx_string = ""
for x in range(len(lst) - 1):
temp = longest_prefix(lst[x], lst[x + 1])
if temp[1] > mxLen:
mxLen = temp[1]
mx_string = temp[0]
first = txt.find(mx_string)
last = txt.rfind(mx_string)
return (first, last, mxLen)
This works. I then went back and relooked at my original code and saw that I wasn't resetting the dictionary. The key is that after each pass through I update the dictionary to -only- look at repeat candidates.
def longest_repeat(text):
# create the initial dictionary with all length-2 repeats
cntr = 2 # size of initial substring length we look for
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
# find any item in dict that wasn't repeated at least once
del_list = [(d[item]) for item in d if len(d[item]) > 1]
sol = (0,0,0)
# Keep looking as long as del_list isn't empty,
while len(del_list) > 0:
d = defaultdict(list) # reset dictionary
cntr += 1 # increment search length
for item in del_list:
for i in item:
d[text[i:i + cntr]].append(i)
# filter as above
del_list = [(d[item]) for item in d if len(d[item]) > 1]
# if not empty, update solution
if len(del_list) != 0:
sol = (del_list[0][0], del_list[0][1], cntr)
return sol
This was quite fast, and I think it's easier to follow.

How many minimum numbers of characters from a given string S, should delete to make it a sorted string [duplicate]

This question already has answers here:
How to determine the longest increasing subsequence using dynamic programming?
(20 answers)
Closed 6 years ago.
I need to find the minimum number of deletions required to make string sorted.
Sample Test case:
# Given Input:
teststr = "abcb"
# Expected output:
1
# Explanation
# In this test case, if I delete last 'b' from "abcb",
# then the remaining string "abc" is sorted.
# That is, a single deletion is required.
# Given Input:
teststr = "vwzyx"
# Expected output:
2
# Explanation
# Here, if I delete 'z' and 'x' from "vwzyx",
# then the remaining string "vwy" is a sorted string.
I tried the following but it gives time limit exceeded error.
Any other approach to this problem?
string = input()
prev_ord = ord(string[0])
deletion = 0
for char in string[1:]:
if ord(char) > prev_ord +1 or ord(char) < prev_ord:
deletion += 1
continue
prev_ord = ord(char)
print(deletion)
Your current algorithm will give incorrect results for many strings.
I suspect that there's a more efficient way to solve this problem, but here's a brute-force solution. It generates subsets of the input string, ordered by length, descending. The elements in the subsets retain the order from the original string. As soon as count_deletions finds an ordered subset it returns it (converted back into a string), as well as the number of deletions. Thus the solution it finds is guaranteed to be no shorter than any other sorted selection of the input string.
Please see the itertools docs for info about the various itertools functions I've used; the algorithm for generating subsets was derived from the powerset example in the Recipes section.
from itertools import chain, combinations
def count_deletions(s):
for t in chain.from_iterable(combinations(s, r) for r in range(len(s), 0, -1)):
t = list(t)
if t == sorted(t):
return ''.join(t), len(s) - len(t)
# Some test data.
data = [
"abcdefg",
"cba",
"abcb",
"vwzyx",
"zvwzyx",
"adabcef",
"fantastic",
]
for s in data:
print(s, count_deletions(s))
output
abcdefg ('abcdefg', 0)
cba ('c', 2)
abcb ('abc', 1)
vwzyx ('vwz', 2)
zvwzyx ('vwz', 3)
adabcef ('aabcef', 1)
fantastic ('fntt', 5)
That data set is not really adequate to fully test algorithms designed to solve this problem, but I guess it's an ok starting point. :)
Update
Here's a Python 3 implementation of the algorithm mentioned by Salvador Dali on the linked page. It's much faster than my previous brute-force approach, especially for longer strings.
We can find the longest sorted subsequence by sorting a copy of the string and then finding the Longest Common Subsequence (LCS) of the original string & the sorted string. Salvador's version removes duplicate elements from the sorted string because he wants the result to be strictly increasing, but we don't need that here.
This code only returns the number of deletions required, but it's easy enough to modify it to return the actual sorted string.
To make this recursive function more efficient it uses the lru_cache decorator from functools.
from functools import lru_cache
#lru_cache(maxsize=None)
def lcs_len(x, y):
if not x or not y:
return 0
xhead, xtail = x[0], x[1:]
yhead, ytail = y[0], y[1:]
if xhead == yhead:
return 1 + lcs_len(xtail, ytail)
return max(lcs_len(x, ytail), lcs_len(xtail, y))
def count_deletions(s):
lcs_len.cache_clear()
return len(s) - lcs_len(s, ''.join(sorted(s)))
data = [
"abcdefg",
"cba",
"abcb",
"vwzyx",
"zvwzyx",
"adabcef",
"fantastic",
]
for s in data:
print(s, count_deletions(s))
output
abcdefg 0
cba 2
abcb 1
vwzyx 2
zvwzyx 3
adabcef 1
fantastic 5
Hope it works for all cases :)
s = input()
s_2 = ''.join(sorted(set(s), key=s.index))
sorted_string = sorted(s_2)
str_to_list = list(s_2)
dif = 0
for i in range(len(sorted_string)):
if sorted_string[i]!=str_to_list[i]:
dif+=1
print(dif+abs(len(s)-len(s_2)))

Sorting a string to make a new one

Here I had to remove the most frequent alphabet of a string(if frequency of two alphabets is same, then in alphabetical order) and put it into new string.
Input:
abbcccdddd
Output:
dcdbcdabcd
The code I wrote is:
s = list(sorted(<the input string>))
a = []
for c in range(len(s)):
freq =[0 for _ in range(26)]
for x in s:
freq[ord(x)-ord('a')] += 1
m = max(freq)
allindices = [p for p,q in enumerate(freq) if q == m]
r = chr(97+allindices[0])
a.append(r)
s.remove(r)
print''.join(a)
But it passed the allowed runtime limit maybe due to too many loops.(There's another for loop which seperates the strings from user input)
I was hoping if someone could suggest a more optimised version of it using less memory space.
Your solution involves 26 linear scans of the string and a bunch of unnecessary
conversions to count the frequencies. You can save some work by replacing all those linear scans with a linear count step, another linear repetition generation, then a sort to order your letters and a final linear pass to strip counts:
from collections import Counter # For unsorted input
from itertools import groupby # For already sorted input
from operator import itemgetter
def makenewstring(inp):
# When inp not guaranteed to be sorted:
counts = Counter(inp).iteritems()
# Alternative if inp is guaranteed to be sorted:
counts = ((let, len(list(g))) for let, g in groupby(inp))
# Create appropriate number of repetitions of each letter tagged with a count
# and sort to put each repetition of a letter in correct order
# Use negative n's so much more common letters appear repeatedly at start, not end
repeats = sorted((n, let) for let, cnt in counts for n in range(0, -cnt, -1))
# Remove counts and join letters
return ''.join(map(itemgetter(1), repeats))
Updated: It occurred to me that my original solution could be made much more concise, a one-liner actually (excluding required imports), that minimizes temporaries, in favor of a single sort-by-key operation that uses a trick to sort each letter by the count of that letter seen so far:
from collections import defaultdict
from itertools import count
def makenewstring(inp):
return ''.join(sorted(inp, key=lambda c, d=defaultdict(count): (-next(d[c]), c)))
This is actually the same basic logic as the original answer, it just accomplishes it by having sorted perform the decoration and undecoration of the values implicitly instead of doing it ourselves explicitly (implicit decorate/undecorate is the whole point of sorted's key argument; it's doing the Schwartzian transform for you).
Performance-wise, both approaches are similar; they both (in practice) scale linearly for smaller inputs (the one-liner up to inputs around 150 characters long, the longer code, using Counter, up to inputs in the len 2000 range), and while the growth is super-linear above that point, it's always below the theoretical O(n log_2 n) (likely due to the data being not entirely random thanks to the counts and limited alphabet, ensuring Python's TimSort has some existing ordering to take advantage of). The one-liner is somewhat faster for smaller strings (len 100 or less), the longer code is somewhat faster for larger strings (I'm guessing it has something to do with the longer code creating some ordering by grouping runs of counts for each letter). Really though, it hardly matters unless the input strings are expected to be huge.
Since the alphabet will always be a constant 26 characters,
this will work in O(N) and only takes a constant amount of space of 26
from collections import Counter
from string import ascii_lowercase
def sorted_alphabet(text):
freq = Counter(text)
alphabet = filter(freq.get, ascii_lowercase) # alphabet filtered with freq >= 1
top_freq = max(freq.values()) if text else 0 # handle empty text eg. ''
for top_freq in range(top_freq, 0, -1): # from top_freq to 1
for letter in alphabet:
if freq[letter] >= top_freq:
yield letter
print ''.join(sorted_alphabet('abbcccdddd'))
print ''.join(sorted_alphabet('dbdd'))
print ''.join(sorted_alphabet(''))
print ''.join(sorted_alphabet('xxxxaaax'))
dcdbcdabcd
ddbd
xxaxaxax
What about this?
I am making use of in-built python functions to eliminate loops and improve efficiency.
test_str = 'abbcccdddd'
remaining_letters = [1] # dummy initialisation
# sort alphabetically
unique_letters = sorted(set(test_str))
frequencies = [test_str.count(letter) for letter in unique_letters]
out = []
while(remaining_letters):
# in case of ties, index takes the first occurence, so the alphabetical order is preserved
max_idx = frequencies.index(max(frequencies))
out.append(unique_letters[max_idx])
#directly update frequencies instead of calculating them again
frequencies[max_idx] -= 1
remaining_letters = [idx for idx, freq in enumerate(frequencies) if freq>0]
print''.join(out) #dcdbcdabcd

longest common sub-string, Python complexity analysis

I built a function which finds the longest common sub-string of two text files in ascending order based on Rabin–Karp algorithm.
the main function is "find_longest" and the inner functions are: "make_hashtable","extend_fingerprints" and "has_match".
I'm having trouble analyzing the average case complexity of has_match.
Denote n1,n2 as text1,text2 and l as the size of the currunt "window".
fingers are the hash table of the substring.
def has_match(text1,text2,fingers1,fingers2,l,r):
h = make_hashtable(fingers2,r)
for i in range(len(fingers1)):
for j in h[fingers1[i]]:
if text1[i:i+l] == text2[j:j+l]:
return text1[i:i+l]
return None
this is "make_hashtable", here I'm pretty sure that the complexcity is O(n2-l+1):
def make_hashtable(fingers, table_size):
hash_table=[[] for i in range(table_size)]
count=0
for f in fingers:
hash_table[f].append(count)
count+=1
return hash_table
this is "find_longest", im adding this function despite the fact that i dont need it for the complexity analyzing.
def find_longest(text1,text2,basis=2**8,r=2**17-1):
match = ''
l = 0 #initial "window" size
#fingerprints of "windows" of size 0 - all are 0
fingers1 = [0]*(len(text1)+1)
fingers2 = [0]*(len(text2)+1)
while match != None: #there was a common substring of len l
l += 1
extend_fingerprints(text1, fingers1, l, basis, r)
extend_fingerprints(text2, fingers2, l, basis, r)
match = has_match(text1,text2,fingers1,fingers2,l,r)
print(match)
return l-1
and this is "extend_fingerprints":
def extend_fingerprints(text, fingers, length, basis=2**8, r=2**17-1):
count=0
for f in fingers:
if count==len(fingers)-1:
fingers.pop(len(fingers)-1)
break
fingers[count]=(f*basis+ord(text[length-1+count]))%r
count+=1
I'm having doubts between this two options:
1.O(n_2-l+1)+O(n_1-l+1)*O(l)
Refer to r as a constant number while n1,n2 are very large therefore a lot of collisions would be made at the hash table (let's say O(1) items at every 'cell', yet, always some "false-positives")
2.O(n_2-l+1)+O(n_1-l+1)+O(l)
Refer to r as optimal for a decent hash function, therefore almost no collisions which means that if two texts are the same cell at the hash table we may assume they are actually the same text?
Personally I lean towards the Bold statement.
tnx.
I think the answer is
O((n_2-l) + l*(n_1-l))
.
(n_2-l) represents the complexity of make_hashtable for the second text.
l*(n_1-l) represents the two nested loops who go through every item in the finger prints of the first text and perform 1 comparison operation (for l length slice), for some constant 'm' if there are some items of the same index in the hash table.

Optimization of python code for string duplicates search

We hawe a long list with strings (approx. 18k entries). The goal is to find all similar strings and to group them by maximum similarity. ("a" is the list with strings)
I have wrote the following code:
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).ratio()
dupl = {}
while len(a) > 0:
k = a.pop()
if k not in dupl.keys():
dupl[k] = []
for i,j in enumerate(a):
dif = diff(k, j)
if dif > 0.5:
dupl[k].append("{0}: {1}".format(dif, j))
This code take an element from the list and search for duplicates in the rest of the list. If the similarity is more than 0.5, the similar string is added to the dict.
Everything works well, but very, very slow because of length of a list "a". So I would like to ask is there a way to optimize somehow this code? Any ideas?
A couple of small optimisations:
You could remove duplicates from the list before starting the search (e.g. a=list(set(a))). At the moment, if a contains 18k copies of the string 'hello' it will call diff 18k*18k times.
Curently you will be comparing string number i with string number j, and also string number j with string number i. I think these will return the same result so you could only compute one of these and perhaps go twice as fast.
Of course, the basic problem is that diff is being called n*n times for a list of length n and an ideal solution would be to reduce the number of times diff is being called. The approach to use will depend on the content of your strings.
Here are a few examples of possible approaches that would be relevant to different cases:
Suppose the strings are of very different lengths. diff will only return >0.5 if the lengths of the strings are within a factor of 2. In this case you could sort the input strings by length in O(nlogn) time, and then only compare strings with similar lengths.
Suppose the strings are of sequences of words and expected to be either very different or very similar. You could construct an inverted index for the words and then only compare with strings which contain the same unusual words
Suppose you expect the strings to fall into a small number of groups. You could try running a K-means algorithm to group them into clusters. This would take K*n*I where I is the number of iterations of the K-means algorithm you choose to use.
If n grows to be very large (many million), then these will not be appropriate and you will probably need to use more approximate techniques. One example that is used for clustering web pages is called MinHash
When needing to iterate over many items, itertools, to the rescue!
This snippet will permute all the possibilities of your string (permutations) and return them in the fashion your original code did. I feel like the not in is a needlessly expensive way to check and not as pythonic. Permutations was chosen as it would give you the most access to checking a->b or b->a for two given strings.
import difflib
import itertools
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).quick_ratio()
def calculate_ratios(strings):
dupl = dict()
for s, t in itertools.permutations(strings, 2):
try:
dupl[s].append({t: diff(s,t)})
except KeyError:
dupl[s] = []
dupl[s].append({t: diff(s,t)})
return dupl
a = ['first string', 'second string', 'third string', 'fourth string']
print calculate_ratios(a)
Depending on your constraints, (since permutations are redundant computationally and space-wise), you can replace permutations with combinations, but then your accessing method will need to be adjusted (since a-b will only be listed in a[b] but not b[a]).
In the code I use quick_ratio(), but it is just as simply changed to ratio() or real_quick_ratio() depending on your decision of if there's enough precision.
And in such a case, a simple IF will solve that problem:
import difflib
import itertools
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).quick_ratio()
def diff2(a, b):
return difflib.SequenceMatcher(None, a, b).ratio()
def calculate_ratios(strings, threshold):
dupl = dict()
for s, t in itertools.permutations(strings, 2):
if diff(s,t) > threshold: #arbitrary threshhold
try:
dupl[s].append({t: diff2(s,t)})
except KeyError:
dupl[s] = []
dupl[s].append({t: diff2(s,t)})
return dupl
a = ['first string', 'second string', 'third string', 'fourth string']
print calculate_ratios(a, 0.5)

Categories

Resources