Optimization of python code for string duplicates search

Optimization of python code for string duplicates search - python

We hawe a long list with strings (approx. 18k entries). The goal is to find all similar strings and to group them by maximum similarity. ("a" is the list with strings)
I have wrote the following code:
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).ratio()
dupl = {}
while len(a) > 0:
k = a.pop()
if k not in dupl.keys():
dupl[k] = []
for i,j in enumerate(a):
dif = diff(k, j)
if dif > 0.5:
dupl[k].append("{0}: {1}".format(dif, j))
This code take an element from the list and search for duplicates in the rest of the list. If the similarity is more than 0.5, the similar string is added to the dict.
Everything works well, but very, very slow because of length of a list "a". So I would like to ask is there a way to optimize somehow this code? Any ideas?

A couple of small optimisations:
You could remove duplicates from the list before starting the search (e.g. a=list(set(a))). At the moment, if a contains 18k copies of the string 'hello' it will call diff 18k*18k times.
Curently you will be comparing string number i with string number j, and also string number j with string number i. I think these will return the same result so you could only compute one of these and perhaps go twice as fast.
Of course, the basic problem is that diff is being called n*n times for a list of length n and an ideal solution would be to reduce the number of times diff is being called. The approach to use will depend on the content of your strings.
Here are a few examples of possible approaches that would be relevant to different cases:
Suppose the strings are of very different lengths. diff will only return >0.5 if the lengths of the strings are within a factor of 2. In this case you could sort the input strings by length in O(nlogn) time, and then only compare strings with similar lengths.
Suppose the strings are of sequences of words and expected to be either very different or very similar. You could construct an inverted index for the words and then only compare with strings which contain the same unusual words
Suppose you expect the strings to fall into a small number of groups. You could try running a K-means algorithm to group them into clusters. This would take K*n*I where I is the number of iterations of the K-means algorithm you choose to use.
If n grows to be very large (many million), then these will not be appropriate and you will probably need to use more approximate techniques. One example that is used for clustering web pages is called MinHash

When needing to iterate over many items, itertools, to the rescue!
This snippet will permute all the possibilities of your string (permutations) and return them in the fashion your original code did. I feel like the not in is a needlessly expensive way to check and not as pythonic. Permutations was chosen as it would give you the most access to checking a->b or b->a for two given strings.
import difflib
import itertools
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).quick_ratio()
def calculate_ratios(strings):
dupl = dict()
for s, t in itertools.permutations(strings, 2):
try:
dupl[s].append({t: diff(s,t)})
except KeyError:
dupl[s] = []
dupl[s].append({t: diff(s,t)})
return dupl
a = ['first string', 'second string', 'third string', 'fourth string']
print calculate_ratios(a)
Depending on your constraints, (since permutations are redundant computationally and space-wise), you can replace permutations with combinations, but then your accessing method will need to be adjusted (since a-b will only be listed in a[b] but not b[a]).
In the code I use quick_ratio(), but it is just as simply changed to ratio() or real_quick_ratio() depending on your decision of if there's enough precision.
And in such a case, a simple IF will solve that problem:
import difflib
import itertools
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).quick_ratio()
def diff2(a, b):
return difflib.SequenceMatcher(None, a, b).ratio()
def calculate_ratios(strings, threshold):
dupl = dict()
for s, t in itertools.permutations(strings, 2):
if diff(s,t) > threshold: #arbitrary threshhold
try:
dupl[s].append({t: diff2(s,t)})
except KeyError:
dupl[s] = []
dupl[s].append({t: diff2(s,t)})
return dupl
a = ['first string', 'second string', 'third string', 'fourth string']
print calculate_ratios(a, 0.5)

Related

How can I merge overlapping strings in python?

I have some strings,
['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
These strings partially overlap each other. If you manually overlapped them you would get:
SGALWDVPSPV
I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.
Does anyone have any advice on a nice way to do this in python? Thanks!

Here is a quick sorting solution:
s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b
Output:
'SGALWDVPSPV'
This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.

My proposed solution with a more challenging test list:
#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']
for repeat in range(0, len(strFrag)-1):
bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
for otherStr in strFrag[1:]:
for x in range(0,len(otherStr)):
if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
if len(otherStr)-x > bestMatch[0]:
bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
if x > bestMatch[0]:
bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
if bestMatch[0] > 2:
strFrag[0] = bestMatch[2]
strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]
print(strFrag)
print(strFrag[0])
Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
def overlap(a, b):
return max(i for i in range(len(b)+1) if a.endswith(b[:i]))
res = lst[0]
for s in lst[1:]:
o = overlap(res, s)
res += s[o:]
print(res) # SGALWDVPSPV
Or using reduce:
from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))
This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:
def overlap(a, b):
return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))

Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:
from collections import defaultdict
from random import choice, shuffle
def overlap(a, b):
""" get the maximum overlap of a & b plus where the overlap starts """
overlaps = []
for i in range(len(b)):
for j in range(len(a)):
if a.endswith(b[:i + 1], j):
overlaps.append((i, j))
return max(overlaps) if overlaps else (0, -1)
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']
shuffle(lst) # to verify order doesn't matter
overlaps = defaultdict(list)
while len(lst) > 1:
overlaps.clear()
for a in lst:
for b in lst:
if a == b:
continue
amount, start = overlap(a, b)
overlaps[amount].append((start, a, b))
maximum = max(overlaps)
if maximum == 0:
break
start, a, b = choice(overlaps[maximum]) # pick one among equals
lst.remove(a)
lst.remove(b)
lst.append(a[:start] + b)
print(*lst)
OUTPUT
% python3 test.py
NONSEQUITUR SGALWDVPSPV
%
Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.
The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.

Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.
Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.
Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.

Dredging up an old thread, but had to solve this myself today.
For this specific case, where the fragments are already in order, and each overlap by the same amount (in this case 1), the following fairly simply concatenation works, though might not be the worlds most robust solution:
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
reference = "SGALWDVPSPV"
string = "".join([i[0] for i in lst] + [lst[-1][1:]])
reference == string
True

How many minimum numbers of characters from a given string S, should delete to make it a sorted string [duplicate]

This question already has answers here:
How to determine the longest increasing subsequence using dynamic programming?
(20 answers)
Closed 6 years ago.
I need to find the minimum number of deletions required to make string sorted.
Sample Test case:
# Given Input:
teststr = "abcb"
# Expected output:
1
# Explanation
# In this test case, if I delete last 'b' from "abcb",
# then the remaining string "abc" is sorted.
# That is, a single deletion is required.
# Given Input:
teststr = "vwzyx"
# Expected output:
2
# Explanation
# Here, if I delete 'z' and 'x' from "vwzyx",
# then the remaining string "vwy" is a sorted string.
I tried the following but it gives time limit exceeded error.
Any other approach to this problem?
string = input()
prev_ord = ord(string[0])
deletion = 0
for char in string[1:]:
if ord(char) > prev_ord +1 or ord(char) < prev_ord:
deletion += 1
continue
prev_ord = ord(char)
print(deletion)

Your current algorithm will give incorrect results for many strings.
I suspect that there's a more efficient way to solve this problem, but here's a brute-force solution. It generates subsets of the input string, ordered by length, descending. The elements in the subsets retain the order from the original string. As soon as count_deletions finds an ordered subset it returns it (converted back into a string), as well as the number of deletions. Thus the solution it finds is guaranteed to be no shorter than any other sorted selection of the input string.
Please see the itertools docs for info about the various itertools functions I've used; the algorithm for generating subsets was derived from the powerset example in the Recipes section.
from itertools import chain, combinations
def count_deletions(s):
for t in chain.from_iterable(combinations(s, r) for r in range(len(s), 0, -1)):
t = list(t)
if t == sorted(t):
return ''.join(t), len(s) - len(t)
# Some test data.
data = [
"abcdefg",
"cba",
"abcb",
"vwzyx",
"zvwzyx",
"adabcef",
"fantastic",
]
for s in data:
print(s, count_deletions(s))
output
abcdefg ('abcdefg', 0)
cba ('c', 2)
abcb ('abc', 1)
vwzyx ('vwz', 2)
zvwzyx ('vwz', 3)
adabcef ('aabcef', 1)
fantastic ('fntt', 5)
That data set is not really adequate to fully test algorithms designed to solve this problem, but I guess it's an ok starting point. :)
Update
Here's a Python 3 implementation of the algorithm mentioned by Salvador Dali on the linked page. It's much faster than my previous brute-force approach, especially for longer strings.
We can find the longest sorted subsequence by sorting a copy of the string and then finding the Longest Common Subsequence (LCS) of the original string & the sorted string. Salvador's version removes duplicate elements from the sorted string because he wants the result to be strictly increasing, but we don't need that here.
This code only returns the number of deletions required, but it's easy enough to modify it to return the actual sorted string.
To make this recursive function more efficient it uses the lru_cache decorator from functools.
from functools import lru_cache
#lru_cache(maxsize=None)
def lcs_len(x, y):
if not x or not y:
return 0
xhead, xtail = x[0], x[1:]
yhead, ytail = y[0], y[1:]
if xhead == yhead:
return 1 + lcs_len(xtail, ytail)
return max(lcs_len(x, ytail), lcs_len(xtail, y))
def count_deletions(s):
lcs_len.cache_clear()
return len(s) - lcs_len(s, ''.join(sorted(s)))
data = [
"abcdefg",
"cba",
"abcb",
"vwzyx",
"zvwzyx",
"adabcef",
"fantastic",
]
for s in data:
print(s, count_deletions(s))
output
abcdefg 0
cba 2
abcb 1
vwzyx 2
zvwzyx 3
adabcef 1
fantastic 5

Hope it works for all cases :)
s = input()
s_2 = ''.join(sorted(set(s), key=s.index))
sorted_string = sorted(s_2)
str_to_list = list(s_2)
dif = 0
for i in range(len(sorted_string)):
if sorted_string[i]!=str_to_list[i]:
dif+=1
print(dif+abs(len(s)-len(s_2)))

Sorting a string to make a new one

Here I had to remove the most frequent alphabet of a string(if frequency of two alphabets is same, then in alphabetical order) and put it into new string.
Input:
abbcccdddd
Output:
dcdbcdabcd
The code I wrote is:
s = list(sorted(<the input string>))
a = []
for c in range(len(s)):
freq =[0 for _ in range(26)]
for x in s:
freq[ord(x)-ord('a')] += 1
m = max(freq)
allindices = [p for p,q in enumerate(freq) if q == m]
r = chr(97+allindices[0])
a.append(r)
s.remove(r)
print''.join(a)
But it passed the allowed runtime limit maybe due to too many loops.(There's another for loop which seperates the strings from user input)
I was hoping if someone could suggest a more optimised version of it using less memory space.

Your solution involves 26 linear scans of the string and a bunch of unnecessary
conversions to count the frequencies. You can save some work by replacing all those linear scans with a linear count step, another linear repetition generation, then a sort to order your letters and a final linear pass to strip counts:
from collections import Counter # For unsorted input
from itertools import groupby # For already sorted input
from operator import itemgetter
def makenewstring(inp):
# When inp not guaranteed to be sorted:
counts = Counter(inp).iteritems()
# Alternative if inp is guaranteed to be sorted:
counts = ((let, len(list(g))) for let, g in groupby(inp))
# Create appropriate number of repetitions of each letter tagged with a count
# and sort to put each repetition of a letter in correct order
# Use negative n's so much more common letters appear repeatedly at start, not end
repeats = sorted((n, let) for let, cnt in counts for n in range(0, -cnt, -1))
# Remove counts and join letters
return ''.join(map(itemgetter(1), repeats))
Updated: It occurred to me that my original solution could be made much more concise, a one-liner actually (excluding required imports), that minimizes temporaries, in favor of a single sort-by-key operation that uses a trick to sort each letter by the count of that letter seen so far:
from collections import defaultdict
from itertools import count
def makenewstring(inp):
return ''.join(sorted(inp, key=lambda c, d=defaultdict(count): (-next(d[c]), c)))
This is actually the same basic logic as the original answer, it just accomplishes it by having sorted perform the decoration and undecoration of the values implicitly instead of doing it ourselves explicitly (implicit decorate/undecorate is the whole point of sorted's key argument; it's doing the Schwartzian transform for you).
Performance-wise, both approaches are similar; they both (in practice) scale linearly for smaller inputs (the one-liner up to inputs around 150 characters long, the longer code, using Counter, up to inputs in the len 2000 range), and while the growth is super-linear above that point, it's always below the theoretical O(n log_2 n) (likely due to the data being not entirely random thanks to the counts and limited alphabet, ensuring Python's TimSort has some existing ordering to take advantage of). The one-liner is somewhat faster for smaller strings (len 100 or less), the longer code is somewhat faster for larger strings (I'm guessing it has something to do with the longer code creating some ordering by grouping runs of counts for each letter). Really though, it hardly matters unless the input strings are expected to be huge.

Since the alphabet will always be a constant 26 characters,
this will work in O(N) and only takes a constant amount of space of 26
from collections import Counter
from string import ascii_lowercase
def sorted_alphabet(text):
freq = Counter(text)
alphabet = filter(freq.get, ascii_lowercase) # alphabet filtered with freq >= 1
top_freq = max(freq.values()) if text else 0 # handle empty text eg. ''
for top_freq in range(top_freq, 0, -1): # from top_freq to 1
for letter in alphabet:
if freq[letter] >= top_freq:
yield letter
print ''.join(sorted_alphabet('abbcccdddd'))
print ''.join(sorted_alphabet('dbdd'))
print ''.join(sorted_alphabet(''))
print ''.join(sorted_alphabet('xxxxaaax'))
dcdbcdabcd
ddbd
xxaxaxax

What about this?
I am making use of in-built python functions to eliminate loops and improve efficiency.
test_str = 'abbcccdddd'
remaining_letters = [1] # dummy initialisation
# sort alphabetically
unique_letters = sorted(set(test_str))
frequencies = [test_str.count(letter) for letter in unique_letters]
out = []
while(remaining_letters):
# in case of ties, index takes the first occurence, so the alphabetical order is preserved
max_idx = frequencies.index(max(frequencies))
out.append(unique_letters[max_idx])
#directly update frequencies instead of calculating them again
frequencies[max_idx] -= 1
remaining_letters = [idx for idx, freq in enumerate(frequencies) if freq>0]
print''.join(out) #dcdbcdabcd

Find all minimal elements in a list or set as weighted by a function

If I want to find a minimum of a list or set x as given by some function f on that set, I can use convenient oneliners such as
min(x,key=f)
(4.91 µs)
While for a 'pure' min function, it doesn't make sense to return more than one element in most cases (since all of them are the same, and for sets there is only one), if you choose the minimum according to some function, you will often want to know all the elements for which it was minimal.
In other words, I'm looking for a short, concise and fast function, that allows me to return all minimal elements according to some weighting functing, which preferably works for both lists and sets (and returns the the result in the data type of the input).
For lists, the fastest thing I've managed to write is
def allmin(x,f):
vals = map(f, x)
minval = min(vals)
return [x[i] for i,e in enumerate(vals) if e==minval]
6.73 µs
However, this is far from optimal and doesn't work for sets. First of all, when mapping, all the function values are in memory at some point, so that's the best time to determine the minima instead of looking at it again, which is illustrated in the fact that this is already 50% slower although no additional computations (except for list construction) should have to be performed compared to the single-min example. The only comparable thing for sets I've managed to write is
def allmin(x,f):
vals = [(f(e), e) for e in x]
minval = min(vals)[0]
return {e for val,e in vals if val==minval}
8.44 µs (7.29 µs with list comprehension for the list version)
Any way I can get the performance for lists to around the performance for the better allmin version for lists, and best of all, somewhere near the performance of min(x,key=f)?
(To illustrate and for the timings, I assumed
f = lambda x: (x-4.5)**2
x = random.choice([[0,1,2,3,4,5,6,7,8,9,10,11,13],{0,1,2,3,4,5,6,7,8,9,10,11,13}])
)

If you don't know that number of minimal values, then a simple single pass approach is to keep a running list of minimal values for the lowest weight seen so far:
def minimal(iterable, func):
'Return a list of minimal values according to a weighting function'
it = iter(iterable)
try:
x = next(it)
except StopIteration:
return []
lowest_values = [x]
lowest_weight = func(x)
for x in it:
weight = func(x)
if weight == lowest_weight:
lowest_values.append(x)
elif weight < lowest_weight:
lowest_values = []
lowest_weight = weight
return lowest_values
Here it is in action:
>>> s = {'abc', 'defg', 'hij', 'kl', 'mno', 'qr', 'stuv', 'wx', 'yz'}
>>> minimal(s, len)
['qr', 'kl', 'yz', 'wx']
Alternatively, if you know in advance how many minimal values there are, the heapq.nsmallest function will solve the problem directly and efficiently. For the k smallest of n values, it make n calls to your weighting function and uses memory proportional to k (i.e. it is very cache efficient):
>>> from heapq import nsmallest
>>> s = {'abc', 'defg', 'hij', 'kl', 'mno', 'qr', 'stuv', 'wx', 'yz'}
>>> nsmallest(4, s, key=len)
['qr', 'kl', 'yz', 'wx']

You are currently spending Θ(n) to apply f() over all elements, then another Θ(n) to find the minimum among them, then finally another Θ(n) to find all elements that are equal to the minimum. In short, you are spending 3 x Θ(n), where n is the size of the input list.
You can theoretically do this in 2 x Θ(n) by spending Θ(n) to find the minimum while applying f(), then spending another Θ(n) to retrieve all minimum elements. However, there seems to be a faster way, where you spend Θ(n) for applying f() and finding the minimum, but only spend O(n) when retrieving all minimum elements. (Note that in the worst case, O(n) is no different to Θ(n). For the below algorithm, this worst case occurs when all elements in the list are the same, or the list is sorted in the reverse order.)
def allmin(x,f):
minVal = 9999999999999999999999999
mapped = []
for a in x:
mapVal = f(a)
if mapVal <= minVal:
minVal = mapVal
mapped.append((a, mapVal))
return [a for (a,m) in mapped if m == minVal]
My own time measurements show roughly 20% time improvement over your allmin() method for a list of integers ranging from 0 to 100.
For very large input lists, it may be worth sampling a few elements to start off with so that you can provide a better initial value for minVal (rather than the trivial initialization to a very large value).
========================================= EDIT =========================================
Here's a version that provides a further 5~10% speed up. The speed up comes from the observation that once a new minimum value is found, all previously stored mapped values can be discarded. Thus the final O(n) for retrieving the minimum value is no longer required, and the whole algorithms takes 1 x Θ(n) to run.
def newallmin(x,f):
minVal = f(x[-1])
minList = []
for a in x:
mapVal = f(a)
if mapVal > minVal:
continue
if mapVal < minVal:
minVal = mapVal
minList = [a]
else: # mapVal == minVal
minList.append(a)
return minList
I've been performing time measurements with a list of size 10,000,000 with all elements ranging from 0 to 100.

longest common sub-string, Python complexity analysis

I built a function which finds the longest common sub-string of two text files in ascending order based on Rabin–Karp algorithm.
the main function is "find_longest" and the inner functions are: "make_hashtable","extend_fingerprints" and "has_match".
I'm having trouble analyzing the average case complexity of has_match.
Denote n1,n2 as text1,text2 and l as the size of the currunt "window".
fingers are the hash table of the substring.
def has_match(text1,text2,fingers1,fingers2,l,r):
h = make_hashtable(fingers2,r)
for i in range(len(fingers1)):
for j in h[fingers1[i]]:
if text1[i:i+l] == text2[j:j+l]:
return text1[i:i+l]
return None
this is "make_hashtable", here I'm pretty sure that the complexcity is O(n2-l+1):
def make_hashtable(fingers, table_size):
hash_table=[[] for i in range(table_size)]
count=0
for f in fingers:
hash_table[f].append(count)
count+=1
return hash_table
this is "find_longest", im adding this function despite the fact that i dont need it for the complexity analyzing.
def find_longest(text1,text2,basis=2**8,r=2**17-1):
match = ''
l = 0 #initial "window" size
#fingerprints of "windows" of size 0 - all are 0
fingers1 = [0]*(len(text1)+1)
fingers2 = [0]*(len(text2)+1)
while match != None: #there was a common substring of len l
l += 1
extend_fingerprints(text1, fingers1, l, basis, r)
extend_fingerprints(text2, fingers2, l, basis, r)
match = has_match(text1,text2,fingers1,fingers2,l,r)
print(match)
return l-1
and this is "extend_fingerprints":
def extend_fingerprints(text, fingers, length, basis=2**8, r=2**17-1):
count=0
for f in fingers:
if count==len(fingers)-1:
fingers.pop(len(fingers)-1)
break
fingers[count]=(f*basis+ord(text[length-1+count]))%r
count+=1
I'm having doubts between this two options:
1.O(n_2-l+1)+O(n_1-l+1)*O(l)
Refer to r as a constant number while n1,n2 are very large therefore a lot of collisions would be made at the hash table (let's say O(1) items at every 'cell', yet, always some "false-positives")
2.O(n_2-l+1)+O(n_1-l+1)+O(l)
Refer to r as optimal for a decent hash function, therefore almost no collisions which means that if two texts are the same cell at the hash table we may assume they are actually the same text?
Personally I lean towards the Bold statement.
tnx.

I think the answer is
O((n_2-l) + l*(n_1-l))
.
(n_2-l) represents the complexity of make_hashtable for the second text.
l*(n_1-l) represents the two nested loops who go through every item in the finger prints of the first text and perform 1 comparison operation (for l length slice), for some constant 'm' if there are some items of the same index in the hash table.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimization of python code for string duplicates search - python

Related

How can I merge overlapping strings in python?

How many minimum numbers of characters from a given string S, should delete to make it a sorted string [duplicate]

Sorting a string to make a new one

Find all minimal elements in a list or set as weighted by a function

longest common sub-string, Python complexity analysis

Categories

Resources