Algorithm verification: Get all the combinaison of possible word - python

I wanted to know if the algorithm that i wrotte just below in python is correct.
My goal is to find an algorithm that print/find all the possible combinaison of words that can be done using the character from character '!' (decimal value = 33) to character '~' (decimal value = 126) in the asccii table:
Here the code using recursion:
byteWord = bytearray(b'\x20') # Hex = '\x21' & Dec = '33' & Char = '!'
cntVerif = 0 # Test-------------------------------------------------------------------------------------------------------
def comb_fct(bytes_arr, cnt: int):
global cntVerif # Test------------------------------------------------------------------------------------------------
if len(bytes_arr) > 3: # Test-----------------------------------------------------------------------------------------
print(f'{cntVerif+1}:TEST END')
sys.exit()
if bytes_arr[cnt] == 126:
if cnt == len(bytes_arr) or len(bytes_arr) == 1:
bytes_arr.insert(0, 32)
bytes_arr[cnt] = 32
cnt += 1
cntVerif += 1 # Test----------------------------------------------------------------------------------------------
print(f'{cntVerif}:if bytes_arr[cnt] == 126: \n\tbytes_arr = {bytes_arr}') # Test-------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt)
if cnt == -1 or cnt == len(bytes_arr)-1:
bytes_arr[cnt] = bytes_arr[cnt] + 1
cntVerif += 1 # Test----------------------------------------------------------------------------------------------
print(f'{cntVerif}:if cnt==-1: \n\tbytes_arr = {bytes_arr}') # Test-------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt=-1) # index = -1 means last index
bytes_arr[cnt] = bytes_arr[cnt] + 1
cntVerif += 1 # Test--------------------------------------------------------------------------------------------------
print(f'{cntVerif}:None if: \n\tbytes_arr={bytes_arr}') # Test-----------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt+1)
comb_fct(byteWord, -1)
Thank your for your help because python allow just a limited number of recursion (996 on my computer) so i for exemple i can't verify if my algorithm give all the word of length 3 that can be realised with the range of character describe upper.
Of course if anyone has a better idea to writte this algorithm (a faster algorithm for exemple). I will be happy to read it.

Although you might be able to tweak this a bit, I think the code below is close to the most efficient solution to your problem, which I take to be "generate all possible sequences of maximum length N from a given set of characters". That might be a bit more general than you need, since your set of characters is fixed, but the general solution is more useful and little overhead is added.
Note that the function is written as a generator, using functions from the itertools standard library module. Itertools is described as a set of "functions creating iterators for efficient looping" (emphasis added), and it indeed is. Generators are one of Python's great features, since they allow you to easily and efficiently iterate over complex sequences. If you want to write efficient and "pythonic" code, you should familiarise yourself with these concepts (as well as other essential features, such as comprehensions). So I'm not going to explain these features further; please read the tutorial sections for details.
So here's the simple solution:
from itertools import product, chain
def genseq(maxlen, chars):
return map(''.join,
chain.from_iterable(product(chars, repeat=i)
for i in range(maxlen+1)))
# Example usage:
chars = ''.join(chr(i) for i in range(33, 127))
for word in genseq(4, chars):
# Do something with word
There are 78,914,411 possible words (including the empty word); the above generates all of them in 7 seconds on my laptop. Much of that time is spent creating (and garbage collecting) those strings; you might well be able to do better using a bytearray and recycling it for each generated word. I didn't try that.
For the record, here's a simpler way of "unindexing" an enumeration of such strings. The enumeration starts with the empty word, followed by all 1-character words, then 2-character words, and so on. This ordering makes it unnecessary to specify the length (or even maximum length) of the resulting string.
def unindex(i, chars):
v = []
n = len(chars)
while i > 0:
i -= 1
v.append(i % n)
i //= n
return ''.join(chars[j] for j in v[::-1])
# Example to generate the same words as above:
# chars as above
index_limit = (len(chars) ** 5 - 1) // (len(chars) - 1)
for i in range(0, index_limit):
word = unindex(i, chars)
# Do something with word
Again, you can probably speed this up a bit by using a recycled bytearray. As written above, it took about two minutes, sixteen times as long as my first version.
Note that using bytearrays in the way you do in your answer does not significantly speed things up, because it creates a new bytearray each time. In order to achieve the savings, you have to use a single bytearray for the entire generations, modifying it rather than recreating it. That's more awkward in practice, because it means that if you need to keep a generated word around for later, perhaps because it passed some test, you must copy it. It's easy to forget that, and the resulting bug can be very hard to track down.

You don't need a recursion here. Consider your word as a n-digit number, where the digits are ASCII symbols in the range of interest ([!..~]). Start with the smallest one (all !), and increment it by 1, until you reach the largest (all ~).
To increment the long number, add 1 to the least significant byte. If it becomes ~, make it ! and try to increment the next one, etc.
Keep in mind that the amount of words is huge. There are 94 ** n n-letter words. For n == 4 there are 78074896 of them.

EXPLANATION:
To solve this problem i think that i ve found a more elegant and faster way to do it without using recursive algorithm.
Complexity:
I think too that it is the time and space optimal solution.
As it is in time: O(n) with n the total number of possible combinaison that can be very very high. And theorically O(1) in space complexity. Concerning the space complexity because of the python language characteristics my code ,from a practical point of view, creates a lot of bytearray. This can be corrected with light modification. But for a better code check the solution posted by #ricci that i marked as the accepted answer.
Mathematical principle used:
I am using the fact that it exists a bijection between all the number in decimal basis and the number in base 94.
It is obvious that each number in base 94 can be written using a special sequance of unique character as the one in the range [30, 126] (in decimal value) in the ascii code.
Exemple of base conversion:
https://www.rapidtables.com/convert/number/decimal-to-hex.html
The operator '//' is the quotient operator and the operator '%' is the modulo operator.
I will be happy if anyone can confirm that my solution is correct. :-)
ALGORITHM
VERSION 1:
If you are NOT interested by getting all the sequence of words starting by '!'.
For exemple in lenght 2, you are NOT interested by the words of the form '!!'...'!A' '!B' ... etc ...'!R'...'!~' (as in our base '!' is equivalent to zero).
# Get all ascii relevant character in a list
asciiList = []
for c in (chr(i) for i in range(33, 127)):
asciiList.append(c)
print(f'ascii List: \n{asciiList} \nlist length: {len(asciiList)}')
def base10_to_base94_fct(int_to_convert: int) -> str:
sol_str = ''
loop_condition = True
while loop_condition is True:
quo = int_to_convert // 94
mod = int_to_convert % 94
sol_str = asciiList[mod] + sol_str
int_to_convert = quo
if quo == 0:
loop_condition = False
return sol_str
# test = base10_to_base94_fct(94**2-1)
# print(f'TEST result: {test}')
def comb_fct(word_length: int) -> None:
max_iter = 94**word_length
cnt = 1
while cnt < max_iter:
str_tmp = base10_to_base94_fct(cnt)
cnt += 1
print(f'{cnt}: Current word check:{str_tmp}')
# Test
comb_fct(3)
VERSION 2:
If you are interested by getting all the sequence of words starting by '!'.
For exemple in lenght 2, you are interested by the words of the form '!!'...'!A' '!B' ... etc ...'!R'...'!~' (as in our base '!' is equivalent to zero).
# Get all ascii relevant character in a list
asciiList = []
for c in (chr(i) for i in range(33, 127)):
asciiList.append(c)
print(f'The word should contain only the character in the following ascii List: \n{asciiList} \nlist length: {len(asciiList)}')
def base10_to_base94_fct(int_to_convert: int, str_length: int) -> bytearray:
sol_str = bytearray(b'\x21') * str_length
digit_nbr = str_length-1
loop_condition = True
while loop_condition is True:
quo = int_to_convert // 94
mod = int_to_convert % 94
sol_str[digit_nbr] = 33 + mod
digit_nbr -= 1
int_to_convert = quo
if digit_nbr == -1:
loop_condition = False
return sol_str
def comb_fct(max_word_length: int) -> None:
max_iter_abs = (94/93) * (94**max_word_length-1) # sum of a geometric series: 94 + 94^2 + 94^3 + 94^4 + ... + 94^N
max_iter_rel = 94
word_length = 1
cnt_rel = 0 # rel = relative
cnt_abs = 0 # abs = absolute
while cnt_rel < max_iter_rel**word_length and cnt_abs < max_iter_abs:
str_tmp = base10_to_base94_fct(cnt_rel, word_length)
print(f'{cnt_abs}:Current word test:{str_tmp}.')
print(f'cnt_rel = {cnt_rel} and cnt_abs={cnt_abs}')
if str_tmp == bytearray(b'\x7e') * word_length:
word_length += 1
cnt_rel = 0
continue
cnt_rel += 1
cnt_abs += 1
comb_fct(2) # Test

Related

Python Leetcode 3: Time limit exceeded

I am solving LeetCode problem https://leetcode.com/problems/longest-substring-without-repeating-characters/:
Given a string s, find the length of the longest substring without repeating characters.
Constraints:
0 <= s.length <= 5 * 104
s consists of English letters, digits, symbols and spaces.
If used this sliding window algorithm:
def lengthOfLongestSubstring(str):
# define base case
if (len(str) < 2): return len(str)
# define pointers and frequency counter
left = 0
right = 0
freqCounter = {} # used to store the character count
maxLen = 0
while (right < len(str)):
# adds the character count into the frequency counter dictionary
if (str[right] not in freqCounter):
freqCounter[str[right]] = 1
else:
freqCounter[str[right]] += 1
# print (freqCounter)
# runs the while loop if we have a key-value with value greater than 1.
# this means that there are repeated characters in the substring.
# we want to move the left pointer by 1 until that value decreases to 1 again. E.g., {'a':2,'b':1,'c':1} to {'a':1,'b':1,'c':1}
while (len(freqCounter) != right-left+1):
# while (freqCounter[str[right]] > 1): ## Time Limit Exceeded Error
print(len(freqCounter), freqCounter)
freqCounter[str[left]] -= 1
# remove the key-value if value is 0
if (freqCounter[str[left]] == 0):
del freqCounter[str[left]]
left += 1
maxLen = max(maxLen, right-left+1)
# print(freqCounter, maxLen)
right += 1
return maxLen
print(lengthOfLongestSubstring("abcabcbb")) # 3 'abc'
I got the error "Time Limit Exceeded" when I submitted with this while loop:
while (freqCounter[str[right]] > 1):
instead of
while (len(freqCounter) != right-left+1):
I thought the first is accessing an element in a dictionary, which has a time complexity of O(1). Not sure why this would be significantly slower than the second version. This seems to mean my approach is not optimal in either case. I thought sliding window would be the most efficient algorithm; did I implement it wrong?
Your algorithm running time is close to the timeout limit for some tests -- I even got the time-out with the version len(freqCounter). The difference between the two conditions you have tried cannot be that much different, so I would look into more drastic ways to improve the efficiency of the algorithm:
Instead of counting the frequency of letters, you could store the index of where you last found the character. This allows you to update left in one go, avoiding a second loop where you had to decrease frequencies at each unit step.
Performing a del is really not necessary.
You can also use some more pythonic looping, like with enumerate
Here is the update of your code applying those ideas (the first one is the most important one):
class Solution(object):
def lengthOfLongestSubstring(self, s):
lastpos = {}
left = 0
maxLen = 0
for right, ch in enumerate(s):
if lastpos.setdefault(ch, -1) >= left:
left = lastpos[ch] + 1
else:
maxLen = max(maxLen, right - left + 1)
lastpos[ch] = right
return maxLen
Another boost can be achieved when you work with ASCII codes instead of characters, as then you can use a list instead of a dictionary. As the code challenge guarantees the characters are from a small set of basic characters, we don't need to take other character codes into consideration:
class Solution(object):
def lengthOfLongestSubstring(self, s):
lastpos = [-1] * 128
left = 0
maxLen = 0
for right, asc in enumerate(map(ord, s)):
if lastpos[asc] >= left:
left = lastpos[asc] + 1
else:
maxLen = max(maxLen, right - left + 1)
lastpos[asc] = right
return maxLen
When submitting this, it scored very well in terms of running time.

Check how many character need to be deleted to make an anagram in Python

I wrote python code to check how many characters need to be deleted from two strings for them to become anagrams of each other.
This is the problem statement "Given two strings, and , that may or may not be of the same length, determine the minimum number of character deletions required to make and anagrams. Any characters can be deleted from either of the strings"
def makeAnagram(a, b):
# Write your code here
ac=0 # tocount the no of occurences of chracter in a
bc=0 # tocount the no of occurences of chracter in b
p=False #used to store result of whether an element is in that string
c=0 #count of characters to be deleted to make these two strings anagrams
t=[] # list of previously checked chracters
for x in a:
if x in t == True:
continue
ac=a.count(x)
t.insert(0,x)
for y in b:
p = x in b
if p==True:
bc=b.count(x)
if bc!=ac:
d=ac-bc
c=c+abs(d)
elif p==False:
c=c+1
return(c)
You can use collections.Counter for this:
from collections import Counter
def makeAnagram(a, b):
return sum((Counter(a) - Counter(b) | Counter(b) - Counter(a)).values())
Counter(x) (where x is a string) returns a dictionary that maps characters to how many times they appear in the string.
Counter(a) - Counter(b) gives you a dictionary that maps characters which are overabundant in b to how many times they appear in b more than the number of times they appear in a.
Counter(b) - Counter(a) is like above, but for characters which are overabundant in a.
The | merges the two resulting counters. We then take the values of this, and sum them to get the total number of characters which are overrepresented in either string. This is equivalent to the minimum number of characters that need to be deleted to form an anagram.
As for why your code doesn't work, I can't pin down any one problem with it. To obtain the code below, all I did was some simplification (e.g. removing unnecessary variables, looping over a and b together, removing == True and == False, replacing t with a set, giving variables descriptive names, etc.), and the code began working. Here is that simplified working code:
def makeAnagram(a, b):
c = 0 # count of characters to be deleted to make these two strings anagrams
seen = set() # set of previously checked characters
for character in a + b:
if character not in seen:
seen.add(character)
c += abs(a.count(character) - b.count(character))
return c
I recommend you make it a point to learn how to write simple/short code. It may not seem important compared to actually tackling the algorithms and getting results. It may seem like cleanup or styling work. But it pays off enormously. Bug are harder to introduce in simple code, and easier to spot. Oftentimes simple code will be more performant than equivalent complex code too, either because the programmer was able to more easily see ways to improve it, or because the more performant approach just arose naturally from the cleaner code.
Assuming there are only lowercase letters
The idea is to make character count arrays for both the strings and store frequency of each character. Now iterate the count arrays of both strings and difference in frequency of any character abs(count1[str1[i]-‘a’] – count2[str2[i]-‘a’]) in both the strings is the number of character to be removed in either string.
CHARS = 26
# function to calculate minimum
# numbers of characters
# to be removed to make two
# strings anagram
def remAnagram(str1, str2):
count1 = [0]*CHARS
count2 = [0]*CHARS
i = 0
while i < len(str1):
count1[ord(str1[i])-ord('a')] += 1
i += 1
i =0
while i < len(str2):
count2[ord(str2[i])-ord('a')] += 1
i += 1
# traverse count arrays to find
# number of characters
# to be removed
result = 0
for i in range(26):
result += abs(count1[i] - count2[i])
return result
Here time complexity is O(n + m) where n and m are the length of the two strings
Space complexity is O(1) as we use only array of size 26
This can be further optimised by just using a single array for taking the count.
In this case for string s1 -> we increment the counter
for string s2 -> we decrement the counter
def makeAnagram(a, b):
buffer = [0] * 26
for char in a:
buffer[ord(char) - ord('a')] += 1
for char in b:
buffer[ord(char) - ord('a')] -= 1
return sum(map(abs, buffer))
if __name__ == "__main__" :
str1 = "bcadeh"
str2 = "hea"
print(makeAnagram(str1, str2))
Output : 3

Optimal brute force solution for finding longest palindromic substring

This is my current approach:
def isPalindrome(s):
if (s[::-1] == s):
return True
return False
def solve(s):
l = len(s)
ans = ""
for i in range(l):
subStr = s[i]
for j in range(i + 1, l):
subStr += s[j]
if (j - i + 1 <= len(ans)):
continue
if (isPalindrome(subStr)):
ans = max(ans, subStr, key=len)
return ans if len(ans) > 1 else s[0]
print(solve(input()))
My code exceeds the time limit according to the auto scoring system. I've already spend some time to look up on Google, all of the solutions i found have the same idea with no optimization or using dynamic programming, but sadly i must and only use brute force to solve this problem. I was trying to break the loop earlier by skipping all the substrings that are shorter than the last found longest palindromic string, but still end up failing to meet the time requirement. Is there any other way to break these loops earlier or more time-efficient approach than the above?
With subStr += s[j], a new string is created over the length of the previous subStr. And with s[::-1], the substring from the previous offset j is copied over and over again. Both are inefficient because strings are immutable in Python and have to be copied as a new string for any string operation. On top of that, the string comparison in s[::-1] == s is also inefficient because you've already compared all of the inner substrings in the previous iterations and need to compare only the outermost two characters at the current offset.
You can instead keep track of just the index and the offset of the longest palindrome so far, and only slice the string upon return. To account for palindromes of both odd and even lengths, you can either increase the index by 0.5 at a time, or double the length to avoid having to deal with float-to-int conversions:
def solve(s):
length = len(s) * 2
index_longest = offset_longest = 0
for index in range(length):
offset = 0
for offset in range(1 + index % 2, min(index, length - index), 2):
if s[(index - offset) // 2] != s[(index + offset) // 2]:
offset -= 2
break
if offset > offset_longest:
index_longest = index
offset_longest = offset
return s[(index_longest - offset_longest) // 2: (index_longest + offset_longest) // 2 + 1]
Solved by using the approach "Expand Around Center", thanks #Maruthi Adithya
This modification of your code should improve performance. You can stop your code when the max possible substring is smaller than your already computed answer. Also, you should start your second loop with j+ans+1 instead of j+1 to avoid useless iterations :
def solve(s):
l = len(s)
ans = ""
for i in range(l):
if (l-i+1 <= len(ans)):
break
subStr = s[i:len(ans)]
for j in range(i + len(ans) + 1, l+1):
if (isPalindrome(subStr)):
ans = subStr
subStr += s[j]
return ans if len(ans) > 1 else s[0]
This is a solution that has a time complexity greater than the solutions provided.
Note: This post is to think about the problem better and does not specifically answer the question. I have taken a mathematical approach to find a time complexity greater than 2^L (where L is size of input string)
Note: This is a post to discuss potential algorithms. You will not find the answer here. And the logic shown here has not been proven extensively.
Do let me know if there is something that I haven't considered.
Approach: Create set of possible substrings. Compare and find the maximum pair* from this set that has the highest possible pallindrome.
Example case with input string: "abc".
In this example, substring set has: "a","b","c","ab","ac","bc","abc".
7 elements.
Comparing each element with all other elements will involve: 7^2 = 49 calculations.
Hence, input size is 3 & no of calculations is 49.
Time Complexity:
First compute time complexity for generating the substring set:
<img src="https://latex.codecogs.com/gif.latex?\sum_{a=1}^{L}\left&space;(&space;C_{a}^{L}&space;\right&space;)" title="\sum_{a=1}^{L}\left ( C_{a}^{L} \right )" />
(The math equation is shown in the code snippet)
Here, we are adding all the different substring size combination from the input size L.
To make it clear: In the above example input size is 3. So we find all the pairs with size =1 (i.e: "a","b","c"). Then size =2 (i.e: "ab","ac","bc") and finally size = 3 (i.e: "abc").
So choosing 1 character from input string = combination of taking L things 1 at a time without repetition.
In our case number of combinations = 3.
This can be mathematically shown as (where a = 1):
<img src="https://latex.codecogs.com/gif.latex?C_{a}^{L}" title="C_{a}^{L}" />
Similarly choosing 2 char from input string = 3
Choosing 3 char from input string = 1
Finding time complexity of palindrome pair from generated set with maximum length:
Size of generated set: N
For this we have to compare each string in set with all other strings in set.
So N*N, or 2 for loops. Hence the final time complexity is:
<img src="https://latex.codecogs.com/gif.latex?\sum_{a=1}^{L}\left&space;(&space;C_{a}^{L}&space;\right&space;)^{2}" title="\sum_{a=1}^{L}\left ( C_{a}^{L} \right )^{2}" />
This is diverging function greater than 2^L for L > 1.
However, there can be multiple optimizations applied to this. For example: there is no need to compare "a" with "abc" as "a" will also be compared with "a". Even if this optimization is applied, it will still have a time complexity > 2^L (For the most cases).
Hope this gave you a new perspective to the problem.
PS: This is my first post.
You should not find the string start from the beginning of that string, but you should start from the middle of it & expand the current string
For example, for the string xyzabccbalmn, your solution will cost ~ 6 * 11 comparison but searching from the middle will cost ~ 11 * 2 + 2 operations
But anyhow, brute-forcing will never ensure that your solution will run fast enough for any arbitrary string.
Try this:
def solve(s):
if len(s)==1:
print(0)
return '1'
if len(s)<=2 and not(isPalindrome(s)):
print (0)
return '1'
elif isPalindrome(s):
print( len(s))
return '1'
elif isPalindrome(s[0:len(s)-1]) or isPalindrome(s[1:len(s)]):
print (len(s)-1)
return '1'
elif len(s)>=2:
solve(s[0:len(s)-1])
return '1'
return 0

Space complexity of list creation

Could someone explain me what is the space complexity of beyond program, and why is it?
def is_pal_per(str):
s = [i for i in str]
nums = [0] * 129
for i in s:
nums[ord(i)] += 1
count = 0
for i in nums:
if i != 0 and i / 2 == 0:
count += 1
print count
if count > 1:
return False
else:
return True
Actually, i'm interested in this lines of code. How does it influence on space complexity of above program?
s = [i for i in str]
nums = [0] * 129
I'm unclear where you're having trouble with this. s is simply a list of individual characters in str. The space consumption is len(s).
nums is a constant size, dominated by the O(N) term.
Is this code you wrote, or has this been handed to you? The programming style is highly not "Pythonic".
As for your code, start with this collapse:
count = 0
for char in str:
val = ord[char] + 1
if abs(val) == 1:
count += 1
print count
return count == 0
First, I replaced your single-letter variables (s => char; i => val). Then I cut out most of the intermediate steps, leaving in a couple to help you read the code. Finally, I used a straightforward Boolean value to return, rather than the convoluted statement of the original.
I did not use Python's counting methods -- that would shorten the function even more. By the way, do you have to print the count of unity values, or do you just need the Boolean return? If it's just the return value, you can make this even shorter.
You have found the only two lines that allocate memory:
s = [i for i in str]
nums = [0] * 129
The first grows linearly with len(str), the second is a constant. Therefore, the space complexity of your function is O(N), N=length of str.

Find repeats with certain length within a string using python

I am trying to use the regex module to find non-overlapping repeats (duplicated sub-strings) within a given string (30 char), with the following requirements:
I am only interested in non-overlapping repeats that are 6-15 char long.
allow 1 mis-match
return the positions for each match
One way I thought of is that for each possible repeat length, let python loop through the 30char string input. For example,
string = "ATAGATATATGGCCCGGCCCATAGATATAT" #input
#for 6char repeats, first one in loop would be for the following event:
text = "ATAGAT"
text2 ="(" + text + ")"+ "{e<=1}" #this is to allow 1 mismatch later in regex
string2="ATATGGCCCGGCCCATAGATATAT" #string after excluding text
for x in regex.finditer(text2,string2,overlapped=True):
print x.span()
#then still for 6char repeats, I will move on to text = "TAGATA"...
#after 6char, loop again for 7char...
There should be two outputs for this particular string = "ATAGATATATGGCCCGGCCCATAGATATAT". 1. The bold two "ATAGATATAT" + 1 mismatch: "ATAGATATATG" &"CATAGATATAT" with position index returned as (0,10)&(19, 29); 2. "TGGCCC" & "GGCCCA" (need add one mismatch to be at least 6 char), with index (9,14)&(15,20). Numbers can be in a list or table.
I'm sorry that I didn't include a real loop, but I hope the idea is clear...As you can see, this is a very less efficient method, not to mention it would create redundancy --- e.g. 10char repeats will be counted more than once, because it would suit for 9,8,7 and 6 char repeats loops. Moreover, I have a lot of such 30 char strings to work with, so I would appreciate your advice on some cleaner methods.
Thank you very much:)
I'd try straightforward algorithm instead of regex (which are quite confusing in this instance);
s = "ATAGATATATGGCCCGGCCCATAGATATAT"
def fuzzy_compare(s1, s2):
# sanity check
if len(s1) != len(s2):
return False
diffs = 0
for a, b in zip(s1, s2):
if a != b:
diffs += 1
if diffs > 1:
return False
return True
slen = len(s) # 30
for l in range(6, 16):
i = 0
while (i + l * 2) <= slen:
sub1 = s[i:i+l]
for j in range(i+l, slen - l):
sub2 = s[j:j+l]
if fuzzy_compare(sub1, sub2):
# checking if this could be partial
partial = False
if i + l < j and j + l < slen:
extsub1 = s[i:i+l+1]
extsub2 = s[j:j+l+1]
# if it is partial, we'll get it later in the main loop
if fuzzy_compare(extsub1, extsub2):
partial = True
if not partial:
print (i, i+l), (j, j+l)
i += 1
It's a first draft, so feel free to experiment with it. It also seems to be clunky and not optimal, but try running it first - it may be sufficient enough.

Categories

Resources