Find repeats with certain length within a string using python

Find repeats with certain length within a string using python - python

I am trying to use the regex module to find non-overlapping repeats (duplicated sub-strings) within a given string (30 char), with the following requirements:
I am only interested in non-overlapping repeats that are 6-15 char long.
allow 1 mis-match
return the positions for each match
One way I thought of is that for each possible repeat length, let python loop through the 30char string input. For example,
string = "ATAGATATATGGCCCGGCCCATAGATATAT" #input
#for 6char repeats, first one in loop would be for the following event:
text = "ATAGAT"
text2 ="(" + text + ")"+ "{e<=1}" #this is to allow 1 mismatch later in regex
string2="ATATGGCCCGGCCCATAGATATAT" #string after excluding text
for x in regex.finditer(text2,string2,overlapped=True):
print x.span()
#then still for 6char repeats, I will move on to text = "TAGATA"...
#after 6char, loop again for 7char...
There should be two outputs for this particular string = "ATAGATATATGGCCCGGCCCATAGATATAT". 1. The bold two "ATAGATATAT" + 1 mismatch: "ATAGATATATG" &"CATAGATATAT" with position index returned as (0,10)&(19, 29); 2. "TGGCCC" & "GGCCCA" (need add one mismatch to be at least 6 char), with index (9,14)&(15,20). Numbers can be in a list or table.
I'm sorry that I didn't include a real loop, but I hope the idea is clear...As you can see, this is a very less efficient method, not to mention it would create redundancy --- e.g. 10char repeats will be counted more than once, because it would suit for 9,8,7 and 6 char repeats loops. Moreover, I have a lot of such 30 char strings to work with, so I would appreciate your advice on some cleaner methods.
Thank you very much:)

I'd try straightforward algorithm instead of regex (which are quite confusing in this instance);
s = "ATAGATATATGGCCCGGCCCATAGATATAT"
def fuzzy_compare(s1, s2):
# sanity check
if len(s1) != len(s2):
return False
diffs = 0
for a, b in zip(s1, s2):
if a != b:
diffs += 1
if diffs > 1:
return False
return True
slen = len(s) # 30
for l in range(6, 16):
i = 0
while (i + l * 2) <= slen:
sub1 = s[i:i+l]
for j in range(i+l, slen - l):
sub2 = s[j:j+l]
if fuzzy_compare(sub1, sub2):
# checking if this could be partial
partial = False
if i + l < j and j + l < slen:
extsub1 = s[i:i+l+1]
extsub2 = s[j:j+l+1]
# if it is partial, we'll get it later in the main loop
if fuzzy_compare(extsub1, extsub2):
partial = True
if not partial:
print (i, i+l), (j, j+l)
i += 1
It's a first draft, so feel free to experiment with it. It also seems to be clunky and not optimal, but try running it first - it may be sufficient enough.

Related

Algorithm verification: Get all the combinaison of possible word

I wanted to know if the algorithm that i wrotte just below in python is correct.
My goal is to find an algorithm that print/find all the possible combinaison of words that can be done using the character from character '!' (decimal value = 33) to character '~' (decimal value = 126) in the asccii table:
Here the code using recursion:
byteWord = bytearray(b'\x20') # Hex = '\x21' & Dec = '33' & Char = '!'
cntVerif = 0 # Test-------------------------------------------------------------------------------------------------------
def comb_fct(bytes_arr, cnt: int):
global cntVerif # Test------------------------------------------------------------------------------------------------
if len(bytes_arr) > 3: # Test-----------------------------------------------------------------------------------------
print(f'{cntVerif+1}:TEST END')
sys.exit()
if bytes_arr[cnt] == 126:
if cnt == len(bytes_arr) or len(bytes_arr) == 1:
bytes_arr.insert(0, 32)
bytes_arr[cnt] = 32
cnt += 1
cntVerif += 1 # Test----------------------------------------------------------------------------------------------
print(f'{cntVerif}:if bytes_arr[cnt] == 126: \n\tbytes_arr = {bytes_arr}') # Test-------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt)
if cnt == -1 or cnt == len(bytes_arr)-1:
bytes_arr[cnt] = bytes_arr[cnt] + 1
cntVerif += 1 # Test----------------------------------------------------------------------------------------------
print(f'{cntVerif}:if cnt==-1: \n\tbytes_arr = {bytes_arr}') # Test-------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt=-1) # index = -1 means last index
bytes_arr[cnt] = bytes_arr[cnt] + 1
cntVerif += 1 # Test--------------------------------------------------------------------------------------------------
print(f'{cntVerif}:None if: \n\tbytes_arr={bytes_arr}') # Test-----------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt+1)
comb_fct(byteWord, -1)
Thank your for your help because python allow just a limited number of recursion (996 on my computer) so i for exemple i can't verify if my algorithm give all the word of length 3 that can be realised with the range of character describe upper.
Of course if anyone has a better idea to writte this algorithm (a faster algorithm for exemple). I will be happy to read it.

Although you might be able to tweak this a bit, I think the code below is close to the most efficient solution to your problem, which I take to be "generate all possible sequences of maximum length N from a given set of characters". That might be a bit more general than you need, since your set of characters is fixed, but the general solution is more useful and little overhead is added.
Note that the function is written as a generator, using functions from the itertools standard library module. Itertools is described as a set of "functions creating iterators for efficient looping" (emphasis added), and it indeed is. Generators are one of Python's great features, since they allow you to easily and efficiently iterate over complex sequences. If you want to write efficient and "pythonic" code, you should familiarise yourself with these concepts (as well as other essential features, such as comprehensions). So I'm not going to explain these features further; please read the tutorial sections for details.
So here's the simple solution:
from itertools import product, chain
def genseq(maxlen, chars):
return map(''.join,
chain.from_iterable(product(chars, repeat=i)
for i in range(maxlen+1)))
# Example usage:
chars = ''.join(chr(i) for i in range(33, 127))
for word in genseq(4, chars):
# Do something with word
There are 78,914,411 possible words (including the empty word); the above generates all of them in 7 seconds on my laptop. Much of that time is spent creating (and garbage collecting) those strings; you might well be able to do better using a bytearray and recycling it for each generated word. I didn't try that.
For the record, here's a simpler way of "unindexing" an enumeration of such strings. The enumeration starts with the empty word, followed by all 1-character words, then 2-character words, and so on. This ordering makes it unnecessary to specify the length (or even maximum length) of the resulting string.
def unindex(i, chars):
v = []
n = len(chars)
while i > 0:
i -= 1
v.append(i % n)
i //= n
return ''.join(chars[j] for j in v[::-1])
# Example to generate the same words as above:
# chars as above
index_limit = (len(chars) ** 5 - 1) // (len(chars) - 1)
for i in range(0, index_limit):
word = unindex(i, chars)
# Do something with word
Again, you can probably speed this up a bit by using a recycled bytearray. As written above, it took about two minutes, sixteen times as long as my first version.
Note that using bytearrays in the way you do in your answer does not significantly speed things up, because it creates a new bytearray each time. In order to achieve the savings, you have to use a single bytearray for the entire generations, modifying it rather than recreating it. That's more awkward in practice, because it means that if you need to keep a generated word around for later, perhaps because it passed some test, you must copy it. It's easy to forget that, and the resulting bug can be very hard to track down.

You don't need a recursion here. Consider your word as a n-digit number, where the digits are ASCII symbols in the range of interest ([!..~]). Start with the smallest one (all !), and increment it by 1, until you reach the largest (all ~).
To increment the long number, add 1 to the least significant byte. If it becomes ~, make it ! and try to increment the next one, etc.
Keep in mind that the amount of words is huge. There are 94 ** n n-letter words. For n == 4 there are 78074896 of them.

EXPLANATION:
To solve this problem i think that i ve found a more elegant and faster way to do it without using recursive algorithm.
Complexity:
I think too that it is the time and space optimal solution.
As it is in time: O(n) with n the total number of possible combinaison that can be very very high. And theorically O(1) in space complexity. Concerning the space complexity because of the python language characteristics my code ,from a practical point of view, creates a lot of bytearray. This can be corrected with light modification. But for a better code check the solution posted by #ricci that i marked as the accepted answer.
Mathematical principle used:
I am using the fact that it exists a bijection between all the number in decimal basis and the number in base 94.
It is obvious that each number in base 94 can be written using a special sequance of unique character as the one in the range [30, 126] (in decimal value) in the ascii code.
Exemple of base conversion:
https://www.rapidtables.com/convert/number/decimal-to-hex.html
The operator '//' is the quotient operator and the operator '%' is the modulo operator.
I will be happy if anyone can confirm that my solution is correct. :-)
ALGORITHM
VERSION 1:
If you are NOT interested by getting all the sequence of words starting by '!'.
For exemple in lenght 2, you are NOT interested by the words of the form '!!'...'!A' '!B' ... etc ...'!R'...'!~' (as in our base '!' is equivalent to zero).
# Get all ascii relevant character in a list
asciiList = []
for c in (chr(i) for i in range(33, 127)):
asciiList.append(c)
print(f'ascii List: \n{asciiList} \nlist length: {len(asciiList)}')
def base10_to_base94_fct(int_to_convert: int) -> str:
sol_str = ''
loop_condition = True
while loop_condition is True:
quo = int_to_convert // 94
mod = int_to_convert % 94
sol_str = asciiList[mod] + sol_str
int_to_convert = quo
if quo == 0:
loop_condition = False
return sol_str
# test = base10_to_base94_fct(94**2-1)
# print(f'TEST result: {test}')
def comb_fct(word_length: int) -> None:
max_iter = 94**word_length
cnt = 1
while cnt < max_iter:
str_tmp = base10_to_base94_fct(cnt)
cnt += 1
print(f'{cnt}: Current word check:{str_tmp}')
# Test
comb_fct(3)
VERSION 2:
If you are interested by getting all the sequence of words starting by '!'.
For exemple in lenght 2, you are interested by the words of the form '!!'...'!A' '!B' ... etc ...'!R'...'!~' (as in our base '!' is equivalent to zero).
# Get all ascii relevant character in a list
asciiList = []
for c in (chr(i) for i in range(33, 127)):
asciiList.append(c)
print(f'The word should contain only the character in the following ascii List: \n{asciiList} \nlist length: {len(asciiList)}')
def base10_to_base94_fct(int_to_convert: int, str_length: int) -> bytearray:
sol_str = bytearray(b'\x21') * str_length
digit_nbr = str_length-1
loop_condition = True
while loop_condition is True:
quo = int_to_convert // 94
mod = int_to_convert % 94
sol_str[digit_nbr] = 33 + mod
digit_nbr -= 1
int_to_convert = quo
if digit_nbr == -1:
loop_condition = False
return sol_str
def comb_fct(max_word_length: int) -> None:
max_iter_abs = (94/93) * (94**max_word_length-1) # sum of a geometric series: 94 + 94^2 + 94^3 + 94^4 + ... + 94^N
max_iter_rel = 94
word_length = 1
cnt_rel = 0 # rel = relative
cnt_abs = 0 # abs = absolute
while cnt_rel < max_iter_rel**word_length and cnt_abs < max_iter_abs:
str_tmp = base10_to_base94_fct(cnt_rel, word_length)
print(f'{cnt_abs}:Current word test:{str_tmp}.')
print(f'cnt_rel = {cnt_rel} and cnt_abs={cnt_abs}')
if str_tmp == bytearray(b'\x7e') * word_length:
word_length += 1
cnt_rel = 0
continue
cnt_rel += 1
cnt_abs += 1
comb_fct(2) # Test

Reducing a string by detecting patterns

Given a string, I would like to detect the repeating substrings, and then reduce abab to (ab)2.
For instance, ababababacdecdecdeababab would reduce to (ab)4a(cde)3(ab)3.
The string does not have the same character twice in a row. So, aaab is an invalid string.
Here is the Python that I wrote:
def superscript(n):
return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])
signature = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
d = {}
processed = []
for k in range(2, len(signature)):
i = 0
j = i + k
while j <= len(signature):
repeat_count = 0
while signature[i:i+k] == signature[j:j+k]:
repeat_count += 1
j += k
if repeat_count > 0 and i not in processed:
d[i] = [i+k, repeat_count + 1]
for j in range(i, (i+k)*repeat_count + 1):
processed.append(j)
i = j
j = i + k
else:
i += 1
j = i + k
od = collections.OrderedDict(sorted(d.items()))
output = ''
for k,v in od.items():
print(k, v)
output += '(' + signature[k:v[0]] + ')' + superscript(v[1])
Which aims to detect the repeating substrings of length 2, 3, 4, and so on. I mark the start and the end of a repeating substring by using a dict. I also mark the index of the processed characters by keeping a list to avoid replacing (ab)4 by (abab)2 (since the latter one will overwrite the beginning index in the dict).
The example string I work with is hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb which should output (hd)4(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4cbfb.
However, I get this output:
(hd)4(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5(cg)4(ea)2(dh)4(hd)2(cg)4
I don't know whether this is a well-known problem, but I couldn't find any resources. I don't mind the time complexity of the algorithm.
Where did I make a mistake?
The algorithm I try to describe looks like this:
First, find the repeating substrings of length 2, then 3, then 4, ..., up to the length of the input string.
Then, do the same operation until there is no repetition at all.
A step-by-step example looks like this:
abcabcefefefghabcdabcdefefefghabcabcefefefghabcdabcdefefefgh
abcabc(ef)²ghabcdabcd(ef)²ghabcabc(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²ghabcdabcd(ef)²gh(abc)³(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²gh(abcd)²(ef)²gh(abc)³(ef)²gh(abcd)²(ef)²gh
((abc)³(ef)²gh(abcd)²(ef)²gh)²

You can use re.sub to match any repeating two chars and then pass a replacement function that formats the pattern you desire
import re
def superscript(n):
return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])
s = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
max_length = 5
result = re.sub(
rf'(\w{{2,{max_length}}}?)(\1+)', # Omit the second number in the repetition to match any number of repeating chars (\w{2,}?)(\1+)
lambda m: f'({m.group(1)}){superscript(len(m.group(0))//len(m.group(1)))}',
s
)
print(result) # (hd)⁴(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴cbfb

The problem in your code happens when you put together the list of repeating patterns. When you are merging patterns of length 2 and patterns of length 3, you are using patterns that are not compatible with each other.
hdhdhdhd = (hd)4 starts at index 0 and ends at index 7 (included).
(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5, which is a correct pattern in your string, starts at index 7 (included).
This means when you merge the two patterns, you get an incorrect end result because the letter at index 7 is shared.
This problem stems from the fact that one pattern is even in length, while the other is odd and their limits are not aligning. So, they don't even overwrite each other in d and you end up with your result.
I think you tried to solve this problem using the dictionary d with the starting index as key and with the processed list, but there is still a couple of problems.
for j in range(i, (i+k)*repeat_count + 1): should be for l in range(i, j), otherwise you are skipping the last index of the pattern (j). Also, I changed the loop index to l because j was already used. This fixed the problem I described above.
Even with that fixed, there is still an issue. You check for patterns starting from length 2 (for k in range(2, len(signature))), so single letters not belonging to any pattern, like the c in (hd)4(cg)4c(bf)4 will never make it in the dictionary and therefore you will still have overlapping patterns with different lengths at those positions.

Accessing two following indexes in a loop

I'm new to python (although it's more of a logical question rather than syntax question I belive), and I wonder what's the proper way to access two folowing objects in a loop.
I can't really provide a specific example without getting too cumbersome with my explanation but let's just say that I usually try to tackle this with either [index + 1] or [index - 1] and both are problematic when it comes to either the last (IndexError) or first (addresses the last position right at the beginning) iterations respectively.
Is there a well known way to address this? I haven't really seen any questions regarding this floating around so it made me think it's basic logic I'm missing here.
For example this peice of code that wouldn't have worked had I not wrapped everything with try/except, and also the second inner loop works only since it checks for identical characters, otherwise it could have been a mess.
(explanation for clarity - it recieves a string (my_string) and a number (k) and checks whether a sequence of identical characters the length of k exists in my_string)
# ex2 5
my_string = 'abaadddefggg'
sub_my_string = ''
k = 9
count3 = 0
try:
for index in range(len(my_string)):
i = 0
while i < k:
sub_my_string += my_string[index + i]
i += 1
for index2 in range(len(sub_my_string)):
if sub_my_string[index2] == sub_my_string[index2 - 1]:
count3 += 1
if count3 == k:
break
else:
sub_my_string = ""
count3 = 0
print(f"For length {k}, found the substring {sub_my_string}!")
except IndexError:
print(f"Didn't find a substring of length {k}")
Thanks a lot

First off, by definition you need to give special attention to the first or last element, because they really don't have a pair.
Second-off, I personally tend to use list-comprehensions of the following type for these cases -
[something_about_the_two_consecutive_elements(x, y) for x, y in zip(my_list, my_list[1:])]
And last but not least, the whole code snippet seems like major overkill. How about a simple one-liner -
my_string = 'abaadddefggg'
k = 3
existing_substrings = ([x * k for x in set(my_string) if x * k in my_string])
print(f'For length {k}, found substrings {existing_substrings}')
(To be adapted by one's needs of course)
Explanation:
For each of the unique characters in the string, we can check if a string of that character repeated k times appears in my_string.
set(my_string) gives a set of the unique characters over which we iterate (that's the for x in set(my_string) in the list comprehension).
Taking a character x and multiplying by k gives a string xx...x of length k.
So x * k in my_string tests whether my_string includes the substring xx...x.
Summing up the list-comprehension, we return only characters for which x * k in my_string is True.

If I am understanding what you are trying to achieve, I would approach this differently using string slices and a set.
my_string = "abaadddefggg"
sub_my_string = ""
k = 3
count3 = 0
found = False
for index, _ in enumerate(my_string):
if index + k > len(my_string):
continue
sub_my_string = my_string[index : index + k]
if len(set(sub_my_string)) == 1:
found = True
break
if found:
print(f"For length {k}, found the substring {sub_my_string}!")
else:
print(f"Didn't find a substring of length {k}")
Here we use:
enumerate as this usually signals that we are looking at the indices of an iterable.
Check whether the slice will be take us over the string length as there's no point in checking these.
Use the string slice to subset the string
Use the set to see if all the characters are the same.

Optimal brute force solution for finding longest palindromic substring

This is my current approach:
def isPalindrome(s):
if (s[::-1] == s):
return True
return False
def solve(s):
l = len(s)
ans = ""
for i in range(l):
subStr = s[i]
for j in range(i + 1, l):
subStr += s[j]
if (j - i + 1 <= len(ans)):
continue
if (isPalindrome(subStr)):
ans = max(ans, subStr, key=len)
return ans if len(ans) > 1 else s[0]
print(solve(input()))
My code exceeds the time limit according to the auto scoring system. I've already spend some time to look up on Google, all of the solutions i found have the same idea with no optimization or using dynamic programming, but sadly i must and only use brute force to solve this problem. I was trying to break the loop earlier by skipping all the substrings that are shorter than the last found longest palindromic string, but still end up failing to meet the time requirement. Is there any other way to break these loops earlier or more time-efficient approach than the above?

With subStr += s[j], a new string is created over the length of the previous subStr. And with s[::-1], the substring from the previous offset j is copied over and over again. Both are inefficient because strings are immutable in Python and have to be copied as a new string for any string operation. On top of that, the string comparison in s[::-1] == s is also inefficient because you've already compared all of the inner substrings in the previous iterations and need to compare only the outermost two characters at the current offset.
You can instead keep track of just the index and the offset of the longest palindrome so far, and only slice the string upon return. To account for palindromes of both odd and even lengths, you can either increase the index by 0.5 at a time, or double the length to avoid having to deal with float-to-int conversions:
def solve(s):
length = len(s) * 2
index_longest = offset_longest = 0
for index in range(length):
offset = 0
for offset in range(1 + index % 2, min(index, length - index), 2):
if s[(index - offset) // 2] != s[(index + offset) // 2]:
offset -= 2
break
if offset > offset_longest:
index_longest = index
offset_longest = offset
return s[(index_longest - offset_longest) // 2: (index_longest + offset_longest) // 2 + 1]

Solved by using the approach "Expand Around Center", thanks #Maruthi Adithya

This modification of your code should improve performance. You can stop your code when the max possible substring is smaller than your already computed answer. Also, you should start your second loop with j+ans+1 instead of j+1 to avoid useless iterations :
def solve(s):
l = len(s)
ans = ""
for i in range(l):
if (l-i+1 <= len(ans)):
break
subStr = s[i:len(ans)]
for j in range(i + len(ans) + 1, l+1):
if (isPalindrome(subStr)):
ans = subStr
subStr += s[j]
return ans if len(ans) > 1 else s[0]

This is a solution that has a time complexity greater than the solutions provided.
Note: This post is to think about the problem better and does not specifically answer the question. I have taken a mathematical approach to find a time complexity greater than 2^L (where L is size of input string)
Note: This is a post to discuss potential algorithms. You will not find the answer here. And the logic shown here has not been proven extensively.
Do let me know if there is something that I haven't considered.
Approach: Create set of possible substrings. Compare and find the maximum pair* from this set that has the highest possible pallindrome.
Example case with input string: "abc".
In this example, substring set has: "a","b","c","ab","ac","bc","abc".
7 elements.
Comparing each element with all other elements will involve: 7^2 = 49 calculations.
Hence, input size is 3 & no of calculations is 49.
Time Complexity:
First compute time complexity for generating the substring set:
<img src="https://latex.codecogs.com/gif.latex?\sum_{a=1}^{L}\left&space;(&space;C_{a}^{L}&space;\right&space;)" title="\sum_{a=1}^{L}\left ( C_{a}^{L} \right )" />
(The math equation is shown in the code snippet)
Here, we are adding all the different substring size combination from the input size L.
To make it clear: In the above example input size is 3. So we find all the pairs with size =1 (i.e: "a","b","c"). Then size =2 (i.e: "ab","ac","bc") and finally size = 3 (i.e: "abc").
So choosing 1 character from input string = combination of taking L things 1 at a time without repetition.
In our case number of combinations = 3.
This can be mathematically shown as (where a = 1):
<img src="https://latex.codecogs.com/gif.latex?C_{a}^{L}" title="C_{a}^{L}" />
Similarly choosing 2 char from input string = 3
Choosing 3 char from input string = 1
Finding time complexity of palindrome pair from generated set with maximum length:
Size of generated set: N
For this we have to compare each string in set with all other strings in set.
So N*N, or 2 for loops. Hence the final time complexity is:
<img src="https://latex.codecogs.com/gif.latex?\sum_{a=1}^{L}\left&space;(&space;C_{a}^{L}&space;\right&space;)^{2}" title="\sum_{a=1}^{L}\left ( C_{a}^{L} \right )^{2}" />
This is diverging function greater than 2^L for L > 1.
However, there can be multiple optimizations applied to this. For example: there is no need to compare "a" with "abc" as "a" will also be compared with "a". Even if this optimization is applied, it will still have a time complexity > 2^L (For the most cases).
Hope this gave you a new perspective to the problem.
PS: This is my first post.

You should not find the string start from the beginning of that string, but you should start from the middle of it & expand the current string
For example, for the string xyzabccbalmn, your solution will cost ~ 6 * 11 comparison but searching from the middle will cost ~ 11 * 2 + 2 operations
But anyhow, brute-forcing will never ensure that your solution will run fast enough for any arbitrary string.

Try this:
def solve(s):
if len(s)==1:
print(0)
return '1'
if len(s)<=2 and not(isPalindrome(s)):
print (0)
return '1'
elif isPalindrome(s):
print( len(s))
return '1'
elif isPalindrome(s[0:len(s)-1]) or isPalindrome(s[1:len(s)]):
print (len(s)-1)
return '1'
elif len(s)>=2:
solve(s[0:len(s)-1])
return '1'
return 0

Is there a better way to write the following method in python?

I am writing a small program, in python, which will find a lone missing element from an arithmetic progression (where the starting element could be both positive and negative and the series could be ascending or descending).
so for example: if the input is 1 3 5 9 11, then the function should return 7 as this is the lone missing element in the above AP series.
The input format: the input elements are separated by 1 white space and not commas as is commonly done.
Here is the code:
def find_missing_elm_ap_series(n, series):
ap = series
ap = ap.split(' ')
ap = [int(i) for i in ap]
cd = []
for i in range(n-1):
cd.append(ap[i+1]-ap[i])
common_diff = 0
if len(set(cd)) == 1:
print 'The series is complete'
return series
else:
cd = [abs(i) for i in cd]
common_diff = min(cd)
if ap[0] > ap[1]:
common_diff = (-1)*common_diff
new_ap = []
for i in range(n+1):
new_ap.append(ap[0] + i*common_diff)
missing_element = set(new_ap).difference(set(ap))
return missing_element
where n is the length of the series provided (the series with the missing element:5 in the above example).
I am sure there are other shorter and more elegant way of writing this code in python. Can anybody help ?
Thanks
BTW: i am learning python by myself and hence the question.

Based on the fact that if an element is missing it is exactly expected-sum(series) - actual-sum(series). The expected sum for a series with n elements starting at a and ending at b is (a+b)*n/2. The rest is Python:
def find_missing(series):
A = map(int, series.split(' '))
a, b, n, sumA = A[0], A[-1], len(A), sum(A)
if (a+b)*n/2 == sumA:
return None #no element missing
return (a+b)*(n+1)/2-sumA
print find_missing("1 3 5 9") #7
print find_missing("-1 1 3 5 9") #7
print find_missing("9 6 0") #3
print find_missing("1 2 3") #None
print find_missing("-3 1 3 5") #-1

Well... You can do simpler, but it would completely change your algorithm.
First, you can prove that the step for the arithmetic progression is ap[1] - ap[0], unless ap[2] - ap[1] is lower in magnitude than it, in which case the missing element is between terms 0 and 1. (This is true as there is a single missing element.)
Then you can just take ap[0] + n * step and print the first one that doesn't match.
Here is the source code (also implementing some minor shortcuts, such as grouping your first three lines into one):
def find_missing_elm_ap_series(n, series):
ap = [int(i) for i in series.split(' ')]
step = ap[1] - ap[0]
if (abs(ap[2] - ap[1]) <= abs(step)): # Check missing elt is not between 0 and 1
return ap[0] + ap[2] - ap[1]
for (i, val) in zip(range(len(ap)), ap): # And check position of missing element
if ap[0] + i * step != val:
return ap[0] + i * step
return series # missing element not found

The code appears to be working. There is perhaps a slightly easier way to get it done. This is due to the fact that you don't have to attempt to look through all of the values to get the common difference. The following code simply looks at the difference between the 1st and 2nd as well as the last and second last.
This works in the event that only a single value is missing (and the length of the list is at least 3). As the min difference between the values will provide you the common difference.
def find_missing(prog):
# First we cast them to numbers.
items = [int(x) for x in prog.split()]
#Then we compare the first and second
first_to_second = items[1] - items[0]
#then we compare the last to second last
last_to_second_last = items[-1] - items[-2]
#Now we have to care about which one is closes
# to zero
if abs(first_to_second) < abs(last_to_second_last):
change = first_to_second
else:
change = last_to_second_last
#Iterate through the list. As soon as we find a gap
#that is larger than change, we fill in and return
for i in range(1, len(items)):
comp = items[i] - items[i-1]
if comp != change:
return items[i-1] + change
#There was no gap
return None
print(find_missing("1 3 5 9")) #7
print(find_missing("-1 1 3 5 9")) #7
print(find_missing("9 6 0")) #3
print(find_missing("1 2 3")) #None
The previous code shows this example. First of all attempting to find change between each of the values of the list. Then iterating till the change is missed, and returning the value that has been expected.

Here's the way I thought about it: find the position of the maximum difference between the elements of the array; then regenerate the expected number in the sequence from the other differences (which should be all the same and the minimum number in the differences list):
def find_missing(a):
d = [a[i+1] - a[i] for i in range(len(a)-1)]
i = d.index(max(d))
x = min(d)
return a[0] + (i+1)*x
print find_missing([1,3,5,9,11])
7
print find_missing([1,5,7,9,11])
3

Here are some ideas:
Passing the length of the series seems like a bad idea. The function can more easily calculate the length
There is no reason to assign series to ap, just do a function using series and assign the result to ap
When splitting the string, don't give the sep argument. If you don't give the argument, then consecutive white space will also be removed and leading and trailing white space will also be ignored. This is more friendly on the format of the data.
I've combined a few operations. For example the split and the list comprehension converting to integer make sense to group together. There is also no need to create cd as a list and then convert that to a set. Just build it as a set to start with.
I don't like that the function returns the original series in the case of no missing element. The value None would be more in keeping with the name of the function.
Your original function returned a one item set as the result. That seems odd, so I've used pop() to extract that item and return just the missing element.
The last item was more of an experiment with combining all of the code at the bottom into a single statement. Don't know if it is better, but it's something to think about. I built a set with all the correct numbers and a set with the given numbers and then subtracted them and returned the number that was missing.
Here's the code that I came up with:
def find_missing_elm_ap_series(series):
ap = [int(i) for i in series.split()]
n = len(ap)
cd = {ap[i+1]-ap[i] for i in range(n-1)}
if len(cd) == 1:
print 'The series is complete'
return None
else:
common_diff = min([abs(i) for i in cd])
if ap[0] > ap[1]:
common_diff = (-1)*common_diff
return set(range(ap[0],ap[0]+common_diff*n,common_diff)).difference(set(ap)).pop()

Assuming the first & last items are not missing, we can also make use of range() or xrange() with the step of the common difference, getting rid of the n altogether, it can also return more than 1 missing item (although not reliably depending on number of items missing):
In [13]: def find_missing_elm(series):
ap = map(int, series.split())
cd = map(lambda x: x[1]-x[0], zip(ap[:-1], ap[1:]))
if len(set(cd)) == 1:
print 'complete series'
return ap
mcd = min(cd) if ap[0] < ap[1] else max(cd)
sap = set(ap)
return filter(lambda x: x not in sap, xrange(ap[0], ap[-1], mcd))
....:
In [14]: find_missing_elm('1 3 5 9 11 15')
Out[14]: [7, 13]
In [15]: find_missing_elm('15 11 9 5 3 1')
Out[15]: [13, 7]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find repeats with certain length within a string using python - python

Related

Algorithm verification: Get all the combinaison of possible word

Reducing a string by detecting patterns

Accessing two following indexes in a loop

Optimal brute force solution for finding longest palindromic substring

Is there a better way to write the following method in python?

Categories

Resources