Optimal brute force solution for finding longest palindromic substring - python

This is my current approach:
def isPalindrome(s):
if (s[::-1] == s):
return True
return False
def solve(s):
l = len(s)
ans = ""
for i in range(l):
subStr = s[i]
for j in range(i + 1, l):
subStr += s[j]
if (j - i + 1 <= len(ans)):
continue
if (isPalindrome(subStr)):
ans = max(ans, subStr, key=len)
return ans if len(ans) > 1 else s[0]
print(solve(input()))
My code exceeds the time limit according to the auto scoring system. I've already spend some time to look up on Google, all of the solutions i found have the same idea with no optimization or using dynamic programming, but sadly i must and only use brute force to solve this problem. I was trying to break the loop earlier by skipping all the substrings that are shorter than the last found longest palindromic string, but still end up failing to meet the time requirement. Is there any other way to break these loops earlier or more time-efficient approach than the above?

With subStr += s[j], a new string is created over the length of the previous subStr. And with s[::-1], the substring from the previous offset j is copied over and over again. Both are inefficient because strings are immutable in Python and have to be copied as a new string for any string operation. On top of that, the string comparison in s[::-1] == s is also inefficient because you've already compared all of the inner substrings in the previous iterations and need to compare only the outermost two characters at the current offset.
You can instead keep track of just the index and the offset of the longest palindrome so far, and only slice the string upon return. To account for palindromes of both odd and even lengths, you can either increase the index by 0.5 at a time, or double the length to avoid having to deal with float-to-int conversions:
def solve(s):
length = len(s) * 2
index_longest = offset_longest = 0
for index in range(length):
offset = 0
for offset in range(1 + index % 2, min(index, length - index), 2):
if s[(index - offset) // 2] != s[(index + offset) // 2]:
offset -= 2
break
if offset > offset_longest:
index_longest = index
offset_longest = offset
return s[(index_longest - offset_longest) // 2: (index_longest + offset_longest) // 2 + 1]

Solved by using the approach "Expand Around Center", thanks #Maruthi Adithya

This modification of your code should improve performance. You can stop your code when the max possible substring is smaller than your already computed answer. Also, you should start your second loop with j+ans+1 instead of j+1 to avoid useless iterations :
def solve(s):
l = len(s)
ans = ""
for i in range(l):
if (l-i+1 <= len(ans)):
break
subStr = s[i:len(ans)]
for j in range(i + len(ans) + 1, l+1):
if (isPalindrome(subStr)):
ans = subStr
subStr += s[j]
return ans if len(ans) > 1 else s[0]

This is a solution that has a time complexity greater than the solutions provided.
Note: This post is to think about the problem better and does not specifically answer the question. I have taken a mathematical approach to find a time complexity greater than 2^L (where L is size of input string)
Note: This is a post to discuss potential algorithms. You will not find the answer here. And the logic shown here has not been proven extensively.
Do let me know if there is something that I haven't considered.
Approach: Create set of possible substrings. Compare and find the maximum pair* from this set that has the highest possible pallindrome.
Example case with input string: "abc".
In this example, substring set has: "a","b","c","ab","ac","bc","abc".
7 elements.
Comparing each element with all other elements will involve: 7^2 = 49 calculations.
Hence, input size is 3 & no of calculations is 49.
Time Complexity:
First compute time complexity for generating the substring set:
<img src="https://latex.codecogs.com/gif.latex?\sum_{a=1}^{L}\left&space;(&space;C_{a}^{L}&space;\right&space;)" title="\sum_{a=1}^{L}\left ( C_{a}^{L} \right )" />
(The math equation is shown in the code snippet)
Here, we are adding all the different substring size combination from the input size L.
To make it clear: In the above example input size is 3. So we find all the pairs with size =1 (i.e: "a","b","c"). Then size =2 (i.e: "ab","ac","bc") and finally size = 3 (i.e: "abc").
So choosing 1 character from input string = combination of taking L things 1 at a time without repetition.
In our case number of combinations = 3.
This can be mathematically shown as (where a = 1):
<img src="https://latex.codecogs.com/gif.latex?C_{a}^{L}" title="C_{a}^{L}" />
Similarly choosing 2 char from input string = 3
Choosing 3 char from input string = 1
Finding time complexity of palindrome pair from generated set with maximum length:
Size of generated set: N
For this we have to compare each string in set with all other strings in set.
So N*N, or 2 for loops. Hence the final time complexity is:
<img src="https://latex.codecogs.com/gif.latex?\sum_{a=1}^{L}\left&space;(&space;C_{a}^{L}&space;\right&space;)^{2}" title="\sum_{a=1}^{L}\left ( C_{a}^{L} \right )^{2}" />
This is diverging function greater than 2^L for L > 1.
However, there can be multiple optimizations applied to this. For example: there is no need to compare "a" with "abc" as "a" will also be compared with "a". Even if this optimization is applied, it will still have a time complexity > 2^L (For the most cases).
Hope this gave you a new perspective to the problem.
PS: This is my first post.

You should not find the string start from the beginning of that string, but you should start from the middle of it & expand the current string
For example, for the string xyzabccbalmn, your solution will cost ~ 6 * 11 comparison but searching from the middle will cost ~ 11 * 2 + 2 operations
But anyhow, brute-forcing will never ensure that your solution will run fast enough for any arbitrary string.

Try this:
def solve(s):
if len(s)==1:
print(0)
return '1'
if len(s)<=2 and not(isPalindrome(s)):
print (0)
return '1'
elif isPalindrome(s):
print( len(s))
return '1'
elif isPalindrome(s[0:len(s)-1]) or isPalindrome(s[1:len(s)]):
print (len(s)-1)
return '1'
elif len(s)>=2:
solve(s[0:len(s)-1])
return '1'
return 0

Related

Can an algorithm with higher Complexity be Faster?

I have written a code for a problem and used 2 double-nested loops within the implementation, but this code runs too long with big O as O(n^2).
So I googled for a faster solution for the same problem and found the second code below, which uses a tripled-nested loop with big O as O(n^3).
Is it because the number of computations is higher for the first code, although it has lower big O?
If so can I conclude that big O is not reliable for small "n" values and I have to do experimentation to be able to judge?
Code 1:
def sherlockAndAnagrams(s):
# 1 . Traverse all possible substrings within string
count = 0
lst_char_poss_str = []
len_s = len(s)
for i in range(len_s):#for each char in string
temp_str = ""#a temp string to include characters next to evaluating char
for j in range(i , len_s):#for all possible length of string from that char
temp_str += s[j] #possible substrings from that char
lst_char_poss_str.append(temp_str)#All possible substrings within string
# 2 . Check if any two substrings of equal length are anagrams
new_lst_char_poss_str = []
for i in lst_char_poss_str:
i = list(i)#sorted list, so, "abb" and "bba" will be both "abb"
i.sort()
new_lst_char_poss_str.append(i)#a 2-d list of lists of characters for All possible substrings within string
len_new_s = len(new_lst_char_poss_str)
for i in range (len_new_s - 1):
for j in range (i + 1, len_new_s):
if new_lst_char_poss_str[i] == new_lst_char_poss_str[j]:
count += 1
return(count)
Code 2:
def sherlockAndAnagrams(s):
count = 0
slen = len(s)
for i in range(slen):
for j in range(i+1, slen):
substr = ''.join(sorted(s[i:j]))#Sortingall characters after a char in string
sublen = len(substr)
for x in range(i+1, slen):
if x + sublen > slen: #if index out of range
break
substr2 = ''.join(sorted(s[x:x+sublen]))
if substr == substr2:
anagrams += 1
return count
You might have an algorithm whose running time is 1,000,000 n, because you may be doing some other operations. But you might have an algorithm of this running time. 1,000,000n is O (n), because this is <= some constant time n and you might have some other algorithm with the running time of 2 n^2.
You would say that 1,000,000 n algorithm is better than 2 n^2. The one with the linear running time which is O (n) running time is better than O ( n^2). It is true but in the limit and the limit is achieved very late when n is really large. For small instances this 2 n^2 might actually take less amount of time than your 1,000,000 n. We must be careful about the constants also.
There are lot of point to be considered:
the second algorithm always return 0, nobody increment count
in the first : temp_str += s[j] is not efficient, in the second this string concatenation is not used.
the second is faster because use slicing to retrieve pieces of the string. but to be sure maybe you must do a precise profile of the code.
other than this, as told by #pjs big O notation is an asymptotical explanation.

Accessing two following indexes in a loop

I'm new to python (although it's more of a logical question rather than syntax question I belive), and I wonder what's the proper way to access two folowing objects in a loop.
I can't really provide a specific example without getting too cumbersome with my explanation but let's just say that I usually try to tackle this with either [index + 1] or [index - 1] and both are problematic when it comes to either the last (IndexError) or first (addresses the last position right at the beginning) iterations respectively.
Is there a well known way to address this? I haven't really seen any questions regarding this floating around so it made me think it's basic logic I'm missing here.
For example this peice of code that wouldn't have worked had I not wrapped everything with try/except, and also the second inner loop works only since it checks for identical characters, otherwise it could have been a mess.
(explanation for clarity - it recieves a string (my_string) and a number (k) and checks whether a sequence of identical characters the length of k exists in my_string)
# ex2 5
my_string = 'abaadddefggg'
sub_my_string = ''
k = 9
count3 = 0
try:
for index in range(len(my_string)):
i = 0
while i < k:
sub_my_string += my_string[index + i]
i += 1
for index2 in range(len(sub_my_string)):
if sub_my_string[index2] == sub_my_string[index2 - 1]:
count3 += 1
if count3 == k:
break
else:
sub_my_string = ""
count3 = 0
print(f"For length {k}, found the substring {sub_my_string}!")
except IndexError:
print(f"Didn't find a substring of length {k}")
Thanks a lot
First off, by definition you need to give special attention to the first or last element, because they really don't have a pair.
Second-off, I personally tend to use list-comprehensions of the following type for these cases -
[something_about_the_two_consecutive_elements(x, y) for x, y in zip(my_list, my_list[1:])]
And last but not least, the whole code snippet seems like major overkill. How about a simple one-liner -
my_string = 'abaadddefggg'
k = 3
existing_substrings = ([x * k for x in set(my_string) if x * k in my_string])
print(f'For length {k}, found substrings {existing_substrings}')
(To be adapted by one's needs of course)
Explanation:
For each of the unique characters in the string, we can check if a string of that character repeated k times appears in my_string.
set(my_string) gives a set of the unique characters over which we iterate (that's the for x in set(my_string) in the list comprehension).
Taking a character x and multiplying by k gives a string xx...x of length k.
So x * k in my_string tests whether my_string includes the substring xx...x.
Summing up the list-comprehension, we return only characters for which x * k in my_string is True.
If I am understanding what you are trying to achieve, I would approach this differently using string slices and a set.
my_string = "abaadddefggg"
sub_my_string = ""
k = 3
count3 = 0
found = False
for index, _ in enumerate(my_string):
if index + k > len(my_string):
continue
sub_my_string = my_string[index : index + k]
if len(set(sub_my_string)) == 1:
found = True
break
if found:
print(f"For length {k}, found the substring {sub_my_string}!")
else:
print(f"Didn't find a substring of length {k}")
Here we use:
enumerate as this usually signals that we are looking at the indices of an iterable.
Check whether the slice will be take us over the string length as there's no point in checking these.
Use the string slice to subset the string
Use the set to see if all the characters are the same.

How can i improve time or memory result in Python?

I am trying to learn Python and so I ran into a problem: for my courses there are requirments: max time 1 sec and max memory 512Mb. The task is to find smallest palindrome in alphabetical order. minimal long for palindrome is 2.
for example: ghghwwdkjnccjknjn here are: ghg, cc, ww, njn. We need the smallest - cc or ww - in alphabet c is in front of w (like in dictionaries). aba is in front of aca (c>b) and so on
Here is my code:
s = input("")
lst = []
for i in range(0, len(s)):
for j in range(i + 1, len(s) + 1):
p = s[i:j]
if p == p[::-1] and len(p)>=2:
lst.append(p)
lst.sort()
del p
if not lst:
print("-1")
else:
#lst.sort()
print(sorted(lst, key = len)[0])
In this way I get 1.088s 9.89Mb and with lst.sort() moving to the end I get 0.901s 527.30Mb - both bad. How can I do it better? Thank you!
Efficient clean implementation of all improvements mentioned below:
def substrings(string, length):
for i in range(len(string) - length + 1):
yield string[i : i+length]
def palindromes(strings):
for string in strings:
if string == string[::-1]:
yield string
def best_palindrome(string):
for length in 2, 3:
if result := min(palindromes(substrings(string, length)), default=None):
return result
return -1
print(best_palindrome(input()))
Got accepted at Code Forces with 218 ms and 796 KB (using Python 3.7.2).
Perhaps the simplest modification to make it a lot more efficient is to add this as the first thing in the j-loop:
if j - i > 3:
break
That is, don't check lengths above 3. Because any longer palindrome, like abba or abcba, contains a shorter one, like bb or bcb in those cases. Since you want a shortest anyway, any longer ones are always useless.
Also, do sort only at the end, not after every append.
With those two changes, I got it accepted at Code Forces (link from your comment below).
Further possible improvements:
Don't sort, just get the minimum.
Start j at i + 2 instead of at i + 1 and remove the len(p)>=2 check.
For memory reduction, don't collect everything in a list (use a set or produce the candidates in a generator).
First try only all substrings of length 2, and only if that fails, try length 3.

with recursion in python calculate the amount of letters strings s and t share

I have to recursively or with list comprehension calculate the lingo score of two given strings. There is one point for ever letter that the two strings share.
I tried doing this, but it only works if s[0] is in t but otherwise it doesn't do what it is supposed to and I cannot see what is actually going wrong here.
def count(e, L):
lc = [1 for x in L if x == e]
return sum(lc)
def lingo(s, t):
if s == '' or t == '':
return 0
elif s == t:
return len(s)
if s[0] in t:
lc = [count(s[x], t) for x in range(len(t))]
return sum(lc)
else:
#remove s[0] and try again
lingo(s[:1], t)
these assertions are with the assignment:
assert lingo('diner', 'proza') == 1
assert lingo('beeft', 'euvel') == 2
assert lingo('gattaca', 'aggtccaggcgc') == 5
assert lingo('gattaca', '') == 0
The most obvious mistake
You are missing a return statement on the last line of your code. Instead of:
else:
#remove s[0] and try again
lingo(s[:1], t)
it should be:
else:
#remove s[0] and try again
return lingo(s[:1], t)
A redundancy in your code
The following piece of your code is unnecessary:
elif s == t:
return len(s)
Although this returns the correct result, it is a special case and doesn't particularly help the general case. In most cases s and t will be different; and the logic to calculate their amount of shared letters should work also when they are equal.
A mistake in the algorithm logic
This line of your code is highly suspicious:
lc = [count(s[x], t) for x in range(len(t))]
First of all, x is in range of the length of t, but is used as an index for s. If t is longer than s, this will immediately raise an IndexError exception. If t is shorter than or same length as s, then it will not raise an exception, but will most likely return the wrong result.
Note this interesting test case that was provided:
assert lingo('beeft', 'euvel') == 2
The letter 'e' appears twice in 'beeft' and twice in 'euvel', and the result is 2. Yet if you calculate count(s[1], t) + count(s[2], t) you will find the value 4. This is because the first 'e' of s is found twice in t, and the second 'e' of s is also found twice in t.
Janecx's answer provides one way to carefully fix this. You need to understand the logic behind min(s.count(s[0]), t.count(s[0])).
Other python solutions
Right now you absolutely want to use recursion and list comprehensions. In case you are interested in other ways to solve your problem, here are different algorithms.
Sorting the strings (sorting is a powerful tool that makes many problems easy)
def lingo(s, t):
s = sorted(s) # this doesn't modify the original string, it makes a local copy
t = sorted(t) # this doesn't modify the original string, it makes a local copy
result = 0
i = 0
j = 0
while (i < len(s) and j < len(t)):
if s[i] == t[j]:
result += 1
i += 1
j += 1
elif s[i] < t[j]:
i += 1
else:
j += 1
return result
Complexity analysis: sorting takes N log N + M log M operations, where N=len(s) and M=len(t). The whole while loop only takes N + M operations; it is that fast because s and t are sorted in the same order, so we reach an element of s as the same time as the corresponding element in t, so we don't need to compare every element of s against every element of t.
collections.Counter (a python object specifically designed for counting occurrences)
import collections
def lingo(s, t):
return sum((collections.Counter(s) & collections.Counter(t)).values())
Complexity analysis: this takes N + M operations, where N=len(s) and M=len(t). Counter simply counts the number of occurrences of each letter in s by going through s once, and the number of occurrences of each letter in t by going through t once; then the & operation keeps the minimum of the two counts for each letter (reminiscent of Janecx's min(...) operation); then all the counts are summed up. Summing up only takes as many operations as there are distinct letters, which in the case of a DNA sequence is 4; in the case of an alphabetical word is 26; and in general in a ASCII/Latin-1 string is at most 256.
Recursive approach from Janecx's answer Complexity analysis: takes N * M operations, where N=len(s) and M=len(t). This is much slower than the other two approaches, because for every element of s we need to go through every element of t; written iteratively, this would be a for loop nested inside a second for loop.
There you go. What this code does? If one of the string is empty, return 0. In the other cases, it finds the minimal number of occurences of s[0] in s and t, and then we use recursion to calculate the minimal number of occurences of s[1] in the version of t without the first character, and so on.
def lingo(s, t):
if s == '' or t == '':
return 0
return min(s.count(s[0]), t.count(s[0])) + lingo(s[1:], t.replace(s[0], ''))
assert lingo('diner', 'proza') == 1
assert lingo('beeft', 'euvel') == 2
assert lingo('gattaca', 'aggtccaggcgc') == 5
assert lingo('gattaca', '') == 0

Find repeats with certain length within a string using python

I am trying to use the regex module to find non-overlapping repeats (duplicated sub-strings) within a given string (30 char), with the following requirements:
I am only interested in non-overlapping repeats that are 6-15 char long.
allow 1 mis-match
return the positions for each match
One way I thought of is that for each possible repeat length, let python loop through the 30char string input. For example,
string = "ATAGATATATGGCCCGGCCCATAGATATAT" #input
#for 6char repeats, first one in loop would be for the following event:
text = "ATAGAT"
text2 ="(" + text + ")"+ "{e<=1}" #this is to allow 1 mismatch later in regex
string2="ATATGGCCCGGCCCATAGATATAT" #string after excluding text
for x in regex.finditer(text2,string2,overlapped=True):
print x.span()
#then still for 6char repeats, I will move on to text = "TAGATA"...
#after 6char, loop again for 7char...
There should be two outputs for this particular string = "ATAGATATATGGCCCGGCCCATAGATATAT". 1. The bold two "ATAGATATAT" + 1 mismatch: "ATAGATATATG" &"CATAGATATAT" with position index returned as (0,10)&(19, 29); 2. "TGGCCC" & "GGCCCA" (need add one mismatch to be at least 6 char), with index (9,14)&(15,20). Numbers can be in a list or table.
I'm sorry that I didn't include a real loop, but I hope the idea is clear...As you can see, this is a very less efficient method, not to mention it would create redundancy --- e.g. 10char repeats will be counted more than once, because it would suit for 9,8,7 and 6 char repeats loops. Moreover, I have a lot of such 30 char strings to work with, so I would appreciate your advice on some cleaner methods.
Thank you very much:)
I'd try straightforward algorithm instead of regex (which are quite confusing in this instance);
s = "ATAGATATATGGCCCGGCCCATAGATATAT"
def fuzzy_compare(s1, s2):
# sanity check
if len(s1) != len(s2):
return False
diffs = 0
for a, b in zip(s1, s2):
if a != b:
diffs += 1
if diffs > 1:
return False
return True
slen = len(s) # 30
for l in range(6, 16):
i = 0
while (i + l * 2) <= slen:
sub1 = s[i:i+l]
for j in range(i+l, slen - l):
sub2 = s[j:j+l]
if fuzzy_compare(sub1, sub2):
# checking if this could be partial
partial = False
if i + l < j and j + l < slen:
extsub1 = s[i:i+l+1]
extsub2 = s[j:j+l+1]
# if it is partial, we'll get it later in the main loop
if fuzzy_compare(extsub1, extsub2):
partial = True
if not partial:
print (i, i+l), (j, j+l)
i += 1
It's a first draft, so feel free to experiment with it. It also seems to be clunky and not optimal, but try running it first - it may be sufficient enough.

Categories

Resources