I am trying to learn Python and so I ran into a problem: for my courses there are requirments: max time 1 sec and max memory 512Mb. The task is to find smallest palindrome in alphabetical order. minimal long for palindrome is 2.
for example: ghghwwdkjnccjknjn here are: ghg, cc, ww, njn. We need the smallest - cc or ww - in alphabet c is in front of w (like in dictionaries). aba is in front of aca (c>b) and so on
Here is my code:
s = input("")
lst = []
for i in range(0, len(s)):
for j in range(i + 1, len(s) + 1):
p = s[i:j]
if p == p[::-1] and len(p)>=2:
lst.append(p)
lst.sort()
del p
if not lst:
print("-1")
else:
#lst.sort()
print(sorted(lst, key = len)[0])
In this way I get 1.088s 9.89Mb and with lst.sort() moving to the end I get 0.901s 527.30Mb - both bad. How can I do it better? Thank you!
Efficient clean implementation of all improvements mentioned below:
def substrings(string, length):
for i in range(len(string) - length + 1):
yield string[i : i+length]
def palindromes(strings):
for string in strings:
if string == string[::-1]:
yield string
def best_palindrome(string):
for length in 2, 3:
if result := min(palindromes(substrings(string, length)), default=None):
return result
return -1
print(best_palindrome(input()))
Got accepted at Code Forces with 218 ms and 796 KB (using Python 3.7.2).
Perhaps the simplest modification to make it a lot more efficient is to add this as the first thing in the j-loop:
if j - i > 3:
break
That is, don't check lengths above 3. Because any longer palindrome, like abba or abcba, contains a shorter one, like bb or bcb in those cases. Since you want a shortest anyway, any longer ones are always useless.
Also, do sort only at the end, not after every append.
With those two changes, I got it accepted at Code Forces (link from your comment below).
Further possible improvements:
Don't sort, just get the minimum.
Start j at i + 2 instead of at i + 1 and remove the len(p)>=2 check.
For memory reduction, don't collect everything in a list (use a set or produce the candidates in a generator).
First try only all substrings of length 2, and only if that fails, try length 3.
Related
Given a string, I would like to detect the repeating substrings, and then reduce abab to (ab)2.
For instance, ababababacdecdecdeababab would reduce to (ab)4a(cde)3(ab)3.
The string does not have the same character twice in a row. So, aaab is an invalid string.
Here is the Python that I wrote:
def superscript(n):
return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])
signature = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
d = {}
processed = []
for k in range(2, len(signature)):
i = 0
j = i + k
while j <= len(signature):
repeat_count = 0
while signature[i:i+k] == signature[j:j+k]:
repeat_count += 1
j += k
if repeat_count > 0 and i not in processed:
d[i] = [i+k, repeat_count + 1]
for j in range(i, (i+k)*repeat_count + 1):
processed.append(j)
i = j
j = i + k
else:
i += 1
j = i + k
od = collections.OrderedDict(sorted(d.items()))
output = ''
for k,v in od.items():
print(k, v)
output += '(' + signature[k:v[0]] + ')' + superscript(v[1])
Which aims to detect the repeating substrings of length 2, 3, 4, and so on. I mark the start and the end of a repeating substring by using a dict. I also mark the index of the processed characters by keeping a list to avoid replacing (ab)4 by (abab)2 (since the latter one will overwrite the beginning index in the dict).
The example string I work with is hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb which should output (hd)4(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4cbfb.
However, I get this output:
(hd)4(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5(cg)4(ea)2(dh)4(hd)2(cg)4
I don't know whether this is a well-known problem, but I couldn't find any resources. I don't mind the time complexity of the algorithm.
Where did I make a mistake?
The algorithm I try to describe looks like this:
First, find the repeating substrings of length 2, then 3, then 4, ..., up to the length of the input string.
Then, do the same operation until there is no repetition at all.
A step-by-step example looks like this:
abcabcefefefghabcdabcdefefefghabcabcefefefghabcdabcdefefefgh
abcabc(ef)²ghabcdabcd(ef)²ghabcabc(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²ghabcdabcd(ef)²gh(abc)³(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²gh(abcd)²(ef)²gh(abc)³(ef)²gh(abcd)²(ef)²gh
((abc)³(ef)²gh(abcd)²(ef)²gh)²
You can use re.sub to match any repeating two chars and then pass a replacement function that formats the pattern you desire
import re
def superscript(n):
return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])
s = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
max_length = 5
result = re.sub(
rf'(\w{{2,{max_length}}}?)(\1+)', # Omit the second number in the repetition to match any number of repeating chars (\w{2,}?)(\1+)
lambda m: f'({m.group(1)}){superscript(len(m.group(0))//len(m.group(1)))}',
s
)
print(result) # (hd)⁴(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴cbfb
The problem in your code happens when you put together the list of repeating patterns. When you are merging patterns of length 2 and patterns of length 3, you are using patterns that are not compatible with each other.
hdhdhdhd = (hd)4 starts at index 0 and ends at index 7 (included).
(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5, which is a correct pattern in your string, starts at index 7 (included).
This means when you merge the two patterns, you get an incorrect end result because the letter at index 7 is shared.
This problem stems from the fact that one pattern is even in length, while the other is odd and their limits are not aligning. So, they don't even overwrite each other in d and you end up with your result.
I think you tried to solve this problem using the dictionary d with the starting index as key and with the processed list, but there is still a couple of problems.
for j in range(i, (i+k)*repeat_count + 1): should be for l in range(i, j), otherwise you are skipping the last index of the pattern (j). Also, I changed the loop index to l because j was already used. This fixed the problem I described above.
Even with that fixed, there is still an issue. You check for patterns starting from length 2 (for k in range(2, len(signature))), so single letters not belonging to any pattern, like the c in (hd)4(cg)4c(bf)4 will never make it in the dictionary and therefore you will still have overlapping patterns with different lengths at those positions.
Sorry for the noob question, but is there a less time expensive method to iterate through the input list, as upon submission I receive timeout errors. I tried changing the method of checking for the minimum answer by appending to a list and using min function, but as expected that didn't help at all.
Input:
6 3
3
6
4
2
5
Solution:
with open("cloudin.txt", "r") as input_file:
n, covered = map(int, input_file.readline().split())
ls = [None for i in range(100005)]
for i in range(n-1):
ls[i] = int(input_file.readline().strip())
ans = 1000000001
file = open("cloudout.txt", "w")
for i in range(n-covered):
a = 0
for j in range(covered):
a += ls[i+j]
if a < ans:
ans = a
file.write(str(ans))
output:
11
https://orac2.info/problem/aio18cloud/
Note: Blue + White indicates timeout
The core logic of your code is contained in these lines:
ans = 1000000001
for i in range(n-covered):
a = 0
for j in range(covered):
a += ls[i+j]
if a < ans:
ans = a
Let's break down what this code actually does. For each closed interval (i.e. including the endpoints) [left, right] from the list [0, covered-1], [1, covered], [2, covered+1], ..., [n-covered-1, n-2] (that is, all closed intervals containing exactly covered elements and that are subintervals of [0, n-2]), you are computing the range sum ls[left] + ls[left+1] + ... + ls[right]. Then you set ans to the minimum such range sum.
Currently, that nested loop takes O((n-covered)*covered)) steps, which is O(n^2) if covered is n/2, for example. You want a way to compute that range sum in constant time, eliminating the nested loop, to make the runtime O(n).
The easiest way to do this is with a prefix sum array. In Python, itertools.accumulate() is the standard/simplest way to generate those. To see how this helps:
Original Sum: ls[left] + ls[left+1] + ... + ls[right]
can be rewritten as the difference of prefix sums
(ls[0] + ls[1] + ... + ls[right])
- (ls[0] + ls[1] + ... + ls[left-1])
which is prefix_sum(0, right) - prefix_sum(0, left-1)
where are intervals are written in inclusive notation.
Pulling this into a separate range_sum() function, you can rewrite the original core logic block as:
prefix_sums = list(itertools.accumulate(ls, initial=0))
def range_sum(left: int, right: int) -> int:
"""Given indices left and right, returns the sum of values of
ls in the inclusive interval [left, right].
Equivalent to sum(ls[left : right+1])"""
return prefix_sums[right+1] - prefix_sums[left]
ans = 1000000001
for i in range(n - covered):
a = range_sum(left=i, right=i+covered-1)
if a < ans:
ans = a
The trickiest part of prefix sum arrays is just avoiding off-by-one errors in indexes. Notice that our prefix sum array of the length-n array ls has n+1 elements, since it starts with the empty initial prefix sum of 0, and so we add 1 to array accesses to prefix_sums compared to our formula.
Also, it's possible there may be an off-by-one error in your original code, as the value ls[n-1] is never accessed or used for anything after being set?
I have written a code for a problem and used 2 double-nested loops within the implementation, but this code runs too long with big O as O(n^2).
So I googled for a faster solution for the same problem and found the second code below, which uses a tripled-nested loop with big O as O(n^3).
Is it because the number of computations is higher for the first code, although it has lower big O?
If so can I conclude that big O is not reliable for small "n" values and I have to do experimentation to be able to judge?
Code 1:
def sherlockAndAnagrams(s):
# 1 . Traverse all possible substrings within string
count = 0
lst_char_poss_str = []
len_s = len(s)
for i in range(len_s):#for each char in string
temp_str = ""#a temp string to include characters next to evaluating char
for j in range(i , len_s):#for all possible length of string from that char
temp_str += s[j] #possible substrings from that char
lst_char_poss_str.append(temp_str)#All possible substrings within string
# 2 . Check if any two substrings of equal length are anagrams
new_lst_char_poss_str = []
for i in lst_char_poss_str:
i = list(i)#sorted list, so, "abb" and "bba" will be both "abb"
i.sort()
new_lst_char_poss_str.append(i)#a 2-d list of lists of characters for All possible substrings within string
len_new_s = len(new_lst_char_poss_str)
for i in range (len_new_s - 1):
for j in range (i + 1, len_new_s):
if new_lst_char_poss_str[i] == new_lst_char_poss_str[j]:
count += 1
return(count)
Code 2:
def sherlockAndAnagrams(s):
count = 0
slen = len(s)
for i in range(slen):
for j in range(i+1, slen):
substr = ''.join(sorted(s[i:j]))#Sortingall characters after a char in string
sublen = len(substr)
for x in range(i+1, slen):
if x + sublen > slen: #if index out of range
break
substr2 = ''.join(sorted(s[x:x+sublen]))
if substr == substr2:
anagrams += 1
return count
You might have an algorithm whose running time is 1,000,000 n, because you may be doing some other operations. But you might have an algorithm of this running time. 1,000,000n is O (n), because this is <= some constant time n and you might have some other algorithm with the running time of 2 n^2.
You would say that 1,000,000 n algorithm is better than 2 n^2. The one with the linear running time which is O (n) running time is better than O ( n^2). It is true but in the limit and the limit is achieved very late when n is really large. For small instances this 2 n^2 might actually take less amount of time than your 1,000,000 n. We must be careful about the constants also.
There are lot of point to be considered:
the second algorithm always return 0, nobody increment count
in the first : temp_str += s[j] is not efficient, in the second this string concatenation is not used.
the second is faster because use slicing to retrieve pieces of the string. but to be sure maybe you must do a precise profile of the code.
other than this, as told by #pjs big O notation is an asymptotical explanation.
This is my current approach:
def isPalindrome(s):
if (s[::-1] == s):
return True
return False
def solve(s):
l = len(s)
ans = ""
for i in range(l):
subStr = s[i]
for j in range(i + 1, l):
subStr += s[j]
if (j - i + 1 <= len(ans)):
continue
if (isPalindrome(subStr)):
ans = max(ans, subStr, key=len)
return ans if len(ans) > 1 else s[0]
print(solve(input()))
My code exceeds the time limit according to the auto scoring system. I've already spend some time to look up on Google, all of the solutions i found have the same idea with no optimization or using dynamic programming, but sadly i must and only use brute force to solve this problem. I was trying to break the loop earlier by skipping all the substrings that are shorter than the last found longest palindromic string, but still end up failing to meet the time requirement. Is there any other way to break these loops earlier or more time-efficient approach than the above?
With subStr += s[j], a new string is created over the length of the previous subStr. And with s[::-1], the substring from the previous offset j is copied over and over again. Both are inefficient because strings are immutable in Python and have to be copied as a new string for any string operation. On top of that, the string comparison in s[::-1] == s is also inefficient because you've already compared all of the inner substrings in the previous iterations and need to compare only the outermost two characters at the current offset.
You can instead keep track of just the index and the offset of the longest palindrome so far, and only slice the string upon return. To account for palindromes of both odd and even lengths, you can either increase the index by 0.5 at a time, or double the length to avoid having to deal with float-to-int conversions:
def solve(s):
length = len(s) * 2
index_longest = offset_longest = 0
for index in range(length):
offset = 0
for offset in range(1 + index % 2, min(index, length - index), 2):
if s[(index - offset) // 2] != s[(index + offset) // 2]:
offset -= 2
break
if offset > offset_longest:
index_longest = index
offset_longest = offset
return s[(index_longest - offset_longest) // 2: (index_longest + offset_longest) // 2 + 1]
Solved by using the approach "Expand Around Center", thanks #Maruthi Adithya
This modification of your code should improve performance. You can stop your code when the max possible substring is smaller than your already computed answer. Also, you should start your second loop with j+ans+1 instead of j+1 to avoid useless iterations :
def solve(s):
l = len(s)
ans = ""
for i in range(l):
if (l-i+1 <= len(ans)):
break
subStr = s[i:len(ans)]
for j in range(i + len(ans) + 1, l+1):
if (isPalindrome(subStr)):
ans = subStr
subStr += s[j]
return ans if len(ans) > 1 else s[0]
This is a solution that has a time complexity greater than the solutions provided.
Note: This post is to think about the problem better and does not specifically answer the question. I have taken a mathematical approach to find a time complexity greater than 2^L (where L is size of input string)
Note: This is a post to discuss potential algorithms. You will not find the answer here. And the logic shown here has not been proven extensively.
Do let me know if there is something that I haven't considered.
Approach: Create set of possible substrings. Compare and find the maximum pair* from this set that has the highest possible pallindrome.
Example case with input string: "abc".
In this example, substring set has: "a","b","c","ab","ac","bc","abc".
7 elements.
Comparing each element with all other elements will involve: 7^2 = 49 calculations.
Hence, input size is 3 & no of calculations is 49.
Time Complexity:
First compute time complexity for generating the substring set:
<img src="https://latex.codecogs.com/gif.latex?\sum_{a=1}^{L}\left&space;(&space;C_{a}^{L}&space;\right&space;)" title="\sum_{a=1}^{L}\left ( C_{a}^{L} \right )" />
(The math equation is shown in the code snippet)
Here, we are adding all the different substring size combination from the input size L.
To make it clear: In the above example input size is 3. So we find all the pairs with size =1 (i.e: "a","b","c"). Then size =2 (i.e: "ab","ac","bc") and finally size = 3 (i.e: "abc").
So choosing 1 character from input string = combination of taking L things 1 at a time without repetition.
In our case number of combinations = 3.
This can be mathematically shown as (where a = 1):
<img src="https://latex.codecogs.com/gif.latex?C_{a}^{L}" title="C_{a}^{L}" />
Similarly choosing 2 char from input string = 3
Choosing 3 char from input string = 1
Finding time complexity of palindrome pair from generated set with maximum length:
Size of generated set: N
For this we have to compare each string in set with all other strings in set.
So N*N, or 2 for loops. Hence the final time complexity is:
<img src="https://latex.codecogs.com/gif.latex?\sum_{a=1}^{L}\left&space;(&space;C_{a}^{L}&space;\right&space;)^{2}" title="\sum_{a=1}^{L}\left ( C_{a}^{L} \right )^{2}" />
This is diverging function greater than 2^L for L > 1.
However, there can be multiple optimizations applied to this. For example: there is no need to compare "a" with "abc" as "a" will also be compared with "a". Even if this optimization is applied, it will still have a time complexity > 2^L (For the most cases).
Hope this gave you a new perspective to the problem.
PS: This is my first post.
You should not find the string start from the beginning of that string, but you should start from the middle of it & expand the current string
For example, for the string xyzabccbalmn, your solution will cost ~ 6 * 11 comparison but searching from the middle will cost ~ 11 * 2 + 2 operations
But anyhow, brute-forcing will never ensure that your solution will run fast enough for any arbitrary string.
Try this:
def solve(s):
if len(s)==1:
print(0)
return '1'
if len(s)<=2 and not(isPalindrome(s)):
print (0)
return '1'
elif isPalindrome(s):
print( len(s))
return '1'
elif isPalindrome(s[0:len(s)-1]) or isPalindrome(s[1:len(s)]):
print (len(s)-1)
return '1'
elif len(s)>=2:
solve(s[0:len(s)-1])
return '1'
return 0
I am trying to use the regex module to find non-overlapping repeats (duplicated sub-strings) within a given string (30 char), with the following requirements:
I am only interested in non-overlapping repeats that are 6-15 char long.
allow 1 mis-match
return the positions for each match
One way I thought of is that for each possible repeat length, let python loop through the 30char string input. For example,
string = "ATAGATATATGGCCCGGCCCATAGATATAT" #input
#for 6char repeats, first one in loop would be for the following event:
text = "ATAGAT"
text2 ="(" + text + ")"+ "{e<=1}" #this is to allow 1 mismatch later in regex
string2="ATATGGCCCGGCCCATAGATATAT" #string after excluding text
for x in regex.finditer(text2,string2,overlapped=True):
print x.span()
#then still for 6char repeats, I will move on to text = "TAGATA"...
#after 6char, loop again for 7char...
There should be two outputs for this particular string = "ATAGATATATGGCCCGGCCCATAGATATAT". 1. The bold two "ATAGATATAT" + 1 mismatch: "ATAGATATATG" &"CATAGATATAT" with position index returned as (0,10)&(19, 29); 2. "TGGCCC" & "GGCCCA" (need add one mismatch to be at least 6 char), with index (9,14)&(15,20). Numbers can be in a list or table.
I'm sorry that I didn't include a real loop, but I hope the idea is clear...As you can see, this is a very less efficient method, not to mention it would create redundancy --- e.g. 10char repeats will be counted more than once, because it would suit for 9,8,7 and 6 char repeats loops. Moreover, I have a lot of such 30 char strings to work with, so I would appreciate your advice on some cleaner methods.
Thank you very much:)
I'd try straightforward algorithm instead of regex (which are quite confusing in this instance);
s = "ATAGATATATGGCCCGGCCCATAGATATAT"
def fuzzy_compare(s1, s2):
# sanity check
if len(s1) != len(s2):
return False
diffs = 0
for a, b in zip(s1, s2):
if a != b:
diffs += 1
if diffs > 1:
return False
return True
slen = len(s) # 30
for l in range(6, 16):
i = 0
while (i + l * 2) <= slen:
sub1 = s[i:i+l]
for j in range(i+l, slen - l):
sub2 = s[j:j+l]
if fuzzy_compare(sub1, sub2):
# checking if this could be partial
partial = False
if i + l < j and j + l < slen:
extsub1 = s[i:i+l+1]
extsub2 = s[j:j+l+1]
# if it is partial, we'll get it later in the main loop
if fuzzy_compare(extsub1, extsub2):
partial = True
if not partial:
print (i, i+l), (j, j+l)
i += 1
It's a first draft, so feel free to experiment with it. It also seems to be clunky and not optimal, but try running it first - it may be sufficient enough.