Reducing a string by detecting patterns - python

Given a string, I would like to detect the repeating substrings, and then reduce abab to (ab)2.
For instance, ababababacdecdecdeababab would reduce to (ab)4a(cde)3(ab)3.
The string does not have the same character twice in a row. So, aaab is an invalid string.
Here is the Python that I wrote:
def superscript(n):
return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])
signature = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
d = {}
processed = []
for k in range(2, len(signature)):
i = 0
j = i + k
while j <= len(signature):
repeat_count = 0
while signature[i:i+k] == signature[j:j+k]:
repeat_count += 1
j += k
if repeat_count > 0 and i not in processed:
d[i] = [i+k, repeat_count + 1]
for j in range(i, (i+k)*repeat_count + 1):
processed.append(j)
i = j
j = i + k
else:
i += 1
j = i + k
od = collections.OrderedDict(sorted(d.items()))
output = ''
for k,v in od.items():
print(k, v)
output += '(' + signature[k:v[0]] + ')' + superscript(v[1])
Which aims to detect the repeating substrings of length 2, 3, 4, and so on. I mark the start and the end of a repeating substring by using a dict. I also mark the index of the processed characters by keeping a list to avoid replacing (ab)4 by (abab)2 (since the latter one will overwrite the beginning index in the dict).
The example string I work with is hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb which should output (hd)4(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4cbfb.
However, I get this output:
(hd)4(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5(cg)4(ea)2(dh)4(hd)2(cg)4
I don't know whether this is a well-known problem, but I couldn't find any resources. I don't mind the time complexity of the algorithm.
Where did I make a mistake?
The algorithm I try to describe looks like this:
First, find the repeating substrings of length 2, then 3, then 4, ..., up to the length of the input string.
Then, do the same operation until there is no repetition at all.
A step-by-step example looks like this:
abcabcefefefghabcdabcdefefefghabcabcefefefghabcdabcdefefefgh
abcabc(ef)²ghabcdabcd(ef)²ghabcabc(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²ghabcdabcd(ef)²gh(abc)³(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²gh(abcd)²(ef)²gh(abc)³(ef)²gh(abcd)²(ef)²gh
((abc)³(ef)²gh(abcd)²(ef)²gh)²

You can use re.sub to match any repeating two chars and then pass a replacement function that formats the pattern you desire
import re
def superscript(n):
return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])
s = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
max_length = 5
result = re.sub(
rf'(\w{{2,{max_length}}}?)(\1+)', # Omit the second number in the repetition to match any number of repeating chars (\w{2,}?)(\1+)
lambda m: f'({m.group(1)}){superscript(len(m.group(0))//len(m.group(1)))}',
s
)
print(result) # (hd)⁴(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴cbfb

The problem in your code happens when you put together the list of repeating patterns. When you are merging patterns of length 2 and patterns of length 3, you are using patterns that are not compatible with each other.
hdhdhdhd = (hd)4 starts at index 0 and ends at index 7 (included).
(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5, which is a correct pattern in your string, starts at index 7 (included).
This means when you merge the two patterns, you get an incorrect end result because the letter at index 7 is shared.
This problem stems from the fact that one pattern is even in length, while the other is odd and their limits are not aligning. So, they don't even overwrite each other in d and you end up with your result.
I think you tried to solve this problem using the dictionary d with the starting index as key and with the processed list, but there is still a couple of problems.
for j in range(i, (i+k)*repeat_count + 1): should be for l in range(i, j), otherwise you are skipping the last index of the pattern (j). Also, I changed the loop index to l because j was already used. This fixed the problem I described above.
Even with that fixed, there is still an issue. You check for patterns starting from length 2 (for k in range(2, len(signature))), so single letters not belonging to any pattern, like the c in (hd)4(cg)4c(bf)4 will never make it in the dictionary and therefore you will still have overlapping patterns with different lengths at those positions.

Related

Accessing two following indexes in a loop

I'm new to python (although it's more of a logical question rather than syntax question I belive), and I wonder what's the proper way to access two folowing objects in a loop.
I can't really provide a specific example without getting too cumbersome with my explanation but let's just say that I usually try to tackle this with either [index + 1] or [index - 1] and both are problematic when it comes to either the last (IndexError) or first (addresses the last position right at the beginning) iterations respectively.
Is there a well known way to address this? I haven't really seen any questions regarding this floating around so it made me think it's basic logic I'm missing here.
For example this peice of code that wouldn't have worked had I not wrapped everything with try/except, and also the second inner loop works only since it checks for identical characters, otherwise it could have been a mess.
(explanation for clarity - it recieves a string (my_string) and a number (k) and checks whether a sequence of identical characters the length of k exists in my_string)
# ex2 5
my_string = 'abaadddefggg'
sub_my_string = ''
k = 9
count3 = 0
try:
for index in range(len(my_string)):
i = 0
while i < k:
sub_my_string += my_string[index + i]
i += 1
for index2 in range(len(sub_my_string)):
if sub_my_string[index2] == sub_my_string[index2 - 1]:
count3 += 1
if count3 == k:
break
else:
sub_my_string = ""
count3 = 0
print(f"For length {k}, found the substring {sub_my_string}!")
except IndexError:
print(f"Didn't find a substring of length {k}")
Thanks a lot
First off, by definition you need to give special attention to the first or last element, because they really don't have a pair.
Second-off, I personally tend to use list-comprehensions of the following type for these cases -
[something_about_the_two_consecutive_elements(x, y) for x, y in zip(my_list, my_list[1:])]
And last but not least, the whole code snippet seems like major overkill. How about a simple one-liner -
my_string = 'abaadddefggg'
k = 3
existing_substrings = ([x * k for x in set(my_string) if x * k in my_string])
print(f'For length {k}, found substrings {existing_substrings}')
(To be adapted by one's needs of course)
Explanation:
For each of the unique characters in the string, we can check if a string of that character repeated k times appears in my_string.
set(my_string) gives a set of the unique characters over which we iterate (that's the for x in set(my_string) in the list comprehension).
Taking a character x and multiplying by k gives a string xx...x of length k.
So x * k in my_string tests whether my_string includes the substring xx...x.
Summing up the list-comprehension, we return only characters for which x * k in my_string is True.
If I am understanding what you are trying to achieve, I would approach this differently using string slices and a set.
my_string = "abaadddefggg"
sub_my_string = ""
k = 3
count3 = 0
found = False
for index, _ in enumerate(my_string):
if index + k > len(my_string):
continue
sub_my_string = my_string[index : index + k]
if len(set(sub_my_string)) == 1:
found = True
break
if found:
print(f"For length {k}, found the substring {sub_my_string}!")
else:
print(f"Didn't find a substring of length {k}")
Here we use:
enumerate as this usually signals that we are looking at the indices of an iterable.
Check whether the slice will be take us over the string length as there's no point in checking these.
Use the string slice to subset the string
Use the set to see if all the characters are the same.

Generating all possibilities with letters in python and exploiting results in python3

First, I got this problem: how many words are there (counting all of them, even those that don't make sense) of 5 letters that have at least one I and at least two T's, but no K or Y?
First, I defined the alphabet, which has 24 letters ( k and y aren't counted). After that, i made a code to generate all possibilites
alphabet = list(range(1, 24))
for L in range(0, len(alphabet)+1):
for subset in itertools.permutations(alphabet, L):
I don't know how to use the data.
If a “brute force” method is enough for you, this will work:
import string
import itertools
alphabet = string.ascii_uppercase.replace("K", "").replace("Y", "")
count = 0
for word in itertools.product(alphabet, repeat = 5):
if "I" in word and word.count("T") >= 2:
count += 1
print (count)
It will print the result 15645.
Note that you have to use itertools.product(), because itertools.permutations() will not contain repeated occurences, so it will never contain “T” twice.
Edit: Alternatively, you can calculate the count with a list comprehension or a generator expression. It takes advantage of the fact that boolean True and False are equivalent to integer values 1 and 0, respectively.
count = sum(
"I" in word and word.count("T") >= 2
for word in itertools.product(alphabet, repeat = 5)
)
NB: Interestingly, the first solution (explicit for loop with counter += 1) runs about 15 % faster on my computer than the second solution (generator expression with sum()). Both require the same amount of memory (this is expected).

How can i improve time or memory result in Python?

I am trying to learn Python and so I ran into a problem: for my courses there are requirments: max time 1 sec and max memory 512Mb. The task is to find smallest palindrome in alphabetical order. minimal long for palindrome is 2.
for example: ghghwwdkjnccjknjn here are: ghg, cc, ww, njn. We need the smallest - cc or ww - in alphabet c is in front of w (like in dictionaries). aba is in front of aca (c>b) and so on
Here is my code:
s = input("")
lst = []
for i in range(0, len(s)):
for j in range(i + 1, len(s) + 1):
p = s[i:j]
if p == p[::-1] and len(p)>=2:
lst.append(p)
lst.sort()
del p
if not lst:
print("-1")
else:
#lst.sort()
print(sorted(lst, key = len)[0])
In this way I get 1.088s 9.89Mb and with lst.sort() moving to the end I get 0.901s 527.30Mb - both bad. How can I do it better? Thank you!
Efficient clean implementation of all improvements mentioned below:
def substrings(string, length):
for i in range(len(string) - length + 1):
yield string[i : i+length]
def palindromes(strings):
for string in strings:
if string == string[::-1]:
yield string
def best_palindrome(string):
for length in 2, 3:
if result := min(palindromes(substrings(string, length)), default=None):
return result
return -1
print(best_palindrome(input()))
Got accepted at Code Forces with 218 ms and 796 KB (using Python 3.7.2).
Perhaps the simplest modification to make it a lot more efficient is to add this as the first thing in the j-loop:
if j - i > 3:
break
That is, don't check lengths above 3. Because any longer palindrome, like abba or abcba, contains a shorter one, like bb or bcb in those cases. Since you want a shortest anyway, any longer ones are always useless.
Also, do sort only at the end, not after every append.
With those two changes, I got it accepted at Code Forces (link from your comment below).
Further possible improvements:
Don't sort, just get the minimum.
Start j at i + 2 instead of at i + 1 and remove the len(p)>=2 check.
For memory reduction, don't collect everything in a list (use a set or produce the candidates in a generator).
First try only all substrings of length 2, and only if that fails, try length 3.

Find repeats with certain length within a string using python

I am trying to use the regex module to find non-overlapping repeats (duplicated sub-strings) within a given string (30 char), with the following requirements:
I am only interested in non-overlapping repeats that are 6-15 char long.
allow 1 mis-match
return the positions for each match
One way I thought of is that for each possible repeat length, let python loop through the 30char string input. For example,
string = "ATAGATATATGGCCCGGCCCATAGATATAT" #input
#for 6char repeats, first one in loop would be for the following event:
text = "ATAGAT"
text2 ="(" + text + ")"+ "{e<=1}" #this is to allow 1 mismatch later in regex
string2="ATATGGCCCGGCCCATAGATATAT" #string after excluding text
for x in regex.finditer(text2,string2,overlapped=True):
print x.span()
#then still for 6char repeats, I will move on to text = "TAGATA"...
#after 6char, loop again for 7char...
There should be two outputs for this particular string = "ATAGATATATGGCCCGGCCCATAGATATAT". 1. The bold two "ATAGATATAT" + 1 mismatch: "ATAGATATATG" &"CATAGATATAT" with position index returned as (0,10)&(19, 29); 2. "TGGCCC" & "GGCCCA" (need add one mismatch to be at least 6 char), with index (9,14)&(15,20). Numbers can be in a list or table.
I'm sorry that I didn't include a real loop, but I hope the idea is clear...As you can see, this is a very less efficient method, not to mention it would create redundancy --- e.g. 10char repeats will be counted more than once, because it would suit for 9,8,7 and 6 char repeats loops. Moreover, I have a lot of such 30 char strings to work with, so I would appreciate your advice on some cleaner methods.
Thank you very much:)
I'd try straightforward algorithm instead of regex (which are quite confusing in this instance);
s = "ATAGATATATGGCCCGGCCCATAGATATAT"
def fuzzy_compare(s1, s2):
# sanity check
if len(s1) != len(s2):
return False
diffs = 0
for a, b in zip(s1, s2):
if a != b:
diffs += 1
if diffs > 1:
return False
return True
slen = len(s) # 30
for l in range(6, 16):
i = 0
while (i + l * 2) <= slen:
sub1 = s[i:i+l]
for j in range(i+l, slen - l):
sub2 = s[j:j+l]
if fuzzy_compare(sub1, sub2):
# checking if this could be partial
partial = False
if i + l < j and j + l < slen:
extsub1 = s[i:i+l+1]
extsub2 = s[j:j+l+1]
# if it is partial, we'll get it later in the main loop
if fuzzy_compare(extsub1, extsub2):
partial = True
if not partial:
print (i, i+l), (j, j+l)
i += 1
It's a first draft, so feel free to experiment with it. It also seems to be clunky and not optimal, but try running it first - it may be sufficient enough.

Basic indexing recurrences of a substring within a string (python)

I'm working on teaching myself basic programming.
One simple project is to find the index of recurrences of a substring within a string. So for example, in string "abcdefdef" and substring "def", I would like the output to be 3 and 6. I have some code written, but I'm not getting the answers I want. Following is what I have written
Note:I'm aware that there may be easier way to produce the result, leveraging built-in features/packages of the language, such as Regular Expressions. I'm also aware that my approach is probably not an optimal algorithm. Never the less, at this time, I'm only seeking advice on fixing the following logic, rather than using more idiomatic approaches.
import string
def MIT(String, substring): # "String" is the main string I'm searching within
String_list = list(String)
substring_list = list(substring)
i = 0
j = 0
counter = 0
results = []
while i < (len(String)-1):
if [j] == [i]:
j = j + 1
i = i + 1
counter = counter + 1
if counter == len(substring):
results.append([i - len(substring)+1])
counter = 0
j = 0
i = i+1
else:
counter = 0
j = 0
i = i+1
print results
return
My line of reasoning is as such. I turn the String and substring into a list. That allows for indexing of each letter in the string. I set i and j = 0--these will be my first values in the String and substring index, respectively. I also have a new variable, counter, which I set = to 0. Basically, I'm using counter to count how many times the letter in position [i] is equal to the element in position [j]. If counter equals the length of substring, then I know that [i - len(substring) + 1] is a position where my substring starts, so I add it to a list called results. Then I reset counter and j and continue searching for more substrings.
I know the code is awkward, but I thought that I should still be able to get the answer. Instead I get:
>>> MIT("abcdefghi", "def")
[[3]]
>>> MIT("abcdefghi", "efg")
[[3]]
>>> MIT("abcdefghi", "b")
[[1]]
>>> MIT("abcdefghi", "k")
[[1]]
Any thoughts?
The regular expressions module (re) is much more suited for this task.
Good reference:
http://docs.python.org/howto/regex.html
Also:
http://docs.python.org/library/re.html
EDIT:
A more 'manual' way may be to use slicing
s = len(String)
l = len(substring)
for i in range(s-l+1):
if String[i:i+l] == substring:
pass #add to results or whatever
I'm not clear on whether you want to learn some good string searching algorithms, or a straightforward way to do it in Python. If it's the latter, then string.find is your friend. Something like
def find_all_indexes(needle, haystack):
"""Find the index for the beginning of each occurrence of ``needle`` in ``haystack``. Overlaps are allowed."""
indexes = []
last_index = haystack.find(needle)
while -1 != last_index:
indexes.append(last_index)
last_index = haystack.find(needle, last_index + 1)
return indexes
if __name__ == '__main__':
print find_all_indexes('is', 'This is my string.')
While this is a pretty naive approach, it should be easily understandable.
If you're looking for something that uses even less of the standard library (and will actually teach you a fairly common algorithm used when implementing libraries), you could try implementing the Boyer-Moore string search algorithm.
The main/major problem are the following:
for comparison, use: if String[i] == substring[j]
you increment i twice when you found a match, remove the second increment.
the loop should go till while i < len(String):
and of course it won't find overlapping matches (eg: MIT("aaa", "aa"))
There are some minor "problems", it's not really pythonic, there is no need for building lists, increment is clearer if written i += 1, a useful function should return the values not print them, etc...
If you want proper and fast code, check the classic algorithm book: http://www.amazon.com/Introduction-Algorithms-Thomas-H-Cormen/dp/0262033844 . It has a whole chapter about string search.
If you want a pythonic solution without implementing the whole thing check the other answers.
First, I added some comments to your code to give some tips
import string
def MIT(String, substring):
String_list = list(String) # this doesn't need to be done; you can index strings
substring_list = list(substring)
i = 0
j = 0
counter = 0
results = []
while i < (len(String)-1):
if [j] == [i]: # here you're comparing two, one-item lists. you must do substring[j] and substring[i]
j = j + 1
i = i + 1
counter = counter + 1
if counter == len(substring):
results.append([i - len(substring)+1]) # remove the brackets; append doesn't require them
counter = 0
j = 0
i = i+1 # remove this
else:
counter = 0
j = 0
i = i+1
print results
return
Here's how I would do it without using built-in libraries and such:
def MIT(fullstring, substring):
results = []
sub_len = len(substring)
for i in range(len(fullstring)): # range returns a list of values from 0 to (len(fullstring) - 1)
if fullstring[i:i+sub_len] == substring: # this is slice notation; it means take characters i up to (but not including) i + the length of th substring
results.append(i)
return results
For finding the position of substring in a string this algorithm will do:
def posnof_substring(string,sub_string):
l=len(sub_string)
for i in range(len(string)-len(sub_string)+1):
if(string[i:i+len(sub_string)] == sub_string ):
posn=i+1
return posn
I myself checked this algorithm and it worked!
Based on #Hank Gay's answer. Using regex plus adding an option to search for words.
def find_all(item, text, as_word=False):
indexes = []
re_term = rf'\b{item}\b' if as_word else item
for r in re.finditer(re_term, text.lower()):
indexes.append(r.start())
return indexes
if __name__ == '__main__':
word = 'for'
text = 'Now for a bold step forward.'
print(find_all(word, text), find_all(word, text, as_word=True))

Categories

Resources