Basic indexing recurrences of a substring within a string (python) - python

I'm working on teaching myself basic programming.
One simple project is to find the index of recurrences of a substring within a string. So for example, in string "abcdefdef" and substring "def", I would like the output to be 3 and 6. I have some code written, but I'm not getting the answers I want. Following is what I have written
Note:I'm aware that there may be easier way to produce the result, leveraging built-in features/packages of the language, such as Regular Expressions. I'm also aware that my approach is probably not an optimal algorithm. Never the less, at this time, I'm only seeking advice on fixing the following logic, rather than using more idiomatic approaches.
import string
def MIT(String, substring): # "String" is the main string I'm searching within
String_list = list(String)
substring_list = list(substring)
i = 0
j = 0
counter = 0
results = []
while i < (len(String)-1):
if [j] == [i]:
j = j + 1
i = i + 1
counter = counter + 1
if counter == len(substring):
results.append([i - len(substring)+1])
counter = 0
j = 0
i = i+1
else:
counter = 0
j = 0
i = i+1
print results
return
My line of reasoning is as such. I turn the String and substring into a list. That allows for indexing of each letter in the string. I set i and j = 0--these will be my first values in the String and substring index, respectively. I also have a new variable, counter, which I set = to 0. Basically, I'm using counter to count how many times the letter in position [i] is equal to the element in position [j]. If counter equals the length of substring, then I know that [i - len(substring) + 1] is a position where my substring starts, so I add it to a list called results. Then I reset counter and j and continue searching for more substrings.
I know the code is awkward, but I thought that I should still be able to get the answer. Instead I get:
>>> MIT("abcdefghi", "def")
[[3]]
>>> MIT("abcdefghi", "efg")
[[3]]
>>> MIT("abcdefghi", "b")
[[1]]
>>> MIT("abcdefghi", "k")
[[1]]
Any thoughts?

The regular expressions module (re) is much more suited for this task.
Good reference:
http://docs.python.org/howto/regex.html
Also:
http://docs.python.org/library/re.html
EDIT:
A more 'manual' way may be to use slicing
s = len(String)
l = len(substring)
for i in range(s-l+1):
if String[i:i+l] == substring:
pass #add to results or whatever

I'm not clear on whether you want to learn some good string searching algorithms, or a straightforward way to do it in Python. If it's the latter, then string.find is your friend. Something like
def find_all_indexes(needle, haystack):
"""Find the index for the beginning of each occurrence of ``needle`` in ``haystack``. Overlaps are allowed."""
indexes = []
last_index = haystack.find(needle)
while -1 != last_index:
indexes.append(last_index)
last_index = haystack.find(needle, last_index + 1)
return indexes
if __name__ == '__main__':
print find_all_indexes('is', 'This is my string.')
While this is a pretty naive approach, it should be easily understandable.
If you're looking for something that uses even less of the standard library (and will actually teach you a fairly common algorithm used when implementing libraries), you could try implementing the Boyer-Moore string search algorithm.

The main/major problem are the following:
for comparison, use: if String[i] == substring[j]
you increment i twice when you found a match, remove the second increment.
the loop should go till while i < len(String):
and of course it won't find overlapping matches (eg: MIT("aaa", "aa"))
There are some minor "problems", it's not really pythonic, there is no need for building lists, increment is clearer if written i += 1, a useful function should return the values not print them, etc...
If you want proper and fast code, check the classic algorithm book: http://www.amazon.com/Introduction-Algorithms-Thomas-H-Cormen/dp/0262033844 . It has a whole chapter about string search.
If you want a pythonic solution without implementing the whole thing check the other answers.

First, I added some comments to your code to give some tips
import string
def MIT(String, substring):
String_list = list(String) # this doesn't need to be done; you can index strings
substring_list = list(substring)
i = 0
j = 0
counter = 0
results = []
while i < (len(String)-1):
if [j] == [i]: # here you're comparing two, one-item lists. you must do substring[j] and substring[i]
j = j + 1
i = i + 1
counter = counter + 1
if counter == len(substring):
results.append([i - len(substring)+1]) # remove the brackets; append doesn't require them
counter = 0
j = 0
i = i+1 # remove this
else:
counter = 0
j = 0
i = i+1
print results
return
Here's how I would do it without using built-in libraries and such:
def MIT(fullstring, substring):
results = []
sub_len = len(substring)
for i in range(len(fullstring)): # range returns a list of values from 0 to (len(fullstring) - 1)
if fullstring[i:i+sub_len] == substring: # this is slice notation; it means take characters i up to (but not including) i + the length of th substring
results.append(i)
return results

For finding the position of substring in a string this algorithm will do:
def posnof_substring(string,sub_string):
l=len(sub_string)
for i in range(len(string)-len(sub_string)+1):
if(string[i:i+len(sub_string)] == sub_string ):
posn=i+1
return posn
I myself checked this algorithm and it worked!

Based on #Hank Gay's answer. Using regex plus adding an option to search for words.
def find_all(item, text, as_word=False):
indexes = []
re_term = rf'\b{item}\b' if as_word else item
for r in re.finditer(re_term, text.lower()):
indexes.append(r.start())
return indexes
if __name__ == '__main__':
word = 'for'
text = 'Now for a bold step forward.'
print(find_all(word, text), find_all(word, text, as_word=True))

Related

Algorithm verification: Get all the combinaison of possible word

I wanted to know if the algorithm that i wrotte just below in python is correct.
My goal is to find an algorithm that print/find all the possible combinaison of words that can be done using the character from character '!' (decimal value = 33) to character '~' (decimal value = 126) in the asccii table:
Here the code using recursion:
byteWord = bytearray(b'\x20') # Hex = '\x21' & Dec = '33' & Char = '!'
cntVerif = 0 # Test-------------------------------------------------------------------------------------------------------
def comb_fct(bytes_arr, cnt: int):
global cntVerif # Test------------------------------------------------------------------------------------------------
if len(bytes_arr) > 3: # Test-----------------------------------------------------------------------------------------
print(f'{cntVerif+1}:TEST END')
sys.exit()
if bytes_arr[cnt] == 126:
if cnt == len(bytes_arr) or len(bytes_arr) == 1:
bytes_arr.insert(0, 32)
bytes_arr[cnt] = 32
cnt += 1
cntVerif += 1 # Test----------------------------------------------------------------------------------------------
print(f'{cntVerif}:if bytes_arr[cnt] == 126: \n\tbytes_arr = {bytes_arr}') # Test-------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt)
if cnt == -1 or cnt == len(bytes_arr)-1:
bytes_arr[cnt] = bytes_arr[cnt] + 1
cntVerif += 1 # Test----------------------------------------------------------------------------------------------
print(f'{cntVerif}:if cnt==-1: \n\tbytes_arr = {bytes_arr}') # Test-------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt=-1) # index = -1 means last index
bytes_arr[cnt] = bytes_arr[cnt] + 1
cntVerif += 1 # Test--------------------------------------------------------------------------------------------------
print(f'{cntVerif}:None if: \n\tbytes_arr={bytes_arr}') # Test-----------------------------------------------------------------------------------------------
comb_fct(bytes_arr, cnt+1)
comb_fct(byteWord, -1)
Thank your for your help because python allow just a limited number of recursion (996 on my computer) so i for exemple i can't verify if my algorithm give all the word of length 3 that can be realised with the range of character describe upper.
Of course if anyone has a better idea to writte this algorithm (a faster algorithm for exemple). I will be happy to read it.
Although you might be able to tweak this a bit, I think the code below is close to the most efficient solution to your problem, which I take to be "generate all possible sequences of maximum length N from a given set of characters". That might be a bit more general than you need, since your set of characters is fixed, but the general solution is more useful and little overhead is added.
Note that the function is written as a generator, using functions from the itertools standard library module. Itertools is described as a set of "functions creating iterators for efficient looping" (emphasis added), and it indeed is. Generators are one of Python's great features, since they allow you to easily and efficiently iterate over complex sequences. If you want to write efficient and "pythonic" code, you should familiarise yourself with these concepts (as well as other essential features, such as comprehensions). So I'm not going to explain these features further; please read the tutorial sections for details.
So here's the simple solution:
from itertools import product, chain
def genseq(maxlen, chars):
return map(''.join,
chain.from_iterable(product(chars, repeat=i)
for i in range(maxlen+1)))
# Example usage:
chars = ''.join(chr(i) for i in range(33, 127))
for word in genseq(4, chars):
# Do something with word
There are 78,914,411 possible words (including the empty word); the above generates all of them in 7 seconds on my laptop. Much of that time is spent creating (and garbage collecting) those strings; you might well be able to do better using a bytearray and recycling it for each generated word. I didn't try that.
For the record, here's a simpler way of "unindexing" an enumeration of such strings. The enumeration starts with the empty word, followed by all 1-character words, then 2-character words, and so on. This ordering makes it unnecessary to specify the length (or even maximum length) of the resulting string.
def unindex(i, chars):
v = []
n = len(chars)
while i > 0:
i -= 1
v.append(i % n)
i //= n
return ''.join(chars[j] for j in v[::-1])
# Example to generate the same words as above:
# chars as above
index_limit = (len(chars) ** 5 - 1) // (len(chars) - 1)
for i in range(0, index_limit):
word = unindex(i, chars)
# Do something with word
Again, you can probably speed this up a bit by using a recycled bytearray. As written above, it took about two minutes, sixteen times as long as my first version.
Note that using bytearrays in the way you do in your answer does not significantly speed things up, because it creates a new bytearray each time. In order to achieve the savings, you have to use a single bytearray for the entire generations, modifying it rather than recreating it. That's more awkward in practice, because it means that if you need to keep a generated word around for later, perhaps because it passed some test, you must copy it. It's easy to forget that, and the resulting bug can be very hard to track down.
You don't need a recursion here. Consider your word as a n-digit number, where the digits are ASCII symbols in the range of interest ([!..~]). Start with the smallest one (all !), and increment it by 1, until you reach the largest (all ~).
To increment the long number, add 1 to the least significant byte. If it becomes ~, make it ! and try to increment the next one, etc.
Keep in mind that the amount of words is huge. There are 94 ** n n-letter words. For n == 4 there are 78074896 of them.
EXPLANATION:
To solve this problem i think that i ve found a more elegant and faster way to do it without using recursive algorithm.
Complexity:
I think too that it is the time and space optimal solution.
As it is in time: O(n) with n the total number of possible combinaison that can be very very high. And theorically O(1) in space complexity. Concerning the space complexity because of the python language characteristics my code ,from a practical point of view, creates a lot of bytearray. This can be corrected with light modification. But for a better code check the solution posted by #ricci that i marked as the accepted answer.
Mathematical principle used:
I am using the fact that it exists a bijection between all the number in decimal basis and the number in base 94.
It is obvious that each number in base 94 can be written using a special sequance of unique character as the one in the range [30, 126] (in decimal value) in the ascii code.
Exemple of base conversion:
https://www.rapidtables.com/convert/number/decimal-to-hex.html
The operator '//' is the quotient operator and the operator '%' is the modulo operator.
I will be happy if anyone can confirm that my solution is correct. :-)
ALGORITHM
VERSION 1:
If you are NOT interested by getting all the sequence of words starting by '!'.
For exemple in lenght 2, you are NOT interested by the words of the form '!!'...'!A' '!B' ... etc ...'!R'...'!~' (as in our base '!' is equivalent to zero).
# Get all ascii relevant character in a list
asciiList = []
for c in (chr(i) for i in range(33, 127)):
asciiList.append(c)
print(f'ascii List: \n{asciiList} \nlist length: {len(asciiList)}')
def base10_to_base94_fct(int_to_convert: int) -> str:
sol_str = ''
loop_condition = True
while loop_condition is True:
quo = int_to_convert // 94
mod = int_to_convert % 94
sol_str = asciiList[mod] + sol_str
int_to_convert = quo
if quo == 0:
loop_condition = False
return sol_str
# test = base10_to_base94_fct(94**2-1)
# print(f'TEST result: {test}')
def comb_fct(word_length: int) -> None:
max_iter = 94**word_length
cnt = 1
while cnt < max_iter:
str_tmp = base10_to_base94_fct(cnt)
cnt += 1
print(f'{cnt}: Current word check:{str_tmp}')
# Test
comb_fct(3)
VERSION 2:
If you are interested by getting all the sequence of words starting by '!'.
For exemple in lenght 2, you are interested by the words of the form '!!'...'!A' '!B' ... etc ...'!R'...'!~' (as in our base '!' is equivalent to zero).
# Get all ascii relevant character in a list
asciiList = []
for c in (chr(i) for i in range(33, 127)):
asciiList.append(c)
print(f'The word should contain only the character in the following ascii List: \n{asciiList} \nlist length: {len(asciiList)}')
def base10_to_base94_fct(int_to_convert: int, str_length: int) -> bytearray:
sol_str = bytearray(b'\x21') * str_length
digit_nbr = str_length-1
loop_condition = True
while loop_condition is True:
quo = int_to_convert // 94
mod = int_to_convert % 94
sol_str[digit_nbr] = 33 + mod
digit_nbr -= 1
int_to_convert = quo
if digit_nbr == -1:
loop_condition = False
return sol_str
def comb_fct(max_word_length: int) -> None:
max_iter_abs = (94/93) * (94**max_word_length-1) # sum of a geometric series: 94 + 94^2 + 94^3 + 94^4 + ... + 94^N
max_iter_rel = 94
word_length = 1
cnt_rel = 0 # rel = relative
cnt_abs = 0 # abs = absolute
while cnt_rel < max_iter_rel**word_length and cnt_abs < max_iter_abs:
str_tmp = base10_to_base94_fct(cnt_rel, word_length)
print(f'{cnt_abs}:Current word test:{str_tmp}.')
print(f'cnt_rel = {cnt_rel} and cnt_abs={cnt_abs}')
if str_tmp == bytearray(b'\x7e') * word_length:
word_length += 1
cnt_rel = 0
continue
cnt_rel += 1
cnt_abs += 1
comb_fct(2) # Test

Reducing a string by detecting patterns

Given a string, I would like to detect the repeating substrings, and then reduce abab to (ab)2.
For instance, ababababacdecdecdeababab would reduce to (ab)4a(cde)3(ab)3.
The string does not have the same character twice in a row. So, aaab is an invalid string.
Here is the Python that I wrote:
def superscript(n):
return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])
signature = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
d = {}
processed = []
for k in range(2, len(signature)):
i = 0
j = i + k
while j <= len(signature):
repeat_count = 0
while signature[i:i+k] == signature[j:j+k]:
repeat_count += 1
j += k
if repeat_count > 0 and i not in processed:
d[i] = [i+k, repeat_count + 1]
for j in range(i, (i+k)*repeat_count + 1):
processed.append(j)
i = j
j = i + k
else:
i += 1
j = i + k
od = collections.OrderedDict(sorted(d.items()))
output = ''
for k,v in od.items():
print(k, v)
output += '(' + signature[k:v[0]] + ')' + superscript(v[1])
Which aims to detect the repeating substrings of length 2, 3, 4, and so on. I mark the start and the end of a repeating substring by using a dict. I also mark the index of the processed characters by keeping a list to avoid replacing (ab)4 by (abab)2 (since the latter one will overwrite the beginning index in the dict).
The example string I work with is hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb which should output (hd)4(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4cbfb.
However, I get this output:
(hd)4(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5(cg)4(ea)2(dh)4(hd)2(cg)4
I don't know whether this is a well-known problem, but I couldn't find any resources. I don't mind the time complexity of the algorithm.
Where did I make a mistake?
The algorithm I try to describe looks like this:
First, find the repeating substrings of length 2, then 3, then 4, ..., up to the length of the input string.
Then, do the same operation until there is no repetition at all.
A step-by-step example looks like this:
abcabcefefefghabcdabcdefefefghabcabcefefefghabcdabcdefefefgh
abcabc(ef)²ghabcdabcd(ef)²ghabcabc(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²ghabcdabcd(ef)²gh(abc)³(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²gh(abcd)²(ef)²gh(abc)³(ef)²gh(abcd)²(ef)²gh
((abc)³(ef)²gh(abcd)²(ef)²gh)²
You can use re.sub to match any repeating two chars and then pass a replacement function that formats the pattern you desire
import re
def superscript(n):
return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])
s = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
max_length = 5
result = re.sub(
rf'(\w{{2,{max_length}}}?)(\1+)', # Omit the second number in the repetition to match any number of repeating chars (\w{2,}?)(\1+)
lambda m: f'({m.group(1)}){superscript(len(m.group(0))//len(m.group(1)))}',
s
)
print(result) # (hd)⁴(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴cbfb
The problem in your code happens when you put together the list of repeating patterns. When you are merging patterns of length 2 and patterns of length 3, you are using patterns that are not compatible with each other.
hdhdhdhd = (hd)4 starts at index 0 and ends at index 7 (included).
(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5, which is a correct pattern in your string, starts at index 7 (included).
This means when you merge the two patterns, you get an incorrect end result because the letter at index 7 is shared.
This problem stems from the fact that one pattern is even in length, while the other is odd and their limits are not aligning. So, they don't even overwrite each other in d and you end up with your result.
I think you tried to solve this problem using the dictionary d with the starting index as key and with the processed list, but there is still a couple of problems.
for j in range(i, (i+k)*repeat_count + 1): should be for l in range(i, j), otherwise you are skipping the last index of the pattern (j). Also, I changed the loop index to l because j was already used. This fixed the problem I described above.
Even with that fixed, there is still an issue. You check for patterns starting from length 2 (for k in range(2, len(signature))), so single letters not belonging to any pattern, like the c in (hd)4(cg)4c(bf)4 will never make it in the dictionary and therefore you will still have overlapping patterns with different lengths at those positions.

Accessing two following indexes in a loop

I'm new to python (although it's more of a logical question rather than syntax question I belive), and I wonder what's the proper way to access two folowing objects in a loop.
I can't really provide a specific example without getting too cumbersome with my explanation but let's just say that I usually try to tackle this with either [index + 1] or [index - 1] and both are problematic when it comes to either the last (IndexError) or first (addresses the last position right at the beginning) iterations respectively.
Is there a well known way to address this? I haven't really seen any questions regarding this floating around so it made me think it's basic logic I'm missing here.
For example this peice of code that wouldn't have worked had I not wrapped everything with try/except, and also the second inner loop works only since it checks for identical characters, otherwise it could have been a mess.
(explanation for clarity - it recieves a string (my_string) and a number (k) and checks whether a sequence of identical characters the length of k exists in my_string)
# ex2 5
my_string = 'abaadddefggg'
sub_my_string = ''
k = 9
count3 = 0
try:
for index in range(len(my_string)):
i = 0
while i < k:
sub_my_string += my_string[index + i]
i += 1
for index2 in range(len(sub_my_string)):
if sub_my_string[index2] == sub_my_string[index2 - 1]:
count3 += 1
if count3 == k:
break
else:
sub_my_string = ""
count3 = 0
print(f"For length {k}, found the substring {sub_my_string}!")
except IndexError:
print(f"Didn't find a substring of length {k}")
Thanks a lot
First off, by definition you need to give special attention to the first or last element, because they really don't have a pair.
Second-off, I personally tend to use list-comprehensions of the following type for these cases -
[something_about_the_two_consecutive_elements(x, y) for x, y in zip(my_list, my_list[1:])]
And last but not least, the whole code snippet seems like major overkill. How about a simple one-liner -
my_string = 'abaadddefggg'
k = 3
existing_substrings = ([x * k for x in set(my_string) if x * k in my_string])
print(f'For length {k}, found substrings {existing_substrings}')
(To be adapted by one's needs of course)
Explanation:
For each of the unique characters in the string, we can check if a string of that character repeated k times appears in my_string.
set(my_string) gives a set of the unique characters over which we iterate (that's the for x in set(my_string) in the list comprehension).
Taking a character x and multiplying by k gives a string xx...x of length k.
So x * k in my_string tests whether my_string includes the substring xx...x.
Summing up the list-comprehension, we return only characters for which x * k in my_string is True.
If I am understanding what you are trying to achieve, I would approach this differently using string slices and a set.
my_string = "abaadddefggg"
sub_my_string = ""
k = 3
count3 = 0
found = False
for index, _ in enumerate(my_string):
if index + k > len(my_string):
continue
sub_my_string = my_string[index : index + k]
if len(set(sub_my_string)) == 1:
found = True
break
if found:
print(f"For length {k}, found the substring {sub_my_string}!")
else:
print(f"Didn't find a substring of length {k}")
Here we use:
enumerate as this usually signals that we are looking at the indices of an iterable.
Check whether the slice will be take us over the string length as there's no point in checking these.
Use the string slice to subset the string
Use the set to see if all the characters are the same.

with recursion in python calculate the amount of letters strings s and t share

I have to recursively or with list comprehension calculate the lingo score of two given strings. There is one point for ever letter that the two strings share.
I tried doing this, but it only works if s[0] is in t but otherwise it doesn't do what it is supposed to and I cannot see what is actually going wrong here.
def count(e, L):
lc = [1 for x in L if x == e]
return sum(lc)
def lingo(s, t):
if s == '' or t == '':
return 0
elif s == t:
return len(s)
if s[0] in t:
lc = [count(s[x], t) for x in range(len(t))]
return sum(lc)
else:
#remove s[0] and try again
lingo(s[:1], t)
these assertions are with the assignment:
assert lingo('diner', 'proza') == 1
assert lingo('beeft', 'euvel') == 2
assert lingo('gattaca', 'aggtccaggcgc') == 5
assert lingo('gattaca', '') == 0
The most obvious mistake
You are missing a return statement on the last line of your code. Instead of:
else:
#remove s[0] and try again
lingo(s[:1], t)
it should be:
else:
#remove s[0] and try again
return lingo(s[:1], t)
A redundancy in your code
The following piece of your code is unnecessary:
elif s == t:
return len(s)
Although this returns the correct result, it is a special case and doesn't particularly help the general case. In most cases s and t will be different; and the logic to calculate their amount of shared letters should work also when they are equal.
A mistake in the algorithm logic
This line of your code is highly suspicious:
lc = [count(s[x], t) for x in range(len(t))]
First of all, x is in range of the length of t, but is used as an index for s. If t is longer than s, this will immediately raise an IndexError exception. If t is shorter than or same length as s, then it will not raise an exception, but will most likely return the wrong result.
Note this interesting test case that was provided:
assert lingo('beeft', 'euvel') == 2
The letter 'e' appears twice in 'beeft' and twice in 'euvel', and the result is 2. Yet if you calculate count(s[1], t) + count(s[2], t) you will find the value 4. This is because the first 'e' of s is found twice in t, and the second 'e' of s is also found twice in t.
Janecx's answer provides one way to carefully fix this. You need to understand the logic behind min(s.count(s[0]), t.count(s[0])).
Other python solutions
Right now you absolutely want to use recursion and list comprehensions. In case you are interested in other ways to solve your problem, here are different algorithms.
Sorting the strings (sorting is a powerful tool that makes many problems easy)
def lingo(s, t):
s = sorted(s) # this doesn't modify the original string, it makes a local copy
t = sorted(t) # this doesn't modify the original string, it makes a local copy
result = 0
i = 0
j = 0
while (i < len(s) and j < len(t)):
if s[i] == t[j]:
result += 1
i += 1
j += 1
elif s[i] < t[j]:
i += 1
else:
j += 1
return result
Complexity analysis: sorting takes N log N + M log M operations, where N=len(s) and M=len(t). The whole while loop only takes N + M operations; it is that fast because s and t are sorted in the same order, so we reach an element of s as the same time as the corresponding element in t, so we don't need to compare every element of s against every element of t.
collections.Counter (a python object specifically designed for counting occurrences)
import collections
def lingo(s, t):
return sum((collections.Counter(s) & collections.Counter(t)).values())
Complexity analysis: this takes N + M operations, where N=len(s) and M=len(t). Counter simply counts the number of occurrences of each letter in s by going through s once, and the number of occurrences of each letter in t by going through t once; then the & operation keeps the minimum of the two counts for each letter (reminiscent of Janecx's min(...) operation); then all the counts are summed up. Summing up only takes as many operations as there are distinct letters, which in the case of a DNA sequence is 4; in the case of an alphabetical word is 26; and in general in a ASCII/Latin-1 string is at most 256.
Recursive approach from Janecx's answer Complexity analysis: takes N * M operations, where N=len(s) and M=len(t). This is much slower than the other two approaches, because for every element of s we need to go through every element of t; written iteratively, this would be a for loop nested inside a second for loop.
There you go. What this code does? If one of the string is empty, return 0. In the other cases, it finds the minimal number of occurences of s[0] in s and t, and then we use recursion to calculate the minimal number of occurences of s[1] in the version of t without the first character, and so on.
def lingo(s, t):
if s == '' or t == '':
return 0
return min(s.count(s[0]), t.count(s[0])) + lingo(s[1:], t.replace(s[0], ''))
assert lingo('diner', 'proza') == 1
assert lingo('beeft', 'euvel') == 2
assert lingo('gattaca', 'aggtccaggcgc') == 5
assert lingo('gattaca', '') == 0

Find repeats with certain length within a string using python

I am trying to use the regex module to find non-overlapping repeats (duplicated sub-strings) within a given string (30 char), with the following requirements:
I am only interested in non-overlapping repeats that are 6-15 char long.
allow 1 mis-match
return the positions for each match
One way I thought of is that for each possible repeat length, let python loop through the 30char string input. For example,
string = "ATAGATATATGGCCCGGCCCATAGATATAT" #input
#for 6char repeats, first one in loop would be for the following event:
text = "ATAGAT"
text2 ="(" + text + ")"+ "{e<=1}" #this is to allow 1 mismatch later in regex
string2="ATATGGCCCGGCCCATAGATATAT" #string after excluding text
for x in regex.finditer(text2,string2,overlapped=True):
print x.span()
#then still for 6char repeats, I will move on to text = "TAGATA"...
#after 6char, loop again for 7char...
There should be two outputs for this particular string = "ATAGATATATGGCCCGGCCCATAGATATAT". 1. The bold two "ATAGATATAT" + 1 mismatch: "ATAGATATATG" &"CATAGATATAT" with position index returned as (0,10)&(19, 29); 2. "TGGCCC" & "GGCCCA" (need add one mismatch to be at least 6 char), with index (9,14)&(15,20). Numbers can be in a list or table.
I'm sorry that I didn't include a real loop, but I hope the idea is clear...As you can see, this is a very less efficient method, not to mention it would create redundancy --- e.g. 10char repeats will be counted more than once, because it would suit for 9,8,7 and 6 char repeats loops. Moreover, I have a lot of such 30 char strings to work with, so I would appreciate your advice on some cleaner methods.
Thank you very much:)
I'd try straightforward algorithm instead of regex (which are quite confusing in this instance);
s = "ATAGATATATGGCCCGGCCCATAGATATAT"
def fuzzy_compare(s1, s2):
# sanity check
if len(s1) != len(s2):
return False
diffs = 0
for a, b in zip(s1, s2):
if a != b:
diffs += 1
if diffs > 1:
return False
return True
slen = len(s) # 30
for l in range(6, 16):
i = 0
while (i + l * 2) <= slen:
sub1 = s[i:i+l]
for j in range(i+l, slen - l):
sub2 = s[j:j+l]
if fuzzy_compare(sub1, sub2):
# checking if this could be partial
partial = False
if i + l < j and j + l < slen:
extsub1 = s[i:i+l+1]
extsub2 = s[j:j+l+1]
# if it is partial, we'll get it later in the main loop
if fuzzy_compare(extsub1, extsub2):
partial = True
if not partial:
print (i, i+l), (j, j+l)
i += 1
It's a first draft, so feel free to experiment with it. It also seems to be clunky and not optimal, but try running it first - it may be sufficient enough.

Categories

Resources