String incrementation

String incrementation - python

I've just started to learn Python and I'm doing some exercises in codewars. The instructions are simple: If the string already ends with a number, the number should be incremented by 1.
If the string does not end with a number. the number 1 should be appended to the new string.
I wrote this:
if strng[-1].isdigit():
return strng.replace(strng[-1],str(int(strng[-1])+1))
else:
return strng + "1"
return(strng)
It works sometimes (for example 'foobar001 - foobar002', 'foobar' - 'foobar1'). But in other cases it adds 1 to each number at the end (for example 'foobar11' - 'foobar22'), I would like to achieve a code where the effect is to add only +1 to the ending number, for example when 'foobar99' then 'foobar100', so the number has to be considered as a whole. I would be grateful for advices for beginner :)!

First, you have to make some assumptions
Assuming that the numerical values are always at the end of string and the first character from the right that is not numeric would mark the end of the non-number string, i.e.
>>> input = "foobar123456"
>>> output = 123456 + 1
Second, we need to assume that number exists at the end of the string.
So if we encounter a string without a number, we need to decide if the python code should throw an error and not try to add 1.
>>> input = "foobar"
Or we decide that we automatically generate a 0 digit, which would require us to do something like
input = input if input[-1].isdigit() else input + "0"
Lets assume the latter decision for simplicity of the explanation.
Next we will try to read the numbers from the right until you get to a non-digit
Lets use reversed() to flip the string and then a for-loop to read the characters until we reach a non-number, i.e.
>>> s = "foobar123456"
>>> output = 123456
>>> for character in reversed(s):
... if not character.isdigit():
... break
... else:
... print(character)
...
6
5
4
3
2
1
Now, lets use a list to keep the digits characters
>>> digits_in_reverse = []
>>> for character in reversed(s):
... if not character.isdigit():
... break
... else:
... digits_in_reverse.append(character)
...
>>> digits_in_reverse
['6', '5', '4', '3', '2', '1']
Then we reverse it:
>>> ''.join(reversed(digits_in_reverse))
'123456'
And convert it into an integer:
>>> int(''.join(reversed(digits_in_reverse)))
123456
Now the +1 increment would be easy!
How do we find the string preceding the number?
# The input string.
s = "foobar123456"
s = s if s[-1].isdigit() else s + "0"
# Keep a list of the digits in reverse.
digits_in_reverse = []
# Iterate through each character from the right.
for character in reversed(s):
# If we meet a character that is not a digit, stop.
if not character.isdigit():
break
# Otherwise, keep collecting the digits.
else:
digits_in_reverse.append(character)
# Reverse, the reversed digits, then convert it into an integer.
number_str = "".join(reversed(digits_in_reverse))
number = int(number_str)
print(number)
# end of string preceeding number.
end = s.rindex(number_str)
print(s[:end])
# Increment +1
print(s[:end] + str(number + 1))
[output]:
123456
foobar
foobar123457
Bonus: Can you do it with a one-liner?
Not exactly one line, but close:
import itertools
s = "foobar123456"
s = s if s[-1].isdigit() else s + "0"
number_str = "".join(itertools.takewhile(lambda ch: ch.isdigit(), reversed(s)))[::-1]
end = s.rindex(number_str)
print(s[:end] + str(int(number_str) + 1))
Bonus: But how about regex?
Yeah, with regex it's pretty magical, you would still make the same assumption as how we started, and to make your regex as simple as possible you have to add another assumption that the alphabetic characters preceding the number can only be made up of a-z or A-Z.
Then you can do this:
import re
s = "foobar123456"
s = s if s[-1].isdigit() else s + "0"
alpha, numeric = re.match("([a-zA-z]+)(\d.+)", s).groups()
print(alpha + str(int(numeric) + 1))
But you have to understand the regex which might be a steep learning, see https://regex101.com/r/9iiaCW/1

One simple solution would be:
Have two empty variables head (=non-numeric prefix) and tail (numeric suffix). Iterate the string normally, from left to right. If the current character is a digit, add it to tail. Otherwise, join head and tail, add the current char to head and empty tail. Once complete, increment tail and return head + tail:
def foo(s):
head = tail = ''
for char in s:
if char.isdigit():
tail += char
else:
head += tail + char
tail = ''
tail = int(tail or '0')
return head + str(tail + 1)
Leading zeroes (x001 -> x002), if needed, left as an exercise ;)

In your string, you need to check if it is alpha numeric or not. if it is alpha numeric, then you need to check the last character, whether it is digit or not.
now if above condition satisfy then you need to get the index of first digit in the string which make a integer number in last of string.
once you got the index then, seperate the character and numeric part.
once done, convert numerical string part to interger and add 1. after this join both character and numeric part. that is your answer.
# your code goes here
string = 'randomstring2345'
index = len(string) - 1
if string.isalnum() and string[-1].isdigit():
while True:
if string[index].isdigit():
index-=1
else:
index+=1
break
if index<0:
break
char_part = string[:index]
int_part = string[index:]
integer = 0
if int_part:
integer = int(''.join(int_part))
modified_int = integer + 1
new_string = ''.join([char_part, str(modified_int)])
print(new_string)
output
randomstring2346

Regex can be a useful tool in python~ Here I make two groups, the first (.*?) is as few of anything as possible, while the second (\d*$) is as many digits at the end of the string as possible. For more in depth explanation see regexr.
import re
def increment(s):
word, digits = re.match('(.*?)(\d*$)', s).groups()
digits = str(int(digits) + 1).zfill(len(digits)) if digits else '1'
return word + digits
print(increment('foobar001'))
print(increment('foobar009'))
print(increment('foobar19'))
print(increment('foobar20'))
print(increment('foobar99'))
print(increment('foobar'))
print(increment('1a2c1'))
print(increment(''))
print(increment('01'))
Output:
foobar002
foobar010
foobar20
foobar21
foobar100
foobar1
1a2c2
1
02

Source
def solve(data):
result = None
if len(data) == 0 or not data[-1].isdigit():
result = data + str(1) #appending 1
else:
lin = 0
for index, ch in enumerate(data[::-1]):
if ch.isdigit():
lin = len(data) - index -1
else:
break
result = data[0 : lin] + str(int(data[lin:]) + 1) # incrementing result
return result
pass
print(solve("Hey123"))
print(solve("aaabbbzzz"))
output :
Hey124
aaabbbzzz1

Related

How to find the amount of equal characters that are next to eachother in a string?

i just started using python and im a noob.
this is an example of the string i have to work with "--+-+++----------------+-+"
The program needs to find whats the longest ++ "chain", so how many times does + appear, when they are next to eachother. I dont really know how to explain this, but i need it to find that chain of 3 + smybols, so i can print that the longest + chain contains 3 + symbols.

a = "--+-+++----------------+-+"
count = 0
most = 0
for x in range(len(a)):
if a[x] == "+":
count+=1
else:
count = 0
if count > most:
most = count
print(f"longest + chain includes {most} symbols")
there might be a better way but it's more self explanatory

Try this. It uses regular expressions and a list comprehension, so you may need to read about them.
But the idea is to find all the + chains, calculate their lengths and get the maximum length
import re
s = '+++----------------+-+'
occurs = re.findall('\++',s)
print(max([len(i) for i in occurs]))
Output:
3

You can use a regular expression to specify "one or more + characters". The character for specifying this kind of repetition in a regex is itself +, so to specify the actual + character you have to escape it.
haystack = "--+-+++----------------+-+"
needle = re.compile(r"\++")
Now we can use findall to find all the occurrences of this pattern in the original string, and max to find the longest of these.
longest = max(len(x) for x in needle.findall(haystack))
If you instead need the position of the longest sequence in the target string, you can use:
pos = haystack.index(max(needle.findall(haystack), key=len))

A simple solution is to iterate over the string one character at a time. When the character is the same as the last add one to a counter and each time the character is different to the previous the count can be restarted.
s = "--+-+++----------------+-+"
p = s[0]
max, count = 0
for c in s:
if c == p:
count = count + 1
else:
count = 0
if count > max:
max = count
p = c
s is the string, c is the character being checked, p is previous character, count is the counter, and max is the highest found value,

If the only other character in your string is a minus sign, you can split the string on the minus sign and get maximum length of the resulting substrings:
a = "--+-+++----------------+-+"
r = max(map(len,a.split('-')))
print(r) # 3

Remove string character after run of n characters in string

Suppose you have a given string and an integer, n. Every time a character appears in the string more than n times in a row, you want to remove some of the characters so that it only appears n times in a row. For example, for the case n = 2, we would want the string 'aaabccdddd' to become 'aabccdd'. I have written this crude function that compiles without errors but doesn't quite get me what I want:
def strcut(string, n):
for i in range(len(string)):
for j in range(n):
if i + j < len(string)-(n-1):
if string[i] == string[i+j]:
beg = string[:i]
ends = string[i+1:]
string = beg + ends
print(string)
These are the outputs for strcut('aaabccdddd', n):
n
output
expected
1
'abcdd'
'abcd'
2
'acdd'
'aabccdd'
3
'acddd'
'aaabccddd'
I am new to python but I am pretty sure that my error is in line 3, 4 or 5 of my function. Does anyone have any suggestions or know of any methods that would make this easier?

This may not answer why your code does not work, but here's an alternate solution using regex:
import re
def strcut(string, n):
return re.sub(fr"(.)\1{{{n-1},}}", r"\1"*n, string)
How it works: First, the pattern formatted is "(.)\1{n-1,}". If n=3 then the pattern becomes "(.)\1{2,}"
(.) is a capture group that matches any single character
\1 matches the first capture group
{2,} matches the previous token 2 or more times
The replacement string is the first capture group repeated n times
For example: str = "aaaab" and n = 3. The first "a" is the capture group (.). The next 3 "aaa" matches \1{2,} - in this example a{2,}. So the whole thing matches "a" + "aaa" = "aaaa". That is replaced with "aaa".
regex101 can explain it better than me.

you can implement a stack data structure.
Idea is you add new character in stack, check if it is same as previous one or not in stack and yes then increase counter and check if counter is in limit or not if yes then add it into stack else not. if new character is not same as previous one then add that character in stack and set counter to 1
# your code goes here
def func(string, n):
stack = []
counter = None
for i in string:
if not stack:
counter = 1
stack.append(i)
elif stack[-1]==i:
if counter+1<=n:
stack.append(i)
counter+=1
elif stack[-1]!=i:
stack.append(i)
counter = 1
return ''.join(stack)
print(func('aaabbcdaaacccdsdsccddssse', 2)=='aabbcdaaccdsdsccddsse')
print(func('aaabccdddd',1 )=='abcd')
print(func('aaabccdddd',2 )=='aabccdd')
print(func('aaabccdddd',3 )=='aaabccddd')
output
True
True
True
True

The method I would use is creating a new empty string at the start of the function and then everytime you exceed the number of characters in the input string you just not insert them in the output string, this is computationally efficient because it is O(n) :
def strcut(string,n) :
new_string = ""
first_c, s = string[0], 0
for c in string :
if c != first_c :
first_c, s= c, 0
s += 1
if s > n : continue
else : new_string += c
return new_string
print(strcut("aabcaaabbba",2)) # output : #aabcaabba

Simply, to anwer the question
appears in the string more than n times in a row
the following code is small and simple, and will work fine :-)
def strcut(string: str, n: int) -> str:
tmp = "*" * (n+1)
for char in string:
if tmp[len(tmp) - n:] != char * n:
tmp += char
print(tmp[n+1:])
strcut("aaabccdddd", 1)
strcut("aaabccdddd", 2)
strcut("aaabccdddd", 3)
Output:
abcd
aabccdd
aaabccddd
Notes:
The character "*" in the line tmp = "*"*n+string[0:1] can be any character that is not in the string, it's just a placeholder to handle the start case when there are no characters.
The print(tmp[n:]) line simply removes the "*" characters added in the beginning.

You don't need nested loops. Keep track of the current character and its count. include characters when the count is less or equal to n, reset the current character and count when it changes.
def strcut(s,n):
result = '' # resulting string
char,count = '',0 # initial character and count
for c in s: # only loop once on the characters
if c == char: count += 1 # increase count
else: char,count = c,1 # reset character/count
if count<=n: result += c # include character if count is ok
return result

Just to give some ideas, this is a different approach. I didn't like how n was iterating each time even if I was on i=3 and n=2, I still jump to i=4 even though I already checked that character while going through n. And since you are checking the next n characters in the string, you method doesn't fit with keeping the strings in order. Here is a rough method that I find easier to read.
def strcut(string, n):
for i in range(len(string)-1,0,-1): # I go backwards assuming you want to keep the front characters
if string.count(string[i]) > n:
string = remove(string,i)
print(string)
def remove(string, i):
if i > len(string):
return string[:i]
return string[:i] + string[i+1:]
strcut('aaabccdddd',2)

finding the minimum window substring

the problem says to create a string, take 3 non-consecutive characters from the string and put it into a sub-string and print the which character the first one is and which character the last one is.
str="subliminal"
sub="bmn"
n = len(str)-3
for i in range(0, n):
print(str1[i:i+4])
if sub1 in str1:
print(sub1[i])
this should print 3 to 8 because b is the third letter and n is the 8th letter.
i also don't know how to make the code work for substrings that aren't 3 characters long without changing the code in total.

Not sure if this is what you meant. I assume that the substring is already valid, which means that it contains non consecutive letters. Then I get the first and last letter of the substring and create a list of all the letters in the string using a list comprehension. Then i just loop through the letters and save where the first and last letter occur. If anything is missing, hmu.
sub = "bmn"
str = "subliminal"
first_letter = sub[0]
last_letter = sub[-1]
start = None
end = None
letters = [let for let in str]
for i, letter in enumerate(letters):
if letter == first_letter:
start = i
if letter == last_letter:
end = i
if start and end:
print(f"From %s to %s." % (start + 1, end + 1)) # Output: From 3 to 8.

Some recursion for good health:
def minimum_window_substring(strn, sub, beg=0, fin=0, firstFound=False):
if len(sub) == 0 or len(strn) == 0:
return f'From {beg + 1} to {fin}'
elif strn[0] == sub[0]:
return minimum_window_substring(strn[1:], sub[1:], beg, fin + 1, True)
if not firstFound:
beg += 1
return minimum_window_substring(strn[1:], sub, beg, fin + 1, firstFound)
Explanation:
The base case is if we get our original string or our sub-string to be length 0, we then stop and print the beginning and the end of the substring in the original string.
If the first letter of the current string is equal then we start the counter (we fix the beginning "beg" with the flag "firstFound") Then increment until we finish (sub is an empty string / original string is empty)
Something to think about / More explanation:
If for example, you ask for the first occurrence of the substring, for example if the original string would be "sububusubulum" and the sub would equal to "sbl" then when we hit our first "s" - it means it would 100% start from there, because if another "sbl" is inside the original string - then it must contain the remaining letters, and so we would say they belong to the first s. (A horrible explanation, I am sorry) what I am trying to say is that if we have 2 occurrences of the substring - then we would pick the first one, no matter what.
Note: This function does not really care if the sub-string contains consecutive letters, also, it does not check whether the characters are in the string itself, because you said that we must be given characters from the original string. The positive thing about it, is that the function can be given more than (or less than) 3 characters long substring
When I say "original string" I mean subliminal (or other inputs)

There are many different ways you could do it,
here is a soultion,
import re
def Func(String, SubString):
patt = "".join([char + "[A-Za-z]" + "+" for char in sub[:-1]] + [sub[-1]])
MatchedString = re.findall(patt, String)[0]
FirstIndex = String.find(MatchedString) + 1
LastIndex = FirstIndex + len(MatchedString) -1
return FirstIndex, LastIndex
string="subliminal"
sub="bmn"
FirstIndex, LastIndex = Func(string, sub)
This will return 3, 8 and you could change the length of the substring, and assuming you want just the first match only

Most common character in a string

Write a function that takes a string consisting of alphabetic
characters as input argument and returns the most common character.
Ignore white spaces i.e. Do not count any white space as a character.
Note that capitalization does not matter here i.e. that a lower case
character is equal to a upper case character. In case of a tie between
certain characters return the last character that has the most count
This is the updated code
def most_common_character (input_str):
input_str = input_str.lower()
new_string = "".join(input_str.split())
print(new_string)
length = len(new_string)
print(length)
count = 1
j = 0
higher_count = 0
return_character = ""
for i in range(0, len(new_string)):
character = new_string[i]
while (length - 1):
if (character == new_string[j + 1]):
count += 1
j += 1
length -= 1
if (higher_count < count):
higher_count = count
return (character)
#Main Program
input_str = input("Enter a string: ")
result = most_common_character(input_str)
print(result)
The above is my code. I am getting an error of string index out of bound which I can't understand why. Also the code only checks the occurrence of first character I am confused about how to proceed to the next character and take the maximum count?
The error i get when I run my code:
> Your answer is NOT CORRECT Your code was tested with different inputs.
> For example when your function is called as shown below:
>
> most_common_character ('The cosmos is infinite')
>
> ############# Your function returns ############# e The returned variable type is: type 'str'
>
> ######### Correct return value should be ######## i The returned variable type is: type 'str'
>
> ####### Output of student print statements ###### thecosmosisinfinite 19

You can use a regex patter to search for all characters. \w matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. The + after [\w] means to match one or more repetitions.
Finally, you use Counter to total them and most_common(1) to get the top value. See below for the case of a tie.
from collections import Counter
import re
s = "Write a function that takes a string consisting of alphabetic characters as input argument and returns the most common character. Ignore white spaces i.e. Do not count any white space as a character. Note that capitalization does not matter here i.e. that a lower case character is equal to a upper case character. In case of a tie between certain characters return the last character that has the most count"
>>> Counter(c.lower() for c in re.findall(r"\w", s)).most_common(1)
[('t', 46)]
In the case of a tie, it is a little more tricky.
def top_character(some_string):
joined_characters = [c for c in re.findall(r"\w+", some_string.lower())]
d = Counter(joined_characters)
top_characters = [c for c, n in d.most_common() if n == max(d.values())]
if len(top_characters) == 1:
return top_characters[0]
reversed_characters = joined_characters[::-1]
for c in reversed_characters:
if c in top_characters:
return c
>>> top_character(s)
't'
>>> top_character('the the')
'e'
In the case of your code above and your sentence "The cosmos is infinite", you can see that 'i' occurs more frequently that 'e' (the output of your function):
>>> Counter(c.lower() for c in "".join(re.findall(r"[\w]+", 'The cosmos is infinite'))).most_common(3)
[('i', 4), ('s', 3), ('e', 2)]
You can see the issue in your code block:
for i in range(0, len(new_string)):
character = new_string[i]
...
return (character)
You are iterating through a sentence and assign that letter to the variable character, which is never reassigned elsewhere. The variable character will thus always return the last character in your string.

Actually your code is almost correct. You need to move count, j, length inside of your for i in range(0, len(new_string)) because you need to start over on each iteration and also if count is greater than higher_count, you need to save that charater as return_character and return it instead of character which is always last char of your string because of character = new_string[i].
I don't see why have you used j+1 and while length-1. After correcting them, it now covers tie situations aswell.
def most_common_character (input_str):
input_str = input_str.lower()
new_string = "".join(input_str.split())
higher_count = 0
return_character = ""
for i in range(0, len(new_string)):
count = 0
length = len(new_string)
j = 0
character = new_string[i]
while length > 0:
if (character == new_string[j]):
count += 1
j += 1
length -= 1
if (higher_count <= count):
higher_count = count
return_character = character
return (return_character)

If we ignore the "tie" requirement; collections.Counter() works:
from collections import Counter
from itertools import chain
def most_common_character(input_str):
return Counter(chain(*input_str.casefold().split())).most_common(1)[0][0]
Example:
>>> most_common_character('The cosmos is infinite')
'i'
>>> most_common_character('ab' * 3)
'a'
To return the last character that has the most count, we could use collections.OrderedDict:
from collections import Counter, OrderedDict
from itertools import chain
from operator import itemgetter
class OrderedCounter(Counter, OrderedDict):
pass
def most_common_character(input_str):
counter = OrderedCounter(chain(*input_str.casefold().split()))
return max(reversed(counter.items()), key=itemgetter(1))[0]
Example:
>>> most_common_character('ab' * 3)
'b'
Note: this solution assumes that max() returns the first character that has the most count (and therefore there is a reversed() call, to get the last one) and all characters are single Unicode codepoints. In general, you might want to use \X regular expression (supported by regex module), to extract "user-perceived characters" (eXtended grapheme cluster) from the Unicode string.

Limiting re.findall() to record values before a certain number. Python

sequence_list = ['atgttttgatggATGTTTGATTAG','atggggtagatggggATGGGGTGA','atgaaataatggggATGAAATAA']
I take each element (fdna) from sequence_list and search for sequences starting with ATG and then reading by 3's until it reaches either a TAA, TGA, or TAG
Each element in sequence_list is made up of two sequences. the first sequence will be lowercase and the second will be uppercase. the results string is composed of lowercase + UPPERCASE
Gathering CDS Starts & Upper()
cds_start_positions = []
cds_start_positions.append(re.search("[A-Z]", fdna).start())
fdna = fdna.upper()
So after I find where the uppercase sequence starts, I record the index number in cds_start_positions and then convert the entire string (fdna) to uppercase
This statement gathers all ATG-xxx-xxx- that are followed by either a TAA|TAG|TGA
Gathering uORFs
ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
So what I'm trying to do is gather all occurrences where ATG-xxx-xxx is followed by either TAA, TGA, or TAG.
My input data is composed of 2 sequences (lowercaseUPPERCASE) and I want to find these sequences when:
1: the ATG is followed by TAA|TGA|TAG in the lowercase (which are now uppercase but the value where they become uppercase is stored in the cds_start_positions)
2: the ATG is in the lowercase portion (less than the cds_start_position value) and the next TAA|TGA|TAG that is following it in uppercase.
NOTE* the way it is set up now is that an ATG that was in the original uppercase portion (greater than the cds_start_position value) is saved to list
What the "Gathering CDS Starts & Upper()" does is find where the upper case sequence starts.
Is there any way to put restraints on the "Gathering uORFs" part to where it only recognizes ATG in the position before the corresponding element in the list cds_start_positions?
I want want to put a statement in the ORF_sequences line where it only finds 'ATG' before each element in the list 'cds_start_positions'
Example of what cds_start_positions would look like
cds_start_positions = [12, 15, 14] #where each value indicates where the uppercase portion starts in the sequence_list elements (fdna)
for the first sequence in sequence_list
i would want this result:
#input
fdna = 'atgttttgatggATGTTTGATTAG'
#what i want for the output
ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA']
#what i'm getting for the output
ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA','ATGTTTGATTAG']
that 3rd entry is found in after the value 12 (corresponding value in the list cds_start_positions) and i don't want that. However, the 2nd entry has its starting ATG before that value 12 and its TAA|TGA|TAG after the value 12 which should be allowed.
***Note
I have another line of code that just takes the start positions of where these ATG-xxx-xxx-TAA|TGA|TAG occur and that is:
start_positions = [i for i in start_positions if i < k-1]
Is there a way to use this principle in re.findall ?
let me know if i need to clarify anything

Yesterday, I had written a first answer.
Then I read the answer of ATOzTOA in which he had a very good idea: using a positive look-behind assertion.
I thought that my answer was completely out and that his idea was the right way to do.
But afterward, I realized that there's a flaw in the ATOzTOA's code.
Say there is a portion 'ATGxzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' in the examined string: the positive matching will occur on 'xzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' and the assertive matching on the preceding 'ATG' so the portion will constitute a match; that's OK.
But it means that just after this matching the regex motor is positionned at the end of this portion 'xzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' .
So when the regex motor will search for the next match, it won't find a matching beginning at this 'ATG' present in this portion, since it runs again from a position long after it.
.
So the only way to achieve what is required by the question is effectively the first algorithm I had written, then I repost it.
The job is done by a function find_ORF_seq()
If you pass True as a second argument to the second parameter messages of the function find_ORF_seq() , it will print messages that help to understand the algorithm.
If not, the parameter messages takes the default value None
The pattern is written '(atg).+?(?:TAA|TGA|TAG)' with some letters uppercased and the others lowercased, but it's not the reason why the portions are catched correctly relatively to the up and low cased letters. Because, as you will see, the flag re.IGNORECASE is used: this flag is necessary since the part matched by (?:TAA|TGA|TAG) can fall in the lower cased part as well as in the upper cased part.
The essence of the algorithm lies in the while-loop, which is necessary because of the fact the researched portions may overlap as I explained above (as far as I understood correctly and the samples and explanations you gave are correct) .
So there is no possibility to use findall() or finditer() and I do a loop.
To avoid to iterate in the fdna sequence one base after the other, I use the ma.start() method that gives the position of the beginning of a match ma in a string, and I increment the value of s with s = s + p + 1 ( +1 to not begin to search again at the start of the found match !)
My algorithm doesn't need the information of start_positions because I don't use an look-behind assertion but a real matching on the first 3 letters: a match is declared unfitting with constraints when the start of the match is in the uppercased part, that it to say when ma.group(1) that catches the first three bases (that can be 'ATG' or 'atg' since the regex ignore case) is equal to 'ATG'
I was obliged to put s = s + p + 1 instead of s = s + p + 3 because it seems that the portions you search are not spaced by multiple of three bases.
import re
sequence_list = ['atgttttgatgATGTTTTGATTT',
'atggggtagatggggATGGGGTGA',
'atgaaataatggggATGAAATAA',
'aaggtacttctcggctaACTTTTTCCAAGT']
pat = '(atg).+?(?:TAA|TGA|TAG)'
reg = re.compile(pat,re.IGNORECASE)
def find_ORF_seq(fdna,messages=None,s=0,reg=reg):
ORF_sequences = []
if messages:
print 's before == ',s
while True:
if messages:
print ('---------------------------\n'
's == %d\n'
'fdna[%d:] == %r' % (s,s,fdna[s:]))
ma = reg.search(fdna[s:])
if messages:
print 'reg.search(fdna[%d:]) == %r' % (s,ma)
if ma:
if messages:
print ('ma.group() == %r\n'
'ma.group(1) == %r'
% (ma.group(),ma.group(1)))
if ma.group(1)=='ATG':
if messages:
print "ma.group(1) is uppercased 'ATG' then I break"
break
else:
ORF_sequences.append(ma.group().upper())
p = ma.start()
if messages:
print (' The match is at position p == %d in fdna[%d:]\n'
' and at position s + p == %d + %d == %d in fdna\n'
' then I put s = s + p + 1 == %d'
% (p,s, s,p,s+p, s+p+1))
s = s + p + 1
else:
break
if messages:
print '\n==== RESULT ======\n'
return ORF_sequences
for fdna in sequence_list:
print ('\n============================================')
print ('fdna == %s\n'
'ORF_sequences == %r'
% (fdna, find_ORF_seq(fdna,True)))
###############################
print '\n\n\n######################\n\ninput sample'
fdna = 'atgttttgatggATGTTTGATTTATTTTAG'
print ' fdna == %s' % fdna
print ' **atgttttga**tggATGTTTGATTTATTTTAG'
print ' atgttttg**atggATGTTTGA**TTTATTTTAG'
print 'output sample'
print " ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA']"
print '\nfind_ORF_seq(fdna) ==',find_ORF_seq(fdna)
.
The same function without the print instructions to better see the algorithm.
import re
pat = '(atg).+?(?:TAA|TGA|TAG)'
reg = re.compile(pat,re.IGNORECASE)
def find_ORF_seq(fdna,messages=None,s =0,reg=reg):
ORF_sequences = []
while True:
ma = reg.search(fdna[s:])
if ma:
if ma.group(1)=='ATG':
break
else:
ORF_sequences.append(ma.group().upper())
s = s + ma.start() + 1
else:
break
return ORF_sequences
.
I compared the two functions, ATOzTOA's one and mine, with a fdna sequence revealing the flaw. This legitimates what I described.
from find_ORF_sequences import find_ORF_seq
from ATOz_get_sequences import getSequences
fdna = 'atgggatggtagatggatgggATGGGGTGA'
print 'fdna == %s' % fdna
print 'find_ORF_seq(fdna)\n',find_ORF_seq(fdna)
print 'getSequences(fdna)\n',getSequences(fdna)
result
fdna == atgggatggtagatggatgggATGGGGTGA
find_ORF_seq(fdna)
['ATGGGATGGTAG', 'ATGGTAG', 'ATGGATGGGATGGGGTGA', 'ATGGGATGGGGTGA']
getSequences(fdna)
['ATGGGATGGTAG', 'ATGGATGGGATGGGGTGA']
.
But after all, maybe, I wonder.... :
do you want the matches that are inner parts of another matching, like 'ATGGGATGGGGTGA' at the end of 'ATGGATGGGATGGGGTGA' ?
If not, the answer of ATOzTOA will fit also.

Update 2 - Complete code
import re
def getSequences(fdna):
start = re.search("[A-Z]", fdna).start()
fdna = fdna.upper()
ORF_sequences = re.finditer(r'(?<=ATG)(.*?)(?:TAA|TAG|TGA)',fdna)
sequences = []
for match in ORF_sequences:
s = match.start()
if s < start:
sequences.append("ATG" + match.group(0))
print sequences
return sequences
sequence_list = ['atgttttgatggATGTTTGATTAG','atggggtagatggggATGGGGTGA','atgaaataatggggATGAAATAA']
for fdna in sequence_list:
getSequences(fdna)
Output
>>>
['ATGTTTTGA', 'ATGGATGTTTGA']
['ATGGGGTAG', 'ATGGGGATGGGGTGA']
['ATGAAATAA', 'ATGGGGATGA']
Update
If you need re, then try this:
ORF_sequences = re.finditer(r'(?<=ATG)(.*?)(?:TAA|TAG|TGA)',fdna)
for match in ORF_sequences:
print match.span()
print "ATG" + match.group(0)
Output
>>>
(3, 9)
ATGTTTTGA
(11, 20)
ATGGATGTTTGA
Note
This won't always work. But you can check therr value of match.start() against cds_start_position and remove unwanted sequences.
Try this, not re, but works...
def getSequences(fdna, start):
"""Find the sequence fully to left of start and
laying over the start"""
i = 0
j = 0
f = False
while True:
m = fdna[i:i+3]
if f is False:
if m == "ATG":
f = True
j = i
i += 2
else:
if m in ["TAA", "TAG", "TGA"]:
i += 2
seq1 = fdna[j: i+1]
break
i += 1
i = 1
j = 0
f = False
while True:
m = fdna[i:i+3]
if f is False:
if m == "ATG" and i < start:
f = True
j = i
i += 2
else:
if m in ["TAA", "TAG", "TGA"] and i > start:
i += 2
seq2 = fdna[j: i+1]
break
i += 1
print "Sequence 1 : " + seq1
print "Sequence 2 : " + seq2
Test
fdna = 'atgttttgatggATGTTTGATTAG'
cds_start_positions = []
cds_start_positions.append(re.search("[A-Z]", fdna).start())
fdna = fdna.upper()
getSequences(fdna, cds_start_positions[0])
Output
>>>
Sequence 1 : ATGTTTTGA
Sequence 2 : ATGGATGTTTGA

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

String incrementation - python

Related

How to find the amount of equal characters that are next to eachother in a string?

Remove string character after run of n characters in string

finding the minimum window substring

Most common character in a string

Limiting re.findall() to record values before a certain number. Python

Categories

Resources