Simple Python Message Encoder/Decoder Issue - python

I'm new to Python and just playing around with some code. I am trying to build a "secret message generator" which takes a string (e.g. "1234567890") and outputs it based on a simple pattern (e.g. "1357924680"). I have the encoder working 90% (currently it can't handle apostrophes), but the decoder is giving me a lot of trouble. For anything over 6 characters, there is no problem. Inputting "1357924680" outputs "1234567890". However, for shorter odd numbered strings (e.g. "Hello"), it does not show the last character (e.g. it outputs "Hell"). My code is below. There may be a simpler way to write it, but since I built this myself, I'd appreciate working with my code rather than rewriting It. So, how can it be fixed?
#simple decoder
def decode(s2):
oddlist = []
evenlist = []
final = []
s2 = s2.lower() #makes a string lowercase
size = len(s2) #gets the string size
print "Size " + str(size) #test print
half = size / 2
print "Half " + str(half)
i = 0
while i < half:
if size % 2 == 0: #checks if it is even
split = size / 2 #splits it
oddlist.append(s2[0:split]) #makes a list of the first half
evenlist.append(s2[split:]) #makes a list of the second half
joinodd = ''.join(oddlist) #list -> string
joineven = ''.join(evenlist) #list -> string
else:
split = (size / 2) + 1
print split
oddlist.append(s2[0:split]) #makes a list of the first half
evenlist.append(s2[split:]) #makes a list of the second half
joinodd = ''.join(oddlist) #list -> string
joineven = ''.join(evenlist) #list -> string
string = joinodd[i] + joineven[i]
final.append(string)
i = i + 1
decoded = ''.join(final)
print final
return decoded
print decode("hello")

Maybe another answer will give you the error in your code but I want to make you a recommendation, if you are using python slice notation, use it ALL! This is an example of how you can do what you want in a more pythonic way:
import itertools
def encode(s):
return s[::2] + s[1::2]
def decode(s):
lim = (len(s)+1)/2
return ''.join([odd + even for odd,even in itertools.izip_longest(s[:lim], s[lim:],fillvalue="")])
def test(s):
print "enc:",encode(s)
print "dec:",decode(encode(s))
print "orig:",s
print
test("")
test("1")
test("123")
test("1234")
test("1234567890")
test("123456789")
test("Hello")
Output:
enc:
dec:
orig:
enc: 1
dec: 1
orig: 1
enc: 132
dec: 123
orig: 123
enc: 1324
dec: 1234
orig: 1234
enc: 1357924680
dec: 1234567890
orig: 1234567890
enc: 135792468
dec: 123456789
orig: 123456789
enc: Hloel
dec: Hello
orig: Hello

Your code is splitting the text into groups of two.
Which doesn't really work with words of odd length. So either you are skipping over one with
while i < half:
> ['hl', 'eo']
Or you make sure that you are getting all values with:
while i <= half:
> ['hl', 'eo', 'll']
Though this adds an extra letter to the output since it's technically adding another pair. You might need to re-think that algorithm.

Related

Brute force solution without itertools in Python

I'm new to Python.
I'm trying to calculate 3 ** 16 (¹[0-f], ²[0-f], ³[0-f])
but it's not working properly.
This is my code:
inp = str(input('value len 0-3 digit:'))
hexa = ('0123456789abcdef');
#len(hexa) = 16 digit
#pass = '36f'
pass = inp
for x in range(0, 3 ** len(hexa)):
#range = 0..(3 ^ 16)
if(hexa[x] == pass):
found = hexa[x]
#if result valid
print("pos: %d, pass: %s" % (x, found))
#print position
I got the error "index out of bound".
I need output like this:
000 #first
001
002
...
...
36f #found then break
...
...
fff #last
How do I fix it?
I believe your IndexError: string index out of range error comes from this logic:
for x in range(0, 3 ** len(hexa)):
Which probably should be:
for x in range(len(hexa) ** len(inp)):
A much smaller number. This is never going to work on input of more than one digit:
if(hexa[x] == pass):
You need to rethink this. Using Python's own hex() function, I came up with an approximation of what I'm guessing you're trying to do:
hexadecimal = input('hex value of 1-3 digits: ').lower()
hex_digits = '0123456789abcdef'
for x in range(len(hex_digits) ** len(hexadecimal)):
if hex(x) == "0x" + hexadecimal:
print("pos: %d, pass: %s" % (x, hexadecimal))
break
OUTPUT
> python3 test.py
hex value of 1-3 digits: 36f
pos: 879, pass: 36f
>
If that's not what you're trying to do, please explain further in your question.
Other issues to consider: don't use pass as the name of a variable, it's a Python keyword; input() returns a str, you don't need to call str on it; avoid semicolons (;) in Python.

why is my caeser cipher only printing the last letter of string? python

def caesar_cipher(offset, string):
words = string.replace(" ", " ")
cipher_chars = "abcdefghijklmnopqrstuvwxyz"
word_i = 0
while word_i < len(words):
word = words[word_i]
letter_i = 0
while letter_i < len(word):
char_i = ord(word[letter_i]) - ord("c")
new_char_i = (char_i + offset) % 26
value = chr(new_char_i + ord("c"))
letter_i += 1
word_i += 1
return words.join(value)
print caesar_cipher(3, "abc")
Hey everyone, for some reason my ceasar cipher is only printing the last letter in my string, when I want it to cipher the whole string, for example, if i print an offset of 3 with string "abc" it should print def, but instead is just printing the f. Any help is greatly appreciated!
value is overwritten in the loop. You want to create a list passed to join (ATM you're joining only 1 character):
value = []
then
value.append(chr(new_char_i + ord("c")))
the join statement is also wrong: just do:
return "".join(value)
Note that there are other issues in your code. It seems to intent to process several words, but it doesn't, so a lot of loops don't loop (there's no list of words, it's just a word), so what you are doing could be summarized to (using a simple list comprehension):
def caesar_cipher(offset, string):
return "".join([chr((ord(letter) - ord("c") + offset) % 26 + ord("c")) for letter in string])
and for a sentence:
print(" ".join([caesar_cipher(3, w) for w in "a full sentence".split()]))
As a nice commenter noted, using c as start letter is not correct since it trashes sentences containing the 3 last letters. There's no reason not to start by a (the result are the same for the rest of the letters):
def caesar_cipher(offset, string):
return "".join([chr((ord(letter) - ord("a") + offset) % 26 + ord("a")) for letter in string])
Aside: a quick similar algorithm is rot13. Not really a cipher but it's natively supported:
import codecs
print(codecs.encode("a full sentence","rot13"))
(apply on the encoded string to decode it)

Selecting specific int values from list and changing them

I have been playing with Python and came across a task from MIT, which is to create coded message (Julius Cesar code where for example you change ABCD letters in message to CDEF). This is what I came up with:
Phrase = input('Type message to encrypt: ')
shiftValue = int(input('Enter shift value: '))
listPhrase = list(Phrase)
listLenght = len(listPhrase)
ascii = []
for ch in listPhrase:
ascii.append(ord(ch))
print (ascii)
asciiCoded = []
for i in ascii:
asciiCoded.append(i+shiftValue)
print (asciiCoded)
phraseCoded = []
for i in asciiCoded:
phraseCoded.append(chr(i))
print (phraseCoded)
stringCoded = ''.join(phraseCoded)
print (stringCoded)
The code works but I have to implement not shifting the ascii value of spaces and special signs in message.
So my idea is to select values in list in range of range(65,90) and range(97,122) and change them while I do not change any others. But how do I do that?
If you want to use that gigantic code :) to do something as simple as that, then you keep a check like so:
asciiCoded = []
for i in ascii:
if 65 <= i <= 90 or 97 <= i <= 122: # only letters get changed
asciiCoded.append(i+shiftValue)
else:
asciiCoded.append(i)
But you know what, python can do the whole of that in a single line, using list comprehension. Watch this:
Phrase = input('Type message to encrypt: ')
shiftValue = int(input('Enter shift value: '))
# encoding to cypher, in single line
stringCoded = ''.join(chr(ord(c)+shiftValue) if c.isalpha() else c for c in Phrase)
print(stringCoded)
A little explanation: the list comprehension boils down to this for loop, which is easier to comprehend. Caught something? :)
temp_list = []
for c in Phrase:
if c.isalpha():
# shift if the c is alphabet
temp_list.append(chr(ord(c)+shiftValue))
else:
# no shift if c is no alphabet
temp_list.append(c)
# join the list to form a string
stringCoded = ''.join(temp_list)
Much easier it is to use the maketrans method from the string module:
>>import string
>>
>>caesar = string.maketrans('ABCD', 'CDEF')
>>
>>s = 'CAD BA'
>>
>>print s
>>print s.translate(caesar)
CAD BA
ECF DC
EDIT: This was for Python 2.7
With 3.5 just do
caesar = str.maketrans('ABCD', 'CDEF')
And an easy function to return a mapping.
>>> def encrypt(shift):
... alphabet = string.ascii_uppercase
... move = (len(alphabet) + shift) % len(alphabet)
... map_to = alphabet[move:] + alphabet[:move]
... return str.maketrans(alphabet, map_to)
>>> "ABC".translate(encrypt(4))
'EFG'
This function uses modulo addition to construct the encrypted caesar string.
asciiCoded = []
final_ascii = ""
for i in ascii:
final_ascii = i+shiftValue #add shiftValue to ascii value of character
if final_ascii in range(65,91) or final_ascii in range(97,123): #Condition to skip the special characters
asciiCoded.append(final_ascii)
else:
asciiCoded.append(i)
print (asciiCoded)

Changing parts of a string with '#'

Wanted to see if I'm going in the right direction. I have to change everything but the last four characters of a string into #. I've got two ideas so far.
First one:
def maskify(cc):
cc = raw_input("Enter passcode: ")
n = len(cc)
cc.replace(cc[0:n-4], #) # this one gives me the unexpected EOF while parsing error
Second one (I think this one's closer because it supposedly needs an algorithm):
def maskify(cc):
cc = raw_input("Enter passcode: ")
n = len(cc)
for i in range (0, n-4): # i think a for loop would be good but i don't know how i'm going to use it yet
cc.replace( #not entirely sure what to put here
pass
cc = raw_input("Enter passcode: ")
cc = ''.join(('#' * (len(cc) - 4), cc[-4:]))
The problem in the first example is that the # is unquoted. You need to change it to '#' otherwise it is parsed as the start of a comment and the enclosing parenthesis is a part of that comment. Although, this will only fix the parsing error.
The problem with strings is that you can't change characters inside of them (they are immutable). A common way to get around this is to create an array of the string, change the characters you want to change and then convert the array back to a string (often using ''.join(character_array)). Try that!
How about the following?
def maskify() :
cc = input("Enter passcode: ")
mask = '#'*(len(cc)-4)
return mask + cc[-4:]
I'm not sure how the flow of the rest of your program works, but I doubt whether you should be prompting for raw_input inside of this function. You can decide that depending on your needs. The alternative would look something like this:
def maskify(cc) :
return '#'*(len(cc)-4) + cc[-4:]
myInput = input("Enter passcode: ")
maskedInput = maskify( myInput )
NB: python2 uses raw_input instead of input
Just a little change to your own code:
cc = raw_input("Enter passcode: ")
n = len(cc)
c=""
for i in range(0, n-4): # i think a for loop would be good but i don't know how i'm going to use it yet
c+="#" #not entirely sure what to put here
cc= c+cc [-4:]
print cc
output:
Enter passcode: kased
#ased
The following solution makes the assumption that this would have a security type use, as such passcodes of 4 or fewer characters should just be hashed, otherwise someone would know the whole passcode.
def maskify(cc):
if len(cc) < 9:
split = [0,1,2,3,4,4,4,4,4][len(cc)]
else:
split = len(cc) - 4
return "#" * split + cc[split:]
for length in range(1,12):
test = string.lowercase[:length]
print "%s > %s" % (test, maskify(test))
Giving the following results:
a > #
ab > ##
abc > ###
abcd > ####
abcde > ####e
abcdef > ####ef
abcdefg > ####efg
abcdefgh > ####efgh
abcdefghi > #####fghi
abcdefghij > ######ghij
abcdefghijk > #######hijk
If the short hash is not required, then simply change the array as follows to get the other results:
def maskify(cc):
if len(cc) < 9:
split = [0,0,0,0,0,1,2,3,4][len(cc)]
else:
split = len(cc) - 4
return "#" * split + cc[split:]
Giving:
a > a
ab > ab
abc > abc
abcd > abcd
abcde > #bcde
abcdef > ##cdef
abcdefg > ###defg
abcdefgh > ####efgh
abcdefghi > #####fghi
abcdefghij > ######ghij
abcdefghijk > #######hijk
The string literal '#' is not the same as the character # which starts an inline comment.
def maskify(cc):
cc = raw_input("Enter passcode: ")
mask = '#'*(len(cc)-4)
return mask + cc[-4:]
As others have mentioned # starts a comment. If you want a string containing a hash you need to do '#'.
As André Laszlo mentioned, Python strings are immutable, so it's impossible for a string operation to change a string's content. Thus the str.replace() method can't change the original string: it needs to create a new string which is a modified version of the original string.
So if you do
cc = 'cat'
cc.replace('c', 'b')
then Python would create a new string containing 'bat' which would get thrown away because you're not saving it anywhere.
Instead, you need to do something like
cc = 'cat'
cc = cc.replace('c', 'b')
This discards the original string object 'cat' that was bound to the name cc and binds the new string 'bat' to it.
The best approach (AFAIK) to solving your problem is given in bebop's answer. Here's a slightly modified version of bebop's code, showing that it handles short strings (including the empty string) correctly.
def maskify(s) :
return '#' * (len(s) - 4) + s[-4:]
alpha = 'abcdefghij'
data = [alpha[:i] for i in range(len(alpha)+1)]
for s in data:
print((s, maskify(s)))
output
('', '')
('a', 'a')
('ab', 'ab')
('abc', 'abc')
('abcd', 'abcd')
('abcde', '#bcde')
('abcdef', '##cdef')
('abcdefg', '###defg')
('abcdefgh', '####efgh')
('abcdefghi', '#####fghi')
('abcdefghij', '######ghij')
First of all strings are immutable(they can't be changed) once created--so you can't change the value of cc with replace(). To change all parts of the string except the last four, do this:
def doso(text):
assert len(str(text))>4, 'Length of text should be >4'
ctext=str(text).replace(str(text)[:(len(str(text))-4)],'#'*(len(str(text))-4))
print(ctext)
>>>doso(123) prints 'Length of text should be >4' because len(text)== 3 which is what we expect
Hovever,
>>>doso(12345678) prints #####6789 which is exactly what we expect
Note: Doing '#' * (len(str(text))-4)) accounts for the number of characters we want to replace
# return masked string
def maskify(cc):
maskify = cc
maskifyL= len(maskify) - 4
char = ""
a=0
if maskifyL <= 2:
c2 = cc
else:
for a in range(maskifyL):
char += "#"
a += 1
c2 = maskify.replace(maskify[:maskifyL], char, maskifyL)
return c2

Limiting re.findall() to record values before a certain number. Python

sequence_list = ['atgttttgatggATGTTTGATTAG','atggggtagatggggATGGGGTGA','atgaaataatggggATGAAATAA']
I take each element (fdna) from sequence_list and search for sequences starting with ATG and then reading by 3's until it reaches either a TAA, TGA, or TAG
Each element in sequence_list is made up of two sequences. the first sequence will be lowercase and the second will be uppercase. the results string is composed of lowercase + UPPERCASE
Gathering CDS Starts & Upper()
cds_start_positions = []
cds_start_positions.append(re.search("[A-Z]", fdna).start())
fdna = fdna.upper()
So after I find where the uppercase sequence starts, I record the index number in cds_start_positions and then convert the entire string (fdna) to uppercase
This statement gathers all ATG-xxx-xxx- that are followed by either a TAA|TAG|TGA
Gathering uORFs
ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
So what I'm trying to do is gather all occurrences where ATG-xxx-xxx is followed by either TAA, TGA, or TAG.
My input data is composed of 2 sequences (lowercaseUPPERCASE) and I want to find these sequences when:
1: the ATG is followed by TAA|TGA|TAG in the lowercase (which are now uppercase but the value where they become uppercase is stored in the cds_start_positions)
2: the ATG is in the lowercase portion (less than the cds_start_position value) and the next TAA|TGA|TAG that is following it in uppercase.
NOTE* the way it is set up now is that an ATG that was in the original uppercase portion (greater than the cds_start_position value) is saved to list
What the "Gathering CDS Starts & Upper()" does is find where the upper case sequence starts.
Is there any way to put restraints on the "Gathering uORFs" part to where it only recognizes ATG in the position before the corresponding element in the list cds_start_positions?
I want want to put a statement in the ORF_sequences line where it only finds 'ATG' before each element in the list 'cds_start_positions'
Example of what cds_start_positions would look like
cds_start_positions = [12, 15, 14] #where each value indicates where the uppercase portion starts in the sequence_list elements (fdna)
for the first sequence in sequence_list
i would want this result:
#input
fdna = 'atgttttgatggATGTTTGATTAG'
#what i want for the output
ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA']
#what i'm getting for the output
ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA','ATGTTTGATTAG']
that 3rd entry is found in after the value 12 (corresponding value in the list cds_start_positions) and i don't want that. However, the 2nd entry has its starting ATG before that value 12 and its TAA|TGA|TAG after the value 12 which should be allowed.
***Note
I have another line of code that just takes the start positions of where these ATG-xxx-xxx-TAA|TGA|TAG occur and that is:
start_positions = [i for i in start_positions if i < k-1]
Is there a way to use this principle in re.findall ?
let me know if i need to clarify anything
Yesterday, I had written a first answer.
Then I read the answer of ATOzTOA in which he had a very good idea: using a positive look-behind assertion.
I thought that my answer was completely out and that his idea was the right way to do.
But afterward, I realized that there's a flaw in the ATOzTOA's code.
Say there is a portion 'ATGxzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' in the examined string: the positive matching will occur on 'xzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' and the assertive matching on the preceding 'ATG' so the portion will constitute a match; that's OK.
But it means that just after this matching the regex motor is positionned at the end of this portion 'xzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' .
So when the regex motor will search for the next match, it won't find a matching beginning at this 'ATG' present in this portion, since it runs again from a position long after it.
.
So the only way to achieve what is required by the question is effectively the first algorithm I had written, then I repost it.
The job is done by a function find_ORF_seq()
If you pass True as a second argument to the second parameter messages of the function find_ORF_seq() , it will print messages that help to understand the algorithm.
If not, the parameter messages takes the default value None
The pattern is written '(atg).+?(?:TAA|TGA|TAG)' with some letters uppercased and the others lowercased, but it's not the reason why the portions are catched correctly relatively to the up and low cased letters. Because, as you will see, the flag re.IGNORECASE is used: this flag is necessary since the part matched by (?:TAA|TGA|TAG) can fall in the lower cased part as well as in the upper cased part.
The essence of the algorithm lies in the while-loop, which is necessary because of the fact the researched portions may overlap as I explained above (as far as I understood correctly and the samples and explanations you gave are correct) .
So there is no possibility to use findall() or finditer() and I do a loop.
To avoid to iterate in the fdna sequence one base after the other, I use the ma.start() method that gives the position of the beginning of a match ma in a string, and I increment the value of s with s = s + p + 1 ( +1 to not begin to search again at the start of the found match !)
My algorithm doesn't need the information of start_positions because I don't use an look-behind assertion but a real matching on the first 3 letters: a match is declared unfitting with constraints when the start of the match is in the uppercased part, that it to say when ma.group(1) that catches the first three bases (that can be 'ATG' or 'atg' since the regex ignore case) is equal to 'ATG'
I was obliged to put s = s + p + 1 instead of s = s + p + 3 because it seems that the portions you search are not spaced by multiple of three bases.
import re
sequence_list = ['atgttttgatgATGTTTTGATTT',
'atggggtagatggggATGGGGTGA',
'atgaaataatggggATGAAATAA',
'aaggtacttctcggctaACTTTTTCCAAGT']
pat = '(atg).+?(?:TAA|TGA|TAG)'
reg = re.compile(pat,re.IGNORECASE)
def find_ORF_seq(fdna,messages=None,s=0,reg=reg):
ORF_sequences = []
if messages:
print 's before == ',s
while True:
if messages:
print ('---------------------------\n'
's == %d\n'
'fdna[%d:] == %r' % (s,s,fdna[s:]))
ma = reg.search(fdna[s:])
if messages:
print 'reg.search(fdna[%d:]) == %r' % (s,ma)
if ma:
if messages:
print ('ma.group() == %r\n'
'ma.group(1) == %r'
% (ma.group(),ma.group(1)))
if ma.group(1)=='ATG':
if messages:
print "ma.group(1) is uppercased 'ATG' then I break"
break
else:
ORF_sequences.append(ma.group().upper())
p = ma.start()
if messages:
print (' The match is at position p == %d in fdna[%d:]\n'
' and at position s + p == %d + %d == %d in fdna\n'
' then I put s = s + p + 1 == %d'
% (p,s, s,p,s+p, s+p+1))
s = s + p + 1
else:
break
if messages:
print '\n==== RESULT ======\n'
return ORF_sequences
for fdna in sequence_list:
print ('\n============================================')
print ('fdna == %s\n'
'ORF_sequences == %r'
% (fdna, find_ORF_seq(fdna,True)))
###############################
print '\n\n\n######################\n\ninput sample'
fdna = 'atgttttgatggATGTTTGATTTATTTTAG'
print ' fdna == %s' % fdna
print ' **atgttttga**tggATGTTTGATTTATTTTAG'
print ' atgttttg**atggATGTTTGA**TTTATTTTAG'
print 'output sample'
print " ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA']"
print '\nfind_ORF_seq(fdna) ==',find_ORF_seq(fdna)
.
The same function without the print instructions to better see the algorithm.
import re
pat = '(atg).+?(?:TAA|TGA|TAG)'
reg = re.compile(pat,re.IGNORECASE)
def find_ORF_seq(fdna,messages=None,s =0,reg=reg):
ORF_sequences = []
while True:
ma = reg.search(fdna[s:])
if ma:
if ma.group(1)=='ATG':
break
else:
ORF_sequences.append(ma.group().upper())
s = s + ma.start() + 1
else:
break
return ORF_sequences
.
I compared the two functions, ATOzTOA's one and mine, with a fdna sequence revealing the flaw. This legitimates what I described.
from find_ORF_sequences import find_ORF_seq
from ATOz_get_sequences import getSequences
fdna = 'atgggatggtagatggatgggATGGGGTGA'
print 'fdna == %s' % fdna
print 'find_ORF_seq(fdna)\n',find_ORF_seq(fdna)
print 'getSequences(fdna)\n',getSequences(fdna)
result
fdna == atgggatggtagatggatgggATGGGGTGA
find_ORF_seq(fdna)
['ATGGGATGGTAG', 'ATGGTAG', 'ATGGATGGGATGGGGTGA', 'ATGGGATGGGGTGA']
getSequences(fdna)
['ATGGGATGGTAG', 'ATGGATGGGATGGGGTGA']
.
But after all, maybe, I wonder.... :
do you want the matches that are inner parts of another matching, like 'ATGGGATGGGGTGA' at the end of 'ATGGATGGGATGGGGTGA' ?
If not, the answer of ATOzTOA will fit also.
Update 2 - Complete code
import re
def getSequences(fdna):
start = re.search("[A-Z]", fdna).start()
fdna = fdna.upper()
ORF_sequences = re.finditer(r'(?<=ATG)(.*?)(?:TAA|TAG|TGA)',fdna)
sequences = []
for match in ORF_sequences:
s = match.start()
if s < start:
sequences.append("ATG" + match.group(0))
print sequences
return sequences
sequence_list = ['atgttttgatggATGTTTGATTAG','atggggtagatggggATGGGGTGA','atgaaataatggggATGAAATAA']
for fdna in sequence_list:
getSequences(fdna)
Output
>>>
['ATGTTTTGA', 'ATGGATGTTTGA']
['ATGGGGTAG', 'ATGGGGATGGGGTGA']
['ATGAAATAA', 'ATGGGGATGA']
Update
If you need re, then try this:
ORF_sequences = re.finditer(r'(?<=ATG)(.*?)(?:TAA|TAG|TGA)',fdna)
for match in ORF_sequences:
print match.span()
print "ATG" + match.group(0)
Output
>>>
(3, 9)
ATGTTTTGA
(11, 20)
ATGGATGTTTGA
Note
This won't always work. But you can check therr value of match.start() against cds_start_position and remove unwanted sequences.
Try this, not re, but works...
def getSequences(fdna, start):
"""Find the sequence fully to left of start and
laying over the start"""
i = 0
j = 0
f = False
while True:
m = fdna[i:i+3]
if f is False:
if m == "ATG":
f = True
j = i
i += 2
else:
if m in ["TAA", "TAG", "TGA"]:
i += 2
seq1 = fdna[j: i+1]
break
i += 1
i = 1
j = 0
f = False
while True:
m = fdna[i:i+3]
if f is False:
if m == "ATG" and i < start:
f = True
j = i
i += 2
else:
if m in ["TAA", "TAG", "TGA"] and i > start:
i += 2
seq2 = fdna[j: i+1]
break
i += 1
print "Sequence 1 : " + seq1
print "Sequence 2 : " + seq2
Test
fdna = 'atgttttgatggATGTTTGATTAG'
cds_start_positions = []
cds_start_positions.append(re.search("[A-Z]", fdna).start())
fdna = fdna.upper()
getSequences(fdna, cds_start_positions[0])
Output
>>>
Sequence 1 : ATGTTTTGA
Sequence 2 : ATGGATGTTTGA

Categories

Resources