Parsing Successfully Until IndexError? - python

I have a script that's parsing for the first upper-case words in this file:
IMPORT fs
IF fs.exists("fs.pyra") THEN
PRINT "fs.pyra Exists!"
END
The script looks like this:
file = open(sys.argv[1], "r")
file = file.read().split("\n")
while '' in file:
findIt = file.index('')
file.pop(findIt)
for line in file:
func = ""
index = 0
while line[index] == " ":
index = index + 1
while not line[index] == " " or "=" and line[index].isupper():
func = func + line[index]
index = index + 1
print func
All used modules are already imported.
I passed the file that's being parsed's path in the arguments, and I'm getting this output:
IMPORT
IF
PRINT
Traceback (most recent call last):
File "src/source.py", line 20, in <module>
while not line[index] == " " or "=" and line[index].isupper():
IndexError: string index out of range
Which means it's parsing successfully until the last argument in the list, and then it's not parsing it at all. How do I fix this?

You don't need to increment the index on spaces - line.strip() will remove leading and trailing spaces.
You could split() the line on spaces to get words.
Then you can iterate over those strings and use isupper() to check whole words, rather than individual characters
Alternatively, run the whole file through a pattern matcher for [A-Z]+
Anyways, your error...
while not line[index] == " " or "="
The or "=" is always True, so your index is going out of bounds

If the file you're trying to process is compatible with Python's built in tokenizer, you can use that so it'll also handle stuff within quotes, then take the very first name token it finds in capitals from each line, eg:
import sys
from itertools import groupby
from tokenize import generate_tokens, NAME
with open(sys.argv[1]) as fin:
# Tokenize and group by each line
grouped = groupby(tokenize.generate_tokens(fin.readline), lambda L: L[4])
# Go over the lines
for k, g in grouped:
try:
# Get the first capitalised name
print next(t[1] for t in g if t[0] == NAME and t[1].isupper())
except StopIteration:
# Couldn't find one - so no panic - move on
pass
This gives you:
IMPORT
IF
PRINT
END

Related

Python regex is taking more time

I want to get complete line after first occurrence of the delimiter char I am using below regex but it is taking hell lot of time to parse
import re
str = "abc:dad:kdl--sa:dajs: idsa:kd"
mypat = re.compile(r'.+?:(.+)')
result = mypat.search(str)
line = result.group(1)
line = line.replace("\r", "").replace("\n", "")
print (line)
"I want to get complete line after first occurrence of the delimiter char"
first_occurence = line.index(DELIMITER)
line_after = line[first_occurence+1:]
Note: will raise the ValueError if the DELIMITER is not present.

Why throw ValueError with int() builtin reading parts of lines from .txt file?

This is a subroutine which reads from studentNamesfile.txt
def calculate_average():
'''Calculates and displays average mark.'''
test_results_file = open('studentNamesfile.txt', 'r')
total = 0
num_recs = 0
line = ' '
while line != '':
line = test_results_file.readline()
# Convert everything after the delimiting pipe character to an integer, and add it to total.
total += int(line[line.find('|') + 1:])
num_recs += 1
test_results_file.close()
[num_recs holds the number of records read from the file.]
The format of studentNamesfile.txt is as follows:
Student 01|10
Student 02|20
Student 03|30
and so on. This subroutine is designed to read the mark for all the student records in the file, but I get this error when it runs:
Traceback (most recent call last):
File "python", line 65, in <module>
File "python", line 42, in calculate_average
ValueError: invalid literal for int() with base 10: ''
This error is pretty explicit, but I can't figure out why it's being thrown. I tried tracing the value of line[line.find('|') + 1:], but Python insists it has the correct value (e.g. 10) when I use print(line[line.find('|') + 1:] on the previous line. What's wrong?
Update: I'm considering the possibility that line[line.find('|') + 1:] includes the newline, which is breaking int(). But using line[line.find('|') + 1:line.find('\\')] doesn't fix the problem - the same error is thrown.
Here:
while line != '':
line = test_results_file.readline()
When you hit the end of the file, .readline() returns an empty string, but since this happens after the while line != '' test, you still try to process this line.
The canonical (and much simpler) way to iterate over a file line by line, which is to, well, iterate over the file, would avoid this problem:
for line in test_result_file:
do_something_with(line)
You'll just have to take care of calling .rstrip() on line if you want to get rid of the ending newline character (which is the case for your code).
Also, you want to make sure that the file is properly closed whatever happens. The canonical way is to use open() as a context manager:
with open("path/to/file.txt") as f:
for line in test_result_file:
do_something_with(line)
This will call f.close() when exiting the with block, however it's exited (whether the for loop just finished or an exception happened).
Also, instead of doing complex computation to find the part after the pipe, you can just split your string:
for line in test_results_file:
total = int(line.strip().split("|")[1])
num_recs += 1
And finally, you could use the stdlib's csv module to parse your file instead of doing it manually...
Because it's not a numeric value. So, python throws the ValueError if it is not able convert it into integer. You can below code to check it.
def calculate_average():
test_results_file = open('studentNamesfile.txt', 'r')
total = 0
num_recs = 0
for line in test_results_file.readlines():
try:
total += int(line[line.find('|') + 1:])
num_recs += 1
except ValueError:
print("Invalid Data: ", line[line.find('|') + 1:])
test_results_file.close()
print("total:", total)
print("num_recs:", num_recs)
print("Average:", float(total)/num_recs)
readlines vs readline
from io import StringIO
s = 'hello\n hi\n how are you\n'
f = StringIO(unicode(s))
l = f.readlines()
print(l)
# OUTPUT: [u'hello\n', u' hi\n', u' how are you\n']
f = StringIO(unicode(s))
l1 = f.readline()
# u'hello\n'
l2 = f.readline()
# u' hi\n'
l3 = f.readline()
# u' how are you\n'
l4 = f.readline()
# u''
l5 = f.readline()
# u''
readlines
If we use readlines then it will return a list based on \n character.
readline
From above code we can see that we have only 3 lines in the stringIO but when we access readline it will always gives us an empty string. so, in your code you are converting it into an integer because of that you are getting the ValueError exception.
A simpler approach.
Demo:
total = 0
num_recs = 0
with open(filename) as infile: #Read File
for line in infile: #Iterate Each line
if "|" in line: #Check if | in line
total += int(line.strip().split("|")[-1]) #Extract value and sum
num_recs += 1
print(total, num_recs)

IndexError: cannot fit 'int' into an index-sized integer

So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:
with open('newfiles.txt') as f:
s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {}
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
if match not in d.keys():
d[match] = i
i+=1
list_with_positions.append(d[match])
print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
word_base = [None] + [z.strip() for z in f_base.read().split()]
sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
print(' '.join(sentence_seq))
As i said the first part works fine but then i get the error:-
Traceback (most recent call last):
File "E:\Python\Indexes.py", line 33, in <module>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
File "E:\Python\Indexes.py", line 33, in <listcomp>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer
This error occurs when the program runs through 'sentence_seq' towards the bottom of the code
newfiles is the original text file - a random article with more than one sentence with punctuation
list_with_positions is the list with the actual positions of where each word occurs within the original text
matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.
Does anyone know why I get the error?
The issue with your approach is using ''.join() as this joins everything with no spaces. So, the immediate issue is that you attempt to then split() what is effectively a long series of digits with no spaces; what you get back is a single value with 100+ digits. So, the int overflows with a gigantic number when trying to use it as an index. Even more of an issue is that indices might go into double digits etc.; how did you expect split() to deal with that when numbers are joined without spaces?
Beyond that, you fail to treat punctuation properly. ' '.join() is equally invalid when trying to reconstruct a sentence because you have commas, full stops etc. getting whitespace on either side.
I tried my best to stick with your current code/approach (I don't think there's huge value in changing the entire approach when trying to understand where an issue comes from) but it still feels shakey for me. I dropped the regex, perhaps that was needed. I'm not immediately aware of a library for doing this kind of thing but almost certainly there must be a better way
import string
punctuation_list = set(string.punctuation) # Has to be treated differently
word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
raw_data = infile.read().split()
for index, item in enumerate(raw_data):
index_dict[item] = index
word_base.append(item)
with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
for item in word_base:
outfile1.write(str(item) + ' ')
outfile2.write(str(index_dict[item]) + ' ')
reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
indices = infile1.read().split()
words = infile2.read().split()
reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])

How to skip reading the first line of an input file in Python

I'm trying to write a script that will automatically process the files containing DNA sequences and convert them into a protein sequence. My only hiccup thus far is that in order for my script to process the infile I need to somehow omit the first line.
My code looks like this:
#!/usr/bin/python
### This script will translate in all three frames ###
import glob
from Bio.Seq import *
from Bio.Alphabet import generic_dna
print "Please drag in the directory to be processed: "
folder = raw_input().replace(" ","")
file = glob.glob(str(folder) + "/" + '*.seq')
for i in file:
with open (i, "r") as myfile:
### Need to somehow remove / read over the first line of the input...###
seq = myfile.read().replace(" ", "").replace("\n", "")
x = 0
output = open(i + ".tran", "w+")
while x < 3:
cd = Seq(seq[x:], generic_dna)
qes = seq[::-1]
cdr = Seq(qes[x:], generic_dna)
error = cd.translate().count('*')
reverror = cdr.translate().count('*')
output.write(str(cd.translate()) + "\nStops: " + str(error) + "\n\n")
output.write("Reverse\n" + str(cdr.translate()) + "\nStops: " + str(reverror) + "\n\n")
x += 1
And my input files might look something like this
>G2-pBAD-Forward_A11.ab1
NNNNNNNNNNCNNNCNGNNGCTTTTTATCGCAACTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCGA
GCTCGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGATTGTAATGAAACGAGTTATTACCCTGTTTGCTGT
ACTGCTGATGGGCTGGTCGGTAAATGCCTGGTCAAGCTTGGCTGTTTTGGCGGATGAGAGAAGATTTTCAGCCTGATACA
GATTAAATCAGAACGCAGAAGCGGTCTGATAAAACAGAATTTGCCTGGCGGCAGTAGCGCGGTGGTCCCACCTGACCCCA
TGCCGAACTCAGAAGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAGTAGGGAACTGCCAGGCA
TCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCCTGAGTA
GGACAAATCCGCCGGGAGCGGATTTGAACGTTGCGAAGCAACGGCCCGGAGGGTGGCGGGCAGGACGCCCGCCATAAACT
GCCAGGCATCAAATTAAGCAGAAGGCCATCCTGACGGATGGCCTTTTTGCGTTTCTACAAACTCTTTTGTTTATTTTTCT
AAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATG
AGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCT
GGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCAGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCA
ACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGGTTTATT
GCTGATAAATCTGGAGCCGGTGAGCGTGNTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTTCCCGTAT
CGNANTTNNCTACACGAN
What I need is an elegant way to remove that first line containing the '>'.
You can use SeqIO:
with open(i) as myfile:
for record in SeqIO.parse(myfile, "fasta"):
print record.id
print record.seq
record.seq will give you the sequence, record.id will give you the id, if you only have or want a single sequence in each you can just call next:
with open(i) as myfile:
print(next(SeqIO.parse(myfile, "fasta"))).seq
I don't see any spaces in your input so I am not sure how replace would work, this will output the sequence as a single string.
Output:
NNNNNNNNNNCNNNCNGNNGCTTTTTATCGCAACTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCGAGCTCGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGATTGTAATGAAACGAGTTATTACCCTGTTTGCTGTACTGCTGATGGGCTGGTCGGTAAATGCCTGGTCAAGCTTGGCTGTTTTGGCGGATGAGAGAAGATTTTCAGCCTGATACAGATTAAATCAGAACGCAGAAGCGGTCTGATAAAACAGAATTTGCCTGGCGGCAGTAGCGCGGTGGTCCCACCTGACCCCATGCCGAACTCAGAAGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAGTAGGGAACTGCCAGGCATCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCCTGAGTAGGACAAATCCGCCGGGAGCGGATTTGAACGTTGCGAAGCAACGGCCCGGAGGGTGGCGGGCAGGACGCCCGCCATAAACTGCCAGGCATCAAATTAAGCAGAAGGCCATCCTGACGGATGGCCTTTTTGCGTTTCTACAAACTCTTTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCAGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGNTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTTCCCGTATCGNANTTNNCTACACGAN
You can also use range instead of your while loop and pass the alphabet to SeQIO:
for record in SeqIO.parse(myfile, "fasta",generic_dna)
...
for x in range(3):
....
This should be closer to what you want:
from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
folder = raw_input().replace(" ","")
files = glob.glob(folder + "/" + '*.seq')
for i in files:
with open(i) as myfile:
seq = next(SeqIO.parse(myfile, "fasta", generic_dna)).seq
qes = seq[::-1]
with open("{}.tran".format(i), "w+") as output:
for x in range(3):
cd = seq[x:]
cdr = qes[x:]
error = seq.translate().count('*')
reverror = cdr.translate().count('*')
output.write("{}\nstops: {}\n\n".format(cd.translate(), error))
output.write("Reverse: {}\nStops: {}\n\n ".format(cdr.translate(), reverror))
Which outputs:
XXXXXXXFLSQLSTVSPYPFFWASEFELEIILFNFKKEIYI*L**NELLPCLLYC*WAGR*MPGQAWLFWRMREDFQPDTD*IRTQKRSDKTEFAWRQ*RGGPT*PHAELRSETP*RRW*CGVSPCESRELPGIK*NERLSRKTGPFVLSVVCR*TLS*VGQIRRERI*TLRSNGPEGGGQDARHKLPGIKLSRRPS*RMAFLRFYKLFCLFF*IHSNMYPLMRQ*P**MLQ*Y*KRKSMSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGAANY*LANYLL*LPGNN**TGWRRIKLQDHFCARPFRLAGFIADKSGAGERXLAVSLQHWGQMVSPSRIXXXYT
stops: 25
Reverse: XHIXXXYALPEW*TGVTTLLWRSXASGRGLNSRYLGRSAFPARVFTRTLK*AEVGQIINNGPSISFIKRSIIKRRGLTRSRRK*KWSQRPTRFCPSVLRRFFPYSRCAFTTYEYEKEKVIITS*IVPITEYSPMYKLT*IFLFVFSNIFAFFR*AVLPEDELNYGPSNTARRTGGGRPGNEALQV*ARAA*TG*VLSQVAVCCLFCFPGQKADSESKINYGPSRDESVPLWGVMVAAMPQSEDSSRTPVHPGGAMTAVRLRQNSLAKTQD*IRHSPTFRRE*AVLSVRTGP*MAGRVVVMSFVPLLSKVMLVYI*RKNFNLF**SSSLSDRVFLPIPLCHLSTLFFXXXXXX
Stops: 15
XXXXXXAFYRNSLLFLHTRFFGLANSSSK*FCLTLRRRYTYDCNETSYYPVCCTADGLVGKCLVKLGCFGG*EKIFSLIQIKSERRSGLIKQNLPGGSSAVVPPDPMPNSEVKRRSADGSVGSPHARVGNCQASNKTKGSVERLGLSFYLLFVGERSPE*DKSAGSGFERCEATARRVAGRTPAINCQASN*AEGHPDGWPFCVSTNSFVYFSKYIQICIRS*DNNPDKCFNNIEKGRV*VFNISVSPLFPFLRHFAFLFLLTQKRW*K*KMLKISWVQQTINWRTTYSSFPATINRLDGGG*SCRTTSALGPSGWLGLLLINLEPVSVXSRYHCSTGARW*ALPVSXXXTR
stops: 25
Reverse: STSXXAMPFPNGRPGSRRYYGALVRVAEV*IVVIWVGRPSRLASSPGR*NRRR*VR*LTTALRSHSSSGQLSNDVG*LEVVENESGRKDPLVFVLPFYGVFSLIPAVPLQLMSMRRKKL**LRK*SQ*QSTRLCINLHKSFYLFSQTSLRFSGRQSYRKTN*TTDRQIPPAGRAVGGPATKRCKFRRGPPKQDESSRKWLFVVYFAFRVRKLTRKAK*TTDRQGMRAYPSGV*W*PRCRKVKTQAVPQSTLVAR*RRSV*DKIVWRRRKTKLDIVRLLEESRRFCRFELVRKWLVG*SSCRLSHY*AK*C*YTYRGRISICFNKARA*AIGFFCPYLFVISQRYFSXXXXXX
Stops: 20
XXXXXXLFIATLYCFSIPVFLG*RIRARNNFV*L*EGDIHMIVMKRVITLFAVLLMGWSVNAWSSLAVLADERRFSA*YRLNQNAEAV**NRICLAAVARWSHLTPCRTQK*NAVAPMVVWGLPMRE*GTARHQIKRKAQSKDWAFRFICCLSVNALLSRTNPPGADLNVAKQRPGGWRAGRPP*TARHQIKQKAILTDGLFAFLQTLLFIFLNTFKYVSAHETITLINASIILKKEEYEYSTFPCRPYSLFCGILPSCFCSPRNAGESKRC*RSVGCSKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWVYC**IWSR*AXSRGIIAALGPDGKPFPYRXXLHX
stops: 25
Reverse: AHXXXLCPSRMVDRGHDVTMALXCEWPRSK*SLFGSVGLPGSRLHQDVEIGGGRSDN*QRPFDLIHQAVNYQTTWVD*KS*KMKVVAKTHSFLSFRFTAFFPLFPLCLYNL*V*EGKSYNNFVNSPNNRVLAYV*TYINLFICFLKHLCVFPVGSPTGRRIKLRTVKYRPQDGRWEARQRSVASLGEGRLNRMSPLASGCLLSILLSGSES*LGKQNKLRTVKG*ERTPLGCDGSRDAAK*RLKPYPSPPWWRDDGGPFKTK*SGEDARLN*T*SDF*KRVGGFVGSNWSVNGWSGSRHVVCPIIEQSNVSIHIEEEFQFVLIKLELKRSGFFAHTSLSSLNAIFRXXXXXX
Stops: 14
Although there is a warning because your sequences are not a multiple of three
starting = True
lines = []
with open (i, "r") as myfile:
for line in myfile:
if starting:
starting = False
continue
lines.append(line)
# now you have all the lines except the first one in "lines"
When you go to read the file check to see if the first character of the line is '>'. If it is then just read the line and skip it. Like the comment above mentioned, you may want to check this each time you read a line so you know if you are dealing with a new sequence or the same sequence, as your file may have multiple sequences.
This link will give you all of the documentation on reading a file in python.
https://docs.python.org/2/tutorial/inputoutput.html

How can I reverse compliment a multiple sequence fasta file with python?

I am new to python and I am trying to figure out how to read a fasta file with multiple sequences and then create a new fasta file containing the reverse compliment of the sequences. The file will look something like:
>homo_sapiens
ACGTCAGTACGTACGTCATGACGTACGTACTGACTGACTGACTGACGTACTGACTGACTGACGTACGTACGTACGTACGTACGTACTG
>Canis_lupus
CAGTCATGCATGCATGCAGTCATGACGTCAGTCAGTACTGCATGCATGCATGCATGCATGACTGCAGTACTGACGTACTGACGTCATGCATGCAGTCATG
>Pan_troglodytus
CATGCATACTGCATGCATGCATCATGCATGCATGCATGCATGCATGCATCATGACTGCAGTCATGCAGTCAGTCATGCATGCATCAT
I am trying to learn how to use for and while loops so if the solution can incorporate one of them it would be preferred.
So far I managed to do it in a very unelegant manner as follows:
file1 = open('/path/to/file', 'r')
for line in file1:
if line[0] == '>':
print line.strip() #to capture the title line
else:
import re
seq = line.strip()
line = re.sub(r'T', r'P', seq)
seq = line
line = re.sub(r'A',r'T', seq)
seq = line
line = re.sub(r'G', r'R', seq)
seq = line
line = re.sub(r'C', r'G', seq)
seq = line
line = re.sub(r'P', r'A', seq)
seq = line
line = re.sub(r'R', r'C', seq)
print line[::-1]
file1.close()
This worked but I know there is a better way to iterate through that end part. Any better solutions?
I know you consider this an exercise for yourself, but in case you are interested in using existing facilities, have a look at the Biopython package. Especially if you are going to do more sequence work.
That would allow you to instantiate a sequence with e.g. seq = Seq('GATTACA'). Then, seq.reverse_complement() will give you the reverse complement.
Note that the reverse complement is more than just string reversal, the nucleotide bases need to be replaced with their complementary letter as well.
Assuming I got you right, would the code below work for you? You could just add the exchanges you want to the dictionary.
d = {'A':'T','C':'G','T':'A','G':'C'}
with open("seqs.fasta", 'r') as in_file:
for line in in_file:
if line != '\n': # skip empty lines
line = line.strip() # Remove new line character (I'm working on windows)
if line.startswith('>'):
head = line
else:
print head
print ''.join(d[nuc] for nuc in line[::-1])
Output:
>homo_sapiens
CAGTACGTACGTACGTACGTACGTACGTCAGTCAGTCAGTACGTCAGTCAGTCAGTCAGTACGTACGTCATGACGTACGT
ACTGACGT
>Canis_lupus
CATGACTGCATGCATGACGTCAGTACGTCAGTACTGCAGTCATGCATGCATGCATGCATGCAGTACTGACTGACGTCATG
ACTGCATGCATGCATGACTG
>Pan_troglodytus
ATGATGCATGCATGACTGACTGCATGACTGCAGTCATGATGCATGCATGCATGCATGCATGCATGATGCATGCATGCAGT
ATGCATG
Here is a simple example of a string reversal.
Python Code
string = raw_input("Enter a string:")
reverse_string = ""
print "our string is %s" % string
print "our range will be %s\n" % range(0,len(string))
for num in range(0,len(string)):
offset = len(string) - 1
reverse_string += string[offset - num]
print "the num is currently: %d" % num
print "the offset is currently: %d" % offset
print "the index is currently: %d" % int(offset - num)
print "the new string is currently: %s" % reverse_string
print "-------------------------------"
offset =- 1
print "\nOur reverse string is: %s" % reverse_string
Added print commands to show you what is happening in the script.
Run it in python and see what happens.
Usually, to iterate over lines in a text file you use a for loop, because "open" returns a file object which is iterable
>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>
There is more about this here
You can also use context manager "with" to open a file. This key statement will close the file object for you, so you will never forget it.
I decided not to include a "for line in f:" statement because you have to read several lines to process one sequence (title, sequence and blank line). If you try to use a for loop with "readline()" you will end up with a ValueError (try :)
So I would use string.translate. This script opens a file named "test" with your example in it:
import string
if __name__ == "__main__":
file_name = "test"
translator = string.maketrans("TAGCPR", "PTRGAC")
with open(file_name, "r") as f:
while True:
title = f.readline().strip()
if not title: # end of file
break
rev_seq = f.readline().strip().translate(translator)[::-1]
f.readline() # blank line
print(title)
print(rev_seq)
Output (with your example):
>homo_sapiens
RPGTPRGTPRGTPRGTPRGTPRGTPRGTRPGTRPGTRPGTPRGTRPGTRPGTRPGTRPGTPRGTPRGTRPTGPRGTPRGTPRTGPRGT
>Canis_lupus
RPTGPRTGRPTGRPTGPRGTRPGTPRGTRPGTPRTGRPGTRPTGRPTGRPTGRPTGRPTGRPGTPRTGPRTGPRGTRPTGPRTGRPTGRPTGRPTGPRTG
>Pan_troglodytus
PTGPTGRPTGRPTGPRTGPRTGRPTGPRTGRPGTRPTGPTGRPTGRPTGRPTGRPTGRPTGRPTGPTGRPTGRPTGRPGTPTGRPTG

Categories

Resources