I'm new to BioPython and I'm trying to import a fasta/fastq file and iterate through each sequence, while performing some operation on each sequence. I know this seems basic, but my code below for some reason is not printing correctly.
from Bio import SeqIO
newfile = open("new.txt", "w")
records = list(SeqIO.parse("rosalind_gc.txt", "fasta"))
i = 0
dna = records[i]
while i <= len(records):
print (dna.name)
i = i + 1
I'm trying to basically iterate through records and print the name, however my code ends up only printing "records[0]", where I want it to print "records[1-10]". Can someone explain why it ends up only print "records[0]"?
The reason for your problem is here:
i = 0
dna = records[i]
Your object 'dna' is fixed to the index 0 of records, i.e., records[0]. Since you are not calling it again, dna will always be fixed on that declaration. On your print statement within your while loop, use something like this:
while i <= len(records):
print (records[i].name)
i = i + 1
If you would like to have an object dna as a copy of records entries, you would need to reassign dna to every single index, making this within your while loop, like this:
while i <= len(records):
dna = records[i]
print (dna.name)
i = i + 1
However, that's not the most efficient way. Finally, for you to learn, a much nicer way than with your while loop with i = i + 1 is to use a for loop, like this:
for i in range(0,len(records)):
print (records[i].name)
For loops do the iteration automatically, one by one. range() will give a set of integers from 0 to the length of records. There are also other ways, but I'm keeping it simple.
Related
I'm making a password generator.
import random
pw = []
MAX_LIMIT = 100
random_string = ''
#random str generator
for i in range(10):
random_integer = random.randint(0, MAX_LIMIT)
random_string += (chr(random_integer))
#password saving
pw.append(random_string,'{}'.format(random.randint(0,10)))
and I am trying to append random_string into the list pw, with a .format(random.randint(0,10)) on the end of the random string. However, it is not letting me append 2 arguments.
I want my final outcome to look something like this.
'6;?2^0D\x1c]\x1b'9
with the 9 on the end being the random.randint(0,10)
cheers lads
I don't actually know what the problem is, but I know how to fix it.
For one, try making the randint that is formated a string by using the str function.
Also, you could use + instead of ,:
pw.append(random_string + str(random.randint(0,10)))
As said in a comment, trying to keep the int outside is impossible.
I wrote a Python3 code to manipulate lists of strings but the code gives Runtime Error for long strings. Here is my code for the problem:
string = "BANANA"
slist= list (string)
mark = list(range(len(slist)))
vowel_substrings = list()
consonants_substrings = list()
#print(mark)
for i in range(len(slist)):
if slist[i]=='A' or slist[i]=='E' or slist[i]=='I' or slist[i]=='O' or mark[i]=='U':
mark[i] = 1
else:
mark[i] = 0
#print(mark)
for j in range(len(slist)):
if mark[j] == 1:
for l in range(j,len(string)):
vowel_substrings.append(string[j:l+1])
#print(string[j:l+1])
else:
for l in range(j,len(string)):
consonants_substrings.append(string[j:l+1])
#print(consonants_substrings)
unique_consonants = list(set(consonants_substrings))
unique_vowels = list(set(vowel_substrings))
##add two lists
all_substrings = consonants_substrings+(vowel_substrings)
#print(all_substrings)
##Find points earned by vowel guy and consonant guy
vowel_guy_score = 0
consonant_guy_score = 0
for strng in unique_vowels:
vowel_guy_score += vowel_substrings.count(strng)
for strng in unique_consonants:
consonant_guy_score += consonants_substrings.count(strng)
#print(vowel_guy_score) #Kevin
#print(consonant_guy_score) #Stuart
if vowel_guy_score > consonant_guy_score:
print("Kevin ",vowel_guy_score)
elif vowel_guy_score < consonant_guy_score:
print("Stuart ",consonant_guy_score)
else:
print("Draw")
gives the right answer. But if you have a long string, shown below, it fails.
NANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANAN
I think initialization or memory allocation might be a problem but I don't know how to allocate memory before even knowing how much memory the code will need. Thank you in advance for any help you can provide.
In the middle there, you generate a data structure of size O(n³): for each starting position × each ending position × length of the substring. That's probably where your memory problems appear (you haven't posted a traceback).
One possible optimisation would be, instead of having a list of substrings and then generating the set, use instead a Counter class. That would let you know how many times each substring appears without storing all the copies:
vowel_substrings = collections.Counter()
consonant_substrings = collections.Counter()
for j in range(len(slist)):
if mark[j] == 1:
for l in range(j,len(string)):
vowel_substrings[string[j:l+1]] += 1
#print(string[j:l+1])
else:
for l in range(j,len(string)):
consonants_substrings[string[j:l+1]] += 1
Even better would be to calculate the scores as you go along, without storing any of the substrings. If I'm reading the code correctly, the substrings aren't actually used for anything — each letter is effectively scored based on its distance from the end of the string, and the scores are added up. This can be calculated in a single pass through the string, without making any additional copies or keeping track of anything other than the cumulative scores and the length of the string.
So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
I am trying to create a code to compare gene file with gene panels.
The gene panel file is in csv format and has Chromosome, gene, start location and end locations.
patients file has chromosome, mutations and the location.
so i made a loop to pass gene panel information to a function where the comparison is done to return me a list of similar items.
the function works great when i call it with manual data. but doenst not do the comparison inside the loop.
import vcf
import os, sys
records = open('exampleGenePanel.csv')
read = vcf.Reader(open('examplePatientFile.vcf','r'))
#functions to find mutations in patients sequence
def findMutations(gn,chromo,start,end):
start = int(start)
end = int(end)
for each in read:
CHROM = each.CHROM
if CHROM != chromo:
continue
POS = each.POS
if POS < start:
continue
if POS > end:
continue
REF = each.REF
ALT = each.ALT
print (gn,CHROM,POS,REF,ALT)
list.append([gn,CHROM,POS,REF,ALT])
return list
gene = records.readlines()
list=[]
y = len (gene)
x=1
while x < 3:
field = gene[x].split(',')
gname = field[0]
chromo = field[1]
gstart = field[2]
gend = field[3]
findMutations(gname,chromo,gstart,gend)
x = x+1
if not list:
print ('Mutation not found')
else:
print (len(list),' Mutations found')
print (list)
i want to get the details of matching mutations in the list.
This works as expected when i pass the data manually to the function.
Eg.findMutations('TESTGene','chr8','146171437','146229161')
But doesnt compare when passed through the loop
The problem is that findMutations attempts to read from read each time it is called, but after the first call, read has already been read and there's nothing left. I suggest reading the contents of read once, before calling the function, then save the results in a list. Then findMutations can read the list each time it is called.
It would also be a good idea to use a name other than list for your result list, since that name conflicts with the Python built-in function. It would also be better to have findMutations return its result list rather than append it to a global.
for synset in wn.synsets(wordstr):
len_lemma_names = len (synset.lemma_names)
#print len_lemma_names, synset.lemma_names
count_lemma = count_lemma + len_lemma_names
for synset_scores in swn_senti_synset:
count_synset = count_synset + 1
#print count_synset, synset_scores
I am trying to print len_lemma_names in front of count_synset but it did not work. Is there any way possible for printing them together? Thank you...
I think that you are wanting to iterate over the two, together. If this is the case, you want to use zip, or to avoid turning it all into one big list at once, itertools.izip.
from itertools import izip
for synset, synset_scores in izip(wn.synsets(wordstr), swn_senti_synset):
# Now you can deal with both at once in this loop.
len_lemma_names = len(synset.lemma_names)
count_lemma += len_lemma_names
count_synset += 1
# Mix to taste.
print len_lemma_names, count_synset
Note that the count_synset part may be better done with enumerate (I don't know its initial value or whether you're wanting to use it outside this code).