Separate value that store consecutively in dictionary - python

I currently working on CS50 problem set https://cs50.harvard.edu/x/2021/psets/6/dna/
The problem simply tell us to find some DNA sequence that repeated consecutively in a txt file and match the total length with the person in csv file.
This is the code i currently work (not complete yet):
import re, csv, sys
def main(argv):
# Open csv file
csv_file = open(sys.argv[1], 'r')
str_person = csv.reader(csv_file)
nucleotide = next(str_person)[1:]
# Open dna sequences file
txt_file = open(sys.argv[2], 'r')
dna_file = txt_file.read()
str_repeat = {}
str_list = find_STRrepeats(str_repeat, nucleotide, dna_file)
def find_STRrepeats(str_list, nucleotide, dna):
for STR in nucleotide:
groups = re.findall(rf'(?:{STR})+', dna)
if len(groups) == 0:
str_list[STR] = 0
else:
str_list[STR] = groups
print(str_list)
if __name__ == "__main__":
main(sys.argv[1:])
Output (from the print(str_list)):
{'AGATC': ['AGATCAGATCAGATCAGATC'], 'AATG': ['AATG'], 'TATC': ['TATCTATCTATCTATCTATC']}
But as you can see, the value in the dictionary also store consecutively. If i want to use len function in str_list[STR] = len(groups) it will result 1 for each key in dictionary. Because i want to find how many time (total length) that DNA repeated, and store it as value in my dict.
So, I want it to store separately. Kind of like this:
{'AGATC': ['AGATC', 'AGATC', 'AGATC', 'AGATC'], 'AATG': ['AATG'], 'TATC': ['TATC', 'TATC', 'TATC', 'TATC', 'TATC']}
What should i add to my code so they can separate with a coma like that? or maybe there's some condition i can add to my ReGex code groups = re.findall(rf'(?:{STR})+', dna) ?
I don't wanna change the whole of ReGex code. Because i think is already useful to found largest length of string that repeat consecutively. And i proud to myself can get it without help because i'm beginner with python. Please. Thank you.

I would just store the highest number of repetitions:
...
if len(groups) == 0:
str_list[STR] = 0
else:
str_list[STR] = max(len(i)/len(str) for i in groups)
....
BTW, this would correctly handle the case where more than one sequence exists.

Related

Counting repeated STR in DNA PSET6 CS50

Currently working on CS50. I tried to count STR in file DNA Sequences but it always overcount.
I mean, for example: how much 'AGATC' in file DNA repeat consecutively.
This code is only try to find out how to count those repeated DNA accurately.
import csv
import re
from sys import argv, exit
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
with open(argv[1]) as csv_file, open(argv[2]) as dna_file:
reader = csv.reader(csv_file)
#for row in reader:
# print(row)
str_sequences = next(reader)[1:]
dna = dna_file.read()
for i in range(len(dna)):
count = len(re.findall(str_sequences[0], dna)) # str_sequences[0] is 'AGATC'
print(count)
main()
result for DNA file 11 (AGATC):
$ python dna.py databases/large.csv sequences/11.txt
52
The result supposed to be 43. But, for small.csv, its count accurately. But for large it always over count. Later i know that my code its counting all every match word in DNA file (AGATC). But the task is, you have to take the DNA that only repeat consecutively and ignoring if another same DNA showup again.
{AGATCAGATCAGATCAGATC(T)TTTTAGATC}
So, how to stop counting if the DNA hit the (T), and it doesn't need to count AGATC that comes after?
What should i change in my code? especially in re.findall() that i use. Some people said use substring, how to use substring? or maybe can i just use regEx like i did?
Please write your code if you can. sorry for my bad english.
The for loop is wrong, it will keep counting the sequences even if they are already found earlier in the loop. I think you want to instead loop over the str_sequences.
Something like:
seq_list = []
for STR in str_sequences:
groups = re.findall(rf'(?:{STR})+', dna)
if len(groups) == 0:
seq_list.append('0')
else:
seq_list.append(str(max(map(lambda x: len(x)//len(STR), groups))))
print(seq_list)
Also, there are many posts on this problem. Maybe, you can examine some of them to finish your program.

How to parse letter by letter and make a list with Python?

I have a text file I am attempting to parse. Fairly new to Python.
It contains an ID, a sequence, and frequency
SA1 GDNNN 12
SA2 TDGNNED 8
SA3 VGGNNN 3
Say the user wants to compare the frequency of the first two sequences. They would input the ID number. I'm having trouble figuring out how I would parse with python to make a list like
GD this occurs once in the two so it = 12
DN this also occurs once =12
NN occurs 3 times = 12 + 12 + 8 =32
TD occurs once in the second sequence = 8
DG ""
NE ""
ED ""
What do you recommend to parse letter by letter? In a sequence GD, then DN, then NN (without repeating it in the list), TD.. Etc.?
I currently have:
#Read File
def main():
file = open("clonedata.txt", "r")
lines = file.readlines()
file.close()
class clone_data:
def __init__(id, seq, freq):
id.seq = seq
id.freq = freq
def myfunc(id)
id = input ("Input ID number to see frequency: ")
for line in infile:
line = line.strip().upper()
line.find(id)
#print('y')
I'm not entirely sure from the example, but it sounds like you're trying to look at each line in the file and determine if the ID is in a given line. If so, you want to add the number at the end of that line to the current count.
This can be done in Python with something like this:
def get_total_from_lines_for_id(id_string, lines):
total = 0 #record the total at the end of each line
#now loop over the lines searching for the ID string
for line in lines:
if id_string in line: #this will be true if the id_string is in the line and will only match once
split_line = line.split(" ") #split the line at each space character into an array
number_string = split_line[-1] #get the last item in the array, the number
number_int = int(number_string) #make the string a number so we can add it
total = total + number_int #increase the total
return total
I'm honestly not sure what part of that task seems difficult to you, in part because I'm not sure what exactly is the task you're trying to accomplish.
Unless you expect the datafile to be enormous, the simplest way to start would be to read it all into memory, recording the id, sequence and frequency in a dictionary indexed by id: [Note 1]
with open('clonedata.txt') as file:
data = { id : (sequence, int(frequency))
for id, sequence, frequency in (
line.split() for line in file)}
With the sample data provided, that gives you: (newlines added for legibility)
>>> data
{'SA1': ('GDNNN', 12),
'SA2': ('TDGNNED', 8),
'SA3': ('VGGNNN', 3)}
and you can get an individual sequence and frequency with something like:
seq, freq = data['SA2']
Apparently, you always want to count the number of digrams (instances of two consecutive letters) in a sequence of letters. You can do that easily with collections.Counter: [Note 2]
from collections import Counter
# ...
seq, freq = data['SA1']
Counter(zip(seq, seq[1:]))
which prints
Counter({('N', 'N'): 2, ('G', 'D'): 1, ('D', 'N'): 1})
It would probably be most convenient to make that into a function:
def count(seq):
return Counter(zip(seq, seq[1:]))
Also apparently, you actually want to multiply the counted frequency by the frequency extracted from the file. Unfortunately, Counter does not support multiplication (although you can, conveniently, add two Counters to get the sum of frequencies for each key, so there's no obvious reason why they shouldn't support multiplication.) However, you can multiply the counts afterwards:
def count_freq(seq, freq):
retval = count(seq)
for digram in retval:
retval[digram] *= freq
return retval
If you find tuples of pairs of letters annoying, you can easily turn them back into strings using ''.join().
Notes:
That code is completely devoid of error checking; it assumes that your data file is perfect, and will throw an exception for any line with two few elements, including blank lines. You could handle the blank lines by changing for line in file to for line in file if line.strip() or some other similar test, but a fully bullet-proof version would require more work.)
zip(a, a[1:]) is the idiomatic way of making an iterator out of overlapping pairs of elements of a list. If you want non-overlapping pairs, you can use something very similar, using the same list iterator twice:
def pairwise(a):
it = iter(a)
return zip(it, it)
(Or, javascript style: pairwise = lambda a: (lambda it:zip(it, it))(iter(a)).)

CS50 PSET6 DNA no match using regex to count STR

I have been stuck at this point for quite a while, hope to get some tips.
The problem can be simplified as to find what is the largest consecutive occurrence of a pattern in a string. As a pattern AATG, for a string like ATAATGAATGAATGGAATG the right result should be 3. I tired to count the occurrences of the pattern by using re.compile(). I have found out from the doc that if i want to find consecutive occurrence of a pattern i possibly have to use special character +. For instance, a pattern like AATG i have to use re.compile(r'(AATG)+') instead of re.compile(r'AATG'). Otherwise, the occurrences will be overcounted. However, in this program the pattern is not a fixed string. I have treat it as a variable. I have tried many ways to put it into re.compile() without positive results. Could anyone enlighten me the correct way to format it (which is in the Function def countSTR below)?
After that, i think finditer(the_string_to_be_analysis) should return a iterator including all matches found. Then i used match.end() - match.start() to obtain the length of every match to compare with each other in order to get the longest consecutive occurrence of the pattern. maybe something goes wrong there?
code attached. Every input would be appreciated!
from sys import argv, exit
import csv
import re
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
# read DNA sequence
with open(argv[2], "r") as file:
if file.mode != 'r':
print(f"database {argv[2]} can not be read")
exit(1)
sequence = file.read()
# read database.csv
with open(argv[1], newline='') as file:
if file.mode != 'r':
print(f"database {argv[1]} can not be read")
exit(1)
# get the heading of the csv file in order to obtain STRs
csv_reader = csv.reader(file)
headings = next(csv_reader)
# dictionary to store STRs match result of DNA-sequence
STR_counter = {}
for STR in headings[1::]:
# entry result accounting to the STR keys
STR_counter[STR] = countSTR(STR, sequence)
# read csv file as a dictionary
with open(argv[1], newline='') as file:
database = csv.DictReader(file)
for row in database:
count = 0
for STR in STR_counter:
# print("row in database ", row[STR], "STR in STR_counter", STR_counter[STR])
if int(row[STR]) == int(STR_counter[STR]):
count += 1
if count == len(STR_counter):
print(row['name'])
exit(0)
else:
print("No match")
# find non-overlapping occurrences of STR in DNA-sequence
def countSTR(STR, sequence):
count = 0
maxcount = 0
# in order to match repeat STR. for example: "('AATG')+" as pattern
# into re.compile() to match repeat STR
# rewrite STR to "(STR)+"
STR = "(" + STR + ")+"
pattern = re.compile(r'STR')
# matches should be a iterator object
matches = pattern.finditer(sequence)
# go throgh every repeat and find the longest one
# by match.end() - match.start()
for match in matches:
count = match.end() - match.start()
if count > maxcount:
maxcount = count
# return repeat times of the longest repeat
return maxcount/len(STR)
main()
just find out a correct way to get the desired result.
post it here in case any others are also confused.
From what I have understand, to match a variable named var_pattern could use re.compile(rf'{var_pattern}'). Then if consecutive occurrences of the var_pattern should be searched, could use re.compile(rf'(var_pattern)+'). There may be other smarter ways to implement that, however i managed to get it work as fine as previously .

Why my code does outputs ? (Probable logic issue)

Following problem is a problem from cs50 pset6. The goal here is to search for a .txt file that represent a DNA thread and check how many times a specific string occurs consecutively. If number of cons. string matches with an individual for each DNA example, print individuals name.
I am still a newbie with python, I wrote what I meant by the lines I hope all is right. Any help is appreciated. (At the moment code does not give any outputs)
python dna.py databases/small.csv sequences/2.txt (commend line arguments,1st is the csv and 2nd is txt file)
CSV files :
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
name,AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Albus,15,49,38,5,14,44,14,12
Cedric,31,21,41,28,30,9,36,44
Draco,9,13,8,26,15,25,41,39
Fred,37,40,10,6,5,10,28,8
Ginny,37,47,10,23,5,48,28,23
Hagrid,25,38,45,49,39,18,42,30
Harry,46,49,48,29,15,5,28,40
Hermione,43,31,18,25,26,47,31,36
James,46,41,38,29,15,5,48,22
Kingsley,7,11,18,33,39,31,23,14
Lavender,22,33,43,12,26,18,47,41
Lily,42,47,48,18,35,46,48,50
Lucius,9,13,33,26,45,11,36,39
Luna,18,23,35,13,11,19,14,24
Minerva,17,49,18,7,6,18,17,30
Neville,14,44,28,27,19,7,25,20
Petunia,29,29,40,31,45,20,40,35
Remus,6,18,5,42,39,28,44,22
Ron,37,47,13,25,17,6,13,35
Severus,29,27,32,41,6,27,8,34
Sirius,31,11,28,26,35,19,33,6
Vernon,26,45,34,50,44,30,32,28
Zacharias,29,50,18,23,38,24,22,9
.txt file example :
TGGTTTAGGGCCTATAATTGCAGGACCACTGGCCCTTGTCGAGGTGTACAGGTAGGGAGCTAAGTTCGAAACGCCCCTTGGTCGGGATTACCGCCATTCTAGTAGTCTAACCCCGAACGCGCTCAGGCTTTGAGTTCGCGCAGCATTAAGAAGTCCATGCCGGCACCGAATGTCCCGACGACAGGCAACCAGCACGGATACCCGCCTTGAAGGCGCAATCAGTAGGTCGAGTTACAGAGGCTCCCCCCGAGCTTGTGCTTCCATTGAGTAGGGGCTATAGATATGTAGCACTCAGGTTTAGTAGCGCCCTTTTAACAGCGAGAGCCCGCCTGGTCAGAACCGAACGGCTGATACGCGAGCTGATGGCTAGAGGATGAACACGGTCCTTCTCTTCGCTTCGATCCGGGGTAGTTTTGTAGCGAAGGATAACGCTCTGTGGATTCTCCGAGAATAATCATCAGTACGGTGTGCGTACCCTCTCTTTGATCCACGCCTGGGGCTGGACATAGTCAGGCGCATTTCATCTACTTAACCCCGGTAAGGGCCACGGGCGCGACATCTCCTTACCAGGGTTGTCTTATGCTCGCTTTTCCCAGATGATAAGCATCTCGTTGTAATGAACAGGTACCTAAGAAAACTGAGTTTCGACGACCCGTCGGCTCGTGTTCTTATCTATTGATCTAACCGAGGTGAAGCTCGCCAAAAATTTCGTAATGTAAGAGAGAAATTGAAGGGGTGAATTTTGCACTCTCGTGCATACGTCTTGCTACAATAGCAAGAGCTGTATGCGTGCGACCACTTCACTACCTCTATAGCATGCGATCTCGCAGCCCGGATTGTCGCCTCCTTTGGGCCGCAAAAACGGTATACGGACACACTGCATCTGTGAGCACGCACCAACTCGATGCGCGTAATAGGCATCTGCTCCACCCAGGGGGCAAGGGCACCTGAGGACGATCTGTTCCGAACAGTATATTTGAGCCAATGTTCTATTAGTGAAGGCAAGGTGAGGGCAACTCACACTGGTACCTCGAGGAGTAATCCGATCCATCTAAACTGGCATGCCCGTTCGAGCTCGGGGGGTTCTTGAATTTAGCTGCGTGTACCTCAGGTCTCCTGAAGGTGACACCAAAGTACAAGATAGATAGATAGATAGATAGGTAATATCTGACGTGAATTGCTTACCGCTAGTGAGCGATTTAGGTCACGTCCTCTGAAGATGTACCGTGATCATCGATGAAATGGCTCTGCAGAGCGTTGCTCTTAACTTAGTGGGAGTGTTCCTTTGCAGTCTGATGAAGTCGCTGCATGTTAACTTTGCACTAAGGGCGTTTCCGAACCCTATAGTCATCCTTATTGATTCGCCCTGTCTTACCCAGGATACACTACCGTTCGAGGCTCTTAACGTACCAACGCATGCAGTCAGAAGATGATTCACCATCCAAGCAAATCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTGGCTTTAAAGCACCGAATGTAGTTGGCCATCGGTCTCGGTCACCAATAAGACGGCCTCGTGGGTCACTCGGTCGATGATCTAGGGTCGGGTGCATAGTGTTTCAGGTCGGCGCTCAGGGTTCTTGTCAGGGAAATCTACGGGTGAGTTGGAAAGCGCCGCCAGCGAGATGCCTGTAGGCGATTAGTGTAGAGAGAGCAACATCGGAAAATTGTCCGTGGGGCGCTACGTAAGTGTTCCCAGTATTCTCGTCCAGAGTAAGTCATGCATACCAGTATCAGGCGTCTGTGTGTTACGTTGCAGTGTATCCCGGTAGCGGGAAGCGTATAGAGCGTAACAGACCTGTCCTACAGCACGCAGGATGTCGACCCTTTCTCAGGCACGATACTTCGTGTAACAGCAGTTCCGGTGTCATCTGTAACTGTTCTGTGTTCCATAGTGAAGATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCGGGCGACCCTGGCAATCAGGCGCCTGCTTATATGACAATTTGTCGCAATACGGGTCGCAGATGTATTGTCCGCATGAGTTTACGGTATCCGGAACTGTCACCGCCGATCCAATATGATCGCAAATCGGGGTGACACAATGGACCGCCGTAGATAAACCCTGCGATGCTGCAATAAGGATATGATATCGCGCGCGGGCCGTAAAACCGATCTTGGAAGGCGGGAAGTCTCCGGGAAAAACTCTCTGATAAAGCCTATTACAAGAAGAGCTCGAAGGCAAGATGGGCATGCCCCGTCGACCACACGGGCAAGCTCTGAGAATCGATGTGGTCGCTTAACCAACCCATACGGAGTGAACGAGACCACGCGGGCGGTTCTTGGTACGCATGATTCCTATTGGTTCTGCCGGGCGTGTGCAGGATTGTTCACTCCCCACCCTGTCGCTCACGAACGCGCTGGTTGCTTAAACCGACCGGAAATTCTGTAGCCGCCCCGTAAGTTTAACGCTTTGAAATACTCCACATGTGCGTACCGGGTCTGATCGCTTACGTGGCGCCACTATGTTAGGAGCTCATAGATATCGATGAATCAAATGTCTTTCATCGCTCCTTAAACAACCTGACGTATTCGCAAAATTGCGCGTATTGAGAAGGGAAAGTTAAAGGAACGATAACAATGAGTCTGCTTTCACCGGCTGCATAACGGGATCGCGCGCTATGGGATTTCCTAACTATAATTCGTGTCGATACTCAGACGCGTTGTACAGGTAAGAAGTCGGCGGGACAGTATTTGAGAAGGGGCTCTGCGGCACCAACGCCGAGCTGTATCAGGGGGGTTAATGTGTAGCGGGCATATAACACAATACAGCCCGCGGCGCGTCGTGGTTACCGTAGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAACACCGGAGCTAGGATCCACGACTACCAGTGGGAAACTGTGAATTGTGCATGGTAATTAAAGGATGACTGGTCAACACCGGTCTCCACGGGCGTTAAACAACCTCGCTCCAGTCAATCTCTAGCGGTGGTTGTGGCAGCTTATTCCTGGAGGTAATACTCTTCCGGGCCCACTAAAAATGTAACGAAGTCGAGGTTGGGTCAGGGGATTGAGTGGGGGCGACTCACTGATTCCACCAGGAATTGTCGTCAATCGCGACGTACTTTGAGCCTTGTATCTTGGCGTTTCTTGTTGGTACGCGGCCGTGTTCGTGAATCACGACGTCGTTCATGATTCATCCGTCCAAGCCTAGACCTAGCGTAAAAACGGTGTCGATCTGTGCTCCAACCGATGGATGGTTTTTACACAAGTGAACTTCGAGGCTGTGGGACAAACAGCACAACTTGTTCACTGCTGACCGTGGTACTAAACCACGCTTGCTTTCAGCCCTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTCAGATGCGGATCAAGGGTTACTCGAGCCGTTCTGAGGTCCTAAAATTTTAGCCCTTGGTGTTAGCTTCGGTTTAAGAACGTAGGTGCGACGCGGGGGTCCTAGAGCTCCGCGATCTGCACTCCCCACCTGGCACCAAAACGAATCCTGCATAACGGCTCTCTGTGCATGGGGGATGGTCGCAACAACGAGCATAGCTGGCATCACTTCGTTTGCTGTGGATTGCTGTTTTATACAGAATACGGTGGTGATCATCAAAGGAAGCATAATCCACATCGGGCACCCCGGGCCATCGTGCGTTCCCTTATAGCCGGCTTGCATGTTGGGGGAGGAGTAAGGCCGGTAACGTCTCGCAGCACTGTCGCGTAACACAGGTACATCTTTATTTCCGGTGCTGTAGAAGTGGTTTTTCGAAGGCGTAACCCAGAACGACTGATATAATAGTCCACTATTCCCTGGTTTAAGACTTCTACAAAGTTTTACGCAAAGTTACATGCACACTCGGCGACGTAAATATTAGCCTTGCTAAATTGCCACGGATATTAATCCCGAGCCAACCTGTTCCCACTAGCGGTCTACGGTCATAGTCCTTTGTGTAGAGCGTCATTGCGGTTGGGGCCCGTCCGCGGAGGTTCCCCTTATGATCTAACCGCGGTGCAGGTTGACTGAATGCCATACACTATAGAGAAGACGTCTAAGTAGAAACGTTCTTTAAAAATCTTGAACTGACGGCCGAGTATTATCAAGAGAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCATGCACAATAAGGTTAGAAGCAGCAAGCATGTATTCTTTGCATAGAGGCGGTAAAGCCGCCTTGCATACCCAGCAGCAGCCGCGAAGGCCTTACTCCAGAGGACAGAACTTCTCACACAGCGTCCGCATACACCGCGGACGTGACAAGGTTAGATAGCTCTAGTTTGCGGCAACCCTCGCATCAGGCCGACTCACCCGCGCTTGCTACCCGGAGGATGGGTCAAGGGATAAACATAGCACGTTAGTTAAGCCTAACGTCAGTTTTTAGAGTTTACATGCACGACTAAGTGCATCGAAATACACGCCGTTGACAGACCAACAGCGTGTCAACTGGGCCTTGAGAATTGTATCATAATAGCCAAATACGAGGCCAAGTAGTCCGACGAGAGGCACGTAGAGACCACTTTCCCTAAACGATCTGTCGCATTACCCTTTGACTCGCACCCTATGCCTTATGTTCCAAGCAGCACCGAAGTTAGATTTAAGGGCGTATCTATCGGTACCTCGGTTGGGCCGGTCCACAGCTCCAGCTGAATTAGTGCTCACCCCGCTTCGAGGTTGAGTAAGGGTCACTTTTAAAAATATGCTTAAGGGTGATTCACATGACAGTAATCGAATAGTGAGATATAAGTAGGTGCGCCCCGCGCACACATCAAAACTGTGCAGACTGAAACTGAATGCTGGAGGCTGAGGAAAATGAAGATCAGAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGAGAAGAGTGAATGAATAGATCTGCCGCTGAATCCCCGCGTGAGGTTTTTGCGAC
Code:
from sys import argv
import csv
from itertools import groupby
#first csv cma 2nd txt
# fread from CSV file first thing in a row is name then the number of strs
# fread from dna seq and read it into a memory
#find how many times each str censequetivel
# if number of strs == with a persons print the person
checkstr = [] #global array that tells us what str to read
def readtxt(csvfile,seq):
with open(f'{csvfile}','r') as p:#finding which str to read from header line of the csv
header = csv.reader(p) # Header contains which strs to look for
for row in header:
checkstr = row[1:]
break
with open(f'{seq}','r') as f:#searching the text for strs
s = f.read()
for c in checkstr:
groups = groupby(s.split(c))
try:
return [sum(1 for _ in group)+1 for label, group in groups if label==''][0]
except IndexError:
return 0
def readcsv(n):
with open(f'{n}','r') as f:
readed = csv.DictReader(f)
for row in readed:
return row
def main():
counter = 0
if len(argv) != 3:
print("Please start program with cmd arguments.")
readtxt(argv[1], argv[2])#for fulling the checkstr
for i in range(0,len(checkstr)): #Do this as much as the number of special strings
for j in checkstr: #For each special string in the list
if readtxt(argv[1], argv[2]) == readcsv(argv[1])[f'{checkstr[j]}']: #If dictionary value that returns for that spesific str is matches to the spesific str
counter += 1
if counter == len(checkstr): # if all spesific strs matches, then we found our person!
print(readcsv(argv[1])['name'])
#readtxt(argv[1], argv[2])
#readcsv(argv[1])
main()
I think the answer quite simple: you forgot an else-statement.
After you check the number arguments, you must place the else-statement.
def main():
counter = 0
if len(argv) != 3:
print("Please use: program <csvfile> <textfile>") # give usage and exit
else:
readtxt(argv[1], argv[2])#for fulling the checkstr
for i in range(0,len(checkstr)): #Do this as much as the number of special strings
for j in checkstr: #For each special string in the list
if readtxt(argv[1], argv[2]) == readcsv(argv[1])[f'{checkstr[j]}']: #If dictionary value that returns for that spesific str is matches to the spesific str
counter += 1
if counter == len(checkstr): # if all spesific strs matches, then we found our person!
print(readcsv(argv[1])['name'])
#readtxt(argv[1], argv[2])
#readcsv(argv[1])
main()

how do I count the characters in a group of lines separated by another kind of line?

I am currently working with a text file that has a list of DNA extraction sequences (contigs), each with a header followed by lines of nucleotides, which is the nucleotide length of that contig. there are 120 contigs, with each entry marked by a line that starts with ">" to denote the sequence information. after this line, a length of nucleotides of that sequence is given.
example:
>gi|571136972|ref|XM_006625214.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 5 (Rps5) (rps5) mRNA, complete cds
ATGAGAAATATTTTATTAAAGAAAAAATTATATAATAGTAAAAATATTTATATTTTATATTATATTTTAATAATATTTAAAAGTATTTTTATTATTTTATTTAATAGTAAATATAATGTGAATTATTATTTATATAATAAAATTTATAATTTATTTATTATATATATAAAATTATATTATATTATAAATAATATATATTATAATAATAATTATTATTATATATATAATATGAATTATATA
TATTTTTATATTTATAAATATAATAGTTTAAATAATA
>gi|571136996|ref|XM_006625226.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 2 (Rps2) (rps2) mRNA, complete cds
ATGTTTATTACATTTAAAGATTTATTAAAATCTAAAATATATATAGGAAATAATTATAAAAATATTTATATTAATAATTATAAATTTATATATAAAATAAAATATAATTATTGTATTTTAAATTTTACATTAATTATATTATATTTATATAAATTATATTTATATATTTATAATATATCTATATTTAATAATAAAATTTTATTTATTATTAATAATAATTTAATTACAAATTTAATTATT
AATATATGTAATTTAACTAATAATTTTTATATTATTA
what I would like to do is make a list of every contig. My problem is, I do not know the syntax needed to tell Python to:
find the line after the line that starts with ">"
take a count of all of the characters in the lines of that sequence
return a value to a list of all contig values (a list that gives a list of length of every contig, ie 126, 300, 25...)
make sure the last contig (which has no ">" to denote its end) is counted.
I would like a list of integers, so that I can calculate things like the mean length of the contigs, standard deviation, cool gene equations etc.
I am relatively new to programming. if I am unclear or further information is needed, please let me know.
Don't reinvent the wheel, use biopython as Martin has suggested. Here's a start for you that will print the sequence ID and length to terminal. You can install biopython with pip, i.e. pip install biopython
from Bio import SeqIO
import sys
FileIn = sys.argv[1]
handle = open(FileIn, 'rU')
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
print "%s: %i bp" % (record.id, length) #print sequence ID: seq length
Or you could store the results in a dictionary:
handle = open(FileIn, 'rU')
sequence_lengths = {}
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
sequence_lengths[record.id] = length
#access dictionary outside of loop
print sequence_lengths
This might work for you: it prints the number of ACGT's in the lines that follow a line that includes >:
import re
with open("input.txt") as input_file:
data = input_file.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print(data)
thanks for all the help. I have looked at the biopython stuff and am excited to understand it and incorporate it. The overall goal of this assignment was to teach me how to understand python, rather than finding the solution outright, or at least if I find the solution, I have to be able to explain it in my own words.
Anyway, I have created a code incorporating that element as well as others. I have a few more things to do, and if I am confused, I will return to ask.
here is my first working code outside of working directly with my supervisor or tutorials that I made and understand (woo!):
import re
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
contigs = 0
for line in fasta:
if line.strip().startswith('>'):
contigs = contigs + 1
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
data = fasta.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print "Total number of contigs: %s" %contigs
total_contigs = sum(data)
N50 = sum(data)/2
print "number used to determine N50 = %s" %N50
average = 0
total = 0
for n in data:
total = total + n
mean = total / len(data)
print "mean length of contigs: %s" %mean
print "total nucleotides in fasta = %s" %total_contigs
#print "list of contigs by length: %s" %sorted([data])
l = data
l.sort(reverse = True)
print "list of contigs by length: %s" %l
this does what I want it to do, but if you have any comments or advice, I would love to hear.
next up, determining N50 with this sweet sweet list. thanks again!
I created a function to calculate N50 and it seemed to work nicely. I can parse the command line and run any .fa file through the program
def calc_n50(array):
array.sort(reverse = True)
n50 = 0 #sums lengths
n = 0 #n50 sequence
half = sum(array)/2
for val in array:
n50 += val
if n50 >= half:
n = val
break #breaks loop when condition is met
print "N50 is",n

Categories

Resources