Currently working on CS50. I tried to count STR in file DNA Sequences but it always overcount.
I mean, for example: how much 'AGATC' in file DNA repeat consecutively.
This code is only try to find out how to count those repeated DNA accurately.
import csv
import re
from sys import argv, exit
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
with open(argv[1]) as csv_file, open(argv[2]) as dna_file:
reader = csv.reader(csv_file)
#for row in reader:
# print(row)
str_sequences = next(reader)[1:]
dna = dna_file.read()
for i in range(len(dna)):
count = len(re.findall(str_sequences[0], dna)) # str_sequences[0] is 'AGATC'
print(count)
main()
result for DNA file 11 (AGATC):
$ python dna.py databases/large.csv sequences/11.txt
52
The result supposed to be 43. But, for small.csv, its count accurately. But for large it always over count. Later i know that my code its counting all every match word in DNA file (AGATC). But the task is, you have to take the DNA that only repeat consecutively and ignoring if another same DNA showup again.
{AGATCAGATCAGATCAGATC(T)TTTTAGATC}
So, how to stop counting if the DNA hit the (T), and it doesn't need to count AGATC that comes after?
What should i change in my code? especially in re.findall() that i use. Some people said use substring, how to use substring? or maybe can i just use regEx like i did?
Please write your code if you can. sorry for my bad english.
The for loop is wrong, it will keep counting the sequences even if they are already found earlier in the loop. I think you want to instead loop over the str_sequences.
Something like:
seq_list = []
for STR in str_sequences:
groups = re.findall(rf'(?:{STR})+', dna)
if len(groups) == 0:
seq_list.append('0')
else:
seq_list.append(str(max(map(lambda x: len(x)//len(STR), groups))))
print(seq_list)
Also, there are many posts on this problem. Maybe, you can examine some of them to finish your program.
Related
I currently working on CS50 problem set https://cs50.harvard.edu/x/2021/psets/6/dna/
The problem simply tell us to find some DNA sequence that repeated consecutively in a txt file and match the total length with the person in csv file.
This is the code i currently work (not complete yet):
import re, csv, sys
def main(argv):
# Open csv file
csv_file = open(sys.argv[1], 'r')
str_person = csv.reader(csv_file)
nucleotide = next(str_person)[1:]
# Open dna sequences file
txt_file = open(sys.argv[2], 'r')
dna_file = txt_file.read()
str_repeat = {}
str_list = find_STRrepeats(str_repeat, nucleotide, dna_file)
def find_STRrepeats(str_list, nucleotide, dna):
for STR in nucleotide:
groups = re.findall(rf'(?:{STR})+', dna)
if len(groups) == 0:
str_list[STR] = 0
else:
str_list[STR] = groups
print(str_list)
if __name__ == "__main__":
main(sys.argv[1:])
Output (from the print(str_list)):
{'AGATC': ['AGATCAGATCAGATCAGATC'], 'AATG': ['AATG'], 'TATC': ['TATCTATCTATCTATCTATC']}
But as you can see, the value in the dictionary also store consecutively. If i want to use len function in str_list[STR] = len(groups) it will result 1 for each key in dictionary. Because i want to find how many time (total length) that DNA repeated, and store it as value in my dict.
So, I want it to store separately. Kind of like this:
{'AGATC': ['AGATC', 'AGATC', 'AGATC', 'AGATC'], 'AATG': ['AATG'], 'TATC': ['TATC', 'TATC', 'TATC', 'TATC', 'TATC']}
What should i add to my code so they can separate with a coma like that? or maybe there's some condition i can add to my ReGex code groups = re.findall(rf'(?:{STR})+', dna) ?
I don't wanna change the whole of ReGex code. Because i think is already useful to found largest length of string that repeat consecutively. And i proud to myself can get it without help because i'm beginner with python. Please. Thank you.
I would just store the highest number of repetitions:
...
if len(groups) == 0:
str_list[STR] = 0
else:
str_list[STR] = max(len(i)/len(str) for i in groups)
....
BTW, this would correctly handle the case where more than one sequence exists.
I have been stuck at this point for quite a while, hope to get some tips.
The problem can be simplified as to find what is the largest consecutive occurrence of a pattern in a string. As a pattern AATG, for a string like ATAATGAATGAATGGAATG the right result should be 3. I tired to count the occurrences of the pattern by using re.compile(). I have found out from the doc that if i want to find consecutive occurrence of a pattern i possibly have to use special character +. For instance, a pattern like AATG i have to use re.compile(r'(AATG)+') instead of re.compile(r'AATG'). Otherwise, the occurrences will be overcounted. However, in this program the pattern is not a fixed string. I have treat it as a variable. I have tried many ways to put it into re.compile() without positive results. Could anyone enlighten me the correct way to format it (which is in the Function def countSTR below)?
After that, i think finditer(the_string_to_be_analysis) should return a iterator including all matches found. Then i used match.end() - match.start() to obtain the length of every match to compare with each other in order to get the longest consecutive occurrence of the pattern. maybe something goes wrong there?
code attached. Every input would be appreciated!
from sys import argv, exit
import csv
import re
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
# read DNA sequence
with open(argv[2], "r") as file:
if file.mode != 'r':
print(f"database {argv[2]} can not be read")
exit(1)
sequence = file.read()
# read database.csv
with open(argv[1], newline='') as file:
if file.mode != 'r':
print(f"database {argv[1]} can not be read")
exit(1)
# get the heading of the csv file in order to obtain STRs
csv_reader = csv.reader(file)
headings = next(csv_reader)
# dictionary to store STRs match result of DNA-sequence
STR_counter = {}
for STR in headings[1::]:
# entry result accounting to the STR keys
STR_counter[STR] = countSTR(STR, sequence)
# read csv file as a dictionary
with open(argv[1], newline='') as file:
database = csv.DictReader(file)
for row in database:
count = 0
for STR in STR_counter:
# print("row in database ", row[STR], "STR in STR_counter", STR_counter[STR])
if int(row[STR]) == int(STR_counter[STR]):
count += 1
if count == len(STR_counter):
print(row['name'])
exit(0)
else:
print("No match")
# find non-overlapping occurrences of STR in DNA-sequence
def countSTR(STR, sequence):
count = 0
maxcount = 0
# in order to match repeat STR. for example: "('AATG')+" as pattern
# into re.compile() to match repeat STR
# rewrite STR to "(STR)+"
STR = "(" + STR + ")+"
pattern = re.compile(r'STR')
# matches should be a iterator object
matches = pattern.finditer(sequence)
# go throgh every repeat and find the longest one
# by match.end() - match.start()
for match in matches:
count = match.end() - match.start()
if count > maxcount:
maxcount = count
# return repeat times of the longest repeat
return maxcount/len(STR)
main()
just find out a correct way to get the desired result.
post it here in case any others are also confused.
From what I have understand, to match a variable named var_pattern could use re.compile(rf'{var_pattern}'). Then if consecutive occurrences of the var_pattern should be searched, could use re.compile(rf'(var_pattern)+'). There may be other smarter ways to implement that, however i managed to get it work as fine as previously .
I have been given an assignment for a project with no previous programming experience. It asks to create a motif finder using while loops, incrementals and boo's. I believe I am on the right track but very uncertain as I have no programming experience. Can anybody help me find my wrongs and tell me what I need to do to correct them. Again I am a biology guy asked to take this on and
gi|14578797|gb|AF230943.1| Vibrio hollisae strain ATCC33564 Hsp60 (hsp60) gene, partial cds
CGCAACTGTACTGGCACAGGCTATCGTAAGCGAAGGTCTGAAAGCCGTTGCTGCAGGCATGAACCCAATG
GACCTGAAGCGTGGTATTGACAAAGCGGTTGCTGCGGCAGTTGAGCAACTGAAAGCGTTGTCTGTTGAGT
GTAATGACACCAAGGCTATTGCACAGGTAGGTACCATTTCTGCTAACTCTGATGAAACTGTAGGTAACAT
CATTGCAGAAGCGATGGAAAAAGTAGGCCGCGACGGTGTTATCACTGTTGAAGAAGGTCAGTCTCTGCAA
GACGAGCTGGATGTGGTTGAAGGTATGCAGTTTGACCGCGGCTACCTGTCTCCATACTTCATCAACAACC
AAGAGTCTGGTTCTGTTGATCTGGAAAACCCATTCATCCTGCTGGTTGACAAAAAAGTATCAAACATCCG
CGAACTGCTGCCTACTCTGGAAGCCGTCGCGAAATCTTCACGTCCACTGCTGATCATCGCTGAAGACGTA
GAAGGTGAAGCACTGGCAACACTGGTTGTAAACAACATGCGTGGCATCGTAAAAGGGCAGCAGTT
gi|14578795|gb|AF230942.1| Photobacterium damselae strain ATCC33539 Hsp60 (hsp60) gene, partial cds
GGCTACAGTACTGGCTCAAGCAATTATCACTGAAGGTCTAAAAGCGGTTGCTGCGGGTATGAACCCAATG
GATCTTAAGCGTGGTATCGACAAAGCAGTAGTTGCTGCTGTTGAAGAGCTAAAAGCACTATCTGTTCCTT
GTGCTGACACTAAAGCGATTGCTCAGGTAGGTACTATCTCTGCAAACTCTGATGCAACTGTGGGTAACCT
AATTGCAAAAGCTATGGATAAAGTTGGTCGTGATGGTGTTATCACGGTTGAAGAAGGCCAAGCGCTACAA
GATGAGTTAGATGTAGTTGAAGGTATGCAGTTCGATCGCGGTTACCTATCTCCATACTTCATCAACAACC
AACAAGCAGGTGCGGTGGAGCTAGAAAGCCCATTTATCCTTCTTGTTGATAAGAAAATCTCTAACATCCG
TGAGCTATTACCAGCACTAGAAGGCGTTGCAAAAGCATCTCGTCCTCTACTGATCATCGCTGAAGATGTT
GAAGGTGAAGCACTAGCAACACTGGTTGTGAACAACATGCGCGGCATTGTTAAAGTTGCTGCTGTT
I am in need of some help.
import re
#function parsing header for sequence
def fasta_splitter(x):
boo=0
seq = ""
i=0
while i < len(lines)
if line[0] ==">"and boo ==0
line[i] = header
boo = 1
i=1+i
elif line [i][0] ==">"
header=line[0]
seq=""
i=i+1
else
seq=seq+line[i]
print ("header" + "seq")
#open file and read file by command line
x=open('C:\\Python27\\fasta.py.txt','r+')
lines = x.readlines()
fasta_splitter(lines)
#split orgnaism details from actual bases
# not sure how to call defined function
re.search(pattern, string)
# renaming string seq to dna
seq ="x"
m = re.search(r"GG(ATCG)GTTAC",dna)
print "m"
For starters, rеad FASTA with a Bio.SeqIO module, so you don't have to write this fasta_splitter monstrosity. Biopython is generally great.
Second, you've messed just about everything up. You call re.search without having defined either a pattern or a string. This will just crash. Then you write
seq="x"
...
print "m"
In both cases you use literally letters "x" or "m", and what you need are variable names. Correct thing will be
seq = x
...
print(m)
And all this is assuming this is a student assignment and not an actual research. In latter case it's generally better to use some modern motif finder tool: those are more sensitive and biologically correct than any bunch of regexes could be.
I have this code, which I want to open a specified file, and then every time there is a while loop it will count it, finally outputting the total number of while loops in a specific file. I decided to convert the input file to a dictionary, and then create a for loop that every time the word while followed by a space was seen it would add a +1 count to WHILE_ before finally printing WHILE_ at the end.
However this did not seem to work, and I am at a loss as to why. Any help fixing this would be much appreciated.
This is the code I have at the moment:
WHILE_ = 0
INPUT_ = input("Enter file or directory: ")
OPEN_ = open(INPUT_)
READLINES_ = OPEN_.readlines()
STRING_ = (str(READLINES_))
STRIP_ = STRING_.strip()
input_str1 = STRIP_.lower()
dic = dict()
for w in input_str1.split():
if w in dic.keys():
dic[w] = dic[w]+1
else:
dic[w] = 1
DICT_ = (dic)
for LINE_ in DICT_:
if ("while\\n',") in LINE_:
WHILE_ += 1
elif ('while\\n",') in LINE_:
WHILE_ += 1
elif ('while ') in LINE_:
WHILE_ += 1
print ("while_loops {0:>12}".format((WHILE_)))
This is the input file I was working from:
'''A trivial test of metrics
Author: Angus McGurkinshaw
Date: May 7 2013
'''
def silly_function(blah):
'''A silly docstring for a silly function'''
def nested():
pass
print('Hello world', blah + 36 * 14)
tot = 0 # This isn't a for statement
for i in range(10):
tot = tot + i
if_im_done = false # Nor is this an if
print(tot)
blah = 3
while blah > 0:
silly_function(blah)
blah -= 1
while True:
if blah < 1000:
break
The output should be 2, but my code at the moment prints 0
This is an incredibly bizarre design. You're calling readlines to get a list of strings, then calling str on that list, which will join the whole thing up into one big string with the quoted repr of each line joined by commas and surrounded by square brackets, then splitting the result on spaces. I have no idea why you'd ever do such a thing.
Your bizarre variable names, extra useless lines of code like DICT_ = (dic), etc. only serve to obfuscate things further.
But I can explain why it doesn't work. Try printing out DICT_ after you do all that silliness, and you'll see that the only keys that include while are while and 'while. Since neither of these match any of the patterns you're looking for, your count ends up as 0.
It's also worth noting that you only add 1 to WHILE_ even if there are multiple instances of the pattern, so your whole dict of counts is useless.
This will be a lot easier if you don't obfuscate your strings, try to recover them, and then try to match the incorrectly-recovered versions. Just do it directly.
While I'm at it, I'm also going to fix some other problems so that your code is readable, and simpler, and doesn't leak files, and so on. Here's a complete implementation of the logic you were trying to hack up by hand:
import collections
filename = input("Enter file: ")
counts = collections.Counter()
with open(filename) as f:
for line in f:
counts.update(line.strip().lower().split())
print('while_loops {0:>12}'.format(counts['while']))
When you run this on your sample input, you correctly get 2. And extending it to handle if and for is trivial and obvious.
However, note that there's a serious problem in your logic: Anything that looks like a keyword but is in the middle of a comment or string will still get picked up. Without writing some kind of code to strip out comments and strings, there's no way around that. Which means you're going to overcount if and for by 1. The obvious way of stripping—line.partition('#')[0] and similarly for quotes—won't work. First, it's perfectly valid to have a string before an if keyword, as in "foo" if x else "bar". Second, you can't handle multiline strings this way.
These problems, and others like them, are why you almost certainly want a real parser. If you're just trying to parse Python code, the ast module in the standard library is the obvious way to do this. If you want to be write quick&dirty parsers for a variety of different languages, try pyparsing, which is very nice, and comes with some great examples.
Here's a simple example:
import ast
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
while_loops = sum(1 for node in ast.walk(tree) if isinstance(node, ast.While))
print('while_loops {0:>12}'.format(while_loops))
Or, more flexibly:
import ast
import collections
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
counts = collections.Counter(type(node).__name__ for node in ast.walk(tree))
print('while_loops {0:>12}'.format(counts['While']))
print('for_loops {0:>14}'.format(counts['For']))
print('if_statements {0:>10}'.format(counts['If']))
This is more about to find the fastest way to do it.
I have a file1 which contains about one million strings(length 6-40) in separate line. I want to search each of them in another file2 which contains about 80,000 strings and count occurrence(if small string is found in one string multiple times, the occurence of this string is still 1). If anyone is interested to compare performance, there is link to download file1 and file2.
dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m
What i am doing now is construct a dictionary for file 2, use strings ID as key and string as value. (because strings in file2 have duplicate values, only string ID is unique)
my code is
for line in file1:
substring=line[:-1].split("\t")
for ID in dictionary.keys():
bigstring=dictionary[ID]
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
output.write("%s\t%s\n" % (substring,str(len(IDlist))))
My code will take hours to finish. Can anyone suggest a faster way to do it?
both file1 and file2 are just around 50M, my pc have 8G memory, you can use as much memory as you need to make it faster. Any method that can finish in one hour is acceptable:)
Here, after I have tried some suggestions from these comments below, see performance comparison, first comes the code then it is the run time.
Some improvements suggested by Mark Amery and other peoples
import sys
from Bio import SeqIO
#first I load strings in file2 to a dictionary called var_seq,
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[record.id]=str(record.seq)
print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance
output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
icount+=1
row=line[:-1].split("\t")
ensp=row[0] #ensp is just peptides iD
peptide=row[1] #peptides is the substrings i want to search in file2
num=0
for ID,bigstring in var_seq.iteritems():
if peptide in bigstring:
num+=1
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It will take 1m4s to finish. Improved 20s compared to my old one
#######NEXT METHOD suggested by entropy
from collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[str(record.seq)]+=1
print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value
handle.close()
output=open(outputfilename,'w')
icount=0
with open(file1) as fd:
for line in fd:
icount+=1
row=line[:-1].split("\t")
ensp=row[0]
peptide=row[1]
num=0
for varseq,num_occurrences in var_seq.items():
if peptide in varseq:
num+=num_occurrences
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
output.close()
This one takes 1m10s,not faster as expected since it avoids searching duplicates, don't understand why.
Haystack and Needle method suggested by Mark Amery, which turned out to be the fastest, The problem of this method is that counting result for all substrings is 0, which I don't understand yet.
Here is the code I implemented his method.
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
input1=open(file1,'r')
needles={}
for line in input1:
row=line[:-1].split("\t")
needles[row[1]]=row[0]
print len(needles)
handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
haystacks[record.id]=str(record.seq)
print len(haystacks)
for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
icount=0
output=open(outputfilename,'w')
for needle in needles:
icount+=1
count = search_haystack_tree(needle)
newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It takes only 0m11s to finish, which is much faster than other methods. However, I don't know it is my mistakes to make all counting result as 0, or there is a flaw in the Mark's method.
Your code doesn't seem like it works(are you sure you didn't just quote it from memory instead of pasting the actual code?)
For example, this line:
substring=line[:-1].split("\t")
will cause substring t be a list. But later you do:
if bigstring.find(substring)!=-1:
That would cause an error if you call str.find(list).
In any case, you are building lists uselessly in your innermost loop. This:
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
#later
len(IDlist)
That will uselessly allocate and free lists which would cause memory thrashing as well as uselessly bogging everything down.
This is code that should work and uses more efficient means to do the counting:
from collections import defaultdict
dictionary = defaultdict(int)
with open(file2) as fd:
for line in fd:
for s in line.split("\t"):
dictionary[s.strip()] += 1
with open(file1) as fd:
for line in fd:
for substring in line.split('\t'):
count = 0
for bigstring,num_occurrences in dictionary.items():
if substring in bigstring:
count += num_occurrences
print substring, count
PS: I am assuming that you have multiple words per line that are tab-split because you do line.split("\t") at some point. If that is wrong, it should be easy to revise the code.
PPS: If this ends up being too slow for your use(you'd have to try it, but my guess is this should run in ~10min given the number of strings you said you had). You'll have to use suffix trees as one of the comments suggested.
Edit: Amended the code so that it handles multiple occurrences of the same string in file2 without negatively affecting performance
Edit 2: Trading maximum space for time.
Below is code that will consume quite a bit of memory and take a while to build the dictionary. However, once that's done, each search out of the million strings to search for should complete in the time it takes for a single hashtable lookup, that is O(1).
Note, I have added some statements to log the time it takes for each step of the process. You should keep those so you know which part of the time is taken when searching. Since you are testing with 1000 strings only this matters a lot since if 90% of the cost is the build time, not the search time, then when you test with 1M strings you will still only be doing that once, so it won't matter
Also note that I have amended my code to parse file1 and file2 as you do, so you should be able to just plug this in and test it:
from Bio import SeqIO
from collections import defaultdict
from datetime import datetime
def all_substrings(s):
result = set()
for length in range(1,len(s)+1):
for offset in range(len(s)-length+1):
result.add(s[offset:offset+length])
return result
print "Building dictionary...."
build_start = datetime.now()
dictionary = defaultdict(int)
handle = SeqIO.parse(file2, 'fasta')
for record in handle:
for sub in all_substrings(str(record.seq).strip()):
dictionary[sub] += 1
build_end = datetime.now()
print "Dictionary built in: %gs" % (build-end-build_start).total_seconds()
print "Searching...\n"
search_start = datetime.now()
with open(file1) as fd:
for line in fd:
substring = line.strip().split("\t")[1]
count = dictionary[substring]
print substring, count
search_end = datetime.now()
print "Search done in: %gs" % (search_end-search_start).total_seconds()
I'm not an algorithms whiz, but I reckon this should give you a healthy performance boost. You need to set 'haystacks' to be a list of the big words you want to look in, and 'needles' to be a list of the substrings you're looking for (either can contain duplicates), which I'll let you implement on your end. It'd be great if you could post your list of needles and list of haystacks so that we can easily compare performance of proposed solutions.
haystacks = <some list of words>
needles = <some list of words>
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
for haystack_id, haystack in enumerate(haystacks):
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
for needle in needles:
count = search_haystack_tree(needle)
print "Count for %s: %i" % (needle, count)
You can probably figure out what's going on by looking at the code, but just to put it in words: I construct a huge tree of substrings of the haystack words, such that given any needle, you can navigate the tree character by character and end up at a node which has attached to it the set of all haystack ids of haystacks containing that substring. Then for each needle we just go through the tree and count the size of the set at the end.