I have been given an assignment for a project with no previous programming experience. It asks to create a motif finder using while loops, incrementals and boo's. I believe I am on the right track but very uncertain as I have no programming experience. Can anybody help me find my wrongs and tell me what I need to do to correct them. Again I am a biology guy asked to take this on and
gi|14578797|gb|AF230943.1| Vibrio hollisae strain ATCC33564 Hsp60 (hsp60) gene, partial cds
CGCAACTGTACTGGCACAGGCTATCGTAAGCGAAGGTCTGAAAGCCGTTGCTGCAGGCATGAACCCAATG
GACCTGAAGCGTGGTATTGACAAAGCGGTTGCTGCGGCAGTTGAGCAACTGAAAGCGTTGTCTGTTGAGT
GTAATGACACCAAGGCTATTGCACAGGTAGGTACCATTTCTGCTAACTCTGATGAAACTGTAGGTAACAT
CATTGCAGAAGCGATGGAAAAAGTAGGCCGCGACGGTGTTATCACTGTTGAAGAAGGTCAGTCTCTGCAA
GACGAGCTGGATGTGGTTGAAGGTATGCAGTTTGACCGCGGCTACCTGTCTCCATACTTCATCAACAACC
AAGAGTCTGGTTCTGTTGATCTGGAAAACCCATTCATCCTGCTGGTTGACAAAAAAGTATCAAACATCCG
CGAACTGCTGCCTACTCTGGAAGCCGTCGCGAAATCTTCACGTCCACTGCTGATCATCGCTGAAGACGTA
GAAGGTGAAGCACTGGCAACACTGGTTGTAAACAACATGCGTGGCATCGTAAAAGGGCAGCAGTT
gi|14578795|gb|AF230942.1| Photobacterium damselae strain ATCC33539 Hsp60 (hsp60) gene, partial cds
GGCTACAGTACTGGCTCAAGCAATTATCACTGAAGGTCTAAAAGCGGTTGCTGCGGGTATGAACCCAATG
GATCTTAAGCGTGGTATCGACAAAGCAGTAGTTGCTGCTGTTGAAGAGCTAAAAGCACTATCTGTTCCTT
GTGCTGACACTAAAGCGATTGCTCAGGTAGGTACTATCTCTGCAAACTCTGATGCAACTGTGGGTAACCT
AATTGCAAAAGCTATGGATAAAGTTGGTCGTGATGGTGTTATCACGGTTGAAGAAGGCCAAGCGCTACAA
GATGAGTTAGATGTAGTTGAAGGTATGCAGTTCGATCGCGGTTACCTATCTCCATACTTCATCAACAACC
AACAAGCAGGTGCGGTGGAGCTAGAAAGCCCATTTATCCTTCTTGTTGATAAGAAAATCTCTAACATCCG
TGAGCTATTACCAGCACTAGAAGGCGTTGCAAAAGCATCTCGTCCTCTACTGATCATCGCTGAAGATGTT
GAAGGTGAAGCACTAGCAACACTGGTTGTGAACAACATGCGCGGCATTGTTAAAGTTGCTGCTGTT
I am in need of some help.
import re
#function parsing header for sequence
def fasta_splitter(x):
boo=0
seq = ""
i=0
while i < len(lines)
if line[0] ==">"and boo ==0
line[i] = header
boo = 1
i=1+i
elif line [i][0] ==">"
header=line[0]
seq=""
i=i+1
else
seq=seq+line[i]
print ("header" + "seq")
#open file and read file by command line
x=open('C:\\Python27\\fasta.py.txt','r+')
lines = x.readlines()
fasta_splitter(lines)
#split orgnaism details from actual bases
# not sure how to call defined function
re.search(pattern, string)
# renaming string seq to dna
seq ="x"
m = re.search(r"GG(ATCG)GTTAC",dna)
print "m"
For starters, rеad FASTA with a Bio.SeqIO module, so you don't have to write this fasta_splitter monstrosity. Biopython is generally great.
Second, you've messed just about everything up. You call re.search without having defined either a pattern or a string. This will just crash. Then you write
seq="x"
...
print "m"
In both cases you use literally letters "x" or "m", and what you need are variable names. Correct thing will be
seq = x
...
print(m)
And all this is assuming this is a student assignment and not an actual research. In latter case it's generally better to use some modern motif finder tool: those are more sensitive and biologically correct than any bunch of regexes could be.
Related
dict = {}
tag = ""
with open('/storage/emulated/0/Download/sequence.fasta.txt','r') as sequence:
seq = sequence.readlines()
for line in seq:
if line.startswith(">"):
tag = line.replace("\n", "")
else:
seq = "".join(seq[1:])
dict[tag] = seq.replace("\n", "")
print(dict)
Background for those who arn't familiar with FASTA files. This format contains one or multiple DNA, RNA, or protein sequences with a one-line descriptive tag of the sequence that starts with a ">" and then the sequence in the following lines(Ex. For DNA it would be a lot of repeating of A, T, G, and C). It also comes with many unnecessary line breaks. So far this code works when I only have one sequence per file but it seems to ignore the if condition if there are multiple. For example it should add each new tag: sequence pair into the dictionary everytime it notices a ">" but instead it only runs once and puts the first description as the key in the dictionary and joins the rest of the file regardless of ">" characters and uses that as the value. How can I get this loop to notice a new ">" after the first occurrence?
I am purposefully steering away from the biopython module.
UPDATE: the code below now works for multiple-line sequences.
The following code works fine for me:
import re
from collections import defaultdict
sequences = defaultdict(str)
with open('fasta.txt') as f:
lines = f.readlines()
current_tag = None
for line in lines:
m = re.match('^>(.+)', line)
if m:
current_tag = m.group(1)
else:
sequences[current_tag] += line.strip()
for k, v in sequences.items():
print(f"{k}: {v}")
It uses a number of features you may be unfamiliar with, such as regular expressions (which are probably very useful in bioinformatics) and f-string formatting. If anything confuses you, ask away. One thing I should add is that you don't want to define a variable as dict because that will clobber something Python has defined at startup. I chose sequences, which doesn't do this and is more informative.
For reference, this is the content of the example FASTA file fasta.txt I used in this instance:
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)
If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.
I have downloaded the following dictionary from Project Gutenberg http://www.gutenberg.org/cache/epub/29765/pg29765.txt (it is 25 MB so if you're on a slow connection avoid clicking the link)
In the file the keywords I am looking for are in uppercases for instance HALLUCINATION, then in the dictionary there are some lines dedicated to the pronunciation which are obsolete for me.
What I want to extract is the definition, indicated by "Defn" and then print the lines. I have came up with this rather ugly 'solution'
def lookup(search):
find = search.upper() # transforms our search parameter all upper letters
output = [] # empty dummy list
infile = open('webster.txt', 'r') # opening the webster file for reading
for line in infile:
for part in line.split():
if (find == part):
for line in infile:
if (line.find("Defn:") == 0): # ugly I know, but my only guess so far
output.append(line[6:])
print output # uncertain about how to proceed
break
Now this of course only prints the first line that comes right after "Defn:". I am new when it comes to manipulate .txt files in Python and therefore clueless about how to proceed. I did read in the line in a tuple and noticed that there are special new line characters.
So I want to somehow tell Python to keep on reading until it runs out of new line characters I suppose, but also that doesn't count for the last line which has to be read.
Could someone please enhance me with useful functions I might could use to solve this problem (with a minimal example would be appreciated).
Example of desired output:
lookup("hallucination")
out: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.
lookup("hallucination")
out: The perception of objects which have no reality, or of \r\n
sensations which have no corresponding external cause, arising from \r\n
disorder or the nervous system, as in delirium tremens; delusion.\r\n
Hallucinations are always evidence of cerebral derangement and are\r\n
common phenomena of insanity. W. A. Hammond.
from text:
HALLUCINATE
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of
hallucinari, alucinari, to wander in mind, talk idly, dream.]
Defn: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.
HALLUCINATION
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]
1. The act of hallucinating; a wandering of the mind; error; mistake;
a blunder.
This must have been the hallucination of the transcriber. Addison.
2. (Med.)
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
HALLUCINATOR
Hal*lu"ci*na`tor, n. Etym: [L.]
Here is a function that returns the first definition:
def lookup(word):
word_upper = word.upper()
found_word = False
found_def = False
defn = ''
with open('dict.txt', 'r') as file:
for line in file:
l = line.strip()
if not found_word and l == word_upper:
found_word = True
elif found_word and not found_def and l.startswith("Defn:"):
found_def = True
defn = l[6:]
elif found_def and l != '':
defn += ' ' + l
elif found_def and l == '':
return defn
return False
print lookup('hallucination')
Explanation: There are four different cases we have to consider.
We haven't found the word yet. We have to compare the current line to the word we are looking for in uppercases. If they are equal, we found the word.
We have found the word, but haven't found the start of the definition. We therefore have to look for a line that starts with Defn:. If we found it, we add the line to the definition (excluding the six characters for Defn:.
We have already found the start of the definition. In that case, we just add the line to the definition.
We have already found the start of definition and the current line is empty. The definition is complete and we return the definition.
If we found nothing, we return False.
Note: There are certain entries, e.g. CRANE, that have multiple definitions. The above code is not able to handle that. It will just return the first definition. However, it is far from easy to code a perfect solution considering the format of the file.
You can split into paragraphs and use the index of the search word and find the first Defn paragraph after:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word)) # find where our search word is
except ValueError:
return "Cannot find search term"
paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
for para in paras:
if para.startswith("Defn:"): # if para startswith Defn: we have what we need
return para # return the para
print(find_def("in.txt","HALLUCINATION"))
Using the whole file returns:
In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.
In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
A slightly shorter version:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word))
except ValueError:
return "Cannot find search term"
defn = lines[start:].index("Defn:")
return re.split("\s+\r\n",lines[start+defn:],1)[0]
From here I learned an easy way to deal with memory mapped files and use them as if they were strings. Then you can use something like this to get the first definition for a term.
def lookup(search):
term = search.upper()
f = open('webster.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
index = s.find('\r\n\r\n' + term + '\r\n')
if index == -1:
return None
definition = s.find('Defn:', index) + len('Defn:') + 1
endline = s.find('\r\n\r\n', definition)
return s[definition:endline]
print lookup('hallucination')
print lookup('hallucinate')
Assumptions:
There is at least one definition per term
If there are more than one, only the first is returned
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.
I have this code, which I want to open a specified file, and then every time there is a while loop it will count it, finally outputting the total number of while loops in a specific file. I decided to convert the input file to a dictionary, and then create a for loop that every time the word while followed by a space was seen it would add a +1 count to WHILE_ before finally printing WHILE_ at the end.
However this did not seem to work, and I am at a loss as to why. Any help fixing this would be much appreciated.
This is the code I have at the moment:
WHILE_ = 0
INPUT_ = input("Enter file or directory: ")
OPEN_ = open(INPUT_)
READLINES_ = OPEN_.readlines()
STRING_ = (str(READLINES_))
STRIP_ = STRING_.strip()
input_str1 = STRIP_.lower()
dic = dict()
for w in input_str1.split():
if w in dic.keys():
dic[w] = dic[w]+1
else:
dic[w] = 1
DICT_ = (dic)
for LINE_ in DICT_:
if ("while\\n',") in LINE_:
WHILE_ += 1
elif ('while\\n",') in LINE_:
WHILE_ += 1
elif ('while ') in LINE_:
WHILE_ += 1
print ("while_loops {0:>12}".format((WHILE_)))
This is the input file I was working from:
'''A trivial test of metrics
Author: Angus McGurkinshaw
Date: May 7 2013
'''
def silly_function(blah):
'''A silly docstring for a silly function'''
def nested():
pass
print('Hello world', blah + 36 * 14)
tot = 0 # This isn't a for statement
for i in range(10):
tot = tot + i
if_im_done = false # Nor is this an if
print(tot)
blah = 3
while blah > 0:
silly_function(blah)
blah -= 1
while True:
if blah < 1000:
break
The output should be 2, but my code at the moment prints 0
This is an incredibly bizarre design. You're calling readlines to get a list of strings, then calling str on that list, which will join the whole thing up into one big string with the quoted repr of each line joined by commas and surrounded by square brackets, then splitting the result on spaces. I have no idea why you'd ever do such a thing.
Your bizarre variable names, extra useless lines of code like DICT_ = (dic), etc. only serve to obfuscate things further.
But I can explain why it doesn't work. Try printing out DICT_ after you do all that silliness, and you'll see that the only keys that include while are while and 'while. Since neither of these match any of the patterns you're looking for, your count ends up as 0.
It's also worth noting that you only add 1 to WHILE_ even if there are multiple instances of the pattern, so your whole dict of counts is useless.
This will be a lot easier if you don't obfuscate your strings, try to recover them, and then try to match the incorrectly-recovered versions. Just do it directly.
While I'm at it, I'm also going to fix some other problems so that your code is readable, and simpler, and doesn't leak files, and so on. Here's a complete implementation of the logic you were trying to hack up by hand:
import collections
filename = input("Enter file: ")
counts = collections.Counter()
with open(filename) as f:
for line in f:
counts.update(line.strip().lower().split())
print('while_loops {0:>12}'.format(counts['while']))
When you run this on your sample input, you correctly get 2. And extending it to handle if and for is trivial and obvious.
However, note that there's a serious problem in your logic: Anything that looks like a keyword but is in the middle of a comment or string will still get picked up. Without writing some kind of code to strip out comments and strings, there's no way around that. Which means you're going to overcount if and for by 1. The obvious way of stripping—line.partition('#')[0] and similarly for quotes—won't work. First, it's perfectly valid to have a string before an if keyword, as in "foo" if x else "bar". Second, you can't handle multiline strings this way.
These problems, and others like them, are why you almost certainly want a real parser. If you're just trying to parse Python code, the ast module in the standard library is the obvious way to do this. If you want to be write quick&dirty parsers for a variety of different languages, try pyparsing, which is very nice, and comes with some great examples.
Here's a simple example:
import ast
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
while_loops = sum(1 for node in ast.walk(tree) if isinstance(node, ast.While))
print('while_loops {0:>12}'.format(while_loops))
Or, more flexibly:
import ast
import collections
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
counts = collections.Counter(type(node).__name__ for node in ast.walk(tree))
print('while_loops {0:>12}'.format(counts['While']))
print('for_loops {0:>14}'.format(counts['For']))
print('if_statements {0:>10}'.format(counts['If']))