This is going to be long but I don't know how else to effectively explain this.
So I have 2 files that I am reading in. The first one has a list of characters.The second file is a list of 3 characters and then it's matching identifier character(separated by a tab).
With the second file I made a dictionary with the 3 characters as the items and the one character as the corresponding key.
What I need to do is take 3 characters at a time from the first list and compare it with the dictionary. If there is a match I need to take the corresponding key and append it to a new list that I will print out. If the match is a '*' character I need to stop not continue comparing the list to the dictionary.
I'm having trouble with the comparing and then making the new list by using the append function.
Here is part of the first input file:
Seq0
ATGGAAGCGAGGATGtGa
Here is part the second:
AUU I
AUC I
AUA I
CUU L
GUU V
UGA *
Here is my code so far:
input = open("input.fasta", "r")
codons = open("codons.txt", "r")
counts = 1
amino_acids = {}
for lines in codons:
lines = lines.strip()
codon, acid = lines.split("\t")
amino_acids[codon] = acid
counts += 1
count = 1
for line in input:
if count%2 == 0:
line = line.upper()
line = line.strip()
line = line.replace(" ", "")
line = line.replace("T", "U")
import re
if not re.match("^[AUCG]*$", line):
print "Error!"
if re.match("^[AUCG]*$", line):
mrna = len(line)/3
first = 0
last = 3
while mrna != 0:
codon = line[first:last]
first += 3
last += 3
mrna -= 1
list = []
if codon == amino_acids[codon]:
list.append(acid)
if acid == "*":
mrna = 0
for acid in list:
print acid
So I want my output to look something like this:
M L I V *
But I'm not getting even close to this.
Please help!
The following is purely untested code. Check indentation, syntax and logic, but should be closer to what you want.
import re
codons = open("codons.txt", "r")
amino_acids = {}
for lines in codons:
lines = lines.strip()
codon, acid = lines.split("\t")
amino_acids[codon] = acid
input = open("input.fasta", "r")
count = 0
list = []
for line in input:
count += 1
if count%2 == 0: #i.e. only care about even lines
line = line.upper()
line = line.strip()
line = line.replace(" ", "")
line = line.replace("T", "U")
if not re.match("^[AUCG]*$", line):
print "Error!"
else:
mrna = len(line)/3
first = 0
while mrna != 0:
codon = line[first:first+3]
first += 3
mrna -= 1
if codon in amino_acids:
list.append(amino_acids[codon])
if acid == "*":
mrna = 0
for acid in list:
print acid
In Python there's usually a way to avoid writing explicit loops with counters and such. There's an incredibly powerful list comprehension syntax that lets you construct lists in one line. To wit, here's an alternate way to write your second for loop:
import re
def codons_to_acids(amino_acids, sequence):
sequence = sequence.upper().strip().replace(' ', '').replace('T', 'U')
codons = re.findall(r'...', sequence)
acids = [amino_acids.get(codon) for codon in codons if codon in amino_acids]
if '*' in acids:
acids = acids[:acids.index('*') + 1]
return acids
The first line performs all of the string sanitization. Chaining together the different methods makes the code more readable to me. You may or may not like that. The second line uses re.findall in a tricky way to split the string every three characters. The third line is a list comprehension which looks up each codon in the amino_acids dict and creates a list of the resulting values.
There's no easy way to break out of a for loop inside a list comprehension, so the final if statement slices off any entries occurring after a *.
You would call this function like so:
amino_acids = {
'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'CUU': 'L', 'GUU': 'V', 'UGA': '*'
}
print codons_to_acids(amino_acids, 'ATGGAAGCGAGGATGtGaATT')
If you can solve the problem without regex, it's best not to use it.
with open('input.fasta', 'r') as f1:
input = f1.read()
codons = list()
with open('codons.txt', 'r') as f2:
codons = f2.readlines()
input = [x.replace('T', 'U') for x in input.upper() if x in 'ATCG']
chunks = [''.join(input[x:x+3]) for x in xrange(0, len(input), 3)]
codons = [c.replace('\n', '').upper() for c in codons if c != '\n']
my_dict = {q.split()[0]: q.split()[1] for q in codons }
result = list()
for ch in chunks:
new_elem = my_dict.pop(ch, None)
if new_elem is None:
print 'Invalid key!'
else:
result.append(new_elem)
if new_elem == '*':
break
print result
Related
I need to calculate the occurrences of a motif (including overlaps) in sequences (motif is passed in the first line of standard input and the sequences in subsequent lines). The sequence name starts with >, and after whitespace is just a comment about the sequence that needs to be neglected. The input of program is like:
AT
>seq1 Comment......
AGGTATA
TGGCGCC
>seq2 Comment.....
GGCCGGCGC
The output should be:
seq1: 2
seq2: 0
I decided to save the first line as a motif, strip the comment from sequence name, join lines of sequence in one line and save sequence names (keys) and sequences (values) in a dictionary. I also wrote a function for motif_count and want to call it on dictionary values and then save it in a separate dictionary for final output. Can I do it or is there a better way?
#!/usr/bin/env python3
import sys
sequence = sys.stdin.readlines()
motif = sequence[0]
d = {}
temp_genename = None
temp_sequence = None
def motif_count(m, s):
count = 0
next_pos = -1
while True:
next_pos = s.find(m, next_pos + 1)
if next_pos < 0:
break
count += 1
return count
if sequence[1][0] != '>':
print("ERROR")
exit(1)
for line in sequence[1:]:
if line[0] == '>':
temp_genename = line.split(' ')[0].strip()
temp_sequence = ""
else:
temp_sequence += line.strip()
d[temp_genename] = temp_sequence
for value in d:
motif_count(motif, value)
You can simplify your code by using dictionary and string expressions to get the relevant key words that you need for your processing. Assuming your sequence values are consistent and similar to what you provided, you can split over the redundant This sequence is from and then later filter the uppercase letters and finally compute the occurrence of your motif. This can be done as in the following:
def motif_count(motif, key):
d[key] = d[key].count(motif)
sequence = """AT
>seq1 This sequence is from bacterial genome
AGGTATA
TGGCGCC
>seq2 This sequence is rich is CG
GGCCGGCGC""".split('\n')
d = {}
# print error if format is wrong
if sequence[1][0] != '>':
print("ERROR")
else:
seq = "".join(sequence).split('>')[1:]
func = lambda line: line.split(' This sequence is ')
d = dict((func(line)[0], ''.join([c for c in func(line)[1] if c.isupper()]))
for line in seq)
motif = sequence[0]
# replace seq with its count
for key in d:
motif_count(motif, key)
# print output
print(d)
Output:
{'seq1': 2, 'seq2': 0}
I am trying to read a quote from a text file and find any duplicated words that appear next to each other. The following is the quote:
"He that would make his own liberty liberty secure,
must guard even his enemy from oppression;
for for if he violates this duty, he
he establishes a precedent that will reach to himself."
-- Thomas Paine
The output should be the following:
Found word: "Liberty" on line 1
Found word: "for" on line 3
Found word: "he" on line 4
I have written the code to read the text from the file but I am having trouble with the code to identify the duplicates. I have tried enumerating each word in the file and checking if the word at one index is equal to the the word at the following index. However, I am getting an index error because the loop continues outside of the index range. Here's what I've come up with so far:
import string
file_str = input("Enter file name: ")
input_file = open(file_str, 'r')
word_list = []
duplicates = []
for line in input_file:
line_list = line_str.split()
for word in line_list:
if word != "--":
word_list.append(word)
for idx, word in enumerate(word_list):
print(idx, word)
if word_list[idx] == word_list[idx + 1]:
duplicates.append(word)
Any help with the current method I'm trying would be appreciated, or suggestions for another method.
When you record the word_list you are losing information about which line the word is on.
Perhaps better would be to determine duplicates as you read the lines.
line_number = 1
for line in input_file:
line_list = line_str.split()
previous_word = None
for word in line_list:
if word != "--":
word_list.append(word)
if word == previous_word:
duplicates.append([word, line_number])
previous_word = word
line_number += 1
This should do the trick OP. In the for loop over the word list it only goes up to the second to last element now. This won't keep track of the line numbers though, I would use Phillip Martin's solution for that.
import string
file_str = input("Enter file name: ")
input_file = open(file_str, 'r')
word_list = []
duplicates = []
for line in input_file:
line_list = line.split()
for word in line_list:
if word != "--":
word_list.append(word)
#Here is the change I made > <
for idx, word in enumerate(word_list[:-1]):
print(idx, word)
if word_list[idx] == word_list[idx + 1]:
duplicates.append(word)
print duplicates
Here's another approach.
from itertools import tee, izip
from collections import defaultdict
dups = defaultdict(set)
with open('file.txt') as f:
for no, line in enumerate(f, 1):
it1, it2 = tee(line.split())
next(it2, None)
for word, follower in izip(it1, it2):
if word != '--' and word == follower:
dups[no].add(word)
which yields
>>> dups
defaultdict(<type 'set'>, {1: set(['liberty']), 3: set(['for'])})
which is a dictionary which holds a set of pair-duplicates for each line, e.g.
>>> dups[3]
set(['for'])
(I don't know why you expect "he" to be found on line four, it is certainly not doubled in your sample file.)
Im trying to find the dinuc count and frequencies from a sequence in a text file, but my code is only outputting single nucleotide counts.
e = "ecoli.txt"
ecnt = {}
with open(e) as seq:
for line in seq:
for word in line.split():
for i in range(len(seqr)):
dinuc = (seqr[i] + seqr[i:i+2])
for dinuc in seqr:
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
for x,y in ecnt.items():
print(x, y)
Sample input: "AAATTTCGTCGTTGCCC"
Sample output:
AA:2
TT:3
TC:2
CG:2
GT:2
GC:1
CC:2
Right now, Im only getting single nucleotides for my output:
C 83550600
A 60342100
T 88192300
G 92834000
For the nucleotides that repeat i.e. "AAA", the count has to return all possible combinations of consecutive 'AA', so the output should be 2 rather than 1. It doesnt matter what order the dinucleotides are listed, I just need all combinations, and for the code to return the correct count for the repeated nucleotides. I was asking my TA and she said that my only problem was getting my 'for' loop to add the dinucleotides to my dictionary, and I think my range may or may not be wrong. The file is a really big one, so the sequence is split up into lines.
Thank you so much in advance!!!
I took a look at your code and found several things that you might want to take a look at.
For testing my solution, since I did not have ecoli.txt, I generated one of my own with random nucleotides with the following function:
import random
def write_random_sequence():
out_file = open("ecoli.txt", "w")
num_nts = 500
nts_per_line = 80
nts = []
for i in range(num_nts):
nt = random.choice(["A", "T", "C", "G"])
nts.append(nt)
lines = [nts[i:i+nts_per_line] for i in range(0, len(nts), nts_per_line)]
for line in lines:
out_file.write("".join(line) + "\n")
out_file.close()
write_random_sequence()
Notice that this file has a single sequence of 500 nucleotides separated into lines of 80 nucleotides each. In order to count dinucleotides where you have the first nucleotide at the end of one line and the second nucleotide at the start of the next line, we need to merge all of these separate lines into a single string, without spaces. Let's do that first:
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
Try printing out "seq" and notice that it should be one giant string containing all of the nucleotides. Next, we need to find the dinucleotides in the sequence string. We can do this using slicing, which I see you tried. So for each position in the string, we look at both the current nucleotide and the one after it.
for i in range(len(seq)-1):#note the -1
dinuc = seq[i:i+2]
We can then do the counting of the nucleotides and storage of them in a dictionary "ecnt" very much like you had. The final code looks like this:
ecnt = {}
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
for i in range(len(seq)-1):
dinuc = seq[i:i+2]
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
print ecnt
A perfect opportunity to use a defaultdict:
from collections import defaultdict
file_name = "ecoli.txt"
dinucleotide_counts = defaultdict(int)
sequence = ""
with open(file_name) as file:
for line in file:
sequence += line.strip()
for i in range(len(sequence) - 1):
dinucleotide_counts[sequence[i:i + 2]] += 1
for key, value in sorted(dinucleotide_counts.items()):
print(key, value)
I have a file.txt with thousands of words, and I need to create a new file based on certain parameters, and then sort them a certain way.
Assuming the user imports the proper libraries when they test, what is wrong with my code? (There are 3 separate functions)
For the first, I must create a file with words containing certain letters, and sort them lexicographically, then put them into a new file list.txt.
def getSortedContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
The second is similar, but must be sorted reverse lexicographically, if the string inputted is NOT in the word.
def getReverseSortedNotContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s not in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
For the third, I must sort words that contain a certain amount of integers, and sort lexicographically by the last character in each word.
def getRhymeSortedCount(n, ifile, ofile):
toWrite = ""
for line in ifile:
word = line[:-1] #gets rid of \n
if len(word) == n:
toWrite += word + "\n"
reversetoWrite = toWrite[::-1]
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
reversetoWrite = toWrites[::-1]
ofile.write(reversetoWrites[:-1])
Could someone please point me in the right direction for these? Right now they are not sorting as they're supposed to.
There is a lot of stuff that is unclear here so I'll try my best to clean this up.
You're concatenating strings together into one big string then appending that one big string into a list. You then tried to sort your 1-element list. This obviously will do nothing. Instead put all the strings into a list and then sort that list
IE: for your first example do the following:
def getSortedContain(s,ifile,ofile):
words = [word for word in ifile if s in words]
words.sort()
ofile.write("\n".join(words))
I have got this python program which reads through a wordlist file and checks for the suffixes ending which are given in another file using endswith() method.
the suffixes to check for is saved into the list: suffixList[]
The count is being taken using suffixCount[]
The following is my code:
fd = open(filename, 'r')
print 'Suffixes: '
x = len(suffixList)
for line in fd:
for wordp in range(0,x):
if word.endswith(suffixList[wordp]):
suffixCount[wordp] = suffixCount[wordp]+1
for output in range(0,x):
print "%-6s %10i"%(prefixList[output], prefixCount[output])
fd.close()
The output is this :
Suffixes:
able 0
ible 0
ation 0
the program is unable to reach this loop :
if word.endswith(suffixList[wordp]):
You need to strip the newline:
word = ln.rstrip().lower()
The words are coming from a file so each line ends with a newline character. You are then trying to use endswith which always fails as none of your suffixes end with a newline.
I would also change the function to return the values you want:
def store_roots(start, end):
with open("rootsPrefixesSuffixes.txt") as fs:
lst = [line.split()[0] for line in map(str.strip, fs)
if '#' not in line and line]
return lst, dict.fromkeys(lst[start:end], 0)
lst, sfx_dict = store_roots(22, 30) # List, SuffixList
Then slice from the end and see if the substring is in the dict:
with open('longWordList.txt') as fd:
print('Suffixes: ')
mx, mn = max(sfx_dict, key=len), min(sfx_dict, key=len)
for ln in map(str.rstrip, fd):
suf = ln[-mx:]
for i in range(mx-1, mn-1, -1):
if suf in sfx_dict:
sfx_dict[suf] += 1
suf = suf[-i:]
for k,v in sfx_dict:
print("Suffix = {} Count = {}".format(k,v))
Slicing the end of the string incrementally should be faster than checking every string especially if you have numerous suffixes that are the same length. At most it does mx - mn iterations, so if you had 20 four character suffixes you would only need to check the dict once, only one n length substring can be matched at a time so we would kill n length substrings at the one time with a single slice and lookup.
You could use a Counter to count the occurrences of suffix:
from collections import Counter
with open("rootsPrefixesSuffixes.txt") as fp:
List = [line.strip() for line in fp if line and '#' not in line]
suffixes = List[22:30] # ?
with open('longWordList.txt') as fp:
c = Counter(s for word in fp for s in suffixes if word.rstrip().lower().endswith(s))
print(c)
Note: add .split()[0] if there are more than one words per line you want to ignore, otherwise this is unnecessary.