dinucleotide count and frequency - python

Im trying to find the dinuc count and frequencies from a sequence in a text file, but my code is only outputting single nucleotide counts.
e = "ecoli.txt"
ecnt = {}
with open(e) as seq:
for line in seq:
for word in line.split():
for i in range(len(seqr)):
dinuc = (seqr[i] + seqr[i:i+2])
for dinuc in seqr:
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
for x,y in ecnt.items():
print(x, y)
Sample input: "AAATTTCGTCGTTGCCC"
Sample output:
AA:2
TT:3
TC:2
CG:2
GT:2
GC:1
CC:2
Right now, Im only getting single nucleotides for my output:
C 83550600
A 60342100
T 88192300
G 92834000
For the nucleotides that repeat i.e. "AAA", the count has to return all possible combinations of consecutive 'AA', so the output should be 2 rather than 1. It doesnt matter what order the dinucleotides are listed, I just need all combinations, and for the code to return the correct count for the repeated nucleotides. I was asking my TA and she said that my only problem was getting my 'for' loop to add the dinucleotides to my dictionary, and I think my range may or may not be wrong. The file is a really big one, so the sequence is split up into lines.
Thank you so much in advance!!!

I took a look at your code and found several things that you might want to take a look at.
For testing my solution, since I did not have ecoli.txt, I generated one of my own with random nucleotides with the following function:
import random
def write_random_sequence():
out_file = open("ecoli.txt", "w")
num_nts = 500
nts_per_line = 80
nts = []
for i in range(num_nts):
nt = random.choice(["A", "T", "C", "G"])
nts.append(nt)
lines = [nts[i:i+nts_per_line] for i in range(0, len(nts), nts_per_line)]
for line in lines:
out_file.write("".join(line) + "\n")
out_file.close()
write_random_sequence()
Notice that this file has a single sequence of 500 nucleotides separated into lines of 80 nucleotides each. In order to count dinucleotides where you have the first nucleotide at the end of one line and the second nucleotide at the start of the next line, we need to merge all of these separate lines into a single string, without spaces. Let's do that first:
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
Try printing out "seq" and notice that it should be one giant string containing all of the nucleotides. Next, we need to find the dinucleotides in the sequence string. We can do this using slicing, which I see you tried. So for each position in the string, we look at both the current nucleotide and the one after it.
for i in range(len(seq)-1):#note the -1
dinuc = seq[i:i+2]
We can then do the counting of the nucleotides and storage of them in a dictionary "ecnt" very much like you had. The final code looks like this:
ecnt = {}
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
for i in range(len(seq)-1):
dinuc = seq[i:i+2]
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
print ecnt

A perfect opportunity to use a defaultdict:
from collections import defaultdict
file_name = "ecoli.txt"
dinucleotide_counts = defaultdict(int)
sequence = ""
with open(file_name) as file:
for line in file:
sequence += line.strip()
for i in range(len(sequence) - 1):
dinucleotide_counts[sequence[i:i + 2]] += 1
for key, value in sorted(dinucleotide_counts.items()):
print(key, value)

Related

how can I print lines of a file that specefied by a list of numbers Python?

I open a dictionary and pull specific lines the lines will be specified using a list and at the end i need to print a complete sentence in one line.
I want to open a dictionary that has a word in each line
then print a sentence in one line with a space between the words:
N = ['19','85','45','14']
file = open("DICTIONARY", "r")
my_sentence = #?????????
print my_sentence
If your DICTIONARY is not too big (i.e. can fit your memory):
N = [19,85,45,14]
with open("DICTIONARY", "r") as f:
words = f.readlines()
my_sentence = " ".join([words[i].strip() for i in N])
EDIT: A small clarification, the original post didn't use space to join the words, I've changed the code to include it. You can also use ",".join(...) if you need to separate the words by a comma, or any other separator you might need. Also, keep in mind that this code uses zero-based line index so the first line of your DICTIONARY would be 0, the second would be 1, etc.
UPDATE:: If your dictionary is too big for your memory, or you just want to consume as little memory as possible (if that's the case, why would you go for Python in the first place? ;)) you can only 'extract' the words you're interested in:
N = [19, 85, 45, 14]
words = {}
word_indexes = set(N)
counter = 0
with open("DICTIONARY", "r") as f:
for line in f:
if counter in word_indexes:
words[counter] = line.strip()
counter += 1
my_sentence = " ".join([words[i] for i in N])
you can use linecache.getline to get specific line numbers you want:
import linecache
sentence = []
for line_number in N:
word = linecache.getline('DICTIONARY',line_number)
sentence.append(word.strip('\n'))
sentence = " ".join(sentence)
Here's a simple one with more basic approach:
n = ['2','4','7','11']
file = open("DICTIONARY")
counter = 1 # 1 if you're gonna count lines in DICTIONARY
# from 1, else 0 is used
output = ""
for line in file:
line = line.rstrip() # rstrip() method to delete \n character,
# if not used, print ends with every
# word from a new line
if str(counter) in n:
output += line + " "
counter += 1
print output[:-1] # slicing is used for a white space deletion
# after last word in string (optional)

count suffixes appearing in the word file

I have got this python program which reads through a wordlist file and checks for the suffixes ending which are given in another file using endswith() method.
the suffixes to check for is saved into the list: suffixList[]
The count is being taken using suffixCount[]
The following is my code:
fd = open(filename, 'r')
print 'Suffixes: '
x = len(suffixList)
for line in fd:
for wordp in range(0,x):
if word.endswith(suffixList[wordp]):
suffixCount[wordp] = suffixCount[wordp]+1
for output in range(0,x):
print "%-6s %10i"%(prefixList[output], prefixCount[output])
fd.close()
The output is this :
Suffixes:
able 0
ible 0
ation 0
the program is unable to reach this loop :
if word.endswith(suffixList[wordp]):
You need to strip the newline:
word = ln.rstrip().lower()
The words are coming from a file so each line ends with a newline character. You are then trying to use endswith which always fails as none of your suffixes end with a newline.
I would also change the function to return the values you want:
def store_roots(start, end):
with open("rootsPrefixesSuffixes.txt") as fs:
lst = [line.split()[0] for line in map(str.strip, fs)
if '#' not in line and line]
return lst, dict.fromkeys(lst[start:end], 0)
lst, sfx_dict = store_roots(22, 30) # List, SuffixList
Then slice from the end and see if the substring is in the dict:
with open('longWordList.txt') as fd:
print('Suffixes: ')
mx, mn = max(sfx_dict, key=len), min(sfx_dict, key=len)
for ln in map(str.rstrip, fd):
suf = ln[-mx:]
for i in range(mx-1, mn-1, -1):
if suf in sfx_dict:
sfx_dict[suf] += 1
suf = suf[-i:]
for k,v in sfx_dict:
print("Suffix = {} Count = {}".format(k,v))
Slicing the end of the string incrementally should be faster than checking every string especially if you have numerous suffixes that are the same length. At most it does mx - mn iterations, so if you had 20 four character suffixes you would only need to check the dict once, only one n length substring can be matched at a time so we would kill n length substrings at the one time with a single slice and lookup.
You could use a Counter to count the occurrences of suffix:
from collections import Counter
with open("rootsPrefixesSuffixes.txt") as fp:
List = [line.strip() for line in fp if line and '#' not in line]
suffixes = List[22:30] # ?
with open('longWordList.txt') as fp:
c = Counter(s for word in fp for s in suffixes if word.rstrip().lower().endswith(s))
print(c)
Note: add .split()[0] if there are more than one words per line you want to ignore, otherwise this is unnecessary.

10 ,most frequent words in a string Python

I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you
The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)
You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.
Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.
You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)
To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)

Python function that is supposed to take a certain file and print out the occurrences of a certain part of the file

For example, if I take this file: http://vlm1.uta.edu/~athitsos/courses/cse1310_summer2013/assignments/assignment7/albums.txt
I need the function to count each band and the number of times they are listed in the file and print it on screen in descending order.
It should be in this format
band1: number1
band2: number2
band3: number3
this is what I have so far:
def read_albums(filename):
counter = 0
work_list = []
my_file = open(filename, 'r')
for line in my_file:
my_list = line.split()
work_list = line.split()
for i in range(0, len(my_list)):
item = my_list[0]
counter = 1
j = i + 1
for j in range(j, len(my_list)):
if j > len(my_list):
j = len(my_list)
if item == my_list[0]:
counter = counter + 1
work_list[j] = None
else:
continue
if work_list[0] != None:
print(work_list[0], counter)
Any tips regarding what I am doing wrong would be very helpful, I just cant seem to get it
d = defaultdict(int)
with open("some_file.txt") as f:
for line in file:
artist,album = line.split("-")
d[artist] += 1
for k,v in d.items():
print "%s:%s"%(k,v)
Something like this would be the Pythonic way to go:
from collections import Counter
with open('albums.txt') as f:
print Counter(line.split(' - ')[0] for line in f)
I recommend you take look at this talk.
You already have a working answer so I will just say where you went wrong.
my_list = line.split()
work_list = line.split()
They are exactly the same so I'll just stick with work_list.
work_list = line.split()
This splits the text at each space, so "Pink Floyd - Album" will become ["Pink", "Floyd", "-", "Album"]. Also, what it does is it sets the variable work_list to the latest line you split. What you want is to put all the split lines in a list:
work_list.append(line.split("-")[0])
This splits the line properly and only returns the first element, which is the band name. This is then appended to the list work_list, which you have properly initialised as empty at the beginning.
Once you have the bands in a list, you can use any method to count all the occurrances. Counter is brilliant for that. Your method had a lot of logic flaws, but I think what you were going for was (in pseudocode):
for each item in the array (item)
go through all the remaining items (new_item)
if item == new_item
increase counter
This doesn't count the occurances of each item once. For example, every time it comes across a band, it will count all the duplicate ones from that point forward. What you would want instead is a set, which is like a list but with no duplicate entries.
work_set = set(work_list)
for band in work_set:
counter = 0
for i in range(len(work_list)):
if work_list[i] == band:
counter += 1
print (band, counter)
If your programs don't behave as expected, you can print your variables to see whether they are assigned what you expect them to.

Python: Appending a to a list from a dictionary

This is going to be long but I don't know how else to effectively explain this.
So I have 2 files that I am reading in. The first one has a list of characters.The second file is a list of 3 characters and then it's matching identifier character(separated by a tab).
With the second file I made a dictionary with the 3 characters as the items and the one character as the corresponding key.
What I need to do is take 3 characters at a time from the first list and compare it with the dictionary. If there is a match I need to take the corresponding key and append it to a new list that I will print out. If the match is a '*' character I need to stop not continue comparing the list to the dictionary.
I'm having trouble with the comparing and then making the new list by using the append function.
Here is part of the first input file:
Seq0
ATGGAAGCGAGGATGtGa
Here is part the second:
AUU I
AUC I
AUA I
CUU L
GUU V
UGA *
Here is my code so far:
input = open("input.fasta", "r")
codons = open("codons.txt", "r")
counts = 1
amino_acids = {}
for lines in codons:
lines = lines.strip()
codon, acid = lines.split("\t")
amino_acids[codon] = acid
counts += 1
count = 1
for line in input:
if count%2 == 0:
line = line.upper()
line = line.strip()
line = line.replace(" ", "")
line = line.replace("T", "U")
import re
if not re.match("^[AUCG]*$", line):
print "Error!"
if re.match("^[AUCG]*$", line):
mrna = len(line)/3
first = 0
last = 3
while mrna != 0:
codon = line[first:last]
first += 3
last += 3
mrna -= 1
list = []
if codon == amino_acids[codon]:
list.append(acid)
if acid == "*":
mrna = 0
for acid in list:
print acid
So I want my output to look something like this:
M L I V *
But I'm not getting even close to this.
Please help!
The following is purely untested code. Check indentation, syntax and logic, but should be closer to what you want.
import re
codons = open("codons.txt", "r")
amino_acids = {}
for lines in codons:
lines = lines.strip()
codon, acid = lines.split("\t")
amino_acids[codon] = acid
input = open("input.fasta", "r")
count = 0
list = []
for line in input:
count += 1
if count%2 == 0: #i.e. only care about even lines
line = line.upper()
line = line.strip()
line = line.replace(" ", "")
line = line.replace("T", "U")
if not re.match("^[AUCG]*$", line):
print "Error!"
else:
mrna = len(line)/3
first = 0
while mrna != 0:
codon = line[first:first+3]
first += 3
mrna -= 1
if codon in amino_acids:
list.append(amino_acids[codon])
if acid == "*":
mrna = 0
for acid in list:
print acid
In Python there's usually a way to avoid writing explicit loops with counters and such. There's an incredibly powerful list comprehension syntax that lets you construct lists in one line. To wit, here's an alternate way to write your second for loop:
import re
def codons_to_acids(amino_acids, sequence):
sequence = sequence.upper().strip().replace(' ', '').replace('T', 'U')
codons = re.findall(r'...', sequence)
acids = [amino_acids.get(codon) for codon in codons if codon in amino_acids]
if '*' in acids:
acids = acids[:acids.index('*') + 1]
return acids
The first line performs all of the string sanitization. Chaining together the different methods makes the code more readable to me. You may or may not like that. The second line uses re.findall in a tricky way to split the string every three characters. The third line is a list comprehension which looks up each codon in the amino_acids dict and creates a list of the resulting values.
There's no easy way to break out of a for loop inside a list comprehension, so the final if statement slices off any entries occurring after a *.
You would call this function like so:
amino_acids = {
'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'CUU': 'L', 'GUU': 'V', 'UGA': '*'
}
print codons_to_acids(amino_acids, 'ATGGAAGCGAGGATGtGaATT')
If you can solve the problem without regex, it's best not to use it.
with open('input.fasta', 'r') as f1:
input = f1.read()
codons = list()
with open('codons.txt', 'r') as f2:
codons = f2.readlines()
input = [x.replace('T', 'U') for x in input.upper() if x in 'ATCG']
chunks = [''.join(input[x:x+3]) for x in xrange(0, len(input), 3)]
codons = [c.replace('\n', '').upper() for c in codons if c != '\n']
my_dict = {q.split()[0]: q.split()[1] for q in codons }
result = list()
for ch in chunks:
new_elem = my_dict.pop(ch, None)
if new_elem is None:
print 'Invalid key!'
else:
result.append(new_elem)
if new_elem == '*':
break
print result

Categories

Resources