Given a large amount of FASTA files (the peptidome for various organisms for secreted peptides), how can I read the FASTA files (from UNIProt) with Python (Or Matlab), and count the frequencies of each Amino Acid, and of amino-acid "double" pairings?
(I.E - the output should have the % of each individual amino acid (Out of the 22 letters/Chars) AND the frequencies of pairings of amino acids.
Effectively, I want to count the bigram (or n-gram if easy to implement) frequencies for letter pairs.
The 22 amino acids are each represented by a unique letter in the FASTA file, and the name of each protein is preceded on its line by >. ( already parsed it, so only relevent characters remain)
Sample of a file:
FFKA
FLRN
MTTVSYVTILLTVLVQVLTSDAKATNNKRELSSGLKERSLSDDAPQFWKGRFSRSEEDPQ
FWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQ
FWKGRFSDGTKRENDPQYWKGRFSRSFEDQPDSEAQFWKGRFARTSSGEKREPQYWKGRF
SRDSVPGRYGRELQGRFGRELQGRFGREAQGRFGRELQGRFGREFQGRFGREDQGRFGRE
DQGRFGREDQGRFGREDQGRFGREDQGRFGREDQGRFGRELQGRFGREFQGRFGREDQGR
FGREDQGRFGRELQGRFGREDQGRFGREDQGRFGREDLAKEDQGRFGREDLAKEDQGRFG
REDIAEADQGRFGRNAAAAAAAAAAAKKRTIDVIDIESDPKPQTRFRDGKDMQEKRKVEK
KDKIEKSDDALAKTS
Thank you very much!
How does this look?
>>> sequence = "LTSDAKAARFSDPQFWKGRFSDPQFWKGRSAAKGRFARTSSGAAEKREPQAAYWKGRF "
>>> occurrenceAA = str(sequence.count("AA")) # counting occurence of n-aminos
>>> percent_occurrenceAA = float(occurrenceAA)/len(sequence)*100 # calculate percent total of protein
>>> print occurrenceAA, " Double-alanines in your sequence"
4 Double-alanines in your sequence
>>> print round(percent_occurrenceAA,2), " % of total" # rounding off % to 2 decimal places
6.78 % of total
Related
Harold is a kidnapper who wrote a ransom note, but now he is worried it will be traced back to him through his handwriting. He found a magazine and wants to know if he can cut out whole words from it and use them to create an untraceable replica of his ransom note. The words in his note are case-sensitive and he must use only whole words available in the magazine. He cannot use substrings or concatenation to create the words he needs.
Given the words in the magazine and the words in the ransom note, print Yes if he can replicate his ransom note exactly using whole words from the magazine; otherwise, print No.
Example
= "attack at dawn" = "Attack at dawn"
The magazine has all the right words, but there is a case mismatch. The answer is .
Function Description
Complete the checkMagazine function in the editor below. It must print if the note can be formed using the magazine, or .
checkMagazine has the following parameters:
string magazine[m]: the words in the magazine
string note[n]: the words in the ransom note
Prints
string: either or , no return value is expected
Input Format
The first line contains two space-separated integers, and , the numbers of words in the and the , respectively.
The second line contains space-separated strings, each .
The third line contains space-separated strings, each .
Constraints
.
Each word consists of English alphabetic letters (i.e., to and to ).
Sample Input 0
6 4
give me one grand today night
give one grand today
Sample Output 0
Yes
Sample Input 1
6 5
two times three is not four
two times two is four
Sample Output 1
No
Explanation 1
'two' only occurs once in the magazine.
Sample Input 2
7 4
ive got a lovely bunch of coconuts
ive got some coconuts
Sample Output 2
No
Explanation 2
Harold's magazine is missing the word .
# Complete the checkMagazine function below.
def checkMagazine(magazine, note):
hash_table = [[] for _ in range(5)]
test=0
for a in magazine:
length=len(a)
key=(length-1)%5
hash_table[key].append(a)
for a in note:
key=(len(a)-1)%5
if a not in hash_table[key]:
test=1
break
else:
hash_table[key].remove(a)
if test==1:
print("No")
else:
print("Yes")
my time limit exceeded for case 16 and 17. please help how to improve it
Introducing Python Counters
From the docs: A Counter is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values.
from collections import Counter
def checkMagazine(magazine, note):
magazine_counter = Counter(magazine)
note_counter = Counter(note)
return "Yes" if magazine_counter & note_counter == note_counter else "No"
The logical & between the two counters will give you the minimum of corresponding counts. Comparing the result with the note is what you want.
I have written the following (crude) code to find the association strengths among the words in a given piece of text.
import re
## The first paragraph of Wikipedia's article on itself - you can try with other pieces of text with preferably more words (to produce more meaningful word pairs)
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."
text = re.sub("[\[].*?[\]]", "", text) ## Remove brackets and anything inside it.
text=re.sub(r"[^a-zA-Z0-9.]+", ' ', text) ## Remove special characters except spaces and dots
text=str(text).lower() ## Convert everything to lowercase
## Can add other preprocessing steps, depending on the input text, if needed.
from nltk.corpus import stopwords
import nltk
stop_words = stopwords.words('english')
desirable_tags = ['NN'] # We want only nouns - can also add 'NNP', 'NNS', 'NNPS' if needed, depending on the results
word_list = []
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.append(word)
'''
Construct the association matrix, where we count 2 words as being associated
if they appear in the same sentence.
Later, I'm going to define associations more properly by introducing a
window size (say, if 2 words seperated by at most 5 words in a sentence,
then we consider them to be associated)
'''
table = np.zeros((len(word_list),len(word_list)), dtype=int)
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
df = pd.DataFrame(table, columns=word_list, index=word_list)
# Count the number of occurrences of each word from word_list in the text
all_words = pd.DataFrame(np.zeros((len(df), 2)), columns=['Word', 'Count'])
all_words.Word = df.index
for sent in text.split('.'):
count=0
for word in sent.split():
if word in word_list:
all_words.loc[all_words.Word==word,'Count'] += 1
# Sort the word pairs in decreasing order of their association strengths
df.values[np.triu_indices_from(df, 0)] = 0 # Make the upper triangle values 0
assoc_df = pd.DataFrame(columns=['Word 1', 'Word 2', 'Association Strength (Word 1 -> Word 2)'])
for row_word in df:
for col_word in df:
'''
If Word1 occurs 10 times in the text, and Word1 & Word2 occur in the same sentence 3 times,
the association strength of Word1 and Word2 is 3/10 - Please correct me if this is wrong.
'''
assoc_df = assoc_df.append({'Word 1': row_word, 'Word 2': col_word,
'Association Strength (Word 1 -> Word 2)': df[row_word][col_word]/all_words[all_words.Word==row_word]['Count'].values[0]}, ignore_index=True)
assoc_df.sort_values(by='Association Strength (Word 1 -> Word 2)', ascending=False)
This produces the word associations like so:
Word 1 Word 2 Association Strength (Word 1 -> Word 2)
330 wiki encyclopedia 3.0
895 encyclopadia found 1.0
1317 anyone edit 1.0
754 peer science 1.0
755 peer encyclopadia 1.0
756 peer britannica 1.0
...
...
...
However, the code contains a lot of for loops which hampers its running time. Specially the last part (sort the word pairs in decreasing order of their association strengths) consumes a lot of time as it computes the association strengths of n^2 word pairs/combinations, where n is the number of words we are interested in (those in word_list in my code above).
So, the following are what I would like some help on:
How do I vectorize the code, or otherwise make it more efficient?
Instead of producing n^2 combinations/pairs of words in the last step, is there any way to prune some of them before producing them? I am going to prune some of the useless/meaningless pairs by inspection after they are produced anyway.
Also, and I know this does not fall into the purview of a coding question, but I would love to know if there's any mistake in my logic, specially when calculating the word association strengths.
Since you asked about your specific code, I will not go into alternate libraries. I will be mostly concentrating on points 1) and 2) of your question:
Instead of iterating through the whole word ist twice (i and j) you can already cut the processing time by ~ half by just iterating j between i + i and the end of the list. This removes duplicate pairs (index 24 and 42 as well as index 42 and 24) as well as the identical pair (index 42 and 42).
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(i+1, len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
Be careful with this, though. the in operator will also match partial words (like and in hand)
Of course, you could also remove the j iteration completely by first filtering for all words in your word list and then pairing them afterward:
word_list = set() # Using set instead of list makes lookups faster since this is a hashed structure
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.add(word)
(...)
for sent in text.split('.'):
found_words = [word for word in sent.split() if word in word_list] # list comprehensions are usually faster than pure for loops
# If you want to count duplicate words, then leave the whole line below out.
found_words = tuple(frozenset(found_words)) # make every word unique using a set and then iterable by index again by converting it into a tuple.
for i in range(len(found_words):
for j in range(i+1, len(found_words):
table[i, j] += 1
In general, you should really think about using external libraries for most of this, though. As some of the comments on your question already pointed out, splitting on . may get you wrong results, the same counts for splitting on whitespace, for example with words that are separated with a - or words that are followed by a ,.
EDIT: The "already answered" is not talking about what I am. My string already comes out in word format. I need to strip those words from my string into a list
I'm trying to work with phrase manipulation for a voice assistant in python.
When I speak something like:
"What is 15,276 divided by 5?"
It comes out formatted like this:
"what is fifteen thousand two hundred seventy six divided by five"
I already have a way to change a string to an int for the math part, so is there a way to somehow get a list like this from the phrase?
['fifteen thousand two hundred seventy six','five']
Go through the list of words and check each one for membership in a set of numerical words. Add adjacent words to a temporary list. When you find a non-number word, create a new list. Remember to account for non-number words at the beginning of the sentence and number words at the end of the sentence.
result = []
group = []
nums = set('one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty thirty forty fifty sixty seventy eighty ninety hundred thousand million billion trillion quadrillion quintillion'.split())
for word in "what is fifteen thousand two hundred seventy six divided by five".split():
if word in nums:
group.append(word)
else:
if group:
result.append(group)
group = []
if group:
result.append(group)
Result:
>>> result
[['fifteen', 'thousand', 'two', 'hundred', 'seventy', 'six'], ['five']]
To join each sublist into a single string:
>>> list(map(' '.join, result))
['fifteen thousand two hundred seventy six', 'five']
Im trying to calculate a cumulative distribution into a dictionary. The distribution should take letters from a given text and find the probability over the times they appear in the text, and from this it should calculate the cumulative distribution.
I don't know if I'm doing it the right way, but here's my code:
with open('text') as infile:
text = infile.read()
letters = list(text)
letter_freqs = Counter(letters(text))
letter_sum = len(letters)
letter_proba = [letter_freqs[letter]/letter_sum for letter in letters(text)]
And now I wan't to calculate a cumulative distribution, and plot it like a histogram, can someone help me?
The following should at least run (which your code as posted won't):
import collections, itertools
with open('text') as infile:
letters = list(infile.read()) # not just letters: whitespace & punct, too
letter_freqs = collections.Counter(letters)
letter_sum = len(letters)
letters_set = sorted(set(letters))
d = {l: letter_freqs[letter]/letter_sum for l in letters_set}
cum = itertools.accumulate(d[l] for l in letters_set)
cum_d = dict(zip(letters_set, cum)
Now you have in cum_d a dictionary mapping each character, not just letters of course since you're done nothing to exclude whitespace and punctuation, to the cumulative probability of that character and all those below it in alphabetical order. How you plan to "plot" a dictionary, no idea. But hey, at least this does run, and produce something that might fit at least one interpretation of the vague specs you give for the task!-)
How to sum up the number of words frequency using fd.items() from FreqDist?
>>> fd = FreqDist(text)
>>> most_freq_w = fd.keys()[:10] #gives me the most 10 frequent words in the text
>>> #here I should sum up numbers of each of these 10 freq words appear in the text
e.g. if each word in most_freq_w appear 10 times, the result should be 100
!!! I don't need that number of all words in the text, just the 10 most frequent
I'm not familiar with nltk, but since FreqDist derives from dict, then the following should work:
v = fd.values()
v.sort()
count = sum(v[-10:])
To find the number of times a word appears in the corpus(your piece of text):
raw="<your file>"
tokens = nltk.word_tokenize(raw)
fd = FreqDist(tokens)
print fd['<your word here>']
It has a pretty print feature
fd.pprint()
will do it.
If FreqDist is a mapping of words to their frequencies:
sum(map(fd.get, most_freq_w))