Related
Im doing a course in bioinformatics. We were supposed to create a function that takes a list of strings like this:
Motifs =[
"AACGTA",
"CCCGTT",
"CACCTT",
"GGATTA",
"TTCCGG"]
and turn it into a count matrix that counts the occurrence of the nucleotides (the letters A, C, G and T) in each column and adds a pseudocount 1 to it, represented by a dictionary with multiple values for each key like this:
count ={
'A': [2, 3, 2, 1, 1, 3],
'C': [3, 2, 5, 3, 1, 1],
'G': [2, 2, 1, 3, 2, 2],
'T': [2, 2, 1, 2, 5, 3]}
For example A occurs 1 + 1 pseudocount = 2 in the first column. C appears 2 + 1 pseudocount = 3 in the fourth column.
Here is my solution:
def CountWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
count = {}
for symbol in "ACGT":
count[symbol] = [1 for j in range(k)]
for i in range(t):
for j in range(k):
symbol = Motifs[i][j]
count[symbol][j] += 1
return count
The first set of for loops generates a dictionary with the keys A,C,G,T and the initial values 1 for each column like this:
count ={
'A': [1, 1, 1, 1, 1, 1],
'C': [1, 1, 1, 1, 1, 1],
'G': [1, 1, 1, 1, 1, 1],
'T': [1, 1, 1, 1, 1, 1]}
The second set of for loops counts the occurrence of the nucleotides and adds it to the values of the existing dictionary as seen above.
This works and does its job, but I want to know how to further compress both for loops using dict comprehensions.
NOTE:
I am fully aware that there are a multitude of modules and libraries like biopython, scipy and numpy that probably can turn the entire function into a one liner. The problem with modules is that their output format often doesnt match with what the automated solution check from the course is expecting.
This
count = {}
for symbol in "ACGT":
count[symbol] = [1 for j in range(k)]
can be changed to comprehension as follows
count = {symbol:[1 for j in range(k)] for symbol in "ACGT"}
and then further simplified by using pythons ability to multiply list by integer to
count = {symbol:[1]*k for symbol in "ACGT"}
compressing the first loop:
count = {symbol: [1 for j in range(k)] for symbol in "ACGT"}
This method is called a generator (or dict comprehension) - it generates a dict using a for loop.
I'm not sure you can compress the second (nested) loop, since it's not generating anything, but changing the first dict.
You can compress a lot your code using collections.Counter and collections.defaultdict:
from collections import Counter, defaultdict
out = defaultdict(list)
bases = 'ACGT'
for m in zip(*Motifs):
c = Counter(m)
for b in bases:
out[b].append(c[b]+1)
dict(out)
output:
{'A': [2, 3, 2, 1, 1, 3],
'C': [3, 2, 5, 3, 1, 1],
'G': [2, 2, 1, 3, 2, 2],
'T': [2, 2, 1, 2, 5, 3]}
You can use collections.Counter:
from collections import Counter
m = ['AACGTA', 'CCCGTT', 'CACCTT', 'GGATTA', 'TTCCGG']
d = [Counter(i) for i in zip(*m)]
r = {a:[j.get(a, 0)+1 for j in d] for a in 'ACGT'}
Output:
{'A': [2, 3, 2, 1, 1, 3], 'C': [3, 2, 5, 3, 1, 1], 'G': [2, 2, 1, 3, 2, 2], 'T': [2, 2, 1, 2, 5, 3]}
Below is a simplfied issue that I have.
Lets say I have a list of elements as follows:
my_list = [0, 0, 1, 0, 1, 2, 2, 0, 0, 3, 3, 3, 2, 3]
I want it to be sequential from my_list[0] to my_list[-1], so smallest value first, largest is last.
If any element does not follow the sequence, I want to remove it.
So for the above example the output I want is:
my_list = [0, 0, 1, 1, 2, 2, 3, 3, 3, 3]
How can I do this? I know I could just enumerate and check if previous idx is <= the current, but if you have more than 1 outlier then this theory falls apart.
E.g.
new_list = []
for idx, el in enumerate(my_list):
if idx>0:
if my_list[idx-1] <=el:
new_list.append(el) # only these values count
output of new_list is:
[0, 1, 1, 2, 2, 0, 3, 3, 3, 3]
So still getting that outlier (0) at index 5
Note - I know I could sort() the list, but I want to actively remove the outliers, not sort.
Since you want every element in new_list to be greater-than-or-equal-to its previous element, you should compare it with the last element appended to new_list.
Besides, new_list should start with an element, not starting empty, or new_list[-1] will fail.
new_list = [my_list[0]]
for el in my_list:
if el >= new_list[-1]:
new_list.append(el)
You could do this by comparing each number with the cumulative maximum at its position. The cumulative maximum can be computed using the accumulate() function from itertools. Combining the numbers with their respective cumulative maximum can be achieved using the zip() function:
from itertools import accumulate
my_list = [0, 0, 1, 0, 1, 2, 2, 0, 0, 3, 3, 3, 2, 3]
my_list = [a for a,m in zip(my_list,accumulate(my_list,max)) if a==m]
print(my_list)
[0, 0, 1, 1, 2, 2, 3, 3, 3, 3]
I am still new to python so apologies in advance if my question is confusing.
I have this
A : [5, 1, 0, 0, 5, 5, 0]
C : [0, 1, 1, 2, 1, 0, 2]
G : [0, 3, 2, 1, 0, 1, 2]
T : [1, 0, 2, 0, 0, 0, 1]
dictionary. I want to print the key of the highest value from the list.
For eg:- From the first element of the list i.e. 5,0,0,1
A has the highest value i want to print A, then for second position its G and so on.
Thank you.
Edited: - Hey guys thank you for all your advice I was able to complete the task.
You can use the following:
d = {"A": [5, 1, 0, 0, 5, 5, 0],
"C": [0, 1, 1, 2, 1, 0, 2],
"G": [0, 3, 2, 1, 0, 1, 2],
"T": [1, 0, 2, 0, 0, 0, 1]}
maxes = [max(d, key=lambda k: d[k][i]) for i in range(len(list(d.values())[0]))]
# ['A', 'G', 'G', 'C', 'A', 'A', 'C']
Assuming you data is in a dictionary (and it's named d), you can try:
maxkey = max(d.items(), key=lambda x: max(x[1]))[0]
print(maxkey)
Output:
A
Something like this:
>>> def f(di, index):
... max = None
... for key in di:
... if index >= len(di[key]):
... continue
... if max is None or di[max][index] < di[key][index]:
... max = key
... return max
>>> di = {'A': [5, 1, 0, 0, 5, 5, 0],
... 'B': [0, 1, 1, 2, 1, 0, 2]}
>>> f(di, 0)
'A'
>>> f(di, 3)
'B'
For each index (0 to 6) get the key that has the maximum value at that index. That can be done in a list comprehension:
[ max(d,key=lambda k:d[k][i]) for i in range(7)]
['A', 'G', 'G', 'C', 'A', 'A', 'C']
If you have different lists sizes in your dictionary or you don't know the length of the lists in advance, you can get it first using the min function:
minLen = min(map(len,d.values()))
maxLetters = [ max(d,key=lambda k:d[k][i]) for i in range(minLen)]
you need to use min() in order to make sure d[k][i] will not go beyond the size of the shortest list.
I have Nested dictionary something like this.
{'A': {'21-26': 2,
'26-31': 7,
'31-36': 3,
'36-41': 2,
'41-46': 0,
'46-51': 0,
'Above 51': 0},
'B': {'21-26': 2,
'26-31': 11,
'31-36': 5,
'36-41': 4,
'41-46': 1,
'46-51': 0,
'Above 51': 3}}
And I want to create list by key from second dictionary.
And i don't want duplicates in my list.
Required Output is
ls = ['21-26','26-31','31-36','36-41','41-46','46-51','Above 51']
Thank you for your time and consideration.
You can use:
>>> list(set(key for val in d.values() for key in val.keys()))
['21-26', '36-41', '31-36', '46-51', 'Above 51', '26-31', '41-46']
Where d is your dictionary.
Simple set comprehension, then convert to list. a is your dict.
list({k for v in a.values() for k in v.keys()})
Output ordering is random, but you can sort how you like.
Can you use pandas? IF so:
import pandas as pd
a = {'A': {'21-26': 2, '26-31': 7, '31-36': 3, '36-41': 2, '41-46': 0, '46-51': 0, 'Above 51': 0}, 'B': {'21-26': 2, '26-31': 11, '31-36': 5, '36-41': 4, '41-46': 1, '46-51': 0, 'Above 51': 3}}
pd.DataFrame(a).index.to_list()
output:
['21-26', '26-31', '31-36', '36-41', '41-46', '46-51', 'Above 51']
You can use chain.from_iterable() to chain inner dictionaries and dict.fromkeys() to remove duplicates:
from itertools import chain
c = chain.from_iterable(dct.values())
result = list(dict.fromkeys(c))
I know how to count frequency of elements in a list but here's a lightly different question. I have a larger set of vocabulary and a few lists that only use part of the total vocabulary. Using numbers instead of words as an example:
vocab=[1,2,3,4,5,6,7]
list1=[1,2,3,4]
list2=[2,3,4,5,6,6,7]
list3=[3,2,4,4,1]
and I want the output to keep "0"s when a word is not used:
count1=[1,1,1,1,0,0,0]
count2=[0,1,1,1,1,2,1]
count3=[1,1,1,2,0,0,0]
I guess I need to sort the words, but how do I keep the "0" records?
This can be done using the list object's inbuilt count function, within a list comprehension.
>>> vocab = [1, 2, 3, 4, 5, 6, 7]
>>> list1 = [1, 2, 3, 4]
>>> list2 = [2, 3, 4, 5, 6, 6, 7]
>>> list3 = [3, 2, 4, 4, 1]
>>> [list1.count(v) for v in vocab]
[1, 1, 1, 1, 0, 0, 0]
>>> [list2.count(v) for v in vocab]
[0, 1, 1, 1, 1, 2, 1]
>>> [list3.count(v) for v in vocab]
[1, 1, 1, 2, 0, 0, 0]
Iterate over each value in vocab, accumulating the frequencies for them.
You could also achieve this with the follwing (Python 2):
map(lambda v: list1.count(v), vocab)