Im doing a course in bioinformatics. We were supposed to create a function that takes a list of strings like this:
Motifs =[
"AACGTA",
"CCCGTT",
"CACCTT",
"GGATTA",
"TTCCGG"]
and turn it into a count matrix that counts the occurrence of the nucleotides (the letters A, C, G and T) in each column and adds a pseudocount 1 to it, represented by a dictionary with multiple values for each key like this:
count ={
'A': [2, 3, 2, 1, 1, 3],
'C': [3, 2, 5, 3, 1, 1],
'G': [2, 2, 1, 3, 2, 2],
'T': [2, 2, 1, 2, 5, 3]}
For example A occurs 1 + 1 pseudocount = 2 in the first column. C appears 2 + 1 pseudocount = 3 in the fourth column.
Here is my solution:
def CountWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
count = {}
for symbol in "ACGT":
count[symbol] = [1 for j in range(k)]
for i in range(t):
for j in range(k):
symbol = Motifs[i][j]
count[symbol][j] += 1
return count
The first set of for loops generates a dictionary with the keys A,C,G,T and the initial values 1 for each column like this:
count ={
'A': [1, 1, 1, 1, 1, 1],
'C': [1, 1, 1, 1, 1, 1],
'G': [1, 1, 1, 1, 1, 1],
'T': [1, 1, 1, 1, 1, 1]}
The second set of for loops counts the occurrence of the nucleotides and adds it to the values of the existing dictionary as seen above.
This works and does its job, but I want to know how to further compress both for loops using dict comprehensions.
NOTE:
I am fully aware that there are a multitude of modules and libraries like biopython, scipy and numpy that probably can turn the entire function into a one liner. The problem with modules is that their output format often doesnt match with what the automated solution check from the course is expecting.
This
count = {}
for symbol in "ACGT":
count[symbol] = [1 for j in range(k)]
can be changed to comprehension as follows
count = {symbol:[1 for j in range(k)] for symbol in "ACGT"}
and then further simplified by using pythons ability to multiply list by integer to
count = {symbol:[1]*k for symbol in "ACGT"}
compressing the first loop:
count = {symbol: [1 for j in range(k)] for symbol in "ACGT"}
This method is called a generator (or dict comprehension) - it generates a dict using a for loop.
I'm not sure you can compress the second (nested) loop, since it's not generating anything, but changing the first dict.
You can compress a lot your code using collections.Counter and collections.defaultdict:
from collections import Counter, defaultdict
out = defaultdict(list)
bases = 'ACGT'
for m in zip(*Motifs):
c = Counter(m)
for b in bases:
out[b].append(c[b]+1)
dict(out)
output:
{'A': [2, 3, 2, 1, 1, 3],
'C': [3, 2, 5, 3, 1, 1],
'G': [2, 2, 1, 3, 2, 2],
'T': [2, 2, 1, 2, 5, 3]}
You can use collections.Counter:
from collections import Counter
m = ['AACGTA', 'CCCGTT', 'CACCTT', 'GGATTA', 'TTCCGG']
d = [Counter(i) for i in zip(*m)]
r = {a:[j.get(a, 0)+1 for j in d] for a in 'ACGT'}
Output:
{'A': [2, 3, 2, 1, 1, 3], 'C': [3, 2, 5, 3, 1, 1], 'G': [2, 2, 1, 3, 2, 2], 'T': [2, 2, 1, 2, 5, 3]}
Related
This problem seems really stupid bu I can't get my head around it.
I have the following list:
a = [2, 1, 3, 1, 1, 2, 3, 2, 3]
I have to produce a second list which have the same size as the previous one but the values that appear should be the amount of times that a value showed up in the array until that point. For example:
b = [1, 1, 1, 2, 3, 2, 2, 3, 3]
So b[0] = 1 because it's the first time the item '2' appear on the 'a' list. b[5] = 2 and b[7] = 3 because it's the second and third time that the item '2' appear on the list 'a'.
Here a solution:
from collections import defaultdict
a = [2, 1, 3, 1, 1, 2, 3, 2, 3]
b = []
d = defaultdict(int)
for x in a:
d[x] +=1
b.append(d[x])
print(b)
Output:
[1, 1, 1, 2, 3, 2, 2, 3, 3]
I think using dictionary might help you, basically I am iterating the list and storing the current frequency of the number.
a = [2, 1, 3, 1, 1, 2, 3, 2, 3]
d = {}
z = []
for i in a:
if i not in d:
d[i] = 1
z.append(1)
else:
d[i]+=1
z.append(d[i])
print(z)
output = [1, 1, 1, 2, 3, 2, 2, 3, 3]
I'm coding in Python, I have an exercise like this :
long = [5, 2, 4]
number = [1, 2, 3]
ids = []
I want to have :
ids = [1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3]
I want to repeat 5 times 1, 2 times 2, 4 times 3.
I don't know how to do it.
from collections import Counter
long = [5, 2, 4]
number = [1, 2, 3]
ids = list(Counter(dict(zip(number, long))).elements())
print(ids)
You can do it with a simple loop which will iterate over (times-to-repeat, number) pairs and extend your output list with generated list of numbers :
for times, n in zip(long, number):
ids.extend([n] * times)
print(ids) # [1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3]
They've provided you with two very good solutions, but I'll leave the brute force approach (hardly ever the best one) in here, since it's the one you're likely more prone to understand:
long = [5, 2, 4]
number = [1, 2, 3]
ids = []
for i in range(len(long)):
aux = 0
while (aux < long[i]):
ids.append(number[i])
aux += 1
print(ids)
I'd like to do the cartesian product of multiple dicts, based on their keys, and then sum the produced tuples, and return that as a dict. Keys that don't exist in one dict should be ignored (this constraint is ideal, but not necessary; i.e. you may assume all keys exist in all dicts if needed). Below is basically what I'm trying to achieve (example shown with two dicts). Is there a simpler way to do this, and with N dicts?
def doProdSum(inp1, inp2):
prod = defaultdict(lambda: 0)
for key in set(list(inp1.keys())+list(inp2.keys())):
if key not in prod:
prod[key] = []
if key not in inp1 or key not in inp2:
prod[key] = inp1[key] if key in inp1 else inp2[key]
continue
for values in itertools.product(inp1[key], inp2[key]):
prod[key].append(values[0] + values[1])
return prod
x = doProdSum({"a":[0,1,2],"b":[10],"c":[1,2,3,4]}, {"a":[1,1,1],"b":[1,2,3,4,5]})
print(x)
Output (as expected):
{'c': [1, 2, 3, 4], 'b': [11, 12, 13, 14, 15], 'a': [1, 1, 1, 2, 2, 2,
3, 3, 3]}
You can do it like this, by first reorganizing your data by key:
from collections import defaultdict
from itertools import product
def doProdSum(list_of_dicts):
# We reorganize the data by key
lists_by_key = defaultdict(list)
for d in list_of_dicts:
for k, v in d.items():
lists_by_key[k].append(v)
# list_by_key looks like {'a': [[0, 1, 2], [1, 1, 1]], 'b': [[10], [1, 2, 3, 4, 5]],'c': [[1, 2, 3, 4]]}
# Then we generate the output
out = {}
for key, lists in lists_by_key.items():
out[key] = [sum(prod) for prod in product(*lists)]
return out
Example output:
list_of_dicts = [{"a":[0,1,2],"b":[10],"c":[1,2,3,4]}, {"a":[1,1,1],"b":[1,2,3,4,5]}]
doProdSum(list_of_dicts)
# {'a': [1, 1, 1, 2, 2, 2, 3, 3, 3],
# 'b': [11, 12, 13, 14, 15],
# 'c': [1, 2, 3, 4]}
I know how to count frequency of elements in a list but here's a lightly different question. I have a larger set of vocabulary and a few lists that only use part of the total vocabulary. Using numbers instead of words as an example:
vocab=[1,2,3,4,5,6,7]
list1=[1,2,3,4]
list2=[2,3,4,5,6,6,7]
list3=[3,2,4,4,1]
and I want the output to keep "0"s when a word is not used:
count1=[1,1,1,1,0,0,0]
count2=[0,1,1,1,1,2,1]
count3=[1,1,1,2,0,0,0]
I guess I need to sort the words, but how do I keep the "0" records?
This can be done using the list object's inbuilt count function, within a list comprehension.
>>> vocab = [1, 2, 3, 4, 5, 6, 7]
>>> list1 = [1, 2, 3, 4]
>>> list2 = [2, 3, 4, 5, 6, 6, 7]
>>> list3 = [3, 2, 4, 4, 1]
>>> [list1.count(v) for v in vocab]
[1, 1, 1, 1, 0, 0, 0]
>>> [list2.count(v) for v in vocab]
[0, 1, 1, 1, 1, 2, 1]
>>> [list3.count(v) for v in vocab]
[1, 1, 1, 2, 0, 0, 0]
Iterate over each value in vocab, accumulating the frequencies for them.
You could also achieve this with the follwing (Python 2):
map(lambda v: list1.count(v), vocab)
I have a list hns = [[a,b,c],[c,b,a],[b,a,c]] where the positions give the rank in a particular sample. ie hns[0] is run 1 hns[1] is run `2 and hns[2] is run 3 and 'a' was ranked 1 in run 1 ranked 3 in run 2 and 2 in run 3.
and another list hnday = [[a,1,2,3],[b,1,2,3],[c,1,2,3]]
so in hns a is in the 0,0 position then 1,2 and the 2,1 which in this problem means that it's ranking is 1 3 2 respectively and I need to end up with a table that reflects that
hnday = [[a,1,3,2],[b,2,2,1],[c,3,1,3]]
so right now ( because I am still stuck in for loop thinking as I am new to python) it seems to me that I have to loop through hns and populate hnday as I go taking the index value of, say 'a' = 1 and update hnday[0][1] = 1
hnday[0][2] = 3 and hnday[0][3] = 2
this doesn't seem a very pythonic way to approach this and I would ask what other approach I could look at.
This is the most pythonic and beautiful way I can think of:
>>> hns=[['a','b','c'],['c','b','a'],['b','a','c']]
>>> keys = ['a','b','c']
>>> hnday = [[k]+[hns[i].index(k)+1 for i in range(len(hns))] for k in keys]
[['a', 1, 3, 2], ['b', 2, 2, 1], ['c', 3, 1, 3]]
However, doesn't a dictionary seem most appropriate for the last expression?
With a dictionary you could easily access the rankings of a key with hnday[key], instead of iterating hnday.
It doesn't change much in the comprehension expression:
>>> hnday = {k:[hns[i].index(k)+1 for i in range(len(hns))] for k in keys}
{'c': [3, 1, 3], 'b': [2, 2, 1], 'a': [1, 3, 2]}
>>> hnday['a']
[1, 3, 2]
>>> hnday['b']
[2, 2, 1]
>>> hnday['c']
[3, 1, 3]
I think you will get a better performance if you do it this way
hns = [['a','b','c'],['c','b','a'],['b','a','c']]
M={}
for x in hns:
for i,y in enumerate(x):
if y in M:
M[y].append(i+1)
else:
M[y]=[i+1]
print M
# {'a': [1, 3, 2], 'c': [3, 1, 3], 'b': [2, 2, 1]}