Sum tuples of cartesian product of arbitrary number of dicts - python

I'd like to do the cartesian product of multiple dicts, based on their keys, and then sum the produced tuples, and return that as a dict. Keys that don't exist in one dict should be ignored (this constraint is ideal, but not necessary; i.e. you may assume all keys exist in all dicts if needed). Below is basically what I'm trying to achieve (example shown with two dicts). Is there a simpler way to do this, and with N dicts?
def doProdSum(inp1, inp2):
prod = defaultdict(lambda: 0)
for key in set(list(inp1.keys())+list(inp2.keys())):
if key not in prod:
prod[key] = []
if key not in inp1 or key not in inp2:
prod[key] = inp1[key] if key in inp1 else inp2[key]
continue
for values in itertools.product(inp1[key], inp2[key]):
prod[key].append(values[0] + values[1])
return prod
x = doProdSum({"a":[0,1,2],"b":[10],"c":[1,2,3,4]}, {"a":[1,1,1],"b":[1,2,3,4,5]})
print(x)
Output (as expected):
{'c': [1, 2, 3, 4], 'b': [11, 12, 13, 14, 15], 'a': [1, 1, 1, 2, 2, 2,
3, 3, 3]}

You can do it like this, by first reorganizing your data by key:
from collections import defaultdict
from itertools import product
def doProdSum(list_of_dicts):
# We reorganize the data by key
lists_by_key = defaultdict(list)
for d in list_of_dicts:
for k, v in d.items():
lists_by_key[k].append(v)
# list_by_key looks like {'a': [[0, 1, 2], [1, 1, 1]], 'b': [[10], [1, 2, 3, 4, 5]],'c': [[1, 2, 3, 4]]}
# Then we generate the output
out = {}
for key, lists in lists_by_key.items():
out[key] = [sum(prod) for prod in product(*lists)]
return out
Example output:
list_of_dicts = [{"a":[0,1,2],"b":[10],"c":[1,2,3,4]}, {"a":[1,1,1],"b":[1,2,3,4,5]}]
doProdSum(list_of_dicts)
# {'a': [1, 1, 1, 2, 2, 2, 3, 3, 3],
# 'b': [11, 12, 13, 14, 15],
# 'c': [1, 2, 3, 4]}

Related

Dictionary comprehension with multiple values for each key

Im doing a course in bioinformatics. We were supposed to create a function that takes a list of strings like this:
Motifs =[
"AACGTA",
"CCCGTT",
"CACCTT",
"GGATTA",
"TTCCGG"]
and turn it into a count matrix that counts the occurrence of the nucleotides (the letters A, C, G and T) in each column and adds a pseudocount 1 to it, represented by a dictionary with multiple values for each key like this:
count ={
'A': [2, 3, 2, 1, 1, 3],
'C': [3, 2, 5, 3, 1, 1],
'G': [2, 2, 1, 3, 2, 2],
'T': [2, 2, 1, 2, 5, 3]}
For example A occurs 1 + 1 pseudocount = 2 in the first column. C appears 2 + 1 pseudocount = 3 in the fourth column.
Here is my solution:
def CountWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
count = {}
for symbol in "ACGT":
count[symbol] = [1 for j in range(k)]
for i in range(t):
for j in range(k):
symbol = Motifs[i][j]
count[symbol][j] += 1
return count
The first set of for loops generates a dictionary with the keys A,C,G,T and the initial values 1 for each column like this:
count ={
'A': [1, 1, 1, 1, 1, 1],
'C': [1, 1, 1, 1, 1, 1],
'G': [1, 1, 1, 1, 1, 1],
'T': [1, 1, 1, 1, 1, 1]}
The second set of for loops counts the occurrence of the nucleotides and adds it to the values of the existing dictionary as seen above.
This works and does its job, but I want to know how to further compress both for loops using dict comprehensions.
NOTE:
I am fully aware that there are a multitude of modules and libraries like biopython, scipy and numpy that probably can turn the entire function into a one liner. The problem with modules is that their output format often doesnt match with what the automated solution check from the course is expecting.
This
count = {}
for symbol in "ACGT":
count[symbol] = [1 for j in range(k)]
can be changed to comprehension as follows
count = {symbol:[1 for j in range(k)] for symbol in "ACGT"}
and then further simplified by using pythons ability to multiply list by integer to
count = {symbol:[1]*k for symbol in "ACGT"}
compressing the first loop:
count = {symbol: [1 for j in range(k)] for symbol in "ACGT"}
This method is called a generator (or dict comprehension) - it generates a dict using a for loop.
I'm not sure you can compress the second (nested) loop, since it's not generating anything, but changing the first dict.
You can compress a lot your code using collections.Counter and collections.defaultdict:
from collections import Counter, defaultdict
out = defaultdict(list)
bases = 'ACGT'
for m in zip(*Motifs):
c = Counter(m)
for b in bases:
out[b].append(c[b]+1)
dict(out)
output:
{'A': [2, 3, 2, 1, 1, 3],
'C': [3, 2, 5, 3, 1, 1],
'G': [2, 2, 1, 3, 2, 2],
'T': [2, 2, 1, 2, 5, 3]}
You can use collections.Counter:
from collections import Counter
m = ['AACGTA', 'CCCGTT', 'CACCTT', 'GGATTA', 'TTCCGG']
d = [Counter(i) for i in zip(*m)]
r = {a:[j.get(a, 0)+1 for j in d] for a in 'ACGT'}
Output:
{'A': [2, 3, 2, 1, 1, 3], 'C': [3, 2, 5, 3, 1, 1], 'G': [2, 2, 1, 3, 2, 2], 'T': [2, 2, 1, 2, 5, 3]}

Python Dictionary Filtration with items as a list

I have several lists as items in my dictionary. I want to create a dictionary with the same keys, but only with items that correspond to the unique values of the list in the first key. What's the best way to do this?
Original:
d = {'s': ['a','a','a','b','b','b','b'],
'd': ['c1','d2','c3','d4','c5','d6','c7'],
'g': ['e1','f2','e3','f4','e5','f6','e7']}
Output:
e = {'s': ['a','a','a'],
'd': ['c1','d2','c3'],
'g': ['e1','f2','e3']}
f = {'s': ['b','b','b','b'],
'd': ['d4','c5','d6','c7'],
'g': ['f4','e5','f6','e7']}
I don't think there is an easy way to do this. I created a (not so) little function for you:
def func(entry):
PARSING_KEY = "s"
# check if entry dict is valid (optional)
assert type(entry)==dict
for key in entry.keys():
assert type(entry[key])==list
first_list = entry[PARSING_KEY]
first_list_len = len(first_list)
for key in entry.keys():
assert len(entry[key]) == first_list_len
# parsing
output_list_index = []
already_check = set()
for index1, item1 in enumerate(entry[PARSING_KEY]):
if not item1 in already_check:
output_list_index.append([])
for index2, item2 in enumerate(entry[PARSING_KEY][index1:]):
if item2==item1:
output_list_index[-1].append(index2)
already_check.add(item1)
# creating lists
output_list = []
for indexes in output_list_index:
new_dict = {}
for key, value in entry.items():
new_dict[key] = [value[i] for i in indexes]
output_list.append(new_dict)
return output_list
Note that because of the structure of dict, there isn't a "first key" so you have to hardcode the key you want to use to parse (whit the "PARSING_KEY" constant at the top of the function)
original_dict = {
'a': [1, 3, 5, 8, 4, 2, 1, 2, 7],
'b': [4, 4, 4, 4, 4, 3],
'c': [822, 1, 'hello', 'world']
}
distinct_dict = {k: list(set(v)) for k, v in original_dict.items()}
distinct_dict
yields
{'a': [1, 2, 3, 4, 5, 7, 8], 'b': [3, 4], 'c': [1, 'hello', 'world', 822]}

Restructuring the hierarchy of dictionaries in Python?

If I have a nested dictionary in Python, is there any way to restructure it based on keys?
I'm bad at explaining, so I'll give a little example.
d = {'A':{'a':[1,2,3],'b':[3,4,5],'c':[6,7,8]},
'B':{'a':[7,8,9],'b':[4,3,2],'d':[0,0,0]}}
Re-organize like this
newd = {'a':{'A':[1,2,3],'B':[7,8,9]},
'b':{'A':[3,4,5],'B':[4,3,2]},
'c':{'A':[6,7,8]},
'd':{'B':[0,0,0]}}
Given some function with inputs like
def mysteryfunc(olddict,newkeyorder):
????
mysteryfunc(d,[1,0])
Where the [1,0] list passed means to put the dictionaries 2nd level of keys in the first level and the first level in the 2nd level. Obviously the values need to be associated with their unique key values.
Edit:
Looking for an answer that covers the general case, with arbitrary unknown nested dictionary depth.
Input:
d = {'A':{'a':[1,2,3],'b':[3,4,5],'c':[6,7,8]},
'B':{'a':[7,8,9],'b':[4,3,2],'d':[0,0,0]}}
inner_dict={}
for k,v in d.items():
print(k)
for ka,va in v.items():
val_list=[]
if ka not in inner_dict:
val_dict={}
val_dict[k]=va
inner_dict[ka]=val_dict
else:
val_dict=inner_dict[ka]
val_dict[k]=va
inner_dict[ka]=val_dict
Output:
{'a': {'A': [1, 2, 3], 'B': [7, 8, 9]},
'b': {'A': [3, 4, 5], 'B': [4, 3, 2]},
'c': {'A': [6, 7, 8]},
'd': {'B': [0, 0, 0]}}
you can use 2 for loops, one to iterate over each key, value pair and the second for loop to iterate over the nested dict, at each step form the second for loop iteration you can build your desired output:
from collections import defaultdict
new_dict = defaultdict(dict)
for k0, v0 in d.items():
for k1, v1 in v0.items():
new_dict[k1][k0] = v1
print(dict(new_dict))
output:
{'a': {'A': [1, 2, 3], 'B': [7, 8, 9]},
'b': {'A': [3, 4, 5], 'B': [4, 3, 2]},
'c': {'A': [6, 7, 8]},
'd': {'B': [0, 0, 0]}}
You can use recursion with a generator to handle input of arbitrary depth:
def paths(d, c = []):
for a, b in d.items():
yield from ([((c+[a])[::-1], b)] if not isinstance(b, dict) else paths(b, c+[a]))
from collections import defaultdict
def group(d):
_d = defaultdict(list)
for [a, *b], c in d:
_d[a].append([b, c])
return {a:b[-1][-1] if not b[0][0] else group(b) for a, b in _d.items()}
print(group(list(paths(d))))
Output:
{'a': {'A': [1, 2, 3], 'B': [7, 8, 9]}, 'b': {'A': [3, 4, 5], 'B': [4, 3, 2]}, 'c': {'A': [6, 7, 8]}, 'd': {'B': [0, 0, 0]}}

same output in values for different keys in for-loop

I'm new to Python and was trying to create a new dictionary from a dictionary with values in list format. Below please find my code:
dataset = {
"a" : [3, 1, 7, 5, 9],
"b" : [4, 8, 7, 5, 3],
"c" : [3, 4, 1, 0, 0],
"d" : [0, 5, 1, 5, 5],
"e" : [3, 7, 5, 5, 1],
"f" : [3, 6, 9, 2, 0],
}
for v in dataset.values():
total = (sum(v))
print(total)
for k in dataset:
print(k)
dict1 = {k:total for k in dataset}
print(dict1)
My expected result is {"a":25, "b":27, ..}.
Instead, when i run the codes, the output is
{'a': 20, 'b': 20, 'c': 20, 'd': 20, 'e': 20, 'f': 20}
May I know which part of the codes I am wrong?
It's because total is set to the last value in your for v in dataset.values() loop. You should try
dict1={k: sum(v) for k,v in dataset.items()}
You are just storing the last total in your solution. Try doing this instead:
dataset = {
"a" : [3, 1, 7, 5, 9],
"b" : [4, 8, 7, 5, 3],
"c" : [3, 4, 1, 0, 0],
"d" : [0, 5, 1, 5, 5],
"e" : [3, 7, 5, 5, 1],
"f" : [3, 6, 9, 2, 0],
}
dict1 = {k: sum(v) for k, v in dataset.items()}
print(dict1)
'total' changes after each iteration. Every time you go through a value in your dictionary and take the sum, you get a new total. So when the loop ends, total is the sum of the last list in your dictionary.
You are making each key in your dictionary have the value of the total for the last list. Instead, you should assign the total for each key inside the loop you are calculating total.
try this one.
dataset = {
"a" : [3, 1, 7, 5, 9],
"b" : [4, 8, 7, 5, 3],
"c" : [3, 4, 1, 0, 0],
"d" : [0, 5, 1, 5, 5],
"e" : [3, 7, 5, 5, 1],
"f" : [3, 6, 9, 2, 0],
}
result = {}
for (k, v) in dataset.items():
result[k] = sum(v)
print (result)
output:
{'a': 25, 'c': 8, 'b': 27, 'e': 21, 'd': 16, 'f': 20}
You are setting total once for every list of values in dataset, but by the time you use dictionary comprehension to create your dict1, you're only using the last calculated value of total, which happens to be 20. (Dicts are not ordered, so total could have ended with any of your sum values, depending on which random order the values were visited in.)
Instead, you can do all of this in one line, using a handy function called items(), which turns a dict into a list of pairs (i.e.: 2-tuples) of corresponding keys and values. So you can rewrite your dict1 code like this:
dict1 = {k: sum(v) for (k, v) in dataset.items()}
When you run the following part of your code, the total is updated for every value in the dataset dictionary, and then the total is updated to the sum of the last element of the dictionary, i.e. [3, 6, 9, 2, 0], hence total=20
for v in dataset.values():
total = (sum(v))
print(total)
And then when you do dict1 = {k:total for k in dataset}, all values of dict1 are 20`.
Two ways to simplify this.
Iterate through the input dictionary and update the output dictionary at the same time, this is most efficient
total = []
dict1 = {}
for k, v in dataset.items():
total = sum(v)
dict1[k] = total
print(dict1)
#{'a': 25, 'b': 27, 'c': 8, 'd': 16, 'e': 21, 'f': 20}
Iterate through the values, and then make a list of all the sums, and then iterate through the keys, and make the output dictionary. Note that this is grossly inefficient, but can be treated as the intermediate step from going to the code you currently have, to the first approach.
total = []
dict1 = {}
#Create the list of sums
for v in dataset.values():
total.append(sum(v))
idx = 0
#Use that list of sums to create the output dictionary
for k in dataset.keys():
dict1[k] = total[idx]
idx+=1
print(dict1)
#{'a': 25, 'b': 27, 'c': 8, 'd': 16, 'e': 21, 'f': 20}

Python dictionary sum values

I'm using Python 2.7 and I have a large dictionary that looks a little like this
{J: [92704, 238476902378, 32490872394, 234798327, 2390470], M: [32974097, 237407, 3248707, 32847987, 34879], Z: [8237, 328947, 239487, 234, 182673]}
How can I sum these by value to create a new dictionary that sums the first values in each dictionary, then the second, etc. Like
{FirstValues: J[0]+M[0]+Z[0]}
etc
In [4]: {'FirstValues': sum(e[0] for e in d.itervalues())}
Out[4]: {'FirstValues': 33075038}
where d is your dictionary.
print [sum(row) for row in zip(*yourdict.values())]
yourdict.values() gets all the lists, zip(* ) groups the first, second, etc items together and sum sums each group.
I don't know why do you need dictionary as output, but here it is:
dict(enumerate( [sum(x) for x in zip(*d.values())] ))
from itertools import izip_longest
totals = (sum(vals) for vals in izip_longest(*mydict.itervalues(), fillvalue=0))
print tuple(totals)
In English...
zip the lists (dict values) together, padding with 0 (if you want, you don't have to).
Sum each zipped group
For example,
mydict = {
'J': [1, 2, 3, 4, 5],
'M': [1, 2, 3, 4, 5],
'Z': [1, 2, 3, 4]
}
## When zipped becomes...
([1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4], [5, 5, 0])
## When summed becomes...
(3, 6, 9, 12, 10)
It does really not make sense to create a new dictionary as the new keys are (probably) meaningless. The results don't relate to the original keys. More appropriate is a tuple as results[0] holds the sum of all values at position 0 in the original dict values etc.
If you must have a dict, take the totals iterator and turn it into a dict thus:
new_dict = dict(('Values%d' % idx, val) for idx, val in enumerate(totals))
Say you have some dict like:
d = {'J': [92704, 238476902378, 32490872394, 234798327, 2390470],
'M': [32974097, 237407, 3248707, 32847987, 34879],
'Z': [8237, 328947, 239487, 234, 182673]}
Make a defaultdict (int)
from collections import defaultdict
sum_by_index = defaultdict(int)
for alist in d.values():
for index,num in enumerate(alist):
sum_by_index[index] += num

Categories

Resources