Python: Elegantly merge dictionaries with sum() of values [duplicate] - python

This question already has answers here:
Is there any pythonic way to combine two dicts (adding values for keys that appear in both)?
(22 answers)
Closed 10 years ago.
I'm trying to merge logs from several servers. Each log is a list of tuples (date, count). date may appear more than once, and I want the resulting dictionary to hold the sum of all counts from all servers.
Here's my attempt, with some data for example:
from collections import defaultdict
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input=[a,b,c]
output=defaultdict(int)
for d in input:
for item in d:
output[item[0]]+=item[1]
print dict(output)
Which gives:
{'14.5': 100, '16.5': 100, '13.5': 100, '15.5': 200}
As expected.
I'm about to go bananas because of a colleague who saw the code. She insists that there must be a more Pythonic and elegant way to do it, without these nested for loops. Any ideas?

Doesn't get simpler than this, I think:
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input=[a,b,c]
from collections import Counter
print sum(
(Counter(dict(x)) for x in input),
Counter())
Note that Counter (also known as a multiset) is the most natural data structure for your data (a type of set to which elements can belong more than once, or equivalently - a map with semantics Element -> OccurrenceCount. You could have used it in the first place, instead of lists of tuples.
Also possible:
from collections import Counter
from operator import add
print reduce(add, (Counter(dict(x)) for x in input))
Using reduce(add, seq) instead of sum(seq, initialValue) is generally more flexible and allows you to skip passing the redundant initial value.
Note that you could also use operator.and_ to find the intersection of the multisets instead of the sum.
The above variant is terribly slow, because a new Counter is created on every step. Let's fix that.
We know that Counter+Counter returns a new Counter with merged data. This is OK, but we want to avoid extra creation. Let's use Counter.update instead:
update(self, iterable=None, **kwds) unbound collections.Counter method
Like dict.update() but add counts instead of replacing them.
Source can be an iterable, a dictionary, or another Counter instance.
That's what we want. Let's wrap it with a function compatible with reduce and see what happens.
def updateInPlace(a,b):
a.update(b)
return a
print reduce(updateInPlace, (Counter(dict(x)) for x in input))
This is only marginally slower than the OP's solution.
Benchmark: http://ideone.com/7IzSx (Updated with yet another solution, thanks to astynax)
(Also: If you desperately want an one-liner, you can replace updateInPlace by lambda x,y: x.update(y) or x which works the same way and even proves to be a split second faster, but fails at readability. Don't :-))

from collections import Counter
a = [("13.5",100)]
b = [("14.5",100), ("15.5", 100)]
c = [("15.5",100), ("16.5", 100)]
inp = [dict(x) for x in (a,b,c)]
count = Counter()
for y in inp:
count += Counter(y)
print(count)
output:
Counter({'15.5': 200, '14.5': 100, '16.5': 100, '13.5': 100})
Edit:
As duncan suggested you can replace these 3 lines with a single line:
count = Counter()
for y in inp:
count += Counter(y)
replace by : count = sum((Counter(y) for y in inp), Counter())

You could use itertools' groupby:
from itertools import groupby, chain
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input = sorted(chain(a,b,c), key=lambda x: x[0])
output = {}
for k, g in groupby(input, key=lambda x: x[0]):
output[k] = sum(x[1] for x in g)
print output
The use of groupby instead of two loops and a defaultdict will make your code clearer.

You can use Counter or defaultdict, or you can try my variant:
def merge_with(d1, d2, fn=lambda x, y: x + y):
res = d1.copy() # "= dict(d1)" for lists of tuples
for key, val in d2.iteritems(): # ".. in d2" for lists of tuples
try:
res[key] = fn(res[key], val)
except KeyError:
res[key] = val
return res
>>> merge_with({'a':1, 'b':2}, {'a':3, 'c':4})
{'a': 4, 'c': 4, 'b': 2}
Or even more generic:
def make_merger(fappend=lambda x, y: x + y, fempty=lambda x: x):
def inner(*dicts):
res = dict((k, fempty(v)) for k, v
in dicts[0].iteritems()) # ".. in dicts[0]" for lists of tuples
for dic in dicts[1:]:
for key, val in dic.iteritems(): # ".. in dic" for lists of tuples
try:
res[key] = fappend(res[key], val)
except KeyError:
res[key] = fempty(val)
return res
return inner
>>> make_merger()({'a':1, 'b':2}, {'a':3, 'c':4})
{'a': 4, 'c': 4, 'b': 2}
>>> appender = make_merger(lambda x, y: x + [y], lambda x: [x])
>>> appender({'a':1, 'b':2}, {'a':3, 'c':4}, {'b':'BBB', 'c':'CCC'})
{'a': [1, 3], 'c': [4, 'CCC'], 'b': [2, 'BBB']}
Also you can subclass the dict and implement a __add__ method:

Related

assign values to list of variables in python

I have made a small demo of a more complex problem
def f(a):
return tuple([x for x in range(a)])
d = {}
[d['1'],d['2']] = f(2)
print d
# {'1': 0, '2': 1}
# Works
Now suppose the keys are programmatically generated
How do i achieve the same thing for this case?
n = 10
l = [x for x in range(n)]
[d[x] for x in l] = f(n)
print d
# SyntaxError: can't assign to list comprehension
You can't, it's a syntactical feature of the assignment statement. If you do something dynamic, it'll use different syntax, and thus not work.
If you have some function results f() and a list of keys keys, you can use zip to create an iterable of keys and results, and loop over them:
d = {}
for key, value in zip(keys, f()):
d[key] = value
That is easily rewritten as a dict comprehension:
d = {key: value for key, value in zip(keys, f())}
Or, in this specific case as mentioned by #JonClements, even as
d = dict(zip(keys, f()))

How to add dictionary keys with defined values to a list

I'm trying to only add keys with a value >= n to my list, however I can't give the key an argument.
n = 2
dict = {'a': 1, 'b': 2, 'c': 3}
for i in dict:
if dict[i] >= n:
list(dict.keys([i])
When I try this, it tells me I can't give .keys() an argument. But if I remove the argument, all keys are added, regardless of value
Any help?
You don't need to call .keys() method of dict as you are already iterating data_dict's keys using for loop.
n = 2
data_dict = {'a': 1, 'b': 2, 'c': 3}
lst = []
for i in data_dict:
if data_dict[i] >= n:
lst.append(i)
print lst
Results:
['c', 'b']
You can also achieve this using list comprehension
result = [k for k, v in data_dict.iteritems() if v >= 2]
print result
You should read this: Iterating over Dictionaries.
Try using filter:
filtered_keys = filter(lambda x: d[x] >= n, d.keys())
Or using list comprehension:
filtered_keys = [x for x in d.keys() if d[x] >= n]
The error in your code is that dict.keys returns all keys, as the docs mention:
Return a copy of the dictionary’s list of keys.
What you want is one key at a time, which list comprehension gives you. Also, when filtering, which is basically what you do, consider using the appropriate method (filter).

How to quickly get a list of keys from dict

I construct a dictionary from an excel sheet and end up with something like:
d = {('a','b','c'): val1, ('a','d'): val2}
The tuples I use as keys contain a handful of values, the goal is to get a list of these values which occur more than a certain number of times.
I've tried two solutions, both of which take entirely too long.
Attempt 1, simple list comprehension filter:
keyList = []
for k in d.keys():
keyList.extend(list(k))
# The script makes it to here before hanging
commonkeylist = [key for key in keyList if keyList.count(key) > 5]
This takes forever since list.count() traverses the least on each iteration of the comprehension.
Attempt 2, create a count dictionary
keyList = []
keydict = {}
for k in d.keys():
keyList.extend(list(k))
# The script makes it to here before hanging
for k in keyList:
if k in keydict.keys():
keydict[k] += 1
else:
keydict[k] = 1
commonkeylist = [k for k in keyList if keydict[k] > 50]
I thought this would be faster since we only traverse all of keyList a handful of times, but it still hangs the script.
What other steps can I take to improve the efficiency of this operation?
Use collections.Counter() and a generator expression:
from collections import Counter
counts = Counter(item for key in d for item in key)
commonkkeylist = [item for item, count in counts.most_common() if count > 50]
where iterating over the dictionary directly yields the keys without creating an intermediary list object.
Demo with a lower count filter:
>>> from collections import Counter
>>> d = {('a','b','c'): 'val1', ('a','d'): 'val2'}
>>> counts = Counter(item for key in d for item in key)
>>> counts
Counter({'a': 2, 'c': 1, 'b': 1, 'd': 1})
>>> [item for item, count in counts.most_common() if count > 1]
['a']
I thought this would be faster since we only traverse all of keyList a
handful of times, but it still hangs the script.
That's because you're still doing an O(n) search. Replace this:
for k in keyList:
if k in keydict.keys():
with this:
for k in keyList:
if k in keydict:
and see if that helps your 2nd attempt perform better.

creating cumulative percentage from a dictionary of data

Given a dictionary (or Counter) of tally data like the following:
d={'dan':7, 'mike':2, 'john':3}
and a new dictionary "d_cump" that I want to contain cumulative percentages
d_cump={'mike':16, 'john':41, 'dan':100}
EDIT: Should clarify that order doesn't matter for my input set, which is why I'm using a dictionary or counter. Order does matter when calculating cumulative percentages so I need to sort the data for that operation, once I have the cumulative percentage for each name then I put it back in a dictionary since again, order shouldn't matter if I'm looking at single values.
What is the most elegant/pythonic way to get from d to d_cump?
Here is what I have seems a bit clumsy:
from numpy import cumsum
d={'dan':7, 'mike':2, 'john':3}
sorted_keys = sorted(d,key=lambda x: d[x])
perc = [x*100/sum(d.values()) for x in cumsum([ d[x] for x in sorted_keys ])]
d_cump=dict(zip(sorted_keys,perc))
>>> d_cump
{'mike': 16, 'john': 41, 'dan': 100}
It's hard to tell how a cumulative percentage would be valuable considering the order of the original dictionary is arbitrary.
That said, here's how I would do it:
from numpy import cumsum
from operator import itemgetter
d={'dan':7, 'mike':2, 'john':3}
#unzip keys from values in a sorted order
keys, values = zip(*sorted(d.items(), key=itemgetter(1)))
total = sum(values)
# calculate cumsum and zip with keys into new dict
d_cump = dict(zip(keys, (100*subtotal/total for subtotal in cumsum(values))))
Note that there is no special order to the results because dictionaries are not ordered:
{'dan': 100, 'john': 41, 'mike': 16}
Since you're using numpy anyway, you can bypass/simplify the list comprehensions:
>>> from numpy import cumsum
>>> d={'dan':7, 'mike':2, 'john':3}
>>> sorted_keys = sorted(d,key=d.get)
>>> z = cumsum(sorted(d.values())) # or z = cumsum([d[k] for k in sorted_keys])
>>> d2 = dict(zip(sorted_keys, 100.0*z/z[-1]))
>>> d2
{'mike': 16, 'john': 41, 'dan': 100}
but as noted elsewhere, it feels weird to be using a dictionary this way.
Calculating a cumulative value? Sounds like a fold to me!
d = {'dan':7, 'mike':2, 'john':3}
denominator = float(sum(d.viewvalues()))
data = ((k,(v/denominator)) for (k, v) in sorted(d.viewitems(), key = lambda (k,v):v))
import functional
f = lambda (k,v), l : [(k, v+l[0][1])]+l
functional.foldr(f, [(None,0)], [('a', 1), ('b', 2), ('c', 3)])
#=>[('a', 6), ('b', 5), ('c', 3), (None, 0)]
d_cump = { k:v for (k,v) in functional.foldr(f, [(None,0)], data) if k is not None }
Functional isn't a built-in package. You could also re-jig f to work with a right-fold, and hence the standard reduce if you wanted.
As you can see, this isn't much shorter, but it takes advantage of sequence destructuring to avoid splitting/zipping, and it uses a generator as the intermediate data, which avoids building a list.
If you want to further minimise object creation, you can use this alternative function which modifies the initial list passed in (but has to use a stupid trick to return the appropriate value, because list.append returns None).
uni = lambda x:x
ff = lambda (k,v), l : uni(l) if l.insert(0, (k, v+l[0][1])) is None else uni(l)
Incidentally, the left fold is very easy using ireduce (from this page http://www.ibm.com/developerworks/linux/library/l-cpyiter/index.html ), because it eliminates the list construction:
ff = lambda (l, ll), (k,v), : (k, v+ll)
g = ireduce(ff, data, (None, 0))
tuple(g)
#=>(('mike', 0.16666666666666666), ('john', 0.41666666666666663), ('dan', 1.0))
def ireduce(func, iterable, init=None):
if init is None:
iterable = iter(iterable)
curr = iterable.next()
else:
curr = init
for x in iterable:
curr = func(curr, x)
yield curr
This is attractive because the initial value is not included, and because generators are lazy, and so particularly suitable for chaining.
Note that ireduce above is equivalent to:
def ireduce(func, iterable, init=None):
from functional import scanl
if init is None: init = next(iterable)
sg = scanl(func, init, iterable)
next(sg)
return sg

How to write a function that takes a string and prints the letters in decreasing order of frequency?

I got this far:
def most_frequent(string):
d = dict()
for key in string:
if key not in d:
d[key] = 1
else:
d[key] += 1
return d
print most_frequent('aabbbc')
Returning:
{'a': 2, 'c': 1, 'b': 3}
Now I need to:
reverse the pair
sort by number by decreasing order
only print the letters out
Should I convert this dictionary to tuples or list?
Here's a one line answer
sortedLetters = sorted(d.iteritems(), key=lambda (k,v): (v,k))
This should do it nicely.
def frequency_analysis(string):
d = dict()
for key in string:
d[key] = d.get(key, 0) + 1
return d
def letters_in_order_of_frequency(string):
frequencies = frequency_analysis(string)
# frequencies is of bounded size because number of letters is bounded by the dictionary, not the input size
frequency_list = [(freq, letter) for (letter, freq) in frequencies.iteritems()]
frequency_list.sort(reverse=True)
return [letter for freq, letter in frequency_list]
string = 'aabbbc'
print letters_in_order_of_frequency(string)
Here is something that returns a list of tuples rather than a dictionary:
import operator
if __name__ == '__main__':
test_string = 'cnaa'
string_dict = dict()
for letter in test_string:
if letter not in string_dict:
string_dict[letter] = test_string.count(letter)
# Sort dictionary by values, credits go here http://stackoverflow.com/questions/613183/sort-a-dictionary-in-python-by-the-value/613218#613218
ordered_answer = sorted(string_dict.items(), key=operator.itemgetter(1), reverse=True)
print ordered_answer
Python 2.7 supports this use case directly:
>>> from collections import Counter
>>> Counter('abracadabra').most_common()
[('a', 5), ('r', 2), ('b', 2), ('c', 1), ('d', 1)]
chills42 lambda function wins, I think but as an alternative, how about generating the dictionary with the counts as the keys instead?
def count_chars(string):
distinct = set(string)
dictionary = {}
for s in distinct:
num = len(string.split(s)) - 1
dictionary[num] = s
return dictionary
def print_dict_in_reverse_order(d):
_list = d.keys()
_list.sort()
_list.reverse()
for s in _list:
print d[s]
EDIT This will do what you want. I'm stealing chills42 line and adding another:
sortedLetters = sorted(d.iteritems(), key=lambda (k,v): (v,k))
sortedString = ''.join([c[0] for c in reversed(sortedLetters)])
------------original answer------------
To print out the sorted string add another line to chills42 one-liner:
''.join(map(lambda c: str(c[0]*c[1]), reversed(sortedLetters)))
This prints out 'bbbaac'
If you want single letters, 'bac' use this:
''.join([c[0] for c in reversed(sortedLetters)])
from collections import defaultdict
def most_frequent(s):
d = defaultdict(int)
for c in s:
d[c] += 1
return "".join([
k for k, v in sorted(
d.iteritems(), reverse=True, key=lambda (k, v): v)
])
EDIT:
here is my one liner:
def most_frequent(s):
return "".join([
c for frequency, c in sorted(
[(s.count(c), c) for c in set(s)], reverse=True
)
])
Here's the code for your most_frequent function:
>>> a = 'aabbbc'
>>> {i: a.count(i) for i in set(a)}
{'a': 2, 'c': 1, 'b': 3}
this particular syntax is for py3k, but it's easy to write something similar using syntax of previous versions. it seems to me a bit more readable than yours.
def reversedSortedFrequency(string)
from collections import defaultdict
d = defaultdict(int)
for c in string:
d[c]+=1
return sorted([(v,k) for k,v in d.items()], key=lambda (k,v): -k)
Here is the fixed version (thank you for pointing out bugs)
def frequency(s):
return ''.join(
[k for k, v in
sorted(
reduce(
lambda d, c: d.update([[c, d.get(c, 0) + 1]]) or d,
list(s),
dict()).items(),
lambda a, b: cmp(a[1], b[1]),
reverse=True)])
I think the use of reduce makes the difference in this sollution compared to the others...
In action:
>>> from frequency import frequency
>>> frequency('abbbccddddxxxyyyyyz')
'ydbxcaz'
This includes extracting the keys (and counting them) as well!!! Another nice property is the initialization of the dictionary on the same line :)
Also: no includes, just builtins.
The reduce function is kinda hard to wrap my head around, and setting dictionary values in a lambda is also a bit cumbersome in python, but, ah well, it works!

Categories

Resources