Finding the difference in value counts by keys in two Dictionaries - python

I have two sample python dictionaries that counts how many times each key appears in a DataFrame.
dict1 = {
2000 : 2,
3000 : 3,
4000 : 4,
5000 : 6,
6000 : 8
}
dict2 = {
4000 : 4,
3000 : 3,
2000 : 4,
6000 : 10,
5000 : 4
}
I would like to output the following where there is a difference.
diff = {
2000 : 2
5000 : 2
6000 : 2
}
I would appreciate any help as I am not familiar with iterating though dictionaries. Even if the output shows me at which key there is a difference in values, it would work for me. I did the following but it does not produce any output.
for (k,v), (k2,v2) in zip(dict1.items(), dict2.items()):
if k == k2:
if v == v2:
pass
else:
print('value is different at k')

The way you're doing doesn't work because the dicts are not sorted, so k==k2 is always evaluated False.
You could use a dict comprehension where you traverse dict1 and subtract the value in dict2 with the matching key:
diff = {k: abs(v - dict2[k]) for k, v in dict1.items()}
Output:
{2000: 2, 3000: 0, 4000: 0, 5000: 2, 6000: 2}
If you have Python >=3.8, and you want only key-value pairs where value > 0, then you could also use the walrus operator:
diff = {k: di for k, v in dict1.items() if (di := abs(v - dict2[k])) > 0}
Output:
{2000: 2, 5000: 2, 6000: 2}
Since you tagged it as pandas, you can also do a similar job in pandas as well.
First, we need to convert the dicts to DataFrame objects, then join them. Since join joins by index by default and the indexes are the keys in the dicts, you get a nice DataFrame where you can directly find the difference row-wise. Then use the diff method on axis + abs to find the differences.
df1 = pd.DataFrame.from_dict(dict1, orient='index')
df2 = pd.DataFrame.from_dict(dict2, orient='index')
out = df1.join(df2, lsuffix='_x', rsuffix='').diff(axis=1).abs().dropna(axis=1)['0']
Also, instead of creating two DataFrames and joining them, we could also build a single DataFrame by passing a list of the dicts, and use similar methods to get the desired outcome:
out = pd.DataFrame.from_dict([dict1, dict2]).diff().dropna().abs().loc[1]
Output:
2000 2
3000 0
4000 0
5000 2
6000 2
Name: 0, dtype: int64

Since you're counting... how about using Counters?
c1, c2 = Counter(dict2), Counter(dict1)
diff = dict((c1-c2) + (c2-c1))
If you used Counters all around, you wouldn't need the conversions from dict and back. And maybe it would also simplify the creation of your two count dicts (Can't tell for sure since you didn't show how you created them).
Try it online!

Related

Run calculation multiple times with different values

I have two dictionaries, that look like:
dict1 = {1: 10, 2: 23, .... 999: 12}
dict2 = {1: 42, 2: 90, .... 999: 78}
I want to perform a simple calculation: Multiply value of dict1 with value of dict2 for 1 and 2 each.
The code so far is:
dict1[1] * dict2[1]
This calculates 10*42, which is exactly what i want.
Now i want to perform this calculation for every index in the dictionary, so for 1 up to 999.
I tried:
i = {1,2,3,4,5,6 ... 999}
dict1[i] * dict2[i]
But it didnt work.
This creates a new dict with the results:
out = { i: dict1[i] * dict2[i] for i in range(1,1000) }
If you need to work with vectors and matrices take a look at the numpy module. It has data structures and a huge collection of tools for working with them.

Python summing up values in a nested dictionary

I have a dictionary P which represents a dictionary within a dictionary within a dictionary. It looks something like this.
P={key1:{keyA:{value1: 1, value2:3}, keyB:{value1:3,value2:4}},
key2:{keyA:{value1: 1, value2:3}, keyB:{value1:3,value2:4}}, key3{...:{...:}}}
What I am trying to do is to write each value of value1,value 2 in terms of their percentages of the totalPopulation from whichever is there base key.
For example key1 should look like
key1:{keyA:{value1: 1/(1+3+3+4), value2:3/(1+3+3+4)}, keyB:
{value1:3/(1+3+3+4),value2:4/(1+3+3+4)}
What I am not sure about is how to iterate over this dictionary and only collect the innermost values of a certain key so I can then sum up all the values and divide each value by that sum.
This can be done in single line using dict comprehension and map like this:
#from __future__ import division # use for Python 2.x
p = {"key1":{"keyA":{"value1": 1, "value2":3}, "keyB":{"value1":3,"value2":4}}}
p = {kOuter:{kInner:{kVal: vVal/sum(map(lambda x: sum(x.values()), vOuter.values())) for kVal, vVal in vInner.iteritems()} for kInner, vInner in vOuter.iteritems()} for kOuter, vOuter in p.iteritems()}
A more readable version of above :
p = {
kOuter:{
kInner:{
kVal: vVal/sum(map(lambda x: sum(x.values()), vOuter.values())) for kVal, vVal in vInner.iteritems()
}
for kInner, vInner in vOuter.iteritems()
}
for kOuter, vOuter in p.iteritems()
}
OUTPUT
>>> p
>>>
{'key1': {'keyB': {'value2': 0.36363636363636365, 'value1': 0.2727272727272727}, 'keyA': {'value2': 0.2727272727272727, 'value1': 0.09090909090909091}}}
The only problem with this is that the sum is calculated repeatedly, you can fix that by calculating the sum for each of your key1, key2... before this dict comprehension and use the stored values instead, like this :
keyTotals = {kOuter:sum(map(lambda x: sum(x.values()), vOuter.values())) for kOuter, vOuter in p.iteritems()}
and then you can simply access the sums calculated above by keys, like this:
p = {kOuter:{kInner:{kVal: vVal/keyTotals[kOuter] for kVal, vVal in vInner.iteritems()} for kInner, vInner in vOuter.iteritems()} for kOuter, vOuter in p.iteritems()}
test = {"key1":{"keyA":{"value1": 1, "value2":3}, "keyB":{"value1":3,"value2":4}}}
for a in test:
s = 0
for b in test[a]:
for c in test[a][b]:
s += test[a][b][c]
print(s)
for b in test[a]:
for c in test[a][b]:
test[a][b][c] = test[a][b][c] / s
This should do what you want. I've only included "key1" in this example.

Dict Deconstruction and Reconstruction in Python

I have multiple dictionaries. There is a great deal of overlap between the dictionaries, but they are not identical.
a = {'a':1,'b':2,'c':3}
b = {'a':1,'c':3, 'd':4}
c = {'a':1,'c':3}
I'm trying to figure out how to break these down into the most primitive pieces and then reconstruct the dictionaries in the most efficient manner. In other words, how can I deconstruct and rebuild the dictionaries by typing each key/value pair the minimum number of times (ideally once). It also means creating the minimum number of sets that can be combined to create all possible sets that exist.
In the above example. It could be broken down into:
c = {'a':1,'c':3}
a = dict(c.items() + {'b':2})
b = dict(c.items() + {'d':4})
I'm looking for suggestions on how to approach this in Python.
In reality, I have roughly 60 dictionaries and many of them have overlapping values. I'm trying to minimize the number of times I have to type each k/v pair to minimize potential typo errors and make it easier to cascade update different values for specific keys.
An ideal output would be the most basic dictionaries needed to construct all dictionaries as well as the formula for reconstruction.
Here is a solution. It isn't the most efficient in any way, but it might give you an idea of how to proceed.
a = {'a':1,'b':2,'c':3}
b = {'a':1,'c':3, 'd':4}
c = {'a':1,'c':3}
class Cover:
def __init__(self,*dicts):
# Our internal representation is a link to any complete subsets, and then a dictionary of remaining elements
mtx = [[-1,{}] for d in dicts]
for i,dct in enumerate(dicts):
for j,odct in enumerate(dicts):
if i == j: continue # we're always a subset of ourself
# if everybody in A is in B, create the reference
if all( k in dct for k in odct.keys() ):
mtx[i][0] = j
dif = {key:value for key,value in dct.items() if key not in odct}
mtx[i][1].update(dif)
break
for i,m in enumerate(mtx):
if m[1] == {}: m[1] = dict(dicts[i].items())
self.mtx = mtx
def get(self, i):
r = { key:val for key, val in self.mtx[i][1].items()}
if (self.mtx[i][0] > 0): # if we had found a subset, add that
r.update(self.mtx[self.mtx[i][0]][1])
return r
cover = Cover(a,b,c)
print(a,b,c)
print('representation',cover.mtx)
# prints [[2, {'b': 2}], [2, {'d': 4}], [-1, {'a': 1, 'c': 3}]]
# The "-1" In the third element indicates this is a building block that cannot be reduced; the "2"s indicate that these should build from the 2th element
print('a',cover.get(0))
print('b',cover.get(1))
print('c',cover.get(2))
The idea is very simple: if any of the maps are complete subsets, substitute the duplication for a reference. The compression could certainly backfire for certain matrix combinations, and can be easily improved upon
Simple improvements
Change the first line of get, or using your more concise dictionary addition code hinted at in the question, might immediately improve readability.
We don't check for the largest subset, which may be worthwhile.
The implementation is naive and makes no optimizations
Larger improvements
One could also implement a hierarchical implementation in which "building block" dictionaries formed the root nodes and the tree was descended to build the larger dictionaries. This would only be beneficial if your data was hierarchical to start.
(Note: tested in python3)
Below a script to generate a script that reconstruct dictionaries.
For example consider this dictionary of dictionaries:
>>>dicts
{'d2': {'k4': 'k4', 'k1': 'k1'},
'd0': {'k2': 'k2', 'k4': 'k4', 'k1': 'k1', 'k3': 'k3'},
'd4': {'k4': 'k4', 'k0': 'k0', 'k1': 'k1'},
'd3': {'k0': 'k0', 'k1': 'k1'},
'd1': {'k2': 'k2', 'k4': 'k4'}}
For clarity, we continue with sets because the association key value can be done elsewhere.
sets= {k:set(v.keys()) for k,v in dicts.items()}
>>>sets
{'d2': {'k1', 'k4'},
'd0': {'k1', 'k2', 'k3', 'k4'},
'd4': {'k0', 'k1', 'k4'},
'd3': {'k0', 'k1'},
'd1': {'k2', 'k4'}}
Now compute the distances (number of keys to add or/and remove to go from one dict to another):
df=pd.DataFrame(dicts)
charfunc=df.notnull()
distances=pd.DataFrame((charfunc.values.T[...,None] != charfunc.values).sum(1),
df.columns,df.columns)
>>>>distances
d0 d1 d2 d3 d4
d0 0 2 2 4 3
d1 2 0 2 4 3
d2 2 2 0 2 1
d3 4 4 2 0 1
d4 3 3 1 1 0
Then the script that write the script. The idea is to begin with the shortest set, and then at each step to construct the nearest set from those already built:
script=open('script.py','w')
dicoto=df.count().argmin() # the shortest set
script.write('res={}\nres['+repr(dicoto)+']='+str(sets[dicoto])+'\ns=[\n')
done=[]
todo=df.columns.tolist()
while True :
done.append(dicoto)
todo.remove(dicoto)
if not todo : break
table=distances.loc[todo,done]
ito,ifrom=np.unravel_index(table.values.argmin(),table.shape)
dicofrom=table.columns[ifrom]
setfrom=sets[dicofrom]
dicoto=table.index[ito]
setto=sets[dicoto]
toadd=setto-setfrom
toremove=setfrom-setto
script.write(('('+repr(dicoto)+','+str(toadd)+','+str(toremove)+','
+repr(dicofrom)+'),\n').replace('set',''))
script.write("""]
for dt,ta,tr,df in s:
d=res[df].copy()
d.update(ta)
for k in tr: d.remove(k)
res[dt]=d
""")
script.close()
and the produced file script.py
res={}
res['d1']={'k2', 'k4'}
s=[
('d0',{'k1', 'k3'},(),'d1'),
('d2',{'k1'},{'k2'},'d1'),
('d4',{'k0'},(),'d2'),
('d3',(),{'k4'},'d4'),
]
for dt,ta,tr,df in s:
d=res[df].copy()
d.update(ta)
for k in tr: d.remove(k)
res[dt]=d
Test :
>>> %run script.py
>>> res==sets
True
With random dicts like here, script size is about 80% of sets size for big dicts (Nd=Nk=100) . But for big overlap, the ratio would certainly be better.
Complement : a script to generate such dicts .
from pylab import *
import pandas as pd
Nd=5 # number of dicts
Nk=5 # number of keys per dict
index=['k'+str(j) for j in range(Nk)]
columns=['d'+str(i) for i in range(Nd)]
charfunc=pd.DataFrame(randint(0,2,(Nk,Nd)).astype(bool),index=index,columns=columns)
dicts={i : { j:j for j in charfunc.index if charfunc.ix[j,i]} for i in charfunc.columns}

Summing up numbers in a defaultdict(list)

I've been experimenting trying to get this to work and I've exhausted every idea and web search. Nothing seems to do the trick. I need to sum numbers in a defaultdict(list) and i just need the final result but no matter what i do i can only get to the final result by iterating and returning all sums adding up to the final. What I've been trying generally,
d = { key : [1,2,3] }
running_total = 0
#Iterate values
for value in d.itervalues:
#iterate through list inside value
for x in value:
running_total += x
print running_total
The result is :
1,3,6
I understand its doing this because its iterating through the for loop. What i dont get is how else can i get to each of these list values without using a loop? Or is there some sort of method iv'e overlooked?
To be clear i just want the final number returned e.g. 6
EDIT I neglected a huge factor , the items in the list are timedealta objects so i have to use .seconds to make them into integers for adding. The solutions below make sense and I've tried similar but trying to throw in the .seconds conversion in the sum statement throws an error.
d = { key : [timedelta_Obj1,timedelta_Obj2,timedelta_Obj3] }
I think this will work for you:
sum(td.seconds for sublist in d.itervalues() for td in sublist)
Try this approach:
from datetime import timedelta as TD
d = {'foo' : [TD(seconds=1), TD(seconds=2), TD(seconds=3)],
'bar' : [TD(seconds=4), TD(seconds=5), TD(seconds=6), TD(seconds=7)],
'baz' : [TD(seconds=8)]}
print sum(sum(td.seconds for td in values) for values in d.itervalues())
You could just sum each of the lists in the dictionary, then take one final sum of the returned list.
>>> d = {'foo' : [1,2,3], 'bar' : [4,5,6,7], 'foobar' : [10]}
# sum each value in the dictionary
>>> [sum(d[i]) for i in d]
[10, 6, 22]
# sum each of the sums in the list
>>> sum([sum(d[i]) for i in d])
38
If you don't want to iterate or to use comprehensions you can use this:
d = {'1': [1, 2, 3], '2': [3, 4, 5], '3': [5], '4': [6, 7]}
print(sum(map(sum, d.values())))
If you use Python 2 and your dict has a lot of keys it's better you use imap (from itertools) and itervalues
from itertools import imap
print sum(imap(sum, d.itervalues()))
Your question was how to get the value "without using a loop". Well, you can't. But there is one thing you can do: use the high performance itertools.
If you use chain you won't have an explicit loop in your code. chain manages that for you.
>>> data = {'a': [1, 2, 3], 'b': [10, 20], 'c': [100]}
>>> import itertools
>>> sum(itertools.chain.from_iterable(data.itervalues()))
136
If you have timedelta objects you can use the same recipe.
>>> data = {'a': [timedelta(minutes=1),
timedelta(minutes=2),
timedelta(minutes=3)],
'b': [timedelta(minutes=10),
timedelta(minutes=20)],
'c': [timedelta(minutes=100)]}
>>> sum(td.seconds for td in itertools.chain.from_iterable(data.itervalues()))
8160

Duplicates in a dictionary (Python)

I need to write a function that returns true if the dictionary has duplicates in it. So pretty much if anything appears in the dictionary more than once, it will return true.
Here is what I have but I am very far off and not sure what to do.
d = {"a", "b", "c"}
def has_duplicates(d):
seen = set()
d={}
for x in d:
if x in seen:
return True
seen.add(x)
return False
print has_duplicates(d)
If you are looking to find duplication in values of the dictionary:
def has_duplicates(d):
return len(d) != len(set(d.values()))
print has_duplicates({'a': 1, 'b': 1, 'c': 2})
Outputs:
True
def has_duplicates(d):
return False
Dictionaries do not contain duplicate keys, ever. Your function, btw., is equivalent to this definition, so it's correct (just a tad long).
If you want to find duplicate values, that's
len(set(d.values())) != len(d)
assuming the values are hashable.
In your code, d = {"a", "b", "c"}, d is a set, not a dictionary.
Neither dictionary keys nor sets can contain duplicates. If you're looking for duplicate values, check if the set of the values has the same size as the dictionary itself:
def has_duplicate_values(d):
return len(set(d.values())) != len(d)
Python dictionaries already have unique keys.
Are you possibly interested in unique values?
set(d.values())
If so, you can check the length of that set to see if it is smaller than the number of values. This works because sets eliminate duplicates from the input, so if the result is smaller than the input, it means some duplicates were found and eliminated.
Not only is your general proposition that dictionaries can have duplicate keys false, but also your implementation is gravely flawed: d={} means that you have lost sight of your input d arg and are processing an empty dictionary!
The only thing that a dictionary can have duplicates of, is values. A dictionary is a key, value store where the keys are unique. In Python, you can create a dictionary like so:
d1 = {k1: v1, k2: v2, k3: v1}
d2 = [k1, v1, k2, v2, k3, v1]
d1 was created using the normal dictionary notation. d2 was created from a list with an even number of elements. Note that both versions have a duplicate value.
If you had a function that returned the number of unique values in a dictionary then you could say something like:
len(d1) != func(d1)
Fortunately, Python makes it easy to do this using sets. Simply converting d1 into a set is not sufficient. Lets make our keys and values real so you can run some code.
v1 = 1; v2 = 2
k1 = "a"; k2 = "b"; k3 = "c"
d1 = {k1: v1, k2: v2, k3: v1}
print len(d1)
s = set(d1)
print s
You will notice that s has three members too and looks like set(['c', 'b', 'a']). That's because a simple conversion only uses the keys in the dict. You want to use the values like so:
s = set(d1.values())
print s
As you can see there are only two elements because the value 1 occurs two times. One way of looking at a set is that it is a list with no duplicate elements. That's what print sees when it prints out a set as a bracketed list. Another way to look at it is as a dict with no values. Like many data processing activities you need to start by selecting the data that you are interested in, and then manipulating it. Start by selecting the values from the dict, then create a set, then count and compare.
This is not a dictionary, is a set:
d = {"a", "b", "c"}
I don't know what are you trying to accomplish but you can't have dictionaries with same key. If you have:
>>> d = {'a': 0, 'b':1}
>>> d['a'] = 2
>>> print d
{'a': 2, 'b': 1}

Categories

Resources