I have multiple dictionaries. There is a great deal of overlap between the dictionaries, but they are not identical.
a = {'a':1,'b':2,'c':3}
b = {'a':1,'c':3, 'd':4}
c = {'a':1,'c':3}
I'm trying to figure out how to break these down into the most primitive pieces and then reconstruct the dictionaries in the most efficient manner. In other words, how can I deconstruct and rebuild the dictionaries by typing each key/value pair the minimum number of times (ideally once). It also means creating the minimum number of sets that can be combined to create all possible sets that exist.
In the above example. It could be broken down into:
c = {'a':1,'c':3}
a = dict(c.items() + {'b':2})
b = dict(c.items() + {'d':4})
I'm looking for suggestions on how to approach this in Python.
In reality, I have roughly 60 dictionaries and many of them have overlapping values. I'm trying to minimize the number of times I have to type each k/v pair to minimize potential typo errors and make it easier to cascade update different values for specific keys.
An ideal output would be the most basic dictionaries needed to construct all dictionaries as well as the formula for reconstruction.
Here is a solution. It isn't the most efficient in any way, but it might give you an idea of how to proceed.
a = {'a':1,'b':2,'c':3}
b = {'a':1,'c':3, 'd':4}
c = {'a':1,'c':3}
class Cover:
def __init__(self,*dicts):
# Our internal representation is a link to any complete subsets, and then a dictionary of remaining elements
mtx = [[-1,{}] for d in dicts]
for i,dct in enumerate(dicts):
for j,odct in enumerate(dicts):
if i == j: continue # we're always a subset of ourself
# if everybody in A is in B, create the reference
if all( k in dct for k in odct.keys() ):
mtx[i][0] = j
dif = {key:value for key,value in dct.items() if key not in odct}
mtx[i][1].update(dif)
break
for i,m in enumerate(mtx):
if m[1] == {}: m[1] = dict(dicts[i].items())
self.mtx = mtx
def get(self, i):
r = { key:val for key, val in self.mtx[i][1].items()}
if (self.mtx[i][0] > 0): # if we had found a subset, add that
r.update(self.mtx[self.mtx[i][0]][1])
return r
cover = Cover(a,b,c)
print(a,b,c)
print('representation',cover.mtx)
# prints [[2, {'b': 2}], [2, {'d': 4}], [-1, {'a': 1, 'c': 3}]]
# The "-1" In the third element indicates this is a building block that cannot be reduced; the "2"s indicate that these should build from the 2th element
print('a',cover.get(0))
print('b',cover.get(1))
print('c',cover.get(2))
The idea is very simple: if any of the maps are complete subsets, substitute the duplication for a reference. The compression could certainly backfire for certain matrix combinations, and can be easily improved upon
Simple improvements
Change the first line of get, or using your more concise dictionary addition code hinted at in the question, might immediately improve readability.
We don't check for the largest subset, which may be worthwhile.
The implementation is naive and makes no optimizations
Larger improvements
One could also implement a hierarchical implementation in which "building block" dictionaries formed the root nodes and the tree was descended to build the larger dictionaries. This would only be beneficial if your data was hierarchical to start.
(Note: tested in python3)
Below a script to generate a script that reconstruct dictionaries.
For example consider this dictionary of dictionaries:
>>>dicts
{'d2': {'k4': 'k4', 'k1': 'k1'},
'd0': {'k2': 'k2', 'k4': 'k4', 'k1': 'k1', 'k3': 'k3'},
'd4': {'k4': 'k4', 'k0': 'k0', 'k1': 'k1'},
'd3': {'k0': 'k0', 'k1': 'k1'},
'd1': {'k2': 'k2', 'k4': 'k4'}}
For clarity, we continue with sets because the association key value can be done elsewhere.
sets= {k:set(v.keys()) for k,v in dicts.items()}
>>>sets
{'d2': {'k1', 'k4'},
'd0': {'k1', 'k2', 'k3', 'k4'},
'd4': {'k0', 'k1', 'k4'},
'd3': {'k0', 'k1'},
'd1': {'k2', 'k4'}}
Now compute the distances (number of keys to add or/and remove to go from one dict to another):
df=pd.DataFrame(dicts)
charfunc=df.notnull()
distances=pd.DataFrame((charfunc.values.T[...,None] != charfunc.values).sum(1),
df.columns,df.columns)
>>>>distances
d0 d1 d2 d3 d4
d0 0 2 2 4 3
d1 2 0 2 4 3
d2 2 2 0 2 1
d3 4 4 2 0 1
d4 3 3 1 1 0
Then the script that write the script. The idea is to begin with the shortest set, and then at each step to construct the nearest set from those already built:
script=open('script.py','w')
dicoto=df.count().argmin() # the shortest set
script.write('res={}\nres['+repr(dicoto)+']='+str(sets[dicoto])+'\ns=[\n')
done=[]
todo=df.columns.tolist()
while True :
done.append(dicoto)
todo.remove(dicoto)
if not todo : break
table=distances.loc[todo,done]
ito,ifrom=np.unravel_index(table.values.argmin(),table.shape)
dicofrom=table.columns[ifrom]
setfrom=sets[dicofrom]
dicoto=table.index[ito]
setto=sets[dicoto]
toadd=setto-setfrom
toremove=setfrom-setto
script.write(('('+repr(dicoto)+','+str(toadd)+','+str(toremove)+','
+repr(dicofrom)+'),\n').replace('set',''))
script.write("""]
for dt,ta,tr,df in s:
d=res[df].copy()
d.update(ta)
for k in tr: d.remove(k)
res[dt]=d
""")
script.close()
and the produced file script.py
res={}
res['d1']={'k2', 'k4'}
s=[
('d0',{'k1', 'k3'},(),'d1'),
('d2',{'k1'},{'k2'},'d1'),
('d4',{'k0'},(),'d2'),
('d3',(),{'k4'},'d4'),
]
for dt,ta,tr,df in s:
d=res[df].copy()
d.update(ta)
for k in tr: d.remove(k)
res[dt]=d
Test :
>>> %run script.py
>>> res==sets
True
With random dicts like here, script size is about 80% of sets size for big dicts (Nd=Nk=100) . But for big overlap, the ratio would certainly be better.
Complement : a script to generate such dicts .
from pylab import *
import pandas as pd
Nd=5 # number of dicts
Nk=5 # number of keys per dict
index=['k'+str(j) for j in range(Nk)]
columns=['d'+str(i) for i in range(Nd)]
charfunc=pd.DataFrame(randint(0,2,(Nk,Nd)).astype(bool),index=index,columns=columns)
dicts={i : { j:j for j in charfunc.index if charfunc.ix[j,i]} for i in charfunc.columns}
Related
This question already has answers here:
how to randomly choose multiple keys and its value in a dictionary python
(4 answers)
Closed 7 months ago.
This post was edited and submitted for review 7 months ago and failed to reopen the post:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
I have a (potentially) huge dict in Python 3.10 and want to randomly sample a few values. Alas, random.sample(my_dict, k) says:
TypeError: Population must be a sequence. For dicts or sets, use sorted(d).
and random.sample(my_dict.keys(), k) gives
DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
I don't want to pay the cost of converting my dictionary keys to a list, and I don't need them sorted.
There's an old question in a similar vein, but that's from before stuff got deprecated in Python and the person asking that question didn't mind converting to a list first.
I also tried running random.choice multiple times to simulate random.sample. But that's even worse: it just throws an exception when you use it on a dict. (Instead of giving you a reasonable error message.)
You need to use sample(list(dct)). (The example code select random 2 items from original dict.)
from random import sample
dct = {'a':1, 'b':2, 'c':3, 'd':4}
rnd_keys = sample(list(dct), 2)
# rnd_keys -> ['c', 'b']
rnd_dct = dict(sample(list(dct.items()), 2))
print(rnd_dct)
{'c': 3, 'b': 2}
Update without converting huge dict to list (this convert use O(n) space and question say, don't do this.). You can generate random number base len(dict) and use enumerate and only get k,v that idx match with random_idx and break from for-loop when reaching to zero base random_number that we want to select (this break helps you don't see all dict).
from random import sample
dct = {'a':1, 'b':2, 'c':3, 'd':4}
# idx -^0^----^1^----^2^----^3^---
number_rnd = 2
rnd_idx = set(sample(range(len(dct)), number_rnd))
print(rnd_idx)
# {0, 3}
res = {}
for idx, (k,v) in enumerate(dct.items()):
if idx in rnd_idx:
res[k] = v
number_rnd -= 1
if number_rnd == 0:
break
print(res)
# {'b': 2, 'c': 3}
Third Approach By thanks Tomerikoo, We can use flip a coin idea, On each iterate over items() we cen generate a random 0 or 1 and if the random number is 1 save the item in the result dict. (Maybe we see all dict items but don't select all random numbers because, maybe we get many random 0.)
import random
dct = {'a':1, 'b':2, 'c':3, 'd':4}
number_rnd = 2
res = {}
for k,v in dct.items():
rnd_ch = random.getrandbits(1)
if rnd_ch:
res[k] = v
number_rnd -= 1
if number_rnd == 0:
break
print(res)
If you’re ok with getting the keys into a list, you can just pick the keys using a random set of integers off of the list. If you don’t want to store them into a list, you can generate your random integers based on the dict size, sort them, then iterate through the dict and sample as you get to indices matching your random integer picks.
I would like to shuffle the key value pairs in this dictionary so that the outcome has no original key value pairs. Starting dictionary:
my_dict = {'A':'a',
'K':'k',
'P':'p',
'Z':'z'}
Example of unwanted outcome:
my_dict_shuffled = {'Z':'a',
'K':'k', <-- Original key value pair
'A':'p',
'P':'z'}
Example of wanted outcome:
my_dict_shuffled = {'Z':'a',
'A':'k',
'K':'p',
'P':'z'}
I have tried while loops and for loops with no luck. Please help! Thanks in advance.
Here's a fool-proof algorithm I learned from a Numberphile video :)
import itertools
import random
my_dict = {'A': 'a',
'K': 'k',
'P': 'p',
'Z': 'z'}
# Shuffle the keys and values.
my_dict_items = list(my_dict.items())
random.shuffle(my_dict_items)
shuffled_keys, shuffled_values = zip(*my_dict_items)
# Offset the shuffled values by one.
shuffled_values = itertools.cycle(shuffled_values)
next(shuffled_values, None) # Offset the values by one.
# Guaranteed to have each value paired to a random different key!
my_random_dict = dict(zip(shuffled_keys, shuffled_values))
Disclaimer (thanks for mentioning, #jf328): this will not generate all possible permutations! It will only generate permutations with exactly one "cycle". Put simply, the algorithm will never give you the following outcome:
{'A': 'k',
'K': 'a',
'P': 'z',
'Z': 'p'}
However, I imagine you can extend this solution by building a random list of sub-cycles:
(2, 2, 3) => concat(zip(*items[0:2]), zip(*items[2:4]), zip(*items[4:7]))
A shuffle which doesn't leave any element in the same place is called a derangement. Essentially, there are two parts to this problem: first to generate a derangement of the keys, and then to build the new dictionary.
We can randomly generate a derangement by shuffling until we get one; on average it should only take 2-3 tries even for large dictionaries, but this is a Las Vegas algorithm in the sense that there's a tiny probability it could take a much longer time to run than expected. The upside is that this trivially guarantees that all derangements are equally likely.
from random import shuffle
def derangement(keys):
if len(keys) == 1:
raise ValueError('No derangement is possible')
new_keys = list(keys)
while any(x == y for x, y in zip(keys, new_keys)):
shuffle(new_keys)
return new_keys
def shuffle_dict(d):
return { x: d[y] for x, y in zip(d, derangement(d)) }
Usage:
>>> shuffle_dict({ 'a': 1, 'b': 2, 'c': 3 })
{'a': 2, 'b': 3, 'c': 1}
theonewhocodes, does this work, if you don't have a right answer, can you update your question with a second use case?
my_dict = {'A':'a',
'K':'k',
'P':'p',
'Z':'z'}
while True:
new_dict = dict(zip(list(my_dict.keys()), random.sample(list(my_dict.values()),len(my_dict))))
if new_dict.items() & my_dict.items():
continue
else:
break
print(my_dict)
print(new_dict)
I'd like to compare all entries in a dict with all other entries – if the values are within a close enough range, I want to merge them under a single key and delete the other key. But I cannot figure out how to iterate through the dict without errors.
An example version of my code (not the real set of values, but you get the idea):
things = { 'a': 1, 'b': 3, 'c': 22 }
for me in things.iteritems():
for other in things.iteritems():
if me == other:
continue
if abs(me-other) < 5:
print 'merge!', me, other
# merge the two into 'a'
# delete 'b'
I'd hope to then get:
>> { 'a': [ 1, 2 ], 'c': 22 }
But if I run this code, I get the first two that I want to merge:
>> merge! ('a', 1) ('b', 2)
Then the same one in reverse (which I want to have merged already):
>> duplicate! ('b', 2) ('a', 1)
If I use del things['b'] I get an error that I'm trying to modify the dict while iterating. I see lots of "how to remove items from a dict" questions, and lots about comparing two separate dicts, but not this particular problem (as far as I can tell).
EDIT
Per feedback in the comments, I realized my example is a little misleading. I want to merge two items if their values are similar enough.
So, to do this in linear time (but requiring extra space) use an intermediate dict to group the keys by value:
>>> things = { 'fruit': 'tomato', 'vegetable': 'tomato', 'grain': 'wheat' }
>>> from collections import defaultdict
>>> grouper = defaultdict(list)
>>> for k, v in things.iteritems():
... grouper[v].append(k)
...
>>> grouper
defaultdict(<type 'list'>, {'tomato': ['vegetable', 'fruit'], 'wheat': ['grain']})
Then, you simply take the first item from your list of values (that used to be keys), as the new key:
>>> {v[0]:k for k, v in grouper.iteritems()}
{'vegetable': 'tomato', 'grain': 'wheat'}
Note, dictionaries are inherently unordered, so if order is important, you should have been using an OrderedDict from the beginning.
Note that your result will depend on the direction of the traversal. Since you are bucketing data depending on distance (in the metric sense), either the right neighbor or the left neighbor can claim the data point.
I am a little stuck on writing a function for a project. This function takes a dictionary of candidates who's values are the number of votes they received. I then have to return a set containing the remaining_candidates. In other words the candidate with the least amount of votes should not be in the set being returned and if for example all of the candidates have the same votes, the set should be empty. I am having trouble getting started here.
For example I know I can sort the dictionary like so:
x = min(canadites, key=canadites.__getitem__)
but that will not work if the candidates have the same value, as it just pops up the last one in the dict.
Any ideas?
Update: To make things clear.
Lets say I have the following dictionary:
canadites = {'X':22,'Y':1, 'Z':0}
Ideally the function should return a set containing only X and Y. But if Y and Z where both 1
x = min(canadites, key=canadites.__getitem__)
seems to only return Z
It's cleaner to create a new dict instead of popping items from the old one:
>>> d = {'a':1, 'b':2, 'c':1, 'd':3}
>>> min_val = min(d.values())
>>> {k:v for k,v in d.items() if v > min_val}
{'b': 2, 'd': 3}
In python2, itervalues and iteritems would be more efficient, although this is a micro-optimization in most cases.
I need to write a function that returns true if the dictionary has duplicates in it. So pretty much if anything appears in the dictionary more than once, it will return true.
Here is what I have but I am very far off and not sure what to do.
d = {"a", "b", "c"}
def has_duplicates(d):
seen = set()
d={}
for x in d:
if x in seen:
return True
seen.add(x)
return False
print has_duplicates(d)
If you are looking to find duplication in values of the dictionary:
def has_duplicates(d):
return len(d) != len(set(d.values()))
print has_duplicates({'a': 1, 'b': 1, 'c': 2})
Outputs:
True
def has_duplicates(d):
return False
Dictionaries do not contain duplicate keys, ever. Your function, btw., is equivalent to this definition, so it's correct (just a tad long).
If you want to find duplicate values, that's
len(set(d.values())) != len(d)
assuming the values are hashable.
In your code, d = {"a", "b", "c"}, d is a set, not a dictionary.
Neither dictionary keys nor sets can contain duplicates. If you're looking for duplicate values, check if the set of the values has the same size as the dictionary itself:
def has_duplicate_values(d):
return len(set(d.values())) != len(d)
Python dictionaries already have unique keys.
Are you possibly interested in unique values?
set(d.values())
If so, you can check the length of that set to see if it is smaller than the number of values. This works because sets eliminate duplicates from the input, so if the result is smaller than the input, it means some duplicates were found and eliminated.
Not only is your general proposition that dictionaries can have duplicate keys false, but also your implementation is gravely flawed: d={} means that you have lost sight of your input d arg and are processing an empty dictionary!
The only thing that a dictionary can have duplicates of, is values. A dictionary is a key, value store where the keys are unique. In Python, you can create a dictionary like so:
d1 = {k1: v1, k2: v2, k3: v1}
d2 = [k1, v1, k2, v2, k3, v1]
d1 was created using the normal dictionary notation. d2 was created from a list with an even number of elements. Note that both versions have a duplicate value.
If you had a function that returned the number of unique values in a dictionary then you could say something like:
len(d1) != func(d1)
Fortunately, Python makes it easy to do this using sets. Simply converting d1 into a set is not sufficient. Lets make our keys and values real so you can run some code.
v1 = 1; v2 = 2
k1 = "a"; k2 = "b"; k3 = "c"
d1 = {k1: v1, k2: v2, k3: v1}
print len(d1)
s = set(d1)
print s
You will notice that s has three members too and looks like set(['c', 'b', 'a']). That's because a simple conversion only uses the keys in the dict. You want to use the values like so:
s = set(d1.values())
print s
As you can see there are only two elements because the value 1 occurs two times. One way of looking at a set is that it is a list with no duplicate elements. That's what print sees when it prints out a set as a bracketed list. Another way to look at it is as a dict with no values. Like many data processing activities you need to start by selecting the data that you are interested in, and then manipulating it. Start by selecting the values from the dict, then create a set, then count and compare.
This is not a dictionary, is a set:
d = {"a", "b", "c"}
I don't know what are you trying to accomplish but you can't have dictionaries with same key. If you have:
>>> d = {'a': 0, 'b':1}
>>> d['a'] = 2
>>> print d
{'a': 2, 'b': 1}