Recursive in order to find commonality between sets in Python - python

I have multiple sets (the number is unknown) and I would like to find the commonality between the sets, if I have a match between sets (80% match) I would like to merge these 2 sets and then rerun the new set that I have against all the other sets from the beginning.
for example:
A : {1,2,3,4}
B : {5,6,7}
C : {1,2,3,4,5}
D : {2,3,4,5,6,7}
Then A runs and there is no commonality between A & B and then it runs A against C which hits the commonalty target therefore we have now a new set AC = {1,2,3,4,5} and now we compare AC to B it doesn't hit the threshold but D does therefore we have a new ACD set and now we run again and now we have a hit with B.
I'm currently using 2 loops but this solve only if I compare between 2 sets.
in order to calculate the commonality I'm using the following calculation:
overlap = a_set & b_set
universe = a_set | b_set
per_overlap = (len(overlap)/len(universe))
I think the solution should be a recursive function but I'm not so sure how to write this I'm kind of new to Python or maybe there is a different and simple way to do this.

I believe this does what you are looking for. The complexity is awful because it starts over each time it gets a match. No recursion is needed.
def commonality(s1, s2):
overlap = s1 & s2
universe = s1 | s2
return (len(overlap)/len(universe))
def set_merge(s, threshold=0.8):
used_keys = set()
out = s.copy()
incomplete = True
while incomplete:
incomplete = False
restart = False
for k1, s1 in list(out.items()):
if restart:
incomplete = True
break
if k1 in used_keys:
continue
for k2, s2 in s.items():
if k1==k2 or k2 in used_keys:
continue
print(k1, k2)
if commonality(s1, s2) >= threshold:
out.setdefault(k1+k2, s1 | s2)
out.pop(k1)
if k2 in out:
out.pop(k2)
used_keys.add(k1)
used_keys.add(k2)
restart = True
break
out.update({k:v for k,v in s.items() if k not in used_keys})
return out
For your particular example, it only merges A and C, as any other combination is below the threshold.
set_dict = {
'A' : {1,2,3,4},
'B' : {5,6,7},
'C' : {1,2,3,4,5},
'D' : {2,3,4,5,6,7},
}
set_merge(set_dict)
# returns:
{'B': {5, 6, 7},
'D': {2, 3, 4, 5, 6, 7},
'AC': {1, 2, 3, 4, 5}}

Related

Finding the difference in value counts by keys in two Dictionaries

I have two sample python dictionaries that counts how many times each key appears in a DataFrame.
dict1 = {
2000 : 2,
3000 : 3,
4000 : 4,
5000 : 6,
6000 : 8
}
dict2 = {
4000 : 4,
3000 : 3,
2000 : 4,
6000 : 10,
5000 : 4
}
I would like to output the following where there is a difference.
diff = {
2000 : 2
5000 : 2
6000 : 2
}
I would appreciate any help as I am not familiar with iterating though dictionaries. Even if the output shows me at which key there is a difference in values, it would work for me. I did the following but it does not produce any output.
for (k,v), (k2,v2) in zip(dict1.items(), dict2.items()):
if k == k2:
if v == v2:
pass
else:
print('value is different at k')
The way you're doing doesn't work because the dicts are not sorted, so k==k2 is always evaluated False.
You could use a dict comprehension where you traverse dict1 and subtract the value in dict2 with the matching key:
diff = {k: abs(v - dict2[k]) for k, v in dict1.items()}
Output:
{2000: 2, 3000: 0, 4000: 0, 5000: 2, 6000: 2}
If you have Python >=3.8, and you want only key-value pairs where value > 0, then you could also use the walrus operator:
diff = {k: di for k, v in dict1.items() if (di := abs(v - dict2[k])) > 0}
Output:
{2000: 2, 5000: 2, 6000: 2}
Since you tagged it as pandas, you can also do a similar job in pandas as well.
First, we need to convert the dicts to DataFrame objects, then join them. Since join joins by index by default and the indexes are the keys in the dicts, you get a nice DataFrame where you can directly find the difference row-wise. Then use the diff method on axis + abs to find the differences.
df1 = pd.DataFrame.from_dict(dict1, orient='index')
df2 = pd.DataFrame.from_dict(dict2, orient='index')
out = df1.join(df2, lsuffix='_x', rsuffix='').diff(axis=1).abs().dropna(axis=1)['0']
Also, instead of creating two DataFrames and joining them, we could also build a single DataFrame by passing a list of the dicts, and use similar methods to get the desired outcome:
out = pd.DataFrame.from_dict([dict1, dict2]).diff().dropna().abs().loc[1]
Output:
2000 2
3000 0
4000 0
5000 2
6000 2
Name: 0, dtype: int64
Since you're counting... how about using Counters?
c1, c2 = Counter(dict2), Counter(dict1)
diff = dict((c1-c2) + (c2-c1))
If you used Counters all around, you wouldn't need the conversions from dict and back. And maybe it would also simplify the creation of your two count dicts (Can't tell for sure since you didn't show how you created them).
Try it online!

Unique elements of multiple sets

I have a list of sets like below. I want to write a function to return the elements that only appear once in those sets. The function I wrote kinda works. I am wondering, is there better way to handle this problem?
s1 = {1, 2, 3, 4}
s2 = {1, 3, 4}
s3 = {1, 4}
s4 = {3, 4}
s5 = {1, 4, 5}
s = [s1, s2, s3, s4, s5]
def unique(s):
temp = []
for i in s:
temp.extend(list(i))
c = Counter(temp)
result = set()
for k,v in c.items():
if v == 1:
result.add(k)
return result
unique(s) # will return {2, 5}
You can use directly a Counter and then get the elements that only appear once.
from collections import Counter
import itertools
c = Counter(itertools.chain.from_iterable(s))
res = {k for k,v in c.items() if v==1}
# {2, 5}
I love the Counter-based solution by #abc. But, just in case, here is a pure set-based one:
result = set()
for _ in s:
result |= s[0] - set.union(*s[1:])
s = s[-1:] + s[:-1] # shift the list of sets
#{2, 5}
This solution is about 6 times faster but cannot be written as a one-liner.
set.union(*[i-set.union(*[j for j in s if j!=i]) for i in s])
I think the proposed solution is similar to what #Bobby Ocean suggested but not as compressed.
The idea is to loop over the complete set array "s" to compute all the subset differences for each target subset "si" (avoiding itself).
For example starting with s1 we compute st = s1-s2-s3-s4-s5 and starting with s5 we have st=s5-s1-s2-s3-s4.
The logic behind is that due to the difference, for each target subset "si" we only keep the elements that are unique to "si" (compared to the other subsets).
Finally result is the set of the union of these uniques elements.
result= set()
for si in s: # target subset
st=si
for sj in s: # the other subsets
if sj!=si: # avoid itself
st = st-sj #compute differences
result=result.union(st)

Counting numbers of sets in a list

I have a list of sets constructed as below. I want to count how many times the set s1 appears in the list. My approach right now is converting each set to tuple and count them. Is there another solution for this?
s1 = {1, 2}
s2 = {1, 3, 4}
s3 = {1, 4}
s = [s1, s2, s1, s3]
# This won't work because set is unhashable
# c = Counter(s)
s = [tuple(i) for i in s]
c = Counter(s)
print(c)

How to use filter in python to filter a dictionary in this situation?

I never used python before. Now I have a dictionary like :
d1 = {1:2,3:3,2:2,4:2,5:2}
The pair[0] in each pair means point, the pair[1] in each pair means the cluster id. So d1 means point 1 belongs to cluster 2, point 3 belongs to cluster 3, point 2 belongs to cluster 2, point 4 belongs to cluster 2, point 5 belongs to cluster 2. No point belongs to cluster 1.
How to use filter(don't use loop) to get a dictionary like following :
d2 = {1:[],2:[1,2,4,5],3:[3]}
it means no point belongs to cluster 1, 1,2,4,5 belongs to cluster 2, 3 belongs to cluster 3.
I tried :
d2 = dict(filter(lambda a,b: a,b if a[1] == b[1] , d1.items()))
I would use a collections.defaultdict
from collections import defaultdict
d2 = defaultdict(list)
for point, cluster in d1.items():
d2[cluster].append(point)
Your defaultdict won't have a cluster 1 in it, but if you know what clusters you expect, then all will be fine with the world (because the empty list will be put in that slot when you try to look there -- this is the "default" part of the defaultdict):
expected_clusters = [1, 2, 3]
for cluster in expected_clusters:
print(d2[cluster])
FWIW, doing this problem with the builtin filter is just insanity. However, if you must, something like the following works:
d2 = {}
filter(lambda (pt, cl): d2.setdefault(cl, []).append(pt), d1.items())
Note that I'm using python2.x's unpacking of arguments. For python3.x, you'd need to do something like lambda item: d2.setdefault(item[1], []).append(item[0]), or, maybe we could do something like this which is a bit nicer:
d2 = {}
filter(lambda pt: d2.setdefault(d1[pt], []).append(pt), d1)
We can do a tiny bit better with the reduce builtin (at least the reduce isn't simply a vehicle to create an implicit loop and therefore actually returns the dict we want):
>>> d1 = {1:2,3:3,2:2,4:2,5:2}
>>> reduce(lambda d, k: d.setdefault(d1[k], []).append(k) or d, d1, {})
{2: [1, 2, 4, 5], 3: [3]}
But this is still really ugly python.
>>> d1 = {1:2,3:3,2:2,4:2,5:2}
>>> dict(map(lambda c : (c, [k for k, v in d1.items() if v == c]), d1.values()))
{2: [1, 2, 4, 5], 3: [3]}
lambda function to get the value list
map function to map values(clusters) using the above lambda

Dict Deconstruction and Reconstruction in Python

I have multiple dictionaries. There is a great deal of overlap between the dictionaries, but they are not identical.
a = {'a':1,'b':2,'c':3}
b = {'a':1,'c':3, 'd':4}
c = {'a':1,'c':3}
I'm trying to figure out how to break these down into the most primitive pieces and then reconstruct the dictionaries in the most efficient manner. In other words, how can I deconstruct and rebuild the dictionaries by typing each key/value pair the minimum number of times (ideally once). It also means creating the minimum number of sets that can be combined to create all possible sets that exist.
In the above example. It could be broken down into:
c = {'a':1,'c':3}
a = dict(c.items() + {'b':2})
b = dict(c.items() + {'d':4})
I'm looking for suggestions on how to approach this in Python.
In reality, I have roughly 60 dictionaries and many of them have overlapping values. I'm trying to minimize the number of times I have to type each k/v pair to minimize potential typo errors and make it easier to cascade update different values for specific keys.
An ideal output would be the most basic dictionaries needed to construct all dictionaries as well as the formula for reconstruction.
Here is a solution. It isn't the most efficient in any way, but it might give you an idea of how to proceed.
a = {'a':1,'b':2,'c':3}
b = {'a':1,'c':3, 'd':4}
c = {'a':1,'c':3}
class Cover:
def __init__(self,*dicts):
# Our internal representation is a link to any complete subsets, and then a dictionary of remaining elements
mtx = [[-1,{}] for d in dicts]
for i,dct in enumerate(dicts):
for j,odct in enumerate(dicts):
if i == j: continue # we're always a subset of ourself
# if everybody in A is in B, create the reference
if all( k in dct for k in odct.keys() ):
mtx[i][0] = j
dif = {key:value for key,value in dct.items() if key not in odct}
mtx[i][1].update(dif)
break
for i,m in enumerate(mtx):
if m[1] == {}: m[1] = dict(dicts[i].items())
self.mtx = mtx
def get(self, i):
r = { key:val for key, val in self.mtx[i][1].items()}
if (self.mtx[i][0] > 0): # if we had found a subset, add that
r.update(self.mtx[self.mtx[i][0]][1])
return r
cover = Cover(a,b,c)
print(a,b,c)
print('representation',cover.mtx)
# prints [[2, {'b': 2}], [2, {'d': 4}], [-1, {'a': 1, 'c': 3}]]
# The "-1" In the third element indicates this is a building block that cannot be reduced; the "2"s indicate that these should build from the 2th element
print('a',cover.get(0))
print('b',cover.get(1))
print('c',cover.get(2))
The idea is very simple: if any of the maps are complete subsets, substitute the duplication for a reference. The compression could certainly backfire for certain matrix combinations, and can be easily improved upon
Simple improvements
Change the first line of get, or using your more concise dictionary addition code hinted at in the question, might immediately improve readability.
We don't check for the largest subset, which may be worthwhile.
The implementation is naive and makes no optimizations
Larger improvements
One could also implement a hierarchical implementation in which "building block" dictionaries formed the root nodes and the tree was descended to build the larger dictionaries. This would only be beneficial if your data was hierarchical to start.
(Note: tested in python3)
Below a script to generate a script that reconstruct dictionaries.
For example consider this dictionary of dictionaries:
>>>dicts
{'d2': {'k4': 'k4', 'k1': 'k1'},
'd0': {'k2': 'k2', 'k4': 'k4', 'k1': 'k1', 'k3': 'k3'},
'd4': {'k4': 'k4', 'k0': 'k0', 'k1': 'k1'},
'd3': {'k0': 'k0', 'k1': 'k1'},
'd1': {'k2': 'k2', 'k4': 'k4'}}
For clarity, we continue with sets because the association key value can be done elsewhere.
sets= {k:set(v.keys()) for k,v in dicts.items()}
>>>sets
{'d2': {'k1', 'k4'},
'd0': {'k1', 'k2', 'k3', 'k4'},
'd4': {'k0', 'k1', 'k4'},
'd3': {'k0', 'k1'},
'd1': {'k2', 'k4'}}
Now compute the distances (number of keys to add or/and remove to go from one dict to another):
df=pd.DataFrame(dicts)
charfunc=df.notnull()
distances=pd.DataFrame((charfunc.values.T[...,None] != charfunc.values).sum(1),
df.columns,df.columns)
>>>>distances
d0 d1 d2 d3 d4
d0 0 2 2 4 3
d1 2 0 2 4 3
d2 2 2 0 2 1
d3 4 4 2 0 1
d4 3 3 1 1 0
Then the script that write the script. The idea is to begin with the shortest set, and then at each step to construct the nearest set from those already built:
script=open('script.py','w')
dicoto=df.count().argmin() # the shortest set
script.write('res={}\nres['+repr(dicoto)+']='+str(sets[dicoto])+'\ns=[\n')
done=[]
todo=df.columns.tolist()
while True :
done.append(dicoto)
todo.remove(dicoto)
if not todo : break
table=distances.loc[todo,done]
ito,ifrom=np.unravel_index(table.values.argmin(),table.shape)
dicofrom=table.columns[ifrom]
setfrom=sets[dicofrom]
dicoto=table.index[ito]
setto=sets[dicoto]
toadd=setto-setfrom
toremove=setfrom-setto
script.write(('('+repr(dicoto)+','+str(toadd)+','+str(toremove)+','
+repr(dicofrom)+'),\n').replace('set',''))
script.write("""]
for dt,ta,tr,df in s:
d=res[df].copy()
d.update(ta)
for k in tr: d.remove(k)
res[dt]=d
""")
script.close()
and the produced file script.py
res={}
res['d1']={'k2', 'k4'}
s=[
('d0',{'k1', 'k3'},(),'d1'),
('d2',{'k1'},{'k2'},'d1'),
('d4',{'k0'},(),'d2'),
('d3',(),{'k4'},'d4'),
]
for dt,ta,tr,df in s:
d=res[df].copy()
d.update(ta)
for k in tr: d.remove(k)
res[dt]=d
Test :
>>> %run script.py
>>> res==sets
True
With random dicts like here, script size is about 80% of sets size for big dicts (Nd=Nk=100) . But for big overlap, the ratio would certainly be better.
Complement : a script to generate such dicts .
from pylab import *
import pandas as pd
Nd=5 # number of dicts
Nk=5 # number of keys per dict
index=['k'+str(j) for j in range(Nk)]
columns=['d'+str(i) for i in range(Nd)]
charfunc=pd.DataFrame(randint(0,2,(Nk,Nd)).astype(bool),index=index,columns=columns)
dicts={i : { j:j for j in charfunc.index if charfunc.ix[j,i]} for i in charfunc.columns}

Categories

Resources