So I want to compare each key of a dictionary to each other and if a key happens to be too similar to another key (based on fuzzy similarity), I want to merge those 2 entries together into a single key (so one key gets deleted whereas the values will be added up). Is there a more efficient way to do this?
d is a dictionary with {"labels": [list of sentences]}
# First I create a new dictionary that has a dictionary as value which includes the key and value
# of the old dictionary. It feels a bit redundant but afaik you can't loop through a dictionary and
# edit it at the same time + you can't edit keys themselves?
d_new = {}
for key, value in d.items():
d_new[key] = {"label": key, "sentences": value}
for key1, key2 in itertools.combinations(d, 2):
if fuzz.token_set_ratio(key1, key2) >= similarity:
if len(d_new[key1]["sentences"]) > len(d_new[key2]["sentences"]):
d_new[key2]["label"] = key1
# Merge keys
d_new[key2]["sentences"] = list(set(d_new[key1]["sentences"] + d_new[key2]["sentences"]))
del d_new[key1]
continue
# Prepare output
result = {}
for key, value in d_new.items():
result[value["label"]] = list(set(value["sentences"]))
return result
Since you are using itertools.combinations, this will result in a non-linear time complexity to loop over the dictionary. but your code seems to be optimized in the way you want it to be and there probably isn't any better way to find similar keys unless you know what similar things you are looking for.
As the other answer said, further optimization depends on token_set_ratio.
You can also experience with dict comprehension to see if you can make the dict creation faster.
for example:
{value['lable']:list(set(value['sentences'])) for (key,value) in d_new.items()}
suggestion:
due to limited data on what values are there in your keys its hard to say, but if your keys are structured and are based on a pattern and your dict length is large enough, you can brute force for keys since accessing dict keys is O(1) but there is the cost of 'guessing' similar things (for example if your keys are words, you can probably know what words are similar.). but as I said the dict should be huge for this method to be efficient.
Whether or not this can be optimized depends on the implementation of token_set_ratio. If token_set_ratio is your only heuristic of similarity between the two tokens, and the implementation of token_set_ratio is a black box, then this is the most Big O efficient way to calculate similarity. The reasoning here is that because your only heuristic is a black box, you must test every binary combination to get a similarity score.
If instead, you have a different heuristic that works on a single token, then you may be able to precompute a feature score that can help you cluster more efficiently. You would expect some false positives.
Alternatively, if you know how the original heuristic is computed, then there may be a way to optimize based on the heuristic. In that case, you should post it here.
If there is a large percentage of the keys that will be merged, it may be worthwhile to implement the combination logic yourself so that keys that have been merged are not checked again.
There is no data to test this but, here's how it could look like:
from collections import deque
def mergeKeys(d):
result = {k:{v} for k,v in d.items()} # start with nonthing merged
keys = deque(d.keys()) # list of remaining keys
while keys:
key1 = keys.popleft() # consume one key
nextKeys = deque() # track unmerged for next pass
while keys:
key1 = keys.popleft() # consume paired key
if key2 not in result: continue
if fuzz.token_set_ratio(key1, key2) >= similarity:
result[key1].update(result.pop(key2)) # merge
else:
nextKeys.append(key2) # not merged, add to next pass
keys = nextKeys # remaining keys for next pass
return result
Related
I'm looking for the fastest way to do the following: given a dictionary and a key value, return the lowest key in the dictionary greater than than the value given. Per this question, the natural way would seem to be to create an OrderedDict, then use bisect on the keys to find the proper key location. The OrderedDict.keys() method doesn't support indexing, so per e.g. this question, one has to convert the keys to a list, before doing bisect or similar.
So once an OrderedDict has been created with its keys in order, in order to access the correct position one has to do the following:
Convert the keys to a list
Do a binary search of the keys with bisect or similar.
Check that this insertion point isn't at the end of the list, before retrieving the key located after this index.
Retrieve the key value in our original OrderedDict.
I'm most concerned about step 1 above, from an efficiency perspective (although all of this looks roundabout to me). Without knowing the details of how Python does the conversion to list, it seems like it would have to be O(n), completely eliminating the savings of using OrderedDict and binary search. I'm hoping someone can tell me whether this assumption I have about step 1 is or isn't correct, and regardless whether or not there may be a better method.
As an alternative, I could simply create a list of tuples, sorted by the first element (key), where the second element is the dict value associated with that key. Then I could pass the key lambda x:x[0] to bisect. This seems reasonable, but I'd prefer to store my key / value pairs more uniformly (e.g. JSON), since that's how it's done with other dicts in the same project that don't need this specific type of comparison.
Here's some example code for a single lookup. Edit: But lest anyone think I'm over-optimizing, the actual dictionary has ~3 million keys, and will be accessed ~7 million times in a batch, daily. So I'm very interested in finding a fast way of doing this.
# Single lookup example
from collections import OrderedDict
from bisect import bisect
d = OrderedDict()
d[5] = 'lowest_value'
d[7] = 'middle_value'
d[12] = 'highest_value'
sample_key = 6 # we want to find the value for the key above this in d, e.g. d[7]
list_of_keys = list(d.keys())
key_insertion_index = bisect(list_of_keys,sample_key)
if key_insertion_index < len(list_of_keys):
next_higher_key = list_of_keys[key_insertion_index]
v = d[next_higher_key]
I have a dictionary with key:value list pairings, and I intend to find the index of the value list that contains the desired element.
E.g., if the dictionary is:
my_dict = {"key1":['v1'], "key2":None, "key3":['v2','v3'], "key4":['v4','v5','v6']}
Then, given element 'v2' I should be able to get index 2.
For a value list with one element, the index can be obtained with: list(my_dict.values()).index(['v1']) , however this approach does not work with lists containing multiple elements.
Using for loop, it can be obtained via:
for key, value in my_dict.items():
if value is None:
continue
if 'v2' in value:
print (list(my_dict.keys()).index(key))
Is there a neater (pythonic) way to obtain the same?
You've got an XY problem. You want to know the key that points to a value, and you think you need to find the enumeration index iterating the values so you can then use it to find the key by iteration as well. You don't need all that. Just find the key directly:
my_dict = {"key1":['v1'], "key2":None, "key3":['v2','v3'], "key4":['v4','v5','v6']}
value = 'v2'
# Iterate key/vals pairs in genexpr; if the vals contains value, yield the key,
# next stops immediately for the first key yielded so you don't iterate the whole dict
# when the value is found on an early key
key_for_value = next(key for key, vals in my_dict.items() if vals and value in vals)
print(key_for_value)
Try it online!
That'll raise StopIteration if the value doesn't exist, otherwise it directly retrieves the first key where the values list for that key contains the desired value.
If you don't really have an XY problem, and the index is important (it shouldn't be, that's a misuse of dicts) it's trivial to produce it as well, changing the extraction of the key to get both, e.g.:
index, key_for_value = next((i, key) for i, (key, vals) in enumerate(my_dict.items()) if vals and value in vals)
Mind you, this is a terrible solution if you need to perform these lookups a lot and my_dict isn't trivially small; it's O(n) on the total number of values, so a large dict would take quite a while to check (relative to the cost of just looking up an arbitrary key, which is average-case O(1)). In that case, ideally, if my_dict doesn't change much/at all, you'd construct a reversed dictionary up-front to find the key(s) associated with a value, e.g.:
from collections import defaultdict
my_dict = {"key1":['v1'], "key2":None, "key3":['v2','v3'], "key4":['v4','v5','v6']}
reversed_my_dict = defaultdict(set)
for key, vals in my_dict:
for val in vals:
reversed_my_dict[val].add(key)
reversed_my_dict = dict(reversed_my_dict) # Optional: Prevents future autovivification of keys
# by converting back to plain dict
after which you can cheaply determine the key(s) associated with a given value with:
reversed_my_dict.get(value, ()) # Using .get prevents autovivification of keys even if still a defaultdict
which returns the set of all keys that map to that value, if any, or the empty tuple if not (if you convert back to dict above, reversed_my_dict[value] would also work if you'd prefer to get a KeyError when the value is missing entirely; leaving it a defaultdict(set) would silently construct a new empty set, map it to the key and return it, which is fine if this happens rarely, but a problem if you test thousands of unmapped values and create a corresponding thousands of empty sets for no benefit, consuming memory wastefully).
Which you choose depends on how big my_dict is (for small my_dict, O(n) work doesn't really matter that much), how many times you need to search it (fewer searches mean less gain from reversed dict), and whether it's regularly modified. For that last point, if it's never modified, or rarely modified between lookups, rebuilding the reversed dict from scratch after each modification might be worth it for simplicity (assuming you perform many lookups per rebuild); if it's frequently modified, the reversed dict might still be worth it, you'd just have to update both the forward and reversed dicts rather than just one, e.g., expanding:
# New key
my_dict[newkey] = [newval1, newval2]
# Add value
my_dict[existingkey].append(newval)
# Delete value
my_dict[existingkey].remove(badval)
# Delete key
del my_dict[existingkey]
to:
# New key
newvals = my_dict[newkey] = [newval1, newval2]
for newval in newvals:
reversed_my_dict[newval].add(newkey) # reversed_my_dict.setdefault(newval, set()).add(newkey) if not defaultdict(set) anymore
# Add value
my_dict[existingkey].append(newval)
reversed_my_dict[newval].add(existingkey) # reversed_my_dict.setdefault(newval, set()).add(existingkey) if not defaultdict(set) anymore
# Delete value
my_dict[existingkey].remove(badval)
if badval not in my_dict[existingkey]: # Removed last copy; test only needed if one key can hold same value more than once
reversed_my_dict[badval].discard(existingkey)
# Optional delete badval from reverse mapping if last key removed:
if not reversed_my_dict[badval]:
del reversed_my_dict[badval]
# Delete key
# set() conversion not needed if my_dict's value lists guaranteed not to contain duplicates
for badval in set(my_dict.pop(existingkey)):
reversed_my_dict[badval].discard(existingkey)
# Optional delete badval from reverse mapping if last key removed:
if not reversed_my_dict[badval]:
del reversed_my_dict[badval]
respectively, roughly doubling the work incurred by modifications, in exchange for always getting O(1) lookups in either direction.
If you are looking for the key corresponding to a value, you can reverse the dictionary like so:
reverse_dict = {e: k for k, v in my_dict.items() if v for e in v}
Careful with duplicate values though. The last occurence will override the previous ones.
Don't know if it's the best solution but this works:
value = 'v2'
list(map(lambda x : value in x, list(map(lambda x : x[1] or [], list(my_dict.items()))))).index(True)
I am modeling data for an application and decided to choose dictionary as my data structure. But each row in the data has multiple keys. So I created a dictionary with multiple keys mapping each row, something like:
>>> multiKeyDict = {}
>>> multiKeyDict[('key1','key2','key3')] = 'value1'
>>> multiKeyDict.get(('key1','key2','key3'))
'value1'
Now I have to retrieve all the values with key1 in O(1) time. From my research I know I could do:
use this package to get the job done but not sure if it is O(1)
search for keys as suggested here: https://stackoverflow.com/a/18453567/4085019
I am also open for any better data structures instead of using the dictionary.
You don't have multiple keys. As far as the Python dictionary is concerned, there is just one key, a tuple object. You can't search for the constituents of the tuple in anything other than O(N) linear time.
If your keys are unique, just add each key individually:
multiKeyDict['key1'] = multiKeyDict['key2'] = multiKeyDict['key3'] = 'value1'
Now you have 3 keys all referencing one value. The value object is not duplicated here, only the references to it are.
The multi_key_dict package you found uses an intermediate mapping to map a given constituent key to the composite key, which then maps to the value. This gives you O(1) search too, with the same limitation that each constituent key must be unique.
If your keys are not unique then you need to map each key to another container that then holds the values, like a set for instance:
for key in ('key1', 'key2', 'key3):
multiKeyDict.setdefault(key, set()).add(value)
Now looking up a key gives you the set of all values that that key references.
If you need to be able to combine keys too, then you can add additional references with those combinations. Key-value pairings are relatively cheap, it's all just references. The key and value objects themselves are not duplicated.
Another possibility is to build up an index to a list of row-objects which share a key-component. Provided the number of rows sharing any particular key value is small, this will be quite efficient. (Assume row-objects have keys accessed as row.key1, row.key2 etc., that's not a very relevant detail). Untested code:
index = {}
for row in rows:
index.setdefault( row.key1, []).append(row)
index.setdefault( row.key2, []).append(row)
index.setdefault( row.key3, []).append(row)
and then for looking up rows that match, say, key2 and key3
candidates = index[ key2]
if len( index[key3]) < len(candidates):
candidates = index[key3] # use key3 if it offers a better distribution
results = []
for cand in candidates:
if cand.key2 == key2 and cand.key3 == key3: # full test is necessary!
results.append( cand)
I apologize this must be a basic question for using dictionaries. I'm learning python, and the objective I have is to compare two dictionaries and recover the Key and Value entries from both entries that are identical. I understand that the order in dictionaries is not relevant like if one is working with a list. But I adopted a code to compare my dictionaries and i just wanted to make sure that the order of the dictionaries does not matter.
The code I have written so far is:
def compare_dict(first,second):
with open('Common_hits_python.txt', 'w') as file:
for keyone in first:
for keytwo in second:
if keytwo == keyone:
if first[keyone] == second[keytwo]:
file.write(keyone + "\t" + first[keyone] + "\n")
Any recommendations would be appreciated. I apologize for the redundany in the code above. But if someone could confirm that comparing two dictionaries this way does not require the key to be in the same order would great. Other ways of writing the function would be really appreciated as well.
Since you loop over both dictionaries and compare all the combinations, no, order doesn't matter. Every key in one dictionary is compared with every key in the other dictionary, eventually.
It is not a very efficient way to test for matching keys, however. Testing if a key is present is as simple as keyone in second, no need to loop over all the keys in second here.
Better still, you can use set intersections instead:
for key, value in first.viewitems() & second.viewitems():
# loops over all key - value pairs that match in both.
file.write('{}\t{}\n'.format(key, value))
This uses dictionary view objects; if you are using Python 3, then you can use first.items() & second.items() as dictionaries there return dictionary views by default.
Using dict.viewitems() as a set only works if the values are hashable too, but since you are treating your values as strings when writing to the file I assumed they were.
If your values are not hashable, you'll need to validate that the values match, but you can still use views and intersect just the keys:
for key in first.viewkeys() & second.viewkeys():
# loops over all keys that match in both.
if first[key] == second[key]:
file.write('{}\t{}\n'.format(key, first[key]))
Again, in Python 3, use first.keys() & second.keys() for the intersection of the two dictionaries by keys.
Your way of doing it is valid. As you look through both lists, the order of the dictionaries does not matter.
You could do this instead, to optimize your code.
for keyone in first:
if keyone in second: # returns true if keyone is present in second.
if first[keyone] == second[keyone]:
file.write(keyone + "\t" + first[keyone] + "\n")
The keys of a dictionary are effectively a set, and Python already has a built-in set type with an efficient intersection method. This will produce a set of keys that are common to both dictionaries:
dict0 = {...}
dict1 = {...}
set0 = set(dict0)
set1 = set(dict1)
keys = set0.intersection(set1)
Your goal is to build a dictionary out of these keys, which can be done with a dictionary comprehension. It will require a condition to keep out the keys that have unequal values in the two original dictionaries:
new_dict = {k: dict0[k] for k in keys if dict0[k] == dict1[k]}
Depending on your intended use for the new dictionary, you might want to copy or deepcopy the old dictionary's values into the new one.
Suppose I have some kind of dictionary structure like this (or another data structure representing the same thing.
d = {
42.123231:'X',
42.1432423:'Y',
45.3213213:'Z',
..etc
}
I want to create a function like this:
f(n,d,e):
'''Return a list with the values in dictionary d corresponding to the float n
within (+/-) the float error term e'''
So if I called the function like this with the above dictionary:
f(42,d,2)
It would return
['X','Y']
However, while it is straightforward to write this function with a loop, I don't want to do something that goes through every value in the dictionary and checks it exhaustively, but I want it to take advantage of the indexed structure somehow (or a even a sorted list could be used) to make the search much faster.
Dictionary is a wrong data structure for this. Write a search tree.
Python dictionary is a hashmap implementation. Its keys can't be compared and traversed as in search tree. So you simply can't do it using python dictionary without actually checking all keys.
Dictionaries with numeric keys are usually sorted - by key values. But you may - to be on the safe side - rearrange it as OrderedDictionary - you do it once
from collections import OrderedDict
d_ordered = OrderedDict(sorted(d.items(), key =lambda i:i[0]))
Then filtering values is rather simple - and it will stop at the upper border
import itertools
values = [val for k, val in
itertools.takewhile(lambda (k,v): k<upper, d_ordered.iteritems())
if k > lower]
As I've already stated, ordering dictionary is not really necessary - but some will say that this assumption is based on the current implementation and may change in the future.