Value-sorted dict for Python? - python

I am interested in a dict implementation for Python that provides an iterating interface to sorted values. I.e., a dict with a "sortedvalues()" function.
Naively one can do sorted(dict.values()) but that's not what I want. Every time items are inserted or deleted, one has to run a full sorting which isn't efficient.
Note that I am not asking about key-sorted dict either (for that question, there are excellent answers in Key-ordered dict in Python and Python 2.6 TreeMap/SortedDictionary?).

One solution is to write a class that inherits from dict but also maintains a list of keys sorted by their value (sorted_keys), along with the list of corresponding (sorted) values (sorted_values).
You can then define a __setitem__() method that uses the bisect module in order to know quickly the position k where the new (key, value) pair should be inserted in the two lists. You can then insert the new key and the new value both in the dictionary itself, and in the two lists that you maintain, with sorted_values[k:k] = [new_value] and sorted_keys[k:k] = [new_key]; unfortunately, the time complexity of such an insertion is O(n) (so O(n^2) for the whole dictionary).
Another approach to the ordered element insertion would be to use the heapq module and insert (value, key) pairs in it. This works in O(log n) instead of the list-based approach of the previous paragraph.
Iterating over the dictionary can then simply done by iterating over the list of keys (sorted_keys) that you maintain.
This method saves you the time it would take to sort the keys each time you want to iterate over the dictionary (with sorted values), by basically shifting (and increasing, unfortunately) this time cost to the construction of the sorted lists of keys and values.

The problem is that you need to sort or hash it by keys to get reasonable insert and lookup performance. A naive way of implementing it would be a value-sorted tree structure of entries, and a dict to lookup the tree position for a key. You need to get deep into updating the tree though, as this lookup dictionary needs to be kept correct. Essentially, as you would do for an updatable heap.
I figure there are too many options to make a resonable standard library option out of such a structure, while it is too rarely needed.
Update: a trick that might work for you is to use a dual structure:
a regular dict storing the key-value pairs as usual
any kind of sorted list, for example using bisect
Then you have to implement the common operations on both: a new value is inserted into both structures. The tricky part are the update and delete operations. You use the first structure to look up the old value, delete the old value from the second structure, then (when updating) reinsert as before.
If you need to know the keys too, store (value, key) pairs in your b list.
Update 2: Try this class:
import bisect
class dictvs(dict):
def __init__(self):
self._list = []
def __setitem__(self, key, value):
old = self.get(key)
if old is None:
bisect.insort(self._list, value)
dict.__setitem__(self, key, value)
else:
oldpos = bisect.bisect_left(self._list, old)
newpos = bisect.bisect_left(self._list, value)
if newpos > oldpos:
newpos -= 1
for i in xrange(oldpos, newpos):
self._list[i] = self._list[i + 1]
else:
for i in xrange(oldpos, newpos, -1):
self._list[i] = self._list[i - 1]
self._list[newpos] = value
dict.__setitem__(self, key, value)
def __delitem__(self, key):
old = self.get(key)
if old is not None:
oldpos = bisect.bisect(self._list, old)
del self._list[oldpos]
dict.__delitem__(self, key)
def values(self):
return list(self._list)
It's not a complete dict yet I guess. I havn't tested deletions, and just a tiny update set. You should make a larger unit test for it, and compare the return of values() with that of sorted(dict.values(instance)) there. This is just to show how to update the sorted list with bisect

Here is another, simpler idea:
You create a class that inherits from dict.
You use a cache: you only sort the keys when iterating over the dictionary, and you mark the dictionary as being sorted; insertions should simply append to the list of keys.
kindall mention in a comment that sorting lists that are almost sorted is fast, so this approach should be quite fast.

You can use a skip dict. It is a Python dictionary that is permanently sorted by value.
Insertion is slightly more expensive than a regular dictionary, but it is well worth the cost if you frequently need to iterate in order, or perform value-based queries such as:
What's the highest / lowest item?
Which items have a value between X and Y?

Related

Using bisect efficiently with the keys of an OrderedDict

I'm looking for the fastest way to do the following: given a dictionary and a key value, return the lowest key in the dictionary greater than than the value given. Per this question, the natural way would seem to be to create an OrderedDict, then use bisect on the keys to find the proper key location. The OrderedDict.keys() method doesn't support indexing, so per e.g. this question, one has to convert the keys to a list, before doing bisect or similar.
So once an OrderedDict has been created with its keys in order, in order to access the correct position one has to do the following:
Convert the keys to a list
Do a binary search of the keys with bisect or similar.
Check that this insertion point isn't at the end of the list, before retrieving the key located after this index.
Retrieve the key value in our original OrderedDict.
I'm most concerned about step 1 above, from an efficiency perspective (although all of this looks roundabout to me). Without knowing the details of how Python does the conversion to list, it seems like it would have to be O(n), completely eliminating the savings of using OrderedDict and binary search. I'm hoping someone can tell me whether this assumption I have about step 1 is or isn't correct, and regardless whether or not there may be a better method.
As an alternative, I could simply create a list of tuples, sorted by the first element (key), where the second element is the dict value associated with that key. Then I could pass the key lambda x:x[0] to bisect. This seems reasonable, but I'd prefer to store my key / value pairs more uniformly (e.g. JSON), since that's how it's done with other dicts in the same project that don't need this specific type of comparison.
Here's some example code for a single lookup. Edit: But lest anyone think I'm over-optimizing, the actual dictionary has ~3 million keys, and will be accessed ~7 million times in a batch, daily. So I'm very interested in finding a fast way of doing this.
# Single lookup example
from collections import OrderedDict
from bisect import bisect
d = OrderedDict()
d[5] = 'lowest_value'
d[7] = 'middle_value'
d[12] = 'highest_value'
sample_key = 6 # we want to find the value for the key above this in d, e.g. d[7]
list_of_keys = list(d.keys())
key_insertion_index = bisect(list_of_keys,sample_key)
if key_insertion_index < len(list_of_keys):
next_higher_key = list_of_keys[key_insertion_index]
v = d[next_higher_key]

Python: Obtaining index of an element within a value list of a dictionary

I have a dictionary with key:value list pairings, and I intend to find the index of the value list that contains the desired element.
E.g., if the dictionary is:
my_dict = {"key1":['v1'], "key2":None, "key3":['v2','v3'], "key4":['v4','v5','v6']}
Then, given element 'v2' I should be able to get index 2.
For a value list with one element, the index can be obtained with: list(my_dict.values()).index(['v1']) , however this approach does not work with lists containing multiple elements.
Using for loop, it can be obtained via:
for key, value in my_dict.items():
if value is None:
continue
if 'v2' in value:
print (list(my_dict.keys()).index(key))
Is there a neater (pythonic) way to obtain the same?
You've got an XY problem. You want to know the key that points to a value, and you think you need to find the enumeration index iterating the values so you can then use it to find the key by iteration as well. You don't need all that. Just find the key directly:
my_dict = {"key1":['v1'], "key2":None, "key3":['v2','v3'], "key4":['v4','v5','v6']}
value = 'v2'
# Iterate key/vals pairs in genexpr; if the vals contains value, yield the key,
# next stops immediately for the first key yielded so you don't iterate the whole dict
# when the value is found on an early key
key_for_value = next(key for key, vals in my_dict.items() if vals and value in vals)
print(key_for_value)
Try it online!
That'll raise StopIteration if the value doesn't exist, otherwise it directly retrieves the first key where the values list for that key contains the desired value.
If you don't really have an XY problem, and the index is important (it shouldn't be, that's a misuse of dicts) it's trivial to produce it as well, changing the extraction of the key to get both, e.g.:
index, key_for_value = next((i, key) for i, (key, vals) in enumerate(my_dict.items()) if vals and value in vals)
Mind you, this is a terrible solution if you need to perform these lookups a lot and my_dict isn't trivially small; it's O(n) on the total number of values, so a large dict would take quite a while to check (relative to the cost of just looking up an arbitrary key, which is average-case O(1)). In that case, ideally, if my_dict doesn't change much/at all, you'd construct a reversed dictionary up-front to find the key(s) associated with a value, e.g.:
from collections import defaultdict
my_dict = {"key1":['v1'], "key2":None, "key3":['v2','v3'], "key4":['v4','v5','v6']}
reversed_my_dict = defaultdict(set)
for key, vals in my_dict:
for val in vals:
reversed_my_dict[val].add(key)
reversed_my_dict = dict(reversed_my_dict) # Optional: Prevents future autovivification of keys
# by converting back to plain dict
after which you can cheaply determine the key(s) associated with a given value with:
reversed_my_dict.get(value, ()) # Using .get prevents autovivification of keys even if still a defaultdict
which returns the set of all keys that map to that value, if any, or the empty tuple if not (if you convert back to dict above, reversed_my_dict[value] would also work if you'd prefer to get a KeyError when the value is missing entirely; leaving it a defaultdict(set) would silently construct a new empty set, map it to the key and return it, which is fine if this happens rarely, but a problem if you test thousands of unmapped values and create a corresponding thousands of empty sets for no benefit, consuming memory wastefully).
Which you choose depends on how big my_dict is (for small my_dict, O(n) work doesn't really matter that much), how many times you need to search it (fewer searches mean less gain from reversed dict), and whether it's regularly modified. For that last point, if it's never modified, or rarely modified between lookups, rebuilding the reversed dict from scratch after each modification might be worth it for simplicity (assuming you perform many lookups per rebuild); if it's frequently modified, the reversed dict might still be worth it, you'd just have to update both the forward and reversed dicts rather than just one, e.g., expanding:
# New key
my_dict[newkey] = [newval1, newval2]
# Add value
my_dict[existingkey].append(newval)
# Delete value
my_dict[existingkey].remove(badval)
# Delete key
del my_dict[existingkey]
to:
# New key
newvals = my_dict[newkey] = [newval1, newval2]
for newval in newvals:
reversed_my_dict[newval].add(newkey) # reversed_my_dict.setdefault(newval, set()).add(newkey) if not defaultdict(set) anymore
# Add value
my_dict[existingkey].append(newval)
reversed_my_dict[newval].add(existingkey) # reversed_my_dict.setdefault(newval, set()).add(existingkey) if not defaultdict(set) anymore
# Delete value
my_dict[existingkey].remove(badval)
if badval not in my_dict[existingkey]: # Removed last copy; test only needed if one key can hold same value more than once
reversed_my_dict[badval].discard(existingkey)
# Optional delete badval from reverse mapping if last key removed:
if not reversed_my_dict[badval]:
del reversed_my_dict[badval]
# Delete key
# set() conversion not needed if my_dict's value lists guaranteed not to contain duplicates
for badval in set(my_dict.pop(existingkey)):
reversed_my_dict[badval].discard(existingkey)
# Optional delete badval from reverse mapping if last key removed:
if not reversed_my_dict[badval]:
del reversed_my_dict[badval]
respectively, roughly doubling the work incurred by modifications, in exchange for always getting O(1) lookups in either direction.
If you are looking for the key corresponding to a value, you can reverse the dictionary like so:
reverse_dict = {e: k for k, v in my_dict.items() if v for e in v}
Careful with duplicate values though. The last occurence will override the previous ones.
Don't know if it's the best solution but this works:
value = 'v2'
list(map(lambda x : value in x, list(map(lambda x : x[1] or [], list(my_dict.items()))))).index(True)

Optimization of comparisons between each key in a dictionary (Python)

So I want to compare each key of a dictionary to each other and if a key happens to be too similar to another key (based on fuzzy similarity), I want to merge those 2 entries together into a single key (so one key gets deleted whereas the values will be added up). Is there a more efficient way to do this?
d is a dictionary with {"labels": [list of sentences]}
# First I create a new dictionary that has a dictionary as value which includes the key and value
# of the old dictionary. It feels a bit redundant but afaik you can't loop through a dictionary and
# edit it at the same time + you can't edit keys themselves?
d_new = {}
for key, value in d.items():
d_new[key] = {"label": key, "sentences": value}
for key1, key2 in itertools.combinations(d, 2):
if fuzz.token_set_ratio(key1, key2) >= similarity:
if len(d_new[key1]["sentences"]) > len(d_new[key2]["sentences"]):
d_new[key2]["label"] = key1
# Merge keys
d_new[key2]["sentences"] = list(set(d_new[key1]["sentences"] + d_new[key2]["sentences"]))
del d_new[key1]
continue
# Prepare output
result = {}
for key, value in d_new.items():
result[value["label"]] = list(set(value["sentences"]))
return result
Since you are using itertools.combinations, this will result in a non-linear time complexity to loop over the dictionary. but your code seems to be optimized in the way you want it to be and there probably isn't any better way to find similar keys unless you know what similar things you are looking for.
As the other answer said, further optimization depends on token_set_ratio.
You can also experience with dict comprehension to see if you can make the dict creation faster.
for example:
{value['lable']:list(set(value['sentences'])) for (key,value) in d_new.items()}
suggestion:
due to limited data on what values are there in your keys its hard to say, but if your keys are structured and are based on a pattern and your dict length is large enough, you can brute force for keys since accessing dict keys is O(1) but there is the cost of 'guessing' similar things (for example if your keys are words, you can probably know what words are similar.). but as I said the dict should be huge for this method to be efficient.
Whether or not this can be optimized depends on the implementation of token_set_ratio. If token_set_ratio is your only heuristic of similarity between the two tokens, and the implementation of token_set_ratio is a black box, then this is the most Big O efficient way to calculate similarity. The reasoning here is that because your only heuristic is a black box, you must test every binary combination to get a similarity score.
If instead, you have a different heuristic that works on a single token, then you may be able to precompute a feature score that can help you cluster more efficiently. You would expect some false positives.
Alternatively, if you know how the original heuristic is computed, then there may be a way to optimize based on the heuristic. In that case, you should post it here.
If there is a large percentage of the keys that will be merged, it may be worthwhile to implement the combination logic yourself so that keys that have been merged are not checked again.
There is no data to test this but, here's how it could look like:
from collections import deque
def mergeKeys(d):
result = {k:{v} for k,v in d.items()} # start with nonthing merged
keys = deque(d.keys()) # list of remaining keys
while keys:
key1 = keys.popleft() # consume one key
nextKeys = deque() # track unmerged for next pass
while keys:
key1 = keys.popleft() # consume paired key
if key2 not in result: continue
if fuzz.token_set_ratio(key1, key2) >= similarity:
result[key1].update(result.pop(key2)) # merge
else:
nextKeys.append(key2) # not merged, add to next pass
keys = nextKeys # remaining keys for next pass
return result

Extract the nth key in a python dictionary?

Given a python dictionary and an integer n, I need to access the nth key. I need to do this repeatedly many times in my project.
I have written a function which does this:
def ix(self,dict,n):
count=0
for i in sorted(dict.keys()):
if n==count:
return i
else:
count+=1
But the problem is that if the dictionary is huge, the time complexity increases when used repeatedly.
Is there an efficient way to do this?
I guess you wanted to do something like this, but as dictionary don't have any order so the order of keys in dict.keys can be anything:
def ix(self, dct, n): #don't use dict as a variable name
try:
return list(dct)[n] # or sorted(dct)[n] if you want the keys to be sorted
except IndexError:
print 'not enough keys'
dict.keys() returns a list so, all you need to do is dict.keys()[n]
But, a dictionary is an unordered collection so nth element does not make any sense in this context.
Note: Indexing dict.keys() is not supported in python3
For those that want to avoid the creation of a new temporary list just to access the nth element, I suggest to use an iterator.
from itertools import islice
def nth_key(dct, n):
it = iter(dct)
# Consume n elements.
next(islice(it, n, n), None)
# Return the value at the current position.
# This raises StopIteration if n is beyond the limits.
# Use next(it, None) to suppress that exception.
return next(it)
This can be notably faster for very large dictionaries compared to converting the keys into a temporary list first and then accessing its nth element.
It is mentioned in multiple answers, that dictionaries were unordered. This is nonly true for python versions up to 3.6. From 3.7 ongoing, dictionaries are in fact ordered.

Reversible dictionary for python

I'd like to store some data in Python in a similar form to a dictionary: {1:'a', 2:'b'}. Every value will be unique, not just among other values, but among keys too.
Is there a simple data structure that I can use to get the corresponding object no matter if I ask using the 'key' or the 'value'? For example:
>>> a = {1:'a', 2:'b'}
>>> a[1]
'a'
>>> a['b']
2
>>> a[3]
KeyError
The 'keys' are standard python ints, an the values are short (<256char) strings.
My current solution is creating a reversed dictionary and searching it if I can't find a result in the original dictionary:
pointsreversed = dict((v, k) for k, v in points.iteritems())
def lookup(key):
return points.get(key) or pointsreversed.key()
This uses twice as much space, which isn't great (my dictionaries can be up to a few hundred megs) and is 50% slower on average.
EDIT: as mentioned in a few answers, two dicts doesn't double memory usage, as it's only the dictionary, not the items within, that is duplication.
Is there a solution that improves on this?
If your keys and values are non-overlapping, one obvious approach is to simply store them in the same dict. ie:
class BidirectionalDict(dict):
def __setitem__(self, key, val):
dict.__setitem__(self, key, val)
dict.__setitem__(self, val, key)
def __delitem__(self, key):
dict.__delitem__(self, self[key])
dict.__delitem__(self, key)
d = BidirectionalDict()
d['foo'] = 4
print d[4] # Prints 'foo'
(You'll also probably want to implement things like the __init__, update and iter* methods to act like a real dict, depending on how much functionality you need).
This should only involve one lookup, though may not save you much in memory (you still have twice the number of dict entries after all). Note however that neither this nor your original will use up twice as much space: the dict only takes up space for the references (effectively pointers), plus an overallocation overhead. The space taken up by your data itself will not be repeated twice since the same objects are pointed to.
Related posts:
Python mapping inverse
Python 1:1 mappings
Of course, if all values and keys are unique, couldn't you just use a single dictionary, and insert both key:value and value:key initially?
In The Art of Computer Programming, Vokume 3 Knuth has a section on lookups of secondary keys. For purposes of your question, the value could be considered the secondary key.
The first suggestion is to do what you have done: make an efficient index of the keys by value.
The second suggestion is to setup a large btree that is a composite index of the clustered data, where the branch nodes contain values and the leaves contain the key data and pointers to the larger record (if there is one.)
If the data is geometric (as yours appears to be) there are things called post-office trees. It can answer questions like, what is the nearest object to point x. A few examples are here: http://simsearch.yury.name/russir/01nncourse-hand.pdf Another simple option for this kind of query is the quadtree and the k-d tree. http://en.wikipedia.org/wiki/Quadtree
Another final option is combinatorial hashing, where you combine the key and value into a special kind of hash that lets you do efficient lookups on the hash, even when you don't have both values. I couldn't find a good combinatorial hash explanation online, but it is in TAoCP, Volume 3 Second Edition on page 573.
Granted, for some of these you may have to write your own code. But if memory or performance is really key, you might want to take the time.
It shouldn't use "twice the space". Dictionaries just store references to data, not the data itself. So, if you have a million strings taking up a billion bytes, then each dictionary takes maybe an extra 10-20 million bytes--a tiny fraction of the overall storage. Using two dictionaries is the right thing to do.
Insert reversed pair of (key, value) into same dict:
a = {1:'a', 2:'b'}
a.update(dict((v, k) for k, v in a.iteritems()))
Then you will be able to do both, as you required:
print a[1]
print a['a']
Here's another solution using a user defined class.
And the code...
# search a dictionary for key or value
# using named functions or a class
# tested with Python25 by Ene Uran 01/19/2008
def find_key(dic, val):
"""return the key of dictionary dic given the value"""
return [k for k, v in symbol_dic.iteritems() if v == val][0]
def find_value(dic, key):
"""return the value of dictionary dic given the key"""
return dic[key]
class Lookup(dict):
"""
a dictionary which can lookup value by key, or keys by value
"""
def __init__(self, items=[]):
"""items can be a list of pair_lists or a dictionary"""
dict.__init__(self, items)
def get_key(self, value):
"""find the key(s) as a list given a value"""
return [item[0] for item in self.items() if item[1] == value]
def get_value(self, key):
"""find the value given a key"""
return self[key]
I've been doing it this way for many years now. I personally like the simplicity of it more than the other solutions out there.
d = {1: 'a', 2: 'b'}
dict(zip(d.values(), d.keys()))

Categories

Resources