Using bisect efficiently with the keys of an OrderedDict - python

I'm looking for the fastest way to do the following: given a dictionary and a key value, return the lowest key in the dictionary greater than than the value given. Per this question, the natural way would seem to be to create an OrderedDict, then use bisect on the keys to find the proper key location. The OrderedDict.keys() method doesn't support indexing, so per e.g. this question, one has to convert the keys to a list, before doing bisect or similar.
So once an OrderedDict has been created with its keys in order, in order to access the correct position one has to do the following:
Convert the keys to a list
Do a binary search of the keys with bisect or similar.
Check that this insertion point isn't at the end of the list, before retrieving the key located after this index.
Retrieve the key value in our original OrderedDict.
I'm most concerned about step 1 above, from an efficiency perspective (although all of this looks roundabout to me). Without knowing the details of how Python does the conversion to list, it seems like it would have to be O(n), completely eliminating the savings of using OrderedDict and binary search. I'm hoping someone can tell me whether this assumption I have about step 1 is or isn't correct, and regardless whether or not there may be a better method.
As an alternative, I could simply create a list of tuples, sorted by the first element (key), where the second element is the dict value associated with that key. Then I could pass the key lambda x:x[0] to bisect. This seems reasonable, but I'd prefer to store my key / value pairs more uniformly (e.g. JSON), since that's how it's done with other dicts in the same project that don't need this specific type of comparison.
Here's some example code for a single lookup. Edit: But lest anyone think I'm over-optimizing, the actual dictionary has ~3 million keys, and will be accessed ~7 million times in a batch, daily. So I'm very interested in finding a fast way of doing this.
# Single lookup example
from collections import OrderedDict
from bisect import bisect
d = OrderedDict()
d[5] = 'lowest_value'
d[7] = 'middle_value'
d[12] = 'highest_value'
sample_key = 6 # we want to find the value for the key above this in d, e.g. d[7]
list_of_keys = list(d.keys())
key_insertion_index = bisect(list_of_keys,sample_key)
if key_insertion_index < len(list_of_keys):
next_higher_key = list_of_keys[key_insertion_index]
v = d[next_higher_key]

Related

How to faster compute the count frequency of words in a large words list with python and be a dictionary

There is a very long words list, the length of list is about 360000. I want to get the each word frequency, and to be a dictionary.
For example:
{'I': 50, 'good': 30,.......}
Since the word list is large, I found it take a lot of time to compute it. Do you have faster method to accomplish this?
My code, so far, is the following:
dict_pronoun = dict([(i, lst_all_tweet_noun.count(i)) for i in
lst_all_tweet_noun])
sorted(dict_pronoun)
You are doing several things wrong here:
You are building a huge list first, then turn that list object into a dictionary. There is no need to use the [..] list comprehension; just dropping the [ and ] would turn it into a much more memory-efficient generator expression.
You are using dict() with a loop instead of a {keyexpr: valueexpr for ... in ...} dictionary comprehension; this would avoid a generator expression altogether and go straight to building a dictionary.
You are using list.count(), this does a full scan of the list for every element. You turned a linear scan to count N items into a O(N**2) quadratic problem. You could simply increment an integer in the dictionary each time you find the key already is present, set the value to 0 otherwise, but there are better options (see below).
The sorted() call is busy-work; it returns a sorted list of keys that is then discarded again. Dictionaries are not sortable, not and produce a dictionary again at any rate.
Use a collections.Counter() object here to do your counting; it uses a linear scan:
from collections import Counter
dict_pronoun = Counter(lst_all_tweet_noun)
A Counter has a Counter.most_common() method which will efficiently give you output sorted by counts, which is what I suspect you wanted to achieve with the sorted() call.
For example, to get the top K elements (where K is smaller than N, the size of the dictionary), a heapq is used to get you those elements in O(NlogK) time (avoiding a full O(NlogN) sort).

Ensure list of dicts has a dict with key for each key in list

Context:
I'm using an Ajax call to return some complex JSON from a python module. I have to use a list of keys and confirm that a list of single-item dicts contains a dict with each key.
Example:
mylist=['this', 'that', 'these', 'those']
mydictlist=[{'this':1},{'that':2},{'these':3}]
How do I know that mydictlist is missing the "those" key? Once I know that, I can append {'those':4} to mylist. Simply checking for "those" won't work since the list is dynamic. The data structure cannot change.
Thanks.
Simple code is to convert your search list to a set, then use differencing to determine what you're missing:
missing = set(mylist).difference(*mydictlist)
which gets you missing of {'those'}.
Since the named set methods can take multiple arguments (and they need not be sets themselves), you can just unpack all the dicts as arguments to difference to subtract all of them from your set of desired keys at once.
If you do need to handle duplicates (to make sure you see each of the keys in mylist at least that many time in mydictlist's keys, so mylist might contain a value twice which must occur twice in the dicts), you can use collections and itertools to get remaining counts:
from collections import Counter
from itertools import chain
c = Counter(mylist)
c.subtract(chain.from_iterable(mydictlist))
# In 3.3+, easiest way to remove 0/negative counts
c = +c
# In pre-3.3 Python, change c = +c to get the same effect slightly less efficiently
c += Counter()
The most straightforward way is to iterate over both the containers and check:
for key in mylist:
if not any(key in dic for dic in mydictlist):
print key, "missing"
However, if you have a lot of keys and/or dictionaries, this is not going to be efficient: it iterates over mydictlist once for each element in mylist, which is O(n*m). Instead, consider a set operation:
print set(mylist).difference(*mydictlist)
The pandas package is a great way to handle list of dicts problems. It takes all the keys and makes them column headers, values with similar keys populate the same column.
Check this out:
import pandas as pd
mydictlist=[{'this':1},{'that':2},{'these':3}]
# Convert data to a DataFrame
df = pd.DataFrame(mydictlist)
# List all the column header names and check if any of the key words are missing
df.columns

getting the key index in a Python OrderedDict?

I have a collections.OrderedDict with a list of key, value pairs. I would like to compute the index i such that the ith key matches a given value. For example:
food = OrderedDict([('beans',33),('rice',44),('pineapple',55),('chicken',66)])
I want to go from the key chicken to the index 3, or from the key rice to the index 1. I can do this now with
food.keys().index('rice')
but is there any way to leverage the OrderedDict's ability to look things up quickly by key name? Otherwise it seems like the index-finding would be O(N) rather than O(log N), and I have a lot of items.
I suppose I can do this manually by making my own index:
>>> foodIndex = {k:i for i,k in enumerate(food.keys())}
>>> foodIndex
{'chicken': 3, 'rice': 1, 'beans': 0, 'pineapple': 2}
but I was hoping there might be something built in to an OrderedDict.
Basically, no. OrderedDict gets its ability to look things up quickly by key name just by using a regular, unordered dict under the hood. The order information is stored separately in a doubly linked list. Because of this, there's no way to go directly from the key to its index. The order in an OrderedDict is mainly intended to be available for iteration; a key does not "know" its own order.
As others have pointed out, an OrderedDict is just a dictionary that internally remembers what order entries were added to it. However, you can leverage its ability to look-up things quickly by storing the desired index along with the rest of the data for each entry. Here's what I mean:
from collections import OrderedDict
foods = [('beans', 33), ('rice', 44), ('pineapple', 55), ('chicken', 66)]
food = OrderedDict(((v[0], (v[1], i)) for i, v in enumerate(foods))) # saves i
print(food['rice'][1]) # --> 1
print(food['chicken'][1]) # --> 3
The OrderedDict is a subclass of dict which has the ability to traverse its keys in order (and reversed order) by maintaining a doubly linked list. So it does not know the index of a key. It can only traverse the linked list to find the items in O(n) time.
Perusing the source code may be the most satisfying way to confirm that the index is not maintained by OrderedDict. You'll see that no where is an index ever used or obtained.

python dictionary float search within range

Suppose I have some kind of dictionary structure like this (or another data structure representing the same thing.
d = {
42.123231:'X',
42.1432423:'Y',
45.3213213:'Z',
..etc
}
I want to create a function like this:
f(n,d,e):
'''Return a list with the values in dictionary d corresponding to the float n
within (+/-) the float error term e'''
So if I called the function like this with the above dictionary:
f(42,d,2)
It would return
['X','Y']
However, while it is straightforward to write this function with a loop, I don't want to do something that goes through every value in the dictionary and checks it exhaustively, but I want it to take advantage of the indexed structure somehow (or a even a sorted list could be used) to make the search much faster.
Dictionary is a wrong data structure for this. Write a search tree.
Python dictionary is a hashmap implementation. Its keys can't be compared and traversed as in search tree. So you simply can't do it using python dictionary without actually checking all keys.
Dictionaries with numeric keys are usually sorted - by key values. But you may - to be on the safe side - rearrange it as OrderedDictionary - you do it once
from collections import OrderedDict
d_ordered = OrderedDict(sorted(d.items(), key =lambda i:i[0]))
Then filtering values is rather simple - and it will stop at the upper border
import itertools
values = [val for k, val in
itertools.takewhile(lambda (k,v): k<upper, d_ordered.iteritems())
if k > lower]
As I've already stated, ordering dictionary is not really necessary - but some will say that this assumption is based on the current implementation and may change in the future.

Value-sorted dict for Python?

I am interested in a dict implementation for Python that provides an iterating interface to sorted values. I.e., a dict with a "sortedvalues()" function.
Naively one can do sorted(dict.values()) but that's not what I want. Every time items are inserted or deleted, one has to run a full sorting which isn't efficient.
Note that I am not asking about key-sorted dict either (for that question, there are excellent answers in Key-ordered dict in Python and Python 2.6 TreeMap/SortedDictionary?).
One solution is to write a class that inherits from dict but also maintains a list of keys sorted by their value (sorted_keys), along with the list of corresponding (sorted) values (sorted_values).
You can then define a __setitem__() method that uses the bisect module in order to know quickly the position k where the new (key, value) pair should be inserted in the two lists. You can then insert the new key and the new value both in the dictionary itself, and in the two lists that you maintain, with sorted_values[k:k] = [new_value] and sorted_keys[k:k] = [new_key]; unfortunately, the time complexity of such an insertion is O(n) (so O(n^2) for the whole dictionary).
Another approach to the ordered element insertion would be to use the heapq module and insert (value, key) pairs in it. This works in O(log n) instead of the list-based approach of the previous paragraph.
Iterating over the dictionary can then simply done by iterating over the list of keys (sorted_keys) that you maintain.
This method saves you the time it would take to sort the keys each time you want to iterate over the dictionary (with sorted values), by basically shifting (and increasing, unfortunately) this time cost to the construction of the sorted lists of keys and values.
The problem is that you need to sort or hash it by keys to get reasonable insert and lookup performance. A naive way of implementing it would be a value-sorted tree structure of entries, and a dict to lookup the tree position for a key. You need to get deep into updating the tree though, as this lookup dictionary needs to be kept correct. Essentially, as you would do for an updatable heap.
I figure there are too many options to make a resonable standard library option out of such a structure, while it is too rarely needed.
Update: a trick that might work for you is to use a dual structure:
a regular dict storing the key-value pairs as usual
any kind of sorted list, for example using bisect
Then you have to implement the common operations on both: a new value is inserted into both structures. The tricky part are the update and delete operations. You use the first structure to look up the old value, delete the old value from the second structure, then (when updating) reinsert as before.
If you need to know the keys too, store (value, key) pairs in your b list.
Update 2: Try this class:
import bisect
class dictvs(dict):
def __init__(self):
self._list = []
def __setitem__(self, key, value):
old = self.get(key)
if old is None:
bisect.insort(self._list, value)
dict.__setitem__(self, key, value)
else:
oldpos = bisect.bisect_left(self._list, old)
newpos = bisect.bisect_left(self._list, value)
if newpos > oldpos:
newpos -= 1
for i in xrange(oldpos, newpos):
self._list[i] = self._list[i + 1]
else:
for i in xrange(oldpos, newpos, -1):
self._list[i] = self._list[i - 1]
self._list[newpos] = value
dict.__setitem__(self, key, value)
def __delitem__(self, key):
old = self.get(key)
if old is not None:
oldpos = bisect.bisect(self._list, old)
del self._list[oldpos]
dict.__delitem__(self, key)
def values(self):
return list(self._list)
It's not a complete dict yet I guess. I havn't tested deletions, and just a tiny update set. You should make a larger unit test for it, and compare the return of values() with that of sorted(dict.values(instance)) there. This is just to show how to update the sorted list with bisect
Here is another, simpler idea:
You create a class that inherits from dict.
You use a cache: you only sort the keys when iterating over the dictionary, and you mark the dictionary as being sorted; insertions should simply append to the list of keys.
kindall mention in a comment that sorting lists that are almost sorted is fast, so this approach should be quite fast.
You can use a skip dict. It is a Python dictionary that is permanently sorted by value.
Insertion is slightly more expensive than a regular dictionary, but it is well worth the cost if you frequently need to iterate in order, or perform value-based queries such as:
What's the highest / lowest item?
Which items have a value between X and Y?

Categories

Resources