python dictionary float search within range - python

Suppose I have some kind of dictionary structure like this (or another data structure representing the same thing.
d = {
42.123231:'X',
42.1432423:'Y',
45.3213213:'Z',
..etc
}
I want to create a function like this:
f(n,d,e):
'''Return a list with the values in dictionary d corresponding to the float n
within (+/-) the float error term e'''
So if I called the function like this with the above dictionary:
f(42,d,2)
It would return
['X','Y']
However, while it is straightforward to write this function with a loop, I don't want to do something that goes through every value in the dictionary and checks it exhaustively, but I want it to take advantage of the indexed structure somehow (or a even a sorted list could be used) to make the search much faster.

Dictionary is a wrong data structure for this. Write a search tree.

Python dictionary is a hashmap implementation. Its keys can't be compared and traversed as in search tree. So you simply can't do it using python dictionary without actually checking all keys.

Dictionaries with numeric keys are usually sorted - by key values. But you may - to be on the safe side - rearrange it as OrderedDictionary - you do it once
from collections import OrderedDict
d_ordered = OrderedDict(sorted(d.items(), key =lambda i:i[0]))
Then filtering values is rather simple - and it will stop at the upper border
import itertools
values = [val for k, val in
itertools.takewhile(lambda (k,v): k<upper, d_ordered.iteritems())
if k > lower]
As I've already stated, ordering dictionary is not really necessary - but some will say that this assumption is based on the current implementation and may change in the future.

Related

Using bisect efficiently with the keys of an OrderedDict

I'm looking for the fastest way to do the following: given a dictionary and a key value, return the lowest key in the dictionary greater than than the value given. Per this question, the natural way would seem to be to create an OrderedDict, then use bisect on the keys to find the proper key location. The OrderedDict.keys() method doesn't support indexing, so per e.g. this question, one has to convert the keys to a list, before doing bisect or similar.
So once an OrderedDict has been created with its keys in order, in order to access the correct position one has to do the following:
Convert the keys to a list
Do a binary search of the keys with bisect or similar.
Check that this insertion point isn't at the end of the list, before retrieving the key located after this index.
Retrieve the key value in our original OrderedDict.
I'm most concerned about step 1 above, from an efficiency perspective (although all of this looks roundabout to me). Without knowing the details of how Python does the conversion to list, it seems like it would have to be O(n), completely eliminating the savings of using OrderedDict and binary search. I'm hoping someone can tell me whether this assumption I have about step 1 is or isn't correct, and regardless whether or not there may be a better method.
As an alternative, I could simply create a list of tuples, sorted by the first element (key), where the second element is the dict value associated with that key. Then I could pass the key lambda x:x[0] to bisect. This seems reasonable, but I'd prefer to store my key / value pairs more uniformly (e.g. JSON), since that's how it's done with other dicts in the same project that don't need this specific type of comparison.
Here's some example code for a single lookup. Edit: But lest anyone think I'm over-optimizing, the actual dictionary has ~3 million keys, and will be accessed ~7 million times in a batch, daily. So I'm very interested in finding a fast way of doing this.
# Single lookup example
from collections import OrderedDict
from bisect import bisect
d = OrderedDict()
d[5] = 'lowest_value'
d[7] = 'middle_value'
d[12] = 'highest_value'
sample_key = 6 # we want to find the value for the key above this in d, e.g. d[7]
list_of_keys = list(d.keys())
key_insertion_index = bisect(list_of_keys,sample_key)
if key_insertion_index < len(list_of_keys):
next_higher_key = list_of_keys[key_insertion_index]
v = d[next_higher_key]

Comparing multiple values within a dictionary and returning key

I have a dictionary where the keys are an arbitrary name and the values are an mtime of a file. Example:
{'server_1': 1506286408.854673, 'server_2': 1506286219.1254442, 'server_3':1506472359.154043}
I wish to iterate over comparing two of the values from the dictionary finding the largest of the two, and returning the key of said large value and continuing to do this until there is only a single key:val pair left.
I know there is a way of "ordering" dictionaries by value with some tricks provided by outside libraries like operator and defaultdict. However, I was curious if there was an easier way to accomplish this goal and avoid trying to sort a naturally unordered structure.
So the end result I would be looking for is the first iteration to return server_3, then server_1 and then stop there.
It looks like you want to sort dictionary based on values but ignore the last one.
def keys_sorted_by_values(d):
return [ k for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True) ][:-1]
server_to_mtime = {'server_1': 1506286408.854673, 'server_2': 1506286219.1254442, 'server_3':1506472359.154043}
for server in keys_sorted_by_values(server_to_mtime):
print(server)
Output
server_3
server_1

Faster way to filter a list of dictionaries

I have a large list of dicts (200,000+) and need to filter those dicts based on a key many times (~11,000). What is the fastest way to do this?
I am retrieving a list of dicts (olist), roughly 225,000 dicts, and am trying to filter those dicts based on a single key ('type'). Currently, I build a list of all 'type' present in the dicts and then iterate over it, filtering the dicts for every 'type'. My problem is it takes ~.3s to do this initial 'type' filter, which would require almost an hour to run. I use threading which is getting me down to just over 10min but I would like to be closer to half that. Bellow are the relevant snippets of my code, is there a faster way of doing this (either faster filter or more effective algorithm)?
tLim = threading.BoundedSemaphore(500)
...
olist = _get_co_(h) ## this returns a list of ~225,000 dictionaries
idlist = list(set([d['type'] for d in olist])) ## returns list of ~11,000
for i in idlist:
t = Thread(target=_typeData_, args=(i,olist,cData))
threads.append(t)
def _typeData_(i,olist,cData):
tLim.acquire()
tList = list(filter(lambda x: x['type'] == i, olist)) ## takes ~0.3s
do stuff with tList ## takes ~0.01s
Please note, I've look at generator expressions but it seems like having to store and recall the results might be worse? I haven't tried it though and I'm not sure how I would implement it...
Also, increasing the semaphore does not improve time much, if at all.
You could group the dictionaries by type so you can avoid the filter later on:
from collections import defaultdict
id_groups = defaultdict(list)
for dct in olist:
id_groups[dct['type']].append(dct)
now you don't need to filter at all, you just iterate over this id_groups and you'll get a list of all dictionaries of that type:
for i, tList in id_groups.items():
# the i and tList are identical to your variables in the "_typeData_" function.
# do something with tList

Ensure list of dicts has a dict with key for each key in list

Context:
I'm using an Ajax call to return some complex JSON from a python module. I have to use a list of keys and confirm that a list of single-item dicts contains a dict with each key.
Example:
mylist=['this', 'that', 'these', 'those']
mydictlist=[{'this':1},{'that':2},{'these':3}]
How do I know that mydictlist is missing the "those" key? Once I know that, I can append {'those':4} to mylist. Simply checking for "those" won't work since the list is dynamic. The data structure cannot change.
Thanks.
Simple code is to convert your search list to a set, then use differencing to determine what you're missing:
missing = set(mylist).difference(*mydictlist)
which gets you missing of {'those'}.
Since the named set methods can take multiple arguments (and they need not be sets themselves), you can just unpack all the dicts as arguments to difference to subtract all of them from your set of desired keys at once.
If you do need to handle duplicates (to make sure you see each of the keys in mylist at least that many time in mydictlist's keys, so mylist might contain a value twice which must occur twice in the dicts), you can use collections and itertools to get remaining counts:
from collections import Counter
from itertools import chain
c = Counter(mylist)
c.subtract(chain.from_iterable(mydictlist))
# In 3.3+, easiest way to remove 0/negative counts
c = +c
# In pre-3.3 Python, change c = +c to get the same effect slightly less efficiently
c += Counter()
The most straightforward way is to iterate over both the containers and check:
for key in mylist:
if not any(key in dic for dic in mydictlist):
print key, "missing"
However, if you have a lot of keys and/or dictionaries, this is not going to be efficient: it iterates over mydictlist once for each element in mylist, which is O(n*m). Instead, consider a set operation:
print set(mylist).difference(*mydictlist)
The pandas package is a great way to handle list of dicts problems. It takes all the keys and makes them column headers, values with similar keys populate the same column.
Check this out:
import pandas as pd
mydictlist=[{'this':1},{'that':2},{'these':3}]
# Convert data to a DataFrame
df = pd.DataFrame(mydictlist)
# List all the column header names and check if any of the key words are missing
df.columns

can I compare the keys of two dictionaries that are not in the same order?

I apologize this must be a basic question for using dictionaries. I'm learning python, and the objective I have is to compare two dictionaries and recover the Key and Value entries from both entries that are identical. I understand that the order in dictionaries is not relevant like if one is working with a list. But I adopted a code to compare my dictionaries and i just wanted to make sure that the order of the dictionaries does not matter.
The code I have written so far is:
def compare_dict(first,second):
with open('Common_hits_python.txt', 'w') as file:
for keyone in first:
for keytwo in second:
if keytwo == keyone:
if first[keyone] == second[keytwo]:
file.write(keyone + "\t" + first[keyone] + "\n")
Any recommendations would be appreciated. I apologize for the redundany in the code above. But if someone could confirm that comparing two dictionaries this way does not require the key to be in the same order would great. Other ways of writing the function would be really appreciated as well.
Since you loop over both dictionaries and compare all the combinations, no, order doesn't matter. Every key in one dictionary is compared with every key in the other dictionary, eventually.
It is not a very efficient way to test for matching keys, however. Testing if a key is present is as simple as keyone in second, no need to loop over all the keys in second here.
Better still, you can use set intersections instead:
for key, value in first.viewitems() & second.viewitems():
# loops over all key - value pairs that match in both.
file.write('{}\t{}\n'.format(key, value))
This uses dictionary view objects; if you are using Python 3, then you can use first.items() & second.items() as dictionaries there return dictionary views by default.
Using dict.viewitems() as a set only works if the values are hashable too, but since you are treating your values as strings when writing to the file I assumed they were.
If your values are not hashable, you'll need to validate that the values match, but you can still use views and intersect just the keys:
for key in first.viewkeys() & second.viewkeys():
# loops over all keys that match in both.
if first[key] == second[key]:
file.write('{}\t{}\n'.format(key, first[key]))
Again, in Python 3, use first.keys() & second.keys() for the intersection of the two dictionaries by keys.
Your way of doing it is valid. As you look through both lists, the order of the dictionaries does not matter.
You could do this instead, to optimize your code.
for keyone in first:
if keyone in second: # returns true if keyone is present in second.
if first[keyone] == second[keyone]:
file.write(keyone + "\t" + first[keyone] + "\n")
The keys of a dictionary are effectively a set, and Python already has a built-in set type with an efficient intersection method. This will produce a set of keys that are common to both dictionaries:
dict0 = {...}
dict1 = {...}
set0 = set(dict0)
set1 = set(dict1)
keys = set0.intersection(set1)
Your goal is to build a dictionary out of these keys, which can be done with a dictionary comprehension. It will require a condition to keep out the keys that have unequal values in the two original dictionaries:
new_dict = {k: dict0[k] for k in keys if dict0[k] == dict1[k]}
Depending on your intended use for the new dictionary, you might want to copy or deepcopy the old dictionary's values into the new one.

Categories

Resources