Fastest way to intersect and merge many dictionaries - python

I have 1000 dictionaries (grid_1, grid_2 ... grid_1000) stored as pickle objects (generated by some previous process) and one reference dictionary. I need to compare each of the pickled dictionaries to the reference, and then combine the results.
The input dictionaries might look like:
grid_1 = {'143741': {'467457':1,'501089':2,'903718':1,'999216':5,'1040952':2},'281092':{'1434': 67,'3345': 345}, '33123': {'4566':5,'56788':45}}
grid_2 = {'143741': {'467457':5,'501089':7,'1040952':9},'281092':{'1434': 67,'3345': 20}, '33123': {'4566':7,'56788':38}}
and the reference dictionary might look like:
grid_density_original = {'143741': {'467457':1,'501089':2,'903718':1,'999216':5,'9990':4},'281092':{'1434': 60,'3345': 3,'9991': 43}, '33123': {'56788':4}}
In the first step, we should intersect the individual grid_n dicts like:
# intersection of grid_1 and grid_density_original
assert intersect_1 == {'143741': {'467457':1,'501089':2,'903718':1,'999216':5},'281092':{'1434': 67,'3345': 345}, '33123': {'56788':45}
# intersection of grid_2 and grid_density_original
assert intersect_2 == {'143741': {'467457':5,'501089':7},'281092':{'1434': 67,'3345': 20}, '33123': {'56788':38}}
Then these results should be combined, as follows:
assert combine12 == {'143741': {'467457':[1,5],'501089':[2,7],'903718':[1,99999],'999216':[5,99999]},'281092':{'1434': [67,67],'3345': [345,20]}, '33123': {'56788':[45,38]}
This appears to b the slow part, as the inner list size increases each time a new intersect_n is added.
This is the code I have currently. My actual dictionaries have on the order of 10,000 keys, and this code takes about 4 days to run.
from collections import defaultdict
from collections import Counter
import pickle
import gc
import copy
import pickle
import scipy.stats as st
from collections import defaultdict
# grid_density_orignal is original nested dictionary we compare each of 1000 grids to:
with open('path\grid_density_original_intercountry.pickle','rb') as handle:
grid_density_orignal = pickle.load(handle,encoding ='latin-1')
# Previous process generated 1000 grids and dump them as .pickle files : grid_1,grid_2....grid_1000
for iteration in range(1,1001):
# load each grid i.e.grid_1,grid_2...grid_1000 into memory sequentially
filename = 'path\grid_%s' %iteration
with open(filename,'rb') as handle:
globals()['dictt_%s' % iteration] = pickle.load(handle,encoding ='latin-1')
# Counter to store grid-grids densities: same dictionary structure as grid_density_orignal
globals()['g_den_%s' % iteration] = defaultdict(list)
for k,v in globals()['dictt_%s' % iteration].items():
globals()['g_den_%s' % iteration][k] = Counter(v)
# here we find the common grid-grid connections between grid_density_orignal and each of the 1000 grids
globals()['intersect_%s' % iteration] = defaultdict(list)
for k,v in grid_density_orignal.items():
pergrid = defaultdict(list)
common_grid_ids = v.keys() & globals()['g_den_%s' % iteration][k].keys()
for gridid in common_grid_ids:
pergrid[gridid] = globals()['g_den_%s' % iteration][k][gridid]
globals()['intersect_%s' % iteration][k] = pergrid
print('All 1000 grids intersection done')
# From previous code we now have 1000 intersection grids : intersect_1,intersect_2 ...... intersect_1000
for iteration in range(1,1000):
itnext = iteration +1 # to get next intersect out of 1000
globals()['combine_%s%s' %(iteration,itnext)] = defaultdict(list) # dictionary to store intermediate combine step results between 2 intersects : intersect_x and intersect_x+1
for k,v in globals()['intersect_%s' %iteration].items():
innr = []
combine = defaultdict(list)
for key in set(list(globals()['intersect_%s' % iteration][k].keys())+ list(globals()['intersect_%s' % itnext][k].keys())): # key in the union of intersect_1 , intersect_2
if (isinstance(globals()['intersect_%s' % iteration][k].get(key,99999), int) and isinstance(globals()['intersect_%s' % itnext][k].get(key,99999), int)): # get key value if exists, if for e.g. a certain grid doesnt exist in intersect_1, intersect_2 we give it default of 99999 as placeholder, alos check if value is an instance of int or list as in intial step it is an int but later we get lists after combining every 2 intersects
combine[key] = [globals()['intersect_%s' % iteration][k].get(key,99999)] + [globals()['intersect_%s' % itnext][k].get(key,99999)] # combine into list intersect_1, intersect_2
if (isinstance(globals()['intersect_%s' % iteration][k].get(key,99999), list) and isinstance(globals()['intersect_%s' % itnext][k].get(key,99999), int)): # this condition will be reached after initial step of intersect_1 + intersect_2
combine[key] = globals()['intersect_%s' % iteration][k].get(key,99999) + [globals()['intersect_%s' % itnext][k].get(key,99999)] # combine into list intersect_1, intersect_2
globals()['combine_%s%s' %(iteration,itnext)][k] = combine
globals()['intersect_%s' % itnext] = copy.copy(globals()['combine_%s%s' %(iteration,itnext)]) # copy combine dict onto intersect dict so we can continue combining this dict with the next iteration
print('2 combined: ',iteration,itnext)
del globals()['intersect_%s' % iteration] # delete older intersect, combine as we dont need them and may cause memory overflow when more dicts are in memory
del globals()['combine_%s%s' %(iteration,itnext)]
gc.collect() # explicitly call the garbage collector as too big for ram
print('intersection and combine done') # at the end we have intersect_1000 with is a dict with all grid_id ids as keys and a list of densities (list size is 1000 corresponding to 1000 grids)
How can I improve the performance of the code?

I will focus on the second loop (merging the intersect_n dicts), give some general advice, and leave the rest as an exercise. My final result is so much shorter than the original that I feel the need to break down the process into several steps. Hopefully you will learn many useful techniques along the way.
After removing blank lines, comments (we'll rewrite those later anyway) and debug traces, and cleaning up a bit of whitespace we are starting with:
for iteration in range(1, 1000):
itnext = iteration + 1
globals()['combine_%s%s' % (iteration, itnext)] = defaultdict(list)
for k, v in globals()['intersect_%s' % iteration].items():
innr = []
combine = defaultdict(list)
for key in set(list(globals()['intersect_%s' % iteration][k].keys()) + list(globals()['intersect_%s' % itnext][k].keys())):
if (isinstance(globals()['intersect_%s' % iteration][k].get(key, 99999), int) and isinstance(globals()['intersect_%s' % itnext][k].get(key, 99999), int)):
combine[key] = [globals()['intersect_%s' % iteration][k].get(key, 99999)] + [globals()['intersect_%s' % itnext][k].get(key, 99999)]
if (isinstance(globals()['intersect_%s' % iteration][k].get(key, 99999), list) and isinstance(globals()['intersect_%s' % itnext][k].get(key, 99999), int)):
combine[key] = globals()['intersect_%s' % iteration][k].get(key, 99999) + [globals()['intersect_%s' % itnext][k].get(key, 99999)]
globals()['combine_%s%s' % (iteration, itnext)][k] = combine
globals()['intersect_%s' % itnext] = copy.copy(globals()['combine_%s%s' % (iteration, itnext)])
del globals()['intersect_%s' % iteration]
del globals()['combine_%s%s' % (iteration, itnext)]
gc.collect()
Now we can get to work.
1. Properly structuring the data
Trying to create variable variables is generally a bad idea. It also has a performance cost:
$ python -m timeit -s "global x_0; x_0 = 'test'" "globals()['x_%s' % 0]"
2000000 loops, best of 5: 193 nsec per loop
$ python -m timeit -s "global x; x = ['test']" "x[0]"
10000000 loops, best of 5: 29.1 nsec per loop
Yes, we're talking about nanoseconds, but the existing code is doing it constantly, for nearly every access. But more importantly, visually simplified code is easier to analyze for subsequent improvements.
Clearly we already know how to manipulate nested data structures; adding one more level of nesting isn't an issue. To store the intersect_n results, rather than having 1000 dynamically named variables, the obvious solution is to just make a 1000-element list, where each element is one of those results. (Note that we will start counting them from 0 rather than 1, of course.) As for globals()['combine_%s%s' % (iteration, itnext)] - that makes no sense; we don't need to create a new variable name each time through, because we're going to throw that data away at the end of the loop anyway. So let's just use a constant name.
Once the first loop is modified to give the right data (which will also look much simpler in that part), the access looks much simpler here:
for iteration in range(999):
itnext = iteration + 1
combine_overall = defaultdict(list)
for k, v in intersect[iteration].items():
combine = defaultdict(list)
for key in set(list(intersect[iteration][k].keys()) + list(intersection[itnext][k].keys())):
if (isinstance(intersect[iteration][k].get(key, 99999), int) and isinstance(intersect[itnext][k].get(key, 99999), int)):
combine[key] = [intersect[iteration][k].get(key, 99999)] + [intersect[itnext][k].get(key, 99999)]
if (isinstance(intersect[iteration][k].get(key, 99999), list) and isinstance(intersect[itnext][k].get(key, 99999), int)):
combine[key] = intersect[iteration][k].get(key, 99999) + [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
intersect[itnext] = copy.copy(combine_overall)
You'll notice I removed the memory management stuff at the end. I'll discuss a better approach for that later. The del for the iteration value would mess up iterating over that list, and we don't need to delete combine_overall because we'll just replace it with a new empty defaultdict. I also sneakily removed innr = [], because the value is simply unused. Like I said: visually simpler code is easier to analyze.
2. Unnecessary type checks
All this isinstance stuff is hard to read, and time consuming especially considering all the repeated access:
$ python -m timeit -s "global x; x = {'a': {'b': {'c': 0}}}" "isinstance(x['a']['b'].get('c', 0), int)"
2000000 loops, best of 5: 138 nsec per loop
$ python -m timeit -s "global x; x = {'a': {'b': {'c': 0}}}" "x['a']['b'].get('c', 0)"
5000000 loops, best of 5: 83.9 nsec per loop
We know the exact conditions under which intersect[itnext][k].get(key, 99999) should be an int: always, or else the data is simply corrupted. (We can worry about that later, and probably by doing exception handling in the calling code.) We know the conditions under which intersect[iteration][k].get(key, 99999) should be an int or a list: it will be an int (or missing) the first time through, and a list (or missing) every other time. Fixing this will also make it easier to understand the next step.
for iteration in range(999):
itnext = iteration + 1
combine_overall = defaultdict(list)
for k, v in intersect[iteration].items():
combine = defaultdict(list)
for key in set(list(intersect[iteration][k].keys()) + list(intersection[itnext][k].keys())):
if iteration == 0:
combine[key] = [intersect[iteration][k].get(key, 99999)] + [intersect[itnext][k].get(key, 99999)]
else:
combine[key] = intersect[iteration][k].get(key, [99999]) + [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
intersect[itnext] = copy.copy(combine_overall)
Notice how, when the key is either a list or missing, we use a list as the default value. That's the trick to preserving type consistency and making it possible to write the code this way.
3. An unnecessary copy and unnecessary pair-wise iteration
Since combine_overall isn't referenced anywhere else, we don't actually need to copy it over the intersect[itnext] value - we could just reassign it without any aliasing issues. But better yet is to just leave it where it is. Instead of considering adjacent pairs of iteration values that we merge together pairwise, we just merge everything into combine_overall, one at a time (and set up an initial defaultdict once instead of overwriting it).
This does mean we'll have to do some setup work - instead of special-casing the first merge, we'll "merge" intersect[0] by itself into the initial state of combine_overall.
combine_overall = defaultdict(list)
for k, v in intersect[0].items():
combine = defaultdict(list)
for key, value in v.keys():
combine[key] = [value]
combine_overall[k] = combine
for iteration in range(999):
itnext = iteration + 1
for k, v in combine_overall.items():
combine = defaultdict(list)
for key in set(list(combine_overall[k].keys()) + list(intersection[itnext][k].keys())):
combine[key] = combine_overall[k].get(key, [99999]) + [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
Notice how much more simply we can do the initial step - we know which keys we're working with, so no .gets are necessary; and there's only one dict, so no merging of key-sets is necessary. But we aren't done...
4. Some miscellaneous cleanup
Looking at this version, we can more easily notice:
The iteration loop doesn't use iteration at all, but only itnext - so we can fix that. Also, there's no reason to use indexes like this for a simple loop - we should directly iterate over elements.
combine_overall will hold dicts, not lists (as we assign the values from combine, which is a defaultdict); so defaultdict(list) makes no sense.
Instead of using a temporary combine to build a replacement for combine_overall[k] and then assigning it back, we could just directly modify combine_overall[k]. In this way, we would actually get benefit from the defaultdict behaviour. We actually want the default values to be defaultdicts themselves - not completely straightforward, but very doable.
Since we no longer need to make a distinction between the overall combined result and individual results, we can rename combine_overall to just combine to look a little cleaner.
combine = defaultdict(lambda: defaultdict(list))
for k, v in intersect[0].items():
for key, value in v.keys():
combine[k][key] = [value]
for to_merge in intersect[1:]:
for k, v in combine.items():
for key in set(list(combine[k].keys()) + list(to_merge[k].keys())):
combine[k][key] = combine[k].get(key, [99999]) + [to_merge[k].get(key, 99999)]
5. Oops, there was a bug all along. Also, "special cases aren't special enough to break the rules"
Hopefully, this looks a little strange to you. Why are we using .get on a defaultdict? Why would we have this single-item placeholder value, rather than an empty list? Why do we have to do this complicated check for the possible keys to use? And do we really need to handle the first intersect value differently?
Consider what happens on the following data (using the original naming conventions):
intersect_1 = {'1': {'1': 1}}
intersect_2 = {'1': {'1': 1}}
intersect_3 = {'1': {'1': 1, '2': 1}}
With the original approach, I get a result like:
$ python -i tmp.py
>>> intersect_3
defaultdict(<class 'list'>, {'1': defaultdict(<class 'list'>, {'2': [99999, 1], '1': [1, 1, 1]})})
Oops. intersect_3['1']['2'] only has two elements ([99999, 1]), and thus doesn't match up with intersect_3['1']['1']. That defeats the purpose of the 99999 placeholder values. The problem is that, because the value was missing multiple times at the start, multiple 99999 values should have been inserted, but only one was - the one that came from creating the initial list, when the isinstance check reported an int rather than a list, when it retrieved the 99999 with .get. That lost information: we couldn't distinguish between a missing int and a missing list.
How do we work around this? Simple: we use the same, overall key set each time - the total set of keys that should be present, which we get from grid_density_original[k]. Whenever one of those entries is missing in any of the intersect results, we write the placeholder value instead. Now we are also handling each intersect result the same way - instead of doing special setup with the first value, and merging everything else in, we are merging everything in to an empty initial state.
Instead of iterating over the .items of combine (and expecting to_merge to have the same keys), we iterate over to_merge, which makes a lot more sense. Instead of creating and assigning a list for combine[k][key], we simply append a value to the existing list (and we know there is one available, because we are using defaultdicts properly now).
Thus:
combine = defaultdict(lambda: defaultdict(list))
for to_merge in intersect:
for k, v in to_merge.items():
# BTW, you typo'd this as "orignal" in the original code
for key in grid_density_original[k].keys():
combine[k][key].append(v.get(key, 99999))
(This does mean that, if none of the intersect dicts contain a particular key, the result will contain a list of 1000 99999 values, instead of omitting the key completely. I hope that isn't an issue.)
6. Okay, but weren't you going to do something about the memory usage?
Oh, right. Take a moment to write the corresponding code for the first loop, please.
Got it? Okay, now. Set up combine first; and then each time you compute one of the would-be elements of intersect, merge it in (using the two inner loops shown here) instead of building the actual intersect list.
Oh, and I think you'll find that - since we're going to iterate over grid_density_original[k].keys() anyway - the preprocessing to remove other keys from the g_den results isn't actually needed at all now.

Related

Finding how a key's value compares to the rest of a dictionary of values

I'm creating a virtual race, and for each frame of the simulation, I have a data file containing each racer's distance from the lead point, meaning the racer with the lowest value is winning.
After each race, I need to go back to the data for the midpoint, and find what race position the eventual winner was in. It's easy to SEE in the example below the winner (1) was in 3rd position, with a distance of 600, but I'm struggling how to do this in Python (there will be hundreds of races and many more racers in each race).
RaceDistances = {1:600,2:300,3:450,4:1000,5:750}
We can generate the rankings using .items() to get a list of tuples containing the key-value pairs in the dictionary. Then, we sort by distance using sorted() and the key parameter. We then read off the rankings into a list using a list comprehension.
If you only need to get the rank of one specific player, you can then use .index([<id of player>]) + 1:
race_distances = {1:600,2:300,3:450,4:1000,5:750}
player_ranks = [x for x, _ in sorted(race_distances.items(), key=lambda x: x[1])]
result = player_ranks.index(1) + 1
print(result)
If you're going to be looking up the ranks for a lot of players, it is much better to create a dictionary instead, which has faster lookup:
result = {elem: idx + 1 for idx, elem in enumerate(player_ranks)}
print(result[1])
In both cases, this prints:
3
Counting how many smaller distances there are, then add 1:
print(sum(d < RaceDistances[1]
for d in RaceDistances.values()) + 1)

Python: Removing lower versions from a 2D list

Assume I have the following data structure (a list of lists):
myList = [['Something','1'], ['Something','2'], ['Something Else','5'], ['Yet Another Something','1'], ['Yet ANOTHER Something','0'], ['Yet Another Something','2']]
I have a function that will remove duplicates from that list, choosing the highest number for the 2nd value. However, it seems to choke on very large data sets (150+ entries in myList). For this small data set, I expect the returned list to be:
[['Something','2'], ['Something Else','5'], ['Yet Another Something','2']]
What kind of optimization can be implemented using standard python (without including custom, external modules) into this function so that it returns the same result set without issues on large data sets?
Here is my function:
def remove_duplicates(duplicate):
final_list = []
final_list_upper = []
for k,v in duplicate:
found = False
for x in range(len(final_list)):
if k in final_list[x] or k.upper() in final_list_upper[x]:
if k == final_list[x][0] or k.upper() == final_list_upper[x][0]:
if int(v) >= int(final_list[x][1]):
final_list.pop(final_list.index(final_list[x]))
final_list_upper.pop(final_list_upper.index(final_list_upper[x]))
break
else:
found = True
break
if not found:
final_list.append([k,v])
final_list_upper.append([k.upper(),v])
final_list_upper = [] # clear the list
return final_list
You're using a second loop to check if the current "key" that you're checking exists in the list. This is slowing down your code.
Why? Because, as your code demonstrates, checking for membership in lists is slow. Really slow, because you need to iterate over the entire list, which means it's an O(N) operation, so the time depends linearly on the size of the list.
Instead, you could simply change the list to a dictionary. Lookup in a dictionary is an O(1) operation, so the lookup happens in constant (or nearly constant) time regardless of the size of the dictionary.
When you do this, there's no longer a need for two loops. Here's an idea:
def remove_duplicates_new(duplicate):
final_dict = {}
case_sensitive_keys = {}
for k, v in duplicate:
klower = k.lower()
vint = int(v)
old_val = final_dict.get(klower, 0) # Get the key k, with a default of zero if the key doesn't exist
if vint > old_val:
# Replace if current value is greater than old value
final_dict[klower] = vint
case_sensitive_keys[klower] = k
# Now we're done looping, so create the list
final_list = [[case_sensitive_keys[k], str(v)] for k, v in final_dict.items()]
return final_list
To compare, let's make a test list with 10000 elements. The "keys" are random numbers between 1 and 100, so we're bound to get a whole bunch of duplicates.:
import random
import timeit
testList = [[str(random.randint(1, 100)), str(random.randint(1, 10))] for i in range(10000)]
timeit.timeit('remove_duplicates(testList)', setup='from __main__ import testList, remove_duplicates', number=10)
# Output: 1.1064800999999989
timeit.timeit('remove_duplicates_new(testList)', setup='from __main__ import testList, remove_duplicates_new', number=10)
# Output: 0.03743689999998878
Hot damn! That's a ~30x speedup!

Cython dictionary / map

I have a list of element, label pairs like this: [(e1, l1), (e2, l2), (e3, l1)]
I have to count how many labels two element have in common - ie. in the list above e1and e3have the label l1 in common and thus 1 label in common.
I have this Python implementation:
def common_count(e_l_list):
count = defaultdict(int)
l_list = defaultdict(set)
for e1, l in e_l_list:
for e2 in l_list[l]:
if e1 == e2:
continue
elif e1 > e2:
count[e1,e2] += 1
else:
count[e2,e1] += 1
l_list[l].add(e1)
return count
It takes a list like the one above and computes a dictionary of element pairs and counts. The result for the list above should give {(e1, e2): 1}
Now i have to scale this to millions of elements and labels and i though Cython would be a good solution to save CPU time and memory. But i can't find docs on how to use maps in Cython.
How would i implement the above in pure Cython?
It can be asumed that all elements and labels are unsigned integers.
Thanks in advance :-)
I think you are trying to over complicate this by creating pairs of elements and storing all common labels as the value when you can create a dict with the element as the key and have a list of all values associated with that element. When you want to find common labels convert the lists to a set and perform an intersection on them, the resulting set will have the common labels between the two. The average time of the intersection, checked with ~20000 lists, is roughly 0.006 or very fast
I tested this with this code
from collections import *
import random
import time
l =[]
for i in xrange(10000000):
#With element range 0-10000000 the dictionary creation time increases to ~16 seconds
l.append((random.randrange(0,50000),random.randrange(0,50000)))
start = time.clock()
d = defaultdict(list)
for i in l: #O(n)
d[i[0]].append(i[1]) #O(n)
print time.clock()-start
times = []
for i in xrange(10000):
start = time.clock()
tmp = set(d[random.randrange(0,50000)]) #picks a random list of labels
tmp2 = set(d[random.randrange(0,50000)]) #not guaranteed to be a different list but more than likely
times.append(time.clock()-start)
common_elements = tmp.intersection(tmp2)
print sum(times)/100.0
18.6747529999 #creation of list
4.17812619876 #creation of dictionary
0.00633531142994 #intersection
Note: The times do change slightly depending on number of labels. Also creating the dict might be too long for your situation but that is only a one time operation.
I would also highly not recommend creating all pairs of elements. If you have 5,000,000 elements and they all share at least one label, which is worst case, then you are looking at 1.24e+13 pairs or, more bluntly, 12.5 trillion. That would be ~1700 terabytes or ~1.7 petabytes

Efficient way of searching a list of lists according to their hash value?

I have a list of tuples with 3 members in each tuple as seen below:
[(-5092793511388848640, 'test1', 1),
(-5092793511388848639, 'test0', 0),
(-5092793511388848638, 'test3', 3),
(-5092793511388848637, 'test2', 2),
(-5092793511388848636, 'test5', 5)]
The tuples are ordered in ascending according to first element of each tuple - the hash value of each key (e.g 'test0'). I want to find a quick way of searching through these tuples using binary search of their hash values to find a specific key. Problem is the quickest way I have found is using a for loop:
def get(key, D, hasher=hash):
'''
Returns the value in the dictionary corresponding to the given key.
Arguements:
key -- desired key to retrieve the value of.
D -- intended dictionary to retrieve value from.
hasher -- the hash function to be used on the key.
'''
for item in D:
if item[0] == hash(key):
return item[2]
raise TypeError('Key not found in the dictionary.')
The function I have written above seems to be very slow at searching through a much longer list of tuples, lets say a list of 6000 different tuples. It also breaks if there are any hash collisions. I was wondering if there was a more efficient/quick way of searching the list for the correct tuple?
Side note: I know using dictionaries will be a much quicker and easier way to solve my problem but I'd like to avoid using them.
First, prehash the key, don't do it over and over. Second, you can use next with an unpacking generator expression to optimize a bit:
def get(key, D, hasher=hash):
keyhash = hasher(key)
try:
return next(v for hsh, k, v in D if keyhash == hsh and key == k)
except StopIteration:
raise TypeError('Key not found in the dictionary.')
That said, you claim you want to do a binary search, but the above is still a linear search, just optimized to avoid redundant work and to stop when the desired key is found (it checks hash first, assuming key comparison is expensive, then checks key equality only on hash match, since you complained about issues with duplicates). If the goal is binary search, and D is sorted by hash code, you'd want to use the bisect module. It's not trivial to do so (because bisect doesn't take a key argument like sorted), but if you could split D into two parts, one with just hash codes, and one with codes, keys and values, you could do:
import bisect
def get(key, Dhashes, D, hasher=hash):
keyhash = hasher(key)
# Search whole list of hashes for beginning of range with correct hash
start = bisect.bisect_left(Dhashes, keyhash)
# Search for end point of correct hashes (limit to entries after start for speed)
end = bisect.bisect_right(Dhashes, keyhash, start)
try:
# Linear search of only start->end indices for exact key
return next(v for hsh, k, v in D[start:end] if key == k)
except StopIteration:
raise TypeError('Key not found in the dictionary.')
That gets you true binary search, but as noted, requires that the hash codes be separated from the complete tuples of hashcode, key, value ahead of time, before the search. Splitting the hash codes at the time of each search wouldn't be worth it since the loop that split them off could have just found your desired value directly (it would only be worth splitting if you were performing many searches at once).
As Padraic notes in his answer, at the expense of giving up the C accelerator code, you could copy and modify the pure Python implementation of bisect.bisect_right and bisect.bisect_left changing each use of a[mid] to a[mid][0] which would get you bisect code that doesn't require you to maintain a separate list of hashes. The memory savings might be worth the higher lookup costs. Don't use itertools.islice to perform the slicing though, as islice with a start index iterates the whole list up to that point; true slicing only reads and copies what you care about. If you want to avoid the second bisect operation though, you could always write your own Sequence-optimized islice and combine it with itertools.takewhile to get a similar effect without having to calculate the end index up front. Code for that might be something like:
from itertools import takewhile
# Copied from bisect.bisect_left, with unused arguments removed and only
# index 0 of each tuple checked
def bisect_idx0_left(a, x):
lo, hi = 0, len(a)
while lo < hi:
mid = (lo+hi)//2
if a[mid][0] < x: lo = mid+1
else: hi = mid
return lo
def sequence_skipper(seq, start):
return (seq[i] for i in xrange(start, len(seq)))
def get(key, D, hasher=hash):
keyhash = hasher(key)
# Search whole list of hashes for beginning of range with correct hash
start = bisect_idx0_left(D, keyhash)
# Make lazy iterator that skips start values in the list
# and stops producing values when the hash stops matching
hashmatches = takewhile(lambda x: keyhash == x[0], sequence_skipper(D, start))
try:
# Linear search of only indices with matching hashes for exact key
return next(v for hsh, k, v in hashmatches if key == k)
except StopIteration:
raise TypeError('Key not found in the dictionary.')
Note: You could save even more work at the expense of more memory, by having Dhashes actually be (hashcode, key) pairs; assuming uniqueness, this would mean a single bisect.bisect* call, not two, and no need for a scan between indices for a key match; you either found it in the binary search or you didn't. Just for example, I generated 1000 key value pairs, storing them as either (hashcode, key, value) tuples in a list (which I sorted on the hashcode), or a dict mapping keys->values. The keys were all 65 bit ints (long enough that the hash code wasn't a trivial self-mapping). Using the linear search code I provided up top, it took ~15 microseconds to find the value located at index 321; with binary search (having copied hashes only to a separate list) it took just over 2 microseconds. Looking it up in the equivalent dict took ~55 _nano_seconds; the run time overhead even for binary search worked out to ~37x, and linear search ran ~270x higher. And that's before we get into the increased memory costs, increased code complexity, and increased overhead to maintain sorted order (assuming D is ever modified).
Lastly: You say "I'd like to avoid using [dicts]", but give no explanation as to why. dicts are the correct way to solve a problem like this; assuming no self-hashing (i.e. key is an int that hashes to itself, possibly saving the cost of the hash code), the memory overhead just for the list of tuples (not including a separate list of hash codes) would be (roughly) twice that of a simple dict mapping keys to values. dict would also prevent accidentally storing duplicates, have ~O(1) insertion cost (even with bisect, insertion maintaining sorted order would have ~O(log n) lookup and O(n) memory move costs), ~O(1) lookup cost (vs. ~O(log n) with bisect), and beyond the big-O differences, would do all the work using C built-in functions that are heavily optimized, so the real savings would be greater.
You could modify bisect to just check the first element:
def bisect_left(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi) // 2
if a[mid][0] < x:
lo = mid+1
else: hi = mid
return lo
def get_bis(key, d):
h = hash(key)
ind = bisect_left(d, h)
if ind == -1:
raise KeyError()
for i in xrange(ind, len(d)):
if d[i][0] != h:
raise KeyError()
if d[i][1] == key:
return d[i][2]
raise KeyError()
replicating some collisions, it does what it should:
In [41]: l = [(-5092793511388848640, 'test1', 1), (-5092793511388848639, 'test9', 0), (-5092793511388848639, 'test0', 3), (-5092793511388848637, 'test2', 2), (-5092793511388848636, 'test5', 5)]
In [42]: get("test0", l)
Out[42]: 3
In [43]: get("test1", l)
Out[43]: 1
In [44]: get(-5092793511388848639, l)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-44-81e928da1ac8> in <module>()
----> 1 get(-5092793511388848639, l)
<ipython-input-30-499e71432196> in get(key, d)
6 for sub in islice(d, ind, None):
7 if sub[0] != h:
----> 8 raise KeyError()
9 if sub[1] == key:
10 return sub
KeyError:
Some timings:
In [91]: l = sorted((hash(s), s,randint(1,100000)) for s in ("".join(sample(ascii_letters,randint(10,26))) for _ in xrange(1000000)))
In [92]: l[-1]
Out[92]: (9223342880888029755, 'FocWPinpYZXjHhBqRkJxQeGMa', 43768)
In [93]: timeit get_bis(l[-1][1],l)hed
100000 loops, best of 3: 5.29 µs per loop
In [94]: l[250000]
Out[94]: (-4616437486317828880, 'qXsybdhFPLczWwCQkm', 86136)
In [95]: timeit get_bis(l[250000][1],l)
100000 loops, best of 3: 4.4 µs per loop
In [96]: l[750000]
Out[96]: (4623630109115829672, 'dlQewhpMoBGmn', 39904)
In [97]: timeit get_bis(l[750000][1],l)
100000 loops, best of 3: 4.46 µs per loop
To get a better idea you would have to throw in collisions but to find the section that the hash may belong is pretty efficient.
Just type casting a few variables and compiling with cython:
def cython_bisect_left(a, long x, long lo=0):
if lo < 0:
raise ValueError('lo must be non-negative')
cdef long hi = len(a)
while lo < hi:
mid = (lo + hi) // 2
if a[mid][0] < x:
lo = mid + 1
else:
hi = mid
return lo
def cython_get(str key, d):
cdef long h = hash(key)
cdef ind = cython_bisect_left(d, h)
if ind == -1:
raise KeyError()
for i in xrange(ind, len(d)):
if d[i][0] != h:
raise KeyError()
if d[i][1] == key:
return d[i][2]
raise KeyError()
Gets us almost down to 1 microsecond:
In [13]: timeit cython_get(l[-1][1],l)
The slowest run took 40.77 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.44 µs per loop
In [14]: timeit cython_get(l[250000][1],l)
1000000 loops, best of 3: 1.33 µs per loop
In [15]: timeit cython_get(l[750000][1],l)
1000000 loops, best of 3: 1.33 µs per loop
Try using list comprehensions. I'm not sure if it's the most efficient way, but it's the pythonic way and quite effective!
[ x for x in D if x[0] == hash(key) ]

Better algorithm (than using a dict) for enumerating pairs with a given sum.

Given a number, I have to find out all possible index-pairs in a given array whose sum equals that number. I am currently using the following algo:
def myfunc(array,num):
dic = {}
for x in xrange(len(array)): # if 6 is the current key,
if dic.has_key(num-array[x]): #look at whether num-x is there in dic
for y in dic[num-array[x]]: #if yes, print all key-pair values
print (x,y),
if dic.has_key(array[x]): #check whether the current keyed value exists
dic[array[x]].append(x) #if so, append the index to the list of indexes for that keyed value
else:
dic[array[x]] = [x] #else create a new array
Will this run in O(N) time? If not, then what should be done to make it so? And in any case, will it be possible to make it run in O(N) time without using any auxiliary data structure?
Will this run in O(N) time?
Yes and no. The complexity is actually O(N + M) where M is the output size.
Unfortunately, the output size is in O(N^2) worst case, for example the array [3,3,3,3,3,...,3] and number == 6 - it will result in quadric number of elements needed to be produced.
However - asymptotically speaking - it cannot be done better then this, because it is linear in the input size and output size.
Very, very simple solution that actually does run in O(N) time by using array references. If you want to enumerate all the output pairs, then of course (as amit notes) it must take O(N^2) in the worst case.
from collections import defaultdict
def findpairs(arr, target):
flip = defaultdict(list)
for i, j in enumerate(arr):
flip[j].append(i)
for i, j in enumerate(arr):
if target-j in flip:
yield i, flip[target-j]
Postprocessing to get all of the output values (and filter out (i,i) answers):
def allpairs(arr, target):
for i, js in findpairs(arr, target):
for j in js:
if i < j: yield (i, j)
This might help - Optimal Algorithm needed for finding pairs divisible by a given integer k
(With a slight modification, there we are seeing for all pairs divisible by given number and not necessarily just equal to given number)

Categories

Resources