Hashtable size said to be arbitrary, why?

Hashtable size said to be arbitrary, why? - python

I am learning about abstract data types here. Lately I have been reading about hashing with a Map (or some data structure like a dict).
Here is how the code looks like:
class HashTable:
def __init__(self):
self.size = 11
self.slots = [None] * self.size
self.data = [None] * self.size
def put(self,key,data):
hashvalue = self.hashfunction(key,len(self.slots))
if self.slots[hashvalue] == None:
self.slots[hashvalue] = key
self.data[hashvalue] = data
else:
if self.slots[hashvalue] == key:
self.data[hashvalue] = data #replace
else:
nextslot = self.rehash(hashvalue,len(self.slots))
while self.slots[nextslot] != None and \
self.slots[nextslot] != key:
nextslot = self.rehash(nextslot,len(self.slots))
if self.slots[nextslot] == None:
self.slots[nextslot]=key
self.data[nextslot]=data
else:
self.data[nextslot] = data #replace
def hashfunction(self,key,size):
return key%size
def rehash(self,oldhash,size):
return (oldhash+1)%size
def get(self,key):
startslot = self.hashfunction(key,len(self.slots))
data = None
stop = False
found = False
position = startslot
while self.slots[position] != None and \
not found and not stop:
if self.slots[position] == key:
found = True
data = self.data[position]
else:
position=self.rehash(position,len(self.slots))
if position == startslot:
stop = True
return data
def __getitem__(self,key):
return self.get(key)
def __setitem__(self,key,data):
self.put(key,data)
Now within the textbook, the author states that the size of the hashtable is arbitrary. See here:
Note that the initial size for the hash table has been chosen to be
11. Although this is arbitrary, it is important that the size be a prime number so that the collision resolution algorithm can be as
efficient as possible.
Why is this arbitrary? It would seem that the number of slots given is directly correlated to how many values can be stored. I know that other hashtables may be flexible and be able to store more data into one data slot, but in THIS specific example, it isn't just 'arbitrary'. It is exactly how many values can be stored.
Am I missing something here?

Why is this arbitrary?
Because he could have chosen any other small prime.
It would seem that the number of slots is directly correlated with […] how many values can be stored
Yep, and that's irrelevant. If you need to grow your hash table, you resize (reallocate) and re-hash it. This is not what's the author is talking about.

The Paramagnetic Croiss answered your main question. The number 11 does of course mean that you can't fit more than 11 elements without reallocating your table and rehashing all your elements, so obviously it's not arbitrary in that sense. But it's arbitrary in the sense that as long as the number is prime (and, yes, larger than the number of inserts you're going to do), everything the author intends to demonstrate will work out the same.*
* In particular, if your elements are natural numbers, and your table size is prime, and small enough compared to the largest integer, % size makes a great hash function.
But for your followup question:
It would seem though, that making a table with a bigger prime number would allow for you to have more available slots and require less need for you to rehash, and have less items to search through in each slot (if you extended the data slots to hold more than one value). The items would be spread thinner in general. Is this not correct?
If I understand you right, you're not using the right words, which is why you're getting confusing answers. Your example code uses a function called rehash, but that's misleading. Rehashing is one way to do probing, but it's not the way you're doing it; you're just doing a combination of linear probing and double hashing.* More commonly, when people talk about rehashing, they're talking about what you do after you grow the hash table and have to rehash every value from the old table into the new one.
* When your hash function is as simple as key%size, the distinction is ambiguous…
Anyway, yes, more load (if you have N elements in M buckets, you have N/M load) means more probing, which is bad. To take the most extreme element, at load 1.0, the average operation will have to probe through half the table to find the right bucket, making the hash table as inefficient as brute-force searching an array.
However, as you decrease load, the returns drop off pretty fast. You can draw the exact curve for any particular hash implementation, but the rule of thumb you usually use (for closed hashes like this) is that getting the load below 2/3 is usually not worth it. And keep in mind that a larger hash table has costs as well as benefits. Let's say you're on a 32-bit machine with a 64-byte cache line. So, 11 pointers fit in a single cache line; after any hash operation, the next one is guaranteed to be a cache hit. But 17 pointers are split across two cache lines; after any hash operation, the next one only has a 50% chance of being a cache hit.*
* Of course realistically there's plenty of room inside your loop to use up 2 cache lines for a hash table; that's why people don't usually worry about performance at all when N is in single digits… But you can see how with larger hash tables, keeping too much empty space an mean more L1 cache misses, more L2 cache misses, in the worst case even more VM page misses.

Well, nobody can predict the future, as you never know how many values the data structure user will actually put in the container.
So you start with something small, not to eat too much memory, and then increase and rehash as needed.

Related

Algorithm Design - When to use a dictionary vs a list to track values

For the following problem, I used a dictionary to track values while the provided answer used a list. Is there a quick way to determine the most efficient data structures for problems like these?
A robot moves in a plane starting from the original point (0,0). The
robot can move toward UP, DOWN, LEFT and RIGHT with a given steps. The
trace of robot movement is shown as the following: UP 5 DOWN 3 LEFT 3
RIGHT 2. The numbers after the direction are steps. Please write a
program to compute the distance from current position after a sequence
of movement and original point. If the distance is a float, then just
print the nearest integer. Example: If the following tuples are given
as input to the program: UP 5 DOWN 3 LEFT 3 RIGHT 2 Then, the output
of the program should be: 2
My answer uses a dictionary (origin["y"] for y and origin["x"] for x):
direction = 0
steps = 0
command = (direction, steps)
command_list = []
origin = {"x": 0, "y": 0}
while direction is not '':
direction = input("Direction (U, D, L, R):")
steps = input("Number of steps:")
command = (direction, steps)
command_list.append(command)
print(command_list)
while len(command_list) > 0:
current = command_list[-1]
if current[0] == 'U':
origin["y"] += int(current[1])
elif current[0] == 'D':
origin["y"] -= int(current[1])
elif current[0] == 'L':
origin["x"] -= int(current[1])
elif current[0] == 'R':
origin["x"] += int(current[1])
command_list.pop()
distance = ((origin["x"])**2 + (origin["y"])**2)**0.5
print(distance)
The provided answer uses a list (pos[0] for y, and pos[1] for x):
import math
pos = [0,0]
while True:
s = raw_input()
if not s:
break
movement = s.split(" ")
direction = movement[0]
steps = int(movement[1])
if direction=="UP":
pos[0]+=steps
elif direction=="DOWN":
pos[0]-=steps
elif direction=="LEFT":
pos[1]-=steps
elif direction=="RIGHT":
pos[1]+=steps
else:
pass
print int(round(math.sqrt(pos[1]**2+pos[0]**2)))

I'll offer a few points on your question because I strongly disagree with the close recommendations. There's much in your question that's not opinion.
In general, your choice of dictionary wasn't appropriate. For a toy program like this it doesn't make much difference, but I assume you're interested in best practice for serious programs. In production software, you wouldn't make this choice. Why?
Error prone-ness. A typo in future code, e.g. origin["t"] = 3 when you meant origin["y"] = 3 is a nasty bug, maybe difficult to find. t = 3 is more likely to cause a "fast failure." (In a statically typed language like C++ or Java, it's a sure compile-time error.)
Space overhead. A simple scalar variable requires essentially no space beyond the value itself. An array has a fixed overhead for the "dope vector" that tracks its location, current, and maximum size. A dictionary requires yet more extra space for open addressing, unused hash buckets, and fill tracking.
Speed.
Accessing a scalar variable is very fast: just a few processor instructions.
Accessing a tuple or array element when you know its index is also very fast, though not as fast as variable access. Extra instructions are needed to check array bounds. Adding one element to an array may take O(current array size) to copy current contents into a larger block of memory. The advantage of tuples and arrays is that you can access elements quickly based on a computed integer index. Scalar variables don't do this. Choose an array/tuple when you need integer index access. Favor tuples when you know the exact size and it's unlikely to change. Their immutability tends to make code more understandable (and thread safe).
Accessing a dictionary element is still more expensive because a hash value must be computed and buckets traversed with possible collision resolution. Adding a single element can also trigger a table reorganization, which is O(table size) with constant factor much bigger than list reorganization because all the elements must be rehashed. The big advantage of dictionaries is that accessing all stored pairs is likely to take the same amount of time. You should choose a dict only when you need that capability: to store a "map" from keys to values.
Conclude from all the above that the best choice for your origin coordinates would have been simple variables. If you later enhance the program in a way that requires passing (x, y) pairs to/from methods, then you'd consider a Point class.

how can I verify that this hash function is not gonna give me same result for two diiferent strings?

Consider two different strings to be of same length.
I am implementing robin-karp algorithm and using the hash function below:
def hs(pat):
l = len(pat)
pathash = 0
for x in range(l):
pathash += ord(pat[x])*prime**x # prime is global variable equal to 101
return pathash

It's a hash. There's, by definition, no guarantee there will be no collisions - otherwise, the hash would have to be as long as the hashed value, at least.
The idea behind what you're doing is based in number theory: powers of a number that is coprime to the size of your finite group (which probably the original author meant to be something like 2^N) can give you any number in that finite group, and it's hard to tell which one these were.
Sadly, the interesting part of this hash function, namely the size limiting/modulo operation of the hash, has been left out of this code – which makes one wonder where your code comes from. As far as I can immediately see, has little to do with Rabin-Karb.

Memoized to DP solution - Making Change

Recently I read a problem to practice DP. I wasn't able to come up with one, so I tried a recursive solution which I later modified to use memoization. The problem statement is as follows :-
Making Change. You are given n types of coin denominations of values
v(1) < v(2) < ... < v(n) (all integers). Assume v(1) = 1, so you can
always make change for any amount of money C. Give an algorithm which
makes change for an amount of money C with as few coins as possible.
[on problem set 4]
I got the question from here
My solution was as follows :-
def memoized_make_change(L, index, cost, d):
if index == 0:
return cost
if (index, cost) in d:
return d[(index, cost)]
count = cost / L[index]
val1 = memoized_make_change(L, index-1, cost%L[index], d) + count
val2 = memoized_make_change(L, index-1, cost, d)
x = min(val1, val2)
d[(index, cost)] = x
return x
This is how I've understood my solution to the problem. Assume that the denominations are stored in L in ascending order. As I iterate from the end to the beginning, I have a choice to either choose a denomination or not choose it. If I choose it, I then recurse to satisfy the remaining amount with lower denominations. If I do not choose it, I recurse to satisfy the current amount with lower denominations.
Either way, at a given function call, I find the best(lowest count) to satisfy a given amount.
Could I have some help in bridging the thought process from here onward to reach a DP solution? I'm not doing this as any HW, this is just for fun and practice. I don't really need any code either, just some help in explaining the thought process would be perfect.
[EDIT]
I recall reading that function calls are expensive and is the reason why bottom up(based on iteration) might be preferred. Is that possible for this problem?

Here is a general approach for converting memoized recursive solutions to "traditional" bottom-up DP ones, in cases where this is possible.
First, let's express our general "memoized recursive solution". Here, x represents all the parameters that change on each recursive call. We want this to be a tuple of positive integers - in your case, (index, cost). I omit anything that's constant across the recursion (in your case, L), and I suppose that I have a global cache. (But FWIW, in Python you should just use the lru_cache decorator from the standard library functools module rather than managing the cache yourself.)
To solve for(x):
If x in cache: return cache[x]
Handle base cases, i.e. where one or more components of x is zero
Otherwise:
Make one or more recursive calls
Combine those results into `result`
cache[x] = result
return result
The basic idea in dynamic programming is simply to evaluate the base cases first and work upward:
To solve for(x):
For y starting at (0, 0, ...) and increasing towards x:
Do all the stuff from above
However, two neat things happen when we arrange the code this way:
As long as the order of y values is chosen properly (this is trivial when there's only one vector component, of course), we can arrange that the results for the recursive call are always in cache (i.e. we already calculated them earlier, because y had that value on a previous iteration of the loop). So instead of actually making the recursive call, we replace it directly with a cache lookup.
Since every component of y will use consecutively increasing values, and will be placed in the cache in order, we can use a multidimensional array (nested lists, or else a Numpy array) to store the values instead of a dictionary.
So we get something like:
To solve for(x):
cache = multidimensional array sized according to x
for i in range(first component of x):
for j in ...:
(as many loops as needed; better yet use `itertools.product`)
If this is a base case, write the appropriate value to cache
Otherwise, compute "recursive" index values to use, look up
the values, perform the computation and store the result
return the appropriate ("last") value from cache

I suggest considering the relationship between the value you are constructing and the values you need for it.
In this case you are constructing a value for index, cost based on:
index-1 and cost
index-1 and cost%L[index]
What you are searching for is a way of iterating over the choices such that you will always have precalculated everything you need.
In this case you can simply change the code to the iterative approach:
for each choice of index 0 upwards:
for each choice of cost:
compute value corresponding to index,cost
In practice, I find that the iterative approach can be significantly faster (e.g. *4 perhaps) for simple problems as it avoids the overhead of function calls and checking the cache for preexisting values.

Recursive algorithm using memoization

My problem is as follows:
I have a list of missions each taking a specific amount of time and grants specific amount of points, and a time 'k' given to perform them:
e.g: missions = [(14,3),(54,5),(5,4)] and time = 15
in this example I have 3 missions and the first one gives me 14 points and takes 3 minutes.
I have 15 minutes total.
Each mission is a tuple with the first value being num of points for this mission and second value being num of minutes needed to perform this mission.
I have to find recursively using memoization the maximum amount of points I am able to get for a given list of missions and given time.
I am trying to implement a function called choose(missions,time) that will operate recursively and use the function choose_mem(missions,time,mem,k) to achive my goal.
the function choose_mem should get 'k' which is the number of missions to choose from, and mem which is an empty dictionary, mem, which will contain all the problems that were already been solved before.
This is what I got so far, I need help implementing what is required above, I mean the dictionary usage (which is currently just there and empty all the time), and also the fact that my choose_mem function input is i,j,missions,d and it should be choose_mem(missions, time, mem, k) where mem = d and k is the number of missions to choose from.
If anyone can help me adjust my code it would be very appreciated.
mem = {}
def choose(missions, time):
j = time
result = []
for i in range(len(missions), 0, -1):
if choose_mem(missions, j, mem, i) != choose_mem(missions, j, mem, i-1):
j -= missions[i - 1][1]
return choose_mem(missions, time, mem, len(missions))
def choose_mem(missions, time, mem, k):
if k == 0: return 0
points, a = missions[k - 1]
if a > time:
return choose_mem(missions, time, mem, k-1)
else:
return max(choose_mem(missions, time, mem, k-1),
choose_mem(missions, time-a, mem, k-1) + points)

This is a bit vague, but your problem is roughly translated to a very famous NP-complete problem, the Knapsack Problem.
You can read a bit more about it on wikipedia, if you replace weight with time, you have your problem.
Dynamic programming is a common way to approach that problem, as you can see here:
http://en.wikipedia.org/wiki/Knapsack_problem#Dynamic_programming
Memoization is more or less equivalent to Dynamic Programming, for pratical matters, so don't let the fancy name fool you.
The base concept is that you use an additional data structure to store parts of your problem that you already solved. Since the solution you're implementing is recursive, many sub-problems will overlap, and memoization allows you to only calculate each of them once.
So, the hard part is for you to think about your problem, what what you need to store in the dictionary, so that when you call choose_mem with values that you already calculated, you simply retrieve them from the dictionary, instead of doing another recursive call.
If you want to check an implementation of the generic 0-1 Knapsack Problem (your case, since you can't add items partially), then this seemed to me like a good resource:
https://sites.google.com/site/mikescoderama/Home/0-1-knapsack-problem-in-p
It's well explained, and the code is readable enough. If you understand the usage of the matrix to store costs, then you'll have your problem worked out for you.

Counting collisions in a Python dictionary

my first time posting here, so hope I've asked my question in the right sort of way,
After adding an element to a Python dictionary, is it possible to get Python to tell you if adding that element caused a collision? (And how many locations the collision resolution strategy probed before finding a place to put the element?)
My problem is: I am using dictionaries as part of a larger project, and after extensive profiling, I have discovered that the slowest part of the code is dealing with a sparse distance matrix implemented using dictionaries.
The keys I'm using are IDs of Python objects, which are unique integers, so I know they all hash to different values. But putting them in a dictionary could still cause collisions in principle. I don't believe that dictionary collisions are the thing that's slowing my program down, but I want to eliminate them from my enquiries.
So, for example, given the following dictionary:
d = {}
for i in xrange(15000):
d[random.randint(15000000, 18000000)] = 0
can you get Python to tell you how many collisions happened when creating it?
My actual code is tangled up with the application, but the above code makes a dictionary that looks very similar to the ones I am using.
To repeat: I don't think that collisions are what is slowing down my code, I just want to eliminate the possibility by showing that my dictionaries don't have many collisions.
Thanks for your help.
Edit: Some code to implement #Winston Ewert's solution:
n = 1500
global collision_count
collision_count = 0
class Foo():
def __eq__(self, other):
global collision_count
collision_count += 1
return id(self) == id(other)
def __hash__(self):
#return id(self) # #John Machin: yes, I know!
return 1
objects = [Foo() for i in xrange(n)]
d = {}
for o in objects:
d[o] = 1
print collision_count
Note that when you define __eq__ on a class, Python gives you a TypeError: unhashable instance if you don't also define a __hash__ function.
It doesn't run quite as I expected. If you have the __hash__ function return 1, then you get loads of collisions, as expected (1125560 collisions for n=1500 on my system). But with return id(self), there are 0 collisions.
Anyone know why this is saying 0 collisions?
Edit:
I might have figured this out.
Is it because __eq__ is only called if the __hash__ values of two objects are the same, not their "crunched version" (as #John Machin put it)?

Short answer:
You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.
Collisions do happen. "Having unique thingies means no collisions" is wrong for several values of "thingy".
You shouldn't be worrying about collisions.
Long answer:
Some explanations, derived from reading the source code:
A dict is implemented as a table of 2 ** i entries, where i is an integer.
dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.
When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer
It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.
Consequently collisions are inevitable.
However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.
The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).
In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.
Update after testing on Python 2.6:
Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".
>>> n = 15000
>>> i = 0
>>> while 2 ** i / 1.5 < n:
... i += 1
...
>>> print i, 2 ** i, int(2 ** i / 1.5)
15 32768 21845
>>> probe_mask = 2 ** i - 1
>>> print hex(probe_mask)
0x7fff
>>> class Foo(object):
... pass
...
>>> olist = [Foo() for j in xrange(n)]
>>> hashes = [hash(o) for o in olist]
>>> print len(set(hashes))
15000
>>> probes = [h & probe_mask for h in hashes]
>>> print len(set(probes))
12997
>>>

This idea doesn't actually work, see discussion in the question.
A quick look at the C implementation of python shows that the code for resolving collisions does not calculate or store the number of collisions.
However, it will invoke PyObject_RichCompareBool on the keys to check if they match. This means that __eq__ on the key will be invoked for every collision.
So:
Replace your keys with objects that define __eq__ and increment a counter when it is called. This will be slower because of the overhead involved in jumping into python for the compare. However, it should give you an idea of how many collisions are happening.
Make sure you use different objects as the key, otherwise python will take a shortcut because an object is always equal to itself. Also, make sure the objects hash to the same value as the original keys.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.