Implementation of hashing and dictionaries

Implementation of hashing and dictionaries - python

Let's take this code which does nothing really interesting. It creates a dictionary with a single key 'x' * N (N being an argument of the script), then accesses this item 10000000 times and prints the execution time.
import sys, time
X = 'x' * int(sys.argv[1])
LOOPS = 10000000
attrs = {X: 1}
t1 = time.time()
n = 0
for _ in range(LOOPS):
n += attrs[X]
attrs[X] += 1
t2 = time.time()
print(t2 - t1)
I launched it with different values of N in {1, 10, 100, 1000} and I did not observe any increase of the execution time (around 2.8 seconds each time on my machine). I was expecting some as my first guess was that each access to attrs would cause an hash value to be computed for 'x' * N. So I'm curious about the magic behind this. Is there some caching mechanism applied? Or am I wrong about my assumption on the implementation of dictionaries?

Python's implementation of str.__hash__() caches the hash the first time it's computed for that string, so the hash only has to be computed once for any given string.
Some (but not all) other immutable objects do the same sort of caching. See https://bugs.python.org/issue1462796 (a rejected request to apply the same caching logic to tuples).

The value/length of 'X' variable is changing based on your input i.e., {1, 10, 100*, 1000}. This means the memory used to store the variable is increasing with your input size.
The value of X is used in a dictionary and is used only to hash/index the value in the dictionary. In the dictionary, the hashing is done by internally storing the address of the index. Due to this hash/index, the time to access the value is the same irrespective of the length of X. This is the reason for the CPU time taken to be constant.

Related

Is it more performant to access function from variable?

I was reading sources of Python statistics module and saw strage variable partials_get = partials.get which then was used once in for loop partials[d] = partials_get(d, 0) + n.
def _sum(data, start=0):
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get # STRANGE VARIABLE
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n, d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n # AND IT'S USAGE
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
So my question: Why not just write partials[d] = partials.get(d, 0) + n? Is it slower than storing and calling function from variable?

partials.get has to search for the get attribute, starting with the object's dictionary and then going to the dictionary of the class and its parent classes. This will be done each time through the loop.
Assigning it to a variable does this lookup once, rather than repeating it.
This is a microoptimization that's typically only significant if the loop has many repetitions. The statistics library often processes large data sets, so it's reasonable here. It's rarely needed in ordinary application code.

Short answer: yes.
Python is an interpreted language, and while dictionary/attribute access is blazingly fast and very optimized, it still incurs a hit.
Since they are running this in a tight loop, they are taking the slight performance advantage of removing the "dot" from accessing partials.get.
There are other slight improvements from doing this in other cases where the variable is enough of a hint to the compiler (for cpython at least) to ensure this stays local, but I'm not sure this is the case here.

Avoid generating duplicate values from random

I want to generate random numbers and store them in a list as the following:
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(total)]
My concern is the following: I want to generate total=40 million values in the range of (0, 2 ** mypower - 1). If mypower = 64, then alist will be of size ~20GB (40M*64*8) which is very large for my laptop memory. I have an idea to iteratively generate chunk of values, say 5 million at a time, and save them to a file so that I don't have to generate all 40M values at once. My concern is that if I do that in a loop, it is guaranteed that random.randint(0, 2 ** mypower - 1) will not generate values that were already generated from the previous iteration? Something like this:
for i in range(num_of_chunks):
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(chunk)]
# save to file

Well, since efficiency/speed doesn't matter, I think this will work:
s = set()
while len(s) < total:
s.add(random.randint(0, 2 ** mypower - 1))
alist = list(s)
Since sets can only have unique elements in it, i think this will work well enough

To guarantee unique values you should avoid using random. Instead you should use an encryption. Because encryption is reversible, unique inputs guarantee unique outputs, given the same key. Encrypt the numbers 0, 1, 2, 3, ... and you will get guaranteed unique random-seeming outputs back providing you use a secure encryption. Good encryption is designed to give random-seeming output.
Keep track of the key (essential) and how far you have got. For your first batch encrypt integers 0..5,000,000. For the second batch encrypt 5,000,001..10,000,000 and so on.
You want 64 bit numbers, so use DES in ECB mode. DES is a 64-bit cipher, so the output from each encryption will be 64 bits. ECB mode does have a weakness, but that only applies with identical inputs. You are supplying unique inputs so the weakness is not relevant for your particular application.
If you need to regenerate the same numbers, just re-encrypt them with the same key. If you need a different set of random numbers (which will duplicate some from the first set) then use a different key. The guarantee of uniqueness only applies with a fixed key.

One way to generate random values that don't repeat is first to create a list of contiguous values
l = list(range(1000))
then shuffle it:
import random
random.shuffle(l)
You could do that several times, and save it in a file, but you'll have limited ranges since you'll never see the whole picture because of your limited memory (it's like trying to sort a big list without having the memory for it)
As someone noted, to get a wide span of random numbers, you'll need a lot of memory, so simple but not so efficient.
Another hack I just though of: do the same as above but generate a range using a step. Then in a second pass, add a random offset to the values. Even if the offset values repeat, it's guaranteed to never generate the same number twice:
import random
step = 10
l = list(range(0,1000-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
with the required max value and number of iterations that gives:
import random
number_of_iterations = 40*10**6
max_number = 2**64
step = max_number//number_of_iterations
l = list(range(0,max_number-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
print(len(newlist),len(set(newlist)))
runs in 1-2 minutes on my laptop, and gives 40000000 distinct values (evenly scattered across the range)

Usually random number generators are not really random at all. In fact, this is quite helpful in some situations. If you want the values to be unique after the second iteration, send it a different seed value.
random.seed()
The same seed will generate the same list, so if you want the next iteration to be the same, use the same seed. if you want it to be different, use a different seed.

It may need High CPU and Physical memory!
I suggest you to classify your data.
For Example you can Save:
All numbers starting with 10 and are 5 characters(Example:10365) to 10-5.txt
All numbers starting with 11 and are 6 characters(Example:114567) to 11-6.txt
Then for checking a new number:
For Example my number is 9256547
It starts with 92 and is 7 characters.
So I search 92-7.txt for this number and if it isn't duplicate,I will add it to 92-7.txt
Finally,You can join all files together.
Sorry If I have mistakes.My main language isn't English.

Is there anything faster than dict()?

I need a faster way to store and access around 3GB of k:v pairs. Where k is a string or an integer and v is an np.array() that can be of different shapes.
Is there any object that is faster than the standard python dict in storing and accessing such a table? For example, a pandas.DataFrame?
As far I have understood, python dict is a quite fast implementation of a hashtable. Is there anything better than that for my specific case?

No, there is nothing faster than a dictionary for this task and that’s because the complexity of its indexing (getting and setting item) and even membership checking is O(1) in average. (check the complexity of the rest of functionalities on Python doc https://wiki.python.org/moin/TimeComplexity )
Once you saved your items in a dictionary, you can have access to them in constant time which means that it's unlikely for your performance problem to have anything to do with dictionary indexing. That being said, you still might be able to make this process slightly faster by making some changes in your objects and their types that may result in some optimizations at under the hood operations.
e.g. If your strings (keys) are not very large you can intern the lookup key and your dictionary's keys. Interning is caching the objects in memory --or as in Python, table of "interned" strings-- rather than creating them as a separate object.
Python has provided an intern() function within the sys module that you can use for this.
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup...
also ...
If the keys in a dictionary are interned and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer comparison instead of comparing the string values themselves which in consequence reduces the access time to the object.
Here is an example:
In [49]: d = {'mystr{}'.format(i): i for i in range(30)}
In [50]: %timeit d['mystr25']
10000000 loops, best of 3: 46.9 ns per loop
In [51]: d = {sys.intern('mystr{}'.format(i)): i for i in range(30)}
In [52]: %timeit d['mystr25']
10000000 loops, best of 3: 38.8 ns per loop

No, I don't think there is anything faster than dict. The time complexity of its index checking is O(1).
-------------------------------------------------------
Operation | Average Case | Amortized Worst Case |
-------------------------------------------------------
Copy[2] | O(n) | O(n) |
Get Item | O(1) | O(n) |
Set Item[1] | O(1) | O(n) |
Delete Item | O(1) | O(n) |
Iteration[2] | O(n) | O(n) |
-------------------------------------------------------
PS https://wiki.python.org/moin/TimeComplexity

A numpy.array[] and simple dict = {} comparison:
import numpy
from timeit import default_timer as timer
my_array = numpy.ones([400,400])
def read_out_array_values():
cumsum = 0
for i in range(400):
for j in range(400):
cumsum += my_array[i,j]
start = timer()
read_out_array_values()
end = timer()
print("Time for array calculations:" + str(end - start))
my_dict = {}
for i in range(400):
for j in range(400):
my_dict[i,j] = 1
def read_out_dict_values():
cumsum = 0
for i in range(400):
for j in range(400):
cumsum += my_dict[i,j]
start = timer()
read_out_dict_values()
end = timer()
print("Time for dict calculations:" + str(end - start))
Prints:
Time for dict calculations:0.046898419999999996
Time for array calculations:0.07558204099999999
============= RESTART: C:/Users/user/Desktop/dict-vs-numpyarray.py =============
Time for array calculations:0.07849989000000002
Time for dict calculations:0.047769446000000104

One would think that array indexing is faster than hash lookup.
So if we could store this data in a numpy array, and assume the keys are not strings, but numbers, would that be faster than a python a dictionary?
Unfortunately not, because NumPy is optimized for vector operations, not for individual look up of values.
Pandas fares even worse.
See the experiment here: https://nbviewer.jupyter.org/github/annotation/text-fabric/blob/master/test/pandas/pandas.ipynb
The other candidate could be the Python array, in the array module. But that is not usable for variable-size values.
And in order to make this work, you probably need to wrap it into some pure python code, which will set back all time performance gains that the array offers.
So, even if the requirements of the OP are relaxed, there still does not seem to be a faster option than dictionaries.

You can think of storing them in Data structure like Trie given your key is string. Even to store and retrieve from Trie you need O(N) where N is maximum length of key. Same happen to hash calculation which computes hash for key. Hash is used to find and store in Hash Table. We often don't consider the hashing time or computation.
You may give a shot to Trie, Which should be almost equal performance, may be little bit faster( if hash value is computed differently for say
HASH[i] = (HASH[i-1] + key[i-1]*256^i % BUCKET_SIZE ) % BUCKET_SIZE
or something similar due to collision we need to use 256^i.
You can try to store them in Trie and see how it performs.

I don't understand why/how one of these methods is faster than the others

I wanted to test the difference in time between implementations of some simple code. I decided to count how many values out of a random sample of 10,000,000 numbers is greater than 0.5. The random sample is grabbed uniformly from the range [0.0, 1.0).
Here is my code:
from numpy.random import random_sample; import time;
n = 10000000;
t1 = time.clock();
t = 0;
z = random_sample(n);
for x in z:
if x > 0.5: t += 1;
print t;
t2 = time.clock();
t = 0;
for _ in xrange(n):
if random_sample() > 0.5: t += 1;
print t;
t3 = time.clock();
t = (random_sample(n) > 0.5).sum();
print t;
t4 = time.clock();
print t2-t1; print t3-t2; print t4-t3;
This is the output:
4999445
4999511
5001498
7.0348236652
1.75569394301
0.202538106332
I get that the first implementation sucks because creating a massive array and then counting it element-wise is a bad idea, so I thought that the second implementation would be the most efficient.
But how is the third implementation 10 times faster than the second method? Doesn't the third method also create a massive array in the form of random_sample(n) and then go through it checking values against 0.5?
How is this third method different from the first method and why is it ~35 times faster than the first method?
EDIT: #merlin2011 suggested that Method 3 probably doesn't create the full array in memory. So, to test that theory I tried the following:
z = random_sample(n);
t = (z > 0.5).sum();
print t;
which runs in a time of 0.197948451549 which is practically identical to Method 3. So, this is probably not a factor.

Method 1 generates a full list in memory before using it. This is slow because the memory has to be allocated and then accessed, probably missing the cache multiple times.
Method 2 uses an generator, which never creates the list in memory but instead generates each element on demand.
Method 3 is probably faster because sum() is implemented as a loop in C but I am not 100% sure. My guess is that this is faster for the same reason that Matlab vectorization is faster than for loops in Matlab.
Update: Separating out each of three steps, I observe that method 3 is still equally fast, so I have to agree with utdemir that each individual operator is executing instructions closer to machine code.
z = random_sample(n)
z2 = z > 0.5
t = z2.sum();
In each of the first two methods, you are invoking Python's standard functionality to do a loop, and this is much slower than a C-level loop that is baked into the implementation.

AFAIK
Function calls are heavy, on method two, you're calling random_sample() 10000000 times, but on third method, you just call it once.
Numpy's > and .sum are optimized to their last bits in C, also most probably using SIMD instructions to avoid loops.
So,
On method 2, you are comparing and looping using Python; but on method 3, you're much closer to the processor and using optimized instructions to compare and sum.

Counting collisions in a Python dictionary

my first time posting here, so hope I've asked my question in the right sort of way,
After adding an element to a Python dictionary, is it possible to get Python to tell you if adding that element caused a collision? (And how many locations the collision resolution strategy probed before finding a place to put the element?)
My problem is: I am using dictionaries as part of a larger project, and after extensive profiling, I have discovered that the slowest part of the code is dealing with a sparse distance matrix implemented using dictionaries.
The keys I'm using are IDs of Python objects, which are unique integers, so I know they all hash to different values. But putting them in a dictionary could still cause collisions in principle. I don't believe that dictionary collisions are the thing that's slowing my program down, but I want to eliminate them from my enquiries.
So, for example, given the following dictionary:
d = {}
for i in xrange(15000):
d[random.randint(15000000, 18000000)] = 0
can you get Python to tell you how many collisions happened when creating it?
My actual code is tangled up with the application, but the above code makes a dictionary that looks very similar to the ones I am using.
To repeat: I don't think that collisions are what is slowing down my code, I just want to eliminate the possibility by showing that my dictionaries don't have many collisions.
Thanks for your help.
Edit: Some code to implement #Winston Ewert's solution:
n = 1500
global collision_count
collision_count = 0
class Foo():
def __eq__(self, other):
global collision_count
collision_count += 1
return id(self) == id(other)
def __hash__(self):
#return id(self) # #John Machin: yes, I know!
return 1
objects = [Foo() for i in xrange(n)]
d = {}
for o in objects:
d[o] = 1
print collision_count
Note that when you define __eq__ on a class, Python gives you a TypeError: unhashable instance if you don't also define a __hash__ function.
It doesn't run quite as I expected. If you have the __hash__ function return 1, then you get loads of collisions, as expected (1125560 collisions for n=1500 on my system). But with return id(self), there are 0 collisions.
Anyone know why this is saying 0 collisions?
Edit:
I might have figured this out.
Is it because __eq__ is only called if the __hash__ values of two objects are the same, not their "crunched version" (as #John Machin put it)?

Short answer:
You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.
Collisions do happen. "Having unique thingies means no collisions" is wrong for several values of "thingy".
You shouldn't be worrying about collisions.
Long answer:
Some explanations, derived from reading the source code:
A dict is implemented as a table of 2 ** i entries, where i is an integer.
dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.
When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer
It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.
Consequently collisions are inevitable.
However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.
The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).
In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.
Update after testing on Python 2.6:
Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".
>>> n = 15000
>>> i = 0
>>> while 2 ** i / 1.5 < n:
... i += 1
...
>>> print i, 2 ** i, int(2 ** i / 1.5)
15 32768 21845
>>> probe_mask = 2 ** i - 1
>>> print hex(probe_mask)
0x7fff
>>> class Foo(object):
... pass
...
>>> olist = [Foo() for j in xrange(n)]
>>> hashes = [hash(o) for o in olist]
>>> print len(set(hashes))
15000
>>> probes = [h & probe_mask for h in hashes]
>>> print len(set(probes))
12997
>>>

This idea doesn't actually work, see discussion in the question.
A quick look at the C implementation of python shows that the code for resolving collisions does not calculate or store the number of collisions.
However, it will invoke PyObject_RichCompareBool on the keys to check if they match. This means that __eq__ on the key will be invoked for every collision.
So:
Replace your keys with objects that define __eq__ and increment a counter when it is called. This will be slower because of the overhead involved in jumping into python for the compare. However, it should give you an idea of how many collisions are happening.
Make sure you use different objects as the key, otherwise python will take a shortcut because an object is always equal to itself. Also, make sure the objects hash to the same value as the original keys.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Implementation of hashing and dictionaries - python

Related

Is it more performant to access function from variable?

Avoid generating duplicate values from random

Is there anything faster than dict()?

I don't understand why/how one of these methods is faster than the others

Counting collisions in a Python dictionary

Categories

Resources