Most efficient property to hash for numpy array

Most efficient property to hash for numpy array - python

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.
The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict?
My current approach is to use str(arr.data), which is faster than md5 in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.

You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'

You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).
Example IPython session:
>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop
And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.
Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].
[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.

Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:
def subsample_hash(a):
rng = np.random.RandomState(89)
inds = rng.randint(low=0, high=a.size, size=1000)
b = a.flat[inds]
b.flags.writeable = False
return hash(b.data)
I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.

If your np.array() is small and in a tight loop, then one option is to skip hash() completely and just use np.array().data.tobytes() directly as your dict key:
grid = np.array([[True, False, True],[False, False, True]])
hash = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
cache[hash] = function(grid)
return cache[hash]

What kind of data do you have?
array-size
do you have an index several times in the array
If your array only consists of permutation of indices you can use a base-convertion
(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)
and use '10' as hash_key via
import numpy as num
base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()
hashed_array = (base * array).sum()
Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

Related

How to deoptimze memory access in python?

This may not useful. It's just a challenge I have set up for myself.
Let's say you have a big array. What can you do so that the program does not benefit from caching, cache line prefetching or the fact that the next memory access can only be determined after the first access finishes.
So we have our array:
array = [0] * 10000000
What would be the best way to deoptimize the memory access if you had to access all elements in a loop? The idea is to increase the access time of each memory location as much as possible
I'm not looking for a solution which proposes to do "something else" (which takes time) before doing the next access. The idea is really to increase the access time as much as possible. I guess we have to traverse the array in a certain way (perhaps randomly? I'm still looking into it)

I did not expect any difference, but in fact accessing the digits in random order is significantly slower than accessing them in order or in reverse order (which is both about the same).
>>> N = 10**5
>>> arr = [random.randint(0, 1000) for _ in range(N)]
>>> srt = list(range(N))
>>> rvd = srt[::-1]
>>> rnd = random.sample(srt, N)
>>> %timeit sum(arr[i] for i in srt)
10 loops, best of 5: 24.9 ms per loop
>>> %timeit sum(arr[i] for i in rvd)
10 loops, best of 5: 25.7 ms per loop
>>> %timeit sum(arr[i] for i in rnd)
10 loops, best of 5: 59.2 ms per loop
And it really seems to be the randomness. Just accessing indices out of order, but with a pattern, e.g. as [0, N-1, 2, N-3, ...] or [0, N/2, 1, N/2+1, ...], is just as fast as accessing them in order:
>>> alt1 = [i if i % 2 == 0 else N - i for i in range(N)]
>>> alt2 = [i for p in zip(srt[:N//2], srt[N//2:]) for i in p]
>>> %timeit sum(arr[i] for i in alt1)
10 loops, best of 5: 24.5 ms per loop
>>> %timeit sum(arr[i] for i in alt2)
10 loops, best of 5: 24.1 ms per loop
Interestingly, just iterating the shuffled indices (and calculating their sum as with the array above) is also slower than doing the same with the sorted indices, but not as much. Of the ~35ms difference between srt and rnd, ~10ms seem to come from iterating the randomized indices, and ~25ms for actually accessing the indices in random order.
>>> %timeit sum(i for i in srt)
100 loops, best of 5: 19.7 ms per loop
>>> %timeit sum(i for i in rnd)
10 loops, best of 5: 30.5 ms per loop
>>> %timeit sum(arr[i] for i in srt)
10 loops, best of 5: 24.5 ms per loop
>>> %timeit sum(arr[i] for i in rnd)
10 loops, best of 5: 56 ms per loop
(IPython 5.8.0 / Python 3.7.3 on a rather old laptop running Linux)

Python interns small integers. Use integers > 255. * just adds references to the number already in the list when expanded, use unique values instead. Caches hate randomness, so go random.
import random
array = list(range(256, 10000256))
while array:
array.pop(random.randint(0, len(array)-1))
A note on interning small integers. When you create an integer in your program, say 12345, python creates an object on the heap of 55 or greater bytes. This is expensive. So, numbers between (I think) -4 and 255 are built into python to optimize common small number operations. By avoiding these numbers you force python to allocate integers on the heap, spreading out the amount of memory you will touch and reducing cache efficiency.
If you use a single number in the array [1234] * 100000, that single number is referenced many times. If you use unique numbers, they are all individually allocated on the heap, increasing memory footprint. And when they are removed from the list, python has to touch the object to reduce its reference count which pulls its memory location into cache, invalidating something else.

Fastest way to check if duplicates exist in a python list / numpy ndarray

I want to determine whether or not my list (actually a numpy.ndarray) contains duplicates in the fastest possible execution time. Note that I don't care about removing the duplicates, I simply want to know if there are any.
Note: I'd be extremely surprised if this is not a duplicate, but I've tried my best and can't find one. Closest are this question and this question, both of which are requesting that the unique list be returned.

Here are the four ways I thought of doing it.
TL;DR: if you expect very few (less than 1/1000) duplicates:
def contains_duplicates(X):
return len(np.unique(X)) != len(X)
If you expect frequent (more than 1/1000) duplicates:
def contains_duplicates(X):
seen = set()
seen_add = seen.add
for x in X:
if (x in seen or seen_add(x)):
return True
return False
The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.
>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...: elems = set()
...: for i in X:
...: if i in elems:
...: return True
...: elems.add(i)
...: return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...: seen = set()
...: seen_add = seen.add
...: for x in X:
...: if (x in seen or seen_add(x)):
...: return True
...: return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop
Do things change if you start with an ordinary Python list, and not a numpy.ndarray?
>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop
Edit: what if we have a prior expectation of the number of duplicates?
The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.
>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>> print("{} duplicates".format(n_duplicates))
>>> duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>> X[duplicate_idx] = 0
>>> print("terhost_early_exit")
>>> %timeit terhorst_early_exit(X)
>>> print("peterbe_early_exit")
>>> %timeit peterbe_early_exit(X)
>>> print("set length")
>>> %timeit len(set(X)) != len(X)
>>> print("numpy unique length")
>>> %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop
So if you expect very few duplicates, the numpy.unique function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.

Depending on how large your array is, and how likely duplicates are, the answer will be different.
For example, if you expect the average array to have around 3 duplicates, early exit will cut your average-case time (and space) by 2/3rds; if you expect only 1 in 1000 arrays to have any duplicates at all, it will just add a bit of complexity without improving anything.
Meanwhile, if the arrays are big enough that building a temporary set as large as the array is likely to be expensive, sticking a probabilistic test like a bloom filter in front of it will probably speed things up dramatically, but if not, it's again just wasted effort.
Finally, you want to stay within numpy if at all possible. Looping over an array of floats (or whatever) and boxing each one into a Python object is going to take almost as much time as hashing and checking the values, and of course storing things in a Python set instead of optimized numpy storage is wasteful as well. But you have to trade that off against the other issues—you can't do early exit with numpy, and there may be nice C-optimized bloom filter implementations a pip install away but not be any that are numpy-friendly.
So, there's no one best solution for all possible scenarios.
Just to give an idea of how easy it is to write a bloom filter, here's one I hacked together in a couple minutes:
from bitarray import bitarray # pip3 install bitarray
def dupcheck(X):
# Hardcoded values to give about 5% false positives for 10000 elements
size = 62352
hashcount = 4
bits = bitarray(size)
bits.setall(0)
def check(x, hash=hash): # TODO: default-value bits, hashcount, size?
for i in range(hashcount):
if not bits[hash((x, i)) % size]: return False
return True
def add(x):
for i in range(hashcount):
bits[hash((x, i)) % size] = True
seen = set()
seen_add = seen.add
for x in X:
if check(x) or add(x):
if x in seen or seen_add(x):
return True
return False
This only uses 12KB (a 62352-bit bitarray plus a 500-float set) instead of 80KB (a 10000-float set or np.array). Which doesn't matter when you're only dealing with 10K elements, but with, say, 10B elements that use up more than half of your physical RAM, it would be a different story.
Of course it's almost certainly going to be an order of magnitude or so slower than using np.unique, or maybe even set, because we're doing all that slow looping in Python. But if this turns out to be worth doing, it should be a breeze to rewrite in Cython (and to directly access the numpy array without boxing and unboxing).

My timing tests differ from Scott for small lists. Using Python 3.7.3, set() is much faster than np.unique for a small numpy array from randint (length 8), but faster for a larger array (length 1000).
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- --------- ----------- ------------ ---------
set_len 0 7.73486e-06 Baseline
unique_len 9.644e-06 2.55573e-05 Slower 0
Length 1000
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- ---------- ----------- ------------ ---------
set_len 0.00011066 0.000270466 Baseline
unique_len 4.3684e-05 8.95608e-05 Faster 0
Then I tried my own implementation, but I think it would require optimized C code to beat set:
def check_items(key_rand, **kwargs):
for i, vali in enumerate(key_rand):
for j in range(i+1, len(key_rand)):
valj = key_rand[j]
if vali == valj:
break
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
----------- ---------- ----------- ------------ ---------
set_len 0 6.74221e-06 Baseline
unique_len 0 2.14604e-05 Slower 0
check_items 1.1138e-05 2.16369e-05 Slower 0
(using my randomized compare_time() function from easyinfo)

Efficent way of constructing a matrix with all elements zero except one in numpy

I want to compute the output error for a neural network for each input by compare output signal and its true output value so I need two matrix to compute this task.
I have output matrix in shape of (n*1) but in the label I just have the index of neuron that should be activated, so I need a matrix in the same shape with all element equal to zero except the one which it's index is equal to the label. I could do that with a function but I wonder is there a built in method in numpy python that can do that for me?

You can do that multiple ways using numpy or standard libraries, one way is to create an array of zeros, and set the value corresponding to index as 1.
n = len(result)
a = np.zeros((n,));
a[id] = 1
It probably is going to be the fastest one as well:
>> %timeit a = np.zeros((n,)); a[id] = 1
1000000 loops, best of 3: 634 ns per loop
Alternatively you can use numpy.pad to pad [ 1 ] array with zeros. But this will almost definitely will be slower due to padding logic.
np.lib.pad([1],(id,n-id),'constant', constant_values=(0))
As expected order of magnitude slower:
>> %timeit np.lib.pad([1],(id,n-id),'constant', constant_values=(0))
10000 loops, best of 3: 47.4 µs per loop
And you can try list comprehension as suggested by the comments:
results = [7]
np.matrix([1 if x == id else 0 for x in results])
But it is much slower than the first method as well:
>> %timeit np.matrix([1 if x == id else 0 for x in results])
100000 loops, best of 3: 7.25 µs per loop
Edit:
But in my opinion, if you want to compute the neural networks error. You should just use np.argmax and compute whether it was successful or not. That error calculation may give you more noise than it is useful. You can make a confusion matrix if you feel your network is prone to similarities.

A few other methods that also seem to be slower than #umutto's above:
%timeit a = np.zeros((n,)); a[id] = 1 #umutto's method
The slowest run took 45.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.53 µs per loop
Boolean construction:
%timeit a = np.arange(n) == id
The slowest run took 13.98 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.76 µs per loop
Boolean construction to integer:
%timeit a = (np.arange(n) == id).astype(int)
The slowest run took 15.31 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.47 µs per loop
List construction:
%timeit a = [0]*n; a[id] = 1; a=np.asarray(a)
10000 loops, best of 3: 77.3 µs per loop
Using scipy.sparse
%timeit a = sparse.coo_matrix(([1], ([id],[0])), shape=(n,1))
10000 loops, best of 3: 51.1 µs per loop
Now what's actually faster may depend on what's being cached, but it seems like constructing the zero array is probably fastest, especially if you can use np.zeros_like(result) instead of np.zeros(len(result))

One liner:
x = np.identity(n)[id]

Dictionaries with numpy - Can I use XY coordinates as a hash? [duplicate]

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.
The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict?
My current approach is to use str(arr.data), which is faster than md5 in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.

You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'

You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).
Example IPython session:
>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop
And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.
Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].
[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.

Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:
def subsample_hash(a):
rng = np.random.RandomState(89)
inds = rng.randint(low=0, high=a.size, size=1000)
b = a.flat[inds]
b.flags.writeable = False
return hash(b.data)
I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.

If your np.array() is small and in a tight loop, then one option is to skip hash() completely and just use np.array().data.tobytes() directly as your dict key:
grid = np.array([[True, False, True],[False, False, True]])
hash = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
cache[hash] = function(grid)
return cache[hash]

What kind of data do you have?
array-size
do you have an index several times in the array
If your array only consists of permutation of indices you can use a base-convertion
(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)
and use '10' as hash_key via
import numpy as num
base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()
hashed_array = (base * array).sum()
Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

How are are NumPy's in-place operators implemented to explain the significant performance gain

I know that in Python, the in-place operators use the __iadd__ method for in-place operators. For immutable types, the __iadd__ is a workaround using the __add__, e.g., like tmp = a + b; a = tmp, but mutable types (like lists) are modified in-place, which causes a slight speed boost.
However, if I have a NumPy array where I modify its contained immutable types, e.g., integers or floats, there is also an even more significant speed boost. How does this work? I did some example benchmarks below:
import numpy as np
def inplace(a, b):
a += b
return a
def assignment(a, b):
a = a + b
return a
int1 = 1
int2 = 1
list1 = [1]
list2 = [1]
npary1 = np.ones((1000,1000))
npary2 = np.ones((1000,1000))
print('Python integers')
%timeit inplace(int1, 1)
%timeit assignment(int2, 1)
print('\nPython lists')
%timeit inplace(list1, [1])
%timeit assignment(list2, [1])
print('\nNumPy Arrays')
%timeit inplace(npary1, 1)
%timeit assignment(npary2, 1)
What I would expect is a similar difference as for the Python integers when I used the in-place operators on NumPy arrays, however the results are completely different:
Python integers
1000000 loops, best of 3: 265 ns per loop
1000000 loops, best of 3: 249 ns per loop
Python lists
1000000 loops, best of 3: 449 ns per loop
1000000 loops, best of 3: 638 ns per loop
NumPy Arrays
100 loops, best of 3: 3.76 ms per loop
100 loops, best of 3: 6.6 ms per loop

Each call to assignment(npary2, 1) requires creating a new one million element array. Consider how much time it takes just to allocate a (1000, 1000)-shaped array of ones:
In [21]: %timeit np.ones((1000, 1000))
100 loops, best of 3: 3.84 ms per loop
This allocation of a new temporary array requires on my machine about 3.84 ms, and is on the right order of magnitude to explain the entire difference between inplace(npary1, 1) and assignment(nparay2, 1):
In [12]: %timeit inplace(npary1, 1)
1000 loops, best of 3: 1.8 ms per loop
In [13]: %timeit assignment(npary2, 1)
100 loops, best of 3: 4.04 ms per loop
So, given that allocation is a relatively slow process, it makes sense that in-place addition is significantly faster than assignment to a new array.
NumPy operations on NumPy arrays may be fast, but creation of NumPy arrays is relatively slow. Consider, for example, how much more time it takes to create a NumPy array than a Python list:
In [14]: %timeit list()
10000000 loops, best of 3: 106 ns per loop
In [15]: %timeit np.array([])
1000000 loops, best of 3: 563 ns per loop
This is one reason why it is generally better to use one large NumPy array (allocated once) rather than thousands of small NumPy arrays.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most efficient property to hash for numpy array - python

Related

How to deoptimze memory access in python?

Fastest way to check if duplicates exist in a python list / numpy ndarray

Efficent way of constructing a matrix with all elements zero except one in numpy

Dictionaries with numpy - Can I use XY coordinates as a hash? [duplicate]

How are are NumPy's in-place operators implemented to explain the significant performance gain

Categories

Resources