Numpy fast check for complete array equality, like Matlabs isequal - python

In Matlab, the builtin isequal does a check if two arrays are equal. If they are not equal, this might be very fast, as the implementation presumably stops checking as soon as there is a difference:
>> A = zeros(1e9, 1, 'single');
>> B = A(:);
>> B(1) = 1;
>> tic; isequal(A, B); toc;
Elapsed time is 0.000043 seconds.
Is there any equavalent in Python/numpy? all(A==B) or all(equal(A, B)) is far slower, because it compares all elements, even if the initial one differs:
In [13]: A = zeros(1e9, dtype='float32')
In [14]: B = A.copy()
In [15]: B[0] = 1
In [16]: %timeit all(A==B)
1 loops, best of 3: 612 ms per loop
Is there any numpy equivalent? It should be very easy to implement in C, but slow to implement in Python because this is a case where we do not want to broadcast, so it would require an explicit loop.
Edit:
It appears array_equal does what I want. However, it is not faster than all(A==B), because it's not a built-in, but just a short Python function doing A==B. So it does not meet my need for a fast check.
In [12]: %timeit array_equal(A, B)
1 loops, best of 3: 623 ms per loop

First, it should be noted that in the OP's example the arrays have identical elements because B=A[:] is just a view onto the array, so:
>>> print A[0], B[0]
1.0, 1.0
But, although the test isn't a fit one, the basic complaint is true: Numpy does not have a short-circuiting equivalency check.
One can easily see from the source that all of allclose, array_equal, and array_equiv are just variations upon all(A==B) to match their respective details, and are not notable faster.
An advantage of numpy though is that slices are just views, and are therefore very fast, so one could write their own short-circuiting comparison fairly easily (I'm not saying this is ideal, but it does work):
from numpy import *
A = zeros(1e8, dtype='float32')
B = A[:]
B[0] = 1
C = array(B)
C[0] = 2
D = array(A)
D[-1] = 2
def short_circuit_check(a, b, n):
L = len(a)/n
for i in range(n):
j = i*L
if not all(a[j:j+L]==b[j:j+L]):
return False
return True
In [26]: %timeit short_circuit_check(A, C, 100) # 100x faster
1000 loops, best of 3: 1.49 ms per loop
In [27]: %timeit all(A==C)
1 loops, best of 3: 158 ms per loop
In [28]: %timeit short_circuit_check(A, D, 100)
10 loops, best of 3: 144 ms per loop
In [29]: %timeit all(A==D)
10 loops, best of 3: 160 ms per loop

Related

Fastest way to check if duplicates exist in a python list / numpy ndarray

I want to determine whether or not my list (actually a numpy.ndarray) contains duplicates in the fastest possible execution time. Note that I don't care about removing the duplicates, I simply want to know if there are any.
Note: I'd be extremely surprised if this is not a duplicate, but I've tried my best and can't find one. Closest are this question and this question, both of which are requesting that the unique list be returned.
Here are the four ways I thought of doing it.
TL;DR: if you expect very few (less than 1/1000) duplicates:
def contains_duplicates(X):
return len(np.unique(X)) != len(X)
If you expect frequent (more than 1/1000) duplicates:
def contains_duplicates(X):
seen = set()
seen_add = seen.add
for x in X:
if (x in seen or seen_add(x)):
return True
return False
The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.
>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...: elems = set()
...: for i in X:
...: if i in elems:
...: return True
...: elems.add(i)
...: return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...: seen = set()
...: seen_add = seen.add
...: for x in X:
...: if (x in seen or seen_add(x)):
...: return True
...: return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop
Do things change if you start with an ordinary Python list, and not a numpy.ndarray?
>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop
Edit: what if we have a prior expectation of the number of duplicates?
The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.
>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>> print("{} duplicates".format(n_duplicates))
>>> duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>> X[duplicate_idx] = 0
>>> print("terhost_early_exit")
>>> %timeit terhorst_early_exit(X)
>>> print("peterbe_early_exit")
>>> %timeit peterbe_early_exit(X)
>>> print("set length")
>>> %timeit len(set(X)) != len(X)
>>> print("numpy unique length")
>>> %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop
So if you expect very few duplicates, the numpy.unique function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.
Depending on how large your array is, and how likely duplicates are, the answer will be different.
For example, if you expect the average array to have around 3 duplicates, early exit will cut your average-case time (and space) by 2/3rds; if you expect only 1 in 1000 arrays to have any duplicates at all, it will just add a bit of complexity without improving anything.
Meanwhile, if the arrays are big enough that building a temporary set as large as the array is likely to be expensive, sticking a probabilistic test like a bloom filter in front of it will probably speed things up dramatically, but if not, it's again just wasted effort.
Finally, you want to stay within numpy if at all possible. Looping over an array of floats (or whatever) and boxing each one into a Python object is going to take almost as much time as hashing and checking the values, and of course storing things in a Python set instead of optimized numpy storage is wasteful as well. But you have to trade that off against the other issues—you can't do early exit with numpy, and there may be nice C-optimized bloom filter implementations a pip install away but not be any that are numpy-friendly.
So, there's no one best solution for all possible scenarios.
Just to give an idea of how easy it is to write a bloom filter, here's one I hacked together in a couple minutes:
from bitarray import bitarray # pip3 install bitarray
def dupcheck(X):
# Hardcoded values to give about 5% false positives for 10000 elements
size = 62352
hashcount = 4
bits = bitarray(size)
bits.setall(0)
def check(x, hash=hash): # TODO: default-value bits, hashcount, size?
for i in range(hashcount):
if not bits[hash((x, i)) % size]: return False
return True
def add(x):
for i in range(hashcount):
bits[hash((x, i)) % size] = True
seen = set()
seen_add = seen.add
for x in X:
if check(x) or add(x):
if x in seen or seen_add(x):
return True
return False
This only uses 12KB (a 62352-bit bitarray plus a 500-float set) instead of 80KB (a 10000-float set or np.array). Which doesn't matter when you're only dealing with 10K elements, but with, say, 10B elements that use up more than half of your physical RAM, it would be a different story.
Of course it's almost certainly going to be an order of magnitude or so slower than using np.unique, or maybe even set, because we're doing all that slow looping in Python. But if this turns out to be worth doing, it should be a breeze to rewrite in Cython (and to directly access the numpy array without boxing and unboxing).
My timing tests differ from Scott for small lists. Using Python 3.7.3, set() is much faster than np.unique for a small numpy array from randint (length 8), but faster for a larger array (length 1000).
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- --------- ----------- ------------ ---------
set_len 0 7.73486e-06 Baseline
unique_len 9.644e-06 2.55573e-05 Slower 0
Length 1000
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- ---------- ----------- ------------ ---------
set_len 0.00011066 0.000270466 Baseline
unique_len 4.3684e-05 8.95608e-05 Faster 0
Then I tried my own implementation, but I think it would require optimized C code to beat set:
def check_items(key_rand, **kwargs):
for i, vali in enumerate(key_rand):
for j in range(i+1, len(key_rand)):
valj = key_rand[j]
if vali == valj:
break
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
----------- ---------- ----------- ------------ ---------
set_len 0 6.74221e-06 Baseline
unique_len 0 2.14604e-05 Slower 0
check_items 1.1138e-05 2.16369e-05 Slower 0
(using my randomized compare_time() function from easyinfo)

Code optimization python

I wrote the below function to estimate the orientation from a 3 axes accelerometer signal (X,Y,Z)
X.shape
Out[4]: (180000L,)
Y.shape
Out[4]: (180000L,)
Z.shape
Out[4]: (180000L,)
def estimate_orientation(self,X,Y,Z):
sigIn=np.array([X,Y,Z]).T
N=len(sigIn)
sigOut=np.empty(shape=(N,3))
sigOut[sigOut==0]=None
i=0
while i<N:
sigOut[i,:] = np.arccos(sigIn[i,:]/np.linalg.norm(sigIn[i,:]))*180/math.pi
i=i+1
return sigOut
Executing this function with a signal of 180000 samples takes quite a while (~2.2 seconds)... I know that it is not written in a "pythonic way"... Could you help me to optimize the execution time?
Thanks!
Starting approach
One approach following an usage of broadcasting, would be like so -
np.arccos(sigIn/np.linalg.norm(sigIn,axis=1,keepdims=1))*180/np.pi
Further optimization - I
We could use np.einsum to replace np.linalg.norm part. Thus :
np.linalg.norm(sigIn,axis=1,keepdims=1)
could be replaced by :
np.sqrt(np.einsum('ij,ij->i',sigIn,sigIn))[:,None]
Further optimization - II
Further boost could be brought in with numexpr module, which works really well with huge arrays and with operations involving trigonometrical functions. In our case that would be arcccos. So, we will use the einsum part as used in the previous optimization section and then use arccos from numexpr on it.
Thus, the implementation would look something like this -
import numexpr as ne
pi_val = np.pi
s = np.sqrt(np.einsum('ij,ij->i',signIn,signIn))[:,None]
out = ne.evaluate('arccos(signIn/s)*180/pi_val')
Runtime test
Approaches -
def original_app(sigIn):
N=len(sigIn)
sigOut=np.empty(shape=(N,3))
sigOut[sigOut==0]=None
i=0
while i<N:
sigOut[i,:] = np.arccos(sigIn[i,:]/np.linalg.norm(sigIn[i,:]))*180/math.pi
i=i+1
return sigOut
def broadcasting_app(signIn):
s = np.linalg.norm(signIn,axis=1,keepdims=1)
return np.arccos(signIn/s)*180/np.pi
def einsum_app(signIn):
s = np.sqrt(np.einsum('ij,ij->i',signIn,signIn))[:,None]
return np.arccos(signIn/s)*180/np.pi
def numexpr_app(signIn):
pi_val = np.pi
s = np.sqrt(np.einsum('ij,ij->i',signIn,signIn))[:,None]
return ne.evaluate('arccos(signIn/s)*180/pi_val')
Timings -
In [115]: a = np.random.rand(180000,3)
In [116]: %timeit original_app(a)
...: %timeit broadcasting_app(a)
...: %timeit einsum_app(a)
...: %timeit numexpr_app(a)
...:
1 loops, best of 3: 1.38 s per loop
100 loops, best of 3: 15.4 ms per loop
100 loops, best of 3: 13.3 ms per loop
100 loops, best of 3: 4.85 ms per loop
In [117]: 1380/4.85 # Speedup number
Out[117]: 284.5360824742268
280x speedup there!

Add multiple of a matrix without build a new one

Say I have two matrices B and M and I want to execute the following statement:
B += 3*M
I execute this instruction repeatedly so I don't want to build each time the matrix 3*M (3 may change, it is just to make cleat that I only do a scalar-matrix product). Is it a numpy-function which makes this computation "in place"?
More precisely, I have a list of scalars as and a list of matrices Ms, I would like to perform the "dot product" (which is not really one since the two operands are of different type) of the two, that is to say:
sum(a*M for a, M in zip(as, Ms))
The np.dot function does not do what I except...
You can use np.tensordot -
np.tensordot(As,Ms,axes=(0,0))
Or np.einsum -
np.einsum('i,ijk->jk',As,Ms)
Sample run -
In [41]: As = [2,5,6]
In [42]: Ms = [np.random.rand(2,3),np.random.rand(2,3),np.random.rand(2,3)]
In [43]: sum(a*M for a, M in zip(As, Ms))
Out[43]:
array([[ 6.79630284, 5.04212877, 10.76217631],
[ 4.91927651, 1.98115548, 6.13705742]])
In [44]: np.tensordot(As,Ms,axes=(0,0))
Out[44]:
array([[ 6.79630284, 5.04212877, 10.76217631],
[ 4.91927651, 1.98115548, 6.13705742]])
In [45]: np.einsum('i,ijk->jk',As,Ms)
Out[45]:
array([[ 6.79630284, 5.04212877, 10.76217631],
[ 4.91927651, 1.98115548, 6.13705742]])
Another way you could do this, particularly if you favour readability, is to make use of broadcasting.
So you could make a 3D array from the 1D and 2D arrays and then sum over the appropriate axis:
>>> Ms = np.random.randn(4, 2, 3) # 4 arrays of size 2x3
>>> As = np.random.randn(4)
>>> np.sum(As[:, np.newaxis, np.newaxis] * Ms)
array([[-1.40199248, -0.40337845, -0.69986566],
[ 3.52724279, 0.19547118, 2.1485559 ]])
>>> sum(a*M for a, M in zip(As, Ms))
array([[-1.40199248, -0.40337845, -0.69986566],
[ 3.52724279, 0.19547118, 2.1485559 ]])
However, it's worth noting that np.einsum and np.tensordot are usually much more efficient:
>>> %timeit np.sum(As[:, np.newaxis, np.newaxis] * Ms, axis=0)
The slowest run took 7.38 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.58 µs per loop
>>> %timeit np.einsum('i,ijk->jk', As, Ms)
The slowest run took 19.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.44 µs per loop
And this is also true for larger numbers:
>>> Ms = np.random.randn(100, 200, 300)
>>> As = np.random.randn(100)
>>> %timeit np.einsum('i,ijk->jk', As, Ms)
100 loops, best of 3: 5.03 ms per loop
>>> %timeit np.sum(As[:, np.newaxis, np.newaxis] * Ms, axis=0)
100 loops, best of 3: 14.8 ms per loop
>>> %timeit np.tensordot(As,Ms,axes=(0,0))
100 loops, best of 3: 2.79 ms per loop
So np.tensordot works best in this case.
The only good reason to use np.sum and broadcasting is to make the code a little more readable (helps when you have small matrices).

Dictionaries with numpy - Can I use XY coordinates as a hash? [duplicate]

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.
The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict?
My current approach is to use str(arr.data), which is faster than md5 in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.
You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).
Example IPython session:
>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop
And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.
Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].
[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.
Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:
def subsample_hash(a):
rng = np.random.RandomState(89)
inds = rng.randint(low=0, high=a.size, size=1000)
b = a.flat[inds]
b.flags.writeable = False
return hash(b.data)
I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.
If your np.array() is small and in a tight loop, then one option is to skip hash() completely and just use np.array().data.tobytes() directly as your dict key:
grid = np.array([[True, False, True],[False, False, True]])
hash = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
cache[hash] = function(grid)
return cache[hash]
What kind of data do you have?
array-size
do you have an index several times in the array
If your array only consists of permutation of indices you can use a base-convertion
(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)
and use '10' as hash_key via
import numpy as num
base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()
hashed_array = (base * array).sum()
Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

Most efficient property to hash for numpy array

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.
The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict?
My current approach is to use str(arr.data), which is faster than md5 in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.
You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).
Example IPython session:
>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop
And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.
Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].
[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.
Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:
def subsample_hash(a):
rng = np.random.RandomState(89)
inds = rng.randint(low=0, high=a.size, size=1000)
b = a.flat[inds]
b.flags.writeable = False
return hash(b.data)
I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.
If your np.array() is small and in a tight loop, then one option is to skip hash() completely and just use np.array().data.tobytes() directly as your dict key:
grid = np.array([[True, False, True],[False, False, True]])
hash = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
cache[hash] = function(grid)
return cache[hash]
What kind of data do you have?
array-size
do you have an index several times in the array
If your array only consists of permutation of indices you can use a base-convertion
(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)
and use '10' as hash_key via
import numpy as num
base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()
hashed_array = (base * array).sum()
Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

Categories

Resources