Comparing multiple numpy arrays - python

how should i compare more than 2 numpy arrays?
import numpy
a = numpy.zeros((512,512,3),dtype=numpy.uint8)
b = numpy.zeros((512,512,3),dtype=numpy.uint8)
c = numpy.zeros((512,512,3),dtype=numpy.uint8)
if (a==b==c).all():
pass
this give a valueError, and i am not interested in comparing arrays two at a time.

For three arrays, you can check for equality among the corresponding elements between the first and second arrays and then second and third arrays to give us two boolean scalars and finally see if both of these scalars are True for final scalar output, like so -
np.logical_and( (a==b).all(), (b==c).all() )
For more number of arrays, you could stack them, get the differentiation along the axis of stacking and check if all of those differentiations are equal to zeros. If they are, we have equality among all input arrays, otherwise not. The implementation would look like so -
L = [a,b,c] # List of input arrays
out = (np.diff(np.vstack(L).reshape(len(L),-1),axis=0)==0).all()

For three arrays, you should really just compare them two at a time:
if np.array_equal(a, b) and np.array_equal(b, c):
do_whatever()
For a variable number of arrays, let's suppose they're all combined into one big array arrays. Then you could do
if np.all(arrays[:-1] == arrays[1:]):
do_whatever()

To expand on previous answers, I would use combinations from itertools to construct all pairs, then run your comparison on each pair. For example, if I have three arrays and want to confirm that they're all equal, I'd use:
from itertools import combinations
for pair in combinations([a, b, c], 2):
assert np.array_equal(pair[0], pair[1])

solution supporting different shapes and nans
compare against first element of array-list:
import numpy as np
a = np.arange(3)
b = np.arange(3)
c = np.arange(3)
d = np.arange(4)
lst_eq = [a, b, c]
lst_neq = [a, b, d]
def all_equal(lst):
for arr in lst[1:]:
if not np.array_equal(lst[0], arr, equal_nan=True):
return False
return True
print('all_equal(lst_eq)=', all_equal(lst_eq))
print('all_equal(lst_neq)=', all_equal(lst_neq))
output
all_equal(lst_eq)= True
all_equal(lst_neq)= False
for equal shape and without nan-support
Combine everything into one array, calculate the absolute diff along the new axis and check if the maximum element along the new dimension is equal 0 or lower than some threshold. This should be quite fast.
import numpy as np
a = np.arange(3)
b = np.arange(3)
c = np.arange(3)
d = np.array([0, 1, 3])
lst_eq = [a, b, c]
lst_neq = [a, b, d]
def all_equal(lst, threshold = 0):
arr = np.stack(lst, axis=0)
return np.max(np.abs(np.diff(arr, axis=0))) <= threshold
print('all_equal(lst_eq)=', all_equal(lst_eq))
print('all_equal(lst_neq)=', all_equal(lst_neq))
output
all_equal(lst_eq)= True
all_equal(lst_neq)= False

This might work.
import numpy
x = np.random.rand(10)
arrays = [x for _ in range(10)]
print(np.allclose(arrays[:-1], arrays[1:])) # True
arrays.append(np.random.rand(10))
print(np.allclose(arrays[:-1], arrays[1:])) # False

one-liner solution:
arrays = [a, b, c]
all([np.array_equal(a, b) for a, b in zip(arrays, arrays[1:])])
We test the equality of consecutive pairs of arrays

Related

Python: einsum inside for loop

Suppose A and B are two 4 dimensional numpy arrays with the same dimension.
A = np.random.rand(5,5,2,10)
B = np.random.rand(5,5,2,10)
a, b, c, d = A.shape
dat = []
for k in range(d):
sum = 0
for l in range(c):
sum = sum + np.einsum('ij,ji->', A[:,:,l,k], B[:,:,l,k])
dat.append(sum)
I was wondering whether I can use the "einsum" to replace the inner for loop, maybe even outer for loop, or maybe some matrix manipulation to replace all of it, casue the data set is large.
Is there any faster way to achieve this?

Selecting numpy columns based on values in a row

Suppose I have a numpy array with 2 rows and 10 columns. I want to select columns with even values in the first row. The outcome I want can be obtained is as follows:
a = list(range(10))
b = list(reversed(range(10)))
c = np.concatenate([a, b]).reshape(2, 10).T
c[c[:, 0] % 2 == 0].T
However, this method transposes twice and I don't suppose it's very pythonic. Is there a way to do the same job cleaner?
Numpy allows you to select along each dimension separately. You pass in a tuple of indices whose length is the number of dimensions.
Say your array is
a = np.random.randint(10, size=(2, 10))
The even elements in the first row are given by the mask
m = (a[0, :] % 2 == 0)
You can use a[0] to get the first row instead of a[0, :] because missing indices are synonymous with the slice : (take everything).
Now you can apply the mask to just the second dimension:
result = a[:, m]
You can also convert the mask to indices first. There are subtle differences between the two approaches, which you won't see in this simple case. The biggest difference is usually that linear indices are a little faster, especially if applied more than once:
i = np.flatnonzero(m)
result = a[:, i]

element-wise matrix multiplication (Hadamard product) using numpy

So suppose i have two numpy ndarrays whose elements are matrices. I need element-wise multiplication for these two arrays, however, there should be matrix multiplication between the two matrix elements. Of course i would be able to implement this with for loops but i was looking to solve this problem without using an explicit for loop. How do i implement this?
EDIT: This for-loop does what I want to do. I'm on python 2.7
n = np.arange(8).reshape(2,2,1,2)
l = np.arange(1,9).reshape(2,2,2,1)
k = np.zeros((2,2))
for i in range(len(n)):
for j in range(len(n[i])):
k[i][j] = np.asscalar(n[i][j].dot(l[i][j]))
print k
Assuming your arrays of matrices are given as n+2 dimensional arrays A and B. What you want to achieve is as simple as C = A#B
Example
outer_dims = 2,3,4
inner_dims = 4,5,6
A = np.random.randint(0,10,(*outer_dims, *inner_dims[:2]))
B = np.random.randint(0,10,(*outer_dims, *inner_dims[1:]))
C = A#B
# check
for I in np.ndindex(outer_dims):
assert (C[I] == A[I]#B[I]).all()
UPDATE: Py2 version; thanks # hpaulj, Divakar
A = np.random.randint(0,10, outer_dims + inner_dims[:2])
B = np.random.randint(0,10, outer_dims + inner_dims[1:])
C = np.matmul(A,B)
# check
for I in np.ndindex(outer_dims):
assert (C[I] == np.matmul(A[I],B[I])).all()
If I understand correctly, this might work:
import numpy as np
a = np.array([[1,1],[1,0]])
b = np.array([[3,4],[5,4]])
x = np.array([[a,b],[b,a]])
y = np.array([[a,a],[b,b]])
result = np.array([_x # _y for _x, _y in zip(x,y)])

Delete specific values in 2-Dimension array - Numpy

import numpy as np
I have two arrays of size n (to simplify, I use in this example n = 2):
A = array([[1,2,3],[1,2,3]])
B has two dimensions with n time a random integer: 1, 2 or 3.
Let's pretend:
B = array([[1],[3]])
What is the most pythonic way to subtract B from A in order to obtain C, C = array([2,3],[1,2]) ?
I tried to use np.subtract but due to the broadcasting rules I do not obtain C. I do not want to use mask or indices but element's values. I also tried to use np.delete, np.where without success.
Thank you.
This might work and should be quite Pythonic:
dd=[[val for val in A[i] if val not in B[i]] for i in xrange(len(A))]

Python equivalent of MATLAB's "ismember" function

After many attempts trying optimize code, it seems that one last resource would be to attempt to run the code below using multiple cores. I don't know exactly how to convert/re-structure my code so that it can run much faster using multiple cores. I will appreciate if I could get guidance to achieve the end goal. The end goal is to be able to run this code as fast as possible for arrays A and B where each array holds about 700,000 elements. Here is the code using small arrays. The 700k element arrays are commented out.
import numpy as np
def ismember(a,b):
for i in a:
index = np.where(b==i)[0]
if index.size == 0:
yield 0
else:
yield index
def f(A, gen_obj):
my_array = np.arange(len(A))
for i in my_array:
my_array[i] = gen_obj.next()
return my_array
#A = np.arange(700000)
#B = np.arange(700000)
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
gen_obj = ismember(A,B)
f(A, gen_obj)
print 'done'
# if we print f(A, gen_obj) the output will be: [4 0 0 4 3]
# notice that the output array needs to be kept the same size as array A.
What I am trying to do is to mimic a MATLAB function called ismember[2] (The one that is formatted as: [Lia,Locb] = ismember(A,B). I am just trying to get the Locb part only.
From Matlab: Locb, contain the lowest index in B for each value in A that is a member of B. The output array, Locb, contains 0 wherever A is not a member of B
One of the main problems is that I need to be able to perform this operation as efficient as possible. For testing I have two arrays of 700k elements. Creating a generator and going through the values of the generator doesn't seem to get the job done fast.
Before worrying about multiple cores, I would eliminate the linear scan in your ismember function by using a dictionary:
def ismember(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
Your original implementation requires a full scan of the elements in B for each element in A, making it O(len(A)*len(B)). The above code requires one full scan of B to generate the dict Bset. By using a dict, you effectively make the lookup of each element in B constant for each element of A, making the operation O(len(A)+len(B)). If this is still too slow, then worry about making the above function run on multiple cores.
Edit: I've also modified your indexing slightly. Matlab uses 0 because all of its arrays start at index 1. Python/numpy start arrays at 0, so if you're data set looks like this
A = [2378, 2378, 2378, 2378]
B = [2378, 2379]
and you return 0 for no element, then your results will exclude all elements of A. The above routine returns None for no index instead of 0. Returning -1 is an option, but Python will interpret that to be the last element in the array. None will raise an exception if it's used as an index into the array. If you'd like different behavior, change the second argument in the Bind.get(item,None) expression to the value you want returned.
sfstewman's excellent answer most likely solved the issue for you.
I'd just like to add how you can achieve the same exclusively in numpy.
I make use of numpy's unique an in1d functions.
B_unique_sorted, B_idx = np.unique(B, return_index=True)
B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)
B_unique_sorted contains the unique values in B sorted.
B_idx holds for these values the indices into the original B.
B_in_A_bool is a boolean array the size of B_unique_sorted that
stores whether a value in B_unique_sorted is in A.
Note: I need to look for (unique vals from B) in A because I need the output to be returned with respect to B_idx
Note: I assume that A is already unique.
Now you can use B_in_A_bool to either get the common vals
B_unique_sorted[B_in_A_bool]
and their respective indices in the original B
B_idx[B_in_A_bool]
Finally, I assume that this is significantly faster than the pure Python for-loop although I didn't test it.
Try the ismember library.
pip install ismember
Simple example:
# Import library
from ismember import ismember
import numpy as np
# data
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
# Lookup
Iloc,idx = ismember(A, B)
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [ True False False True True]
# indexes of d_unique that exists in d
print(idx)
# [4 4 3]
print(B[idx])
# [3 3 6]
print(A[Iloc])
# [3 3 6]
# These vectors will match
A[Iloc]==B[idx]
Speed check:
from ismember import ismember
from datetime import datetime
t1=[]
t2=[]
# Create some random vectors
ns = np.random.randint(10,10000,1000)
for n in ns:
a_vec = np.random.randint(0,100,n)
b_vec = np.random.randint(0,100,n)
# Run stack version
start = datetime.now()
out1=ismember_stack(a_vec, b_vec)
end = datetime.now()
t1.append(end - start)
# Run ismember
start = datetime.now()
out2=ismember(a_vec, b_vec)
end = datetime.now()
t2.append(end - start)
print(np.sum(t1))
# 0:00:07.778331
print(np.sum(t2))
# 0:00:04.609801
# %%
def ismember_stack(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
The ismember function from pypi is almost 2x faster.
Large vectors, eg 700000 elements:
from ismember import ismember
from datetime import datetime
A = np.random.randint(0,100,700000)
B = np.random.randint(0,100,700000)
# Lookup
start = datetime.now()
Iloc,idx = ismember(A, B)
end = datetime.now()
# Print time
print(end-start)
# 0:00:01.194801
Try using a list comprehension;
In [1]: import numpy as np
In [2]: A = np.array([3,4,4,3,6])
In [3]: B = np.array([2,5,2,6,3])
In [4]: [x for x in A if not x in B]
Out[4]: [4, 4]
Generally, list comprehensions are much faster than for-loops.
To get an equal length-list;
In [19]: map(lambda x: x if x not in B else False, A)
Out[19]: [False, 4, 4, False, False]
This is quite fast for small datasets:
In [20]: C = np.arange(10000)
In [21]: D = np.arange(15000, 25000)
In [22]: %timeit map(lambda x: x if x not in D else False, C)
1 loops, best of 3: 756 ms per loop
For large datasets, you could try using a multiprocessing.Pool.map() to speed up the operation.
Here is the exact MATLAB equivalent that returns both the output arguments [Lia, Locb] that match MATLAB except in Python 0 is also a valid index. So, this function doesn't return the 0s. It essentially returns Locb(Locb>0). The performance is also equivalent to MATLAB.
def ismember(a_vec, b_vec):
""" MATLAB equivalent ismember function """
bool_ind = np.isin(a_vec,b_vec)
common = a[bool_ind]
common_unique, common_inv = np.unique(common, return_inverse=True) # common = common_unique[common_inv]
b_unique, b_ind = np.unique(b_vec, return_index=True) # b_unique = b_vec[b_ind]
common_ind = b_ind[np.isin(b_unique, common_unique, assume_unique=True)]
return bool_ind, common_ind[common_inv]
An alternate implementation that is a bit (~5x) slower but doesn't use the unique function is here:
def ismember(a_vec, b_vec):
''' MATLAB equivalent ismember function. Slower than above implementation'''
b_dict = {b_vec[i]: i for i in range(0, len(b_vec))}
indices = [b_dict.get(x) for x in a_vec if b_dict.get(x) is not None]
booleans = np.in1d(a_vec, b_vec)
return booleans, np.array(indices, dtype=int)

Categories

Resources