Python : equivalent of Matlab ismember on rows for large arrays

Python : equivalent of Matlab ismember on rows for large arrays - python

I can't find an efficient way to conduct Matlab's "ismember(a,b,'rows')" with Python where a and b are arrays of size (ma,2) and (mb,2) respectively and m is the number of couples.
The ismember module (https://pypi.org/project/ismember/) crashes because at some point i.e. when doing np.all(a[:, None] == b, axis=2).any(axis=1) it needs to create an array of size (ma,mb,2) and it is too big. Moreover, even when the function works (because arrays are small enough), it is about a 100times slower than in Matlab. I guess it is because Matlab uses a built-in mex function. Why python does not have what I would think to be such an important function ? I use it countless times in my calculations...
ps : the solution proposed here Python version of ismember with 'rows' and index does not correspond to the true matlab's ismember function since it does not work element by element i.e. it does not verify that a couple of values of 'a' exists in 'b' but only if values of each column of 'a' exist in each columns of 'b'.

You can use np.unique(array,axis=0) in order to find the identical row of an array. So with this function you can simplify your 2D problem to a 1D problem which can be easily solve with np.isin():
import numpy as np
# Dummy example array:
a = np.array([[1,2],[3,4]])
b = np.array([[3,5],[2,3],[3,4]])
# ismember_row function, which rows of a are in b:
def ismember_row(a,b):
# Get the unique row index
_, rev = np.unique(np.concatenate((b,a)),axis=0,return_inverse=True)
# Split the index
a_rev = rev[len(b):]
b_rev = rev[:len(b)]
# Return the result:
return np.isin(a_rev,b_rev)
res = ismember_row(a,b)
# res = array([False, True])

Related

Is there any Python/numpy equivalent to Matlab's [row,col] = find(X)

I have this code in Matlab
[r,c] = find (X)
According to Matlab's doc, this ̶f̶i̶n̶d̶s̶ ̶i̶n̶d̶i̶c̶e̶s̶ ̶a̶n̶d̶ ̶v̶a̶l̶u̶e̶s̶ ̶o̶f̶ ̶n̶o̶n̶z̶e̶r̶o̶ ̶e̶l̶e̶m̶e̶n̶t̶s̶ ̶a̶n̶d̶ ̶p̶u̶t̶s̶ ̶t̶h̶e̶m̶ ̶r̶e̶s̶p̶e̶c̶t̶i̶v̶e̶l̶y̶ ̶i̶n̶ ̶̶r̶̶ ̶a̶n̶d̶ ̶i̶n̶ ̶`̶c̶ returns the row and column subscripts of each nonzero element in array X
I need to do the same in Python and I've found that numpy np.nonzero(X) does something similar but it only returns the values in c.
How do I get also the values in r?
Some code I tried:
x = np.array([1,0,3,0,5])
If I do r,c = np.nonzero(x)
I get ValueError: need more than 1 value to unpack
I want to obtain r = [0,0,0] and c = [0,2,4]

I find out the issue by myself.
Doing rc = np.nonzero(x) gives only row values because of the way x is declared.
That way x is a 1d array. If x is declared instead as a 2d array like
x = np.array([[1,0,3,0,5]])
then rc is going to store both row and columns indexes of each element respecting the condition (being non zero in this case)

Python alternative for MATLAB code 'min(Ar_1(Ar_1~=0))'

I want to achieve the same result with least complexity in python as min(Ar(Ar~=0)) in MATLAB where Ar is a 2D numpy array.
For those who are not familiar with MATLAB, ~= means != or not equal to.
Is there a function in python which returns the indexes of the elements:
1. Whose values fulfill a condition (elements which are != 0 in this case)
2.
Which can directly be used as list index input for another array? (As (Ar~=0)'s result is being used as an input like this Ar(Ar~=0)
Here Ar~=0 has been used as list index input like this Ar(Ar~=0) and then min of the array Ar(Ar~=0) is being found out. In other words minimum value of the array is found out excluding the elements whose value is 0.

The python syntax for a numpy array A would be:
A[A!=0].min()
you can also set the array elements:
B = A.copy()
B[A==0] = A[A!=0].min()
just as an example setting a cutoff

How to concatenate an empty array with Numpy.concatenate?

I need to create an array of a specific size mxn filled with empty values so that when I concatenate to that array the initial values will be overwritten with the added values.
My current code:
a = numpy.empty([2,2]) # Make empty 2x2 matrix
b = numpy.array([[1,2],[3,4]]) # Make example 2x2 matrix
myArray = numpy.concatenate((a,b)) # Combine empty and example arrays
Unfortunately, I end up making a 4x2 matrix instead of a 2x2 matrix with the values of b.
Is there anyway to make an actually empty array of a certain size so when I concatenate to it, the values of it become my added values instead of the default + added values?

Like Oniow said, concatenate does exactly what you saw.
If you want 'default values' that will differ from regular scalar elements, I would suggest you to initialize your array with NaNs (as your 'default value'). If I understand your question, you want to merge matrices so that regular scalars will override your 'default value' elements.
Anyway I suggest you to add the following:
def get_default(size_x,size_y):
# returns a new matrix filled with 'default values'
tmp = np.empty([size_x,size_y])
tmp.fill(np.nan)
return tmp
And also:
def merge(a, b):
l = lambda x, y: y if np.isnan(x) else x
l = np.vectorize(l)
return map(l, a, b)
Note that if you merge 2 matrices, and both values are non 'default' then it will take the value of the left matrix.
Using NaNs as default value, will result the expected behavior from a default value, for example all math ops will result 'default' as this value indicates that you don't really care about this index in the matrix.

If I understand your question correctly - concatenate is not what you are looking for. Concatenate does as you saw: joins along an axis.
If you are trying to have an empty matrix that becomes the values of another you could do the following:
import numpy as np
a = np.zeros([2,2])
b = np.array([[1,2],[3,4]])
my_array = a + b
--or--
import numpy as np
my_array = np.zeros([2,2]) # you can use empty here instead in this case.
my_array[0,0] = float(input('Enter first value: ')) # However you get your data to put them into the arrays.
But, I am guessing that is not what you really want as you could just use my_array = b. If you edit your question with more info I may be able to help more.
If you are worried about values adding over time to your array...
import numpy as np
a = np.zeros([2,2])
my_array = b # b is some other 2x2 matrix
''' Do stuff '''
new_b # New array has appeared
my_array = new_b # update your array to these new values. values will not add.
# Note: if you make changes to my_array those changes will carry onto new_b. To avoid this at the cost of some memory:
my_array = np.copy(new_b)

Efficiently convert a vector of bin counts to a vector of bin indices [duplicate]

Given an array of integer counts c, how can I transform that into an array of integers inds such that np.all(np.bincount(inds) == c) is true?
For example:
>>> c = np.array([1,3,2,2])
>>> inverse_bincount(c) # <-- what I need
array([0,1,1,1,2,2,3,3])
Context: I'm trying to keep track of the location of multiple sets of data, while performing computation on all of them at once. I concatenate all the data together for batch processing, but I need an index array to extract the results back out.
Current workaround:
def inverse_bincount(c):
return np.array(list(chain.from_iterable([i]*n for i,n in enumerate(c))))

using numpy.repeat :
np.repeat(np.arange(c.size), c)

no numpy needed :
c = [1,3,2,2]
reduce(lambda x,y: x + [y] * c[y], range(len(c)), [])

The following is about twice as fast on my machine than the currently accepted answer; although I must say I am surprised by how well np.repeat does. I would expect it to suffer a lot from temporary object creation, but it does pretty well.
import numpy as np
c = np.array([1,3,2,2])
p = np.cumsum(c)
i = np.zeros(p[-1],np.int)
np.add.at(i, p[:-1], 1)
print np.cumsum(i)

What is a more efficient way to process numpy arrays based on multiple criteria?

I have written some code that for a range of years (eg. 15 years), ndimage.filters.convolveis used to convolve an array (eg. array1), then where the resulting array (eg. array2) is above a randomly generated number, another array (eg. array3) is given a value of 1. Once array3 has been assigned a value of one it counts up for every year, and when it eventually reaches a certain value (eg. 5), array1 is updated in this location.
Sorry if this is a little confusing. I've actually got the script working by using numpy.where(boolean expression, value, value), but where I needed multiple expressions (eg. where array2 == 1 and array3 == 0), I used a for loop to iterate through each value in the arrays. This works great in the example here, but when I substitute the arrays for larger arrays (The full script imports GIS grids and converts them into arrays), this for loop takes a few minutes to process for every year. As we have to run the model over 60 years 1000 times, I need to find a much more efficient way to process these arrays.
I've tried to use multiple expressions within numpy.where but couldn't work out how to get it to work. I also tried zip(array) to zip the arrays together, but I couldn't update them, I think because this created tuples of the array elements.
I've attached a copy of the script, as mentioned earlier it works exactly as I need it to. However, it needs to do this more efficiently. If anyone has any suggestions that would be great. This is my first post regarding python so I still consider myself a novice.
import numpy as np
from scipy import ndimage
import random
from pylab import *
###################### FUNCTIONS ###########################
def convolveArray1(array1, kern1):
newArray = ndimage.filters.convolve(array1, kern1, mode='constant')
return newArray
######################## MAIN ##############################
## Set the number of years
nYears = range(1,16)
## Cretae array1
array1 = np.zeros((10,10), dtype=np.int) # vegThreshMask
# Add some values to array1
array1[[4,4],[4,5]] = 8
array1[5,4] = 8
array1[5,5] = 8
## Create kerna; array
kernal = np.ones((3,3), dtype=np.float32)
## Create an empty array to be used as counter
array3 = np.zeros((10,10), dtype=np.int)
## iterate through nYears
for y, yea in enumerate(nYears):
# Create a random number for the year
randNum = randint(7, 40)
print 'The random number for year %i is %i' % (yea, randNum)
print
# Call the convolveArray function
convArray = convolveArray1(array1, kernal)
# Update array2 where it is greater than the random number
array2 = np.where(convArray > randNum, 1, 0)
print 'Where convArray > randNum in year %i' % (yea)
print array2
print
# Iterate through array2
for a, ar in enumerate(array2):
for b, arr in enumerate(ar):
if all(arr == 1 and array3[a][b] == 0):
array3[a][b] = 1
else:
if array3[a][b] > 0:
array3[a][b] = array3[a][b] + 1
if array3[a][b] == 5:
array1[a][b] = 8
# Remove the initial array (array1) from the updated array3
array3 = np.where(array1 > 0, 0, array3)
print 'New array3 after %i years' % (yea)
print '(Excluding initial array)'
print array3
print
print 'The final output of the initial array'
print array1

I suspect you could gain a substantial speedup if you start using broadcasting. For example, starting from your line # Iterate through array2 we can remove the explicit loop and simply broadcast over the variables we want to change. Note I'm using AX instead of arrayX for clarity:
# Iterate through A2
idx = (A2==1) & (A3==0)
idx2 = (~idx) & (A3>0)
A3[idx ] = 1
A3[idx2] += 1
A1[A3==5] = 8
In addition, this greatly improves code clarity once you get used to this style as you aren't explicitly dealing with the indices (your a and b here).
Is it worth the trouble?
I asked the OP to do a speed test after trying the code above:
If you do implement loop change, please let me know the speed-up on your real-world code.
It would be useful to know if the advice given is simply glorified syntactic sugar, or has a notable effect.
After testing, the response was a substantial 40x speedup! When dealing with large arrays of contiguous data where simple masks are being performed, numpy is a far better alternative over native python lists.

It sounds like you were trying to use multiple conditions in np.where using expressions like array1 > 0 and array2 < 0. This doesn't work because of the way boolean operations work in Python, as documented here. First, array1 > 0 is evaluated, then it is converted to a boolean value using the __nonzero__ method (renamed to __bool__ in Python 3). There isn't a unique useful way of converting an array into a bool, and there is currently no way of overriding the behaviour of the boolean operators (though I believe this is being discussed for future versions), so in numpy, ndarray.__nonzero__ is defined to raise an exception. Instead, you can use np.logical_and, np.logical_or, and np.logical_not, which have the behaviour you would expect.
I don't know how much of a speedup this will give you, though. If you do end up performing lots of array indexing operations in loops, it might be worth looking into cython, with which you can easily speed up array operations up by moving them into a C extension.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : equivalent of Matlab ismember on rows for large arrays - python

Related

Is there any Python/numpy equivalent to Matlab's [row,col] = find(X)

Python alternative for MATLAB code 'min(Ar_1(Ar_1~=0))'

How to concatenate an empty array with Numpy.concatenate?

Efficiently convert a vector of bin counts to a vector of bin indices [duplicate]

What is a more efficient way to process numpy arrays based on multiple criteria?

Categories

Resources