Find values closest to zero among ndarrays - python

I have two numpy ndarrays named A and B. Each ndarray has dimension 2 by 3. For each of the grid point, I have to find which element of the two arrays is closest to zero and assign a flag accordingly. The flag takes value 1 for array A and value 2 for array B. That is, if the element in (0,0) (i.e., row 0 and column 0) of array A is closest to zero compared to (0,0) element of array B, then the output assigns a value 1 in position row 0 and column 0. The output array will have the dimension 1 by 3.
I give an example below
A= np.array([[0.1,2,0.3],[0.4,3,2]])
B= np.array([[1,0.2,0.5],[4,0.03,0.02]])
The output should be
[[1,2,1],[1,2,2]]
Is there an efficient way of doing it without writing for loop? Many thanks.

Here's what i would do:
import numpy as np
a = np.array([[0.1,2,0.3],[0.4,3,2]])
b = np.array([[1,0.2,0.5],[4,0.03,0.02]])
c = np.abs(np.stack([a, b])).argmin(0)+1
Output:
array([[1, 2, 1],
[1, 2, 2]])

Related

How do I locate an element in a NumPy array equal to a given value?

I work in the field of image processing, and I converted the image into a matrix. The values of this matrix are only two numbers: 0 and 255. I want to know where the value 0 is in which column and in which row it is repeated within this matrix. Please help
I wrote these
array = np.array(binary_img)
print("array",array)
for i in array:
a = np.where(i==0)
print(a)
continue
The == operator on an array returns a boolean array of True/False values for each element. The argwhere function returns the coordinates of the nonzero points in its parameter:
array = np.array(binary_img)
for pt in np.argwhere(array == 0):
print( pt )

Checking whether there is only one occurrence of maximum in a numpy array

I have a 4 x 4 numpy array. I would like determine whether the maximum of each row is unique, i.e. there is only one occurrence of the maximum value. I'm new to Python and numpy and wondered if there is a pythonic way (method) of doing this rather than running a for loop.
You could, for example, try this:
import numpy as np
x = np.random.randint(0, 10, (4, 4))
res = np.sum(x == x.max(axis=1, keepdims=True), axis=1) > 1
This gives you a boolean array. Its nth index is True if the maxima of the nth row of the input array occurs multiple times in the same row.
x.max(axis=1, keepdims=True) computes the maxima along the rows of the array and ensures that the result has the same number of dimensions as the input. Then it checks if there are further occurrences of the maxima in the corresponding rows. The result is boolean array of the same shape as the input array. In Python, booleans are effectively integer values, so you can sum them up. If sum is greater than 1, the maximum is not strict.

Getting the row index of a specific value for each column using numpy

I have a matrix of 10,000 by 10,000 filled with 1s and 0s. What i want to do is to go through each column and find the rows that contain the value 1.
Then I want to store it in a new matrix with 2 columns : column 1 = column index and Column 2 = an array of row indices that contain 1. There are some columns that do not have any 1s at all, in which case it would be an empty array.
Trying to do a for loop again but it is computationally inefficient.
I tried with a smaller matrix
#sample matrix
n = 4
mat = [[randint(0,1) for _ in range(n)] for _ in range(n)]
arr = np.random.randint(0, size=(4, 2))
for col in range(n):
arr[n][1] = n
arr[n][2] = np.where(col == 1)
but this runs quite slowly for a 10,000 by 10,000 matrix. I am wondering if this is right and if there was a better way?
Getting indices where a[i][j] == 1
You can get the data that you're looking for (locations of ones within a matrix of zeroes and ones) efficiently using numpy.argwhere() or numpy.nonzero(), however you will not be able to get them in the format specified in your original question using NumPy ndarrays alone.
You could achieve the data in you're specified format using a combination of ndarrays and standard Python lists, however since efficiency is paramount given the size of the data you are working with I would think it best to focus on getting the data rather than getting it in the format of an ndarray of irregular Python lists.
You can always reformat the results (indices of 1 within your matrix) following computation if the format you have mentioned is a hard requirement, and this way your code will benefit from optimisations provided by NumPy during the heavy computation - reducing the execution time of your procedure overall.
Example using np.argwhere()
import numpy as np
a = np.random.randint(0, 2, size=(4,4))
b = np.argwhere(a == 1)
print(f'a\n{a}')
print(f'b\n{b}')
Output
a
[[1 1 1 1]
[0 0 0 0]
[1 0 1 0]
[1 1 1 1]]
b
[[0 0]
[0 1]
[0 2]
[0 3]
[2 0]
[2 2]
[3 0]
[3 1]
[3 2]
[3 3]]
As you can see, np.argwhere(a == 1) returns an ndarray whose values are ndarrays containing the indices of locations in a whose values (x) meet the condition x == 1.
I gave the above method with a = np.random.randint(0, 2, size=(10000,10000) a try on my laptop (nothing fancy) a few times and it finished at around 3-5 seconds each time.
Getting row indices where all values != 1
If you want to store all row indices of a containing no values == 1, the most straightforward way (assuming you are using my example code above) would probably be by using numpy.setdiff1d() to return an array of row indices that are not present within b - i.e. the set difference between an array containing all row indices of a and the 1d array b[0] which will be row indices of all values in a that are != 1.
Assuming the same a and b as the above example.
c = np.setdiff1d(np.arange(a.shape[0]), b[:, 0])
print(c)
Output
array([1])
In the above example c = [1] as 1 is the only row index in a that doesn't contain any values == 1.
It is worth noting that if a is defined as np.random.randint(0, 2, size=(10000,10000), the probability of c being anything but a zero-length (i.e. empty) array is vanishingly small. This is because for a row to contain no values == 1, np.random would have to return 0 10,000 times in a row to fill a row with 0.
Why use multiple NumPy arrays?
I know that it may seem strange to use b and c to store results pertaining to locations where a == 1 and a != 1 respectively. Why not just use an irregular list as outlined in your original question?
The answer in short is efficiency. By using NumPy arrays you will be able to vectorise computations on your data and largely avoid costly Python loops, the benefits of which will be magnified considerably as reflected in time spent on execution given the size of the data you are working with.
You can always store your data in a different format that is more human friendly and map it back to NumPy as required, however the above examples will likely increase efficiency substantially at execution time when compared to the example in your original question.

Selecting numpy columns based on values in a row

Suppose I have a numpy array with 2 rows and 10 columns. I want to select columns with even values in the first row. The outcome I want can be obtained is as follows:
a = list(range(10))
b = list(reversed(range(10)))
c = np.concatenate([a, b]).reshape(2, 10).T
c[c[:, 0] % 2 == 0].T
However, this method transposes twice and I don't suppose it's very pythonic. Is there a way to do the same job cleaner?
Numpy allows you to select along each dimension separately. You pass in a tuple of indices whose length is the number of dimensions.
Say your array is
a = np.random.randint(10, size=(2, 10))
The even elements in the first row are given by the mask
m = (a[0, :] % 2 == 0)
You can use a[0] to get the first row instead of a[0, :] because missing indices are synonymous with the slice : (take everything).
Now you can apply the mask to just the second dimension:
result = a[:, m]
You can also convert the mask to indices first. There are subtle differences between the two approaches, which you won't see in this simple case. The biggest difference is usually that linear indices are a little faster, especially if applied more than once:
i = np.flatnonzero(m)
result = a[:, i]

Sort array in Python (numpy) by two other lists

I have one array of vectors (or for simplicity you can assume I have one array) based on some computation I make 2 different arrays with same size of first array that contains number between 0 and 1 (for each index of the first array the second and third array containing the hamming distances) now I want to sort first array based on second and third array at the same time.
Since second and third array contains value between 0 and 1 I want to sort first array in order to each element has the closest distance of second and third array to the value 0.5.
If I want to make some realistic example :
first = numpy.array([1,2,3])
second = numpy.array([0.2, 0.5, 0.97])
third = numpy.array([0.3, 0.45, .98])
first = sort(first, second, third)
After that first must be something like this: [2,1,3]
why I should have this? because second[1] and third[1] are the closest point to the (0.5,0.5) (by closes I mean something like euclidean distance or any other one) and after index number 2 index number 0 has the second closes distance to point (0.5,0.5).
Or If I have second = numpy.array([0.62, 0.61, 0.97]) and third = numpy.array([0.49, 0.72, 0.97]) I want the sort array return indices like [0,1,2] so the first array after sorting is like this [1,2,3] why? because second[0] and third[0] are the closest point to the point (0.5,0.5) and so on.
import numpy as np
first = np.array([1,2,3])
second = np.array([0.2, 0.5, 0.97])
third = np.array([0.3, 0.45, .98])
sqdist = (second-0.5)**2 + (third-0.5)**2
idx = np.argsort(sqdist)
first = first[idx]
print(first)
yields
[2 1 3]

Categories

Resources