Can't find values in my array with numpy.where

Can't find values in my array with numpy.where - python

I have a numpy array of dimensions (30435615,3) containing coordinates expressed for example (0.0 0.0 0.0 1) and I'm looking for a method to set to True the indexes that have coordinates contained in another array. I tried with numpy.where method but I'm having some problems.
If I print the 50th element of my array I got:
>>> print(coordsRAS[50,:])
[-165.31173706 7.91322422 -271.87799072]
But if I search this point:
>>> import numpy as np
>>> print(np.where((coordsRAS[:,0]==-165.31173706) & (coordsRAS[:,1] == 7.91322422) & (coordsRAS[:,2] == -256.87799072)))
(array([], dtype=int64),)
I can't figure out why it can't find the point.
EDIT 1:
Sorry I copied the wrong value above, -256.87799072 instead of -271.87799072. However the problem was in the approximation of the print, actually the value has more significant digits for this he could not find it. In this way works:
np.where((np.round(coordsRAS[:,0],8)==-165.31173706) & (np.round(coordsRAS[:,1],8) == 7.91322422) & (np.round(coordsRAS[:,2],8) == -271.87799072))
But now I have another problem. The other array I want to compare coordsRAS to is smaller, so when I try to compare == it gives me an error.
>>> coordsRAS = np.where(coordsRAS[:,:]==points[:,:3],True,False)
C:/Users/silvi/AppData/Local/Temp/xpython_8292/987583353.py:11: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
coordsRAS = np.where(coordsRAS [:,:]==points[:,:3],True,False)
How can I set coordsRAS values to True that are also present in points?

When you are working with floats, it is not a good idea to use equality statements to find numbers, because you are always dealing with numerical inaccuracies. The answer given by Majid will fail in case you multiply your coordsRAS with pi and then divide again by pi. Theoretically it should give you the same result, but it fails:
import numpy as np
coordsRAS = np.random.random((5, 3))
point = [-165.31173706, 7.91322422, -256.87799072]
coordsRAS[4, :] = point
coordsRAS *= np.pi
coordsRAS /= np.pi
result1 = np.where((coordsRAS[:, 0] == -165.31173706), (coordsRAS[:, 1] == 7.91322422), (coordsRAS[:, 2] == -271.87799072))
print(coordsRAS[result1])
We have divided and multiplied with the same number, but now we cannot find the point anymore, due to the numerical round off error. The result in this case is:
[]
So the result is empty, because your float has slightly changed due to numerical round off errors.
The solution is to calculate the difference of your array with the required point, and search for the location where your distance falls below a certain accuracy. So you should do:
distance = np.linalg.norm(coordsRAS - point, axis=-1)
row = np.where(distance < 1e-10)
result2 = coordsRAS[row]
Now the correct point can still be found:
print(result2)
[[-165.31173706 7.91322422 -256.87799072]]
EDIT1:
In case you want to get all the locations stored in an other smaller array, you have to iterate over the points. E.g. you have the following two arrays:
coordsRAS = np.random.random((10, 3))
points = np.random.random((3, 3))
coordsRAS[4:7, :] = points
where the locations of points are stored in the coordsRAS array as well, you can find the locations of points back in the coordsRAS array as
mask_total = None
for point in points[:]:
distance = np.linalg.norm(coordsRAS - point, axis=-1)
mask = distance < 1e-10
if mask_total is None:
mask_total = mask
else:
mask_total = mask_total | mask
result = coordsRAS[mask_total]

Related

What is an efficient way to calculate the mean of values in the bin with maximum frequency for large number of numpy arrays?

I am looking for an efficient way to do the following calculations on millions of arrays. For the values in each array, I want to calculate the mean of the values in the bin with most frequency as demonstrated below. Some of the arrays might contain nan values and other values are float. The loop for my actual data takes too long to finish.
import numpy as np
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
bin_values=np.linspace(0, 10, 21)
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0])
values = np.zeros(array.shape[0])
for i in range(array.shape[0]):
values[i] = np.nanmean(array[i][(array[i]>=bin_start[i])*(array[i]<bin_end[i])])
Also, when I run the above code I get three warnings. The first is 'RuntimeWarning: Mean of empty slice' for the line where I calculate the value variable. I set a condition in case I have all nan values to skip this line, but the warning did not go away. I was wondering what the reason is. The other two warnings are for when the less and greater_equal conditions do not meet which make sense to me since they might be nan values.

The arrays that I want to run this algorithm on are independent, but I am already processing them with 12 separate scripts. Running the code in parallel would be an option, however, for now I am looking to improve the algorithm itself.
The reason that I am using lambda function is to run numpy.histogram over an axis since it seems the histogram function does not take an axis as an option. I was able to use a mask and remove the loop from the code. The code is 2 times faster now, but I think it still can be improved more.
I can explain what I want to do in more detail by an example if it clarifies it. Imagine I have 36 numbers which are greater than 0 and smaller than 20. Also, I have bins with equal distance of 0.5 over the same interval (0.0_0.5, 0.5_1.0, 1.0_1.5, … , 19.5_20.0). I want to see if I put the 36 numbers in their corresponding bin what would be the mean of the numbers within the bin which contain the most number of numbers.
Please post your solution if you can think of a faster algorithm.
import numpy as np
# creating an array to test the algorithm
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
# the algorithm
bin_values=np.linspace(0, 10, 21)
# calculating the frequency of each bin
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0]))
# creating a mask to get the mean over the bin with maximum frequency
mask = (array>=bin_start) * (array<bin_end)
mask_nan = np.tile(np.nan, (mask.shape[0], mask.shape[1]))
mask_nan[mask] = 1
v = np.nanmean(array * mask_nan, axis = 1)

Suggestion to vectorize a Python function

I wrote the following function, which takes as inputs three 1D array (namely int_array, x, and y) and a number lim. The output is a number as well.
def integrate_to_lim(int_array, x, y, lim):
if lim >= np.max(x):
res = 0.0
if lim <= np.min(x):
res = int_array[0]
else:
index = np.argmax(x > lim) # To find the first element of x larger than lim
partial = int_array[index]
slope = (y[index-1] - y[index]) / (x[index-1] - x[index])
rest = (x[index] - lim) * (y[index] + (lim - x[index]) * slope / 2.0)
res = partial + rest
return res
Basically, outside form the limit cases lim>=np.max(x) and lim<=np.min(x), the idea is that the function finds the index of the first value of the array x larger than lim and then uses it to make some simple calculations.
In my case, however lim can also be a fairly big 2D array (shape ~2000 times ~1000 elements)
I would like to rewrite it such that it makes the same calculations for the case that lim is a 2D array.
Obviously, the output should also be a 2D array of the same shape of lim.
I am having a real struggle figuring out how to vectorize it.
I would like to stick only to the numpy package.
PS I want to vectorize my function because efficiency is important and as I understand using for loops is not a good choice in this regard.
Edit: my attempt
I was not aware of the function np.take, which made the task way easier.
Here is my brutal attempt that seems to work (suggestions on how to clean up or to make the code faster are more than welcome).
def integrate_to_lim_vect(int_array, x, y, lim_mat):
lim_mat = np.asarray(lim_mat) # Make sure that it is an array
shape_3d = list(lim_mat.shape) + [1]
x_3d = np.ones(shape_3d) * x # 3 dimensional version of x
lim_3d = np.expand_dims(lim_mat, axis=2) * np.ones(x_3d.shape) # also 3d
# I use np.argmax on the 3d matrices (is there a simpler way?)
index_mat = np.argmax(x_3d > lim_3d, axis=2)
# Silly calculations
partial = np.take(int_array, index_mat)
y1_mat = np.take(y, index_mat)
y2_mat = np.take(y, index_mat - 1)
x1_mat = np.take(x, index_mat)
x2_mat = np.take(x, index_mat - 1)
slope = (y1_mat - y2_mat) / (x1_mat - x2_mat)
rest = (x1_mat - lim_mat) * (y1_mat + (lim_mat - x1_mat) * slope / 2.0)
res = partial + rest
# Make the cases with np.select
condlist = [lim_mat >= np.max(x), lim_mat <= np.min(x)]
choicelist = [0.0, int_array[0]] # Shoud these options be a 2d matrix?
output = np.select(condlist, choicelist, default=res)
return output
I am aware that if the limit is larger than the maximum value in the array np.argmax returns the index zero (leading to wrong results). This is why I used np.select to check and correct for these cases.
Is it necessary to define the three dimensional matrices x_3d and lim_3d, or there is a simpler way to find the 2D matrix of the indices index_mat?
Suggestions, especially to improve the way I expanded the dimension of the arrays, are welcome.

I think you can solve this using two tricks. First, a 2d array can be easily flattened to a 1d array, and then your answers can be converted back into a 2d array with reshape.
Next, your use of argmax suggests that your array is sorted. Then you can find your full set of indices using digitize. Thus instead of a single index, you will get a complete array of indices. All the calculations you are doing are intrinsically supported as array operations in numpy, so that should not cause any problems.
You will have to specifically look at the limiting cases. If those are rare enough, then it might be okay to let the answers be derived by the default formula (they will be garbage values), and then replace them with the actual values you desire.

what causes different in array sum along axis for C versus F ordered arrays in numpy

I am curious if anyone can explain what exactly leads to the discrepancy in this particular handling of C versus Fortran ordered arrays in numpy. See the code below:
system:
Ubuntu 18.10
Miniconda python 3.7.1
numpy 1.15.4
def test_array_sum_function(arr):
idx=0
val1 = arr[idx, :].sum()
val2 = arr.sum(axis=(1))[idx]
print('axis sums:', val1)
print(' ', val2)
print(' equal:', val1 == val2)
print('total sum:', arr.sum())
n = 2_000_000
np.random.seed(42)
rnd = np.random.random(n)
print('Fortran order:')
arrF = np.zeros((2, n), order='F')
arrF[0, :] = rnd
test_array_sum_function(arrF)
print('\nC order:')
arrC = np.zeros((2, n), order='C')
arrC[0, :] = rnd
test_array_sum_function(arrC)
prints:
Fortran order:
axis sums: 999813.1414744433
999813.1414744079
equal: False
total sum: 999813.1414744424
C order:
axis sums: 999813.1414744433
999813.1414744433
equal: True
total sum: 999813.1414744433

This is almost certainly a consequence of numpy sometimes using pairwise summation and sometimes not.
Let's build a diagnostic array:
eps = (np.nextafter(1.0, 2)-1.0) / 2
1+eps+eps+eps
# 1.0
(1+eps)+(eps+eps)
# 1.0000000000000002
X = np.full((32, 32), eps)
X[0, 0] = 1
X.sum(0)[0]
# 1.0
X.sum(1)[0]
# 1.000000000000003
X[:, 0].sum()
# 1.000000000000003
This strongly suggests that 1D arrays and contiguous axes use pairwise summation while strided axes in a multidimensional array don't.
Note that to see that effect the array has to be large enough, otherwise numpy falls back to ordinary summation.

Floating point math isn't necessarily associative, i.e. (a+b)+c != a+(b+c).
Since you're adding along different axes, the order of operations is different, which can affect the final result. As a simple example, consider the matrix whose sum is 1.
a = np.array([[1e100, 1], [-1e100, 0]])
print(a.sum()) # returns 0, the incorrect result
af = np.asfortranarray(a)
print(af.sum()) # prints 1
(Interestingly, a.T.sum() still gives 0, as does aT = a.T; aT.sum() , so I'm not sure how exactly this is implemented in the backend)
The C order is using the sequence of operations (left-to-right) 1e100 + 1 + (-1e100) + 0 whereas the Fortran order uses 1e100 + (-1e100) + 1 + 0. The problem is that (1e100+1) == 1e100 because floats don't have enough precision to represent that small difference, so the 1 gets lost.
In general, don't do equality testing on floating point numbers, instead compare using a small epsilon (if abs(float1 - float2) < 0.00001 or np.isclose). If you need arbitrary float precision, use the Decimal library or fixed-point representation and ints.

np.linalg.norm: "invalid value encountered in sqrt"

I'm working with some position vectors. I am operating each position with each other position and am using matrices to do it as efficiently as I can. I encountered a problem with my most recent version where it gives me a warning: RuntimeWarning: invalid value encountered in sqrt
return sqrt(add.reduce(s, axis=axis, keepdims=keepdims))
An example of some code that gives me this warning is below.
This warning is caused by np.linalg.norm and only happens when I specify a data type for the array, it also only happens in the example code below when I have more than 90 vectors.
Is this a NumPy bug, a known limitation in NumPy, or am I doing something wrong?
x = np.full((100, 3), 1) # Create an array of vectors, in this case all [1, 1, 1]
ps, qs = np.broadcast_arrays(x, np.expand_dims(x, 1)) # Created so that I can operate each vector on each other vector.
z = np.subtract(ps, qs, dtype=np.float32) # Get the difference between them.
np.linalg.norm(z, axis=2) # Get the magnitude of the difference.

You should make sure that Z doesn't contain any negative value!
test if you have negative values:
print len([_ for _ in z if _ < 0])

Numpy to check if a solution exists such that each row is < 0?

Consider the following code
X=np.matrix([[1,-1,1],[-1,0,1]])
print X.T
'''
[[ 1 -1]
[-1 0]
[ 1 1]]
'''
I want to check if a solution exists where the transpose has a <0 solution. For example this would mean checking if the following has a solution
1*y1 + -1*y2 < 0
-1*y1 + 0*y2 < 0
1*y1 + 1*y2 < 0
Tried reading http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html#numpy.linalg.solve but apparently no such luck

It seems that your question is equivalent to asking if the plane that contains the origin and also vectors U=r_[1,-1,1] and V=r_[-1, 0, 1] extends into the octant of 3-d space where all coords are negative.
The cross product UxV (or cross(U,V) is normal to this plane. If this cross-product has three nonzero components all of the same sign, then none of the the normals from it can be in the dreaded octant. For the case of your numbers, I get all three components negative, so there is no solution.
[UPDATE]
In general, the tricky things happen when the normal contains zeros:
Three-zeros: Your original vectors are parallel, or one of them is zero. Pick one that is not zero and if all components have the same sign, then you have a solution.
Two-zeros: Your plane is one of X=0, Z=0, Y=0. Thus one dimension is always nonnegative, there are no solutions.
One-zero: Your plain includes the X, Y or Z axis. There is a solution if and only if the remaining two components of the normal have differing signs.

here is the documentation you need:
numpy apply along axis
import numpy as np:
def func(b,y1,y2):
a = b.T
if a[0]*y1 + a[1]*y2 < 0:
return True
else:
return False
np.apply_along_axis(func,0,X,y1,y2)
so now let's say you want y1 as -1 and y2 as 3:
>>> np.apply_along_axis(func,0,X,-1,3)
array([ True, False, False], dtype=bool)
so this means on transpose the first row (which would be the normal first column) works with your algorithm, the second and third do not!
this is a function for an arbitrary number of Ys as in as large of a matrix as you want:
def func(b,*args):
a = b.T
total = [a[i]*args[i] for i in range(len(args)-1)]
if sum(total) < 0:
return True
else:
return False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't find values in my array with numpy.where - python

Related

What is an efficient way to calculate the mean of values in the bin with maximum frequency for large number of numpy arrays?

Suggestion to vectorize a Python function

what causes different in array sum along axis for C versus F ordered arrays in numpy

np.linalg.norm: "invalid value encountered in sqrt"

Numpy to check if a solution exists such that each row is < 0?

Categories

Resources