Float rounding error with Numpy isin function - python

I'm trying to use the isin() function from Numpy library to find elements that are common in two arrays.
Seems pretty basic, but one of those arrays is created using linspace() and the other I just put hard values in.
But it seems like isin() is using == for its comparisons, and so the result returned by the method is missing one of the numbers.
Is there a way I can work around this, either by defining my arrays differently or by using a method other than isin() ?
thetas = np.array(np.linspace(.25, .50, 51))
known_thetas = [.3, .35, .39, .41, .45]
unknown_thetas = thetas[np.isin(thetas, known_thetas, assume_unique = True, invert = True)]
Printing the three arrays, I find that .41 is still in the third array, because when printing them one by one, my value in the first array is actually 0.41000000000000003, which means == comparison returns False. What is the best way of working around this ?

We could make use of np.isclose after extending one of those arrays to 2D for an outer isclose-match-finding and then doing a ANY match to give us a 1D boolean-array that could be used to mask the relevant input array -
thetas[~np.isclose(thetas[:,None],known_thetas).any(1)]
To customize the level of tolerance for matches, we could feed in custom relative and absolute tolerance values to np.isclose.
If you are looking for performance on large arrays, we could optimize on memory and hence performance too with a NumPy implementation of np.isin with tolerance arg for floating pt numbers with np.searchsorted -
thetas[~isin_tolerance(thetas,known_thetas,tol=0.001)]
Feed in your tolerance value in tol arg.

If you have a fixed absolute tolerance, you can use np.around to round the values before comparing:
unknown_thetas = thetas[np.isin(np.around(thetas, 5), known_thetas, assume_unique = True, invert = True)]
This rounds thetas to 5 decimal digits, but it's up to you to decide how close the numbers need to be for you to consider them equal.

Related

Numpy Divide Arrays With Multiple Out Conditions

I have two, two-dimensional arrays (say arrayA & arrayB) that are exactly the same size (2500X, 1500Y). I am interested in dividing array A by array B, but have three conditions that I would like to be excluded from the division and instead replaced with a specific value. These conditions are:
If arrayB contains zero at point (Bx,By), replace output (Cx,Cy) with (arrayA*arrayA)
If arrayA contains zero at point (Ax,Ay), replace output (Cx,Cy) with 0.50
If both arrayA & B at overlapping points (Ax,Ay & Bx,By) contain 0, replace output (Cx,Cy) with 1
I've found that numpy.divide parameters out and where allow me to define each of these individually, so I've taken the first condition and arranged it as follows:
arrayC = np.divide(arrayA, arrayB, out=(arrayA*arrayA), where=arrayB!=0)
My question is how can I combine the other two conditions and their desired outputs within this operation?
One solution, not sure it is the fastest
za=A==0
zb=B==0
case0=(~za)&~zb
case1=zb&~za
case2=za&~zb
case3=za&zb
C=case3*1 + case2*0.5 + case1*A*A # Case 3,2, 1
C[case0]=(A[case0]/B[case0])
Could be more compact with less intermediate values, but I've chosen clarity.
You could also use a cascade of np.where
zb=B==0
C=np.where(A==0, np.where(zb,1,0.5), np.where(zb, A*A, A/B))
Edit: better version (but still not perfect)
zb=B==0
za=A==0
C=np.where(za, np.where(zb,1,0.5), A*A)
np.divide(A, B, out=C, where=(~zb)&~za)
It combines np.where and your np.divide where=
It is as fast as the previous solution.
And does not complain about 0-division, since division occurs only for the cases where it is needed.
Nevertheless, it computes the first version of C (the one before np.divide), and particularly A*A, everywhere, even where it is not needed, since it will be overwritten.
So, it could probably be better.

What does numpy.percentile mean and how to use this for splitting array?

I am trying to understand percentiles in numpy.
import numpy as np
nd_array = np.array([3.6216, 4.5459, -3.5637, -2.5419])
step_intervals = range(100, 0, -5)
for percentile_interval in step_intervals:
threshold_attr_value = np.percentile(np.array(nd_array), percentile_interval)
print "percentile interval ={interval}, threshold_attr_value = {threshold_attr_value}, {arr}".format(interval=percentile_interval, threshold_attr_value=threshold_attr_value, arr=sorted(nd_array))
I get a value of these as
percentile interval =100, threshold_attr_value = 4.5459, [-3.5636999999999999, -2.5419, 3.6215999999999999, 4.5458999999999996]
...
percentile interval =5, threshold_attr_value = -3.41043, [-3.5636999999999999, -2.5419, 3.6215999999999999, 4.5458999999999996]
What does the percentiles value mean?
100% of the values in the array are < 4.5459?
5% of values in the array are < -3.41043?
Is that the correct way to read these?
I want to split the numpy array into small sub-arrays. I want to do it based on the percentile occurances of the elements. How can I do this?
To be more precise, you should say that a = np.percentile(arr, q) indicates that nearly q% of elements of arr are lower than a. Why do I emphasize on nearly?
If q=100, it always returns the maximum of arr. So, you cannot say that q% of elements are "lower than" a.
If q=0, it always returns the minimum of arr. So, you cannot say that q% of elements are "lower than or equal to" a.
In addition, the returned value depends on the type of interpolation.
The following code shows the role of interpolation parameter:
>>> import numpy as np
>>> arr = np.array([1,2,3,4,5])
>>> np.percentile(arr, 90) # default interpolation='linear'
4.5999999999999996
>>> np.percentile(arr, 90, interpolation='lower')
4
>>> np.percentile(arr, 90, interpolation='higher')
5
No, as you can see by inspection, only 75% of the values in your array are strictly less than 4.5459, and 25% of the values are strictly less than -3.41043. If you had written less than or equal to, then you would have been giving one common definition of "Percentile" which however happens to also not be what is applied in your case; instead, what's happening is that numpy is applying a certain interpolation scheme to ensure that the mapping taking a given number in [0, 100] to the corresponding percentile is continuous and piecewise linear, while still giving the "right" value at ranks corresponding to values in the given array. As it turns out, even this you can do in many different ways, all of which are reasonable, as described in the Wikipedia article on the subject. As you can see in the documentation of numpy.percentile, you have some control of the interpolation behaviour and by default it uses what the Wikipedia article calls the "second variant, $C = 1$".
Perhaps the easiest way to understand the implications of this is to simply plot the result of calculating the different values of np.percentile for your fixed length 4 array:
Note how the kinks are spread evenly across [0, 100] and that the percentiles corresponding to the actual values in your array are given by evaluating lambda p: np.percentile(nd_array, p) at 0*100/(4-1), 1*100/(4-1), 2*100/(4-1), and 3*100/(4-1) respectively.

can I do fast set difference with floats using numpy if elements are equal up to some tolerance

I have two lists of float numbers, and I want to calculate the set difference between them.
With numpy I originally wrote the following code:
aprows = allpoints.view([('',allpoints.dtype)]*allpoints.shape[1])
rprows = toberemovedpoints.view([('',toberemovedpoints.dtype)]*toberemovedpoints.shape[1])
diff = setdiff1d(aprows, rprows).view(allpoints.dtype).reshape(-1, 2)
This works well for things like integers. In case of 2d points with float coordinates that are the result of some geometrical calculations, there's a problem of finite precision and rounding errors causing the set difference to miss some equalities. For now I resorted to the much, much slower:
diff = []
for a in allpoints:
remove = False
for p in toberemovedpoints:
if norm(p-a) < 0.1:
remove = True
if not remove:
diff.append(a)
return array(diff)
But is there a way to write this with numpy and gain back the speed?
Note that I want the remaining points to still have their full precision, so first rounding the numbers and then do a set difference probably is not the way forward (or is it? :) )
Edited to add an solution based on scipy.KDTree that seems to work:
def remove_points_fast(allpoints, toberemovedpoints):
diff = []
removed = 0
# prepare a KDTree
from scipy.spatial import KDTree
tree = KDTree(toberemovedpoints, leafsize=allpoints.shape[0]+1)
for p in allpoints:
distance, ndx = tree.query([p], k=1)
if distance < 0.1:
removed += 1
else:
diff.append(p)
return array(diff), removed
If you want to do this with the matrix form, you have a lot of memory consumption with larger arrays. If that does not matter, then you get the difference matrix by:
diff_array = allpoints[:,None] - toberemovedpoints[None,:]
The resulting array has as many rows as there are points in allpoints, and as many columns as there are points in toberemovedpoints. Then you can manipulate this any way you want (e.g. calculate the absolute value), which gives you a boolean array. To find which rows have any hits (absolute difference < .1), use numpy.any:
hits = numpy.any(numpy.abs(diff_array) < .1, axis=1)
Now you have a vector which has the same number of items as there were rows in the difference array. You can use that vector to index all points (negation because we wanted the non-matching points):
return allpoints[-hits]
This is a numpyish way of doing this. But, as I said above, it takes a lot of memory.
If you have larger data, then you are better off doing it point by point. Something like this:
return allpoints[-numpy.array([numpy.any(numpy.abs(a-toberemoved) < .1) for a in allpoints ])]
This should perform well in most cases, and the memory use is much lower than with the matrix solution. (For stylistic reasons you may want to use numpy.all instead of numpy.any and turn the comparison around to get rid of the negation.)
(Beware, there may be pritning mistakes in the code.)

Comparing two vectors

I have some code where I want to test if the product of a matrix and vector is the zero vector. An example of my attempt is:
n =2
zerovector = np.asarray([0]*n)
for column in itertools.product([0,1], repeat = n):
for row in itertools.product([0,1], repeat = n-1):
M = toeplitz(column, [column[0]]+list(row))
for v in itertools.product([-1,0,1], repeat = n):
vector = np.asarray(v)
if (np.dot(M,v) == zerovector):
print M, "No good!"
break
But the line if (np.dot(M,v) == zerovector): gives the error ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). What is the right way to do this?
The problem is that == between two arrays is an element-wise comparison—you get back an array of boolean values. An array of boolean values isn't a boolean value itself, so you can't use it in an if. This is what the error is trying to tell you.
You could solve this by using the all method, to check whether all of the elements in the boolean array are true. But you're making this way more complicated than you need to. Nonzero values are truthy, zero values are falsey, so you can just use any without a comparison:
if not np.dot(M, v).any():
If you want to make the comparison to zero explicit, just compare to a scalar, don't build a zero vector; it'll get broadcast the same way. And, if you ever do want to build a zero vector, just use the zeros function; don't build a list of zeros in a complicated way and pass it to asarray.
You could also use the count_nonzero function here as a different alternative. If it returns anything truthy (that is, any non-zero number), the array had at least one non-zero.
In general, you're making almost everything harder than necessary, and working through a brief NumPy tutorial and then scanning the main docs pages for useful functions would really help you.
Also, if your values aren't integers, you probably don't actually want to compare == 0 in the first place. Floating-point numbers accumulate rounding errors. To handle that, use the allclose function instead.
as the error says you need to use all
if all(np.dot(M,v) == zerovector):
or np.all. np.dot(M,v) == zerovector gives you a vector which is pair-wise comparison of the two vectors.

Efficient math operations on parts of "sparse" numpy arrays

I have the following challenge in a simulation for my PhD thesis:
I need to optimize the following code:
repelling_forces = repelling_force_prefactor * np.exp(-(height_r_t/potential_steepness))
In this code snippet 'height_r_t' is a real Numpy array and 'potential_steepness' is an scalar. 'repelling_force_prefactor' is also a Numpy array, which is mostly ZERO, but ONE at pre-calculated position, which do NOT change during runtime (i.e. a Mask).
Obviously the code is inefficient as it would make much more sense to only calculate the exponential function at the positions, where 'repelling_force_prefactor' is non-zero.
The question is how do I do this in the most efficient manner?
The only idea I have up to now would be to define slice to 'height_r_t' using 'repelling_force_prefactor' and apply 'np.exp' to those slices. However, I have made the experience that slicing is slow (not sure if this is generally correct) and the solution seems awkward.
Just as a side-note the ration of 1's to 0's in 'repelling_force_prefactor' is about 1/1000 and I am running this in loop, so efficiency is very important.
(Comment: I wouldn't have a problem with resorting to Cython, as I will need/want to learn it at some point anyway... but I am a novice, so I'd need a good pointer/explanation.)
masked arrays are implemented exactly for your purposes.
Performance is the same as Sven's answer:
height_r_t = np.ma.masked_where(repelling_force_prefactor == 0, height_r_t)
repelling_forces = np.ma.exp(-(height_r_t/potential_steepness))
the advantage of masked arrays is that you do not have to slice and expand your array, the size is always the same, but numpy automatically knows not to compute the exp where the array is masked.
Also, you can sum array with different masks and the resulting array has the intersection of the masks.
Slicing is probably much faster than computing all the exponentials. Instead of using the mask repelling_force_prefactor for slicing directly, I suggest to precompute the indices where it is non-zero and use them for slicing:
# before the loop
indices = np.nonzero(repelling_force_prefactor)
# inside the loop
repelling_forces = np.exp(-(height_r_t[indices]/potential_steepness))
Now repelling_forces will contain only the results that are non-zero. If you have to update some array of the original shape of height_r_t with this values, you can use slicing with indices again, or use np.put() or a similar function.
Slicing with the list of indices will be more efficient than slicing with a boolean mask in this case, since the list of indices is shorter by a factor thousand. Actually measuring the performance is of course up to you.

Categories

Resources