Check for numpy array equality with specific NaN - python

There are several different types of NaN possible in most floating point representations (e.g. quiet NaNs, signalling NaNs, etc.). I assume this is also true in numpy. I have a specific bit representation of a NaN, defined in C and imported into python. I wish to test whether an array contains entirely this particular floating point bit pattern. Is there any way to do that?
Note that I want to test whether the array contains this particular NaN, not whether it has NaNs in general.

Numpy allows you to have direct access to the bytes in your array. For a simple case you can view nans directly as integers:
quiet_nan1 = np.uint64(0b0111111111111000000000000000000000000000000000000000000000000000)
x = np.arange(10, dtype=np.float64)
x.view(np.uint64)[5] = quiet_nan1
x.view(np.uint64)
Now you can just compare the elements for the bit-pattern of your exact NaN. This version will preserve shape since the elements are the same size.
A more general solution, which would let you with with types like float128 that don't have a corresponding integer analog on most systems, is to use bytes:
quiet_nan1l = np.frombuffer((0b01111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000).to_bytes(16, 'big'))
x = np.arange(3 * 4 * 5, dtype=np.float128).reshape3, 4, 5)
x.view(np.uint8).reshape(*x.shape, 16)[2, 2, 3, :] = quiet_nan1l
x.view(np.uint8).reshape(*x.shape, 16)
The final reshape is not strictly necessary, but it is very convenient, since it isolates the original array elements along the last dimension.
In both cases, modifying the view modifies the original array. That's the point of a view.
And if course it goes without saying (which is why I'm saying it), that this applies to any other bit pattern you may want to assign or test for, not just NaNs.

Related

Float rounding error with Numpy isin function

I'm trying to use the isin() function from Numpy library to find elements that are common in two arrays.
Seems pretty basic, but one of those arrays is created using linspace() and the other I just put hard values in.
But it seems like isin() is using == for its comparisons, and so the result returned by the method is missing one of the numbers.
Is there a way I can work around this, either by defining my arrays differently or by using a method other than isin() ?
thetas = np.array(np.linspace(.25, .50, 51))
known_thetas = [.3, .35, .39, .41, .45]
unknown_thetas = thetas[np.isin(thetas, known_thetas, assume_unique = True, invert = True)]
Printing the three arrays, I find that .41 is still in the third array, because when printing them one by one, my value in the first array is actually 0.41000000000000003, which means == comparison returns False. What is the best way of working around this ?
We could make use of np.isclose after extending one of those arrays to 2D for an outer isclose-match-finding and then doing a ANY match to give us a 1D boolean-array that could be used to mask the relevant input array -
thetas[~np.isclose(thetas[:,None],known_thetas).any(1)]
To customize the level of tolerance for matches, we could feed in custom relative and absolute tolerance values to np.isclose.
If you are looking for performance on large arrays, we could optimize on memory and hence performance too with a NumPy implementation of np.isin with tolerance arg for floating pt numbers with np.searchsorted -
thetas[~isin_tolerance(thetas,known_thetas,tol=0.001)]
Feed in your tolerance value in tol arg.
If you have a fixed absolute tolerance, you can use np.around to round the values before comparing:
unknown_thetas = thetas[np.isin(np.around(thetas, 5), known_thetas, assume_unique = True, invert = True)]
This rounds thetas to 5 decimal digits, but it's up to you to decide how close the numbers need to be for you to consider them equal.

Creating an empty multidimensional array

In Python when using np.empty(), for example np.empty((3,1)) we get an array that is of size (3,1) but, in reality, it is not empty and it contains very small values (e.g., 1.7*(10^315)). Is possible to create an array that is really empty/have no values but have given dimensions/shape?
I'd suggest using np.full_like to choose the fill-value directly...
x = np.full_like((3, 1), None, dtype=object)
... of course the dtype you chose kind of defines what you mean by "empty"
I am guessing that by empty, you mean an array filled with zeros.
Use np.zeros() to create an array with zeros. np.empty() just allocates the array, so the numbers in there are garbage. It is provided as a way to even reduce the cost of setting the values to zero. But it is generally safer to use np.zeros().
I suggest to use np.nan. like shown below,
yourdata = np.empty((3,1)) * np.nan
(Or)
you can use np.zeros((3,1)). but it will fill all the values as zero. It is not intuitively well. I feel like using np.nan is best in practice.
Its all upto you and depends on your requirement.

Numpy: Uniform way of retrieving `dtype`

If I have a numpy array x, I can get its data type by using dtype like this:
t = x.dtype
However, that obviously won't work for things like lists. I wonder if there is a standard way of retrieving types for lists and numpy arrays. In the case of lists, I guess this would mean the largest type which fits all of the data. For instance, if
x = [ 1, 2.2 ]
I would want such a method to return float, or better yet numpy.float64.
Intuitively, I thought that this was the purpose of the numpy.dtype method. However, that is not the case. That method is used to create a type, not extract a type.
The only method that I know of getting a type is to wrap whatever object is passed in with a numpy array, and then get the dtype:
def dtype(x):
return numpy.asarray(x).dtype
The issue with this approach, however, is that it will copy the array if it is not already a numpy array. In this circumstance, that is extremely heavy for such a simple operation.
So is there a numpy method that I can use which won't require me to do any list copies?
EDIT
I am designing a library for doing some geometric manipulations... Conversions between rotation matrices, rotation vectors, quaternions, euler angles, etc.
It can easily happen that the user is simply working with a single rotation vector (which has 3 elements). In that case, they might write something like
q = vectorToQuaternion([ .1, 0, 0 ])
In this case, I would want the output quaternion to be a numpy array of type numpy.float64. However, sometimes to speed up the calculations, the user might want to use a numpy array of float32's:
q = vectorToQuaternion(numpy.float32([ .1, 0, 0 ]))
In which case, I think it is natural to expect that the output is the same type.
The issue is that I cannot use the zeros_like function (or empty_like, etc) because a quaternion has 4 components, while a vector has 3. So internally, I have to do something like
def vectorToQuaternion(v):
q = empty( (4,), dtype = asarray(v).dtype )
...
If there was a way of using empty_like which extracts all of the properties of the input, but lets me specify the shape of the output, then that would be the ideal function for me. However, to my knowledge, you cannot specify the shape in the call to empty_like.
EDIT
Here are some gists for the class I am talking about, and a test class (so that you can see how I intend to use it).
Class: https://gist.github.com/mholzel/c3af45562a56f2210270d9d1f292943a
Tests: https://gist.github.com/mholzel/1d59eecf1e77f21be7b8aadb37cc67f2
If you really want to do it that way you will probably have to use np.asarray, but I'm not sure that's the most solid way of dealing with the problem. If the user forgets to add . and gives [1, 0, 0] then you will be creating integer outputs, which most definitely does not make sense for quaternions. I would default to np.float64, using the dtype of the input if it is an array of some float type, and maybe also giving the option to explicitly pass a dtype:
import numpy as np
def vectorToQuaternion(v, dtype=None):
if dtype is None:
if isinstance(v, np.ndarray) and np.issubdtype(v.dtype, np.floating):
# Or if you prefer:
if np.issubdtype(getattr(v, 'dtype', np.int), np.floating):
dtype = v.dtype
else:
dtype = np.float64
q = np.empty((4,), dtype=dtype)
# ...

Is there any performance reason to use ndim 1 or 2 vectors in numpy?

This seems like a pretty basic question, but I didn't find anything related to it on stack. Apologies if I missed an existing question.
I've seen some mathematical/linear algebraic reasons why one might want to use numpy vectors "proper" (i.e. ndim 1), as opposed to row/column vectors (i.e. ndim 2).
But now I'm wondering: are there any (significant) efficiency reasons why one might pick one over the other? Or is the choice pretty much arbitrary in that respect?
(edit) To clarify: By "ndim 1 vs ndim 2 vectors" I mean representing a vector that contains, say, numbers 3 and 4 as either:
np.array([3, 4]) # ndim 1
np.array([[3, 4]]) # ndim 2
The numpy documentation seems to lean towards the first case as the default, but like I said, I'm wondering if there's any performance difference.
If you use numpy properly, then no - it is not a consideration.
If you look at the numpy internals documentation, you can see that
Numpy arrays consist of two major components, the raw array data (from now on, referred to as the data buffer), and the information about the raw array data. The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. Numpy also contains a significant set of data that describes how to interpret the data in the data buffer.
So, irrespective of the dimensions of the array, all data is stored in a continuous buffer. Now consider
a = np.array([1, 2, 3, 4])
and
b = np.array([[1, 2], [3, 4]])
It is true that accessing a[1] requires (slightly) less operations than b[1, 1] (as the translation of 1, 1 to the flat index requires some calculations), but, for high performance, vectorized operations are required anyway.
If you want to sum all elements in the arrays, then, in both case you would use the same thing: a.sum(), and b.sum(), and the sum would be over elements in contiguous memory anyway. Conversely, if the data is inherently 2d, then you could do things like b.sum(axis=1) to sum over rows. Doing this yourself in a 1d array would be error prone, and not more efficient.
So, basically a 2d array, if it is natural for the problem just gives greater functionality, with zero or negligible overhead.

Replacing missing values with random in a numpy array

I have a 2D numpy array with binary data, i.e. 0s and 1s (not observed or observed). For some instances, that information is missing (NaN). Since the missing values are random in the data set, I think the best way to replace them would be using random 0s and 1s.
Here is some example code:
import numpy as np
row, col = 10, 5
matrix = np.random.randint(2, size=(row,col))
matrix = matrix.astype(float)
matrix[1,2] = np.nan
matrix[5,3] = np.nan
matrix[8,0] = np.nan
matrix[np.isnan(matrix)] = np.random.randint(2)
The problem with this is that all NaNs are replaced with the same value, either 0 or 1, while I would like both. Is there a simpler solution than for example a for loop calling each NaN separately? The data set I'm working on is a lot bigger than this example.
Try
nan_mask = np.isnan(matrix)
matrix[nan_mask] = np.random.randint(0, 2, size=np.count_nonzero(nan_mask))
You can use a vectorized function:
random_replace = np.vectorize(lambda x: np.random.randint(2) if np.isnan(x) else x)
random_replace(matrix)
Since the missing values are random in the data set, I think the best way to replace them would be using random 0s and 1s.
I'd heartily contradict you here. Unless you have stochastic model that proves that assuming equal probability for each element to be either 0 or 1, that would bias your observation.
Now, I don't know where your data comes from, but "2D array" sure sounds like an image signal, or something of the like. You can find that most of the energy in many signal types is in low frequencies; if something of the like is the case for you, you can probably get lesser distortion by replacing the missing values with an element of a low-pass filtered version of your 2D array.
Either way, since you need to call numpy.isnan from python to check whether a value is NaN, I think the only way to solve this is writing an efficient loop, unless you want to senselessly calculate a huge random 2D array just to fill in a few missing numbers.
EDIT: oh, I like the vectorized version; it's effectively what I'd call a efficient loop, since it does the looping without interpreting a python loop iteration each time.
EDIT2: the mask method with counting nonzeros is even more effective, I guess :)

Categories

Resources