Removing NaNs in numpy arrays

Removing NaNs in numpy arrays - python

I have two numpy arrays that contains NaNs:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
are there any smart way using numpy to remove the NaNs in both arrays, and also remove whats on the corresponding index in the other list?
Making it look like this:
A = array([ 2, 3, ])
B = array([ 2, 4, ])

What you could do is add the 2 arrays together this will overwrite with NaN values where they are none, then use this to generate a boolean mask index and then use the index to index into your original numpy arrays:
In [193]:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
idx = np.where(~np.isnan(A+B))
idx
print(A[idx])
print(B[idx])
[ 2. 3.]
[ 2. 4.]
output from A+B:
In [194]:
A+B
Out[194]:
array([ nan, 4., nan, 7., nan])
EDIT
As #Oliver W. has correctly pointed out, the np.where is unnecessary as np.isnan will produce a boolean index that you can use to index into the arrays:
In [199]:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
idx = (~np.isnan(A+B))
print(A[idx])
print(B[idx])
[ 2. 3.]
[ 2. 4.]

A[~(np.isnan(A) | np.isnan(B))]
B[~(np.isnan(A) | np.isnan(B))]

Related

Avoid output via negative index for numpy

Given a numpy array A such as:
[[ 0. 482. 1900. 961. 579. 56.]
[ 0. 530. 1906. 914. 584. 44.]
[ 43. 0. 1932. 948. 556. 51.]
[ 0. 482. 1917. 946. 581. 52.]
[ 0. 520. 1935. 878. 589. 55.]]
I am getting the element I need to filter like this:
C = array([-1, -1, -1, 1, 2], dtype=int64)
R = array([[-2, -5],
[-1, -5],
[ 0, -4],
[ 1, -3],
[ 2, -2],
[ 3, -1]])
Extracting this way: A[R.T, C]
Issue: Negative indexing is giving me trouble. I would like to get NaN for the entries with either R or C or both <0. Is this possible?

As an exercise I assume your R or C arrays are of a wrong shape (because for your data A[R.T, C] returns a shape mismatch error), so I choose to remove one row from R:
import numpy as np
A = np.arange(30).reshape(5, 6)
C = np.array([-1, -1, -1, 1, 2], dtype=np.int64)
R = np.array([[-2, -5],
[ 0, -4],
[ 1, -3],
[ 2, -2],
[ 3, -1]])
and then A[R.T, C] evaluates to:
array([[23, 5, 11, 13, 20],
[ 5, 11, 17, 19, 26]])
Now I assume you're for some reason trying to replace all values that would otherwise be supplied by a negative index, with the NaN value. That leads me to an expected result of:
array([[nan, nan, nan, 13., 20.],
[nan, nan, nan, nan, nan]])
Is that a correct expectation??
If so, I'd suggest to add a row and a column consisting of NaN values to A and replace all negative indexes with -1 which will do the trick:
nan_row = np.full((1, 6), np.NAN)
nan_col = np.full((6, 1), np.NAN)
A = np.r_[A, nan_row]
A = np.c_[A, nan_col]
R[R<0] = -1
C[C<0] = -1
A[R.T, C] now returns the expected array:
array([[nan, nan, nan, 13., 20.],
[nan, nan, nan, nan, nan]])

Replacing values in n-dimensional tensor given indices from np.argwhere()

I'm somewhat new to numpy so this might be a dumb question, but here goes:
Let's say I have a tensor of any shape and size, say (100,5,5) or (3,3,10,15,4). I have a randomly generated list of indices for points I want to replace with np.nan. For a (3,3,3) test case, it would be as follows:
>> data = np.random.randn(3,3,3)
>> data
array([[[ 0.21368315, -1.42814113, 1.23021783],
[ 0.25835315, 0.44775156, -1.20489094],
[ 0.25928972, 0.39486046, -1.79189447]],
[[ 2.24080908, -0.89617961, -0.29550817],
[ 0.21756087, 1.33996913, -1.24418745],
[-0.63617598, 0.56848439, 0.8175564 ]],
[[ 0.61367002, -1.16104071, -0.53488283],
[ 1.0363354 , -0.76888041, 1.24524786],
[-0.84329375, -0.61744489, 1.50502058]]])
>> idxs = np.argwhere(np.isfinite(data))
>> dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
>> dropidxs
array([[1, 1, 1],
[2, 0, 2],
[2, 1, 0]])
How do I replace the corresponding values? Previously, when I was only dealing with the 3D case, I did it using the following.
for idx in dropidxs:
i,j,k = dropidxs[idx]
missingCube[i,j,k] = np.nan
But now, I want the function to be able to handle tensors of any size.
I've tried
for idx in dropidxs:
missingCube[idx] = np.nan
and
missingCube[dropidxs] = np.nan
But both (unsurprisingly) end up removing a corresponding slice along axis=0. How should I approach this? Is there an easier way to achieve what I'm trying to do?

In [486]: data = np.random.randn(3,3,3)
With this creation all terms are finite, so nonzero returns a tuple of (27,) arrays:
In [487]: idx = np.nonzero(np.isfinite(data))
In [488]: len(idx)
Out[488]: 3
In [489]: idx[0].shape
Out[489]: (27,)
argwhere produces the same numbers, but in a 2d array:
In [490]: idxs = np.argwhere(np.isfinite(data))
In [491]: idxs.shape
Out[491]: (27, 3)
So you select a subset.
In [492]: dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
In [493]: dropidxs.shape
Out[493]: (3, 3)
In [494]: dropidxs
Out[494]:
array([[1, 1, 0],
[2, 1, 2],
[2, 1, 1]])
We could have generated the same subset by x = np.random.choice(...), and applying that x to the arrays in idxs. But in this case, the argwhere array is easier to work with.
But to apply that array to indexing we still need a tuple of arrays:
In [495]: tup = tuple([dropidxs[:,i] for i in range(3)])
In [496]: tup
Out[496]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [497]: data[tup]
Out[497]: array([-0.27965058, 1.2981397 , 0.4501406 ])
In [498]: data[tup]=np.nan
In [499]: data
Out[499]:
array([[[-0.4899279 , 0.83352547, -1.03798762],
[-0.91445783, 0.05777183, 0.19494065],
[ 0.6835925 , -0.47846423, 0.13513958]],
[[-0.08790631, 0.30224828, -0.39864576],
[ nan, -0.77424244, 1.4788093 ],
[ 0.41915952, -0.09335664, -0.47359613]],
[[-0.40281937, 1.64866377, -0.40354504],
[ 0.74884493, nan, nan],
[ 0.13097487, -1.63995208, -0.98857852]]])
Or we could index with:
In [500]: data[dropidxs[:,0],dropidxs[:,1],dropidxs[:,2]]
Out[500]: array([nan, nan, nan])
Actually, a transpose of dropidxs might be be more convenient:
In [501]: tdrop = dropidxs.T
In [502]: tuple(tdrop)
Out[502]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [503]: data[tuple(tdrop)]
Out[503]: array([nan, nan, nan])
Sometimes we can use * to expand a list/array into a tuple, but not when indexing:
In [504]: data[*tdrop]
File "<ipython-input-504-cb619d907adb>", line 1
data[*tdrop]
^
SyntaxError: invalid syntax
but we can create the tuple with:
In [506]: data[(*tdrop,)]
Out[506]: array([nan, nan, nan])

Is it what you're searching for:
import numpy as np
x = np.random.randn(10, 3, 3, 3)
new_value = 0
x[x < 0] = new_value
or
x[x == -inf] = 0

You can choose from flattened indices and convert back to data indices to set elements to np.nan. Here with a seed(41) to make results reproducible, choosing 3 elements.
import numpy as np
data = np.random.randn(3,3,3)
rng = np.random.default_rng(41)
idx = rng.choice(np.arange(data.size), 3, replace=False)
data[np.unravel_index(idx, data.shape)] = np.nan
data
Output
array([[[ 0.13180452, -0.81228319, -0.04456739],
[ 0.53060077, -0.2246579 , 1.83926463],
[-0.38670047, -0.53703577, 0.49275628]],
[[ 0.36671354, 1.44012848, -0.57209412],
[ 0.53960111, -1.06578638, 1.10669842],
[ 1.1772824 , nan, -0.82792041]],
[[-0.03352594, 0.29351109, 0.57021538],
[-0.33291872, nan, 0.04675677],
[ nan, 2.59450517, -1.9579655 ]]])

how do i correctly handle a multi dimensional numpy array

I'm a Python newbie and struggling a bit with multi dimensional arrays in a for loop. What I have is:
CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
"sofa", "train", "tvmonitor"]
...
...
idxs = np.argsort(preds[0])[::-1][:5]
print(idxs)
#loop over top 5 predictions & display them
for (i, idx) in enumerate(idxs):
# draw the top prediction on the input image
print (idx)
if i == 0:
print (preds)
text = "Label: {}, {:.2f}%".format(CLASSES[idx], preds[0][idx] * 100)
cv2.putText(frame, text, (5, 25), cv2.FONT_HERSHEY_SIMPLEX,
0.7, (0, 0, 255), 2)
# display the predicted label + associated probability to the
# console
print("[INFO] {}. label: {}, probability: {:.5}".format(i + 1,CLASSES[idx], preds[0][idx]))
and I get something like:
[[[ 0. 7. 0.3361728 0.2269333 0.6589312
0.70067763 0.8960621 ]
[ 0. 15. 0.44955394 0.5509065 0.4315516
0.6530549 0.7223625 ]]]
[[[0 3 2 4 5 6 1]
[0 4 2 3 5 6 1]]]
[[0 3 2 4 5 6 1]
[0 4 2 3 5 6 1]]
[[[[ 0. 7. 0.3361728 0.2269333 0.6589312
0.70067763 0.8960621 ]
[ 0. 15. 0.44955394 0.5509065 0.4315516
0.6530549 0.7223625 ]]]]
Traceback (most recent call last):
File "real_time_object_detection.py", line 80, in <module>
text = "Label: {}, {:.2f}%".format(CLASSES[idx], preds[0][idx] * 100)
TypeError: only integer scalar arrays can be converted to a scalar index
I've copied this code from https://www.pyimagesearch.com/2017/08/21/deep-learning-with-opencv/ but it looks like I'm doing something wrong as idx should be an int but instead is an array
UPDATE:
I tried to figure out what's going on here but I got stuck with the following: why do all argsort calls give the same result? :o
>>> preds[0] = [[[ 0., 7., 0.3361728, 0.2269333, 0.6589312,0.70067763, 0.8960621 ],[ 0., 15., 0.44955394, 0.5509065, 0.4315516,0.6530549, 0.7223625 ]]]
>>> print(preds[0])
[[[0.0, 7.0, 0.3361728, 0.2269333, 0.6589312, 0.70067763, 0.8960621], [0.0, 15.0, 0.44955394, 0.5509065, 0.4315516, 0.6530549, 0.7223625]]]
>>> import numpy as np
>>> np.argsort(preds[0])
array([[[0, 3, 2, 4, 5, 6, 1],
[0, 4, 2, 3, 5, 6, 1]]])
>>> np.argsort(preds[0])[::-1]
array([[[0, 3, 2, 4, 5, 6, 1],
[0, 4, 2, 3, 5, 6, 1]]])
>>> np.argsort(preds[0])[::-1][:5]
array([[[0, 3, 2, 4, 5, 6, 1],
[0, 4, 2, 3, 5, 6, 1]]])
Plus why does it seem to alter the data, should it not just sort it?

Your preds[0], assigned to a variable name is a 3d array:
In [449]: preds0 = np.array([[[ 0., 7., 0.3361728, 0.2269333
...: , 0.6589312,0.70067763, 0.8960621 ],[ 0., 15., 0.4
...: 4955394, 0.5509065, 0.4315516,0.6530549, 0.7223625 ]]])
In [450]: preds0.shape
Out[450]: (1, 2, 7)
argsort applied to that is an array of the same shape:
In [451]: np.argsort(preds0)
Out[451]:
array([[[0, 3, 2, 4, 5, 6, 1],
[0, 4, 2, 3, 5, 6, 1]]])
In [452]: _.shape
Out[452]: (1, 2, 7)
With that size 1 initial dimension, not amount of reversing or slicing on that dimension makes a difference. I suspect you wanted to reverse and slice the last dimension, the size 7 one. BUT, be careful about that. The argsort of a multidimensional array, even when applied to one dimension (the default last), is a hard thing to understand, and to use.
The shape matches the array, but the values are the range of 0-6, the last dimension. numpy 1.15 added a couple of functions to make it easier to use the result of argsort (and some other functions):
In [455]: np.take_along_axis(preds0, Out[451], axis=-1)
Out[455]:
array([[[ 0. , 0.2269333 , 0.3361728 , 0.6589312 ,
0.70067763, 0.8960621 , 7. ],
[ 0. , 0.4315516 , 0.44955394, 0.5509065 ,
0.6530549 , 0.7223625 , 15. ]]])
Notice that rows are now sorted, same as produced by np.sort(preds0, axis=-1).
I could pick one 'row' of the index array:
In [459]: idxs = Out[451]
In [461]: idx = idxs[0,0]
In [462]: idx
Out[462]: array([0, 3, 2, 4, 5, 6, 1])
In [463]: idx[::-1] # reverse
Out[463]: array([1, 6, 5, 4, 2, 3, 0])
In [464]: idx[::-1][:5] # select
Out[464]: array([1, 6, 5, 4, 2])
In [465]: preds0[0,0,Out[464]]
Out[465]: array([7. , 0.8960621 , 0.70067763, 0.6589312 , 0.3361728 ])
Now I have the five largest values of preds0[0,0,:] in reverse order.
And to do it to the whole preds0 array:
np.take_along_axis(preds0, idxs[:,:,::-1][:,:,:5], axis=-1)
or for earlier versions:
preds0[[0], [[0],[1]], idxs[:,:,::-1][:,:,:5]]

Convert a Pandas DataFrame to a multidimensional ndarray

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.

You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]

Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count

Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2

We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])

Get subset of rows in the numpy matrix based on the values from the column of another matrix

The title looks complicated, but the problem is not that hard. I have 2 matrices: data_X and data_Y. I have to construct a new matrix based on data_X, which will consists of all the rows of data_X, where the corresponding value in the column column in data_Y is not equal to someNumber. The same for data_Y. For example here is 5 by 2 data_X matrix and 5 by 1 data_Y matrix, column is 0 and someNumber = -1.
[[ 0.09580361 0.11221975]
[ 0.71409124 0.24583188]
[ 0.67346718 0.72550385]
[ 0.40641294 0.01172211]
[ 0.89974846 0.70378831]] # data_X
and data_Y = np.array([[5], [-1], [4], [2], [-1]]).
The result would be:
[[ 0.09580361 0.11221975]
[ 0.67346718 0.72550385]
[ 0.40641294 0.01172211]]
[5 4 2]
It is not hard to see that this can be achieved by the following:
data_x, data_y = [], []
for i in xrange(len(data_Y)):
if data_Y[i][column] != someNumber:
data_y.append(data_Y[i][column])
data_x.append(data_X[i])
But I believe there is way easier way (like 2 or 3 numpy operations) to get the results I need.

Use boolean indexing -
In [228]: X
Out[228]:
array([[ 0.09580361, 0.11221975],
[ 0.71409124, 0.24583188],
[ 0.67346718, 0.72550385],
[ 0.40641294, 0.01172211],
[ 0.89974846, 0.70378831]])
In [229]: Y
Out[229]:
array([[ 5],
[-1],
[ 4],
[ 2],
[-1]])
In [230]: mask = Y!=-1 # Create mask for boolean indexing
In [231]: X[mask.ravel()]
Out[231]:
array([[ 0.09580361, 0.11221975],
[ 0.67346718, 0.72550385],
[ 0.40641294, 0.01172211]])
In [232]: Y[mask]
Out[232]: array([5, 4, 2])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing NaNs in numpy arrays - python

A[~(np.isnan(A) | np.isnan(B))] B[~(np.isnan(A) | np.isnan(B))]

Related

Avoid output via negative index for numpy

Replacing values in n-dimensional tensor given indices from np.argwhere()

how do i correctly handle a multi dimensional numpy array

Convert a Pandas DataFrame to a multidimensional ndarray

Get subset of rows in the numpy matrix based on the values from the column of another matrix

Categories

Resources