Avoid output via negative index for numpy - python

Given a numpy array A such as:
[[ 0. 482. 1900. 961. 579. 56.]
[ 0. 530. 1906. 914. 584. 44.]
[ 43. 0. 1932. 948. 556. 51.]
[ 0. 482. 1917. 946. 581. 52.]
[ 0. 520. 1935. 878. 589. 55.]]
I am getting the element I need to filter like this:
C = array([-1, -1, -1, 1, 2], dtype=int64)
R = array([[-2, -5],
[-1, -5],
[ 0, -4],
[ 1, -3],
[ 2, -2],
[ 3, -1]])
Extracting this way: A[R.T, C]
Issue: Negative indexing is giving me trouble. I would like to get NaN for the entries with either R or C or both <0. Is this possible?

As an exercise I assume your R or C arrays are of a wrong shape (because for your data A[R.T, C] returns a shape mismatch error), so I choose to remove one row from R:
import numpy as np
A = np.arange(30).reshape(5, 6)
C = np.array([-1, -1, -1, 1, 2], dtype=np.int64)
R = np.array([[-2, -5],
[ 0, -4],
[ 1, -3],
[ 2, -2],
[ 3, -1]])
and then A[R.T, C] evaluates to:
array([[23, 5, 11, 13, 20],
[ 5, 11, 17, 19, 26]])
Now I assume you're for some reason trying to replace all values that would otherwise be supplied by a negative index, with the NaN value. That leads me to an expected result of:
array([[nan, nan, nan, 13., 20.],
[nan, nan, nan, nan, nan]])
Is that a correct expectation??
If so, I'd suggest to add a row and a column consisting of NaN values to A and replace all negative indexes with -1 which will do the trick:
nan_row = np.full((1, 6), np.NAN)
nan_col = np.full((6, 1), np.NAN)
A = np.r_[A, nan_row]
A = np.c_[A, nan_col]
R[R<0] = -1
C[C<0] = -1
A[R.T, C] now returns the expected array:
array([[nan, nan, nan, 13., 20.],
[nan, nan, nan, nan, nan]])

Related

Replacing values in n-dimensional tensor given indices from np.argwhere()

I'm somewhat new to numpy so this might be a dumb question, but here goes:
Let's say I have a tensor of any shape and size, say (100,5,5) or (3,3,10,15,4). I have a randomly generated list of indices for points I want to replace with np.nan. For a (3,3,3) test case, it would be as follows:
>> data = np.random.randn(3,3,3)
>> data
array([[[ 0.21368315, -1.42814113, 1.23021783],
[ 0.25835315, 0.44775156, -1.20489094],
[ 0.25928972, 0.39486046, -1.79189447]],
[[ 2.24080908, -0.89617961, -0.29550817],
[ 0.21756087, 1.33996913, -1.24418745],
[-0.63617598, 0.56848439, 0.8175564 ]],
[[ 0.61367002, -1.16104071, -0.53488283],
[ 1.0363354 , -0.76888041, 1.24524786],
[-0.84329375, -0.61744489, 1.50502058]]])
>> idxs = np.argwhere(np.isfinite(data))
>> dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
>> dropidxs
array([[1, 1, 1],
[2, 0, 2],
[2, 1, 0]])
How do I replace the corresponding values? Previously, when I was only dealing with the 3D case, I did it using the following.
for idx in dropidxs:
i,j,k = dropidxs[idx]
missingCube[i,j,k] = np.nan
But now, I want the function to be able to handle tensors of any size.
I've tried
for idx in dropidxs:
missingCube[idx] = np.nan
and
missingCube[dropidxs] = np.nan
But both (unsurprisingly) end up removing a corresponding slice along axis=0. How should I approach this? Is there an easier way to achieve what I'm trying to do?
In [486]: data = np.random.randn(3,3,3)
With this creation all terms are finite, so nonzero returns a tuple of (27,) arrays:
In [487]: idx = np.nonzero(np.isfinite(data))
In [488]: len(idx)
Out[488]: 3
In [489]: idx[0].shape
Out[489]: (27,)
argwhere produces the same numbers, but in a 2d array:
In [490]: idxs = np.argwhere(np.isfinite(data))
In [491]: idxs.shape
Out[491]: (27, 3)
So you select a subset.
In [492]: dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
In [493]: dropidxs.shape
Out[493]: (3, 3)
In [494]: dropidxs
Out[494]:
array([[1, 1, 0],
[2, 1, 2],
[2, 1, 1]])
We could have generated the same subset by x = np.random.choice(...), and applying that x to the arrays in idxs. But in this case, the argwhere array is easier to work with.
But to apply that array to indexing we still need a tuple of arrays:
In [495]: tup = tuple([dropidxs[:,i] for i in range(3)])
In [496]: tup
Out[496]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [497]: data[tup]
Out[497]: array([-0.27965058, 1.2981397 , 0.4501406 ])
In [498]: data[tup]=np.nan
In [499]: data
Out[499]:
array([[[-0.4899279 , 0.83352547, -1.03798762],
[-0.91445783, 0.05777183, 0.19494065],
[ 0.6835925 , -0.47846423, 0.13513958]],
[[-0.08790631, 0.30224828, -0.39864576],
[ nan, -0.77424244, 1.4788093 ],
[ 0.41915952, -0.09335664, -0.47359613]],
[[-0.40281937, 1.64866377, -0.40354504],
[ 0.74884493, nan, nan],
[ 0.13097487, -1.63995208, -0.98857852]]])
Or we could index with:
In [500]: data[dropidxs[:,0],dropidxs[:,1],dropidxs[:,2]]
Out[500]: array([nan, nan, nan])
Actually, a transpose of dropidxs might be be more convenient:
In [501]: tdrop = dropidxs.T
In [502]: tuple(tdrop)
Out[502]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [503]: data[tuple(tdrop)]
Out[503]: array([nan, nan, nan])
Sometimes we can use * to expand a list/array into a tuple, but not when indexing:
In [504]: data[*tdrop]
File "<ipython-input-504-cb619d907adb>", line 1
data[*tdrop]
^
SyntaxError: invalid syntax
but we can create the tuple with:
In [506]: data[(*tdrop,)]
Out[506]: array([nan, nan, nan])
Is it what you're searching for:
import numpy as np
x = np.random.randn(10, 3, 3, 3)
new_value = 0
x[x < 0] = new_value
or
x[x == -inf] = 0
You can choose from flattened indices and convert back to data indices to set elements to np.nan. Here with a seed(41) to make results reproducible, choosing 3 elements.
import numpy as np
data = np.random.randn(3,3,3)
rng = np.random.default_rng(41)
idx = rng.choice(np.arange(data.size), 3, replace=False)
data[np.unravel_index(idx, data.shape)] = np.nan
data
Output
array([[[ 0.13180452, -0.81228319, -0.04456739],
[ 0.53060077, -0.2246579 , 1.83926463],
[-0.38670047, -0.53703577, 0.49275628]],
[[ 0.36671354, 1.44012848, -0.57209412],
[ 0.53960111, -1.06578638, 1.10669842],
[ 1.1772824 , nan, -0.82792041]],
[[-0.03352594, 0.29351109, 0.57021538],
[-0.33291872, nan, 0.04675677],
[ nan, 2.59450517, -1.9579655 ]]])

How to replace different values in each column with NAN values?

Please let me know if anyone knows of a better way to do about the following.
I am trying to replace some values in numpy array.
Replace condition is differ in each columns.
Suppose I have a numpy array and list of nodata values like:
import numpy as np
array = np.array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10,11,12]])
nodata_values = [4, 8, 3]
and what I want to have is array which values are replaced like
array([[ 1., 2., nan],
[nan, 5., 6.],
[ 7., nan, 9.],
[10., 11., 12.]])
I know I can do this:
np.array([np.where(i == nodata_values[idx], np.nan, i)
for idx, i in enumerate(array.T)]).T
But this code using for loop inside therefore applying it to a table with tens of thousands of rows will takes time.
Use np.isin to create boolean index & broadcast.
astype to avoid ValueError: cannot convert float NaN to integer
import numpy as np
array = array.astype(np.float)
array[np.isin(array , nodata_values)] = np.NaN
[[ 1. 2. nan]
[nan 5. 6.]
[ 7. nan 9.]
[10. 11. 12.]]

Create a mask from a matrix

Hi there i got a Matrix like this
A=[[nan, 4, nan],[3 , 7 , 8],[nan, 23, nan]]
and I would like to get a mask from the the Matrix A, that is as follows
mask=[[nan, 0, nan],[0, 0, 0],[nan, 0, nan]]
for that I have tried:
import numpy as np
A=[[nan, 4, nan],[3 , 7 , 8],[nan, 23, nan]]
mask=A
mask[np.isfinite(A)]=0
But this also deletes the numerical values of the Matrix A.
You need to do a copy of A in order to keep the values in A, see: https://docs.python.org/2/library/copy.html
In your case this would be
A=[[nan, 4, nan],[3 , 7 , 8],[nan, 23, nan]]
mask=np.array(A.copy())
mask[~np.isnan(A)] = 0
You could use a masked array, in order to mask those values that aren't np.nan, and fill the masked array with 0:
A = np.array([[np.nan, 4, np.nan],[3 , 7 , 8],[np.nan, 23, np.nan]])
np.ma.masked_array(A, mask = ~np.isnan(A)).filled(0)
array([[nan, 0., nan],
[ 0., 0., 0.],
[nan, 0., nan]])
Using A[~np.isnan(A)]:
from numpy import *
A=[[NaN, 4, NaN],[3 , 7 , 8],[NaN, 23, NaN]]
A = np.array(A)
A[~np.isnan(A)] = 0
print(A)
OUTPUT:
[[nan 0. nan]
[ 0. 0. 0.]
[nan 0. nan]]

Convert a Pandas DataFrame to a multidimensional ndarray

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.
You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]
Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count
Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2
We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])

Removing NaNs in numpy arrays

I have two numpy arrays that contains NaNs:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
are there any smart way using numpy to remove the NaNs in both arrays, and also remove whats on the corresponding index in the other list?
Making it look like this:
A = array([ 2, 3, ])
B = array([ 2, 4, ])
What you could do is add the 2 arrays together this will overwrite with NaN values where they are none, then use this to generate a boolean mask index and then use the index to index into your original numpy arrays:
In [193]:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
idx = np.where(~np.isnan(A+B))
idx
print(A[idx])
print(B[idx])
[ 2. 3.]
[ 2. 4.]
output from A+B:
In [194]:
A+B
Out[194]:
array([ nan, 4., nan, 7., nan])
EDIT
As #Oliver W. has correctly pointed out, the np.where is unnecessary as np.isnan will produce a boolean index that you can use to index into the arrays:
In [199]:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
idx = (~np.isnan(A+B))
print(A[idx])
print(B[idx])
[ 2. 3.]
[ 2. 4.]
A[~(np.isnan(A) | np.isnan(B))]
B[~(np.isnan(A) | np.isnan(B))]

Categories

Resources