How to replace different values in each column with NAN values? - python

Please let me know if anyone knows of a better way to do about the following.
I am trying to replace some values in numpy array.
Replace condition is differ in each columns.
Suppose I have a numpy array and list of nodata values like:
import numpy as np
array = np.array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10,11,12]])
nodata_values = [4, 8, 3]
and what I want to have is array which values are replaced like
array([[ 1., 2., nan],
[nan, 5., 6.],
[ 7., nan, 9.],
[10., 11., 12.]])
I know I can do this:
np.array([np.where(i == nodata_values[idx], np.nan, i)
for idx, i in enumerate(array.T)]).T
But this code using for loop inside therefore applying it to a table with tens of thousands of rows will takes time.

Use np.isin to create boolean index & broadcast.
astype to avoid ValueError: cannot convert float NaN to integer
import numpy as np
array = array.astype(np.float)
array[np.isin(array , nodata_values)] = np.NaN
[[ 1. 2. nan]
[nan 5. 6.]
[ 7. nan 9.]
[10. 11. 12.]]

Related

Create a mask from a matrix

Hi there i got a Matrix like this
A=[[nan, 4, nan],[3 , 7 , 8],[nan, 23, nan]]
and I would like to get a mask from the the Matrix A, that is as follows
mask=[[nan, 0, nan],[0, 0, 0],[nan, 0, nan]]
for that I have tried:
import numpy as np
A=[[nan, 4, nan],[3 , 7 , 8],[nan, 23, nan]]
mask=A
mask[np.isfinite(A)]=0
But this also deletes the numerical values of the Matrix A.
You need to do a copy of A in order to keep the values in A, see: https://docs.python.org/2/library/copy.html
In your case this would be
A=[[nan, 4, nan],[3 , 7 , 8],[nan, 23, nan]]
mask=np.array(A.copy())
mask[~np.isnan(A)] = 0
You could use a masked array, in order to mask those values that aren't np.nan, and fill the masked array with 0:
A = np.array([[np.nan, 4, np.nan],[3 , 7 , 8],[np.nan, 23, np.nan]])
np.ma.masked_array(A, mask = ~np.isnan(A)).filled(0)
array([[nan, 0., nan],
[ 0., 0., 0.],
[nan, 0., nan]])
Using A[~np.isnan(A)]:
from numpy import *
A=[[NaN, 4, NaN],[3 , 7 , 8],[NaN, 23, NaN]]
A = np.array(A)
A[~np.isnan(A)] = 0
print(A)
OUTPUT:
[[nan 0. nan]
[ 0. 0. 0.]
[nan 0. nan]]

Convert a Pandas DataFrame to a multidimensional ndarray

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.
You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]
Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count
Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2
We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])

Most efficient way to forward-fill NaN values in numpy array

Example Problem
As a simple example, consider the numpy array arr as defined below:
import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])
where arr looks like this in console output:
array([[ 5., nan, nan, 7., 2.],
[ 3., nan, 1., 8., nan],
[ 4., 9., 6., nan, nan]])
I would now like to row-wise 'forward-fill' the nan values in array arr. By that I mean replacing each nan value with the nearest valid value from the left. The desired result would look like this:
array([[ 5., 5., 5., 7., 2.],
[ 3., 3., 1., 8., 8.],
[ 4., 9., 6., 6., 6.]])
Tried thus far
I've tried using for-loops:
for row_idx in range(arr.shape[0]):
for col_idx in range(arr.shape[1]):
if np.isnan(arr[row_idx][col_idx]):
arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]
I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):
import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()
Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?
Summary
Is there another more efficient way to 'forward-fill' nan values in numpy arrays? (e.g. by using numpy vectorized operations)
Update: Solutions Comparison
I've tried to time all solutions thus far. This was my setup script:
import numba as nb
import numpy as np
import pandas as pd
def random_array():
choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
out = np.random.choice(choices, size=(1000, 10))
return out
def loops_fill(arr):
out = arr.copy()
for row_idx in range(out.shape[0]):
for col_idx in range(1, out.shape[1]):
if np.isnan(out[row_idx, col_idx]):
out[row_idx, col_idx] = out[row_idx, col_idx - 1]
return out
#nb.jit
def numba_loops_fill(arr):
'''Numba decorator solution provided by shx2.'''
out = arr.copy()
for row_idx in range(out.shape[0]):
for col_idx in range(1, out.shape[1]):
if np.isnan(out[row_idx, col_idx]):
out[row_idx, col_idx] = out[row_idx, col_idx - 1]
return out
def pandas_fill(arr):
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
out = df.as_matrix()
return out
def numpy_fill(arr):
'''Solution provided by Divakar.'''
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out
followed by this console input:
%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())
resulting in this console output:
1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 µs per loop
1000 loops, best of 3: 455 µs per loop
1000 loops, best of 3: 351 µs per loop
Here's one approach -
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -
arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]
Sample input, output -
In [179]: arr
Out[179]:
array([[ 5., nan, nan, 7., 2., 6., 5.],
[ 3., nan, 1., 8., nan, 5., nan],
[ 4., 9., 6., nan, nan, nan, 7.]])
In [180]: out
Out[180]:
array([[ 5., 5., 5., 7., 2., 6., 5.],
[ 3., 3., 1., 8., 8., 5., 5.],
[ 4., 9., 6., 6., 6., 6., 7.]])
Update: As pointed out by financial_physician in the comments, my initially proposed solution can simply be exchanged with ffill on the reversed array and then reversing the result. There is no relevant performance loss. My initial solution seems to be 2% or 3% faster according to %timeit. I updated the code example below but left my initial text as it was.
For those that came here looking for the backward-fill of NaN values, I modified the solution provided by Divakar above to do exactly that. The trick is that you have to do the accumulation on the reversed array using the minimum except for the maximum.
Here is the code:
# ffill along axis 1, as provided in the answer by Divakar
def ffill(arr):
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[1]), 0)
np.maximum.accumulate(idx, axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out
# Simple solution for bfill provided by financial_physician in comment below
def bfill(arr):
return ffill(arr[:, ::-1])[:, ::-1]
# My outdated modification of Divakar's answer to do a backward-fill
def bfill_old(arr):
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out
# Test both functions
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])
print('Array:')
print(arr)
print('\nffill')
print(ffill(arr))
print('\nbfill')
print(bfill(arr))
Output:
Array:
[[ 5. nan nan 7. 2.]
[ 3. nan 1. 8. nan]
[ 4. 9. 6. nan nan]]
ffill
[[5. 5. 5. 7. 2.]
[3. 3. 1. 8. 8.]
[4. 9. 6. 6. 6.]]
bfill
[[ 5. 7. 7. 7. 2.]
[ 3. 1. 1. 8. nan]
[ 4. 9. 6. nan nan]]
Edit: Update according to comment of MS_
I liked Divakar's answer on pure numpy.
Here's a generalized function for n-dimensional arrays:
def np_ffill(arr, axis):
idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
np.maximum.accumulate(idx, axis=axis, out=idx)
slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
for dim in range(len(arr.shape))])]
for i, k in enumerate(arr.shape)]
slc[axis] = idx
return arr[tuple(slc)]
AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original. This unstacking/restacking/reshaping, with the pandas sorting involved, is just unnecessary overhead to achieve the same result.
Testing:
def random_array(shape):
choices = [1, 2, 3, 4, np.nan]
out = np.random.choice(choices, size=shape)
return out
ra = random_array((2, 4, 8))
print('arr')
print(ra)
print('\nffull')
print(np_ffill(ra, 1))
raise SystemExit
Output:
arr
[[[ 3. nan 4. 1. 4. 2. 2. 3.]
[ 2. nan 1. 3. nan 4. 4. 3.]
[ 3. 2. nan 4. nan nan 3. 4.]
[ 2. 2. 2. nan 1. 1. nan 2.]]
[[ 2. 3. 2. nan 3. 3. 3. 3.]
[ 3. 3. 1. 4. 1. 4. 1. nan]
[ 4. 2. nan 4. 4. 3. nan 4.]
[ 2. 4. 2. 1. 4. 1. 3. nan]]]
ffull
[[[ 3. nan 4. 1. 4. 2. 2. 3.]
[ 2. nan 1. 3. 4. 4. 4. 3.]
[ 3. 2. 1. 4. 4. 4. 3. 4.]
[ 2. 2. 2. 4. 1. 1. 3. 2.]]
[[ 2. 3. 2. nan 3. 3. 3. 3.]
[ 3. 3. 1. 4. 1. 4. 1. 3.]
[ 4. 2. 1. 4. 4. 3. 1. 4.]
[ 2. 4. 2. 1. 4. 1. 3. 4.]]]
Use Numba. This should give a significant speedup:
import numba
#numba.jit
def loops_fill(arr):
...
I like Divakar's answer, but it doesn't work for an edge case where a row starts with np.nan, like the arr below
arr = np.array([[9, np.nan, 4, np.nan, 6, 6, 7, 2, 3, np.nan],
[ np.nan, 5, 5, 6, 5, 3, 2, 1, np.nan, 10]])
The output using Divakar's code would be:
[[ 9. 9. 4. 4. 6. 6. 7. 2. 3. 3.]
[nan 4. 5. 6. 5. 3. 2. 1. 1. 10.]]
Divakar's code can be simplified a bit, and the simplified version solves this issue at the same time:
arr[np.isnan(arr)] = arr[np.nonzero(np.isnan(arr))[0], np.nonzero(np.isnan(arr))[1]-1]
In case of several np.nans in a row (either in the beginning or in the middle), just repeat this operation several times. For instance, if the array has 5 consecutive np.nans, the following code will "forward fill" all of them with the number before these np.nans:
for i in range(0, 5):
value[np.isnan(value)] = value[np.nonzero(np.isnan(value))[0], np.nonzero(np.isnan(value))[1]-1]
For those who are interested in the problem of having leading np.nan after foward-filling, the following works:
mask = np.isnan(arr)
first_non_zero_idx = (~mask!=0).argmax(axis=1) #Get indices of first non-zero values
arr = [ np.hstack([
[arr[i,first_nonzero]]*(first_nonzero),
arr[i,first_nonzero:]])
for i, first_nonzero in enumerate(first_non_zero_idx) ]
bottleneck push function is a good option to forward fill. It's normally used internally in packages like Xarray, it should be faster than other alternatives and the package also has a set of benchmarks.
Example:
import numpy as np
from bottleneck import push
a = np.array(
[
[1, np.nan, 3],
[np.nan, 3, 2],
[2, np.nan, np.nan]
]
)
push(a, axis=0)
array([[ 1., nan, 3.],
[ 1., 3., 2.],
[ 2., 3., 2.]])
Use bottleneck module, it comes along with pandas or numpy module so no need to separately install.
Below code should give you desired result.
import bottleneck as bn
bn.push(arr,axis=1)
If you're willing to use Pandas/ xarray: Let axis be the direction you wish to ffill/bfill over, as shown below,
xr.DataArray(arr).ffill(f'dim_{axis}').values
xr.DataArray(arr).bfill(f'dim_{axis}').values
More information:
http://xarray.pydata.org/en/stable/generated/xarray.DataArray.ffill.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html
One liner:
result = np.where(np.isnan(arr), 0, arr)
In a function with forcing float (I needed it in my case because I had dtype=object).
def fillna(arr):
arr = np.array(arr,dtype=float)
out = np.where(np.isnan(arr), 0, arr)
return out
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])
result = fillna(arr)
print(result)
# result
# array([[5., 0., 0., 7., 2.],
# [3., 0., 1., 8., 0.],
# [4., 9., 6., 0., 0.]])
unless I miss something, the solutions does not works on any example:
arr = np.array([[ 3.],
[ 8.],
[np.nan],
[ 7.],
[np.nan],
[ 1.],
[np.nan],
[ 3.],
[ 8.],
[ 8.]])
print("A:::: \n", arr)
print("numpy_fill::: \n ", numpy_fill(arr))
print("loop_fill", loops_fill(arr))
A::::
[[ 3.]
[ 8.]
[nan]
[ 7.]
[nan]
[ 1.]
[nan]
[ 3.]
[ 8.]
[ 8.]]
numpy_fill:::
[[ 3.]
[ 8.]
[nan]
[ 7.]
[nan]
[ 1.]
[nan]
[ 3.]
[ 8.]
[ 8.]]
loop_fill [[ 3.]
[ 8.]
[nan]
[ 7.]
[nan]
[ 1.]
[nan]
[ 3.]
[ 8.]
[ 8.]]
Comments ??
Minor improvement of of RichieV generalized pure numpy solution with axis selection and 'backward' support
def _np_fill_(arr, axis=-1, fill_dir='f'):
"""Base function for np_fill, np_ffill, np_bfill."""
if axis < 0:
axis = len(arr.shape) + axis
if fill_dir.lower() in ['b', 'backward']:
dir_change = tuple([*[slice(None)]*axis, slice(None, None, -1)])
return np_ffill(arr[dir_change])[dir_change]
elif fill_dir.lower() not in ['f', 'forward']:
raise KeyError(f"fill_dir must be one of: 'b', 'backward', 'f', 'forward'. Got: {fill_dir}")
idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
np.maximum.accumulate(idx, axis=axis, out=idx)
slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
for dim in range(len(arr.shape))])]
for i, k in enumerate(arr.shape)]
slc[axis] = idx
return arr[tuple(slc)]
def np_fill(arr, axis=-1, fill_dir='f'):
"""General fill function which supports multiple filling steps. I.e.:
fill_dir=['f', 'b'] or fill_dir=['b', 'f']"""
if isinstance(fill_dir, (tuple, list, np.ndarray)):
for i in fill_dir:
arr = _np_fill_(arr, axis=axis, fill_dir=i)
else:
arr = _np_fill_(arr, axis=axis, fill_dir=fill_dir)
return arr
def np_ffill(arr, axis=-1):
return np_fill(arr, axis=axis, fill_dir='forward')
def np_bfill(arr, axis=-1):
return np_fill(arr, axis=axis, fill_dir='backward')
I used np.nan_to_num
Example:
data = np.nan_to_num(data, data.mean())
Reference : Numpy document

Adding two 2D NumPy arrays ignoring NaNs in them

What is the right way to add 2 numpy arrays a and b (both 2D) with numpy.nan as missing value?
a + b
or
numpy.ma.sum(a,b)
Since the inputs are 2D arrays, you can stack them along the third axis with np.dstack and then use np.nansum which would ensure NaNs are ignored, unless there are NaNs in both input arrays, in which case output would also have NaN. Thus, the implementation would look something like this -
np.nansum(np.dstack((A,B)),2)
Sample run -
In [157]: A
Out[157]:
array([[ 0.77552455, 0.89241629, nan, 0.61187474],
[ 0.62777982, 0.80245533, nan, 0.66320306],
[ 0.41578442, 0.26144272, 0.90260667, nan],
[ 0.65122428, 0.3211213 , 0.81634856, nan],
[ 0.52957704, 0.73460363, 0.16484994, 0.20701344]])
In [158]: B
Out[158]:
array([[ 0.55809925, 0.1339353 , nan, 0.35154039],
[ 0.94484722, 0.23814073, 0.36048809, 0.20412318],
[ 0.25191484, nan, 0.43721322, 0.95810905],
[ 0.69115038, 0.51490958, nan, 0.44613473],
[ 0.01709308, 0.81771896, 0.3229837 , 0.64013882]])
In [159]: np.nansum(np.dstack((A,B)),2)
Out[159]:
array([[ 1.3336238 , 1.02635159, nan, 0.96341512],
[ 1.57262704, 1.04059606, 0.36048809, 0.86732624],
[ 0.66769925, 0.26144272, 1.33981989, 0.95810905],
[ 1.34237466, 0.83603089, 0.81634856, 0.44613473],
[ 0.54667013, 1.55232259, 0.48783363, 0.84715226]])
Just replace the NaNs with zeros in both arrays:
a[np.isnan(a)] = 0 # replace all nan in a with 0
b[np.isnan(b)] = 0 # replace all nan in b with 0
And then perform the addition:
a + b
This relies on the fact that 0 is the "identity element" for addition.

numpy classification comparison with 3d array

I'm trying to do some basic classification of numpy arrays...
I want to compare a 2d array against a 3d array, along the 3rd dimension, and make a classification based on the corresponding z-axis values.
so given 3 arrays that are stacked into a 3d array:
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = dstack((a1,a2,a3))
and another 2d array
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
I want to be able to compare a2d against a3d, and return a 2d array of which level of a3d is closest. (or I suppose any custom function that can compare each value along the z-axis, and return a value base on that comparison.)
EDIT
I modified my arrays to more closely match my data. a1 would be the minimum values, a2 the average values, and a3 the maximum values. So I want to output if each a2d value is closer to a1 (classed "1") a2 (classed "2") or a3 (classed "3"). I'm doing as a 3d array because in the real data, it won't be a simple 3-array choice, but for SO purposes, it helps to keep it simple. We can assume that in the case of a tie, we'll take the lower, so 2 would be classed as level "1", 4 as level "2".
You can use the following list comprehension :
>>> [sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,a2d) for i in a3d]]
[30.0, 22.5, 30.0]
In preceding code i create the following list with zip,that is the zip of each sub array of your 3d list then all you need is calculate the sum of the elemets of subtract of those pairs then sum of them again :
>>> [zip(i,a2d) for i in a3d]
[[(array([ 1., 3., 1.]), array([1, 2, 1])), (array([ 2., 2., 1.]), array([5, 5, 4])), (array([ 3., 1., 1.]), array([9, 8, 8]))], [(array([ 4., 6., 4.]), array([1, 2, 1])), (array([ 5. , 6.5, 4. ]), array([5, 5, 4])), (array([ 6., 4., 4.]), array([9, 8, 8]))], [(array([ 7., 9., 7.]), array([1, 2, 1])), (array([ 8., 8., 7.]), array([5, 5, 4])), (array([ 9., 7., 7.]), array([9, 8, 8]))]]
then for all of your sub arrays you'll have the following list:
[30.0, 22.5, 30.0]
that for each sub-list show a the level of difference with 2d array!and then you can get the relative sub-array from a3d like following :
>>> a3d[l.index(min(l))]
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])
Also you can put it in a function:
>>> def find_nearest(sub,main):
... l=[sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,sub) for i in main]]
... return main[l.index(min(l))]
...
>>> find_nearest(a2d,a3d)
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])
You might consider a different approach using numpy.vectorize which lets you efficiently apply a python function to each element of your array.
In this case, your python function could just classify each pixel with whatever breaks you define:
import numpy as np
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
def classify(x):
if x >= 4:
return 3
elif x >= 2:
return 2
elif x > 0:
return 1
else:
return 0
vclassify = np.vectorize(classify)
result = vclassify(a2d)
Thanks to #perrygeo and #Kasra - they got me thinking in a good direction.
Since I want a classification of the closest 3d array's z value, I couldn't do simple math - I needed the (z)index of the closest value.
I did it by enumerating both axes of the 2d array, and doing a proximity compare against the corresponding (z)index of the 3d array.
There might be a way to do this without iterating the 2d array, but at least I'm avoiding iterating the 3d.
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = np.dstack((a1,a2,a3))
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
classOut = np.empty_like(a2d)
def find_nearest_idx(array,value):
idx = (np.abs(array-value)).argmin()
return idx
# enumerate to get indices
for i,a in enumerate(a2d):
for ii,v in enumerate(a):
valStack = a3d[i,ii]
nearest = find_nearest_idx(valStack,v)
classOut[i,ii] = nearest
print classOut
which gets me
[[0 0 1]
[2 2 0]
[0 1 1]]
This tells me that (for example) a2d[0,0] is closest to the 0-index of a3d[0,0], which in my case means it is closest to the min value for that 2d position. a2d[1,1] is closest to the 2-index, which in my case means closer to the max value for that 2d position.

Categories

Resources