Masked `np.nan` in the `np.ma.array` problem in jupyter - python

Let's run in the Anaconda Jupyter the Python3 NumPy code:
y = np.ma.array(np.matrix([[np.nan, 2.0]]), mask=[0, 1])
m = (y < 0.01)
and we have the warning: /.../anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in less.
Substituting np.nan with 1.0 etc. --- no warning.
Why the np.nan can not be masked and then compared?

MA has several strategies to implementing methods.
1) evaluate the method on y.data, and make a new ma with y.mask. It may suppress any runtime warnings.
2) evaluate the method on y.filled() # with the default fill value
3) evaluate the method on y.filled(1) # or some other innocuous value
4) evaluate the method on y.compressed()
5) evaluate the method on y.data[~y.mask]
multiplication, for example use filled(1), and addition uses filled(0).
It appears that the comparisons are done with 1).
I haven't studied the ma code in detail, but I don't think it does 5).
If you are using ma just to avoid the runtime warning, there are some alternatives.
there's a collection of np.nan... functions that filter out nan before calculating
there are ways of surpressing runtime warnings
ufuncs have a where parameter that can be used to skip some elements. Use it with an out parameter to define the skipped ones.
===
Looking a np.ma.core.py I see functions like ma.less.
In [857]: y = np.ma.array([np.nan, 0.0, 2.0], mask=[1, 0, 0])
In [858]: y >1.0
/usr/local/bin/ipython3:1: RuntimeWarning: invalid value encountered in greater
#!/usr/bin/python3
Out[858]:
masked_array(data=[--, False, True],
mask=[ True, False, False],
fill_value=True)
In [859]: np.ma.greater(y,1.0)
Out[859]:
masked_array(data=[--, False, True],
mask=[ True, False, False],
fill_value=True)
Looking at the code, ma.less and such are a MaskedBinaryOperation class, and use 1) - evaluate on the data with
np.seterr(divide='ignore', invalid='ignore')
The result mask is logical combination of the arguments' masks.
https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html#operations-on-masked-arrays

Making the issue more simple, let's assume:
y = np.ma.array([np.nan, 0.0, 2.0], mask=[1, 0, 0])
m = (y > 1.0)
print(y, y.shape) ; print(y[m], y[m].shape, m.shape)
and the output is:
[-- 0.0 2.0] (3,)
[2.0] (1,) (3,)
with the RuntimeWarning: /.../anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in greater.
Changing:
...
m = (y != 2.0)
...
We get:
[-- 0.0 2.0] (3,)
[-- 0.0] (2,) (3,)
so we have a masked element and the result without any RuntimeWarning.
Changing now:
...
m = y.mask.copy() ; y[np.isnan(y)] = 9.0 ; y.mask = m ; m = (y > 1.0)
...
We get (without RuntimeWorning):
[-- 0.0 2.0] (3,)
[-- 2.0] (2,) (3,)
This work-around is however strange (by setting arbitrary value in the place of np.nan and mask saving). Comparing something with masked should be always masked, shouldn't it?

Related

How to apply a function element-wise with inputs from multiple numpy masked arrays to create a new masked array?

I have a function that takes in 4 single value inputs to return a singular float output, for example:
from scipy.stats import multivariate_normal
grid_step = 0.25 #in units of sigma
grid_x, grid_y = np.mgrid[-2:2+grid_step:grid_step, -2:2+grid_step:grid_step]
pos = np.dstack((grid_x, grid_y))
rv = multivariate_normal([0.0, 0.0], [[1.0, 0], [0, 1.0]])
grid_pdf = rv.pdf(pos)*grid_step**2
norm_pdf = np.sum(rv.pdf(pos))*grid_step**2
def cal_prob(x, x_err, y, y_err):
x_grid = grid_x*x_err + x
y_grid = grid_y*y_err + y
PSB_grid = ((x_grid>3) & (y_grid<10) & (y_grid < 10**(0.23*x_grid-0.46)))
PSB_prob = np.sum(PSB_grid*grid_pdf)/norm_pdf
return PSB_prob
What this function is doing is estimating the probability that some x-y measurement is within some defined limit in x-y space, given x and y's uncertainties. It assumes the uncertainties are Gaussian and uncorrelated. Then, using the pre-made grid_pdf, it checks which grid points (scaled by x_err/y_err and shifted by x/y) are within the defined limit, and multiply the True/False grid by the grid_pdf, normalized by norm_pdf. The probability is given by the sum of the normalized array.
I want this function to be applied element-wise with those 4 inputs stored in 4 separate numpy masked arrays of the same shape, with possibly different masks, then use the function outputs to create a new array of the same shape. Is there a way that doesn't use a for loop?
Thanks!
My current solution is this:
mask1 = np.array([[False, True, False],[True, True, True],[True, False, False]])
mask2 = np.array([[True, True, True],[True, True, False],[False, False, True]])
# the only overlaps should be [0,1], [1,0] and [1,1]
x = np.ma.array(np.random.randn(*mask1.shape), mask=~mask1)
x_err = np.ma.array(np.abs(np.random.randn(*mask1.shape))*0.1, mask=~mask1)
y = np.ma.array(np.random.randn(*mask2.shape), mask=~mask2)
y_err = np.ma.array(np.abs(np.random.randn(*mask2.shape))*0.1, mask=~mask2)
# a combined mask to iterate through
all_mask = x+x_err+y+y_err
prob = np.zeros(mask1.shape)
prob = np.ma.masked_where(np.ma.getmask(all_mask), prob)
for i,xi in np.ma.ndenumerate(all_mask):
prob[i] = cal_prob(xi, x_err[i], y[i], y_err[i])
A test of np.vectorize with a masked array input:
In [180]: def foo(x):
...: print(x)
...: return 2*x
...:
In [181]: np.vectorize(foo)(np.ma.masked_array([1,2,3],[True,False,True]))
1
1
2
3
Out[181]:
masked_array(data=[--, 4, --],
mask=[ True, False, True],
fill_value=999999)
In [182]: _.data
Out[182]: array([2, 4, 6])

how to avoid division by zero in 2d numpy array when taking average?

Let's say I have three arrays
A = np.array([[2,2,2],[1,0,0],[1,2,1]])
B = np.array([[2,0,2],[0,1,0],[1,2,1]])
C = np.array([[2,0,1],[0,1,0],[1,1,2]])
A,B,C
(array([[2, 2, 2],
[1, 0, 0],
[1, 2, 1]]),
array([[2, 0, 2],
[0, 1, 0],
[1, 2, 1]]),
array([[2, 0, 1],
[0, 1, 0],
[1, 1, 2]]))
when i take average of C/ (A+B), i get nan/inf value with RunTimeWarning..
the resultant array looks like the following.
np.average(C/(A+B), axis = 1)
array([0.25 , nan, 0.58333333])
I would like to change any inf/nan value to 0.
What I tried so far was
#doesn't work. ( maybe im doing this wrong..)
mask = A+B >0
np.average(C[mask]/(A[mask]+B[mask]), axis = 1)
#does not work and not an ideal solution.
avg = np.average(C/(A+B), axis = 1)
avg[avg == np.nan] =0
any help would be appreciated!
Your tried approaches are both a valid way of dealing with it, but you need to change them slightly.
Avoiding the division upfront, by only calculating the result where it's valid (eg non-zero):
The use of the boolean mask you defined makes the resulting arrays (after indexing) to become 1D. So using this would mean you have to allocate the resulting array upfront, and assign it using that same mask.
mask = A+B > 0
result = np.zeros_like(A, dtype=np.float32)
result[mask] = C[mask]/(A[mask]+B[mask])
It does require the averaging over the second dimension to be done separate, and also masking the incorrect result to zero for elements where the division could not be done due to the zeros.
result = result.mean(axis=1)
result[(~mask).any(axis=1)] = 0
To me the main benefit would be avoiding the warning from Numpy, and perhaps in the case of a large amount of zeros (in A+B) you could gain a little performance by avoiding that calculation all together. But overall it seems a lot of effort to me.
Masking invalid values afterwards:
The main takeaway here is that you should never ever compare against np.nan directly since it will always be False. You can check this yourself by looking at the result from np.nan == np.nan. The way to handle this is use the dedicated np.isnan function. Or alternatively negate the np.isfinite function if you also want to catch +/- np.inf values at the same time.
avg = np.average(C/(A+B), axis = 1)
avg[np.isnan(avg)] = 0
# or to include inf
avg[~np.isfinite(avg)] = 0
import numpy as np
a = np.array([1, np.nan])
print(a) # [1, nan]
a = np.nan_to_num(a)
print(a) # [1, 0]
https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html
for inf and -inf
from numpy import inf
avg[avg == inf] = 0
avg[avg == -inf] = 0
Simply follow this if you are supposed to keep the inf value zero
np.divide(a, b, where=b.astype(bool))
This is tougher than I thought as np.mean's where argument doesn't work if it results in empty arrays and np.average's weights have to be 1-D.
# these don't work
# >>> np.mean(div, axis=1, where=mask.all(1, keepdims=True))
# RuntimeWarning: Mean of empty slice.
# RuntimeWarning: invalid value encountered in true_divide
# >>> np.average(div, axis=1, weights=mask.all(1, keepdims=True))
# TypeError: 1D weights expected when shapes of a and weights differ.
import numpy as np
A = np.array([[2,2,2],[1,0,0],[1,2,1]])
B = np.array([[2,0,2],[0,1,0],[1,2,1]])
C = np.array([[2,0,1],[0,1,0],[1,1,2]])
div = np.zeros(C.shape)
AB = A+B # avoid repeated summing
mask = AB > 0 # AB != 0 to include all valid divisors
np.divide(C, AB, where=mask, out=div) # out=None won't initialize unused elements
np.mean(div * mask.all(1, keepdims=True), axis = 1)
Output
array([0.25 , 0. , 0.58333333])

Using numpy where functions

I am trying to understand the behavior of the following piece of code:
import numpy as np
theta = np.arange(0,1.1,0.1)
prior_theta = 0.7
prior_prob = np.where(theta == prior_theta)
print(prior_prob)
However if I explicitly give the datatype the where function works as per expectation
import numpy as np
theta = np.arange(0,1.1,0.1,dtype = np.float32)
prior_theta = 0.7
prior_prob = np.where(theta == prior_theta)
print(prior_prob)
This seems like a data type comparison. Any idea on this will be very helpful.
This is just how floating point numbers work. You can't rely on exact comparisons. The number 0.7 cannot be represented in binary -- it is an infinitely repeating fraction. arange has to compute 0.1+0.1+0.1+0.1 etc,, and the round-off errors accumulate. The 7th value is not exactly the same as the literal value 0.7. The rounding is different for float32s, so you happened to get lucky.
You need to get in the habit of using "close enough" comparisons, like where(np.abs(theta-prior_theta) < 0.0001).
np.isclose (and np.allclose) is useful when making floats tests.
In [240]: theta = np.arange(0,1.1,0.1)
In [241]: theta
Out[241]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [242]: theta == 0.7
Out[242]:
array([False, False, False, False, False, False, False, False, False,
False, False])
np.arange warns us about using float increments - read the warnings section.
In [243]: theta.tolist()
Out[243]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [244]: np.isclose(theta, 0.7)
Out[244]:
array([False, False, False, False, False, False, False, True, False,
False, False])
In [245]: np.nonzero(np.isclose(theta, 0.7))
Out[245]: (array([7]),)
arange suggests using np.linspace, but that's more to address the end point issue, which you've already handled with 1.1 value. The 0.7 value is still the same.

Numpy only on finite entries

Here's a brief example of a function. It maps a vector to a vector. However, entries that are NaN or inf should be ignored. Currently this looks rather clumsy to me. Do you have any suggestions?
from scipy import stats
import numpy as np
def p(vv):
mask = np.isfinite(vv)
y = np.NaN * vv
v = vv[mask]
y[mask] = 1/v*(stats.hmean(v)/len(v))
return y
You can change the NaN values to zero with Numpy's isnan function and then remove the zeros as follows:
import numpy as np
def p(vv):
# assuming vv is your array
# use Nympy's isnan function to replace the NaN values in the array with zero
replace_NaN = np.isnan(vv)
vv[replace_NaN] = 0
# convert array vv to list
vv_list = vv.tolist()
new_list = []
# loop vv_list and exclude 0 values:
for i in vv_list:
if i != 0:
new.list.append(i)
# set array vv again
vv = np.array(new_list, dtype = 'float64')
return vv
I have came up with this kind of construction:
from scipy import stats
import numpy as np
## operate only on the valid entries of x and use the same mask on the resulting vector y
def __f(func, x):
mask = np.isfinite(x)
y = np.NaN * x
y[mask] = func(x[mask])
return y
# implementation of the parity function
def __pp(x):
return 1/x*(stats.hmean(x)/len(x))
def pp(vv):
return __f(__pp, vv)
Masked arrays accomplish this functionality and allow you to specify the mask as you desire. The numpy 1.18 docs for it are here: https://numpy.org/doc/1.18/reference/maskedarray.generic.html#what-is-a-masked-array
In masked arrays, False mask values are used in calculations, while True are ignored for calculations.
Example for obtaining the mean of only the finite values using np.isfinite():
import numpy as np
# Seeding for reproducing these results
np.random.seed(0)
# Generate random data and add some non-finite values
x = np.random.randint(0, 5, (3, 3)).astype(np.float32)
x[1,2], x[2,1], x[2,2] = np.inf, -np.inf, np.nan
# array([[ 4., 0., 3.],
# [ 3., 3., inf],
# [ 3., -inf, nan]], dtype=float32)
# Make masked array. Note the logical not of isfinite
x_masked = np.ma.masked_array(x, mask=~np.isfinite(x))
# Mean of entire masked matrix
x_masked.mean()
# 2.6666666666666665
# Masked matrix's row means
x_masked.mean(1)
# masked_array(data=[2.3333333333333335, 3.0, 3.0],
# mask=[False, False, False],
# fill_value=1e+20)
# Masked matrix's column means
x_masked.mean(0)
# masked_array(data=[3.3333333333333335, 1.5, 3.0],
# mask=[False, False, False],
# fill_value=1e+20)
Note that scipy.stats.hmean() also works with masked arrays.
Note that if all you care about is detecting NaNs and leaving infs, then you can use np.isnan() instead of np.isfinite().

Find the min/max excluding zeros in a numpy array (or a tuple) in python

I have an array. The valid values are not zero (either positive or negetive). I want to find the minimum and maximum within the array which should not take zeros into account. For example if the numbers are only negative. Zeros will be problematic.
How about:
import numpy as np
minval = np.min(a[np.nonzero(a)])
maxval = np.max(a[np.nonzero(a)])
where a is your array.
If you can choose the "invalid" value in your array, it is better to use nan instead of 0:
>>> a = numpy.array([1.0, numpy.nan, 2.0])
>>> numpy.nanmax(a)
2.0
>>> numpy.nanmin(a)
1.0
If this is not possible, you can use an array mask:
>>> a = numpy.array([1.0, 0.0, 2.0])
>>> masked_a = numpy.ma.masked_equal(a, 0.0, copy=False)
>>> masked_a.max()
2.0
>>> masked_a.min()
1.0
Compared to Josh's answer using advanced indexing, this has the advantage of avoiding to create a copy of the array.
Here's another way of masking which I think is easier to remember (although it does copy the array). For the case in point, it goes like this:
>>> import numpy
>>> a = numpy.array([1.0, 0.0, 2.0])
>>> ma = a[a != 0]
>>> ma.max()
2.0
>>> ma.min()
1.0
>>>
It generalizes to other expressions such as a > 0, numpy.isnan(a), ...
And you can combine masks with standard operators (+ means OR, * means AND, - means NOT) e.g:
# Identify elements that are outside interpolation domain or NaN
outside = (xi < x[0]) + (eta < y[0]) + (xi > x[-1]) + (eta > y[-1])
outside += numpy.isnan(xi) + numpy.isnan(eta)
inside = -outside
xi = xi[inside]
eta = eta[inside]
You could use a generator expression to filter out the zeros:
array = [-2, 0, -4, 0, -3, -2]
max(x for x in array if x != 0)
Masked arrays in general are designed exactly for these kind of purposes. You can leverage masking zeros from an array (or ANY other kind of mask you desire, even masks that are more complicated than a simple equality) and do pretty much most of the stuff you do on regular arrays on your masked array. You can also specify an axis for which you wish to find the min along:
import numpy.ma as ma
mx = ma.masked_array(x, mask=x==0)
mx.min()
Example input:
x = np.array([1.0, 0.0, 2.0])
output:
1.0
A simple way would be to use a list comprehension to exclude zeros.
>>> tup = (0, 1, 2, 5, 2)
>>> min([x for x in tup if x !=0])
1

Categories

Resources