I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?
If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True
I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))
This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)
To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1
If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)
You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler
There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))
If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]
If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()
If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.
Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])
For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)
If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.
A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x
Related
For machine learning, I'm appliying Parzen Window algorithm.
I have an array (m,n). I would like to check on each row if any of the values is > 0.5 and if each of them is, then I would return 0, otherwise 1.
I would like to know if there is a way to do this without a loop thanks to numpy.
You can use np.all with axis=1 on a boolean array.
import numpy as np
arr = np.array([[0.8, 0.9], [0.1, 0.6], [0.2, 0.3]])
print(np.all(arr>0.5, axis=1))
>> [True False False]
import numpy as np
# Value Initialization
a = np.array([0.75, 0.25, 0.50])
y_predict = np.zeros((1, a.shape[0]))
#If the value is greater than 0.5, the value is 1; otherwise 0
y_predict = (a > 0.5).astype(float)
I have an array (m,n). I would like to check on each row if any of the values is > 0.5
That will be stored in b:
import numpy as np
a = # some np.array of shape (m,n)
b = np.any(a > 0.5, axis=1)
and if each of them is, then I would return 0, otherwise 1.
I'm assuming you mean 'and if this is the case for all rows'. In this case:
c = 1 - 1 * np.all(b)
c contains your return value, either 0 or 1.
I am trying to apply a softmax function to a numpy array. But I am not getting the desired results. This is the code I have tried:
import numpy as np
x = np.array([[1001,1002],[3,4]])
softmax = np.exp(x - np.max(x))/(np.sum(np.exp(x - np.max(x)))
print softmax
I think the x - np.max(x) code is not subtracting the max of each row. The max needs to be subtracted from x to prevent very large numbers.
This is supposed to output
np.array([
[0.26894142, 0.73105858],
[0.26894142, 0.73105858]])
But I am getting:
np.array([
[0.26894142, 0.73105858],
[0, 0]])
A convenient way to keep the axes that are consumed by "reduce" operations such as max or sum is the keepdims keyword:
mx = np.max(x, axis=-1, keepdims=True)
mx
# array([[1002],
# [ 4]])
x - mx
# array([[-1, 0],
# [-1, 0]])
numerator = np.exp(x - mx)
denominator = np.sum(numerator, axis=-1, keepdims=True)
denominator
# array([[ 1.36787944],
# [ 1.36787944]])
numerator/denominator
# array([[ 0.26894142, 0.73105858],
[ 0.26894142, 0.73105858]])
My 5-liner (which uses scipy logsumexp for the tricky bits):
def softmax(a, axis=None):
"""
Computes exp(a)/sumexp(a); relies on scipy logsumexp implementation.
:param a: ndarray/tensor
:param axis: axis to sum over; default (None) sums over everything
"""
from scipy.special import logsumexp
lse = logsumexp(a, axis=axis) # this reduces along axis
if axis is not None:
lse = np.expand_dims(lse, axis) # restore that axis for subtraction
return np.exp(a - lse)
You may have to use from scipy.misc import logsumexp if you have an older scipy version.
EDIT. As of version 1.2.0, scipy includes softmax as a special function:
https://scipy.github.io/devdocs/generated/scipy.special.softmax.html
I wrote a very general softmax function operating over an arbitrary axis, including the tricky max subtraction bit. The function is below, and I wrote a blog post about it here.
def softmax(X, theta = 1.0, axis = None):
"""
Compute the softmax of each element along an axis of X.
Parameters
----------
X: ND-Array. Probably should be floats.
theta (optional): float parameter, used as a multiplier
prior to exponentiation. Default = 1.0
axis (optional): axis to compute values along. Default is the
first non-singleton axis.
Returns an array the same size as X. The result will sum to 1
along the specified axis.
"""
# make X at least 2d
y = np.atleast_2d(X)
# find axis
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
# multiply y against the theta parameter,
y = y * float(theta)
# subtract the max for numerical stability
y = y - np.expand_dims(np.max(y, axis = axis), axis)
# exponentiate y
y = np.exp(y)
# take the sum along the specified axis
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
# finally: divide elementwise
p = y / ax_sum
# flatten if X was 1D
if len(X.shape) == 1: p = p.flatten()
return p
The x - np.max(x) code is not doing row-wise subtraction.
Let's do it step-wise. First we will make a 'maxes' array by tiling or making a copy of the column:
maxes = np.tile(np.max(x,1), (2,1)).T
This will create a 2X2 matrix which will correspond to the maxes for each row by making a duplicate column(tile). After this you can do:
x = np.exp(x - maxes)/(np.sum(np.exp(x - maxes), axis = 1))
You should get your result with this. The axis = 1 is for the row-wise softmax you mentioned in the heading of your answer. Hope this helps.
How about this?
For taking max along the rows just specify the argument as axis=1 and then convert the result as a column vector(but a 2D array actually) using np.newaxis/None.
In [40]: x
Out[40]:
array([[1001, 1002],
[ 3, 4]])
In [41]: z = x - np.max(x, axis=1)[:, np.newaxis]
In [42]: z
Out[42]:
array([[-1, 0],
[-1, 0]])
In [44]: softmax = np.exp(z) / np.sum(np.exp(z), axis=1)[:, np.newaxis]
In [45]: softmax
Out[45]:
array([[ 0.26894142, 0.73105858],
[ 0.26894142, 0.73105858]])
In the last step, again when you take sum just specify the argument axis=1 to sum it along the rows.
I am learning numpy/scipy, coming from a MATLAB background. The xcorr function in Matlab has an optional argument "maxlag" that limits the lag range from –maxlag to maxlag. This is very useful if you are looking at the cross-correlation between two very long time series but are only interested in the correlation within a certain time range. The performance increases are enormous considering that cross-correlation is incredibly expensive to compute.
In numpy/scipy it seems there are several options for computing cross-correlation. numpy.correlate, numpy.convolve, scipy.signal.fftconvolve. If someone wishes to explain the difference between these, I'd be happy to hear, but mainly what is troubling me is that none of them have a maxlag feature. This means that even if I only want to see correlations between two time series with lags between -100 and +100 ms, for example, it will still calculate the correlation for every lag between -20000 and +20000 ms (which is the length of the time series). This gives a 200x performance hit! Do I have to recode the cross-correlation function by hand to include this feature?
Here are a couple functions to compute auto- and cross-correlation with limited lags. The order of multiplication (and conjugation, in the complex case) was chosen to match the corresponding behavior of numpy.correlate.
import numpy as np
from numpy.lib.stride_tricks import as_strided
def _check_arg(x, xname):
x = np.asarray(x)
if x.ndim != 1:
raise ValueError('%s must be one-dimensional.' % xname)
return x
def autocorrelation(x, maxlag):
"""
Autocorrelation with a maximum number of lags.
`x` must be a one-dimensional numpy array.
This computes the same result as
numpy.correlate(x, x, mode='full')[len(x)-1:len(x)+maxlag]
The return value has length maxlag + 1.
"""
x = _check_arg(x, 'x')
p = np.pad(x.conj(), maxlag, mode='constant')
T = as_strided(p[maxlag:], shape=(maxlag+1, len(x) + maxlag),
strides=(-p.strides[0], p.strides[0]))
return T.dot(p[maxlag:].conj())
def crosscorrelation(x, y, maxlag):
"""
Cross correlation with a maximum number of lags.
`x` and `y` must be one-dimensional numpy arrays with the same length.
This computes the same result as
numpy.correlate(x, y, mode='full')[len(a)-maxlag-1:len(a)+maxlag]
The return vaue has length 2*maxlag + 1.
"""
x = _check_arg(x, 'x')
y = _check_arg(y, 'y')
py = np.pad(y.conj(), 2*maxlag, mode='constant')
T = as_strided(py[2*maxlag:], shape=(2*maxlag+1, len(y) + 2*maxlag),
strides=(-py.strides[0], py.strides[0]))
px = np.pad(x, maxlag, mode='constant')
return T.dot(px)
For example,
In [367]: x = np.array([2, 1.5, 0, 0, -1, 3, 2, -0.5])
In [368]: autocorrelation(x, 3)
Out[368]: array([ 20.5, 5. , -3.5, -1. ])
In [369]: np.correlate(x, x, mode='full')[7:11]
Out[369]: array([ 20.5, 5. , -3.5, -1. ])
In [370]: y = np.arange(8)
In [371]: crosscorrelation(x, y, 3)
Out[371]: array([ 5. , 23.5, 32. , 21. , 16. , 12.5, 9. ])
In [372]: np.correlate(x, y, mode='full')[4:11]
Out[372]: array([ 5. , 23.5, 32. , 21. , 16. , 12.5, 9. ])
(It will be nice to have such a feature in numpy itself.)
Until numpy implements the maxlag argument, you can use the function ucorrelate from the pycorrelate package. ucorrelate operates on numpy arrays and has a maxlag keyword. It implements the correlation from using a for-loop and optimizes the execution speed with numba.
Example - autocorrelation with 3 time lags:
import numpy as np
import pycorrelate as pyc
x = np.array([2, 1.5, 0, 0, -1, 3, 2, -0.5])
c = pyc.ucorrelate(x, x, maxlag=3)
c
Result:
Out[1]: array([20, 5, -3])
The pycorrelate documentation contains a notebook showing perfect match between pycorrelate.ucorrelate and numpy.correlate:
matplotlib.pyplot provides matlab like syntax for computating and plotting of cross correlation , auto correlation etc.
You can use xcorr which allows to define the maxlags parameter.
import matplotlib.pyplot as plt
import numpy as np
data = np.arange(0,2*np.pi,0.01)
y1 = np.sin(data)
y2 = np.cos(data)
coeff = plt.xcorr(y1,y2,maxlags=10)
print(*coeff)
[-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
8 9 10] [ -9.81991753e-02 -8.85505028e-02 -7.88613080e-02 -6.91325329e-02
-5.93651264e-02 -4.95600447e-02 -3.97182508e-02 -2.98407146e-02
-1.99284126e-02 -9.98232812e-03 -3.45104289e-06 9.98555430e-03
1.99417667e-02 2.98641953e-02 3.97518558e-02 4.96037706e-02
5.94189688e-02 6.91964864e-02 7.89353663e-02 8.86346584e-02
9.82934198e-02] <matplotlib.collections.LineCollection object at 0x00000000074A9E80> Line2D(_line0)
#Warren Weckesser's answer is the best as it leverages numpy to get performance savings (and not just call corr for each lag). Nonetheless, it returns the cross-product (eg the dot product between the inputs at various lags). To get the actual cross-correlation I modified his answer w/ an optional mode argument, which if set to 'corr' returns the cross-correlation as such:
def crosscorrelation(x, y, maxlag, mode='corr'):
"""
Cross correlation with a maximum number of lags.
`x` and `y` must be one-dimensional numpy arrays with the same length.
This computes the same result as
numpy.correlate(x, y, mode='full')[len(a)-maxlag-1:len(a)+maxlag]
The return vaue has length 2*maxlag + 1.
"""
py = np.pad(y.conj(), 2*maxlag, mode='constant')
T = as_strided(py[2*maxlag:], shape=(2*maxlag+1, len(y) + 2*maxlag),
strides=(-py.strides[0], py.strides[0]))
px = np.pad(x, maxlag, mode='constant')
if mode == 'dot': # get lagged dot product
return T.dot(px)
elif mode == 'corr': # gets Pearson correlation
return (T.dot(px)/px.size - (T.mean(axis=1)*px.mean())) / \
(np.std(T, axis=1) * np.std(px))
I encountered the same problem some time ago, I paid more attention to the efficiency of calculation.Refer to the source code of MATLAB's function xcorr.m, I made a simple one.
import numpy as np
from scipy import signal, fftpack
import math
import time
def nextpow2(x):
if x == 0:
y = 0
else:
y = math.ceil(math.log2(x))
return y
def xcorr(x, y, maxlag):
m = max(len(x), len(y))
mx1 = min(maxlag, m - 1)
ceilLog2 = nextpow2(2 * m - 1)
m2 = 2 ** ceilLog2
X = fftpack.fft(x, m2)
Y = fftpack.fft(y, m2)
c1 = np.real(fftpack.ifft(X * np.conj(Y)))
index1 = np.arange(1, mx1+1, 1) + (m2 - mx1 -1)
index2 = np.arange(1, mx1+2, 1) - 1
c = np.hstack((c1[index1], c1[index2]))
return c
if __name__ == "__main__":
s = time.clock()
a = [1, 2, 3, 4, 5]
b = [6, 7, 8, 9, 10]
c = xcorr(a, b, 3)
e = time.clock()
print(c)
print(e-c)
Take the results of a certain run as an exmple:
[ 29. 56. 90. 130. 110. 86. 59.]
0.0001745000000001884
comparing with MATLAB code:
clear;close all;clc
tic
a = [1, 2, 3, 4, 5];
b = [6, 7, 8, 9, 10];
c = xcorr(a, b, 3)
toc
29.0000 56.0000 90.0000 130.0000 110.0000 86.0000 59.0000
时间已过 0.000279 秒。
If anyone can give a strict mathematical derivation about this,that would be very helpful.
I think I have found a solution, as I was facing the same problem:
If you have two vectors x and y of any length N, and want a cross-correlation with a window of fixed len m, you can do:
x = <some_data>
y = <some_data>
# Trim your variables
x_short = x[window:]
y_short = y[window:]
# do two xcorrelations, lagging x and y respectively
left_xcorr = np.correlate(x, y_short) #defaults to 'valid'
right_xcorr = np.correlate(x_short, y) #defaults to 'valid'
# combine the xcorrelations
# note the first value of right_xcorr is the same as the last of left_xcorr
xcorr = np.concatenate(left_xcorr, right_xcorr[1:])
Remember you might need to normalise the variables if you want a bounded correlation
Here is another answer, sourced from here, seems faster on the margin than np.correlate and has the benefit of returning a normalised correlation:
def rolling_window(self, a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def xcorr(self, x,y):
N=len(x)
M=len(y)
meany=np.mean(y)
stdy=np.std(np.asarray(y))
tmp=self.rolling_window(np.asarray(x),M)
c=np.sum((y-meany)*(tmp-np.reshape(np.mean(tmp,-1),(N-M+1,1))),-1)/(M*np.std(tmp,-1)*stdy)
return c
as I answered here, https://stackoverflow.com/a/47897581/5122657
matplotlib.xcorr has the maxlags param. It is actually a wrapper of the numpy.correlate, so there is no performance saving. Nevertheless it gives exactly the same result given by Matlab's cross-correlation function. Below I edited the code from matplotlib so that it will return only the correlation. The reason is that if we use matplotlib.corr as it is, it will return the plot as well. The problem is, if we put complex data type as the arguments into it, we will get "casting complex to real datatype" warning when matplotlib tries to draw the plot.
<!-- language: python -->
import numpy as np
import matplotlib.pyplot as plt
def xcorr(x, y, maxlags=10):
Nx = len(x)
if Nx != len(y):
raise ValueError('x and y must be equal length')
c = np.correlate(x, y, mode=2)
if maxlags is None:
maxlags = Nx - 1
if maxlags >= Nx or maxlags < 1:
raise ValueError('maxlags must be None or strictly positive < %d' % Nx)
c = c[Nx - 1 - maxlags:Nx + maxlags]
return c
I would like to improve the speed of my code by computing a function once on a numpy array instead of a for loop is over a function of this python library. If I have a function as following:
import numpy as np
import galsim
from math import *
M200=1e14
conc=6.9
def func(M200, conc):
halo_z=0.2
halo_pos =[1200., 3769.7]
halo_pos = galsim.PositionD(x=halo_pos_arcsec[0],y=halo_pos_arcsec[1])
nfw = galsim.NFWHalo(mass=M200, conc=conc, redshift=halo_z,halo_pos=halo_pos, omega_m = 0.3, omega_lam =0.7)
for i in range(len(shear_z)):
shear_pos=galsim.PositionD(x=pos_arcsec[i,0],y=pos_arcsec[i,1])
model_g1, model_g2 = nfw.getShear(pos=self.shear_pos, z_s=shear_z[i])
l=np.sum(model_g1-model_g2)/sqrt(np.pi)
return l
While pos_arcsec is a two-dimensional array of 24000x2 and shear_z is a 1D array with 24000 elements as well.
The main problem is that I want to calculate this function on a grid where M200=np.arange(13., 16., 0.01) and conc = np.arange(3, 10, 0.01). I don't know how to broadcast this function to be estimated for this two dimensional array over M200 and conc. It takes a lot to run the code. I am looking for the best approaches to speed up these calculations.
This here should work when pos is an array of shape (n,2)
import numpy as np
def f(pos, z):
r=np.sqrt(pos[...,0]**2+pos[...,1]**2)
return np.log(r)*(z+1)
Example:
z = np.arange(10)
pos = np.arange(20).reshape(10,2)
f(pos,z)
# array([ 0. , 2.56494936, 5.5703581 , 8.88530251,
# 12.44183436, 16.1944881 , 20.11171117, 24.17053133,
# 28.35353608, 32.64709419])
Use numpy.linalg.norm
If you have an array:
import numpy as np
import numpy.linalg as la
a = np.array([[3, 4], [5, 12], [7, 24]])
then you can determine the magnitude of the resulting vector (sqrt(a^2 + b^2)) by
b = np.sqrt(la.norm(a, axis=1)
>>> print b
array([ 5., 15. 25.])
Here's a brief example of a function. It maps a vector to a vector. However, entries that are NaN or inf should be ignored. Currently this looks rather clumsy to me. Do you have any suggestions?
from scipy import stats
import numpy as np
def p(vv):
mask = np.isfinite(vv)
y = np.NaN * vv
v = vv[mask]
y[mask] = 1/v*(stats.hmean(v)/len(v))
return y
You can change the NaN values to zero with Numpy's isnan function and then remove the zeros as follows:
import numpy as np
def p(vv):
# assuming vv is your array
# use Nympy's isnan function to replace the NaN values in the array with zero
replace_NaN = np.isnan(vv)
vv[replace_NaN] = 0
# convert array vv to list
vv_list = vv.tolist()
new_list = []
# loop vv_list and exclude 0 values:
for i in vv_list:
if i != 0:
new.list.append(i)
# set array vv again
vv = np.array(new_list, dtype = 'float64')
return vv
I have came up with this kind of construction:
from scipy import stats
import numpy as np
## operate only on the valid entries of x and use the same mask on the resulting vector y
def __f(func, x):
mask = np.isfinite(x)
y = np.NaN * x
y[mask] = func(x[mask])
return y
# implementation of the parity function
def __pp(x):
return 1/x*(stats.hmean(x)/len(x))
def pp(vv):
return __f(__pp, vv)
Masked arrays accomplish this functionality and allow you to specify the mask as you desire. The numpy 1.18 docs for it are here: https://numpy.org/doc/1.18/reference/maskedarray.generic.html#what-is-a-masked-array
In masked arrays, False mask values are used in calculations, while True are ignored for calculations.
Example for obtaining the mean of only the finite values using np.isfinite():
import numpy as np
# Seeding for reproducing these results
np.random.seed(0)
# Generate random data and add some non-finite values
x = np.random.randint(0, 5, (3, 3)).astype(np.float32)
x[1,2], x[2,1], x[2,2] = np.inf, -np.inf, np.nan
# array([[ 4., 0., 3.],
# [ 3., 3., inf],
# [ 3., -inf, nan]], dtype=float32)
# Make masked array. Note the logical not of isfinite
x_masked = np.ma.masked_array(x, mask=~np.isfinite(x))
# Mean of entire masked matrix
x_masked.mean()
# 2.6666666666666665
# Masked matrix's row means
x_masked.mean(1)
# masked_array(data=[2.3333333333333335, 3.0, 3.0],
# mask=[False, False, False],
# fill_value=1e+20)
# Masked matrix's column means
x_masked.mean(0)
# masked_array(data=[3.3333333333333335, 1.5, 3.0],
# mask=[False, False, False],
# fill_value=1e+20)
Note that scipy.stats.hmean() also works with masked arrays.
Note that if all you care about is detecting NaNs and leaving infs, then you can use np.isnan() instead of np.isfinite().