Calculate moving average in numpy array with NaNs

Calculate moving average in numpy array with NaNs - python

I am trying to calculate the moving average in a large numpy array that contains NaNs. Currently I am using:
import numpy as np
def moving_average(a,n=5):
ret = np.cumsum(a,dtype=float)
ret[n:] = ret[n:]-ret[:-n]
return ret[-1:]/n
When calculating with a masked array:
x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx).filled(np.nan)
print y
>>> array([3.8,3.8,3.6,nan,nan,nan,2,2.4,nan,nan,nan,2.8,2.6])
The result I am looking for (below) should ideally have NaNs only in the place where the original array, x, had NaNs and the averaging should be done over the number of non-NaN elements in the grouping (I need some way to change the size of n in the function.)
y = array([4.75,4.75,nan,4.4,3.75,2.33,3.33,4,nan,nan,3,3.5,nan,3.25,4,4.5,3])
I could loop over the entire array and check index by index but the array I am using is very large and that would take a long time. Is there a numpythonic way to do this?

Pandas has a lot of really nice functionality with this. For example:
x = np.array([np.nan, np.nan, 3, 3, 3, np.nan, 5, 7, 7])
# requires three valid values in a row or the resulting value is null
print(pd.Series(x).rolling(3).mean())
#output
nan,nan,nan, nan, 3, nan, nan, nan, 6.333
# only requires 2 valid values out of three for size=3 window
print(pd.Series(x).rolling(3, min_periods=2).mean())
#output
nan, nan, nan, 3, 3, 3, 4, 6, 6.3333
You can play around with the windows/min_periods and consider filling-in nulls all in one chained line of code.

I'll just add to the great answers before that you could still use cumsum to achieve this:
import numpy as np
def moving_average(a, n=5):
ret = np.cumsum(a.filled(0))
ret[n:] = ret[n:] - ret[:-n]
counts = np.cumsum(~a.mask)
counts[n:] = counts[n:] - counts[:-n]
ret[~a.mask] /= counts[~a.mask]
ret[a.mask] = np.nan
return ret
x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx)

You could create a temporary array and use np.nanmean() (new in version 1.8 if I'm not mistaken):
import numpy as np
temp = np.vstack([x[i:-(5-i)] for i in range(5)]) # stacks vertically the strided arrays
means = np.nanmean(temp, axis=0)
and put original nan back in place with means[np.isnan(x[:-5])] = np.nan
However this look redundant both in terms of memory (stacking the same array strided 5 times) and computation.

If I understand correctly, you want to create a moving average and then populate the resulting elements as nan if their index in the original array was nan.
import numpy as np
>>> inc = 5 #the moving avg increment
>>> x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
>>> mov_avg = np.array([np.nanmean(x[idx:idx+inc]) for idx in range(len(x))])
# Determine indices in x that are nans
>>> nan_idxs = np.where(np.isnan(x))[0]
# Populate output array with nans
>>> mov_avg[nan_idxs] = np.nan
>>> mov_avg
array([ 4.75, 4.75, nan, 4.4, 3.75, 2.33333333, 3.33333333, 4., nan, nan, 3., 3.5, nan, 3.25, 4., 4.5, 3.])

Here's an approach using strides -
w = 5 # Window size
n = x.strides[0]
avgs = np.nanmean(np.lib.stride_tricks.as_strided(x, \
shape=(x.size-w+1,w), strides=(n,n)),1)
x_rem = np.append(x[-w+1:],np.full(w-1,np.nan))
avgs_rem = np.nanmean(np.lib.stride_tricks.as_strided(x_rem, \
shape=(w-1,w), strides=(n,n)),1)
avgs = np.append(avgs,avgs_rem)
avgs[np.isnan(x)] = np.nan

Currently bottleneck package should do the trick quite reliably and quickly. Here is slightly adjusted example from https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.move_mean:
>>> import bottleneck as bn
>>> a = np.array([1.0, 2.0, 3.0, np.nan, 5.0])
>>> bn.move_mean(a, window=2)
array([ nan, 1.5, 2.5, nan, nan])
>>> bn.move_mean(a, window=2, min_count=1)
array([ 1. , 1.5, 2.5, 3. , 5. ])
Note that the resulting means correspond to the last index of the window.
The package is available from Ubuntu repos, pip etc. It can operate over arbitrary axis of numpy-array etc. Besides that, it is claimed to be faster than plain-numpy implementation in many cases.

Related

generating random numbers between two numbers

Is it possible to generate random numbers that are almost equally spaced which shouldnot be exactly same as numpy.linspace output
I look into the numpy.random.uniform function but it doesnot give the required results.
Moreover the the summation of the values generated by the function should be same as the summation of the values generated by numpy.linspace function.
code
import random
import numpy as np
random.seed(42)
data=np.random.uniform(2,4,10)
print(data)

You might consider drawing random samples around the output of numpy.linspace. Setting these numbers as the mean of the normal distribution and setting the variance not too high would generate numbers close to the output of numpy.linspace. For example,
>>> import numpy as np
>>> exact_numbers = np.linspace(2.0, 10.0, num=5)
>>> exact_numbers
array([ 2., 4., 6., 8., 10.])
>>> approximate_numbers = np.random.normal(exact_numbers, np.ones(5) * 0.1)
>>> approximate_numbers
array([2.12950013, 3.9804745 , 5.80670316, 8.07868932, 9.85288221])

Maybe this trick by combining numpy.linspace and numpy.random.uniform and random choice two indexes and increase one of them and decrease other help you:
(You can change size=10, threshold=0.1 for how random numbers are bigger or smaller)
import numpy as np
size = 10
theroshold = 0.1
r = np.linspace(2,4,size) # r.sum()=30
# array([2. , 2.22222222, 2.44444444, 2.66666667, 2.88888889,
# 3.11111111, 3.33333333, 3.55555556, 3.77777778, 4. ])
c = np.random.uniform(0,theroshold,size)
# array([0.02246768, 0.08661081, 0.0932445 , 0.00360563, 0.06539992,
# 0.0107167 , 0.06490493, 0.0558159 , 0.00268924, 0.00070247])
s = np.random.choice(range(size), size+1)
# array([5, 5, 8, 3, 6, 4, 1, 8, 7, 1, 7])
for idx, (i,j) in enumerate(zip(s, s[1:])):
r[i] += c[idx]
r[j] -= c[idx]
print(r)
print(r.sum())
Output:
[2. 2.27442369 2.44444444 2.5770278 2.83420567 3.19772192
3.39512762 3.50172642 3.77532244 4. ]
30

Numpy only on finite entries

Here's a brief example of a function. It maps a vector to a vector. However, entries that are NaN or inf should be ignored. Currently this looks rather clumsy to me. Do you have any suggestions?
from scipy import stats
import numpy as np
def p(vv):
mask = np.isfinite(vv)
y = np.NaN * vv
v = vv[mask]
y[mask] = 1/v*(stats.hmean(v)/len(v))
return y

You can change the NaN values to zero with Numpy's isnan function and then remove the zeros as follows:
import numpy as np
def p(vv):
# assuming vv is your array
# use Nympy's isnan function to replace the NaN values in the array with zero
replace_NaN = np.isnan(vv)
vv[replace_NaN] = 0
# convert array vv to list
vv_list = vv.tolist()
new_list = []
# loop vv_list and exclude 0 values:
for i in vv_list:
if i != 0:
new.list.append(i)
# set array vv again
vv = np.array(new_list, dtype = 'float64')
return vv

I have came up with this kind of construction:
from scipy import stats
import numpy as np
## operate only on the valid entries of x and use the same mask on the resulting vector y
def __f(func, x):
mask = np.isfinite(x)
y = np.NaN * x
y[mask] = func(x[mask])
return y
# implementation of the parity function
def __pp(x):
return 1/x*(stats.hmean(x)/len(x))
def pp(vv):
return __f(__pp, vv)

Masked arrays accomplish this functionality and allow you to specify the mask as you desire. The numpy 1.18 docs for it are here: https://numpy.org/doc/1.18/reference/maskedarray.generic.html#what-is-a-masked-array
In masked arrays, False mask values are used in calculations, while True are ignored for calculations.
Example for obtaining the mean of only the finite values using np.isfinite():
import numpy as np
# Seeding for reproducing these results
np.random.seed(0)
# Generate random data and add some non-finite values
x = np.random.randint(0, 5, (3, 3)).astype(np.float32)
x[1,2], x[2,1], x[2,2] = np.inf, -np.inf, np.nan
# array([[ 4., 0., 3.],
# [ 3., 3., inf],
# [ 3., -inf, nan]], dtype=float32)
# Make masked array. Note the logical not of isfinite
x_masked = np.ma.masked_array(x, mask=~np.isfinite(x))
# Mean of entire masked matrix
x_masked.mean()
# 2.6666666666666665
# Masked matrix's row means
x_masked.mean(1)
# masked_array(data=[2.3333333333333335, 3.0, 3.0],
# mask=[False, False, False],
# fill_value=1e+20)
# Masked matrix's column means
x_masked.mean(0)
# masked_array(data=[3.3333333333333335, 1.5, 3.0],
# mask=[False, False, False],
# fill_value=1e+20)
Note that scipy.stats.hmean() also works with masked arrays.
Note that if all you care about is detecting NaNs and leaving infs, then you can use np.isnan() instead of np.isfinite().

How to normalize a NumPy array to a unit vector?

I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?

If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True

I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))

This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)

To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm

If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1

If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)

You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler

There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))

If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]

If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()

If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.

Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])

For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)

If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.

A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x

Finding the max and min in a tuple of tuples

I'm new to python and having some problems finding the minimum and maximum values for a tuple of tuples. I need them to normalise my data. So, basically, I have a list that is a row of 13 numbers, each representing something. Each number makes a column in a list, and I need the max and min for each column. I tried indexing/iterating through but keep getting an error of
max_j = max(j)
TypeError: 'float' object is not iterable
any help would be appreciated!
The code is (assuming data_set_tup is a tuple of tuples, eg ((1,3,4,5,6,7,...),(5,6,7,3,6,73,2...)...(3,4,5,6,3,2,2...)) I also want to make a new list using the normalised values.
normal_list = []
for i in data_set_tup:
for j in i[1:]: # first column doesn't need to be normalised
max_j = max(j)
min_j = min(j)
normal_j = (j-min_j)/(max_j-min_j)
normal_list.append(normal_j)
normal_tup = tuple(normal_list)

You can transpose rows to columns and vice versa with zip(*...). (Use list(zip(*...)) in Python 3)
cols = zip(*data_set_tup)
normal_cols = [cols[0]] # first column doesn't need to be normalised
for j in cols[1:]:
max_j = max(j)
min_j = min(j)
normal_cols.append(tuple((k-min_j)/(max_j-min_j) for k in j)
normal_list = zip(*normal_cols)

This really sounds like a job for the non-builtin numpy module, or maybe the pandas module, depending on your needs.
Adding an extra dependency on your application should not be done lightly, but if you do a lot of work on matrix-like data, then your code will likely be both faster and more readable if you use one of the above modules throughout your application.
I do not recommend converting a list of lists to a numpy array and back again just to get this single result -- it's better to use the pure python method of Jannes answer. Also, seeing that you're a python beginner, numpy may be overkill right now. But I think your question deserves an answer pointing out that this is an option.
Here's a step-by-step console illustration of how this would work in numpy:
>>> import numpy as np
>>> a = np.array([[1,3,4,5,6],[5,6,7,3,6],[3,4,5,6,3]], dtype=float)
>>> a
array([[ 1., 3., 4., 5., 6.],
[ 5., 6., 7., 3., 6.],
[ 3., 4., 5., 6., 3.]])
>>> min = np.min(a, axis=0)
>>> min
array([1, 3, 4, 3, 3])
>>> max = np.max(a, axis=0)
>>> max
array([5, 6, 7, 6, 6])
>>> normalized = (a - min) / (max - min)
>>> normalized
array([[ 0. , 0. , 0. , 0.66666667, 1. ],
[ 1. , 1. , 1. , 0. , 1. ],
[ 0.5 , 0.33333333, 0.33333333, 1. , 0. ]])
So in actual code:
import numpy as np
def normalize_by_column(a):
min = np.min(a, axis=0)
max = np.max(a, axis=0)
return (a - min) / (max - min)

We have nested_tuple = ((1, 2, 3), (4, 5, 6), (7, 8, 9)).
First of all we need to normalize it. Pythonic way:
flat_tuple = [x for row in nested_tuple for x in row]
Output: [1, 2, 3, 4, 5, 6, 7, 8, 9] # it's a list
Move it to tuple: tuple(flat_tuple), get max value: max(flat_tuple), get min value: min(flat_tuple)

How to normalize a 2-dimensional numpy array in python less verbose?

Given a 3 times 3 numpy array
a = numpy.arange(0,27,3).reshape(3,3)
# array([[ 0, 3, 6],
# [ 9, 12, 15],
# [18, 21, 24]])
To normalize the rows of the 2-dimensional array I thought of
row_sums = a.sum(axis=1) # array([ 9, 36, 63])
new_matrix = numpy.zeros((3,3))
for i, (row, row_sum) in enumerate(zip(a, row_sums)):
new_matrix[i,:] = row / row_sum
There must be a better way, isn't there?
Perhaps to clearify: By normalizing I mean, the sum of the entrys per row must be one. But I think that will be clear to most people.

Broadcasting is really good for this:
row_sums = a.sum(axis=1)
new_matrix = a / row_sums[:, numpy.newaxis]
row_sums[:, numpy.newaxis] reshapes row_sums from being (3,) to being (3, 1). When you do a / b, a and b are broadcast against each other.
You can learn more about broadcasting here or even better here.

Scikit-learn offers a function normalize() that lets you apply various normalizations. The "make it sum to 1" is called L1-norm. Therefore:
from sklearn.preprocessing import normalize
matrix = numpy.arange(0,27,3).reshape(3,3).astype(numpy.float64)
# array([[ 0., 3., 6.],
# [ 9., 12., 15.],
# [ 18., 21., 24.]])
normed_matrix = normalize(matrix, axis=1, norm='l1')
# [[ 0. 0.33333333 0.66666667]
# [ 0.25 0.33333333 0.41666667]
# [ 0.28571429 0.33333333 0.38095238]]
Now your rows will sum to 1.

I think this should work,
a = numpy.arange(0,27.,3).reshape(3,3)
a /= a.sum(axis=1)[:,numpy.newaxis]

In case you are trying to normalize each row such that its magnitude is one (i.e. a row's unit length is one or the sum of the square of each element in a row is one):
import numpy as np
a = np.arange(0,27,3).reshape(3,3)
result = a / np.linalg.norm(a, axis=-1)[:, np.newaxis]
# array([[ 0. , 0.4472136 , 0.89442719],
# [ 0.42426407, 0.56568542, 0.70710678],
# [ 0.49153915, 0.57346234, 0.65538554]])
Verifying:
np.sum( result**2, axis=-1 )
# array([ 1., 1., 1.])

I think you can normalize the row elements sum to 1 by this:
new_matrix = a / a.sum(axis=1, keepdims=1).
And the column normalization can be done with new_matrix = a / a.sum(axis=0, keepdims=1). Hope this can hep.

You could use built-in numpy function:
np.linalg.norm(a, axis = 1, keepdims = True)

it appears that this also works
def normalizeRows(M):
row_sums = M.sum(axis=1)
return M / row_sums

You could also use matrix transposition:
(a.T / row_sums).T

Here is one more possible way using reshape:
a_norm = (a/a.sum(axis=1).reshape(-1,1)).round(3)
print(a_norm)
Or using None works too:
a_norm = (a/a.sum(axis=1)[:,None]).round(3)
print(a_norm)
Output:
array([[0. , 0.333, 0.667],
[0.25 , 0.333, 0.417],
[0.286, 0.333, 0.381]])

Use
a = a / np.linalg.norm(a, ord = 2, axis = 0, keepdims = True)
Due to the broadcasting, it will work as intended.

Or using lambda function, like
>>> vec = np.arange(0,27,3).reshape(3,3)
>>> import numpy as np
>>> norm_vec = map(lambda row: row/np.linalg.norm(row), vec)
each vector of vec will have a unit norm.

We can achieve the same effect by premultiplying with the diagonal matrix whose main diagonal is the reciprocal of the row sums.
A = np.diag(A.sum(1)**-1) # A

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate moving average in numpy array with NaNs - python

Related

generating random numbers between two numbers

Numpy only on finite entries

How to normalize a NumPy array to a unit vector?

Finding the max and min in a tuple of tuples

How to normalize a 2-dimensional numpy array in python less verbose?

Categories

Resources