Related
I want to compute discrete difference of identity matrix.
The code below use numpy and scipy.
import numpy as np
from scipy.sparse import identity
from scipy.sparse import csc_matrix
x = identity(4).toarray()
y = csc_matrix(np.diff(x, n=2))
print(y)
I would like to improve performance or memory usage.
Since identity matrix produce many zeros, it would reduce memory usage to perform calculation in compressed sparse column(csc) format. However, np.diff() does not accept csc format, so converting between csc and normal format using csc_matrix would slow it down a bit.
Normal format
x = identity(4).toarray()
print(x)
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
csc format
x = identity(4)
print(x)
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0
Thanks
Here is my hacky solution to get the sparse matrix as you want.
L - the length of the original identity matrix,
n - the parameter of np.diff.
In your question they are:
L = 4
n = 2
My code produces the same y as your code, but without the conversions between csc and normal formats.
Your code:
from scipy.sparse import identity, csc_matrix
x = identity(L).toarray()
y = csc_matrix(np.diff(x, n=n))
My code:
from scipy.linalg import pascal
def get_data(n, L):
nums = pascal(n + 1, kind='lower')[-1].astype(float)
minuses_from = n % 2 + 1
nums[minuses_from : : 2] *= -1
return np.tile(nums, L - n)
data = get_data(n, L)
row_ind = (np.arange(n + 1) + np.arange(L - n).reshape(-1, 1)).flatten()
col_ind = np.repeat(np.arange(L - n), n + 1)
y = csc_matrix((data, (row_ind, col_ind)), shape=(L, L - n))
I have noticed that after applying np.diff to the identity matrix n times, the values of the columns are the binomial coefficients with their signs alternating. This is my variable data.
Then I am just constructing the csc_matrix.
Unfortunately, it does not seem that SciPy provides any tools for this kind of sparse matrix manipulation. Regardless, by cleverly manipulating the indices and data of the entries one can emulate np.diff(x,n) in a straightforward fashion.
Given 2D NumPy array (matrix) of dimension MxN, np.diff() multiplies each column (of column index y) with -1 and adds the next column to it (column index y+1). Difference of order k is just the iterative application of k differences of order 1. A difference of order 0 is just the returns the input matrix.
The method below makes use of this, iterateively eliminating duplicate entries by addition through sum_duplicates(), reducing the number of columns by one, and filtering non-valid indices.
def csc_diff(x, n):
'''Emulates np.diff(x,n) for a sparse matrix by iteratively taking difference of order 1'''
assert isinstance(x, csc_matrix) or (isinstance(x, np.ndarray) & len(x.shape) == 2), "Input matrix must be a 2D np.ndarray or csc_matrix."
assert isinstance(n, int) & n >= 0, "Integer n must be larger or equal to 0."
if n >= x.shape[1]:
return csc_matrix(([], ([], [])), shape=(x.shape[0], 0))
if isinstance(x, np.ndarray):
x = csc_matrix(x)
# set-up of data/indices via column-wise difference
if(n > 0):
for k in range(1,n+1):
# extract data/indices of non-zero entries of (current) sparse matrix
M, N = x.shape
idx, idy = x.nonzero()
dat = x.data
# difference: this row (y) * (-1) + next row (y+1)
idx = np.concatenate((idx, idx))
idy = np.concatenate((idy, idy-1))
dat = np.concatenate(((-1)*dat, dat))
# filter valid indices
validInd = (0<=idy) & (idy<N-1)
# x_diff: csc_matrix emulating np.diff(x,1)'s output'
x_diff = csc_matrix((dat[validInd], (idx[validInd], idy[validInd])), shape=(M, N-1))
x_diff.sum_duplicates()
x = x_diff
return x
Moreover, the method outputs an empty csc_matrix of dimension Mx0 when the difference order is larger or equal to the number of columns of the input matrix. This is why the output is identical, see
csc_diff(x, 2).toarray()
> array([[ 1., 0.],
[-2., 1.],
[ 1., -2.],
[ 0., 1.]])
which is identical to
np.diff(x.toarray(), 2)
> array([[ 1., 0.],
[-2., 1.],
[ 1., -2.],
[ 0., 1.]])
This identity holds for other difference orders, too
(csc_diff(x, 0).toarray() == np.diff(x.toarray(), 0)).all()
>True
(csc_diff(x, 3).toarray() == np.diff(x.toarray(), 3)).all()
>True
(csc_diff(x, 13).toarray() == np.diff(x.toarray(), 13)).all()
>True
I am trying to calculate the moving average in a large numpy array that contains NaNs. Currently I am using:
import numpy as np
def moving_average(a,n=5):
ret = np.cumsum(a,dtype=float)
ret[n:] = ret[n:]-ret[:-n]
return ret[-1:]/n
When calculating with a masked array:
x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx).filled(np.nan)
print y
>>> array([3.8,3.8,3.6,nan,nan,nan,2,2.4,nan,nan,nan,2.8,2.6])
The result I am looking for (below) should ideally have NaNs only in the place where the original array, x, had NaNs and the averaging should be done over the number of non-NaN elements in the grouping (I need some way to change the size of n in the function.)
y = array([4.75,4.75,nan,4.4,3.75,2.33,3.33,4,nan,nan,3,3.5,nan,3.25,4,4.5,3])
I could loop over the entire array and check index by index but the array I am using is very large and that would take a long time. Is there a numpythonic way to do this?
Pandas has a lot of really nice functionality with this. For example:
x = np.array([np.nan, np.nan, 3, 3, 3, np.nan, 5, 7, 7])
# requires three valid values in a row or the resulting value is null
print(pd.Series(x).rolling(3).mean())
#output
nan,nan,nan, nan, 3, nan, nan, nan, 6.333
# only requires 2 valid values out of three for size=3 window
print(pd.Series(x).rolling(3, min_periods=2).mean())
#output
nan, nan, nan, 3, 3, 3, 4, 6, 6.3333
You can play around with the windows/min_periods and consider filling-in nulls all in one chained line of code.
I'll just add to the great answers before that you could still use cumsum to achieve this:
import numpy as np
def moving_average(a, n=5):
ret = np.cumsum(a.filled(0))
ret[n:] = ret[n:] - ret[:-n]
counts = np.cumsum(~a.mask)
counts[n:] = counts[n:] - counts[:-n]
ret[~a.mask] /= counts[~a.mask]
ret[a.mask] = np.nan
return ret
x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx)
You could create a temporary array and use np.nanmean() (new in version 1.8 if I'm not mistaken):
import numpy as np
temp = np.vstack([x[i:-(5-i)] for i in range(5)]) # stacks vertically the strided arrays
means = np.nanmean(temp, axis=0)
and put original nan back in place with means[np.isnan(x[:-5])] = np.nan
However this look redundant both in terms of memory (stacking the same array strided 5 times) and computation.
If I understand correctly, you want to create a moving average and then populate the resulting elements as nan if their index in the original array was nan.
import numpy as np
>>> inc = 5 #the moving avg increment
>>> x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
>>> mov_avg = np.array([np.nanmean(x[idx:idx+inc]) for idx in range(len(x))])
# Determine indices in x that are nans
>>> nan_idxs = np.where(np.isnan(x))[0]
# Populate output array with nans
>>> mov_avg[nan_idxs] = np.nan
>>> mov_avg
array([ 4.75, 4.75, nan, 4.4, 3.75, 2.33333333, 3.33333333, 4., nan, nan, 3., 3.5, nan, 3.25, 4., 4.5, 3.])
Here's an approach using strides -
w = 5 # Window size
n = x.strides[0]
avgs = np.nanmean(np.lib.stride_tricks.as_strided(x, \
shape=(x.size-w+1,w), strides=(n,n)),1)
x_rem = np.append(x[-w+1:],np.full(w-1,np.nan))
avgs_rem = np.nanmean(np.lib.stride_tricks.as_strided(x_rem, \
shape=(w-1,w), strides=(n,n)),1)
avgs = np.append(avgs,avgs_rem)
avgs[np.isnan(x)] = np.nan
Currently bottleneck package should do the trick quite reliably and quickly. Here is slightly adjusted example from https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.move_mean:
>>> import bottleneck as bn
>>> a = np.array([1.0, 2.0, 3.0, np.nan, 5.0])
>>> bn.move_mean(a, window=2)
array([ nan, 1.5, 2.5, nan, nan])
>>> bn.move_mean(a, window=2, min_count=1)
array([ 1. , 1.5, 2.5, 3. , 5. ])
Note that the resulting means correspond to the last index of the window.
The package is available from Ubuntu repos, pip etc. It can operate over arbitrary axis of numpy-array etc. Besides that, it is claimed to be faster than plain-numpy implementation in many cases.
I would like to improve the speed of my code by computing a function once on a numpy array instead of a for loop is over a function of this python library. If I have a function as following:
import numpy as np
import galsim
from math import *
M200=1e14
conc=6.9
def func(M200, conc):
halo_z=0.2
halo_pos =[1200., 3769.7]
halo_pos = galsim.PositionD(x=halo_pos_arcsec[0],y=halo_pos_arcsec[1])
nfw = galsim.NFWHalo(mass=M200, conc=conc, redshift=halo_z,halo_pos=halo_pos, omega_m = 0.3, omega_lam =0.7)
for i in range(len(shear_z)):
shear_pos=galsim.PositionD(x=pos_arcsec[i,0],y=pos_arcsec[i,1])
model_g1, model_g2 = nfw.getShear(pos=self.shear_pos, z_s=shear_z[i])
l=np.sum(model_g1-model_g2)/sqrt(np.pi)
return l
While pos_arcsec is a two-dimensional array of 24000x2 and shear_z is a 1D array with 24000 elements as well.
The main problem is that I want to calculate this function on a grid where M200=np.arange(13., 16., 0.01) and conc = np.arange(3, 10, 0.01). I don't know how to broadcast this function to be estimated for this two dimensional array over M200 and conc. It takes a lot to run the code. I am looking for the best approaches to speed up these calculations.
This here should work when pos is an array of shape (n,2)
import numpy as np
def f(pos, z):
r=np.sqrt(pos[...,0]**2+pos[...,1]**2)
return np.log(r)*(z+1)
Example:
z = np.arange(10)
pos = np.arange(20).reshape(10,2)
f(pos,z)
# array([ 0. , 2.56494936, 5.5703581 , 8.88530251,
# 12.44183436, 16.1944881 , 20.11171117, 24.17053133,
# 28.35353608, 32.64709419])
Use numpy.linalg.norm
If you have an array:
import numpy as np
import numpy.linalg as la
a = np.array([[3, 4], [5, 12], [7, 24]])
then you can determine the magnitude of the resulting vector (sqrt(a^2 + b^2)) by
b = np.sqrt(la.norm(a, axis=1)
>>> print b
array([ 5., 15. 25.])
I have very large matrix, so dont want to sum by going through each row and column.
a = [[1,2,3],[3,4,5],[5,6,7]]
def neighbors(i,j,a):
return [a[i][j-1], a[i][(j+1)%len(a[0])], a[i-1][j], a[(i+1)%len(a)][j]]
[[np.mean(neighbors(i,j,a)) for j in range(len(a[0]))] for i in range(len(a))]
This code works well for 3x3 or small range of matrix, but for large matrix like 2k x 2k this is not feasible. Also this does not work if any of the value in matrix is missing or it's like na
This code works well for 3x3 or small range of matrix, but for large matrix like 2k x 2k this is not feasible. Also this does not work if any of the value in matrix is missing or it's like na. If any of the neighbor values is na then skip that neighbour in getting the average
Shot #1
This assumes you are looking to get sliding windowed average values in an input array with a window of 3 x 3 and considering only the north-west-east-south neighborhood elements.
For such a case, signal.convolve2d with an appropriate kernel could be used. At the end, you need to divide those summations by the number of ones in kernel, i.e. kernel.sum() as only those contributed to the summations. Here's the implementation -
import numpy as np
from scipy import signal
# Inputs
a = [[1,2,3],[3,4,5],[5,6,7],[4,8,9]]
# Convert to numpy array
arr = np.asarray(a,float)
# Define kernel for convolution
kernel = np.array([[0,1,0],
[1,0,1],
[0,1,0]])
# Perform 2D convolution with input data and kernel
out = signal.convolve2d(arr, kernel, boundary='wrap', mode='same')/kernel.sum()
Shot #2
This makes the same assumptions as in shot #1, except that we are looking to find average values in a neighborhood of only zero elements with the intention to replace them with those average values.
Approach #1: Here's one way to do it using a manual selective convolution approach -
import numpy as np
# Convert to numpy array
arr = np.asarray(a,float)
# Pad around the input array to take care of boundary conditions
arr_pad = np.lib.pad(arr, (1,1), 'wrap')
R,C = np.where(arr==0) # Row, column indices for zero elements in input array
N = arr_pad.shape[1] # Number of rows in input array
offset = np.array([-N, -1, 1, N])
idx = np.ravel_multi_index((R+1,C+1),arr_pad.shape)[:,None] + offset
arr_out = arr.copy()
arr_out[R,C] = arr_pad.ravel()[idx].sum(1)/4
Sample input, output -
In [587]: arr
Out[587]:
array([[ 4., 0., 3., 3., 3., 1., 3.],
[ 2., 4., 0., 0., 4., 2., 1.],
[ 0., 1., 1., 0., 1., 4., 3.],
[ 0., 3., 0., 2., 3., 0., 1.]])
In [588]: arr_out
Out[588]:
array([[ 4. , 3.5 , 3. , 3. , 3. , 1. , 3. ],
[ 2. , 4. , 2. , 1.75, 4. , 2. , 1. ],
[ 1.5 , 1. , 1. , 1. , 1. , 4. , 3. ],
[ 2. , 3. , 2.25, 2. , 3. , 2.25, 1. ]])
To take care of the boundary conditions, there are other options for padding. Look at numpy.pad for more info.
Approach #2: This would be a modified version of convolution based approach listed earlier in Shot #1. This is same as that earlier approach, except that at the end, we selectively replace
the zero elements with the convolution output. Here's the code -
import numpy as np
from scipy import signal
# Inputs
a = [[1,2,3],[3,4,5],[5,6,7],[4,8,9]]
# Convert to numpy array
arr = np.asarray(a,float)
# Define kernel for convolution
kernel = np.array([[0,1,0],
[1,0,1],
[0,1,0]])
# Perform 2D convolution with input data and kernel
conv_out = signal.convolve2d(arr, kernel, boundary='wrap', mode='same')/kernel.sum()
# Initialize output array as a copy of input array
arr_out = arr.copy()
# Setup a mask of zero elements in input array and
# replace those in output array with the convolution output
mask = arr==0
arr_out[mask] = conv_out[mask]
Remarks: Approach #1 would be the preferred way when you have fewer number of zero elements in input array, otherwise go with Approach #2.
This is an appendix to comments under #Divakar's answer (rather than an independent answer).
Out of curiosity I tried different 'pseudo' convolutions against the scipy convolution. The fastest one was the % (modulus) wrapping one, which surprised me: obviously numpy does something clever with its indexing, though obviously not having to pad will save time.
fn3 -> 9.5ms, fn1 -> 21ms, fn2 -> 232ms
import timeit
setup = """
import numpy as np
from scipy import signal
N = 1000
M = 750
P = 5 # i.e. small number -> bigger proportion of zeros
a = np.random.randint(0, P, M * N).reshape(M, N)
arr = np.asarray(a,float)"""
fn1 = """
arr_pad = np.lib.pad(arr, (1,1), 'wrap')
R,C = np.where(arr==0)
N = arr_pad.shape[1]
offset = np.array([-N, -1, 1, N])
idx = np.ravel_multi_index((R+1,C+1),arr_pad.shape)[:,None] + offset
arr[R,C] = arr_pad.ravel()[idx].sum(1)/4"""
fn2 = """
kernel = np.array([[0,1,0],
[1,0,1],
[0,1,0]])
conv_out = signal.convolve2d(arr, kernel, boundary='wrap', mode='same')/kernel.sum()
mask = arr == 0.0
arr[mask] = conv_out[mask]"""
fn3 = """
R,C = np.where(arr == 0.0)
arr[R, C] = (arr[(R-1)%M,C] + arr[R,(C-1)%N] + arr[R,(C+1)%N] + arr[(R+1)%M,C]) / 4.0
"""
print(timeit.timeit(fn1, setup, number = 100))
print(timeit.timeit(fn2, setup, number = 100))
print(timeit.timeit(fn3, setup, number = 100))
Using numpy and scipy.ndimage, you can apply a "footprint" that defines where you look for the neighbours of each element and apply a function to those neighbours:
import numpy as np
import scipy.ndimage as ndimage
# Getting neighbours horizontally and vertically,
# not diagonally
footprint = np.array([[0,1,0],
[1,0,1],
[0,1,0]])
a = [[1,2,3],[3,4,5],[5,6,7]]
# Need to make sure that dtype is float or the
# mean won't be calculated correctly
a_array = np.array(a, dtype=float)
# Can specify that you want neighbour selection to
# wrap around at the borders
ndimage.generic_filter(a_array, np.mean,
footprint=footprint, mode='wrap')
Out[36]:
array([[ 3.25, 3.5 , 3.75],
[ 3.75, 4. , 4.25],
[ 4.25, 4.5 , 4.75]])
I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?
If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True
I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))
This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)
To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1
If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)
You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler
There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))
If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]
If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()
If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.
Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])
For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)
If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.
A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x