I'm trying to convert a method based on the Mahalanobis distance which works on images to my code which has to process time series. This is the Matlab code, where the users pass an image as the input and then first reshape it and then calculate the mean, the covariance matrix and its inverse (He's using images size):
function out = rxd(X)
% X input size = (126, 150, 204)
sizes = size(X);
X = reshape(X, [sizes(1)*sizes(2), sizes(3)]);
% X input size = (18900, 204)
M = mean(X);
% M size = (1, 204)
C = cov(X);
% M size = (204, 204)
Q = inv(C);
% M size = (204, 204)
This is my code, where I implemented this first part in Python. I do not have an image but a time series, whose shape is (24230, 30), that's why I avoided the reshaping part:
import os
import numpy as np
X = np.load('dataset.npy')
# dataset shape: (24230, 30)
# 1. Calculate the mean of the matrix
M = np.mean(X, axis=0) # shape = (30,)
# 2. Calculate the Covariance matrix
C = np.cov(X) #shape = (24230, 24230)
# 3. Calculate the inverse of the Covariance matrix
Q = np.linalg.inv(C) #Error
If I try to run it I get the error:
LinAlgError: Singular matrix
What could be the problem? I noticed that the only difference from the Matlab outputs is in the mean shape, but I don't understand if I'm wrong with my conversion.
Ignoring everything about MATLAB, your covariance matrix C can't be inverted; by definition this means the matrix C is "singular", hence the Singular Matrix error. (So it's not about the code, but the data).
If you wish to calculate the inverse of this matrix anyway, the pseudo-inverse function np.linalg.pinv can do this; but do be sure to understand why you are doing it.
Related
I am following this tutorial to implement object tracking for my project - https://www.pyimagesearch.com/2018/07/23/simple-object-tracking-with-opencv/
Method is to find centroids of detected objects in the initial frame, and then calculate the shortest distance to the other centroids of detected objects that show up on the next frame. Assumption is that a centroid that is closest would be a same object.
In the tutorial -
from scipy.spatial import distance as dist
...
D = dist.cdist(np.array(objectCentroids), newCentroids)
is used to calculate the distance (Euclidean Distance). Unfortunately, I cannot use scipy module as I am trying to deploy this to AWS Lambda (size limit). In this case, the recommendation is to use this - https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html
D = np.linalg.norm(objectCentroids – newCentroids)
The issue with this is that, unlike dist.cdist, where it computes all and any matrix, np.linalg.norm only outputs 1 value, which is calculated after newCentroids is subtracted from objectCentroids matrix. I am about to loop over n times (however big the matrix is) and append to another matrix to construct the result I need. However, I wasn't sure if my understanding of this concept is correct or not, so I wanted to seek out some help. If anyone knows of a better way, I would appreciate any pointer.
UPDATE
Based on the feedback/answer I got, I updated the code a bit, and well... it seems to be working -
n = arrayObjectCentroids.shape[0]
m = inputCentroids.shape[0]
T = []
for i in range(0,n):
for z in range(0,m):
Tv = np.linalg.norm(arrayObjectCentroids[i] - inputCentroids[z])
# print(f'Tv is \n {Tv}')
T = np.append(T, Tv)
# print(f'T is \n {T}')
print(f'new T is \n {T}')
D = np.reshape(T, (n, m))
print(f'D is \n {D}')
In this case, if there is one object and moving a little -
newCentroids is [[224 86]], and the shape of it is (1, 2)...
objectCentroids is [[224 86]], and the shape objectCentroids is (1, 2)
D is[[0.]]
If I have 3 objects, -
new Centroids is
[[228 79]
[ 45 127]
[103 123]]
shape of inputCentroids is (3, 2)
objectCentroids is
[[228 79]
[ 45 127]
[103 123]]
shape objectCentroids is (3, 2)
D is
[[ 0. 189.19038031 132.51792332]
[189.19038031 0. 58.13776741]
[132.51792332 58.13776741 0. ]]
Great that it works, but I feel like this may not be the best solution out there, and if you have any pointer, I would appreciate it!
Thanks!
EDIT: Edited code to address comments below
If in your case you have vectors in Euclidean space then np.linalg.norm will return the length of that vector.
So objectCentroid – newCentroid will give you the vector between the point at objectCentroid and the point at newCentroid. Note that is between 2 points and not an array containing ALL points.
To get all combinations of points I've used itertools & then reshaped the array to give the same output as dist
import numpy as np
from scipy.spatial import distance as dist
import itertools
# Example data
objectCentroids = np.array([[0,0,0],[1,1,1],[2,2,2], [3,3,3]])
newCentroids = np.array([[4,4,4],[5,5,5],[6,6,6],[7,7,7]])
comb = list(itertools.product(objectCentroids, newCentroids))
all_dist = []
for pair in comb:
dis = np.linalg.norm((pair[0] - pair[1]))
all_dist.append(dis)
all_dist = np.reshape(all_dist, (len(objectCentroids), len(objectCentroids)))
D = dist.cdist(objectCentroids, newCentroids)
print(D)
print(" ")
print(all_dist)
You can use Numpy broadcasting to create a distance matrix.
Read about it here and here.
The basic idea is:
Stack (reshape) your centroids as (1, n, 3) and (n, 1, 3) where the last dimension with shape 3 is (x,y,z). Then subtract the arrays and use np.linalg.norm to calculate the distance along axis ... hm ... probably the last one. That should yield a square (n,n) distance matrix.
I am trying to run a PCA analysis over an dataset representing the 3 bands of an image. The dataset is of size (300000,3) being pixels and 3bands.I find the Eigen values and pairs which are then put into a tuple called eig_pairs. I then calculate the variance to determine how many bands to use for PCA.
I determine that I wish to use 2 bands.
My eig_pairs shape is a list of tuples of size 3.
Following this tutorial I says I need to reshape everything by reducing from original dimension space (3) to how every many a dimension equal to the number of dimensions I wish to use (2). Their example goes for 7 to 4 as shown here:
matrix_w = np.hstack((eig_pairs[0][1].reshape(7,1),
eig_pairs[1][1].reshape(7,1),
eig_pairs[2][1].reshape(7,1),
eig_pairs[3][1].reshape(7,1)))
Following this logic I changed my own to:
matrix_w = np.hstack((eig_pairs0.reshape(3,1),
eig_pairs1.reshape(3,1)))
However I get the error ValueError: shapes (3131892,3) and (2,3) not aligned: 3 (dim 1) != 2 (dim 0)
#read in image
img = cv2.imread('/Volumes/EXTERNAL/Stitched-Photos-for-Chris/p7_0015_20161005-949am-75m-pass-1.jpg.png',1)
row,col = img.shape[:2]
b,g,r = cv2.split(img)
# Pandas dataset
# samples = 3000000, featuress = 3
dataSet = pd.DataFrame({'bBnad':b.flat[:],'gBnad':g.flat[:],'rBnad':r.flat[:]})
print(dataSet.head())
# Standardize the data
X = dataSet.values
X_std = StandardScaler().fit_transform(X) #converts data from unit8 to float64
#Calculating Eigenvectors and eigenvalues of Covariance matrix
meanVec = np.mean(X_std, axis=0)
covarianceMatx = np.cov(X_std.T)
eigVals, eigVecs = np.linalg.eig(covarianceMatx)
# Create a list of (eigenvalue, eigenvector) tuples
eig_pairs = [ (np.abs(eigVals[i]),eigVecs[:,i]) for i in range(len(eigVals))]
# Sort from high to low
eig_pairs.sort(key = lambda x: x[0], reverse= True)
# Determine how many PC going to choose for new feature subspace via
# the explained variance measure which is calculated from eigen vals
# The explained variance tells us how much information (variance) can
# be attributed to each of the principal components
tot = sum(eigVals)
var_exp = [(i / tot)*100 for i in sorted(eigVals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
#convert 3 dimension space to 2 dimensional space therefore getting a 2x3 matrix W
matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1),
eig_pairs[1][1].reshape(3,1)))
Appreciate any help.
I had seen several discussions in this forum about applying median filter with moving window, but my application have a special peculiarity.
I have a 3D array of dimension 750x12000x10000 and I need to apply a median filter to result in a 2D array (12000x10000). For this, each median calculation should consider a fixed neighborhood window (usually 100x100) and all z-axis values. There are some zero values in the matrix and they should not be considered for the calculation of the median. To proccessing real data, I am using numpy.memmap:
fp = np.memmap(filename, dtype='float32', mode='w+', shape=(750, 12000, 10000))
To proccessing the real data stored with memmap, my input array is subdivided into several chunks, but to increase the speed of my tests, I will use in this post a reduced array (11, 200, 300) and a smaller window (11, 5, 5) or (11, 50, 50) and I expect a result matrix (200, 300):
import numpy as np
from timeit import default_timer as timer
zsize, ysize, xsize = (11, 200, 300)
w_size = 5 #to generate a 3D window (all_z, w_size, w_size)
#w_size = 50 #to generate a 3D window (all_z, w_size, w_size)
m_in=np.arange(zsize*ysize*xsize).reshape(zsize, ysize, xsize)
m_out = np.zeros((ysize, xsize))
First, I've tried the brute force method, but it is very slow as expected (even for the small array):
start = timer()
for l in range(0, ysize):
i_l = max(0, l - w_size/2)
o_l = min(ysize, i_l+w_size/2)
for c in range(0, xsize):
i_c = max(0, c - w_size/2)
o_c = min(xsize, i_c+w_size/2)
values = m_in[:, i_l:o_l, i_c:o_c]
values = values[np.nonzero(values)]
value = np.median(values)
m_out[l, c] = value
end = timer()
print("Time elapsed: %f seconds"%(end-start))
#11.7 seconds with 50 in z, 7.9 seconds with 5 in z
To remove the double-for, I tried to use itertools.product, but it still remains slow:
from itertools import product
for l, c in product(range(0, ysize), range(0, xsize)):
i_l = max(0, l - w_size/2)
o_l = min(ysize, i_l+w_size/2)
i_c = max(0, c - w_size/2)
o_c = min(xsize, i_c+w_size/2)
values = m_in[:, i_l:o_l, i_c:o_c]
values = values[np.nonzero(values)]
value = np.median(values)
m_out[l, c] = value
#11.7 seconds with 50 in z, 2.3 seconds with 5
So I tried to use the performance of matrix operations of numpy, so I tried with scipy.ndimage:
from scipy import ndimage
m_all = ndimage.median_filter(m_in, size=(zsize, w_size, w_size))
m_out[:] = m_all[0] #only first layer of 11, considering all the same
#a lot of seconds with 50 in z, 7.9 seconds with 5
and scipy.signal too:
m_all = signal.medfilt(m_in, kernel_size=(zsize, w_size, w_size))
m_out[:] = m_all[0] #only first layer of 11, considering all the same
#a lot of seconds with 50 in z, 7.8 seconds with 5 in z
But in both scipy cases, there are a waste of processing because the function is applied in all 3D positions of input matrix, however, it could be applied only in the first layer using a sliding window with dimension (all_z, w_size, w_size).
In all my tests, I did not had an fast execution time even when I used the reduced matrix and windows ((11, 200, 300) and (11, 50, 50)). The performance will be even more critical using my real data (an array of 750x12000x10000, and window of 750x100x100).
Please, can anyone help me to apply the median filter (3D array to 2D array) with a more best pythonic way?
Edit1
The real data array has many zero values. When considering a single axis, of the 750 values, about 15 are non-zero values. The zeros must be discarded in the processing, and because of this, I am not using a sparse array representation.
This ended up being too long for a comment:
If you were applying a mean-filter, this problem would be trivial: you would take the mean over the z-axis and then apply the mean filter in 2D; this would be exactly equivalent to computing the mean over the full (x,y,z) neighbourhood in one go as the mean operation is associative (if that is the term; I mean: f(f(a,b), c) = f(a, b, c)).
In principle, this is not true for the median. However, as your neighbourhoods in (x,y) and z are both fairly large, I would assume that associativity still approximately holds (unless your data is drawn from a whacky distribution which it probably is not as this looks like some sort of imaging data). If I were you, I would test on some test data if applying the median in z first and then the median filter (or maybe even a mean filter) in (x,y) results in an unacceptable error compared to computing the median exactly by filtering in (x,y,z) simultaneously.
Context
I'm running into an error when trying to use sparse matrices as an input to sklearn.neural_network.MLPRegressor. Nominally, this method is able to handle sparse matrices. I think this might be a bug in scikit-learn, but wanted to check on here before I submit an issue.
The Problem
When passing a scipy.sparse input to sklearn.neural_network.MLPRegressor I get:
ValueError: input must be a square array
The error is raised by the matrix_power function within numpy.matrixlab.defmatrix. It seems to occur because matrix_power passes the sparse matrix to numpy.asanyarray (L137), which returns an array of size=1, ndim=0 containing the sparse matrix object. matrix_power then performs some dimension checks (L138-141) to make sure the input is a square matrix, which fail because the array returned by numpy.asanyarray is not square, even though the underlying sparse matrix is square.
As far as I can tell, the problem stems from numpy.asanyarray preventing the dimensions of the sparse matrix being determined. The sparse matrix itself has a size attribute which would allow it to pass the dimension checks, but only if it's not run through asanyarray.
I think this might be a bug, but don't want to dive around filing issues until I've confirmed that I'm not just being an idiot! Please see below, to check.
If it is a bug, where would be the most appropriate place to raise an issue? NumPy? SciPy? or Scikit-Learn?
Minimal Example
Environment
Arch Linux
kernel 4.15.7-1
Python 3.6.4
numpy 1.14.1
scipy 1.0.0
sklearn 0.19.1
Code
import numpy as np
from scipy import sparse
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn.neural_network import MLPRegressor
## Generate some synthetic data
def fW(A, B, C):
return A * np.random.normal(.3, .1) + B * np.random.normal(.6, .1)
def fX(A, B, C):
return B * np.random.normal(-1, .1) + A * np.random.normal(-.9, .1) / C
# independent variables
N = int(1e4)
A = np.random.uniform(2, 12, N)
B = np.random.uniform(2, 12, N)
C = np.random.uniform(2, 12, N)
# synthetic data
mW = fW(A, B, C)
mX = fX(A, B, C)
# combine datasets
real = np.vstack([A, B, C]).T
meas = np.vstack([mW, mX]).T
# add noise to meas
meas *= np.random.normal(1, 0.0001, meas.shape)
## Make data sparse
prob_null = 0.2
real[np.random.choice([True, False], real.shape, p=[prob_null, 1-prob_null])] = np.nan
meas[np.random.choice([True, False], meas.shape, p=[prob_null, 1-prob_null])] = np.nan
# NB: problem persists whichever sparse matrix method is used.
real = sparse.csr_matrix(real)
meas = sparse.csr_matrix(meas)
# replace missing values with mean
rmnan = Imputer()
real = rmnan.fit_transform(real)
meas = rmnan.fit_transform(meas)
# split into test/training sets
real_train, real_test, meas_train, meas_test = model_selection.train_test_split(real, meas, test_size=0.3)
# create scalers and apply to data
real_scaler = StandardScaler(with_mean=False)
meas_scaler = StandardScaler(with_mean=False)
real_scaler.fit(real_train)
meas_scaler.fit(meas_train)
treal_train = real_scaler.transform(real_train)
tmeas_train = meas_scaler.transform(meas_train)
treal_test = real_scaler.transform(real_test)
tmeas_test = meas_scaler.transform(meas_test)
nn = MLPRegressor((100,100,10), solver='lbfgs', early_stopping=True, activation='tanh')
nn.fit(tmeas_train, treal_train)
## ERROR RAISED HERE
## The problem:
# the sparse matrix has a shape attribute that would pass the square matrix validation
tmeas_train.shape
# but not after it's been through asanyarray
np.asanyarray(tmeas_train).shape
MLPRegressor.fit() as given in documentation supports sparse matrix for X but not for y
Parameters:
X : array-like or sparse matrix, shape (n_samples, n_features)
The input data.
y : array-like, shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels in classification, real numbers in regression).
I am able to successfully run your code with:
nn.fit(tmeas_train, treal_train.toarray())
I would like to use a generic filter to calculate the mean of values within a given window (or kernel), for values that fulfill a couple of conditions. I expected the following code to produce a mean filter of the first array in a 3-layer window, using the other two arrays to mask values from the mean calculation.
from scipy import ndimage
import numpy as np
#some test data
tstArr = np.random.rand(3,7,7)
tstArr = tstArr*10
tstArr = np.int_(tstArr)
tstArr[1] = tstArr[1]*100
tstArr[2] = tstArr[2] *1000
#mean function
def testFun(tstData,processLayer,nLayers,kernelSize):
funData= tstData.reshape((nLayers,kernelSize,kernelSize))
meanLayer = funData[processLayer]
maskedData = meanLayer[(funData[1]>1)&(funData[2]<9000)]
returnMean = np.mean(maskedData)
return returnMean
#number of layers in the array
nLayers = np.shape(tstArr)[0]
#window size
kernelSize = 5
#create a sampling window of 5x5 elements from each array
footprnt = np.ones((nLayers,kernelSize,kernelSize),dtype = np.int)
# calculate the mean of the first layer in the array (other two are for masking)
processLayer = 0
tstOut = ndimage.generic_filter(tstArr, testFun, footprint=footprnt, extra_arguments = (processLayer,nLayers,kernelSize))
I thought this would yield a 7x7 array of masked mean values from the first layer in the input array. The output is a 3x7x7 array, and I don't understand what the values represent. I'm not sure how to produce the "masked" mean-filtered array, or how to interpret the output as given.
Your code produce a mean filter of the first array in a 3-layer window, using the over two arrays to mask values from the mean calculation. You will find the result in tstOut[1].
What is going on ? When you call ndimage.generic_filter with tstArr of shape (3, 7, 7) and footprint=np.ones((3, 5, 5)) then for all i from 0 to 2, for all j from 0 to 6 and for all k from 0 to 6, testFun is called with the subarray of tstArr centered in (i, j, k) and of shape (3, 5, 5) (the array is reflected at the boundary to supply missing values).
In the end:
tstOut[0] is the mean filter of tstArr[0] with tstArr[0] and tstArr[1] as masks
tstOut[1] is the mean filter of tstArr[0] with tstArr[1] and tstArr[2] as masks
tstOut[2] is the mean filter of tstArr[1] with tstArr[2] and tstArr[2] as masks
Again, the wanted result is in tstOut[1].
I hope this will help you.