The following trivial example returns a singular matrix. Why? Any ways to overcome it?
In: from scipy.stats import gaussian_kde
Out:
In: points
Out: (array([63, 84]), array([46, 42]))
In: gaussian_kde(points)
Out: (array([63, 84]), array([46, 42]))
LinAlgError: singular matrix
Looking at the backtrace, you can see it fails when inverting the covariance matrix. This is due to exact multicollinearity of your data. From the page, you have multicollinearity in your data if two variables are collinear, i.e. if
the correlation between two independent variables is equal to 1 or -1
In this case, the two variables have only two samples, and they are always collinear (trivially, there exists always one line passing two distinct points). We can check that:
np.corrcoef(array([63,84]),array([46,42]))
[[ 1. -1.]
[-1. 1.]]
To not be necessarily collinear, two variables must have at least n=3 samples. To add to this constraint, you have the limitation pointed out by ali_m, that the number of samples n should be greater or equal to the number of variables p. Putting the two together,
n>=max(3,p)
in this case p=2 and n>=3 is the right constraint.
The error occurs when gaussian_kde() tries to take the inverse of the covariance matrix of your input data. In order for the covariance matrix to be nonsingular, the number of (non-identical) points in your input must be >= to the number of variables. Try adding a third point and you should see that it works.
This answer on Crossvalidated has a proper explanation for why this is the case.
Related
I am not particularly good at math. I would like to get some breadcrumbs about how to solve the following formula using python code.
Assume an [m,n] matrix M and a [1,n] vector y.
Solve for the least squares using scipy.linalg.lstsq(M, y).
The output will be an [m,1] vector of coefficients a in the equation Ma=y.
As per this question, any vector of solutions like a in a regression is basically a series of single points each taken from a normal distribution that describes the error of every point on the regression. In effect, every single digit in the solution vector a is the mean of a normal distribution of errors centred on zero.
I would like to find those normal distributions rather than the scalar value for every single point in the solution. Apologies for the poor description of the mathy bits, I was never trained in math in Uni.
Here is a hint. Let me know if you want more.
scipy.linalg.lstsq(M, y) returns four things:
x : (N,) or (N, K) ndarray
Least-squares solution.
residues : (K,) ndarray or float
Square of the 2-norm for each column in b - a x, if M > N and ndim(A) == n
(returns a scalar if b is 1-D). Otherwise a (0,)-shaped array is returned.
rank : int
Effective rank of a.
s : (min(M, N),) ndarray or None
Singular values of a. The condition number of a is s[0] / s[-1].
residues is going to be of interest to you!
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html
I am fairly new to Python and learning some basic python for data science, and I am trying to ascertain the min and max values for an array when you run randn.
to clarify the question, how potentially low and high could the below numbers get?
does it have anything to do with the row/column values entered?
I thought they were for values between -1 and 1 but this is not the case when I test.
import numpy as np
np.random.randn(3,3)
array([[ 1.61311526, -1.20028357, -0.41723647],
[-0.31983635, -3.05411198, -0.43453723],
[ 0.09385744, -0.28239577, -1.17262933]])
As mentioned by others, graphically, the probability distribution looks like this.
Probability of getting a value from -0.5 to 0.5: 19.1% + 19.1% = 38.2%
Probability of getting a value larger than 3 = 0.1%
np.random.randn return a sample (or samples) from the “standard normal” distribution (see documentation here).
The standard normal distribution is not bounded. However, the probability for example to get a sample smaller than -3 is 0.0013.
The function numpy.random.randn returns values from the standard normal distribution, which can be anything between negative and positive infinity, so there's no max or min. These values are distributed along the "bell curve" centered at 0, and are exponentially less likely to occur the farther you get from 0.
The row/column parameters don't affect determine any (non-existent) max/min, they just determine the shape of the output array (see the documentation)
So in your example, passing (3,3) into np.random.randn(3,3) returns a 3x3 array of values from the standard normal distribution.
Basically, there's no max or min value, but since higher numbers are less likely to come up or in other words have lower probabilities, you're usually only looking at -3.5 to 3.5. But the larger the size of random data you are trying to generate, the higher the chances of generating a larger value.
The numpy.random.randn function is based on a standard normal distribution, meaning that there is not a maximum value or a minimum value. However, more positive and negative values are less likely to be produced than ones closer to zero.
I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.
The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.
I want to make sure I am using numpy's correlate correctly, it is not giving me the answer I expect. Perhaps I am misunderstanding the correlate function. Here is a code snipet with comments:
import numpy as np
ref = np.sin(np.linspace(-2*np.pi, 2*np.pi, 10000)) # make some data
fragment = ref[2149:7022] # create a fragment of data from ref
corr = np.correlate(ref, fragment) # Find the correlation between the two
maxLag = np.argmax(corr) # find the maximum lag, this should be the offset that we chose above, 2149
print(maxLag)
2167 # I expected this to be 2149.
Isn't the index in the corr array where the correlation is maximum the lag between these two datasets? I would think the starting index I chose for the smaller dataset would be the offset with the greatest correlation.
Why is there a discrepancy between what I expect, 2149, and the result, 2167?
Thanks
That looks like a precision error to me, cross-correlation is an integral and it will always have problems when being represented in discrete space, I guess the problem arises when the values are close to 0. Maybe if you increase the numbers or increase the precision that difference will disappear but I don't think it is really necessary since you are already dealing with approximation when using the discrete cross-correlation, below is the graph of the correlation for you te see that the values are indeed close:
In my code I'm using theano to calculate an euclidean distance matrix (code from here):
import theano
import theano.tensor as T
MAT = T.fmatrix('MAT')
squared_euclidean_distances = (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) - 2 * MAT.dot(MAT.T)
f_euclidean = theano.function([MAT], T.sqrt(squared_euclidean_distances))
def pdist_euclidean(mat):
return f_euclidean(mat)
But the following code causes some values of the matrix to be NaN. I've read that this happens when calculating theano.tensor.sqrt() and here it's suggested to
Add an eps inside the sqrt (or max(x,EPs))
So I've added an eps to my code:
import theano
import theano.tensor as T
eps = 1e-9
MAT = T.fmatrix('MAT')
squared_euclidean_distances = (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) - 2 * MAT.dot(MAT.T)
f_euclidean = theano.function([MAT], T.sqrt(eps+squared_euclidean_distances))
def pdist_euclidean(mat):
return f_euclidean(mat)
And I'm adding it before performing sqrt. I'm getting less NaNs, but I'm still getting them. What is the proper solution to the problem? I've also noticed that if MAT is T.dmatrix() there are no NaN
There are two likely sources of NaNs when computing Euclidean distances.
Floating point representation approximation issues causing negative distances when it's really just zero. The square root of a negative number is undefined (assuming you're not interested in the complex solution).
Imagine MAT has the value
[[ 1.62434536 -0.61175641 -0.52817175 -1.07296862 0.86540763]
[-2.3015387 1.74481176 -0.7612069 0.3190391 -0.24937038]
[ 1.46210794 -2.06014071 -0.3224172 -0.38405435 1.13376944]
[-1.09989127 -0.17242821 -0.87785842 0.04221375 0.58281521]]
Now, if we break down the computation we see that (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) has value
[[ 10.3838024 -9.92394296 10.39763039 -1.51676099]
[ -9.92394296 18.16971188 -14.23897281 5.53390084]
[ 10.39763039 -14.23897281 15.83764622 -0.65066204]
[ -1.51676099 5.53390084 -0.65066204 4.70316652]]
and 2 * MAT.dot(MAT.T) has value
[[ 10.3838024 14.27675714 13.11072431 7.54348446]
[ 14.27675714 18.16971188 17.00367905 11.4364392 ]
[ 13.11072431 17.00367905 15.83764622 10.27040637]
[ 7.54348446 11.4364392 10.27040637 4.70316652]]
The diagonal of these two values should be equal (the distance between a vector and itself is zero) and from this textual representation it looks like that is true, but in fact they are slightly different -- the differences are too small to show up when we print the floating point values like this
This becomes apparent when we print the value of the full expression (the second of the matrices above subtracted from the first)
[[ 0.00000000e+00 2.42007001e+01 2.71309392e+00 9.06024545e+00]
[ 2.42007001e+01 -7.10542736e-15 3.12426519e+01 5.90253836e+00]
[ 2.71309392e+00 3.12426519e+01 0.00000000e+00 1.09210684e+01]
[ 9.06024545e+00 5.90253836e+00 1.09210684e+01 0.00000000e+00]]
The diagonal is almost composed of zeros but the item in the second row, second column is now a very small negative value. When you then compute the square root of all these values you get NaN in that position because the square root of a negative number is undefined (for real numbers).
[[ 0. 4.91942071 1.64714721 3.01002416]
[ 4.91942071 nan 5.58951267 2.42951402]
[ 1.64714721 5.58951267 0. 3.30470398]
[ 3.01002416 2.42951402 3.30470398 0. ]]
Computing the gradient of a Euclidean distance expression with respect to a variable inside the input to the function. This can happen not only if a negative number of generated due to floating point approximations, as above, but also if any of the inputs are zero length.
If y = sqrt(x) then dy/dx = 1/(2 * sqrt(x)). So if x=0 or, for your purposes, if squared_euclidean_distances=0 then the gradient will be NaN because 2 * sqrt(0) = 0 and dividing by zero is undefined.
The solution to the first problem can be achieved by ensuring squared distances are never negative by forcing them to be no less than zero:
T.sqrt(T.maximum(squared_euclidean_distances, 0.))
To solve both problems (if you need gradients) then you need to make sure the squared distances are never negative or zero, so bound with a small positive epsilon:
T.sqrt(T.maximum(squared_euclidean_distances, eps))
The first solution makes sense since the problem only arises from approximate representations. The second is a bit more questionable because the true distance is zero so, in a sense, the gradient should be undefined. Your specific use case may yield some alternative solution that is maintains the semantics without an artificial bound (e.g. by ensuring that gradients are never computed/used for zero-length vectors). But NaN values can be pernicious: they can spread like weeds.
Just checking
In squared_euclidian_distances you're adding a column, a row, and a matrix. Are you sure this is what you want?
More precisely, if MAT is of shape (n, p), you're adding matrices of shapes (n, 1), (1, n) and (n, n).
Theano seems to silently repeat the rows (resp. the columns) of each one-dimensional member to match the number of rows and columns of the dot product.
If this is what you want
In reshape, you should probably specify ndim=2 according to basic tensor functionality : reshape.
If the shape is a Variable argument, then you might need to use the optional ndim parameter to declare how many elements the shape has, and therefore how many dimensions the reshaped Variable will have.
Also, it seems that squared_euclidean_distances should always be positive, unless imprecision errors in the difference change zero values into small negative values. If this is true, and if negative values are responsible for the NaNs you're seeing, you could indeed get rid of them without corrupting your result by surrounding squared_euclidean_distances with abs(...).