Fitting 2D Gaussian to a 2D matrix of values - python

I'm trying to fit a gaussian to this set of data:
It is a 2D matrix with values (probability distribution). If I plot it in 3D it looks like:
As far as I understood from this other question (https://mathematica.stackexchange.com/questions/27642/fitting-a-two-dimensional-gaussian-to-a-set-of-2d-pixels) I need to compute the mean and the covariance matrix of my data and the Gaussian that I need will be exactly the one defined by that mean and covariance matrix.
However, I can not properly understand the code of that other question (as it is from Mathematica) and I am pretty stuck with statistics.
How would I compute in Python (Numpy, PyTorch...) the mean and the covariance matrix of the Gaussian?
I'm trying to avoid all these optimization frameworks (LSQ, KDE) as I think that the solution is much simpler and the computational cost is something that I have to take into account...
Thanks!

Let's call your data matrix D with shape d x n where d is the data dimension and n is the number of samples. I will assume that in your example, d=5 and n=6, although you will need to determine for yourself which is the data dimension and which is the sample dimension. In that case, we can find the mean and covariance using the following code:
import numpy as np
n = 6
d = 5
D = np.random.random([d, n])
mean = D.mean(axis=1)
covariance = np.cov(D)

Related

Vectorization of numpy matrix that contains pdf of multiple gaussians and multiple samples

My problem is this: I have GMM model with K multi-variate gaussians, and also I have N samples.
I want to create a N*K numpy matrix, which in it's [i,k] cell there is the pdf function of the k'th gaussian on the i'th sample, i.e. in this cell there is
In short, I'm intrested in the following matrix:
pdf matrix
This what I have now (I'm working with python):
Q = np.array([scipy.stats.multivariate_normal(mu_t[k], cov_t[k]).pdf(X) for k in range(self.K)]).T
X in the code is a matrix whose lines are my samples.
It's works fine on small toy dataset from small dimension, but the dataset I'm working with is 10,000 28*28 pictures, and on it this line run extremely slowly...
I want to find a solution that doesn't envolve loops but only vector\matrix operation (i.e. vectorization). The scipy 'multivariate_normal' function cannot parameters of more than 1 gaussians, as far as I understand it (but it's 'pdf' function can calculates on multiple samples at once).
Does someone have an Idea?
I am afraid, that the main speed killer in your problem is the inversion and deteminant calculation for the cov_t matrices. If you somehow managed to precalculate these, you could enroll the calculation and use np.add.outer to get all combinations of x_i - mu_k and then use array comprehension to calculate the probabilities with the full formula of the normal distribution function.
Try
S = np.add.outer(X,-mu_t)
cov_t_inv = ??
cov_t_inv_det = ??
Q = 1/(2*np.pi*cov_t_inv_det)**0.5 * np.exp(-0.5*np.einsum('ikr,krs,kis->ik',S,cov_t_inv,S))
Where you insert precalculated arrays cov_t_inv for the inverse covariance matrices and cov_t_inv_det for their determinants.

Getting variance values for random samples generated from a standard normal distribution using numpy

I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.
The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.

Coefficients from linear sum of matrices?

I have a basis set of square matrices and a data set that I need to find the coefficients for given that my data is a linear sum of the basis set.
def basis(a,b,c):
return a*gam1+b*gam2+c*kapp+allsky
So data = basis and I need the best fit (least square) values of the coefficients a,b and c. The basis matrices and the data matrices are all square matrices of 89x89. I have tried using np.linalg.lstsq however since my A matrix would need to be a matrix of the 4 basis matrices the array dimension becomes 4 and throws an error stating the array dimension must be 2. Any help is appreciated.

Python-Generating numbers according to a corellation matrix

Hi, I am trying to generate correlated data as close to the first table as possible (first three rows shown out of a total of 13). The correlation matrix for the relevant columns is also shown (corr_total).
I am trying the following code, which shows the error:
"LinAlgError: 4-th leading minor not positive definite"
from scipy.linalg import cholesky
# Correlation matrix
# Compute the (upper) Cholesky decomposition matrix
upper_chol = cholesky(corr_total)
# What should be here? The mu and sigma of one row of a table?
rnd = np.random.normal(2.57, 0.78, size=(10,7))
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
My question is what goes into the values of The mu and sigma, and how to resolve the error shown above.
Thanks!
P.S I have edited the question to show the original table. It shows data for four patients. I basically want to make synthetic data for more cases, that replicates the patterns found in these patients
Thank you for answering my question about when data you have access to. The error that you received was generated when you called cholesky. cholesky requires that your matrix be positive semidefinite. One way to check if a matrix is semi-positive definite is to see if all of its eigenvalues are greater than zero. One of the eigenvalues of your correlation/covarance matrix is nearly zero. I think that cholesky is just being fussy. Use can use scipy.linalg.sqrtm as an alternate decomposition.
For your question on the generation of multivariate normals, the random normal that you generate should be a standard random normal, i.e. a mean of 0 and a width of 1. Numpy provides a standard random normal generator with np.random.randn.
To generate a multivariate normal, you should also take the decomposition of the covariance, not the correlation matrix. The following will generate a multivariate normal using an affine transformation, as in your question.
from scipy.linalg import cholesky, sqrtm
relavant_columns = ['Affecting homelife',
'Affecting mobility',
'Affecting social life/hobbies',
'Affecting work',
'Mood',
'Pain Score',
'Range of motion in Doc']
# df is a pandas dataframe containing the data frame from figure 1
mu = df[relavant_columns].mean().values
cov = df[relavant_columns].cov().values
number_of_sample = 10
# generate using affine transformation
#c2 = cholesky(cov).T
c2 = sqrtm(cov).T
s = np.matmul(c2, np.random.randn(c2.shape[0], number_of_sample)) + mu.reshape(-1, 1)
# transpose so each row is a sample
s = s.T
Numpy also has a built-in function which can generate multivariate normals directly
s = np.random.multivariate_normal(mu, cov, size=number_of_sample)

creating a normally distributed matrix in numpy with specified mean and variance per element

Assume 3 matrices Mean, Variance, Sample all with the same dimensionality. Is there a 1 line solution to generate the Sample matrix in numpy such that:
Sample[i,j] is drawn from NormalDistribution(Mean[i,j], Variance[i,j])
Using linearity of mean and Var(aX +b) = a**2 Var(X):
Generate a centered and reduced 2D array with np.random.randn(). Multiply pointwise by std (np.sqrt(variance)) and add (still pointwise) mean.

Categories

Resources