I have two 2D arrays of size N x 3968 and M x 3968, respectively. They represent the N and M feature vectors of two sets of molecules (N and M), each of which consists of 3968 features (floating point numbers). I want to measure the divergence between these two multivariate distributions, which don't follow any norm (they are not multivariate normal or multinomial, etc.). Is there an implementation in Python to compute the Jensen-Shannon divergence or Kullback-Leibler divergence between two such multi-variate distributions?
I have found a function in dit package but it seems to take only 1D distributions: https://dit.readthedocs.io/en/latest/measures/divergences/jensen_shannon_divergence.html
Also scipy has an implementation of Jensen-Shannon distance but it also computes it between 1D distributions: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html?highlight=jensenshannon
The latter has an argument that controls along which axis the distances are computed and at the end returns a 1D array of distances. Given my two 2D arrays (N x 3968 and M x 3968), would computing the Jensen-Shannon distance along axis=0 and subsequently averaging the resulting 3968 distances to a single number, give a reasonable estimate of the divergence of the two multivariate distributions? No, because the mean doesn't capture variable (feature) dependences. What about their product?
What about a dimensionality reduction technique to remove nonlinear dependence between features? A technique that estimates the percentage of the variance "pv" that each lower space dimension explains (as PCA does for linear dependence), could facilitate calculation of the weighted mean of the JS distances on the low dimensions, where the weight will be the "pv" of each low dimension.
Related
I have a large M x N 2D matrix of positive coefficients (8bit unsigned) with M,N~10^3.
I want to optimize the parameters (M*N~10^6) such that my error-function (where I put the matrix in) is minimized. Boundary conditions: Neighboring parameters vary slowly/smoothly.
I have already tried to downscale the matrix in order to reduce the number of parameters and then flatten the matrix to feed it into Scipy's Minimize function. This causes memory errors quickly.
Is there a smart approch to optimize this large coefficient matrix in time ranges less than infinity without applying a low-parametric model?
I have a numpy array in range [0,1000] with exponential distribution with lambda_x, I want to transform this numpy array to an array with different exponential distribution lambda_y. How can I find a function that does this mapping?
tried to use the inverse function but it didnt work.
def inverse(X, lambd):
"""Inverse of exponential distribution """
return -np.log(1-X)/lambd
It should be as simple as taking your original exponentials X and scaling them by multiplying by λx/λy to produce Y's.
A well known mechanism is to generate exponentials via inverse transform sampling. The second example on that page shows that if you generate U's which are uniformly distributed between 0 and 1 (where 100*U corresponds to the percentiles of the distribution) and transform them using the formula -ln(1 - U) / λ, you will get exponentials with rate λ. If λ is λx it yields your X distribution, and if λ is λy it yields the Y distribution. Hence rescaling by the ratio of the lambdas will convert from one to the other for a given percentile.
Assume 3 matrices Mean, Variance, Sample all with the same dimensionality. Is there a 1 line solution to generate the Sample matrix in numpy such that:
Sample[i,j] is drawn from NormalDistribution(Mean[i,j], Variance[i,j])
Using linearity of mean and Var(aX +b) = a**2 Var(X):
Generate a centered and reduced 2D array with np.random.randn(). Multiply pointwise by std (np.sqrt(variance)) and add (still pointwise) mean.
I have a large sparse matrix and I want to find its eigenvectors with specific eigenvalue. In scipy.sparse.linalg.eigs, it says the required argument k:
"k is the number of eigenvalues and eigenvectors desired. k must be smaller than N-1. It is not possible to compute all eigenvectors of a matrix".
The problem is that I don't know how many eigenvectors corresponding to the eigenvalue I want. What should I do in this case?
I'd suggest using Singular Value Decomposition (SVD) instead. There is a function from scipy where you can use SVD from scipy.sparse.linalg import svds and it can handle sparse matrix. You can find eigenvalues (in this case will be singular value) and eigenvectors by the following:
U, Sigma, VT = svds(X, k=n_components, tol=tol)
where X can be sparse CSR matrix, U and VT is set of left eigenvectors and right eigenvectors corresponded to singular values in Sigma. Here, you can control number of components. I'd say start with small n_components first and then increase it. You can rank your Sigma and see the distribution of singular value you have. There will be some large number and drop quickly. You can make threshold on how many eigenvectors you want to keep from singular values.
If you want to use scikit-learn, there is a class sklearn.decomposition.TruncatedSVD that let you do what I explained.
I know that numpy.cov calculates the covariance given a N dimensional array.
I can see from the documentation on GitHub that the normalisation is done by (N-1). But for my specific case, the covariance matrix is given by:
where xi is the quantity. i and j are the bins.
As you can see from the above equation, this covariance matrix is normalised by (N-1)/N.
TO GET THE ABOVE NORMALISATION
Can I simply multiply the covariance matrix obtained from numpy.cov by (N-1)**2 / N to get the above normalisation? Is that correct?
Or Should I use the bias parameter inside numpy.cov? If so how?
There are two ways of doing this.
We can call np.cov with bias=1 and then multiply the result by N-1
or
We can multiply the overall covariance matrix obtained by (N-1)**2/N