I have two problems.
1). First: How to generate a 1000x100 dim matrix each dimension being a marginal distribution with mean 0 and 1?. I know I can use univariate diet for each but how do you add 100 such distribution in one numpy/ pandas matrix?
2). Generate a 100x1000 dim matrix . This would use a multivariate dist, but how do you specify the mean and covariant matrix for that numpy function. It has to be random.
Related
I have two 2D arrays of size N x 3968 and M x 3968, respectively. They represent the N and M feature vectors of two sets of molecules (N and M), each of which consists of 3968 features (floating point numbers). I want to measure the divergence between these two multivariate distributions, which don't follow any norm (they are not multivariate normal or multinomial, etc.). Is there an implementation in Python to compute the Jensen-Shannon divergence or Kullback-Leibler divergence between two such multi-variate distributions?
I have found a function in dit package but it seems to take only 1D distributions: https://dit.readthedocs.io/en/latest/measures/divergences/jensen_shannon_divergence.html
Also scipy has an implementation of Jensen-Shannon distance but it also computes it between 1D distributions: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html?highlight=jensenshannon
The latter has an argument that controls along which axis the distances are computed and at the end returns a 1D array of distances. Given my two 2D arrays (N x 3968 and M x 3968), would computing the Jensen-Shannon distance along axis=0 and subsequently averaging the resulting 3968 distances to a single number, give a reasonable estimate of the divergence of the two multivariate distributions? No, because the mean doesn't capture variable (feature) dependences. What about their product?
What about a dimensionality reduction technique to remove nonlinear dependence between features? A technique that estimates the percentage of the variance "pv" that each lower space dimension explains (as PCA does for linear dependence), could facilitate calculation of the weighted mean of the JS distances on the low dimensions, where the weight will be the "pv" of each low dimension.
I'm trying to fit a gaussian to this set of data:
It is a 2D matrix with values (probability distribution). If I plot it in 3D it looks like:
As far as I understood from this other question (https://mathematica.stackexchange.com/questions/27642/fitting-a-two-dimensional-gaussian-to-a-set-of-2d-pixels) I need to compute the mean and the covariance matrix of my data and the Gaussian that I need will be exactly the one defined by that mean and covariance matrix.
However, I can not properly understand the code of that other question (as it is from Mathematica) and I am pretty stuck with statistics.
How would I compute in Python (Numpy, PyTorch...) the mean and the covariance matrix of the Gaussian?
I'm trying to avoid all these optimization frameworks (LSQ, KDE) as I think that the solution is much simpler and the computational cost is something that I have to take into account...
Thanks!
Let's call your data matrix D with shape d x n where d is the data dimension and n is the number of samples. I will assume that in your example, d=5 and n=6, although you will need to determine for yourself which is the data dimension and which is the sample dimension. In that case, we can find the mean and covariance using the following code:
import numpy as np
n = 6
d = 5
D = np.random.random([d, n])
mean = D.mean(axis=1)
covariance = np.cov(D)
I'm doing nonparametric regression, and need a function to expand the design matrix X into the basis matrix. Is there a package that can do this?
For example, if X is 200*10 (200 obs and 10 features), using a B-spline basis expansion with 5 bases will yield a 200*50 basis matrix.
I tried scipy.interpolate.BSpline, but it seems it does not return the basis matrix.
The pasty library in python is useful.
from pasty import dmatrix
transformed_x = dmatrix(
"bs(x, df=df, degree=degree, include_intercept=False)",
{"train": x}, return_type='matrix')
This will return the basis expansion for a vector x. If you have a data matrix, do this for each column.
I want to apply sklearn graph clustering algorithms but they don't accept input from networkx in .gexf format. What kind of library/transformations do I need to turn my .gexf graphs suitable for sklearn?
Cluster algorithms accept either distance matrices, affinity matrices, or feature matrices. For example, kmeans would accept a feature matrix (say X of n points of m dimensions) and apply the Euclidean distance metric, while affinity propagation accepts an affinity matrix (i.e. a square matrix D of nxn dimensions) or a feature matrix (depending on the affinity parameter).
If you want to apply a sklearn (or just non-graph) cluster algorithm, you can extract adjacency matrices from networkx graphs.
A = nx.to_scipy_sparse_matrix(G)
I guess you should make sure, your diagonal is 1; do numpy.fill_diagonal(D, 1) if not.
This then leaves only applying the clustering algorithm:
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation(affinity='precomputed').fit(A)
print(ap.labels_)
You can also convert your adjacency matrix to a distance matrix if you want to apply other algorithms or even project your adjacency/distance matrix to a feature matrix.
To go through all of this would go too far, however, as for getting the distance matrix, if you have binary edges, you can do D = 1 - A; if you have weighted edges you could D = A.max() - A.
Assume 3 matrices Mean, Variance, Sample all with the same dimensionality. Is there a 1 line solution to generate the Sample matrix in numpy such that:
Sample[i,j] is drawn from NormalDistribution(Mean[i,j], Variance[i,j])
Using linearity of mean and Var(aX +b) = a**2 Var(X):
Generate a centered and reduced 2D array with np.random.randn(). Multiply pointwise by std (np.sqrt(variance)) and add (still pointwise) mean.