Python-Generating numbers according to a corellation matrix

Python-Generating numbers according to a corellation matrix - python

Hi, I am trying to generate correlated data as close to the first table as possible (first three rows shown out of a total of 13). The correlation matrix for the relevant columns is also shown (corr_total).
I am trying the following code, which shows the error:
"LinAlgError: 4-th leading minor not positive definite"
from scipy.linalg import cholesky
# Correlation matrix
# Compute the (upper) Cholesky decomposition matrix
upper_chol = cholesky(corr_total)
# What should be here? The mu and sigma of one row of a table?
rnd = np.random.normal(2.57, 0.78, size=(10,7))
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
My question is what goes into the values of The mu and sigma, and how to resolve the error shown above.
Thanks!
P.S I have edited the question to show the original table. It shows data for four patients. I basically want to make synthetic data for more cases, that replicates the patterns found in these patients

Thank you for answering my question about when data you have access to. The error that you received was generated when you called cholesky. cholesky requires that your matrix be positive semidefinite. One way to check if a matrix is semi-positive definite is to see if all of its eigenvalues are greater than zero. One of the eigenvalues of your correlation/covarance matrix is nearly zero. I think that cholesky is just being fussy. Use can use scipy.linalg.sqrtm as an alternate decomposition.
For your question on the generation of multivariate normals, the random normal that you generate should be a standard random normal, i.e. a mean of 0 and a width of 1. Numpy provides a standard random normal generator with np.random.randn.
To generate a multivariate normal, you should also take the decomposition of the covariance, not the correlation matrix. The following will generate a multivariate normal using an affine transformation, as in your question.
from scipy.linalg import cholesky, sqrtm
relavant_columns = ['Affecting homelife',
'Affecting mobility',
'Affecting social life/hobbies',
'Affecting work',
'Mood',
'Pain Score',
'Range of motion in Doc']
# df is a pandas dataframe containing the data frame from figure 1
mu = df[relavant_columns].mean().values
cov = df[relavant_columns].cov().values
number_of_sample = 10
# generate using affine transformation
#c2 = cholesky(cov).T
c2 = sqrtm(cov).T
s = np.matmul(c2, np.random.randn(c2.shape[0], number_of_sample)) + mu.reshape(-1, 1)
# transpose so each row is a sample
s = s.T
Numpy also has a built-in function which can generate multivariate normals directly
s = np.random.multivariate_normal(mu, cov, size=number_of_sample)

Related

Fitting 2D Gaussian to a 2D matrix of values

I'm trying to fit a gaussian to this set of data:
It is a 2D matrix with values (probability distribution). If I plot it in 3D it looks like:
As far as I understood from this other question (https://mathematica.stackexchange.com/questions/27642/fitting-a-two-dimensional-gaussian-to-a-set-of-2d-pixels) I need to compute the mean and the covariance matrix of my data and the Gaussian that I need will be exactly the one defined by that mean and covariance matrix.
However, I can not properly understand the code of that other question (as it is from Mathematica) and I am pretty stuck with statistics.
How would I compute in Python (Numpy, PyTorch...) the mean and the covariance matrix of the Gaussian?
I'm trying to avoid all these optimization frameworks (LSQ, KDE) as I think that the solution is much simpler and the computational cost is something that I have to take into account...
Thanks!

Let's call your data matrix D with shape d x n where d is the data dimension and n is the number of samples. I will assume that in your example, d=5 and n=6, although you will need to determine for yourself which is the data dimension and which is the sample dimension. In that case, we can find the mean and covariance using the following code:
import numpy as np
n = 6
d = 5
D = np.random.random([d, n])
mean = D.mean(axis=1)
covariance = np.cov(D)

Vector and RMS averaging in FFT

I have a data array on which I have performed an FFT. This is the code that I have applied.
import numpy as np
# "data" is a column vector on which FFT needs to be performed
# N = No. of points in "data"
# dt = time interval between two corresponding data points
FFT_data = np.fft.fft(data) # Complex values
FFT_data_real = 2/N*abs(FFT_data) # Absolute values
However, I went through following link: https://www.dsprelated.com/showarticle/1159.php
Here it says, to enhance the SNR we can apply "RMS-averaged FFT" and "Vector Averaged FFT".
Can somebody please let me know how to we go about doing these two methodologies in Python or is there any documentation/links to which we can refer ?

As your reference indicates:
If you take the square root of the average of the squares of your sample spectra, you are doing RMS Averaging. Another alternative is Vector Averaging in which you average the real and complex components separately.
Obviously to be able to perform either averaging you'd need to have more than a single data set to average. In your example code, you have a single column vector data. Let's assume you have multiple such column vectors arranged as a 2D NxM matrix, where N is the number of points per dataset and M is the number of datasets. Since the datasets are stored in columns, when computing the FFT you will need to specify the parameter axis=0 to compute the FFT along columns.
RMS-averaged FFT
As the name suggests, for this method you need to take the square-root of the mean of the squared amplitudes. Since the different sets are stored in columns, you'd need to do the average along the axis 1 (the other axis than the one used for the FFT).
FFT_data = np.fft.fft(data, axis=0) # Complex values
FFT_data_real = 2/N*abs(FFT_data) # Absolute values
rms_averaged = np.sqrt(np.mean(FFT_data_real**2, axis=1))
Vector Averaged FFT
In this case you need to obtain the real and imaginary components of the FFT data, then compute the average on each separately:
FFT_data = np.fft.fft(data, axis=0) # Complex values
real_part_avg = 2/N*np.mean(np.real(FFT_data),axis=1)
imag_part_avg = 2/N*np.mean(np.imag(FFT_data),axis=1)
vector_averaged = np.abs(real_part_avg+1j*imag_part_avg)
Note that I've kept the 2/N scaling you had for the absolute values.
But what can I do if I really only have one dataset?
If that dataset happens to be stationary and sufficiently large then you could break down your dataset into smaller blocks. This can be done by reshaping your vector into an NxM matrix with the following:
data = data.reshape(N,M)
...
Then you could perform the averaging with either method.

Getting variance values for random samples generated from a standard normal distribution using numpy

I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.

The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.

Calculation of Mahalanobis distance doesn't work when using scipy's cdist function

I want to calculate the mahalanobis distance between every row of a matrix and a single row vector. Originally, I used np.tile to create a matrix that contains multiple copies of the vector to create an equal shaped matrix and then used sklearn.metrics.pairwise.paired_distances to calculate the distances row-wise.
I wanted to know if there's a more readable alternative to this, that is why I asked this question a few weeks ago. User Divakar came up with the idea to use cdist(a,b).ravel(). I found this alternative more readable and therefore used it. I tested out both functions with metric='euclidean' and found that they produce the same result. However, when I switch the metric to metric='mahalanobis' (which I want to use now), the first method seems to work but the cdist alternative throws the following error:
ValueError: The number of observations (3) is too small; the
covariance matrix is singular. For observations with 3 dimensions, at
least 4 observations are required.
Can someone explain what is going on here? Why do the functions give the same results when using euclidean distance but not when using Mahalanobis distance?
Here's my code:
import numpy as np
from sklearn.metrics.pairwise import paired_distances
from scipy.spatial.distance import cdist
# get matrix a and vector b
a = np.array([[1,2,3],[4,5,6]])
b = np.array([[7],[8],[9]]).transpose()
# define metric
METRIC = 'mahalanobis'
# define distance function that computes the distance between each row of a matrix
# and another matrix that contains duplicates of the single input vector.
def vector_to_matrix_distance(matrix_array,vector_array,metric):
vector_array_tiled = np.tile(vector_array,(matrix_array.shape[0],1))
return paired_distances(matrix_array,vector_array_tiled)
# run distance function
distances_1 = vector_to_matrix_distance(matrix_array=a,
vector_array=b,
metric=METRIC)
# Use cdist to do the SAME calculation.This works when METRIC = 'euclidean' but
# not when METRIC = 'mahalanobis'
distances_2 = cdist(a,b,metric=METRIC).ravel()

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.
This is the basic setup of what I am using:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_

You can use
Principle Component Analysis (PCA)
PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
Some essential points:
the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).
Sample approach
do PCA on entire dataset (that's what the function below does)
take matrix with observations and features
center it to its average (average of feature values among all observations)
compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
perform decomposition
sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
use components on original data
examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance
Sample function
You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.
PCA is performed on a data matrix with observations (objects) in rows and features in columns.
def dim_red_pca(X, d=0, corr=False):
r"""
Performs principal component analysis.
Parameters
----------
X : array, (n, d)
Original observations (n observations, d features)
d : int
Number of principal components (default is ``0`` => all components).
corr : bool
If true, the PCA is performed based on the correlation matrix.
Notes
-----
Always all eigenvalues and eigenvectors are returned,
independently of the desired number of components ``d``.
Returns
-------
Xred : array, (n, m or d)
Reduced data matrix
e_values : array, (m)
The eigenvalues, sorted in descending manner.
e_vectors : array, (n, m)
The eigenvectors, sorted corresponding to eigenvalues.
"""
# Center to average
X_ = X-X.mean(0)
# Compute correlation / covarianz matrix
if corr:
CO = np.corrcoef(X_.T)
else:
CO = np.cov(X_.T)
# Compute eigenvalues and eigenvectors
e_values, e_vectors = sp.linalg.eigh(CO)
# Sort the eigenvalues and the eigenvectors descending
idx = np.argsort(e_values)[::-1]
e_vectors = e_vectors[:, idx]
e_values = e_values[idx]
# Get the number of desired dimensions
d_e_vecs = e_vectors
if d > 0:
d_e_vecs = e_vectors[:, :d]
else:
d = None
# Map principal components to original data
LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
return LIN[:, :d], e_values, e_vectors
Sample usage
Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.
import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
SN = np.array([ [1.325, 1.000, 1.825, 1.750],
[2.000, 1.250, 2.675, 1.750],
[3.000, 3.250, 3.000, 2.750],
[1.075, 2.000, 1.675, 1.000],
[3.425, 2.000, 3.250, 2.750],
[1.900, 2.000, 2.400, 2.750],
[3.325, 2.500, 3.000, 2.000],
[3.000, 2.750, 3.075, 2.250],
[2.075, 1.250, 2.000, 2.250],
[2.500, 3.250, 3.075, 2.250],
[1.675, 2.500, 2.675, 1.250],
[2.075, 1.750, 1.900, 1.500],
[1.750, 2.000, 1.150, 1.250],
[2.500, 2.250, 2.425, 2.500],
[1.675, 2.750, 2.000, 1.250],
[3.675, 3.000, 3.325, 2.500],
[1.250, 1.500, 1.150, 1.000]], dtype=float)
clust,labels_ = kmeans2(SN,3) # cluster with 3 random initial clusters
# PCA on orig. dataset
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)
xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1],
color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')
Output
The eigenvectors show the weighting of each feature for the component
Short Interpretation
Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.
Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.

You can do it this way:
>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110, 89])
>>> k3
array([ 1, 2, 7, 12])

Try this,
estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']
You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.

I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.
Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.

a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature
the lesser the spread, the better the feature of each cluster basically:
1 - (std(x) / (max(x) - min(x))
I wrote an article and a class to maintain it
https://github.com/GuyLou/python-stuff/blob/main/pluster.py
https://medium.com/#guylouzon/creating-clustering-feature-importance-c97ba8133c37

It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.
For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python-Generating numbers according to a corellation matrix - python

Related

Fitting 2D Gaussian to a 2D matrix of values

Vector and RMS averaging in FFT

Getting variance values for random samples generated from a standard normal distribution using numpy

Calculation of Mahalanobis distance doesn't work when using scipy's cdist function

scikit-learn: Finding the features that contribute to each KMeans cluster

Categories

Resources