scipy - generate random variables with correlations - python

I'm working to implement a basic Monte Carlo simulator in Python for some project management risk modeling I'm trying to do (basically Crystal Ball / #Risk, but in Python).
I have a set of n random variables (all scipy.stats instances). I know that I can use rv.rvs(size=k) to generate k independent observations from each of these n variables.
I'd like to introduce correlations among the variables by specifying an n x n positive semi-definite correlation matrix.
Is there a clean way to do this in scipy?
What I've Tried
This answer and this answer seem to indicate that "copulas" would be an answer, but I don't see any reference in scipy to them.
This link seems to implement what I'm looking for, but I'm not sure if scipy has this functionality implemented already. I'd also like it to work for non-normal variables.
It seems that the Iman, Conover paper is the standard method.

If you just want correlation through a Gaussian Copula (*), then it can be calculated in a few steps with numpy and scipy.
create multivariate random variables with desired covariance, numpy.random.multivariate_normal, and creating a (nobs by k_variables) array
apply scipy.stats.norm.cdf to transform normal to uniform random variables, for each column/variable to get uniform marginal distributions
apply dist.ppf to transform uniform margin to the desired distribution, where dist can be one of the distributions in scipy.stats
(*) Gaussian copula is only one choice and it is not the best when we are interested in tail behavior, but it is the easiest to work with
for example http://archive.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all
two references
https://stats.stackexchange.com/questions/37424/how-to-simulate-from-a-gaussian-copula
http://www.mathworks.com/products/demos/statistics/copulademo.html
(I might have done this a while ago in python, but don't have any scripts or function right now.)

It seems like a rejection-based sampling method such as the Metropolis-Hastings algorithm is what you want. Scipy can implement such methods with its scipy.optimize.basinhopping function.
Rejection-based sampling methods allow you to draw samples from any given probability distribution. The idea is that you draw random samples from another "proposal" pdf that is easy to sample from (such as uniform or gaussian distributions) and then use a random test to decide if this sample from the proposal distribution should be "accepted" as representing a sample of the desired distribution.
The remaining tricks will then be:
Figure out the form of the joint N-dimensional probability density function which has marginals of the form you want along each dimension, but with the correlation matrix that you want. This is easy to do for the Gaussian distribution, where the desired correlation matrix and mean vector is all you need to define the distribution. If your marginals have a simple expression, you can probably find this pdf with some straightforward-but-tedious algebra. This paper cites several others which do what you are talking about, and I'm certain that there are many more.
Formulate a function for basinhopping to minimize such that it's accepted "minima" amount to samples of this pdf you have defined.
Given the results of (1), (2) should be straightforward.

If you have already a positive semi-definite correlation matrix R [n x n], it's easy to build a NormalCopula taking R as input. I'll show you an example with n = 3. The code is based on OpenTURNS library.
import openturns as ot
# you can replace this part by your matrix
dim = 3
R = ot.CorrelationMatrix (dim)
R[0,1] = 0.25
R[0,2] = 0.6
R[1,2] = 0.9
copula = ot.NormalCopula(R)
Should you like to get a sample of size, just write
size = 5
print(copula.getSample(size))
>>> [ X0 X1 X2 ]
0 : [ 0.355353 0.76205 0.632379 ]
1 : [ 0.902567 0.984443 0.989552 ]
2 : [ 0.423219 0.811016 0.754304 ]
3 : [ 0.303776 0.471557 0.450188 ]
4 : [ 0.746168 0.918729 0.891347 ]
EDIT - Following the comment of #Michael_Baudin
Of course, if you want to set the marginal distributions as e.g. Beta and LogNormal marginals, its also possible:
X0 = ot.LogNormal(0.1, 1, 0)
X1 = ot.Beta()
X2 = ot.Uniform(1.0, 2.0)
distribution = ot.ComposedDistribution([X0,X1,X2], Original_copula)
print(distribution.getSample(size))
>>> [ X0 X1 X2 ]
0 : [ 3.97678 0.158823 1.75635 ]
1 : [ 1.18929 -0.554092 1.18952 ]
2 : [ 2.59542 0.0751359 1.68599 ]
3 : [ 1.33363 -0.18407 1.42241 ]
4 : [ 1.34084 0.198019 1.6553 ]

import typing
import numpy as np
import scipy.stats
def run_gaussian_copula_simulation_and_get_samples(
ppfs: typing.List[typing.Callable[[np.ndarray], np.ndarray]], # List of $num_dims percentile point functions
cov_matrix: np.ndarray, # covariance matrix, shape($num_dims, $num_dims)
num_samples: int, # number of random samples to draw
) -> np.ndarray:
num_dims = len(ppfs)
# Draw random samples from multidimensional normal distribution -> shape($num_samples, $num_dims)
ran = np.random.multivariate_normal(np.zeros(num_dims), cov_matrix, (num_samples,), check_valid="raise")
# Transform back into a uniform distribution, i.e. the space [0,1]^$num_dims
U = scipy.stats.norm.cdf(ran)
# Apply ppf to transform samples into the desired distribution
# Each row of the returned array will represent one random sample -> access with a[i]
return np.array([ppfs[i](U[:, i]) for i in range(num_dims)]).T # shape($num_samples, $num_dims)
# Example 1. Uncorrelated data, i.e. both distributions are independent
f1 = run_gaussian_copula_simulation_and_get_samples(
[lambda x: scipy.stats.norm.ppf(x, loc=100, scale=15), scipy.stats.norm.ppf],
[[1, 0], [0, 1]],
6
)
# Example 2. Completely correlated data, i.e. both percentiles match
f2 = run_gaussian_copula_simulation_and_get_samples(
[lambda x: scipy.stats.norm.ppf(x, loc=100, scale=15), scipy.stats.norm.ppf],
[[1, 1], [1, 1]],
6
)
np.set_printoptions(suppress=True) # suppress scientific notation
print(f1)
print(f2)
A few note on this function. np.random.multivariate_normal
does a lot of the heavy lifting for us, note that in particular we do not need to decompose the correlation matrix.
ppfs is passed as a list of functions which each have one input and one return value.
In my particular use case I needed to generate multivariate-t-distributed random variables (in addition to normal-distributed ones),
consult this answer on how to do that: https://stackoverflow.com/a/41967819/2111778.
Additionally, I used scipy.stats.t.cdf for the back-transform part.
In my particular use case the desired distributions were empirical distributions representing expected financial loss.
The final data points then had to be added together to get a total financial loss across all
of the individual-but-correlated financial events.
Thus, np.array(...).T is actually replaced by sum(...) in my code base.

Related

How can I generate samples from a non-normal multivariable distribution in Python?

I have an input dataframe df_input with 10 variables and 100 rows. This data are not normal distributed.
I would like to generate an output dataframe with 10 variables and 10,000 rows, such that the covariance matrix and mean of the new dataframe are the same as those of the original one. The output variables should not be normal distributed, but rather have a distribution similar to the input variables.
That is:
Cov(df_output) = Cov(df_input) and
mean(df_ouput) = mean(df_input)
Is there a Python function that does it?
Note: np.random.multivariate_normal(mean_input,Cov_input,10000) does almost this, but the output variables are normal distributed, whereas I need them to have the same (or similar) distribution as the input.
Update
I just noticed your mention of np.random.multivariate_normal... It does in one swell swoop the equivalent of gen_like() below!
I'll leave it here to help people understand the mechanics of this, but to summarize:
you can match the mean and covariance of an empirical distribution with a (rotated, scaled, translated) normal;
for a better match of higher moments, you should look at the copula.
Original answer
Since you are interested in only matching the two first moments (mean, variance), you can use a simple PCA to obtain a suitable model of the initial data. Note that the new generated data will be a normal ellipsoid, rotated, scaled, and translated to match the empirical mean and covariance of the initial data.
If you want more sophisticated "replication" of the original distribution, then you should look at Copula as I said in the comments.
So, for the first two moments only, assuming your input data is d0:
from sklearn.decomposition import PCA
def gen_like(d0, n):
pca = PCA(n_components=d0.shape[1]).fit(d0)
z0 = pca.transform(d0) # z0 is centered and uncorrelated (cov is diagonal)
z1 = np.random.normal(size=(n, d0.shape[1])) * np.std(z0, 0)
# project back to input space
d1 = pca.inverse_transform(z1)
return d1
Example:
# generate some random data
# arbitrary transformation matrix
F = np.array([
[1, 2, 3],
[2, 1, 4],
[5, 1, 3],
])
d0 = np.random.normal(2, 4, size=(10000, 3)) # F.T
np.mean(d0, 0)
# ex: array([12.12791066, 14.10333273, 17.95212292])
np.cov(d0.T)
# ex: array([[225.09691912, 257.39878551, 259.40288019],
# [257.39878551, 338.34087242, 373.4773562 ],
# [259.40288019, 373.4773562 , 566.29288861]])
# try to match mean, variance of d0
d1 = gen_like(d0, 10000)
np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)
np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)
What's funny is that you can fit a square peg in a round hole (i.e., demonstrating that really only mean, variance are matched, not the higher moments):
d0 = np.random.uniform(5, 10, size=(1000, 3)) # F.T
d1 = gen_like(d0, 10000)
np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)
np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)
Have you tried looking at NumPy docs?:
https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_normal.html
Have you considered using a GAN (generative adversarial network)? Takes a bit more effort than just using a predefined function, but essentially it does exactly what you are hoping to do. Here's the original paper: https://arxiv.org/abs/1406.2661
There are many PyTorch/Tensorflow codes that you can download and fit to your purposes, for example this one: https://github.com/eriklindernoren/PyTorch-GAN
Here is also a blog post I found quite helpful with an introduction to GANs. https://medium.com/ai-society/gans-from-scratch-1-a-deep-introduction-with-code-in-pytorch-and-tensorflow-cb03cdcdba0f
Maybe GAN is overkill for this problem and there are simpler methods for upscaling the sample size, in which case I'd be interested to learn about them.
The best method is indeed to use Copulas, as suggested by many. A simple description is in the link below, which provides also a simple python code.
The method preserves covariances, while augmenting the data. It allows generalization to non-symetric or non-normal distributions. Thanks for all for helping.
https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html.

Python-Generating numbers according to a corellation matrix

Hi, I am trying to generate correlated data as close to the first table as possible (first three rows shown out of a total of 13). The correlation matrix for the relevant columns is also shown (corr_total).
I am trying the following code, which shows the error:
"LinAlgError: 4-th leading minor not positive definite"
from scipy.linalg import cholesky
# Correlation matrix
# Compute the (upper) Cholesky decomposition matrix
upper_chol = cholesky(corr_total)
# What should be here? The mu and sigma of one row of a table?
rnd = np.random.normal(2.57, 0.78, size=(10,7))
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
My question is what goes into the values of The mu and sigma, and how to resolve the error shown above.
Thanks!
P.S I have edited the question to show the original table. It shows data for four patients. I basically want to make synthetic data for more cases, that replicates the patterns found in these patients
Thank you for answering my question about when data you have access to. The error that you received was generated when you called cholesky. cholesky requires that your matrix be positive semidefinite. One way to check if a matrix is semi-positive definite is to see if all of its eigenvalues are greater than zero. One of the eigenvalues of your correlation/covarance matrix is nearly zero. I think that cholesky is just being fussy. Use can use scipy.linalg.sqrtm as an alternate decomposition.
For your question on the generation of multivariate normals, the random normal that you generate should be a standard random normal, i.e. a mean of 0 and a width of 1. Numpy provides a standard random normal generator with np.random.randn.
To generate a multivariate normal, you should also take the decomposition of the covariance, not the correlation matrix. The following will generate a multivariate normal using an affine transformation, as in your question.
from scipy.linalg import cholesky, sqrtm
relavant_columns = ['Affecting homelife',
'Affecting mobility',
'Affecting social life/hobbies',
'Affecting work',
'Mood',
'Pain Score',
'Range of motion in Doc']
# df is a pandas dataframe containing the data frame from figure 1
mu = df[relavant_columns].mean().values
cov = df[relavant_columns].cov().values
number_of_sample = 10
# generate using affine transformation
#c2 = cholesky(cov).T
c2 = sqrtm(cov).T
s = np.matmul(c2, np.random.randn(c2.shape[0], number_of_sample)) + mu.reshape(-1, 1)
# transpose so each row is a sample
s = s.T
Numpy also has a built-in function which can generate multivariate normals directly
s = np.random.multivariate_normal(mu, cov, size=number_of_sample)

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.
This is the basic setup of what I am using:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_
You can use
Principle Component Analysis (PCA)
PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
Some essential points:
the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).
Sample approach
do PCA on entire dataset (that's what the function below does)
take matrix with observations and features
center it to its average (average of feature values among all observations)
compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
perform decomposition
sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
use components on original data
examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance
Sample function
You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.
PCA is performed on a data matrix with observations (objects) in rows and features in columns.
def dim_red_pca(X, d=0, corr=False):
r"""
Performs principal component analysis.
Parameters
----------
X : array, (n, d)
Original observations (n observations, d features)
d : int
Number of principal components (default is ``0`` => all components).
corr : bool
If true, the PCA is performed based on the correlation matrix.
Notes
-----
Always all eigenvalues and eigenvectors are returned,
independently of the desired number of components ``d``.
Returns
-------
Xred : array, (n, m or d)
Reduced data matrix
e_values : array, (m)
The eigenvalues, sorted in descending manner.
e_vectors : array, (n, m)
The eigenvectors, sorted corresponding to eigenvalues.
"""
# Center to average
X_ = X-X.mean(0)
# Compute correlation / covarianz matrix
if corr:
CO = np.corrcoef(X_.T)
else:
CO = np.cov(X_.T)
# Compute eigenvalues and eigenvectors
e_values, e_vectors = sp.linalg.eigh(CO)
# Sort the eigenvalues and the eigenvectors descending
idx = np.argsort(e_values)[::-1]
e_vectors = e_vectors[:, idx]
e_values = e_values[idx]
# Get the number of desired dimensions
d_e_vecs = e_vectors
if d > 0:
d_e_vecs = e_vectors[:, :d]
else:
d = None
# Map principal components to original data
LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
return LIN[:, :d], e_values, e_vectors
Sample usage
Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.
import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
SN = np.array([ [1.325, 1.000, 1.825, 1.750],
[2.000, 1.250, 2.675, 1.750],
[3.000, 3.250, 3.000, 2.750],
[1.075, 2.000, 1.675, 1.000],
[3.425, 2.000, 3.250, 2.750],
[1.900, 2.000, 2.400, 2.750],
[3.325, 2.500, 3.000, 2.000],
[3.000, 2.750, 3.075, 2.250],
[2.075, 1.250, 2.000, 2.250],
[2.500, 3.250, 3.075, 2.250],
[1.675, 2.500, 2.675, 1.250],
[2.075, 1.750, 1.900, 1.500],
[1.750, 2.000, 1.150, 1.250],
[2.500, 2.250, 2.425, 2.500],
[1.675, 2.750, 2.000, 1.250],
[3.675, 3.000, 3.325, 2.500],
[1.250, 1.500, 1.150, 1.000]], dtype=float)
clust,labels_ = kmeans2(SN,3) # cluster with 3 random initial clusters
# PCA on orig. dataset
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)
xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1],
color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')
Output
The eigenvectors show the weighting of each feature for the component
Short Interpretation
Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.
Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.
You can do it this way:
>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110, 89])
>>> k3
array([ 1, 2, 7, 12])
Try this,
estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']
You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.
I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.
Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.
a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature
the lesser the spread, the better the feature of each cluster basically:
1 - (std(x) / (max(x) - min(x))
I wrote an article and a class to maintain it
https://github.com/GuyLou/python-stuff/blob/main/pluster.py
https://medium.com/#guylouzon/creating-clustering-feature-importance-c97ba8133c37
It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.
For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

How to compute the shannon entropy and mutual information of N variables

I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!
The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf
I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)

Separate mixture of gaussians in Python

There is a result of some physical experiment, which can be represented as a histogram [i, amount_of(i)]. I suppose that result can be estimated by a mixture of 4 - 6 Gaussian functions.
Is there a package in Python which takes a histogram as an input and returns the mean and variance of each Gaussian distribution in the mixture distribution?
Original data, for example:
This is a mixture of gaussians, and can be estimated using an expectation maximization approach (basically, it finds the centers and means of the distribution at the same time as it is estimating how they are mixed together).
This is implemented in the PyMix package. Below I generate an example of a mixture of normals, and use PyMix to fit a mixture model to them, including figuring out what you're interested in, which is the size of subpopulations:
# requires numpy and PyMix (matplotlib is just for making a histogram)
import random
import numpy as np
from matplotlib import pyplot as plt
import mixture
random.seed(010713) # to make it reproducible
# create a mixture of normals:
# 1000 from N(0, 1)
# 2000 from N(6, 2)
mix = np.concatenate([np.random.normal(0, 1, [1000]),
np.random.normal(6, 2, [2000])])
# histogram:
plt.hist(mix, bins=20)
plt.savefig("mixture.pdf")
All the above code does is generate and plot the mixture. It looks like this:
Now to actually use PyMix to figure out what the percentages are:
data = mixture.DataSet()
data.fromArray(mix)
# start them off with something arbitrary (probably based on a guess from the figure)
n1 = mixture.NormalDistribution(-1,1)
n2 = mixture.NormalDistribution(1,1)
m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])
# perform expectation maximization
m.EM(data, 40, .1)
print m
The output model of this is:
G = 2
p = 1
pi =[ 0.33307859 0.66692141]
compFix = [0, 0]
Component 0:
ProductDist:
Normal: [0.0360178848449, 1.03018725918]
Component 1:
ProductDist:
Normal: [5.86848468319, 2.0158608802]
Notice it found the two normals quite correctly (one N(0, 1) and one N(6, 2), approximately). It also estimated pi, which is the fraction in each of the two distributions (you mention in the comments that's what you're most interested in). We had 1000 in the first distribution and 2000 in the second distribution, and it gets the division almost exactly right: [ 0.33307859 0.66692141]. If you want to get this value directly, do m.pi.
A few notes:
This approach takes a vector of values, not a histogram. It should be easy to convert your data into a 1D vector (that is, turn [(1.4, 2), (2.6, 3)] into [1.4, 1.4, 2.6, 2.6, 2.6])
We had to guess the number of gaussian distributions in advance (it won't figure out a mix of 4 if you ask for a mix of 2).
We had to put in some initial estimates for the distributions. If you make even remotely reasonable guesses it should converge to the correct estimates.

Categories

Resources