Separate mixture of gaussians in Python - python

There is a result of some physical experiment, which can be represented as a histogram [i, amount_of(i)]. I suppose that result can be estimated by a mixture of 4 - 6 Gaussian functions.
Is there a package in Python which takes a histogram as an input and returns the mean and variance of each Gaussian distribution in the mixture distribution?
Original data, for example:

This is a mixture of gaussians, and can be estimated using an expectation maximization approach (basically, it finds the centers and means of the distribution at the same time as it is estimating how they are mixed together).
This is implemented in the PyMix package. Below I generate an example of a mixture of normals, and use PyMix to fit a mixture model to them, including figuring out what you're interested in, which is the size of subpopulations:
# requires numpy and PyMix (matplotlib is just for making a histogram)
import random
import numpy as np
from matplotlib import pyplot as plt
import mixture
random.seed(010713) # to make it reproducible
# create a mixture of normals:
# 1000 from N(0, 1)
# 2000 from N(6, 2)
mix = np.concatenate([np.random.normal(0, 1, [1000]),
np.random.normal(6, 2, [2000])])
# histogram:
plt.hist(mix, bins=20)
plt.savefig("mixture.pdf")
All the above code does is generate and plot the mixture. It looks like this:
Now to actually use PyMix to figure out what the percentages are:
data = mixture.DataSet()
data.fromArray(mix)
# start them off with something arbitrary (probably based on a guess from the figure)
n1 = mixture.NormalDistribution(-1,1)
n2 = mixture.NormalDistribution(1,1)
m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])
# perform expectation maximization
m.EM(data, 40, .1)
print m
The output model of this is:
G = 2
p = 1
pi =[ 0.33307859 0.66692141]
compFix = [0, 0]
Component 0:
ProductDist:
Normal: [0.0360178848449, 1.03018725918]
Component 1:
ProductDist:
Normal: [5.86848468319, 2.0158608802]
Notice it found the two normals quite correctly (one N(0, 1) and one N(6, 2), approximately). It also estimated pi, which is the fraction in each of the two distributions (you mention in the comments that's what you're most interested in). We had 1000 in the first distribution and 2000 in the second distribution, and it gets the division almost exactly right: [ 0.33307859 0.66692141]. If you want to get this value directly, do m.pi.
A few notes:
This approach takes a vector of values, not a histogram. It should be easy to convert your data into a 1D vector (that is, turn [(1.4, 2), (2.6, 3)] into [1.4, 1.4, 2.6, 2.6, 2.6])
We had to guess the number of gaussian distributions in advance (it won't figure out a mix of 4 if you ask for a mix of 2).
We had to put in some initial estimates for the distributions. If you make even remotely reasonable guesses it should converge to the correct estimates.

Related

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

PDF of the sum of Gaussian distributions using FFT

I am trying to derive the PDF of the sum of independent random variables. At first i would like to do this for a simple case: sum of Gaussian random variables.
I was surprised to see that I don't get a Gaussian density function when I sum an even number of gaussian random variables. I actually get:
which looks like two halfs of a Gaussian distribution.
On the other hand, when I sum an odd number of Gaussian distributions i get the right distribution:
below the code I used to produce the results above:
import numpy as np
from scipy.stats import norm
from scipy.fftpack import fft,ifft
import matplotlib.pyplot as plt
%matplotlib inline
a=10**(-15)
end=norm(0,1).ppf(a)
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
plt.subplot(211)
plt.plot(np.real(ifft(fft(pdf)**2)))
plt.subplot(212)
plt.plot(np.real(ifft(fft(pdf)**3)))
Could someone help me understand why I get odd results for even sums of Gaussian distributions?
Even though your code creates a zero-mean Gaussian PDF:
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
the FFT does not know about sample, and only sees pdf with samples at 0, 1, 2, 3, ... 999. The FFT expects the origin to be the first sample of the signal. To the FFT function, your PDF is not zero mean, but has a mean of 500.
Thus, what is going on here is that you are adding two PDFs with a 500 mean, leading to one with a 1000 mean. And because the FFT imposes a periodicity to the spatial domain signal, you are seeing the PDF exiting the graph on the right and coming back in on the left.
Adding 3 PDFs shifts the mean to 1500, which due to periodicity is the same as 500, meaning it ends up in the same place as the original PDF.
The solution is to shift the origin to the first sample for the FFT, and shift the result back:
from scipy.fftpack import fftshift, ifftshift
pdf2 = fftshift(ifft(fft(ifftshift(pdf))**2))
ifftshift shifts the signal so that the center sample ends up at the first sample, and fftshift shifts it back to where you wanted it for display.
But do note that the way you generate the PDF, the origin is not at a sample, and so the above will not work exactly. Instead, use:
sample=np.linspace(end,-end,1001)
pdf=norm(0,1).pdf(sample)
By picking 1001 samples instead of 1000, zero is exactly at the middle sample.
Use R!
library(ggplot2)
f <- function(n) {
x1 <- rnorm(n)
x2 <- rnorm(n)
X <- x1+x2
return(ds)
}
ds.list <- lapply(10^(2:5),f)
ds <- Reduce(rbind,ds.list)
ggplot(ds,aes(X,fill = n)) + geom_density(alpha = 0.5) + xlab("")
Here's the distribution plot:

Computing percentiles when given a distribution

Let's say I have a vector of values, and a vector of probabilities. I want to compute the percentile over the values, but using the given vector of probabilities.
Say, for example,
import numpy as np
vector = np.array([4, 2, 3, 1])
probs = np.array([0.7, 0.1, 0.1, 0.1])
Ignoring probs, np.percentile(vector, 10) gives me 1.3. However, it's clear that the lowest 10% here have value of 1, so that would be my desired output.
If the result lies between two data points, I'd prefer linear interpolation as documented for the original percentile function.
How would I solve this in Python most conveniently? As in my example, vector will not be sorted. probs always sums to 1. I'd prefer solutions that don't require "non-standard" packages, by any reasonable definition.
If you're prepared to sort your values, then you can construct an interpolating function that allows you to compute the inverse of the probability distribution. This is probably more easily done with scipy.interpolate than with pure numpy routines:
import scipy.interpolate
ordering = np.argsort(vector)
distribution = scipy.interpolate.interp1d(np.cumsum(probs[ordering]), vector[ordering], bounds_error=False, fill_value='extrapolate')
If you interrogate this distribution with the percentile (in the range 0..1), you should get the answers you want, e.g. distribution(0.1) gives 1.0, distribution(0.5) gives about 3.29.
A similar thing can be done with numpy's interp() function, avoiding the extra dependency on scipy, but that would involve reconstructing the interpolating function every time you want to calculate a percentile. This might be fine if you have a fixed list of percentiles that is known before you estimate the probability distribution.
One solution would be to use sampling via numpy.random.choice and then numpy.percentile:
N = 50 # number of samples to draw
samples = np.random.choice(vector, size=N, p=probs, replace=True)
interpolation = "nearest"
print("25th percentile",np.percentile(samples, 25, interpolation=interpolation),)
print("75th percentile",np.percentile(samples, 75, interpolation=interpolation),)
Depending on your kind of data (discrete or continuous) you may want to use different values for the interpolation parameter.

scipy - generate random variables with correlations

I'm working to implement a basic Monte Carlo simulator in Python for some project management risk modeling I'm trying to do (basically Crystal Ball / #Risk, but in Python).
I have a set of n random variables (all scipy.stats instances). I know that I can use rv.rvs(size=k) to generate k independent observations from each of these n variables.
I'd like to introduce correlations among the variables by specifying an n x n positive semi-definite correlation matrix.
Is there a clean way to do this in scipy?
What I've Tried
This answer and this answer seem to indicate that "copulas" would be an answer, but I don't see any reference in scipy to them.
This link seems to implement what I'm looking for, but I'm not sure if scipy has this functionality implemented already. I'd also like it to work for non-normal variables.
It seems that the Iman, Conover paper is the standard method.
If you just want correlation through a Gaussian Copula (*), then it can be calculated in a few steps with numpy and scipy.
create multivariate random variables with desired covariance, numpy.random.multivariate_normal, and creating a (nobs by k_variables) array
apply scipy.stats.norm.cdf to transform normal to uniform random variables, for each column/variable to get uniform marginal distributions
apply dist.ppf to transform uniform margin to the desired distribution, where dist can be one of the distributions in scipy.stats
(*) Gaussian copula is only one choice and it is not the best when we are interested in tail behavior, but it is the easiest to work with
for example http://archive.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all
two references
https://stats.stackexchange.com/questions/37424/how-to-simulate-from-a-gaussian-copula
http://www.mathworks.com/products/demos/statistics/copulademo.html
(I might have done this a while ago in python, but don't have any scripts or function right now.)
It seems like a rejection-based sampling method such as the Metropolis-Hastings algorithm is what you want. Scipy can implement such methods with its scipy.optimize.basinhopping function.
Rejection-based sampling methods allow you to draw samples from any given probability distribution. The idea is that you draw random samples from another "proposal" pdf that is easy to sample from (such as uniform or gaussian distributions) and then use a random test to decide if this sample from the proposal distribution should be "accepted" as representing a sample of the desired distribution.
The remaining tricks will then be:
Figure out the form of the joint N-dimensional probability density function which has marginals of the form you want along each dimension, but with the correlation matrix that you want. This is easy to do for the Gaussian distribution, where the desired correlation matrix and mean vector is all you need to define the distribution. If your marginals have a simple expression, you can probably find this pdf with some straightforward-but-tedious algebra. This paper cites several others which do what you are talking about, and I'm certain that there are many more.
Formulate a function for basinhopping to minimize such that it's accepted "minima" amount to samples of this pdf you have defined.
Given the results of (1), (2) should be straightforward.
If you have already a positive semi-definite correlation matrix R [n x n], it's easy to build a NormalCopula taking R as input. I'll show you an example with n = 3. The code is based on OpenTURNS library.
import openturns as ot
# you can replace this part by your matrix
dim = 3
R = ot.CorrelationMatrix (dim)
R[0,1] = 0.25
R[0,2] = 0.6
R[1,2] = 0.9
copula = ot.NormalCopula(R)
Should you like to get a sample of size, just write
size = 5
print(copula.getSample(size))
>>> [ X0 X1 X2 ]
0 : [ 0.355353 0.76205 0.632379 ]
1 : [ 0.902567 0.984443 0.989552 ]
2 : [ 0.423219 0.811016 0.754304 ]
3 : [ 0.303776 0.471557 0.450188 ]
4 : [ 0.746168 0.918729 0.891347 ]
EDIT - Following the comment of #Michael_Baudin
Of course, if you want to set the marginal distributions as e.g. Beta and LogNormal marginals, its also possible:
X0 = ot.LogNormal(0.1, 1, 0)
X1 = ot.Beta()
X2 = ot.Uniform(1.0, 2.0)
distribution = ot.ComposedDistribution([X0,X1,X2], Original_copula)
print(distribution.getSample(size))
>>> [ X0 X1 X2 ]
0 : [ 3.97678 0.158823 1.75635 ]
1 : [ 1.18929 -0.554092 1.18952 ]
2 : [ 2.59542 0.0751359 1.68599 ]
3 : [ 1.33363 -0.18407 1.42241 ]
4 : [ 1.34084 0.198019 1.6553 ]
import typing
import numpy as np
import scipy.stats
def run_gaussian_copula_simulation_and_get_samples(
ppfs: typing.List[typing.Callable[[np.ndarray], np.ndarray]], # List of $num_dims percentile point functions
cov_matrix: np.ndarray, # covariance matrix, shape($num_dims, $num_dims)
num_samples: int, # number of random samples to draw
) -> np.ndarray:
num_dims = len(ppfs)
# Draw random samples from multidimensional normal distribution -> shape($num_samples, $num_dims)
ran = np.random.multivariate_normal(np.zeros(num_dims), cov_matrix, (num_samples,), check_valid="raise")
# Transform back into a uniform distribution, i.e. the space [0,1]^$num_dims
U = scipy.stats.norm.cdf(ran)
# Apply ppf to transform samples into the desired distribution
# Each row of the returned array will represent one random sample -> access with a[i]
return np.array([ppfs[i](U[:, i]) for i in range(num_dims)]).T # shape($num_samples, $num_dims)
# Example 1. Uncorrelated data, i.e. both distributions are independent
f1 = run_gaussian_copula_simulation_and_get_samples(
[lambda x: scipy.stats.norm.ppf(x, loc=100, scale=15), scipy.stats.norm.ppf],
[[1, 0], [0, 1]],
6
)
# Example 2. Completely correlated data, i.e. both percentiles match
f2 = run_gaussian_copula_simulation_and_get_samples(
[lambda x: scipy.stats.norm.ppf(x, loc=100, scale=15), scipy.stats.norm.ppf],
[[1, 1], [1, 1]],
6
)
np.set_printoptions(suppress=True) # suppress scientific notation
print(f1)
print(f2)
A few note on this function. np.random.multivariate_normal
does a lot of the heavy lifting for us, note that in particular we do not need to decompose the correlation matrix.
ppfs is passed as a list of functions which each have one input and one return value.
In my particular use case I needed to generate multivariate-t-distributed random variables (in addition to normal-distributed ones),
consult this answer on how to do that: https://stackoverflow.com/a/41967819/2111778.
Additionally, I used scipy.stats.t.cdf for the back-transform part.
In my particular use case the desired distributions were empirical distributions representing expected financial loss.
The final data points then had to be added together to get a total financial loss across all
of the individual-but-correlated financial events.
Thus, np.array(...).T is actually replaced by sum(...) in my code base.

normality test of a distribution in python

I have some data I have sampled from a radar satellite image and wanted to perform some statistical tests on. Before this I wanted to conduct a normality test so I could be sure my data was normally distributed. My data appears to be normally distributed but when I perform the test Im getting a Pvalue of 0, suggesting my data is not normally distributed.
I have attached my code along with the output and a histogram of the distribution (Im relatively new to python so apologies if my code is clunky in any way). Can anyone tell me if Im doing something wrong - I find it hard to believe from my histogram that my data is not normally distributed?
values = 'inputfile.h5'
f = h5py.File(values,'r')
dset = f['/DATA/DATA']
array = dset[...,0]
print('normality =', scipy.stats.normaltest(array))
max = np.amax(array)
min = np.amin(array)
histo = np.histogram(array, bins=100, range=(min, max))
freqs = histo[0]
rangebins = (max - min)
numberbins = (len(histo[1])-1)
interval = (rangebins/numberbins)
newbins = np.arange((min), (max), interval)
histogram = bar(newbins, freqs, width=0.2, color='gray')
plt.show()
This prints this: (41099.095955202931, 0.0). the first element is a chi-square value and the second is a pvalue.
I have made a graph of the data which I have attached. I thought that maybe as Im dealing with negative values it was causing a problem so I normalised the values but the problem persists.
This question explains why you're getting such a small p-value. Essentially, normality tests almost always reject the null on very large sample sizes (in yours, for example, you can see just some skew in the left side, which at your enormous sample size is way more than enough).
What would be much more practically useful in your case is to plot a normal curve fit to your data. Then you can see how the normal curve actually differs (for example, you can see whether the tail on the left side does indeed go too long). For example:
from matplotlib import pyplot as plt
import matplotlib.mlab as mlab
n, bins, patches = plt.hist(array, 50, normed=1)
mu = np.mean(array)
sigma = np.std(array)
plt.plot(bins, mlab.normpdf(bins, mu, sigma))
(Note the normed=1 argument: this ensures that the histogram is normalized to have a total area of 1, which makes it comparable to a density like the normal distribution).
In general when the number of samples is less than 50, you should be careful about using tests of normality. Since these tests need enough evidences to reject the null hypothesis, which is "the distribution of the data is normal", and when the number of samples is small they are not able to find those evidences.
Keep in mind that when you fail to reject the null hypothesis it does not mean that the alternative hypothesis is correct.
There is another possibility that:
Some implementations of the statistical tests for normality compare the distribution of your data to standard normal distribution. In order to avoid this, I suggest you to standardize the data and then apply the test of normality.

Categories

Resources