Marginal Density Probability using np - python

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
Here's the dataset where X corresponds to rows and Y corresponds to columns. I'm trying to figure out how to calculate the Covariance and Marginal Density Probability (MDP) for Y (columns). for the Covariance I have the following.
np.cov(P)
array([[ 2.247e-04, 6.999e-05, 2.571e-05, -2.822e-05, 1.061e-04],
[ 6.999e-05, 2.261e-04, 9.535e-07, 8.165e-05, -2.013e-05],
[ 2.571e-05, 9.535e-07, 7.924e-05, 1.357e-05, -8.118e-05],
[-2.822e-05, 8.165e-05, 1.357e-05, 2.039e-04, -1.267e-04],
[ 1.061e-04, -2.013e-05, -8.118e-05, -1.267e-04, 2.372e-04]])
How do I get the MDP? Also, is there a way to use numpy to select just the X and Y vals and assign them to variables where X= P's rows and Y=P's columns?

The data stored in P are a little ambiguous. In statistics, the X and Y have a very specific meaning. Usually, each row refers to one observation (i.e. datapoint) of some statistical object while the column represents a feature that is measured for each statistical object. In your case, there would be 9 observations with 5 features. This is referred to as a design matrix X, considered exogenous (independent), and serves as the foundation of most statistical learning algorithms. In supervised learning, there is additionally a vector Y; its length equals the number of rows in the X.
Your task at hand is of unsupervised nature as there is no true exogenous data Y and you are interested in the distribution of X alone. This opens up additional questions. Indeed, np.cov() computes the empirical covariance matrix measuring the pairwise covariance between each of these 5 features resulting in a symmetric 5x5 matrix as you indicated. Asking for the marginal probability density of each column (i.e. feature), however, refers to the univariate distribution of each feature alone. The covariance matrix in between features is irrelevant for this task.
There are several methodologies to obtain estimates of an unknown distribution given some data. Broadly speaking, they fall into two categories: parametric and non-parametric. I'll explain how each methodology works and can be implemented by leveraging NumPy exactly in the way you eluded to.
1. Parametric density estimation
In many cases, one assumes that the data stems from a particular parametric distribution. This distributional assumption is mostly based on convenience rather than prior knowledge. Then, estimating the unknown parameter values completely determines the distribution(s). In your case, for example, you could assume that each feature is univariate normal distributed.
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats
# mere numerical grid to plot densities (no statistical significance)
x_ = np.linspace(0.0, 0.055, 1000)
# estimate mean (mu) and standard deviation (sigma) for each column
mean_vec = np.mean(P, axis=1)
std_vec = np.std(P, axis=1)
# histograms
for i in range(5):
plt.plot(x_, stats.norm.pdf(x_, loc=mean_vec[i], scale=std_vec[I]),
label='Col. {}'.format(i+1))
plt.suptitle('Marginal Distribution Estimates', fontsize=21, y=1.025)
plt.title('parametric via univariate Normal densities', fontsize=14)
plt.legend(loc='upper right')
plt.show()
2. Nonparametric density estimation
Alternatively, you can use a histogram as a non-parametric estimator of the unknown probability density functions (of each column/feature). Note, however, that you still have to choose a bandwidth h that determines the width of the bins. Additionally, non-parametric tools require a larger sample size to provide accurate estimates. Your sample size 9 is likely insufficient.
import numpy as np
from matplotlib import pyplot as plt
# endpoints on all bins: implies bandwith h=0.00229
bins = np.linspace(0.0, 0.055, 25)
h = np.diff(bins)[0]
# histograms
for i in range(5):
plt.hist(P[:,i], bins, alpha=0.5, label='Col. {}'.format(i+1))
plt.suptitle('Marginal Distribution Estimates', fontsize=21, y=1.025)
plt.title('nonparametric via histograms (h={})'.format(round(h, 4)), fontsize=14)
plt.legend(loc='upper right')
plt.show()

Related

How to calculate 95% confidence level of Fourier transform in Python?

After calculating the Fast Fourier Transform (FFT) of a time series in Python/Scipy, I am trying to plot the 95% confidence level that for which the power spectrum is different from red or white noise, but haven't found a straightforward way to do so. I tried following this thread: Power spectrum in python - significance levels
and wrote the following code to test for a sine function with random noise:
import numpy as np
from scipy.stats import chi2
from scipy.fft import rfft, rfftfreq
x=np.linspace(0,10,500)
data = np.sin(20*np.pi*x)+np.random.rand(500) - 0.5
yf = rfft(data)
xf = rfftfreq(len(data), 1)
n=len(data)
var=np.var(data)
### degrees of freedom
M=n/2
phi=(2*(n-1)-M/2.)/M
###values of chi-squared
chi_val_99 = chi2.isf(q=0.01/2, df=phi) #/2 for two-sided test
chi_val_95 = chi2.isf(q=0.05/2, df=phi)
### normalization of power spectrum with 1/n
plt.figure(figsize=(5,5))
plt.plot(xf,np.abs(yf)/n, color='k')
plt.axhline(y=(var/n)*(chi_val_95/phi),color='r',linestyle='--')
But the resulting line lies below all of the power spectrum, as in Fig. 1. What am I doing wrong? Is there another way to get the significance of the FFT power spectrum ?
Background considerations
I did not read the entire references included in the answer you linked to (and in particular Pankofsky et. al.), but couldn't find an explicit derivation of the formula and exactly under which conditions the results applied. On the other hand I've found a few other references where a derivation could more readily be confirmed.
Based on the answer to this question on dsp.stackexchange.com, if you only had white gaussian noise with unit variance, the squared-amplitude of each Fourier coefficients would have Chi-squared distribution with degree of freedom asymptotically 2 (sum of 2 Gaussians, one for each of the real and imaginary parts of the complex Fourier coefficient, when n >> 1). When the noise does not have unit variance, it follows a more general Gamma distribution (although in this case you can simply think of it as scaling the survival function). For noise with a uniform distribution in the [-0.5,0.5] range, and a sufficiently large number of samples, the distribution can also be approximated by a Gamma distribution thanks to the Central Limit Theorem.
To illustrate and better understand these distribution, we can go through gradually more complex cases.
Frequency domain distribution of random noise
For sake of comparing with the later case of uniformly distributed data we will use a gaussian noise with a matching variance. Since the variance of uniformly distributed data is in the range [-0.5,0.5] is 1/12, this gives us the following data:
data = np.sqrt(1.0/12)*np.random.randn(500)
Now let us check the statistics on the power spectrum. As indicated earlier, the squared magnitude of each frequency coefficient is a random variable with an approximately Gamma distribution. The shape parameter is half the degrees of freedom of a Chi-Squared distribution that could have been used for a unit-variance case (so 1 in this case), and the scale parameter corresponds to the square of the scaling of the time-domain (from linearity the variate yf scales as data, such that np.abs(yf)**2 scales as the square of data).
We can validate this by plotting the histogram of the data against the probability density function:
yf = rfft(data)
spectrum = np.abs(yf)**2/len(data)
plt.figure(figsize=(5,5))
plt.hist(spectrum, bins=100, density=True, label='data')
z = np.linspace(0, np.max(spectrum), 100)
plt.plot(z, gamma.pdf(z, 1, scale=1.0/12), 'k', label='$\Gamma(1,{:.3f})$'.format(1.0/12))
As you can see the values are in pretty good agreement:
Going back to the spectrum plot:
# degrees of freedom
phi = 2
###values of chi-squared
chi_val_95 = chi2.isf(q=0.05/2, df=phi) #/2 for two-sided test
### normalization of power spectrum with 1/n
plt.figure(figsize=(5,5))
plt.plot(xf,np.abs(yf)**2/n, color='k')
# the following two lines should overlap
plt.axhline(y=var*(chi_val_95/phi),color='r',linestyle='--')
plt.axhline(y=gamma.isf(q=0.05/2, a=1, scale=var),color='b')
Just changing the data to use a uniform distribution in the [-0.5,0.5] range (with data = np.random.rand(500) - 0.5) gives an almost identical plot, with the confidence level remaining unchanged.
Frequency domain distribution of signal with noise
To get a single threshold value corresponding to a 95% confidence interval where the noise part would fall if you could separate it from the data containing a sinusoidal component and noise (or otherwise stated as the 95% confidence interval of the null-hypothesis that the data is white noise), you would need the variance of the noise. While trying to estimate this variance you may quickly realize that the sinusoidal contributes a non-negligible portion of the overall data's variance. To remove this contribution we could take advantage of the fact that sinusoidal signals are more readily separated in the frequency-domain.
So we could simply discard the x% largest values of the spectrum, under the assumption that those are mostly contributed by spike of the sinusoidal component in the frequency-domain. Note that 95 percentile choice below for the outliers is somewhat arbitrary:
# remove outliers
threshold = np.percentile(np.abs(yf)**2, 95)
filtered = [x for x in np.abs(yf)**2 if x <= threshold]
Then we can get the time-domain variance using Parseval's theorem:
# estimate variance
# In time-domain variance ~ np.sum(data**2)/len(data))
# In frequency-domain, using Parseval's theorem we get np.sum(data**2)/len(data) = np.mean(np.abs(spectrum)**2)/len(data)
var = np.mean(filtered)/len(data)
Note that due to the dynamic range of values across the spectrum, you may prefer to visualize the results on a logarithmic scale:
plt.figure(figsize=(5,5))
plt.plot(xf,10*np.log10(np.abs(yf)**2/n), color='k')
plt.axhline(y=10*np.log10(gamma.isf(q=0.05/2, a=1, scale=var)),color='r',linestyle='--')
If on the other hand you are trying to obtain a frequency-dependent 95% confidence interval, then you'd need to consider the contribution of the sinusoidal component at each frequency. For sake of simplicity we will assume here that the amplitude of the sinusoidal component and the variance of the noise are known (otherwise we'd first need to estimate these). In this case the distribution gets shifted by the sinusoidal component's contribution:
signal = np.sin(20*np.pi*x)
data = signal + np.random.rand(500) - 0.5
Sf = rfft(signal) # Assuming perfect knowledge of the sinusoidal component
yf = rfft(data)
noiseVar = 1.0/12 # Assuming perfect knowledge of the noise variance
threshold95 = np.abs(Sf)**2/n + gamma.isf(q=0.05/2, a=1, scale=noiseVar)
plt.figure(figsize=(5,5))
plt.plot(xf, 10*np.log10(np.abs(yf)**2/n), color='k')
plt.plot(xf, 10*np.log10(threshold95), color='r',linestyle='--')
Finally, while I kept the final plots in squared-amplitude units, nothing prevents you from taking the square root and view the corresponding thresholds in amplitude units.
Edit : I've used a gamma(1,s) distribution which is an asymptotically good distribution for data with sufficient number of samples n. For really small data sizes the distribution more closely match a gamma(0.5*(n/(n//2+1)),s) (due to the DC and Nyquist coefficients being purely real, thus having 1 degree of freedom unlike all other coefficients).

Interpretation of PCA biplot

I need to understand what the scatterplot created by 2 principal components convey.
I was working on the 'boston housing' dataset from the 'sklearn.datasets' library. I standardized the predictors and the used 'PCA' from 'sklearn.decomposition' library to get 2 principal components and plotted them on the graph.
Now all I want is help in interpreting what the plot says in simple language.enter image description here
Each principal component can be understood as linear combinations of all the features in your dataset. For example if you have three variables A, B and C, then one possibility for a principal component could be calculate it by 0.5A + 0.25B + 0.25C. And a datapoint with values [1, 2, 4] would end up with 0.5*1 + 0.25*2 + 0.25*4 = 2 on the principal component.
The first principal component is extracted by determining the combination of features that yields the highest variance in the data. This roughly means that we tweak the multipliers (0.5, 0.25, 0.25) for each variable such that the variance between all observations is maximized.
The first principal component (green) and second (pink) of 2d data is visualised by the lines through the data in this plot
The PCs are a linear combination of the features. Basically, you can order the PCs on captured variance in the data and label from highest to lowest. PC1 would contain most of the variance, then PC2 etc. Thus for each PC it is known how much variance it exactly explained. However, when you scatterplot the data in 2D, as you did in the boston housing dataset, the it is hard to say “how much” and “which” features were contributing in the PCs. Here is were the “biplot” comes into play. The biplot can plot for each feature its contribution by its angle and length of the vector. When you do this, you will not only know how much variance was explained by the top PCs, but also which features were most important.
Try the ‘pca’ library. This will plot the explained variance, and create a biplot.
pip install pca
from pca import pca
# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)
# Or reduce the data towards 2 PCs
model = pca(n_components=2)
# Fit transform
results = model.fit_transform(X)
# Plot explained variance
fig, ax = model.plot()
# Scatter first 2 PCs
fig, ax = model.scatter()
# Make biplot
fig, ax = model.biplot(n_feat=4)

How to get probability density function using Kullback-Leibler Divergence in Python

I have a set of 1D data saved in a python. I can get the probability density function using gaussian_kde function from scipy. I want to know whether the returned distribution matches with a theoretical distribution such as Normal distribution. For that can I use KL divergence? If so how can I use python to do that?
This is my python code to get the probability density function.
array = np.array(values)
KDEpdf = gaussian_kde(array)
x = np.linspace(0, 50, 1500)
kdepdf = KDEpdf.evaluate(x)
plt.plot(x, kdepdf, label="", color="blue")
plt.legend()
plt.show()
There are couple of ways to do it:
Plot it against a normal fitted probability distribution. Like: plt.hist(x, norm.pdf(x,mu, std))
Compare kdepdf distribution with a uniform random dataset using something like Q-Q plot for both dataset.
Use chi square test, be cautious with the bin size you choose. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution.chi square test. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution.

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.
This is the basic setup of what I am using:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_
You can use
Principle Component Analysis (PCA)
PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
Some essential points:
the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).
Sample approach
do PCA on entire dataset (that's what the function below does)
take matrix with observations and features
center it to its average (average of feature values among all observations)
compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
perform decomposition
sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
use components on original data
examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance
Sample function
You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.
PCA is performed on a data matrix with observations (objects) in rows and features in columns.
def dim_red_pca(X, d=0, corr=False):
r"""
Performs principal component analysis.
Parameters
----------
X : array, (n, d)
Original observations (n observations, d features)
d : int
Number of principal components (default is ``0`` => all components).
corr : bool
If true, the PCA is performed based on the correlation matrix.
Notes
-----
Always all eigenvalues and eigenvectors are returned,
independently of the desired number of components ``d``.
Returns
-------
Xred : array, (n, m or d)
Reduced data matrix
e_values : array, (m)
The eigenvalues, sorted in descending manner.
e_vectors : array, (n, m)
The eigenvectors, sorted corresponding to eigenvalues.
"""
# Center to average
X_ = X-X.mean(0)
# Compute correlation / covarianz matrix
if corr:
CO = np.corrcoef(X_.T)
else:
CO = np.cov(X_.T)
# Compute eigenvalues and eigenvectors
e_values, e_vectors = sp.linalg.eigh(CO)
# Sort the eigenvalues and the eigenvectors descending
idx = np.argsort(e_values)[::-1]
e_vectors = e_vectors[:, idx]
e_values = e_values[idx]
# Get the number of desired dimensions
d_e_vecs = e_vectors
if d > 0:
d_e_vecs = e_vectors[:, :d]
else:
d = None
# Map principal components to original data
LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
return LIN[:, :d], e_values, e_vectors
Sample usage
Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.
import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
SN = np.array([ [1.325, 1.000, 1.825, 1.750],
[2.000, 1.250, 2.675, 1.750],
[3.000, 3.250, 3.000, 2.750],
[1.075, 2.000, 1.675, 1.000],
[3.425, 2.000, 3.250, 2.750],
[1.900, 2.000, 2.400, 2.750],
[3.325, 2.500, 3.000, 2.000],
[3.000, 2.750, 3.075, 2.250],
[2.075, 1.250, 2.000, 2.250],
[2.500, 3.250, 3.075, 2.250],
[1.675, 2.500, 2.675, 1.250],
[2.075, 1.750, 1.900, 1.500],
[1.750, 2.000, 1.150, 1.250],
[2.500, 2.250, 2.425, 2.500],
[1.675, 2.750, 2.000, 1.250],
[3.675, 3.000, 3.325, 2.500],
[1.250, 1.500, 1.150, 1.000]], dtype=float)
clust,labels_ = kmeans2(SN,3) # cluster with 3 random initial clusters
# PCA on orig. dataset
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)
xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1],
color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')
Output
The eigenvectors show the weighting of each feature for the component
Short Interpretation
Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.
Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.
You can do it this way:
>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110, 89])
>>> k3
array([ 1, 2, 7, 12])
Try this,
estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']
You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.
I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.
Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.
a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature
the lesser the spread, the better the feature of each cluster basically:
1 - (std(x) / (max(x) - min(x))
I wrote an article and a class to maintain it
https://github.com/GuyLou/python-stuff/blob/main/pluster.py
https://medium.com/#guylouzon/creating-clustering-feature-importance-c97ba8133c37
It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.
For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

How can I maximize the Poissonian likelihood of a histogram given a fit curve with scipy/numpy?

I have data in a python/numpy/scipy environment that needs to be fit to a probability density function. A way to do this is to create a histogram of the data and then fit a curve to this histogram. The method scipy.optimize.leastsq does this by minimizing the sum of (y - f(x))**2, where (x,y) would in this case be the histogram's bin centers and bin contents.
In statistical terms, this least-square maximizes the likelihood of obtaining that histogram by sampling each bin count from a gaussian centered around the fit function at that bin's position. You can easily see this: each term (y-f(x))**2 is -log(gauss(y|mean=f(x))), and the sum is the logarithm of the multiplying the gaussian likelihood for all the bins together.
That's however not always accurate: for the type of statistical data I'm looking at, each bin count would be the result of a Poissonian process, so I want to minimize (the logarithm of the product over all the bins (x,y) of) poisson(y|mean=f(x)). The Poissonian comes very close to the Gaussian distribution for large values of f(x), but if my histogram doesn't have as good statistics, the difference would be relevant and influencing the fit.
If I understood correctly, you have data and want to see whether or not some probability distribution fits your data.
Well, if that's the case - you need QQ-Plot. If that's the case, then take a look at this StackOverflow question-answer. However, that is about normal distribution function, and you need a code for Poisson distribution function. All you need to do is create some random data according to Poisson random function and test your samples against it. Here you can find an example of QQ-plot for Poisson distribution function. Here's the code from this web-site:
#! /usr/bin/env python
from pylab import *
p = poisson(lam=10, size=4000)
m = mean(p)
s = std(p)
n = normal(loc=m, scale=s, size=p.shape)
a = m-4*s
b = m+4*s
figure()
plot(sort(n), sort(p), 'o', color='0.85')
plot([a,b], [a,b], 'k-')
xlim(a,b)
ylim(a,b)
xlabel('Normal Distribution')
ylabel('Poisson Distribution with $\lambda=10$')
grid(True)
savefig('qq.pdf')
show()

Categories

Resources