Python - calculating pdf from a numpy array distribution - python

Given an array of values, I want to be able to fit a density function to it and find the pdf of an arbitrary input value. Is this possible, and how would I go about it? There aren't necessarily assumptions of normality, and I don't need the function itself.
For instance, given:
x = array([ 0.62529759, -0.08202699, 0.59220673, -0.09074541, 0.05517865,
0.20153703, 0.22773723, -0.26229708, 0.76137555, -0.61229314,
0.27292745, 0.35596795, -0.01373896, 0.32464979, -0.22932331,
1.14796175, 0.17268531, 0.40692172, 0.13846154, 0.22752953,
0.13087359, 0.14111479, -0.09932381, 0.12800392, 0.02605917,
0.18776078, 0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
-0.09171721, 0.23083624, -0.21603973, 0.05425592, 0.47910286,
0.26359565, -0.19917942, 0.40182097, -0.0797546 , 0.47239264,
-0.36654449, 0.4513859 , -0.00282486, -0.13950512, -0.05375369,
0.03331833, 0.48951555, -0.13760504, 2.788 , -0.15017848,
0.02930675, 0.10910646, 0.03868301, -0.048482 , 0.7277376 ,
0.08841259, -0.10968462, 0.50371324, 0.86379698, 0.01674877,
0.19542421, -0.06639165, 0.74500856, -0.10148342, 0.02482331,
0.79195804, 0.40401969, 0.25120005, 0.21020794, -0.01767013,
-0.13453783, -0.09605592, -0.88044229, 0.04689623, 0.09043851,
0.21232286, 0.34129982, -0.3736799 , 0.17313858])
I would like to find how a value of 0.3 compares to all of the above, and what percent of the above values it is greater than.

I personally like using the scipy.stats package. It has a useful implementation of Kernel Density Estimation. Bascially what this does is it estimates a probability density function of certain data, using combinations of gaussian (or other) distributions. Which distributions are used is a parameter you can set. Look at the documentation and related examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#kernel-density-estimation
And for more about KDE: https://en.wikipedia.org/wiki/Kernel_density_estimation
Once you have built your KDE, then you can perform the same operations on it to get probabilities. For example, if you want to calculate the probability that a value occurs that is as large or larger than 0.3 you would do the following:
kde = stats.gaussian_kde(np.array(x))
#visualize KDE
fig = plt.figure()
ax = fig.add_subplot(111)
x_eval = np.linspace(-.2, .2, num=200)
ax.plot(x_eval, kde(x_eval), 'k-')
#get probability
kde.integrate_box_1d( 0.3, np.inf)
TLDR:
Calculate a KDE, then use the KDE as if it were a PDF.

You can use openTURNS for that. You can use a Gaussian kernel smoothing to do that easily! From the doc:
import openturns as ot
kernel = ot.KernelSmoothing()
estimated = kernel.build(x)
That's it, now you have a distribution object :)
This library is very cool for statistics! (I am not related to them).

We have first to create the Sample from the Numpy array.
Then we compute the complementary CDF with the complementaryCDF method of the distribution (a small improvement over Yoda's answer).
import numpy as np
x = np.array([ 0.62529759, -0.08202699, 0.59220673, -0.09074541, 0.05517865,
0.20153703, 0.22773723, -0.26229708, 0.76137555, -0.61229314,
0.27292745, 0.35596795, -0.01373896, 0.32464979, -0.22932331,
1.14796175, 0.17268531, 0.40692172, 0.13846154, 0.22752953,
0.13087359, 0.14111479, -0.09932381, 0.12800392, 0.02605917,
0.18776078, 0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
-0.09171721, 0.23083624, -0.21603973, 0.05425592, 0.47910286,
0.26359565, -0.19917942, 0.40182097, -0.0797546 , 0.47239264,
-0.36654449, 0.4513859 , -0.00282486, -0.13950512, -0.05375369,
0.03331833, 0.48951555, -0.13760504, 2.788 , -0.15017848,
0.02930675, 0.10910646, 0.03868301, -0.048482 , 0.7277376 ,
0.08841259, -0.10968462, 0.50371324, 0.86379698, 0.01674877,
0.19542421, -0.06639165, 0.74500856, -0.10148342, 0.02482331,
0.79195804, 0.40401969, 0.25120005, 0.21020794, -0.01767013,
-0.13453783, -0.09605592, -0.88044229, 0.04689623, 0.09043851,
0.21232286, 0.34129982, -0.3736799 , 0.17313858])
import openturns as ot
kernel = ot.KernelSmoothing()
sample = ot.Sample(x,1)
distribution = kernel.build(sample)
q = distribution.computeComplementaryCDF(0.3)
print(q)
which prints:
0.29136124840835353

Related

Plot PDF of Pareto distribution in Python

I have a specific Pareto distribution. For example,
Pareto(beta=0.00317985, alpha=0.147365, gamma=1.0283)
which I obtained from this answer and now I want to plot a graph of its Probability Density Function (PDF) in matplotlib. So I believe that the x-axis will be all positive real numbers, and the y-axis will be the same.
How exactly can I obtain the appropriate PDF information and plot it? Programmatically obtaining the mathematical PDF function or coordinates is a requirement for this question.
UPDATE:
The drawPDF method returns a Graph object that contains coordinates for the PDF. However, I don't know how to access these coordinates programmatically. I certainly don't want to convert the object to a string nor use a regex to pull out the information:
In [45]: pdfg = distribution.drawPDF()
In [46]: pdfg
Out[46]: class=Graph name=pdf as a function of X0 implementation=class=GraphImplementation name=pdf as a function of X0 title= xTitle=X0 yTitle=PDF axes=ON grid=ON legendposition=topright legendFontSize=1
drawables=[class=Drawable name=Unnamed implementation=class=Curve name=Unnamed derived from class=DrawableImplementation name=Unnamed legend=X0 PDF data=class=Sample name=Unnamed implementation=class=Sam
pleImplementation name=Unnamed size=129 dimension=2 data=[[-1610.7,0],[-1575.83,0],[-1540.96,0],[-1506.09,0],[-1471.22,0],[-1436.35,0],[-1401.48,0],[-1366.61,0],...,[-1331.7,6.95394e-06],[2852.57,6.85646e-06]] color
=red fillStyle=solid lineStyle=solid pointStyle=none lineWidth=2]
I assume that you want to perform different tasks:
To plot the PDF
To compute the PDF at a single point
To compute the PDF for a range of values
Each of these needs requires a different script. Please let me detail them.
I first create the Pareto distribution:
import openturns as ot
import numpy as np
beta = 0.00317985
alpha = 0.147365
gamma = 1.0283
distribution = ot.Pareto(beta, alpha, gamma)
print("distribution", distribution)
To plot the PDF, use drawPDF() method. This creates a ot.Graph which can be viewed directly in Jupyter Notebook or IPython. We can force the creation of the plot with View:
import openturns.viewer as otv
graph = distribution.drawPDF()
otv.View(graph)
This plots:
To compute the PDF at a single point, use computePDF(x), where x is a ot.Point(). This can also be a Python list or tuple or 1D numpy array, as the conversion is automatically managed by OpenTURNS:
x = 500.0
y = distribution.computePDF(x)
print("y=", y)
The previous script prints:
y= 5.0659235352823877e-05
To compute the PDF for a range of values, we can use the computePDF(x), where x is a ot.Sample(). This can also be a Python list of lists or a 2D numpy array, as the conversion is automatically managed by OpenTURNS.
x = ot.Sample([[v] for v in np.linspace(0.0, 1000.0)])
y = distribution.computePDF(x)
print("y=", y)
The previous script prints:
y=
0 : [ 0 ]
1 : [ 0.00210511 ]
[...]
49 : [ 2.28431e-05 ]

How to generate numpy array with skewed distribution given maximum,minimum and number of points to be generated

I am trying to generate latitudes from latitude boundary(min,max latitudes and number of points) .
I have two choices to generate this :
Using multiple probability distribution and generating x with 1 probability distribution ,as that wont be normal
Or
make a distribution with skew introduced.
The main aim is getting a non normal distributed array points .
I might be doing a conceptual mistake here.
def array_gen(min,max):
new_lat_ = np.linspace(min,max,1000)
a = 3
r = skewnorm.rvs(a, size=1000)
#not sure how to use the new_lat_
arr = skewnorm.pdf(new_lat_, *skewnorm.fit(new_lat_))
ax.hist(r, density=True,histtype='stepfilled', alpha=0.2)
ax.hist(arr, density=True,histtype='stepfilled', alpha=0.2)
return arr
array = array_gen(12.786963,13.140868)
fig
I expect a distribution that is skewed I get it but it isn't in the range I expected .Expected range = (12.786963,13.140868)
0.43939953, 0.75707352, 0.63534797, 0.40377254, 0.27907808,
0.23454434, 0.11875663, 0.07422289, 0.02375133, 0.00296892
-0.1491852 , 0.1876381 , 0.5244614 , 0.8612847 , 1.198108,
1.53493129, 1.87175459, 2.20857789, 2.54540119, 2.88222449,
3.21904779

QQ-plot python mean and standard deviation

I am trying to plot a Q-Q plot using python. I was checking scipy.stats.probplot, and the input seems to be the measurement against a normal distributiom.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
and in my code, I had
stats.probplot(mean, dist="norm", plot=plt)
to compare distributions.
But I am wondering where can I input standard deviation? I thought that's a very important factor when comparing distributions but so far I can only input the mean.
Thanks
Let's suppose you have a list on float
X = [-1.31,
4.82,
2.18,
1.99,
4.37,
2.58,
7.22,
3.93,
6.95,
2.41,
2.02,
2.48,
-1.01,
2.3,
2.87,
-0.06,
2.13,
3.62,
5.24,
0.57]
If you want to make a QQ_plot test you need to compare X against a distribution.
For example : N(0, 1) a normal distribution whose mean = 0 and sigma = 1
In OpenTURNS, it goes like that:
import openturns as ot
sample = ot.Sample([[p] for p in X])
graph = ot.VisualTest.DrawQQplot(sample, ot.Normal(0,1))
View(graph);
Explanation: I tell OpenTURNS I have a sample of 20 points [p] coming from X and not 1 point in dimension 20. Then I call ot.VisualTest.DrawQQplot with 2 arguments: sample and the Normal distribution (0,1) ot.Normal(0,1).
We see on the graph that the test fails:
The question now is: what is the best Normal Distribution fitting the sample?
Thanks to NormalFactory() the answer is simple:
BestNormalDistribution = ot.NormalFactory().build(sample)
If you print(BestNormalDistribution) you get the parameters of this distribution:
Normal(mu = 2.76832, sigma = 2.27773)
If we repeat the QQ_plot test of sample against BestNormalDistribution it would be much better

How to properly sample truncated distributions?

I am trying to learn how to sample truncated distributions. To begin with I decided to try a simple example I found here example
I didn't really understand the division by the CDF, therefore I decided to tweak the algorithm a bit. Being sampled is an exponential distribution for values x>0 Here is an example python code:
# Sample exponential distribution for the case x>0
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0. :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
plt.plot(x,pdf(x))
plt.hist([x for x in xvec if x != 0],bins=150,normed=True)
plt.show()
Ant the output is:
The code above seems to work fine only for when using the condition if a > 0. :, i.e. positive x, choosing another condition (e.g. if a > 0.5 :) produces wrong results.
Since my final goal was to sample a 2D-Gaussian - pdf on a truncated interval I tried extending the simple example using the exponential distribution (see the code below). Unfortunately, since the simple case didn't work, I assume that the code given below would yield wrong results.
I assume that all this can be done using the advanced tools of python. However, since my primary idea was to understand the principle behind, I would greatly appreciate your help to understand my mistake.
Thank you for your help.
EDIT:
# code updated according to the answer of CrazyIvan
from scipy.stats import multivariate_normal
RANGE=100000
a=2.06072E-02
b=1.10011E+00
a_range=[0.001,0.5]
b_range=[0.01, 2.5]
cov=[[3.1313994E-05, 1.8013737E-03],[ 1.8013737E-03, 1.0421529E-01]]
x=a
y=b
j=0
for i in range(RANGE):
a_t,b_t=np.random.multivariate_normal([a,b],cov)
# accept if within bounds - all that is neded to truncate
if a_range[0]<a_t and a_t<a_range[1] and b_range[0]<b_t and b_t<b_range[1]:
print(dx,dy)
EDIT:
I changed the code by norming the analytic pdf according to this scheme, and according to the answers given by, #Crazy Ivan and #Leandro Caniglia , for the case where the bottom of the pdf is removed. That is dividing by (1-CDF(0.5)) since my accept condition is x>0.5. This seems again to show some discrepancies. Again the mystery prevails ..
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
# included the corresponding cdf
def cdf(x):
return 1. -np.exp(-x)-x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0.5 :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
# new part norm the analytic pdf to fix the area
plt.plot(x,pdf(x)/(1.-cdf(0.5)))
plt.hist([x for x in xvec if x != 0],bins=200,normed=True)
plt.savefig("test_exp.png")
plt.show()
It seems that this can be cured by choosing larger shift size
shift=15.
a=x+np.random.normal()*shift.
which is in general an issue of the Metropolis - Hastings. See the graph below:
I also checked shift=150
Bottom line is that changing the shift size definitely improves the convergence. The misery is why, since the Gaussian is unbounded.
You say you want to learn the basic idea of sampling a truncated distribution, but your source is a blog post about
Metropolis–Hastings algorithm? Do you actually need this "method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult"? Taking this as your starting point is like learning English by reading Shakespeare.
Truncated normal
For truncated normal, basic rejection sampling is all you need: generate samples for original distribution, reject those outside of bounds. As Leandro Caniglia noted, you should not expect truncated distribution to have the same PDF except on a shorter interval — this is plain impossible because the area under the graph of a PDF is always 1. If you cut off stuff from sides, there has to be more in the middle; the PDF gets rescaled.
It's quite inefficient to gather samples one by one, when you need 100000. I would grab 100000 normal samples at once, accept only those that fit; then repeat until I have enough. Example of sampling truncated normal between amin and amax:
import numpy as np
n_samples = 100000
amin, amax = -1, 2
samples = np.zeros((0,)) # empty for now
while samples.shape[0] < n_samples:
s = np.random.normal(0, 1, size=(n_samples,))
accepted = s[(s >= amin) & (s <= amax)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples] # we probably got more than needed, so discard extra ones
And here is the comparison with the PDF curve, rescaled by division by cdf(amax) - cdf(amin) as explained above.
from scipy.stats import norm
_ = plt.hist(samples, bins=50, density=True)
t = np.linspace(-2, 3, 500)
plt.plot(t, norm.pdf(t)/(norm.cdf(amax) - norm.cdf(amin)), 'r')
plt.show()
Truncated multivariate normal
Now we want to keep the first coordinate between amin and amax, and the second between bmin and bmax. Same story, except there will be a 2-column array and the comparison with bounds is done in a relatively sneaky way:
(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)
This means: subtract amin, bmin from each row and keep only the rows where both results are nonnegative (meaning we had a >= amin and b >= bmin). Also do a similar thing with amax, bmax. Accept only the rows that meet both criteria.
n_samples = 10
amin, amax = -1, 2
bmin, bmax = 0.2, 2.4
mean = [0.3, 0.5]
cov = [[2, 1.1], [1.1, 2]]
samples = np.zeros((0, 2)) # 2 columns now
while samples.shape[0] < n_samples:
s = np.random.multivariate_normal(mean, cov, size=(n_samples,))
accepted = s[(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples, :]
Not going to plot, but here are some values: naturally, within bounds.
array([[ 0.43150033, 1.55775629],
[ 0.62339265, 1.63506963],
[-0.6723598 , 1.58053835],
[-0.53347361, 0.53513105],
[ 1.70524439, 2.08226558],
[ 0.37474842, 0.2512812 ],
[-0.40986396, 0.58783193],
[ 0.65967087, 0.59755193],
[ 0.33383214, 2.37651975],
[ 1.7513789 , 1.24469918]])
To compute the truncated density function pdf_t from the entire density function pdf, do the following:
Let [a, b] be the truncation interval; (x axis)
Let A := cdf(a) and B := cdf(b); (cdf = non-truncated cumulative distribution function)
Then pdf_t(x) := pdf(x) / (B - A) if x in [a, b] and 0 elsewhere.
In cases where a = -infinity (resp. b = +infinity), take A := 0 (resp. B := 1).
As for the "mystery" you see
please note that your blue curve is wrong. It is not the pdf of your truncated distribution, it is just the pdf of the non-truncated one, scaled by the correct amount (division by 1-cdf(0.5)). The actual truncated pdf curve starts with a vertical line on x = 0.5 which goes up until it reaches your current blue curve. In other words, you only scaled the curve but forgot to truncate it, in this case to the left. Such a truncation corresponds to the "0 elsewhere" part of step 3 in the algorithm above.

Fitting a pareto distribution with (python) Scipy

I have a data set that I know has a Pareto distribution. Can someone point me to how to fit this data set in Scipy? I got the below code to run but I have no idea what is being returned to me (a,b,c). Also, after obtaining a,b,c, how do I calculate the variance using them?
import scipy.stats as ss
import scipy as sp
a,b,c=ss.pareto.fit(data)
Be very careful fitting power laws!! Many reported power laws are actually badly fitted by a power law. See Clauset et al. for all the details (also on arxiv if you don't have access to the journal). They have a companion website to the article which now links to a Python implementation. Don't know if it uses Scipy because I used their R implementation when I last used it.
Here's a quickly written version, taking some hints from the Reference page that Rupert gave.
This is currently work in progress in scipy and statsmodels and requires MLE with some fixed or frozen parameters, which is only available in the trunk versions.
No standard errors on the parameter estimates or other result statistics are available yet.
'''estimating pareto with 3 parameters (shape, loc, scale) with nested
minimization, MLE inside minimizing Kolmogorov-Smirnov statistic
running some examples looks good
Author: josef-pktd
'''
import numpy as np
from scipy import stats, optimize
#the following adds my frozen fit method to the distributions
#scipy trunk also has a fit method with some parameters fixed.
import scikits.statsmodels.sandbox.stats.distributions_patch
true = (0.5, 10, 1.) # try different values
shape, loc, scale = true
rvs = stats.pareto.rvs(shape, loc=loc, scale=scale, size=1000)
rvsmin = rvs.min() #for starting value to fmin
def pareto_ks(loc, rvs):
est = stats.pareto.fit_fr(rvs, 1., frozen=[np.nan, loc, np.nan])
args = (est[0], loc, est[1])
return stats.kstest(rvs,'pareto',args)[0]
locest = optimize.fmin(pareto_ks, rvsmin*0.7, (rvs,))
est = stats.pareto.fit_fr(rvs, 1., frozen=[np.nan, locest, np.nan])
args = (est[0], locest[0], est[1])
print 'estimate'
print args
print 'kstest'
print stats.kstest(rvs,'pareto',args)
print 'estimation error', args - np.array(true)
Let's say you data is formated like this
import openturns as ot
data = [
[2.7018013],
[8.53280352],
[1.15643882],
[1.03359467],
[1.53152735],
[32.70434285],
[12.60709624],
[2.012235],
[1.06747063],
[1.41394096],
]
sample = ot.Sample([[v] for v in data])
You can easily fit a Pareto distribution using ParetoFactory of OpenTURNS library:
distribution = ot.ParetoFactory().build(sample)
You can of course print it:
print(distribution)
>>> Pareto(beta = 0.00317985, alpha=0.147365, gamma=1.0283)
or plot its PDF:
from openturns.viewer import View
pdf_graph = distribution.drawPDF()
pdf_graph.setTitle(str(distribution))
View(pdf_graph, add_legend=False)
More details on the ParetoFactory are provided in the documentation.
Before passing the data to build() function in OPENTURNS, make sure to convert it this way:
data = [[i] for i in data]
Because Sample() function may return an error.
FYI #Tropilio

Categories

Resources