Plot PDF of Pareto distribution in Python - python

I have a specific Pareto distribution. For example,
Pareto(beta=0.00317985, alpha=0.147365, gamma=1.0283)
which I obtained from this answer and now I want to plot a graph of its Probability Density Function (PDF) in matplotlib. So I believe that the x-axis will be all positive real numbers, and the y-axis will be the same.
How exactly can I obtain the appropriate PDF information and plot it? Programmatically obtaining the mathematical PDF function or coordinates is a requirement for this question.
UPDATE:
The drawPDF method returns a Graph object that contains coordinates for the PDF. However, I don't know how to access these coordinates programmatically. I certainly don't want to convert the object to a string nor use a regex to pull out the information:
In [45]: pdfg = distribution.drawPDF()
In [46]: pdfg
Out[46]: class=Graph name=pdf as a function of X0 implementation=class=GraphImplementation name=pdf as a function of X0 title= xTitle=X0 yTitle=PDF axes=ON grid=ON legendposition=topright legendFontSize=1
drawables=[class=Drawable name=Unnamed implementation=class=Curve name=Unnamed derived from class=DrawableImplementation name=Unnamed legend=X0 PDF data=class=Sample name=Unnamed implementation=class=Sam
pleImplementation name=Unnamed size=129 dimension=2 data=[[-1610.7,0],[-1575.83,0],[-1540.96,0],[-1506.09,0],[-1471.22,0],[-1436.35,0],[-1401.48,0],[-1366.61,0],...,[-1331.7,6.95394e-06],[2852.57,6.85646e-06]] color
=red fillStyle=solid lineStyle=solid pointStyle=none lineWidth=2]

I assume that you want to perform different tasks:
To plot the PDF
To compute the PDF at a single point
To compute the PDF for a range of values
Each of these needs requires a different script. Please let me detail them.
I first create the Pareto distribution:
import openturns as ot
import numpy as np
beta = 0.00317985
alpha = 0.147365
gamma = 1.0283
distribution = ot.Pareto(beta, alpha, gamma)
print("distribution", distribution)
To plot the PDF, use drawPDF() method. This creates a ot.Graph which can be viewed directly in Jupyter Notebook or IPython. We can force the creation of the plot with View:
import openturns.viewer as otv
graph = distribution.drawPDF()
otv.View(graph)
This plots:
To compute the PDF at a single point, use computePDF(x), where x is a ot.Point(). This can also be a Python list or tuple or 1D numpy array, as the conversion is automatically managed by OpenTURNS:
x = 500.0
y = distribution.computePDF(x)
print("y=", y)
The previous script prints:
y= 5.0659235352823877e-05
To compute the PDF for a range of values, we can use the computePDF(x), where x is a ot.Sample(). This can also be a Python list of lists or a 2D numpy array, as the conversion is automatically managed by OpenTURNS.
x = ot.Sample([[v] for v in np.linspace(0.0, 1000.0)])
y = distribution.computePDF(x)
print("y=", y)
The previous script prints:
y=
0 : [ 0 ]
1 : [ 0.00210511 ]
[...]
49 : [ 2.28431e-05 ]

Related

How to generate numpy array with skewed distribution given maximum,minimum and number of points to be generated

I am trying to generate latitudes from latitude boundary(min,max latitudes and number of points) .
I have two choices to generate this :
Using multiple probability distribution and generating x with 1 probability distribution ,as that wont be normal
Or
make a distribution with skew introduced.
The main aim is getting a non normal distributed array points .
I might be doing a conceptual mistake here.
def array_gen(min,max):
new_lat_ = np.linspace(min,max,1000)
a = 3
r = skewnorm.rvs(a, size=1000)
#not sure how to use the new_lat_
arr = skewnorm.pdf(new_lat_, *skewnorm.fit(new_lat_))
ax.hist(r, density=True,histtype='stepfilled', alpha=0.2)
ax.hist(arr, density=True,histtype='stepfilled', alpha=0.2)
return arr
array = array_gen(12.786963,13.140868)
fig
I expect a distribution that is skewed I get it but it isn't in the range I expected .Expected range = (12.786963,13.140868)
0.43939953, 0.75707352, 0.63534797, 0.40377254, 0.27907808,
0.23454434, 0.11875663, 0.07422289, 0.02375133, 0.00296892
-0.1491852 , 0.1876381 , 0.5244614 , 0.8612847 , 1.198108,
1.53493129, 1.87175459, 2.20857789, 2.54540119, 2.88222449,
3.21904779

Querying data in pandas where points are grouped by a hexbin function

Both seaborn and pandas provide APIs in order to plot bivariate histograms as a hexbin plot (example plotted below). However, I am searching to execute a query for the points that are located in the same hexbin. Is there a function to retrieve the rows associated with the data points in the hexbin?
The give an example:
My data frame contains 3 rows: A, B and C. I use sns.jointplot(x=A,y=B) to plot the density. Now, I want to execute a query on each data point located in the same bin. For instance, for each bin compute the mean of the C value associated with each point.
Current solution -- Quick Hack
Currently, I have implemented the following function to apply a function to the data associated with a (x,y) coordinate located in the same hexbin:
def hexagonify(x, y, values, func=None):
hexagonized_list = []
fig = plt.figure()
fig.set_visible(False)
if func is not None:
image = plt.hexbin(x=x, y=y, C=values, reduce_C_function=func)
else:
image = plt.hexbin(x=x, y=y, C=values)
values = image.get_array()
verts = image.get_offsets()
for offc in range(verts.shape[0]):
binx, biny = verts[offc][0], verts[offc][1]
val = values[offc]
if val:
hexagonized_list.append((binx, biny, val))
fig.clear()
plt.close(fig)
return hexagonized_list
The values (with the same size as x or y) are passed through the values parameter. The hexbins are computed through the hexbin function of matplotlib. The values are retrieved through the get_array() function of the returned PolyCollection. By default, the np.mean function is applied to the accumalated values per bin. This functionality can be changed by providing a function to the func paramater. Subsequently, the get_offsets() method allows us to calculate the center of the bins (discussed here). In this way, we can associate (by default) mean value of the provided values per hexbin. However, this solution is a hack, so any improvements to this solution are welcome.
From matplotlib
If you have already drawn the plot, you can get Bin Counts from polycollection returned by matplotlib:
polycollection: A PolyCollection instance; use PolyCollection.get_array on this to get the counts in each hexagon.
This functionality is also available in:
matplotlib.pyplot.hist2d;
numpy.histogram2d;
Pure pandas
Here a MCVE using only pandas that can handle the C property:
import numpy as np
import pandas as pd
# Trial Dataset:
N=1000
d = np.array([np.random.randn(N), np.random.randn(N), np.random.rand(N)]).T
df = pd.DataFrame(d, columns=['x', 'y', 'c'])
# Create bins:
df['xb'] = pd.cut(df.x, 3)
df['yb'] = pd.cut(df.y, 3)
# Group by and Aggregate:
p = df.groupby(['xb', 'yb']).agg('mean')['c']
p.unstack()
First we create bins using pandas.cut. Then we group by and aggregate. You can pick the agg function you like to aggregate C (eg. max, median, etc.).
The output is about:
yb (-2.857, -0.936] (-0.936, 0.98] (0.98, 2.895]
xb
(-2.867, -0.76] 0.454424 0.519920 0.507443
(-0.76, 1.34] 0.535930 0.484818 0.513158
(1.34, 3.441] 0.441094 0.493657 0.385987

Python - calculating pdf from a numpy array distribution

Given an array of values, I want to be able to fit a density function to it and find the pdf of an arbitrary input value. Is this possible, and how would I go about it? There aren't necessarily assumptions of normality, and I don't need the function itself.
For instance, given:
x = array([ 0.62529759, -0.08202699, 0.59220673, -0.09074541, 0.05517865,
0.20153703, 0.22773723, -0.26229708, 0.76137555, -0.61229314,
0.27292745, 0.35596795, -0.01373896, 0.32464979, -0.22932331,
1.14796175, 0.17268531, 0.40692172, 0.13846154, 0.22752953,
0.13087359, 0.14111479, -0.09932381, 0.12800392, 0.02605917,
0.18776078, 0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
-0.09171721, 0.23083624, -0.21603973, 0.05425592, 0.47910286,
0.26359565, -0.19917942, 0.40182097, -0.0797546 , 0.47239264,
-0.36654449, 0.4513859 , -0.00282486, -0.13950512, -0.05375369,
0.03331833, 0.48951555, -0.13760504, 2.788 , -0.15017848,
0.02930675, 0.10910646, 0.03868301, -0.048482 , 0.7277376 ,
0.08841259, -0.10968462, 0.50371324, 0.86379698, 0.01674877,
0.19542421, -0.06639165, 0.74500856, -0.10148342, 0.02482331,
0.79195804, 0.40401969, 0.25120005, 0.21020794, -0.01767013,
-0.13453783, -0.09605592, -0.88044229, 0.04689623, 0.09043851,
0.21232286, 0.34129982, -0.3736799 , 0.17313858])
I would like to find how a value of 0.3 compares to all of the above, and what percent of the above values it is greater than.
I personally like using the scipy.stats package. It has a useful implementation of Kernel Density Estimation. Bascially what this does is it estimates a probability density function of certain data, using combinations of gaussian (or other) distributions. Which distributions are used is a parameter you can set. Look at the documentation and related examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#kernel-density-estimation
And for more about KDE: https://en.wikipedia.org/wiki/Kernel_density_estimation
Once you have built your KDE, then you can perform the same operations on it to get probabilities. For example, if you want to calculate the probability that a value occurs that is as large or larger than 0.3 you would do the following:
kde = stats.gaussian_kde(np.array(x))
#visualize KDE
fig = plt.figure()
ax = fig.add_subplot(111)
x_eval = np.linspace(-.2, .2, num=200)
ax.plot(x_eval, kde(x_eval), 'k-')
#get probability
kde.integrate_box_1d( 0.3, np.inf)
TLDR:
Calculate a KDE, then use the KDE as if it were a PDF.
You can use openTURNS for that. You can use a Gaussian kernel smoothing to do that easily! From the doc:
import openturns as ot
kernel = ot.KernelSmoothing()
estimated = kernel.build(x)
That's it, now you have a distribution object :)
This library is very cool for statistics! (I am not related to them).
We have first to create the Sample from the Numpy array.
Then we compute the complementary CDF with the complementaryCDF method of the distribution (a small improvement over Yoda's answer).
import numpy as np
x = np.array([ 0.62529759, -0.08202699, 0.59220673, -0.09074541, 0.05517865,
0.20153703, 0.22773723, -0.26229708, 0.76137555, -0.61229314,
0.27292745, 0.35596795, -0.01373896, 0.32464979, -0.22932331,
1.14796175, 0.17268531, 0.40692172, 0.13846154, 0.22752953,
0.13087359, 0.14111479, -0.09932381, 0.12800392, 0.02605917,
0.18776078, 0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
-0.09171721, 0.23083624, -0.21603973, 0.05425592, 0.47910286,
0.26359565, -0.19917942, 0.40182097, -0.0797546 , 0.47239264,
-0.36654449, 0.4513859 , -0.00282486, -0.13950512, -0.05375369,
0.03331833, 0.48951555, -0.13760504, 2.788 , -0.15017848,
0.02930675, 0.10910646, 0.03868301, -0.048482 , 0.7277376 ,
0.08841259, -0.10968462, 0.50371324, 0.86379698, 0.01674877,
0.19542421, -0.06639165, 0.74500856, -0.10148342, 0.02482331,
0.79195804, 0.40401969, 0.25120005, 0.21020794, -0.01767013,
-0.13453783, -0.09605592, -0.88044229, 0.04689623, 0.09043851,
0.21232286, 0.34129982, -0.3736799 , 0.17313858])
import openturns as ot
kernel = ot.KernelSmoothing()
sample = ot.Sample(x,1)
distribution = kernel.build(sample)
q = distribution.computeComplementaryCDF(0.3)
print(q)
which prints:
0.29136124840835353

Python: Get values of array which correspond to contour lines

Is there a way to extract the data from an array, which corresponds to a line of a contourplot in python? I.e. I have the following code:
n = 100
x, y = np.mgrid[0:1:n*1j, 0:1:n*1j]
plt.contour(x,y,values)
where values is a 2d array with data (I stored the data in a file but it seems not to be possible to upload it here). The picture below shows the corresponding contourplot. My question is, if it is possible to get exactly the data from values, which corresponds e.g. to the left contourline in the plot?
Worth noting here, since this post was the top hit when I had the same question, that this can be done with scikit-image much more simply than with matplotlib. I'd encourage you to check out skimage.measure.find_contours. A snippet of their example:
from skimage import measure
x, y = np.ogrid[-np.pi:np.pi:100j, -np.pi:np.pi:100j]
r = np.sin(np.exp((np.sin(x)**3 + np.cos(y)**2)))
contours = measure.find_contours(r, 0.8)
which can then be plotted/manipulated as you need. I like this more because you don't have to get into the deep weeds of matplotlib.
plt.contour returns a QuadContourSet. From that, we can access the individual lines using:
cs.collections[0].get_paths()
This returns all the individual paths. To access the actual x, y locations, we need to look at the vertices attribute of each path. The first contour drawn should be accessible using:
X, Y = cs.collections[0].get_paths()[0].vertices.T
See the example below to see how to access any of the given lines. In the example I only access the first one:
import matplotlib.pyplot as plt
import numpy as np
n = 100
x, y = np.mgrid[0:1:n*1j, 0:1:n*1j]
values = x**0.5 * y**0.5
fig1, ax1 = plt.subplots(1)
cs = plt.contour(x, y, values)
lines = []
for line in cs.collections[0].get_paths():
lines.append(line.vertices)
fig1.savefig('contours1.png')
fig2, ax2 = plt.subplots(1)
ax2.plot(lines[0][:, 0], lines[0][:, 1])
fig2.savefig('contours2.png')
contours1.png:
contours2.png:
plt.contour returns a QuadContourSet which holds the data you're after.
See Get coordinates from the contour in matplotlib? (which this question is probably a duplicate of...)

IPython FITS file plotting gives different results

I am facing a problem while running a script ( please find the code below ).
I am trying to plot an array of values, write it into a FITS file format, read it back again and plot it ---> I don't get the same plots!
If you could please help me with this it would be great.
The following are the versions of my packages and compiler:
matplotlib : '2.0.0b1'
numpy : '1.11.0'
astropy : u'1.1.2'
python : 2.7
Sincerely,
Anik Halder
import numpy as np
from pylab import *
from astropy.io import fits
# Just making a 10x10 meshgrid
x = np.arange(10)
X , Y = np.meshgrid(x,x)
# finding the distance of different points on the meshgrid from a point suppose at (5,5)
Z = ((X-5)**2 + (Y-5)**2)**0.5
# plotting Z (see image [link below] - left one)
imshow(Z, origin = "lower")
colorbar()
show()
# writing the Z data into a fits file
fits.writeto("my_file.fits", Z)
# reading the same fits file and storing the data
Z_read = fits.open("my_file.fits")[0].data
# plotting Z_read : we expect it to show the same plot as before
imshow(Z_read, origin = "lower")
colorbar()
show()
# Lo! That's not the case for me! It's not the same plot! (see image - right one)
# Hence, I try to check whether the values stored in Z and Z_read are different..
print Z - Z_read
# No! It returns an array full of zeros! This means Z and Z_read are the same! I don't get why the plots look different!
Please find the image in this link: http://imgur.com/1TklSjU
Actually it turns out to be that it's to do with the version of matplotlib.
Answered by a developer of matplotlib - Jens Nielsen
This doesn't occur on matplotlib version 1.51
In version 2 beta 1, it seems that the FITS data is converted from float32 to big endian float8. Please look at the following link:
https://gist.github.com/jenshnielsen/86d4a86d8f667fadddc09f88c5fb87e6
The issue has been posted and you can look it up over here:
https://github.com/matplotlib/matplotlib/issues/6671
In the meantime for having the same plots (Z and Z _read) we should rather use the following code (for matplotlib 2 beta 1) :
imshow(Z_read.astype('float64'), origin = "lower")

Categories

Resources