How to find Inverse Cumulative Distribution Function of discrete functions in Python - python

I am trying to find the Inverse CDF function of discrete probability distribution in Python and then plot it. My CDF is derived from the following numpy output:
array([ 0.228157, 0.440671, 0.588515, 0.683326, 0.740365, 0.783288,
0.81362 , 0.840518, 0.859213, 0.876764, 0.889355, 0.89813 ,
0.909194, 0.916443, 0.9256 , 0.930369, 0.938572, 0.942387,
0.946012, 0.951353, 0.954405, 0.956694, 0.965088, 0.966614,
0.96814 , 0.969475, 0.970047, 0.971001, 0.971573, 0.973099,
0.974816, 0.975388, 0.977105, 0.984163, 0.984354, 0.984736,
0.98569 , 0.985881, 0.986072, 0.986644, 0.990269, 0.990651,
0.990842, 0.993322, 0.993704, 0.994467, 0.995039, 0.995802,
0.996184, 0.996375, 0.996566, 0.996757, 0.997329, 0.99752 ,
0.997711, 0.997902, 0.998093, 0.998284, 0.998475, 0.998666,
0.998857, 0.999239, 0.999621, 0.999812, 1.00000])
I tried rv_discrete.ppf(q, *args, **kwds), but it works for random variables, which is not my case.

Since you have lots of points perhaps you would find linear interpolation acceptable between adjacent points. Do binary search to find the points that are adjacent to the probability you seek in the first place. Like this, with some tidying up:
import numpy as np
CDF = np.array([ 0.228157, 0.440671, 0.588515, 0.683326, 0.740365, 0.783288, 0.81362 , 0.840518, 0.859213, 0.876764, 0.889355, 0.89813 , 0.909194, 0.916443, 0.9256 , 0.930369, 0.938572, 0.942387, 0.946012, 0.951353, 0.954405, 0.956694, 0.965088, 0.966614, 0.96814 , 0.969475, 0.970047, 0.971001, 0.971573, 0.973099, 0.974816, 0.975388, 0.977105, 0.984163, 0.984354, 0.984736, 0.98569 , 0.985881, 0.986072, 0.986644, 0.990269, 0.990651, 0.990842, 0.993322, 0.993704, 0.994467, 0.995039, 0.995802, 0.996184, 0.996375, 0.996566, 0.996757, 0.997329, 0.99752 , 0.997711, 0.997902, 0.998093, 0.998284, 0.998475, 0.998666, 0.998857, 0.999239, 0.999621, 0.999812, 1.00000] )
## inverse of .3
index = np.searchsorted(CDF, .3)
print ( index )
print ( (.3 - CDF [ index-1 ] ) / ( CDF [ index ] - CDF [ index-1 ] ) )
Output is this.
1
0.338062433534

Related

scipy RBFInterpolator doesnt reproduce data points

I'm trying to perform a 3D interpolation using a cubic kernel with scipy's RBFInterpolator routine. I'm running into a problem where the interpolator wont reproduce the input data points, even with smoothing set to 0. I've tried other kernels, such as 'linear' and 'quintic' and varying the neighbors argument, but still the interpolator wont reproduce the input data.
My two questions are:
Why wont RBFInterpolator reproduce the input data?
Is there another python module that can perform 3d cubic spline interpolation?
Here's a minimal example:
import numpy as np
from numpy import array
import scipy.interpolate as spint
import matplotlib.pyplot as plt
TMPenden=array([[[0.00033903, 0.00034132, 0.00034312, 0.00034442, 0.0003452 ,
0.00034547, 0.0003452 , 0.00034441, 0.00034311, 0.0003413 ],
[0.00041104, 0.0004126 , 0.00041383, 0.00041471, 0.00041524,
0.00041542, 0.00041524, 0.00041471, 0.00041382, 0.00041259],
[0.00044731, 0.00044845, 0.00044934, 0.00044999, 0.00045037,
0.0004505 , 0.00045037, 0.00044998, 0.00044934, 0.00044844],
[0.00046978, 0.00047068, 0.00047138, 0.00047189, 0.00047219,
0.00047229, 0.00047219, 0.00047188, 0.00047138, 0.00047067],
[0.00048431, 0.00048505, 0.00048563, 0.00048604, 0.00048629,
0.00048638, 0.00048629, 0.00048604, 0.00048563, 0.00048504],
[0.00049386, 0.00049449, 0.00049499, 0.00049534, 0.00049555,
0.00049562, 0.00049555, 0.00049534, 0.00049498, 0.00049449],
[0.00050013, 0.00050068, 0.00050111, 0.00050142, 0.0005016 ,
0.00050166, 0.0005016 , 0.00050142, 0.00050111, 0.00050068],
[0.00050414, 0.00050462, 0.000505 , 0.00050527, 0.00050544,
0.00050549, 0.00050544, 0.00050527, 0.000505 , 0.00050462],
[0.00050653, 0.00050696, 0.0005073 , 0.00050754, 0.00050769,
0.00050774, 0.00050769, 0.00050754, 0.0005073 , 0.00050696],
[0.00050773, 0.00050812, 0.00050843, 0.00050865, 0.00050878,
0.00050882, 0.00050878, 0.00050865, 0.00050843, 0.00050812]],
[[0.00047842, 0.00048166, 0.00048422, 0.00048606, 0.00048717,
0.00048755, 0.00048717, 0.00048605, 0.0004842 , 0.00048164],
[0.00058177, 0.00058398, 0.00058571, 0.00058696, 0.00058771,
0.00058796, 0.00058771, 0.00058695, 0.0005857 , 0.00058396],
[0.00063274, 0.00063436, 0.00063562, 0.00063653, 0.00063708,
0.00063726, 0.00063708, 0.00063653, 0.00063562, 0.00063435],
[0.00066297, 0.00066424, 0.00066524, 0.00066595, 0.00066638,
0.00066653, 0.00066638, 0.00066595, 0.00066523, 0.00066424],
[0.00068146, 0.00068251, 0.00068333, 0.00068392, 0.00068427,
0.00068439, 0.00068427, 0.00068392, 0.00068333, 0.00068251],
[0.00069271, 0.00069361, 0.0006943 , 0.0006948 , 0.0006951 ,
0.0006952 , 0.0006951 , 0.0006948 , 0.0006943 , 0.0006936 ],
[0.00069925, 0.00070002, 0.00070062, 0.00070106, 0.00070132,
0.00070141, 0.00070132, 0.00070106, 0.00070062, 0.00070001],
[0.00070256, 0.00070325, 0.00070378, 0.00070416, 0.00070439,
0.00070447, 0.00070439, 0.00070416, 0.00070378, 0.00070324],
[0.00070361, 0.00070422, 0.0007047 , 0.00070505, 0.00070525,
0.00070532, 0.00070525, 0.00070504, 0.0007047 , 0.00070422],
[0.00070302, 0.00070358, 0.00070401, 0.00070432, 0.0007045 ,
0.00070457, 0.0007045 , 0.00070432, 0.00070401, 0.00070357]],
[[0.00064677, 0.00065116, 0.00065462, 0.00065712, 0.00065863,
0.00065914, 0.00065863, 0.00065711, 0.00065461, 0.00065114],
[0.00078754, 0.00079052, 0.00079286, 0.00079455, 0.00079556,
0.0007959 , 0.00079556, 0.00079454, 0.00079285, 0.0007905 ],
[0.00085515, 0.00085733, 0.00085904, 0.00086027, 0.00086101,
0.00086126, 0.00086101, 0.00086026, 0.00085903, 0.00085732],
[0.00089318, 0.0008949 , 0.00089624, 0.0008972 , 0.00089778,
0.00089798, 0.00089778, 0.0008972 , 0.00089623, 0.00089489],
[0.00091474, 0.00091616, 0.00091726, 0.00091805, 0.00091853,
0.00091869, 0.00091853, 0.00091805, 0.00091726, 0.00091615],
[0.00092632, 0.00092752, 0.00092845, 0.00092913, 0.00092953,
0.00092967, 0.00092953, 0.00092912, 0.00092845, 0.00092751],
[0.00093148, 0.00093252, 0.00093333, 0.00093391, 0.00093426,
0.00093438, 0.00093426, 0.00093391, 0.00093332, 0.00093251],
[0.00093233, 0.00093325, 0.00093396, 0.00093448, 0.00093479,
0.00093489, 0.00093478, 0.00093447, 0.00093396, 0.00093324],
[0.0009302 , 0.00093102, 0.00093166, 0.00093212, 0.00093239,
0.00093248, 0.00093239, 0.00093211, 0.00093166, 0.00093102],
[0.00092595, 0.00092669, 0.00092726, 0.00092768, 0.00092792,
0.00092801, 0.00092792, 0.00092767, 0.00092726, 0.00092668]]])
TMPraddata=array([ 2.43597707, 3.43597707, 4.43597707, 5.43597707, 6.43597707,
7.43597707, 8.43597707, 9.43597707, 10.43597707, 11.43597707])
TMPthetadata=array([1.41381669, 1.44523262, 1.47664855, 1.50806447, 1.5394804 ,
1.57089633, 1.60231225, 1.63372818, 1.66514411, 1.69656003])
TMPalpha = np.array([0.06,0.07,0.08])
coords = np.asarray([[alpha,r,theta] for alpha in TMPalpha for r in TMPraddata for theta in TMPthetadata])
yvalues = np.ravel(TMPenden)
tmprbf = spint.RBFInterpolator(
coords,
yvalues,
neighbors=1000,
kernel='linear',
smoothing=0,
degree=2
)
tmpalpha = 0.6
pnts = [[tmpalpha,i,TMPthetadata[5]] for i in TMPraddata]
datpnts = tmprbf(pnts)
plt.scatter(TMPraddata, datpnts, label="Radial Basis Function",marker=".", linewidth=.5)
plt.plot(TMPraddata, TMPenden[0,:,5],label="Data")
plt.legend();

Area under the histogram is not 1 when using density in plt.hist

Consider the following dataset with random data:
test_dataset = np.array([ -2.09601881, -4.26602684, 1.09105452, -4.59559669,
1.05865251, -0.93076762, -14.70398945, -18.01937129,
4.64126152, -10.34178822, -9.46058493, -5.66864965,
-3.17562022, 15.7030379 , 10.59675205, -5.80882413,
-24.00604149, -4.81518663, -1.94333927, 1.18142171,
12.72030312, 3.84917581, -0.4468796 , 11.91828567,
-17.99171774, 9.35108712, -5.57233376, 5.77547128,
5.49296099, -10.96132844, -18.75174336, 5.27843303,
25.73548956, -21.58043021, -14.24734733, 12.57886018,
-22.10002076, 1.72207555, -6.0411867 , -3.63568527,
7.26542117, -0.21449529, -6.64974714, -0.94574606,
-4.23339431, 16.76199734, -12.42195793, 18.965854 ,
-23.85336123, -15.55104466, 6.17215868, 7.34993316,
8.62461351, -16.30482638, -16.35601099, 1.96857833,
18.74440399, -22.48374434, -10.895831 , -10.14393648,
-17.62768751, 4.83388855, 20.1578181 , 6.04299626,
0.97198296, -3.40889754, -10.62734293, 1.70240472,
20.4203839 , 10.26751364, 15.47859675, -10.97940064,
1.82728251, 4.22894717, 8.31502887, -5.48502811,
-1.09244874, -11.32072796, -24.88520436, -7.42108403,
19.4200716 , 4.82704045, -12.46290135, -15.18466755,
6.37714692, -11.06825059, 5.10898588, -9.07485484,
1.63946084, -12.2270078 , 12.63776832, -25.03916909,
2.42972082, -14.22890171, 18.2199446 , 6.9819771 ,
-12.07795089, 2.59948596, -16.90206575, 6.35192719,
7.33823106, -23.69653447, -11.66091871, -19.40251179,
-12.64863792, 11.04004231, 13.7247356 , -16.36107329,
20.43227515, 17.97334692, 16.92675175, -5.62051239,
-8.66304184, -8.40848514, -23.20919855, 0.96808137,
-5.03287253, -3.13212582, 18.81155666, -8.27988284,
3.85708447, 12.43039322, 17.98003878, 18.11009997,
-3.74294421, -16.62276121, 9.4446743 , 2.2060981 ,
8.34853736, 14.79144713, -1.91113975, -5.17061419,
4.53451746, 8.19090358, 7.98343201, 11.44592322,
-16.9132677 , -25.92554857, 10.10638432, -8.09236786,
20.8878207 , 19.52368296, 0.85858125, 2.61760415,
9.21360649, -8.1192651 , -6.94829273, 2.73562447,
13.40981323, -9.05018331, -17.77563166, -21.03927199,
4.10415845, -1.31550732, 5.68284828, 15.08670773,
-19.78675315, 12.94697869, -11.51797637, 1.91485992,
16.69417993, -16.04271622, -1.14028558, 9.79830109,
-18.58386093, -7.52963269, -10.10059878, -25.2194216 ,
-0.10598426, -15.77641532, -14.15999125, 14.35011271,
11.15178588, -14.43856266, 15.84015226, -3.41221883,
11.90724469, 0.57782081, 18.82127466, -6.01068727,
-19.83684476, 2.20091942, -1.38707755, -8.62821053,
-11.89000913, -11.69539815, 5.70242019, -3.83781841,
5.35894135, -0.30995954, 21.76661212, 8.52974329,
-9.13065082, -11.06209 , -12.00654618, 2.769838 ,
-12.21579496, -27.2686534 , -4.58538197, -6.94388425])
I'd like to plot normalized histogram of it, so in the plt.hist options I choose density=True:
import numpy as np
import matplotlib.pyplot as plt
data1, bins, _ = plt.hist(test_dataset, density=True);
print(np.trapz(data1))
print(sum(data1))
which outputs the following histogram:
0.18206124014272715
0.18866449755723017
From matplotlib documentation:
The density parameter, which normalizes bin heights so that the integral of the histogram is 1. The resulting histogram is an approximation of the probability density function.
But from my example it is clearly seen that the integral of the histogram is NOT 1 and strongly depends on the number of bins: if I specify it for example to be 40 the sum will increase:
data1, bins, _ = plt.hist(test_dataset, density=True);
print(np.trapz(data1))
print(sum(data1))
0.7508847002777762
0.7546579902289207
Is it incorrect description in documentation or I misunderstand some issues here?
you do not calculate the area, area you should calculate as follow (in your example):
sum(data1 * np.diff(bins)) == 1

joint probability distribution using Python

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
I'm looking to figure out how to find the variance of X + Y as well as the expectation (E) of X+Y, E[X+Y] where X is the rows in the dataset above and Y is the columns.

Non-evenly spaced np.array with more points near the boundary

I have an interval, say (0, 9) and I have to generate points between them such that they are denser at the both the boundaries. I know the number of points, say n_x. alpha decides the "denseness" of the system such that points are evenly spaced if alpha = 1.
The cross product of n_x and n_y is supposed to look like this:
[
So far the closest I've been to this is by using np.geomspace, but it's only dense near the left-hand side of the domain,
In [55]: np.geomspace(1,10,15) - 1
Out[55]:
array([0. , 0.17876863, 0.38949549, 0.63789371, 0.93069773,
1.27584593, 1.6826958 , 2.16227766, 2.72759372, 3.39397056,
4.17947468, 5.1054023 , 6.19685673, 7.48342898, 9. ])
I also tried dividing the domain into two parts, (0,4), (5,10) but that did not help either (since geomspace gives more points only at the LHS of the domain).
In [29]: np.geomspace(5,10, 15)
Out[29]:
array([ 5. , 5.25378319, 5.52044757, 5.80064693, 6.09506827,
6.40443345, 6.72950096, 7.07106781, 7.42997145, 7.80709182,
8.20335356, 8.61972821, 9.05723664, 9.51695153, 10. ])
Apart from that, I am a bit confused about which mathematical function can I use to generate such an array.
You can use cumulative beta functions and map to your range.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
def denseboundspace(size=30, start=0, end=9, alpha=.5):
x = np.linspace(0, 1, size)
return start + beta.cdf(x, 2.-alpha, 2.-alpha) * (end-start)
n_x = denseboundspace()
#[0. 0.09681662 0.27092155 0.49228501 0.74944966 1.03538131
# 1.34503326 1.67445822 2.02038968 2.38001283 2.75082572 3.13054817
# 3.51705806 3.9083439 4.30246751 4.69753249 5.0916561 5.48294194
# 5.86945183 6.24917428 6.61998717 6.97961032 7.32554178 7.65496674
# 7.96461869 8.25055034 8.50771499 8.72907845 8.90318338 9. ]
plt.vlines(n_x, 0,2);
n_x = denseboundspace(size=13, start=1.2, end=7.8, alpha=1.0)
#[1.2 1.75 2.3 2.85 3.4 3.95 4.5 5.05 5.6 6.15 6.7 7.25 7.8 ]
plt.vlines(n_x, 0,2);
The spread is continuously controlled by the alpha parameter.

Different values weibull pdf

I was wondering why the values of weibull pdf with the prebuilt function dweibull.pdf are more or less the half they should be
I did a test. For the same x I created the weibull pdf for A=10 and K=2 twice, one by writing myself the formula and the other one with the prebuilt function of dweibull.
import numpy as np
from scipy.stats import exponweib,dweibull
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
K=2.0
A=10.0
x=np.arange(0.,20.,1)
#own function
def weib(data,a,k):
return (k / a) * (data / a)**(k - 1) * np.exp(-(data / a)**k)
pdf1=weib(x,A,K)
print sum(pdf1)
#prebuilt function
dist=dweibull(K,1,A)
pdf2=dist.pdf(x)
print sum(pdf2)
f=plt.figure()
suba=f.add_subplot(121)
suba.plot(x,pdf1)
suba.set_title('pdf dweibull')
subb=f.add_subplot(122)
subb.plot(x,pdf2)
subb.set_title('pdf own function')
f.show()
It seems with dweibull the pdf values are the half but that this is wrong as the summation should be in total 1 and not aroung 0.5 as it is with dweibull. By writing myself the formula the summation is around 1[
scipy.stats.dweibull implements the double Weibull distribution. Its support is the real line. Your function weib corresponds to the PDF of scipy's weibull_min distribution.
Compare your function weib to weibull_min.pdf:
In [128]: from scipy.stats import weibull_min
In [129]: x = np.arange(0, 20, 1.0)
In [130]: K = 2.0
In [131]: A = 10.0
Your implementation:
In [132]: weib(x, A, K)
Out[132]:
array([ 0. , 0.019801 , 0.03843158, 0.05483587, 0.0681715 ,
0.07788008, 0.08372116, 0.0857677 , 0.08436679, 0.08007445,
0.07357589, 0.0656034 , 0.05686266, 0.04797508, 0.03944036,
0.03161977, 0.02473752, 0.01889591, 0.014099 , 0.0102797 ])
scipy.stats.weibull_min.pdf:
In [133]: weibull_min.pdf(x, K, scale=A)
Out[133]:
array([ 0. , 0.019801 , 0.03843158, 0.05483587, 0.0681715 ,
0.07788008, 0.08372116, 0.0857677 , 0.08436679, 0.08007445,
0.07357589, 0.0656034 , 0.05686266, 0.04797508, 0.03944036,
0.03161977, 0.02473752, 0.01889591, 0.014099 , 0.0102797 ])
By the way, there is a mistake in this line of your code:
dist=dweibull(K,1,A)
The order of the parameters is shape, location, scale, so you are setting the location parameter to 1. That's why the values in your second plot are shifted by one. That line should have been
dist = dweibull(K, 0, A)

Categories

Resources