Probability density function numpy histogram/scipy stats

Probability density function numpy histogram/scipy stats - python

We have the arraya=range(10). Using numpy.histogram:
hist,bins=numpy.histogram(a,bins=(np.max(a)-np.min(a))/1, range=np.min(a),np.max(a)),density=True)
According to numpy tutorial:
If density=True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1.
The result is:
array([ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2])
I try to do the same using scipy.stats:
mean = np.mean(a)
sigma = np.std(a)
norm.pdf(a, mean, sigma)
However the result is different:
array([ 0.04070852, 0.06610774, 0.09509936, 0.12118842, 0.13680528,0.13680528, 0.12118842, 0.09509936, 0.06610774, 0.04070852])
I want to know why.
Update:I would like to set a more general question. How can we have the probability density function of an array without using numpy.histogram for density=True ?

If density=True, the result is the value of the probability density
function at the bin, normalized such that the integral over the
range is 1.
The "normalized" there does not mean that it will be transformed using a Normal Distribution. It simply says that each value in the bin will be divided by the total number of entries so that the total density would be equal to 1.

You can't compare numpy.histogram() and scipy.stats.norm() for this sample reason:
scipy.stats.norm() is A normal continuous random variable while numpy.histogram() deal with sequences (discontinuous)

Plotting a Continuous Probability Function(PDF) from a Histogram – Solved in Python. refer this blog for detailed explanation. (http://howdoudoittheeasiestway.blogspot.com/2017/09/plotting-continuous-probability.html) Else you can use the code below.
n, bins, patches = plt.hist(A, 40, histtype='bar')
plt.show()
n = n/len(A)
n = np.append(n, 0)
mu = np.mean(n)
sigma = np.std(n)
plt.bar(bins,n, width=(bins[len(bins)-1]-bins[0])/40)
y1= (1/(sigma*np.sqrt(2*np.pi))*np.exp(-(bins - mu)**2 /(2*sigma**2)))*0.03
plt.plot(bins, y1, 'r--', linewidth=2)
plt.show()

Related

Data generated from Scipy truncnorm.rvs does not match specified standard deviation

I am trying to generate data which follow specified truncated normal distribution. Based on answers here and here, I wrote,
lower,upper,mu,sigma,N = 5,15,10,5,10000
samples = scipy.stats.truncnorm.rvs((lower-mu)/sigma,(upper-mu)/sigma,loc=mu,scale=sigma,size=N)
samples.std()
I get output like
> 2.673
Which is obviously nowhere close to expected value of 5. Repeating it does not changes it considerably so it's not sample size issue. Any suggestions?

Indeed, truncating the normal distribution reduces the variability (and thereby standard deviation) of the possible realizations of the random variable. Regardless, we know why it is not 5.0. But we really don't know why it should be 2.673 either; except for the fact that it is smaller.
What if we compute the exact standard deviation for the truncated normal distribution analytically and compare it to the empirical value you retrieved? In this case, you can be sure that everything checks out.
from scipy import stats
from scipy.integrate import quad
import numpy as np
from matplotlib import pyplot as plt
# re-normalization constant (inverse of prob. of normal dist. on interval [lower, upper])
p = stats.norm.cdf(upper, loc=mu, scale=sigma) - stats.norm.cdf(lower, loc=mu, scale=sigma)
# plot
x_axis = np.linspace(0, 25, 10000)
plt.title('Truncated Normal Density', fontsize=18)
plt.plot(x_axis, scipy.stats.truncnorm.pdf(x_axis, (lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma))
plt.show()
showcases the truncated normal density alluding to the fact that the narrower the interval [lower, upper] are chosen, the smaller the standard deviation will be (even approaching 0 asymptotically when lower and upper get infinitesimally close).
Let's make this rigorous to really be sure. Given the age-old equations for the expected value and variance of our (truncated normal random variable X) we have
Then, defining the helper functions
def xfx(x, lower=lower, upper=upper, mu=mu, sigma=sigma):
'''helper function returning x*f(x) for the truncated normal density f'''
return x*scipy.stats.truncnorm.pdf(x, (lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma)
def x_EX_fx(x, lower=lower, upper=upper, mu=mu, sigma=sigma):
'''helper function returning (x - E[X])**2 * f(x) for the truncated normal density f'''
EX = quad(func=xfx,a=lower,b=upper)[0]
return ((x - EX)**2) * scipy.stats.truncnorm.pdf(x, (lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma)
allows us to compute the values exact
# E[X], expected value of X
quad(func=xfx,a=lower,b=upper)[0]
> 10.0
# (Var(X))^(1/2), standard deviation of X
np.sqrt(quad(func=x_EX_fx,a=lower,b=upper)[0])
> 2.697
This looks eerily similar to your observed value 2.673. Let's see if the difference is merely based on the finite sample size by running a simulation study to observe if the empirical standard deviation approaches the theoretical one.
# simulation study
np.random.seed(7447)
stdList = [scipy.stats.truncnorm.rvs((lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma, size=round(10**N)).std() for N in range(2,8)]
# plot
plt.title("Convergence behaviour of $\hat{σ}_{n}$ to σ", fontsize=18)
plt.plot(range(2,8), stdList)
plt.axhline(2.697800468774485, color='red', lw=0.85)
plt.legend({'emprical' : 'blue', 'theoretical' : 'red'}, fontsize=14)
plt.xlabel("$log_{10}(N)$", fontsize=14)
plt.show()
yielding
This confirms that your output is sound,

This is generating a clipped normal distribution between [5,15]. This is +/- 1 s.d, so the s.d. measured across this sample will not be equal to the input.
If you clip the range of outputs, you necessarily reduce the s.d. observed.
As lower/upper -> +/-infinity, the sample std -> 5.
As lower/upper -> 10, the sample std -> 0.

Using scipy to fit CDF with real data, but CDF start not from 0

Herewith my samples and my codes for fitting CDF.
import numpy as np
import pandas as pd
import scipy.stats as st
samples = [2,3,10,7,9,6,1,3,7,2,5,4,6,3,4,1,4,6,3,10,3,7,5,6,6,5,4,2,2,5,4,5,6,4,4,6,3,3,3,2,2,2,4,2,6,2,7,4,3,2,2,1,4,2,2,5,3,9,6,8,3,6,6,3,9,2,3,3,3,5,4,4,5,4,1,8,5,8,6,6,7,6,3,2,4,2,16,6,2,3,4,2,2,9,9,5,5,5,1,5,2,8,5,3,5,8,11,4,7,4,11,3,7,3,6,6,1,4,2,1,1,1,9,4,15,2,1,3,4,9,3,3,4,3,6,3,3,5,5,6,3,3,4,8,4,4,2,5,6,7,3,5,5,2,5,9,7,6,1,3,4,9,3,2,4,8,5,8,4,4,5,6,5,8,6,1,3,7,9,6,7,12,4,1,4,5,5,7,1,7,1,15,3,3,2,3,7,7,15,6,5,1,7,4,2,10,1,3,3,8,3,8,1,5,4,7,4,2,9,2,1,3,6,1,6,10,6,3,4,7,5,7,3,3,7,4,4,3,5,3,5,2,2,1,2,3,1,1,2,1,1,2,3,10,7,3,2,6,5,6,5,11,1,7,5,2,9,5,12,6,3,9,9,4,3,4,6,4,10,4,8,6,1,7,2,5,8,3,1,3,1,1,3,3,2,2,6,3,3,2,6,6,6,4,2,4,1,10,5,3,5,6,3,4,1,1,7,6,6,5,7,6,3,4,6,6,5,3,2,3,2,1,2,4,1,1,1,3,7,1,6,3,4,3,3,6,7,3,7,4,1,1,7,1,4,4,3,4,2,4,2,6,6,2,2,6,5,4,6,5,6,3,5,1,5,3,3,2,2,2,2,3,3,3,2,2,1,4,2,3,5,7,2,5,1,2,2,5,6,5,2,1,2,4,5,2,3,2,4,9,3,5,2,2,5,4,2,3,4,2,3,1,3,6,7,2,6,3,5,4,2,2,2,2,1,2,5,2,2,3,4,2,5,2,2,3,5,3,2,4,3,2,5,4,1,4,8,6,8,2,2,3,1,2,3,8,2,3,4,3,3,2,1,1,1,3,3,4,3,4,1,2,8,2,2,7,3,1,2,3,3,2,3,1,2,1,1,1,3,2,2,2,4,7,2,1,2,3,1,3,1,1,6,2,1,1,3,1,4,4,1,3,1,1,4,1,1,2,4,4,3,2,3,2,1,2,1,4,2,5,3,4,2,1,1,1,3,1,2,1,1,4,2,1,3,2,1,3,2,1,1,1,2,1,1,1,1,2,1,1,1,1,1,1,1]
bins=np.arange(1, 18, 0.1)
#Because min(samples) = 1, so I start from 1.
y, x = np.histogram(samples, bins=bins, density=True)
params = st.lognorm.fit(samples)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
ccdf = st.lognorm.cdf(x, loc=loc, scale=scale, *arg)
cdf = pd.Series(ccdf, x)
#cdf[1.0] is not 0... That is the issue...
When I print out the first value cdf[1.0], it does not equal to 0. According to theory, it should be 0. As the below picture has shown, the first CDF is not 0. I check my code again and again. However, I cannot fix the problem. If any suggestion to me, I very appreciate it.

In your code, you are trying to plot a bar chart from your sample. This is good, but on the graph you are not having a histogram, but a distribution function of the sample.
The code does not match the picture.
Here is the pdf graph and histogram.
Code for graph above:
# ... insert your sample and calculate lognorm parameters (already in your code)
x = np.linspace(min(samples), max(samples), 100)
pdf = stats.lognorm.pdf(x, loc=loc, scale=scale, *arg)
plt.plot(x, pdf)
plt.hist(samples, bins=max(samples)-min(samples), density=True, alpha=0.75)
plt.show()
You are also looking in the code for cdf options. And Scipy finds them.
And on the graph you draw exactly the cdf.
You don't understand that the cdf value for the minimum value in the sample is not zero.
However, you should be aware that the fit function only brings the approximated curve closer to your sample, it does not produce a curve that accurately describes the empirical distribution function.
Scipy just thinks your sample may contain values less than one,
although there are no such values in the training set.
The pdf also says that a value greater than 14 is extremely unlikely, but your sample has more than 13 values.
As a result, cdf and should not be equal to zero at your point cdf[1.0].
p.s. cdf will still be equal to zero at zero if you pass this point to it.
Code for graph above:
# ... insert your sample and calculate lognorm parameters (already in your code)
x = np.linspace(0, max(samples), 100)
cdf = stats.lognorm.cdf(x, loc=loc, scale=scale, *arg)
plt.plot(x, cdf)
plt.show()

GPflow classification: interpretation of posterior variance

In the tutorial on multiclass classification on the GPflow website, a Sparse Variational Gaussian Process (SVGP) is used on a 1D toy example. As is the case for all other GPflow models, the SVGP model has a method predict_y(self, Xnew) which returns the mean and variance of held-out data at the points Xnew.
From the tutorial it is clear that the first argument that is unpacked from predict_y is the posterior predictive probability of each of the three classes (cells [7] and [8]), shown as the colored lines in the second panel of the plot below. However, the authors do not elaborate on the second argument that can be unpacked from predict_y, which are the variances of the predictions. In a regression setting its interpretation is clear to me, as the posterior predictive distribution in this case would be a Gaussian.
But I fail to understand what could be the interpretation here. Especially, I would like to know how this measure could be used to construct error bars denoting uncertainty around class predictions for any new data point.
I altered the code of the tutorial slightly to add an additional panel to the plot below: the third panel shows in black the maximal standard deviation (square root of the obtained variance from predict_y). It clearly is a good measure for uncertainty and it is probably also no coincidence that the highest possible value is 0.5, but I could not find how it is calculated and what it represents.
Complete notebook with all code here.
def plot(m):
f = plt.figure(figsize=(12,8))
a1 = f.add_axes([0.05, 0.05, 0.9, 0.5])
av = f.add_axes([0.05, 0.6, 0.9, 0.1])
a2 = f.add_axes([0.05, 0.75, 0.9, 0.1])
a3 = f.add_axes([0.05, 0.9, 0.9, 0.1])
xx = np.linspace(m.X.read_value().min()-0.3, m.X.read_value().max()+0.3, 200).reshape(-1,1)
mu, var = m.predict_f(xx)
mu, var = mu.copy(), var.copy()
p, v = m.predict_y(xx)
a3.set_xticks([])
a3.set_yticks([])
av.set_xticks([])
lty = ['-', '--', ':']
for i in range(m.likelihood.num_classes):
x = m.X.read_value()[m.Y.read_value().flatten()==i]
points, = a3.plot(x, x*0, '.')
color=points.get_color()
a1.fill_between(xx[:,0], mu[:,i] + 2*np.sqrt(var[:,i]), mu[:,i] - 2*np.sqrt(var[:,i]), alpha = 0.2)
a1.plot(xx, mu[:,i], color=color, lw=2)
a2.plot(xx, p[:,i], '-', color=color, lw=2)
av.plot(xx, np.sqrt(np.max(v[:,:], axis = 1)), c = "black", lw=2)
for ax in [a1, av, a2, a3]:
ax.set_xlim(xx.min(), xx.max())
a2.set_ylim(-0.1, 1.1)
a2.set_yticks([0, 1])
a2.set_xticks([])
plot(m)

Model.predict_y() calls Likelihood.predict_mean_and_var(). If you look at the documentation of the latter function [1] you see that all it does, is compute the mean and variance of the predictive distribution. I.e., we first compute the marginal predictive distribution q(y) = \int p(y|f) q(f) df, and then we compute the mean and variance of q(y).
For a Gaussian, the mean and variance can be specified independently of each other, and they have interpretations as a point prediction and the uncertainty. For a Bernoulli likelihood, the mean and variance are both completely determined by the single parameter p. The mean of the distribution is the probability of the event, which already tells us the uncertainty! The variance doesn't give much more.
However, you are right that the variance is a nice metric of uncertainty where higher means more uncertainty. The entropy as a function of p looks very similar (although the two differ in behaviour near the edges):
p = np.linspace(0.001, 1 - 0.001, 1000)[:, None]
q = 1 - p
plt.plot(p, -p * np.log(p) - q * np.log(q), label='entropy')
plt.plot(p, p * q, label='variance')
plt.legend()
plt.xlabel('probability')
[1] https://github.com/GPflow/GPflow/blob/b8ed8332549a375da8658a1117470ac86d823e7f/gpflow/likelihoods.py#L76

How to do linear regression, taking errorbars into account?

I am doing a computer simulation for some physical system of finite size, and after this I am doing extrapolation to the infinity (Thermodynamic limit). Some theory says that data should scale linearly with system size, so I am doing linear regression.
The data I have is noisy, but for each data point I can estimate errorbars. So, for example data points looks like:
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
Let's say I am trying to do this in Python.
First way that I know is:
m, c, r_value, p_value, std_err = scipy.stats.linregress(x_list, y_list)
I understand this gives me errorbars of the result, but this does not take into account errorbars of the initial data.
Second way that I know is:
m, c = numpy.polynomial.polynomial.polyfit(x_list, y_list, 1, w = [1.0 / ty for ty in y_err], full=False)
Here we use the inverse of the errorbar for the each point as a weight that is used in the least square approximation. So if a point is not really that reliable it will not influence result a lot, which is reasonable.
But I can not figure out how to get something that combines both these methods.
What I really want is what second method does, meaning use regression when every point influences the result with different weight. But at the same time I want to know how accurate my result is, meaning, I want to know what are errorbars of the resulting coefficients.
How can I do this?

Not entirely sure if this is what you mean, but…using pandas, statsmodels, and patsy, we can compare an ordinary least-squares fit and a weighted least-squares fit which uses the inverse of the noise you provided as a weight matrix (statsmodels will complain about sample sizes < 20, by the way).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
import statsmodels.formula.api as sm
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
# put x and y into a pandas DataFrame, and the weights into a Series
ws = pd.DataFrame({
'x': x_list,
'y': y_list
})
weights = pd.Series(y_err)
wls_fit = sm.wls('x ~ y', data=ws, weights=1 / weights).fit()
ols_fit = sm.ols('x ~ y', data=ws).fit()
# show the fit summary by calling wls_fit.summary()
# wls fit r-squared is 0.754
# ols fit r-squared is 0.701
# let's plot our data
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, facecolor='w')
ws.plot(
kind='scatter',
x='x',
y='y',
style='o',
alpha=1.,
ax=ax,
title='x vs y scatter',
edgecolor='#ff8300',
s=40
)
# weighted prediction
wp, = ax.plot(
wls_fit.predict(),
ws['y'],
color='#e55ea2',
lw=1.,
alpha=1.0,
)
# unweighted prediction
op, = ax.plot(
ols_fit.predict(),
ws['y'],
color='k',
ls='solid',
lw=1,
alpha=1.0,
)
leg = plt.legend(
(op, wp),
('Ordinary Least Squares', 'Weighted Least Squares'),
loc='upper left',
fontsize=8)
plt.tight_layout()
fig.set_size_inches(6.40, 5.12)
plt.show()
WLS residuals:
[0.025624005084707302,
0.013611438189866154,
-0.033569595462217161,
0.044110895217014695,
-0.025071632845910546,
-0.036308252199571928,
-0.010335514810672464,
-0.0081511479431851663]
The mean squared error of the residuals for the weighted fit (wls_fit.mse_resid or wls_fit.scale) is 0.22964802498892287, and the r-squared value of the fit is 0.754.
You can obtain a wealth of data about the fits by calling their summary() method, and/or doing dir(wls_fit), if you need a list of every available property and method.

I wrote a concise function to perform the weighted linear regression of a data set, which is a direct translation of GSL's "gsl_fit_wlinear" function. This is useful if you want to know exactly what your function is doing when it performs the fit
def wlinear_fit (x,y,w) :
"""
Fit (x,y,w) to a linear function, using exact formulae for weighted linear
regression. This code was translated from the GNU Scientific Library (GSL),
it is an exact copy of the function gsl_fit_wlinear.
"""
# compute the weighted means and weighted deviations from the means
# wm denotes a "weighted mean", wm(f) = (sum_i w_i f_i) / (sum_i w_i)
W = np.sum(w)
wm_x = np.average(x,weights=w)
wm_y = np.average(y,weights=w)
dx = x-wm_x
dy = y-wm_y
wm_dx2 = np.average(dx**2,weights=w)
wm_dxdy = np.average(dx*dy,weights=w)
# In terms of y = a + b x
b = wm_dxdy / wm_dx2
a = wm_y - wm_x*b
cov_00 = (1.0/W) * (1.0 + wm_x**2/wm_dx2)
cov_11 = 1.0 / (W*wm_dx2)
cov_01 = -wm_x / (W*wm_dx2)
# Compute chi^2 = \sum w_i (y_i - (a + b * x_i))^2
chi2 = np.sum (w * (y-(a+b*x))**2)
return a,b,cov_00,cov_11,cov_01,chi2
To perform your fit, you would do
a,b,cov_00,cov_11,cov_01,chi2 = wlinear_fit(x_list,y_list,1.0/y_err**2)
Which will return the best estimate for the coefficients a (the intercept) and b (the slope) of the linear regression, along with the elements of the covariance matrix cov_00, cov_01 and cov_11. The best estimate on the error on a is then the square root of cov_00 and the one on b is the square root of cov_11. The weighted sum of the residuals is returned in the chi2 variable.
IMPORTANT: this function accepts inverse variances, not the inverse standard deviations as the weights for the data points.

sklearn.linear_model.LinearRegression supports specification of weights during fit:
x_data = np.array(x_list).reshape(-1, 1) # The model expects shape (n_samples, n_features).
y_data = np.array(y_list)
y_err = np.array(y_err)
model = LinearRegression()
model.fit(x_data, y_data, sample_weight=1/y_err)
Here the sample weight is specified as 1 / y_err. Different versions are possible and often it's a good idea to clip these sample weights to a maximum value in case the y_err varies strongly or has small outliers:
sample_weight = 1 / y_err
sample_weight = np.minimum(sample_weight, MAX_WEIGHT)
where MAX_WEIGHT should be determined from your data (by looking at the y_err or 1 / y_err distributions, e.g. if they have outliers they can be clipped).

I found this document helpful in understanding and setting up my own weighted least squares routine (applicable for any programming language).
Typically learning and using optimized routines is the best way to go but there are times where understanding the guts of a routine is important.

Chi square numpy.polyfit (numpy)

Could someone explain how to get Chi^2/doF using numpy.polyfit?

Assume you have some data points
x = numpy.array([0.0, 1.0, 2.0, 3.0])
y = numpy.array([3.6, 1.3, 0.2, 0.9])
To fit a parabola to those points, use numpy.polyfit():
p = numpy.polyfit(x, y, 2)
To get the chi-squared value for this fit, evaluate the polynomial at the x values of your data points, subtract the y values, square and sum:
chi_squared = numpy.sum((numpy.polyval(p, x) - y) ** 2)
You can divide this number by the number of degrees of freedom if you like.

Numpy's polyfit has, at least since release 1.3, supported a full parameter. If that is set to True, polyfit will return a few more values, including the square of the residuals. Which is chi-squared (unnormalized by the degrees of freedom).
So a simple example would be
p, residuals, _, _, _ = numpy.polyfit(x, y, 2, full=True)
chisq_dof = residuals / (len(x) - 3)
I have not tried this myself with weights, but I assume polyfit does the right thing here (since numpy 1.7, polyfit accepts a parameter w to provide weights for the fit).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.