I have very little knowledge of statistics, so forgive me, but I'm very confused by how the numpy function std works, and the documentation is unfortunately not clearing it up.
From what I understand it will compute the standard deviation of a distribution from the array, but when I set up a Gaussian with a standard deviation of 0.5 with the following code, numpy.std returns 0.2:
sigma = 0.5
mu = 1
x = np.linspace(0, 2, 100)
f = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp((-1 / 2) * ((x - mu) / sigma)**2)
plt.plot(x, f)
plt.show()
print(np.std(f))
This is the distribution:
I have no idea what I'm misunderstanding about how the function works. I thought maybe I would have to tell it the x-values associated with the y-values of the distribution but there's no argument for that in the function. Why is numpy.std not returning the actual standard deviation of my distribution?
I suspect that you understand perfectly well how the function works, but are misunderstanding the meaning of your data. Standard deviation is a measure of the spread of data about the mean value.
When you say std(f), you are computing the the spread of the y-values about their mean. Looking at the graph in the question, a vertical mean of ~0.5 and a standard deviation of ~0.2 are not far fetched. Notice that std(f) does not involve the x-values in any way.
What you are expecting to get is the standard deviation of the x-values, weighted by the y-values. This is essentially the idea behind a probability density function (PDF).
Let's go through the computation manually to understand the differences. The mean of the x-values is normally x.sum() / x.size. But that is only true the the weight of each value is 1. If you weigh each value by the corresponding f value, you can write
m = (x * f).sum() / f.sum()
Standard deviation is the root-mean-square about the mean. That means computing the average squared deviation from the mean, and taking the square root. We can compute the weighted mean of squared deviation in the exact same way we did before:
s = np.sqrt(np.sum((x - m)**2 * f) / f.sum())
Notice that the value of s computed this way from your question is not 0.5, but rather 0.44. This is because your PDF is incomplete, and the missing tails add significantly to the spread.
Here is an example showing that the standard deviation converges to the expected value as you compute it for a larger sample of the PDF:
>>> def s(x, y):
... m = (x * y).sum() / y.sum()
... return np.sqrt(np.sum((x - m)**2 * y) / y.sum())
>>> sigma = 0.5
>>> x1 = np.linspace(-1, 1, 100)
>>> y1 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x1 / sigma)**2)
>>> s(x1, y1)
0.4418881290522094
>>> x2 = np.linspace(-2, 2, 100)
>>> y2 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x2 / sigma)**2)
>>> s(x2, y2)
0.49977093783005005
>>> x3 = np.linspace(-3, 3, 100)
>>> y3 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x3 / sigma)**2)
>>> s(x3, y3)
0.49999998748515206
np.std is used to compute standard deviation. This can be computed as below steps
First we need to compute mean of distribution
Then find the summation of (x - x.mean)**2
Then find the means of above summation (by divide with number of elements in distribution)
Then find square root of this means (calculated in step 3).
Thus this function is calculating the standard deviation of distribution being passed to it.
Related
I asked a similar question in January that #Miłosz Wieczór was kind enough to answer. Now, I am faced with a similar but different challenge since I need to fit two parameters (fc and alpha) simultaneously on two datasets (e_exp and iq_exp). I basically need to find the values of fc and alpha that are the best fits to both data e_exp and iq_exp.
import numpy as np
import math
from scipy.optimize import curve_fit, least_squares, minimize
f_exp = np.array([1, 1.6, 2.7, 4.4, 7.3, 12, 20, 32, 56, 88, 144, 250000])
e_exp = np.array([7.15, 7.30, 7.20, 7.25, 7.26, 7.28, 7.32, 7.25, 7.35, 7.34, 7.37, 11.55])
iq_exp = np.array([0.010, 0.009, 0.011, 0.011, 0.010, 0.012, 0.019, 0.027, 0.038, 0.044, 0.052, 0.005])
ezero = np.min(e_exp)
einf = np.max(e_exp)
ig_fc = 500
ig_alpha = 0.35
def CCRI(f_exp, fc, alpha):
x = np.log(f_exp/fc)
R = ezero + 1/2 * (einf - ezero) * (1 + np.sinh((1 - alpha) * x) / (np.cosh((1 - alpha) * x) + np.sin(1/2 * alpha * math.pi)))
I = 1/2 * (einf - ezero) * np.cos(alpha * math.pi / 2) / (np.cosh((1 - alpha) * x) + np.sin(alpha * math.pi / 2))
RI = np.sqrt(R ** 2 + I ** 2)
return RI
def CCiQ(f_exp, fc, alpha):
x = np.log(f_exp/fc)
R = ezero + 1/2 * (einf - ezero) * (1 + np.sinh((1 - alpha) * x) / (np.cosh((1 - alpha) * x) + np.sin(1/2 * alpha * math.pi)))
I = 1/2 * (einf - ezero) * np.cos(alpha * math.pi / 2) / (np.cosh((1 - alpha) * x) + np.sin(alpha * math.pi / 2))
iQ = I / R
return iQ
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
einf, ezero, and f_exp are all constant plus the variables I need to optimize are ig_fc and ig_alpha, in which ig stands for initial guess. In the code above, I get two different fc and alpha values because I solve them independently. I need however to solve them simultaneously so that fc and alpha are universal.
Is there a way to solve two different functions to provide universal solutions for fc and alpha?
The docs state on the second returned value from curve_fit:
pcov
The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation
errors on the parameters use perr = np.sqrt(np.diag(pcov)).
So if you want to minimize the overall error, you need to combine the errors of both your fits.
def objective(what, ever):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
# not sure if this the correct equation, but you can start with it
err_total = np.sum(np.sqrt(np.diag(pcovRI))) + np.sum(np.sqrt(np.diag(pcoviQ)))
return err_total
On total errors of 2d Gaussian functions:
https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/
Update:
Since you want poptRI and poptiQ to be the same, you need to minimize their distance.
This can be done like
from numpy import linalg
def objective(what, ever):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
delta = linalg.norm(poptiQ - poptRI)
return delta
Minimizing this function will (should) result in similar values for poptRI and poptiQ. You take the parameters as vectors, and try to minimize the length of their delta vector.
However, this approach assumes that poptRI and poptiQ (and their coefficients) are about in the same range since you are using some metric on them. If say one if them is in the range 2000 and the other in the range 2. Then the optimizer will favour tuning the first one. But maybe this is fine.
If you somehow want to treat them the same you need to normalize them.
One approach (assuming all coefficients are similar) could be
linalg.norm((poptiQ / linalg.norm(poptiQ)) - (poptRI / linalg.norm(poptRI))))
You normalize the results to unit vectors, then subtract them, then create the norm.
The same is true for the inputs to the function, but it might not be that important there. See the links below.
But this strongly depends on the problem you are trying to solve. There is no general solution.
Some links related to this:
Is normalization useful/necessary in optimization?
Why do we have to normalize the input for an artificial neural network?
Another objective function:
It this what you are trying to do?
You want to find the best fc and alpha so the fit results of both functions are as close as possible?
def objective(fc, alpha):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(fc, alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(fc, alpha))
delta = linalg.norm(poptiQ - poptRI)
return delta
I'm using scipy skewnorm to create a skewed distribution with a loc and scale.
I adjust the loc and scale passed to scipy.stats.skewnorm based on Adelchi Azzalini's page(Here is link), using the section at the bottom of that page on "mean value" and "delta".
The code I'm using is:
import math
import scipy.stats
skew = -2
mean = 0.05
stdev = 0.05
delta = skew / math.sqrt(1. + math.pow(skew, 2.))
adjMean = mean - stdev * math.sqrt(2. / math.pi) * delta
adjStdev = math.sqrt(math.pow(stdev, 2.) / (1. - 2. * math.pow(delta, 2.) / math.pi))
print 'target mean={:.4f} actual mean={:.4f}'.format(mean, float(scipy.stats.skewnorm.stats(skew, loc=adjMean, scale=adjStdev, moments='mvsk')[0]))
print 'target stdev={:.4f} actual stdev={:.4f}'.format(stdev, math.sqrt(float(scipy.stats.skewnorm.stats(skew, loc=adjMean, scale=adjStdev, moments='mvsk')[1])))
When I run it, though, I'm not getting the mean I expect, while the stdev is what I expect:
target mean=0.0500 actual mean=0.0347
target stdev=0.0500 actual stdev=0.0500
I feel like I'm missing something either about skewnorm or in scipy.stats.skewnorm...
I have numerically integrated the distribution and the mean matches the "actual mean" above.
You have an algebra mistake. You have
adjMean = mean - stdev * math.sqrt(2. / math.pi) * delta
but on the right side, stdev should be adjStdev.
Here's a modified version of your code:
import math
import scipy.stats
skew = 2.0
mean = 1.5
stdev = 3.0
delta = skew / math.sqrt(1. + math.pow(skew, 2.))
adjStdev = math.sqrt(math.pow(stdev, 2.) / (1. - 2. * math.pow(delta, 2.) / math.pi))
adjMean = mean - adjStdev * math.sqrt(2. / math.pi) * delta
print('target mean={:.4f} actual mean={:.4f}'.format(mean, float(scipy.stats.skewnorm.stats(skew, loc=adjMean, scale=adjStdev, moments='mvsk')[0])))
print('target stdev={:.4f} actual stdev={:.4f}'.format(stdev, math.sqrt(float(scipy.stats.skewnorm.stats(skew, loc=adjMean, scale=adjStdev, moments='mvsk')[1]))))
Here's the output:
target mean=1.5000 actual mean=1.5000
target stdev=3.0000 actual stdev=3.0000
I need to know how to generate 1000 random numbers between 500 and 600 that has a mean = 550 and standard deviation = 30 in python.
import pylab
import random
xrandn = pylab.zeros(1000,float)
for j in range(500,601):
xrandn[j] = pylab.randn()
???????
You are looking for stats.truncnorm:
import scipy.stats as stats
a, b = 500, 600
mu, sigma = 550, 30
dist = stats.truncnorm((a - mu) / sigma, (b - mu) / sigma, loc=mu, scale=sigma)
values = dist.rvs(1000)
There are other choices for your problem too. Wikipedia has a list of continuous distributions with bounded intervals, depending on the distribution you may be able to get your required characteristics with the right parameters. For example, if you want something like "a bounded Gaussian bell" (not truncated) you can pick the (scaled) beta distribution:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
def my_distribution(min_val, max_val, mean, std):
scale = max_val - min_val
location = min_val
# Mean and standard deviation of the unscaled beta distribution
unscaled_mean = (mean - min_val) / scale
unscaled_var = (std / scale) ** 2
# Computation of alpha and beta can be derived from mean and variance formulas
t = unscaled_mean / (1 - unscaled_mean)
beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
alpha = beta * t
# Not all parameters may produce a valid distribution
if alpha <= 0 or beta <= 0:
raise ValueError('Cannot create distribution for the given parameters.')
# Make scaled beta distribution with computed parameters
return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
np.random.seed(100)
min_val = 1.5
max_val = 35
mean = 9.87
std = 3.1
my_dist = my_distribution(min_val, max_val, mean, std)
# Plot distribution PDF
x = np.linspace(min_val, max_val, 100)
plt.plot(x, my_dist.pdf(x))
# Stats
print('mean:', my_dist.mean(), 'std:', my_dist.std())
# Get a large sample to check bounds
sample = my_dist.rvs(size=100000)
print('min:', sample.min(), 'max:', sample.max())
Output:
mean: 9.87 std: 3.100000000000001
min: 1.9290674232087306 max: 25.03903889816994
Probability density function plot:
Note that not every possible combination of bounds, mean and standard deviation will produce a valid distribution in this case, though, and depending on the resulting values of alpha and beta the probability density function may look like an "inverted bell" instead (even though mean and standard deviation would still be correct).
I'm not exactly sure what the OP desired, but if he just wanted an array xrandn fulfilling the bottom plot - below I present the steps:
First, create a standard distribution (Gaussian distribution), the easiest way might be to use numpy:
import numpy as np
random_nums = np.random.normal(loc=550, scale=30, size=1000)
And then you keep only the numbers within the desired range with a list comprehension:
random_nums_filtered = [i for i in random_nums if i>500 and i<600]
I am learning gradient descent for calculating coefficients. Below is what I am doing:
#!/usr/bin/Python
import numpy as np
# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
# avg cost per example (the 2 in 2*m doesn't really matter here.
# But to be consistent with the gradient, I include it)
cost = np.sum(loss ** 2) / (2 * m)
#print("Iteration %d | Cost: %f" % (i, cost))
# avg gradient per example
gradient = np.dot(xTrans, loss) / m
# update
theta = theta - alpha * gradient
return theta
X = np.array([41.9,43.4,43.9,44.5,47.3,47.5,47.9,50.2,52.8,53.2,56.7,57.0,63.5,65.3,71.1,77.0,77.8])
y = np.array([251.3,251.3,248.3,267.5,273.0,276.5,270.3,274.9,285.0,290.0,297.0,302.5,304.5,309.3,321.7,330.7,349.0])
n = np.max(X.shape)
x = np.vstack([np.ones(n), X]).T
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradientDescent(x, y, theta, alpha, m, numIterations)
print(theta)
Now my above code works fine. If I now try multiple variables and replace X with X1 like the following:
X1 = np.array([[41.9,43.4,43.9,44.5,47.3,47.5,47.9,50.2,52.8,53.2,56.7,57.0,63.5,65.3,71.1,77.0,77.8], [29.1,29.3,29.5,29.7,29.9,30.3,30.5,30.7,30.8,30.9,31.5,31.7,31.9,32.0,32.1,32.5,32.9]])
then my code fails and shows me the following error:
JustTestingSGD.py:14: RuntimeWarning: overflow encountered in square
cost = np.sum(loss ** 2) / (2 * m)
JustTestingSGD.py:19: RuntimeWarning: invalid value encountered in subtract
theta = theta - alpha * gradient
[ nan nan nan]
Can anybody tell me how can I do gradient descent using X1? My expected output using X1 is:
[-153.5 1.24 12.08]
I am open to other Python implementations also. I just want the coefficients (also called thetas) for X1 and y.
The problem is in your algorithm not converging. It diverges instead. The first error:
JustTestingSGD.py:14: RuntimeWarning: overflow encountered in square
cost = np.sum(loss ** 2) / (2 * m)
comes from the problem that at some point calculating the square of something is impossible, as the 64-bit floats cannot hold the number (i.e. it is > 10^309).
JustTestingSGD.py:19: RuntimeWarning: invalid value encountered in subtract
theta = theta - alpha * gradient
This is only a consequence of the error before. The numbers are not reasonable for calculations.
You can actually see the divergence by uncommenting your debug print line. The cost starts to grow, as there is no convergence.
If you try your function with X1 and a smaller value for alpha, it converges.
I have a function, a gaussian, I have fitted this to my data from a data file. I now need to integrate the gaussian function to give the area under it.
This is my gaussian function
def I(theta,max_x,max_y,sigma):
return (max_y/(sigma*(math.sqrt(2*pi))))*np.exp(-((theta-max_x)**2)/(2*sigma**2))
COMPARING WITH GENERAL FORMULA
N(x | mu, sigma, n) := (n/(sigma*sqrt(2*pi))) * exp((-(x-mu)^2)/(2*sigma^2))
i.e n = max_y , MU = max_x , x = theta
this is what is given on another page:
If Phi(z) = integral(N(x|0,1,1), -inf, z); that is, Phi(z) is the integral of the standard normal distribution from >minus infinity up to z, then it's true by the definition of the error function that
Phi(z) = 0.5 + 0.5 * erf(z / sqrt(2)).
Likewise, if Phi(z | mu, sigma, n) = integral( N(x|sigma, mu, n),
-inf, z); that is, Phi(z | mu, sigma, n) is the integral of the normal distribution given parameters mu, sigma, and n from minus infinity up
to z, then it's true by the definition of the error function that
Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma *
sqrt(2)))).
I am unsure how this helps?? I just want to integrate my function over the plotted values under the curve. Is it saying this is the integral:
Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma * sqrt(2))))
The answer you have there is the indefinite integral. If you would like a numerical answer between two x limits, you can evaluate that function at two points and take the difference.
Your gaussian function is defined over all real numbers (−∞, +∞) but in practice, you are only interested in the middle part as the tails are very close to 0. To obtain a numerical estimate of the total area you can do as you say: evaluate the error function at two points suitably close to 0 on each side of the gaussian's peak and take the difference.
If Phi(z | mu, sigma, n) returns a function you could do:
integral = Phi(z | mu, sigma, n)
area = integral(X_HIGH) - integral(X_LOW)