I'm using scipy skewnorm to create a skewed distribution with a loc and scale.
I adjust the loc and scale passed to scipy.stats.skewnorm based on Adelchi Azzalini's page(Here is link), using the section at the bottom of that page on "mean value" and "delta".
The code I'm using is:
import math
import scipy.stats
skew = -2
mean = 0.05
stdev = 0.05
delta = skew / math.sqrt(1. + math.pow(skew, 2.))
adjMean = mean - stdev * math.sqrt(2. / math.pi) * delta
adjStdev = math.sqrt(math.pow(stdev, 2.) / (1. - 2. * math.pow(delta, 2.) / math.pi))
print 'target mean={:.4f} actual mean={:.4f}'.format(mean, float(scipy.stats.skewnorm.stats(skew, loc=adjMean, scale=adjStdev, moments='mvsk')[0]))
print 'target stdev={:.4f} actual stdev={:.4f}'.format(stdev, math.sqrt(float(scipy.stats.skewnorm.stats(skew, loc=adjMean, scale=adjStdev, moments='mvsk')[1])))
When I run it, though, I'm not getting the mean I expect, while the stdev is what I expect:
target mean=0.0500 actual mean=0.0347
target stdev=0.0500 actual stdev=0.0500
I feel like I'm missing something either about skewnorm or in scipy.stats.skewnorm...
I have numerically integrated the distribution and the mean matches the "actual mean" above.
You have an algebra mistake. You have
adjMean = mean - stdev * math.sqrt(2. / math.pi) * delta
but on the right side, stdev should be adjStdev.
Here's a modified version of your code:
import math
import scipy.stats
skew = 2.0
mean = 1.5
stdev = 3.0
delta = skew / math.sqrt(1. + math.pow(skew, 2.))
adjStdev = math.sqrt(math.pow(stdev, 2.) / (1. - 2. * math.pow(delta, 2.) / math.pi))
adjMean = mean - adjStdev * math.sqrt(2. / math.pi) * delta
print('target mean={:.4f} actual mean={:.4f}'.format(mean, float(scipy.stats.skewnorm.stats(skew, loc=adjMean, scale=adjStdev, moments='mvsk')[0])))
print('target stdev={:.4f} actual stdev={:.4f}'.format(stdev, math.sqrt(float(scipy.stats.skewnorm.stats(skew, loc=adjMean, scale=adjStdev, moments='mvsk')[1]))))
Here's the output:
target mean=1.5000 actual mean=1.5000
target stdev=3.0000 actual stdev=3.0000
Related
I have very little knowledge of statistics, so forgive me, but I'm very confused by how the numpy function std works, and the documentation is unfortunately not clearing it up.
From what I understand it will compute the standard deviation of a distribution from the array, but when I set up a Gaussian with a standard deviation of 0.5 with the following code, numpy.std returns 0.2:
sigma = 0.5
mu = 1
x = np.linspace(0, 2, 100)
f = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp((-1 / 2) * ((x - mu) / sigma)**2)
plt.plot(x, f)
plt.show()
print(np.std(f))
This is the distribution:
I have no idea what I'm misunderstanding about how the function works. I thought maybe I would have to tell it the x-values associated with the y-values of the distribution but there's no argument for that in the function. Why is numpy.std not returning the actual standard deviation of my distribution?
I suspect that you understand perfectly well how the function works, but are misunderstanding the meaning of your data. Standard deviation is a measure of the spread of data about the mean value.
When you say std(f), you are computing the the spread of the y-values about their mean. Looking at the graph in the question, a vertical mean of ~0.5 and a standard deviation of ~0.2 are not far fetched. Notice that std(f) does not involve the x-values in any way.
What you are expecting to get is the standard deviation of the x-values, weighted by the y-values. This is essentially the idea behind a probability density function (PDF).
Let's go through the computation manually to understand the differences. The mean of the x-values is normally x.sum() / x.size. But that is only true the the weight of each value is 1. If you weigh each value by the corresponding f value, you can write
m = (x * f).sum() / f.sum()
Standard deviation is the root-mean-square about the mean. That means computing the average squared deviation from the mean, and taking the square root. We can compute the weighted mean of squared deviation in the exact same way we did before:
s = np.sqrt(np.sum((x - m)**2 * f) / f.sum())
Notice that the value of s computed this way from your question is not 0.5, but rather 0.44. This is because your PDF is incomplete, and the missing tails add significantly to the spread.
Here is an example showing that the standard deviation converges to the expected value as you compute it for a larger sample of the PDF:
>>> def s(x, y):
... m = (x * y).sum() / y.sum()
... return np.sqrt(np.sum((x - m)**2 * y) / y.sum())
>>> sigma = 0.5
>>> x1 = np.linspace(-1, 1, 100)
>>> y1 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x1 / sigma)**2)
>>> s(x1, y1)
0.4418881290522094
>>> x2 = np.linspace(-2, 2, 100)
>>> y2 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x2 / sigma)**2)
>>> s(x2, y2)
0.49977093783005005
>>> x3 = np.linspace(-3, 3, 100)
>>> y3 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x3 / sigma)**2)
>>> s(x3, y3)
0.49999998748515206
np.std is used to compute standard deviation. This can be computed as below steps
First we need to compute mean of distribution
Then find the summation of (x - x.mean)**2
Then find the means of above summation (by divide with number of elements in distribution)
Then find square root of this means (calculated in step 3).
Thus this function is calculating the standard deviation of distribution being passed to it.
It doesnt appear to be a regular log norm pdf as seen in https://en.wikipedia.org/wiki/Log-normal_distribution
https://www.tensorflow.org/tutorials/generative/cvae
def log_normal_pdf(sample, mean, logvar, raxis=1):
log2pi = tf.math.log(2. * np.pi)
return tf.reduce_sum(
-.5 * ((sample - mean) ** 2. * tf.exp(-logvar) + logvar + log2pi),
axis=raxis)
This is the logarithm of the probability according to a normal distribution. I.e. log(p(x)) where p is a normal/Gaussian distribution. The naming is a little confusing though.
In case anyone else wanders down this rabbit hole, the previous answer checks out: "the logarithm of the pdf according to a normal distribution". Here's a simple check that you can run based on the gaussian function definition from wiki:
import numpy as np
def log_normal_pdf(sample, mean, std):
"""Function from tensorflow VAE example"""
logvar = np.log(std**2)
log2pi = np.log(2*np.pi)
return -.5 * ((sample - mean) ** 2. * np.exp(-logvar) + logvar + log2pi)
def test(sample, mean, std):
"""Alternate calc taking the log of the wiki gaussian function"""
out = (1 / (std * np.sqrt(2*np.pi))) * np.exp(-0.5 * (sample-mean)**2 / std**2)
return np.log(out)
print(log_normal_pdf(9, 10, 1))
print(test(9, 10, 1))
I asked a similar question in January that #Miłosz Wieczór was kind enough to answer. Now, I am faced with a similar but different challenge since I need to fit two parameters (fc and alpha) simultaneously on two datasets (e_exp and iq_exp). I basically need to find the values of fc and alpha that are the best fits to both data e_exp and iq_exp.
import numpy as np
import math
from scipy.optimize import curve_fit, least_squares, minimize
f_exp = np.array([1, 1.6, 2.7, 4.4, 7.3, 12, 20, 32, 56, 88, 144, 250000])
e_exp = np.array([7.15, 7.30, 7.20, 7.25, 7.26, 7.28, 7.32, 7.25, 7.35, 7.34, 7.37, 11.55])
iq_exp = np.array([0.010, 0.009, 0.011, 0.011, 0.010, 0.012, 0.019, 0.027, 0.038, 0.044, 0.052, 0.005])
ezero = np.min(e_exp)
einf = np.max(e_exp)
ig_fc = 500
ig_alpha = 0.35
def CCRI(f_exp, fc, alpha):
x = np.log(f_exp/fc)
R = ezero + 1/2 * (einf - ezero) * (1 + np.sinh((1 - alpha) * x) / (np.cosh((1 - alpha) * x) + np.sin(1/2 * alpha * math.pi)))
I = 1/2 * (einf - ezero) * np.cos(alpha * math.pi / 2) / (np.cosh((1 - alpha) * x) + np.sin(alpha * math.pi / 2))
RI = np.sqrt(R ** 2 + I ** 2)
return RI
def CCiQ(f_exp, fc, alpha):
x = np.log(f_exp/fc)
R = ezero + 1/2 * (einf - ezero) * (1 + np.sinh((1 - alpha) * x) / (np.cosh((1 - alpha) * x) + np.sin(1/2 * alpha * math.pi)))
I = 1/2 * (einf - ezero) * np.cos(alpha * math.pi / 2) / (np.cosh((1 - alpha) * x) + np.sin(alpha * math.pi / 2))
iQ = I / R
return iQ
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
einf, ezero, and f_exp are all constant plus the variables I need to optimize are ig_fc and ig_alpha, in which ig stands for initial guess. In the code above, I get two different fc and alpha values because I solve them independently. I need however to solve them simultaneously so that fc and alpha are universal.
Is there a way to solve two different functions to provide universal solutions for fc and alpha?
The docs state on the second returned value from curve_fit:
pcov
The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation
errors on the parameters use perr = np.sqrt(np.diag(pcov)).
So if you want to minimize the overall error, you need to combine the errors of both your fits.
def objective(what, ever):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
# not sure if this the correct equation, but you can start with it
err_total = np.sum(np.sqrt(np.diag(pcovRI))) + np.sum(np.sqrt(np.diag(pcoviQ)))
return err_total
On total errors of 2d Gaussian functions:
https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/
Update:
Since you want poptRI and poptiQ to be the same, you need to minimize their distance.
This can be done like
from numpy import linalg
def objective(what, ever):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
delta = linalg.norm(poptiQ - poptRI)
return delta
Minimizing this function will (should) result in similar values for poptRI and poptiQ. You take the parameters as vectors, and try to minimize the length of their delta vector.
However, this approach assumes that poptRI and poptiQ (and their coefficients) are about in the same range since you are using some metric on them. If say one if them is in the range 2000 and the other in the range 2. Then the optimizer will favour tuning the first one. But maybe this is fine.
If you somehow want to treat them the same you need to normalize them.
One approach (assuming all coefficients are similar) could be
linalg.norm((poptiQ / linalg.norm(poptiQ)) - (poptRI / linalg.norm(poptRI))))
You normalize the results to unit vectors, then subtract them, then create the norm.
The same is true for the inputs to the function, but it might not be that important there. See the links below.
But this strongly depends on the problem you are trying to solve. There is no general solution.
Some links related to this:
Is normalization useful/necessary in optimization?
Why do we have to normalize the input for an artificial neural network?
Another objective function:
It this what you are trying to do?
You want to find the best fc and alpha so the fit results of both functions are as close as possible?
def objective(fc, alpha):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(fc, alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(fc, alpha))
delta = linalg.norm(poptiQ - poptRI)
return delta
I have equation:
import numpy as np
from scipy import optimize
def wealth_evolution(price, wealth=10, rate=0.01, q=1, realEstate=0.1, prev_price=56):
sum_wantedEstate = 100
for delta in range(1,4):
z = rate - ((price-prev_price) / (price + q / rate))
k = delta * np.divide(1.0, float(np.maximum(0.0, z)))
wantedEstate = (wealth / (price + q / rate)) * np.minimum(k, 1) - realEstate
sum_wantedEstate += wantedEstate
return sum_wantedEstate
So I find the solution of this equation:
sol = optimize.fsolve(wealth_evolution, 200)
But if I substituted sol into equation I wouldn't get 0 (welth_evolution(sol)). Why it happens? fsolve finds the roots of f(x)=0.
UPD:
The full_output gives:
(array([ 2585200.]), {'qtf': array([-99.70002298]), 'nfev': 14, 'fjac': array([[-1.]]), 'r': array([ 3.45456519e-11]), 'fvec': array([ 99.7000116])}, 5, 'The iteration is not making good progress, as measured by the \n improvement from the last ten iterations.')
Have you tried plotting your function?
import numpy as np
from scipy import optimize
from matplotlib import pyplot as plt
small = 1e-30
def wealth_evolution(price, wealth=10, rate=0.01, q=1, realEstate=0.1, prev_price=56):
sum_wantedEstate = 100
for delta in range(1,4):
z = rate - ((price-prev_price) / (price + q / rate))
k = delta * np.divide(1.0, float(np.maximum(small, z)))
wantedEstate = (wealth / (price + q / rate)) * np.minimum(k, 1) - realEstate
sum_wantedEstate += wantedEstate
return sum_wantedEstate
price_range = np.linspace(0,10000,10000)
we = [wealth_evolution(p) for p in price_range]
plt.plot(price_range,we)
plt.xlabel('price')
plt.ylabel('wealth_evolution(price)')
plt.show()
At least for the parameters you specify it does not have a root, which is what fsolve tries to find. If you want to minimize a function you can try fmin. For this function this will not help though, because it seems to just asymptotically decay to 99.7 or so. So minimizing it would lead to infinite price.
So either you have to live with this or come up with a different function to optimize or constrain your search range (in which case you don't have to search, because it will just be the maximum value...).
I need to know how to generate 1000 random numbers between 500 and 600 that has a mean = 550 and standard deviation = 30 in python.
import pylab
import random
xrandn = pylab.zeros(1000,float)
for j in range(500,601):
xrandn[j] = pylab.randn()
???????
You are looking for stats.truncnorm:
import scipy.stats as stats
a, b = 500, 600
mu, sigma = 550, 30
dist = stats.truncnorm((a - mu) / sigma, (b - mu) / sigma, loc=mu, scale=sigma)
values = dist.rvs(1000)
There are other choices for your problem too. Wikipedia has a list of continuous distributions with bounded intervals, depending on the distribution you may be able to get your required characteristics with the right parameters. For example, if you want something like "a bounded Gaussian bell" (not truncated) you can pick the (scaled) beta distribution:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
def my_distribution(min_val, max_val, mean, std):
scale = max_val - min_val
location = min_val
# Mean and standard deviation of the unscaled beta distribution
unscaled_mean = (mean - min_val) / scale
unscaled_var = (std / scale) ** 2
# Computation of alpha and beta can be derived from mean and variance formulas
t = unscaled_mean / (1 - unscaled_mean)
beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
alpha = beta * t
# Not all parameters may produce a valid distribution
if alpha <= 0 or beta <= 0:
raise ValueError('Cannot create distribution for the given parameters.')
# Make scaled beta distribution with computed parameters
return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
np.random.seed(100)
min_val = 1.5
max_val = 35
mean = 9.87
std = 3.1
my_dist = my_distribution(min_val, max_val, mean, std)
# Plot distribution PDF
x = np.linspace(min_val, max_val, 100)
plt.plot(x, my_dist.pdf(x))
# Stats
print('mean:', my_dist.mean(), 'std:', my_dist.std())
# Get a large sample to check bounds
sample = my_dist.rvs(size=100000)
print('min:', sample.min(), 'max:', sample.max())
Output:
mean: 9.87 std: 3.100000000000001
min: 1.9290674232087306 max: 25.03903889816994
Probability density function plot:
Note that not every possible combination of bounds, mean and standard deviation will produce a valid distribution in this case, though, and depending on the resulting values of alpha and beta the probability density function may look like an "inverted bell" instead (even though mean and standard deviation would still be correct).
I'm not exactly sure what the OP desired, but if he just wanted an array xrandn fulfilling the bottom plot - below I present the steps:
First, create a standard distribution (Gaussian distribution), the easiest way might be to use numpy:
import numpy as np
random_nums = np.random.normal(loc=550, scale=30, size=1000)
And then you keep only the numbers within the desired range with a list comprehension:
random_nums_filtered = [i for i in random_nums if i>500 and i<600]