In passing someone had suggested to me that I could use half normal distribution in python to set min and max points using 0 to infinity:
halfnorm.rvs()
The 0 seems to cut off the min, however I have no idea what to do with the infinity.
I would like to do a number generator from 0 - 15 within a normal distribution, but having a hard time finding a function that doesn't go over the max or below the min due to the nature of distribution limits.
I would try to use the beta-distribution: https://en.wikipedia.org/wiki/Beta_distribution. It's quite simple (e.g. to integrate) and capable of fitting typical reaction time distributions.
Now the question is how to sample this efficiently for fixed α and β parameters ... scipy has done it for us: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html
Edit: Motivated by the comment and curiosity, here an example, plotting the histogram of 10 samples á 1000 values:
from scipy.stats import beta
from numpy import histogram
import pylab
max_time = 3
min_time = 0.5
a, b = 2, 7
dist = beta(a, b)
for _ in range(10):
sample = min_time + dist.rvs(size=1000) * (max_time - min_time)
his, bins = histogram(sample, bins=20, density=True)
pylab.plot(bins[:-1], his, ".")
pylab.xlabel("Reaction time [s]")
pylab.ylabel("Probability density [1/s]")
pylab.grid()
pylab.show()
I had just answered similar question here. I'll copy answer here as I think this question title is much more informative:
You can use uniform distribution with boundaries "translated" from normal to uniform space (using error function) and convert it to normal distribution using inverse error function.
import matplotlib.pyplot as plt
import numpy as np
from scipy import special
mean = 0
std = 7
min_value = 0
max_value = 15
min_in_standard_domain = (min_value - mean) / std
max_in_standard_domain = (max_value - mean) / std
min_in_erf_domain = special.erf(min_in_standard_domain)
max_in_erf_domain = special.erf(max_in_standard_domain)
random_uniform_data = np.random.uniform(min_in_erf_domain, max_in_erf_domain, 10000)
random_gaussianized_data = (special.erfinv(random_uniform_data) * std) + mean
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].hist(random_uniform_data, 30)
axes[1].hist(random_gaussianized_data, 30)
axes[0].set_title('uniform distribution samples')
axes[1].set_title('erfinv(uniform distribution samples)')
plt.show()
I recently ran in to a similar issue.
To get around this and keep my min/max in reasonable bounds I just created some if statements to catch any numbers that went above the real min and max.
if value <0:
value = abs(value)
elif value >15:
value - 15 = diff
value = 15-diff
This was close enough for me.
Is there any library/function in Python which allows us to generate discrete data that matches given target moments (mean, standard deviation, skewness, kurtosis)? I do not wish to necessarily enforce any specific underlying continuous distribution.
That is, I want to generate, say, 10000 numbers, such that when we calculate their first four moments using standard formulae we get something close to the target moments given as input.
Any known library in Python that implements such method? Her is an example of a paper in which this specific problem is solved (as part of a larger problem):
https://link.springer.com/article/10.1023/A:1021853807313
Thanks!
Yes, although not with 100% accuracy, this is possible.
import statsmodels.sandbox.distributions.extras as extras
import scipy.interpolate as interpolate
import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np
def generate_normal_four_moments(mu, sigma, skew, kurt, size=10000, sd_wide=10):
f = extras.pdf_mvsk([mu, sigma, skew, kurt])
x = np.linspace(mu - sd_wide * sigma, mu + sd_wide * sigma, num=500)
y = [f(i) for i in x]
yy = np.cumsum(y) / np.sum(y)
inv_cdf = interpolate.interp1d(yy, x, fill_value="extrapolate")
rr = np.random.rand(size)
return inv_cdf(rr)
Next, we generate the data by using
data = generate_normal_four_moments(mu=0, sigma=1, skew=-1, kurt=3)
Let's check the moments:
np.mean(data)
np.var(data)
ss.skew(data)
ss.kurtosis(data)
gives
-0.039986656405454374
1.051375501684874
-1.071149838792561
2.9813805363255472
Is there a way to calculate the square of a number (closest approximation), say 4, using Gaussian distribution where mu is the number and sigma is 0.16. and for 1000 random points?
I searched the internet a lot, but couldn't find a solution to this. Any piece of code would be very much helpful as i am new to python.
Assuming that you have your data generated you could find an approximation of your mu (which is the square of your number) by taking the mean of your data. By the law of the large numbers you can be sure that as the size of your data grow the approximation become more accurate. Example:
import random
def generate_data(size):
mu, sigma = 4 ** 2, 0.16
return [random.gauss(mu, sigma) for _ in range(size)]
def mean(ls):
return sum(ls) / len(ls)
print(mean(generate_data(10))) #15.976644889526114
print(mean(generate_data(100))) #16.004123848232233
print(mean(generate_data(1000))) #16.00164187802018
print(mean(generate_data(10000))) #16.001000022147206
You can use numpy.random.randn to generate a standard Gaussian distribution, which can then be scaled as needed, from the docs,
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
which for your example,
import numpy as np
import matplotlib.pyplot as plt
N = 4.
mu = N**2
sigma = 1/N**2
dist = np.sqrt(sigma) * np.random.randn(1000) + mu
plt.hist(dist,30)
plt.show()
If you don't want to use numpy, you could also use random module,
import random
dist = [random.normalvariate(mu, sigma) for i in range(1000)]
I calculated the minimum variance hedge ratio (MVHR) of two securities' returns by:
1. Calculating the optimal h* = Cov(S,F) / Var(F) using samples
2. Running an OLS regression and obtain the beta value
Both values differ slightly, for example I got h* = 0.9547 and beta = 0.9537. But they are supposed to be the same. Why is that so?
Below is my code:
import numpy as np
import statsmodels.api as sm
var = np.var(secRets, ddof = 1)
cov_denom = len(secRets) - 1
for i in range (0, len(secRets)):
cov_num += (indexRets[i] - indexAvg) * (secRets[i] - secAvg)
cov = cov_num / cov_denom
h = cov / var
ols_res = sm.OLS(indexRets, secRets).fit()
beta = ols_res.params[0]
print h, beta
indexRets and secRets are lists of daily returns of the index and the security (futures), respectively.
This is also a case of missing constant in OLS regression. The covariance and variance calculation subtracts the mean which is the same in the linear regression as including a constant. statsmodels doesn't include a constant by default unless you use the formulas.
For more details and an example see for example OLS of statsmodels does not work with inversely proportional data?
Also, you can replace the python loop to calculate the covariance by a call to numpy.cov.
I have been trying to get the result of a lognormal distribution using Scipy. I already have the Mu and Sigma, so I don't need to do any other prep work. If I need to be more specific (and I am trying to be with my limited knowledge of stats), I would say that I am looking for the cumulative function (cdf under Scipy). The problem is that I can't figure out how to do this with just the mean and standard deviation on a scale of 0-1 (ie the answer returned should be something from 0-1). I'm also not sure which method from dist, I should be using to get the answer. I've tried reading the documentation and looking through SO, but the relevant questions (like this and this) didn't seem to provide the answers I was looking for.
Here is a code sample of what I am working with. Thanks.
from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
total = 37
dist = lognorm.cdf(total,mean,stddev)
UPDATE:
So after a bit of work and a little research, I got a little further. But I still am getting the wrong answer. The new code is below. According to R and Excel, the result should be .7434, but that's clearly not what is happening. Is there a logic flaw I am missing?
dist = lognorm([1.744],loc=2.0785)
dist.cdf(25) # yields=0.96374596, expected=0.7434
UPDATE 2:
Working lognorm implementation which yields the correct 0.7434 result.
def lognorm(self,x,mu=0,sigma=1):
a = (math.log(x) - mu)/math.sqrt(2*sigma**2)
p = 0.5 + 0.5*math.erf(a)
return p
lognorm(25,1.744,2.0785)
> 0.7434
I know this is a bit late (almost one year!) but I've been doing some research on the lognorm function in scipy.stats. A lot of folks seem confused about the input parameters, so I hope to help these people out. The example above is almost correct, but I found it strange to set the mean to the location ("loc") parameter - this signals that the cdf or pdf doesn't 'take off' until the value is greater than the mean. Also, the mean and standard deviation arguments should be in the form exp(Ln(mean)) and Ln(StdDev), respectively.
Simply put, the arguments are (x, shape, loc, scale), with the parameter definitions below:
loc - No equivalent, this gets subtracted from your data so that 0 becomes the infimum of the range of the data.
scale - exp μ, where μ is the mean of the log of the variate. (When fitting, typically you'd use the sample mean of the log of the data.)
shape - the standard deviation of the log of the variate.
I went through the same frustration as most people with this function, so I'm sharing my solution. Just be careful because the explanations aren't very clear without a compendium of resources.
For more information, I found these sources helpful:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html#scipy.stats.lognorm
https://stats.stackexchange.com/questions/33036/fitting-log-normal-distribution-in-r-vs-scipy
And here is an example, taken from #serv-inc 's answer, posted on this page here:
import math
from scipy import stats
# standard deviation of normal distribution
sigma = 0.859455801705594
# mean of normal distribution
mu = 0.418749176686875
# hopefully, total is the value where you need the cdf
total = 37
frozen_lognorm = stats.lognorm(s=sigma, scale=math.exp(mu))
frozen_lognorm.cdf(total) # use whatever function and value you need here
It sounds like you want to instantiate a "frozen" distribution from known parameters. In your example, you could do something like:
from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
dist=lognorm([stddev],loc=mean)
which will give you a lognorm distribution object with the mean and standard deviation you specify. You can then get the pdf or cdf like this:
import numpy as np
import pylab as pl
x=np.linspace(0,6,200)
pl.plot(x,dist.pdf(x))
pl.plot(x,dist.cdf(x))
Is this what you had in mind?
from math import exp
from scipy import stats
def lognorm_cdf(x, mu, sigma):
shape = sigma
loc = 0
scale = exp(mu)
return stats.lognorm.cdf(x, shape, loc, scale)
x = 25
mu = 2.0785
sigma = 1.744
p = lognorm_cdf(x, mu, sigma) #yields the expected 0.74341
Similar to Excel and R, The lognorm_cdf function above parameterizes the CDF for the log-normal distribution using mu and sigma.
Although SciPy uses shape, loc and scale parameters to characterize its probability distributions, for the log-normal distribution I find it slightly easier to think of these parameters at the variable level rather than at the distribution level. Here's what I mean...
A log-normal variable X is related to a normal variable Z as follows:
X = exp(mu + sigma * Z) #Equation 1
which is the same as:
X = exp(mu) * exp(Z)**sigma #Equation 2
This can be sneakily re-written as follows:
X = exp(mu) * exp(Z-Z0)**sigma #Equation 3
where Z0 = 0. This equation is of the form:
f(x) = a * ( (x-x0) ** b ) #Equation 4
If you can visualize equations in your head it should be clear that the scale, shape and location parameters in Equation 4 are: a, b and x0, respectively. This means that in Equation 3 the scale, shape and location parameters are: exp(mu), sigma and zero, respectfully.
If you can't visualize that very clearly, let's rewrite Equation 2 as a function:
f(Z) = exp(mu) * exp(Z)**sigma #(same as Equation 2)
and then look at the effects of mu and sigma on f(Z). The figure below holds sigma constant and varies mu. You should see that mu vertically scales f(Z). However, it does so in a nonlinear manner; the effect of changing mu from 0 to 1 is smaller than the effect of changing mu from 1 to 2. From Equation 2 we see that exp(mu) is actually the linear scaling factor. Hence SciPy's "scale" is exp(mu).
The next figure holds mu constant and varies sigma. You should see that the shape of f(Z) changes. That is, f(Z) has a constant value when Z=0 and sigma affects how quickly f(Z) curves away from the horizontal axis. Hence SciPy's "shape" is sigma.
Even more late, but in case it's helpful to anyone else: I found that the Excel's
LOGNORM.DIST(x,Ln(mean),standard_dev,TRUE)
provides the same results as python's
from scipy.stats import lognorm
lognorm.cdf(x,sigma,0,mean)
Likewise, Excel's
LOGNORM.DIST(x,Ln(mean),standard_dev,FALSE)
seems equivalent to Python's
from scipy.stats import lognorm
lognorm.pdf(x,sigma,0,mean).
#lucas' answer has the usage down pat. As a code example, you could use
import math
from scipy import stats
# standard deviation of normal distribution
sigma = 0.859455801705594
# mean of normal distribution
mu = 0.418749176686875
# hopefully, total is the value where you need the cdf
total = 37
frozen_lognorm = stats.lognorm(s=sigma, scale=math.exp(mu))
frozen_lognorm.cdf(total) # use whatever function and value you need here
Known mean and stddev of the lognormal distribution
In case someone is looking for it, here is a solution for getting the scipy.stats.lognorm distribution if the mean mu and standard deviation sigma of the lognormal distribution are known. In this case we have to calculate the stats.lognorm parameters from the known mu and sigma like so:
import numpy as np
from scipy import stats
mu = 10
sigma = 3
a = 1 + (sigma / mu) ** 2
s = np.sqrt(np.log(a))
scale = mu / np.sqrt(a)
This was obtained by looking into the implementation of the variance and mean calculations in the stats.lognorm.stats method and essentially reversing it (solving for the input).
Then we can initialize the frozen distribution instance
distr = stats.lognorm(s, 0, scale)
# generate some randomvals
randomvals = distr.rvs(1_000_000)
# calculate mean and variance using the dedicated method
mu_stats, var_stats = distr.stats("mv")
Compare means and stddevs from input, randomvals and analytical solution from distr.stats:
print(f"""
Mean Std
----------------------------
Input: {mu:6.2f} {sigma:6.2f}
Randomvals: {randomvals.mean():6.2f} {randomvals.std():6.2f}
lognorm.stats: {mu_stats:6.2f} {np.sqrt(var_stats):6.2f}
""")
Mean Std
----------------------------
Input: 10.00 3.00
Randomvals: 10.00 3.00
lognorm.stats: 10.00 3.00
Plot PDF from stats.lognorm and histogram of the random values:
import holoviews as hv
hv.extension('bokeh')
x = np.linspace(0, 30, 301)
counts, _ = np.histogram(randomvals, bins=x)
counts = counts / counts.sum() / (x[1] - x[0])
(hv.Histogram((counts, x))
* hv.Curve((x, distr.pdf(x))).opts(color="r").opts(width=900))
If you read this and just want a function with the behaviour similar to lnorm in R. Well, then relieve yourself from violent anger and use numpy's numpy.random.lognormal.