I need to know how to generate 1000 random numbers between 500 and 600 that has a mean = 550 and standard deviation = 30 in python.
import pylab
import random
xrandn = pylab.zeros(1000,float)
for j in range(500,601):
xrandn[j] = pylab.randn()
???????
You are looking for stats.truncnorm:
import scipy.stats as stats
a, b = 500, 600
mu, sigma = 550, 30
dist = stats.truncnorm((a - mu) / sigma, (b - mu) / sigma, loc=mu, scale=sigma)
values = dist.rvs(1000)
There are other choices for your problem too. Wikipedia has a list of continuous distributions with bounded intervals, depending on the distribution you may be able to get your required characteristics with the right parameters. For example, if you want something like "a bounded Gaussian bell" (not truncated) you can pick the (scaled) beta distribution:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
def my_distribution(min_val, max_val, mean, std):
scale = max_val - min_val
location = min_val
# Mean and standard deviation of the unscaled beta distribution
unscaled_mean = (mean - min_val) / scale
unscaled_var = (std / scale) ** 2
# Computation of alpha and beta can be derived from mean and variance formulas
t = unscaled_mean / (1 - unscaled_mean)
beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
alpha = beta * t
# Not all parameters may produce a valid distribution
if alpha <= 0 or beta <= 0:
raise ValueError('Cannot create distribution for the given parameters.')
# Make scaled beta distribution with computed parameters
return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
np.random.seed(100)
min_val = 1.5
max_val = 35
mean = 9.87
std = 3.1
my_dist = my_distribution(min_val, max_val, mean, std)
# Plot distribution PDF
x = np.linspace(min_val, max_val, 100)
plt.plot(x, my_dist.pdf(x))
# Stats
print('mean:', my_dist.mean(), 'std:', my_dist.std())
# Get a large sample to check bounds
sample = my_dist.rvs(size=100000)
print('min:', sample.min(), 'max:', sample.max())
Output:
mean: 9.87 std: 3.100000000000001
min: 1.9290674232087306 max: 25.03903889816994
Probability density function plot:
Note that not every possible combination of bounds, mean and standard deviation will produce a valid distribution in this case, though, and depending on the resulting values of alpha and beta the probability density function may look like an "inverted bell" instead (even though mean and standard deviation would still be correct).
I'm not exactly sure what the OP desired, but if he just wanted an array xrandn fulfilling the bottom plot - below I present the steps:
First, create a standard distribution (Gaussian distribution), the easiest way might be to use numpy:
import numpy as np
random_nums = np.random.normal(loc=550, scale=30, size=1000)
And then you keep only the numbers within the desired range with a list comprehension:
random_nums_filtered = [i for i in random_nums if i>500 and i<600]
Related
I have the code which generates a normal distribution as a pdf, centered at the mean 400, with st
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
muPrev, sigmaPrev = 400, 40.
a = np.random.normal(muPrev, sigmaPrev, 100000)
count, bins, ignored = plt.hist(a, 1000, density=True)
plt.plot(bins, 1/(sigmaPrev * np.sqrt(2 * np.pi)) *
np.exp( - (bins - muPrev)**2 / (2 * sigmaPrev**2) ),linewidth=3, color='r')
and I can visualise it. But what if I wanted to convert this into a lognormal distribution? So that I now get values of mu and sigma that correspond to this as a log distribution?
What is posted by #SamMason is not correct. It is somewhat working because your mean and sd are relative large.
Ok, here is what would be correct way to get parameters of the Log-Normal distribution.
You have predefined values of mean (corresponding to your Gaussian mean) and sd (again, your Gaussian sd).
Mean=exp(μ+σ2/2)
Var =(exp(σ2) - 1)(exp(2μ+σ2))
Here μ and σ are log-normal (NOT gaussian) parameter. You have to find them.
Compute mean from your Gaussian mean (ok, that one is easy, they are equal)
Compute variance from your Gaussian sd (square)
Using formulas above solve two non-linear equations system and get your μ and σ
Plug μ and σ into your sampling routine and draw samples
UPDATE
Mean2=exp(2μ+σ2)
Var/Mean2 = (exp(σ2) - 1)
So here is your σ. To be more elaborate
Sd2/Mean2 = exp(σ2) - 1
exp(σ2) = 1 + Sd2/Mean2
σ2 = ln(1 + Sd2/Mean2)
From first equation now you could get μ
2μ+σ2 = ln(Mean2)
2μ=ln(Mean2) - σ2 = ln(Mean2) - ln(1 + Sd2/Mean2) = ln((Mean2)/(1 + Sd2/Mean2))
Please, check the math, but this is the way to get PRECISE log-normal μ,σ parameters to match desired Mean and Sd.
#SamMason approximation works, I believe, only if in the expression for
σ2 = ln(1 + Sd2/Mean2)
one have second term much larger than 1. THen you could drop 1 and have log of ratios.
UPDATE II
2μ=ln((Mean2)/(1 + Sd2/Mean2)) = ln(Mean4/(Mean2 + Sd2))
μ=1/2 ln(Mean4/(Mean2 + Sd2))=ln(Mean2/Sqrt(Mean2 + Sd2))
You could directly generate samples for a lognormal distribution with https://numpy.org/doc/stable/reference/random/generated/numpy.random.lognormal.html, alternatively:
log_norm = np.exp(a)
Note that if you want to generate the lognormal directly you need to calculate the appropriate mean and variance https://en.wikipedia.org/wiki/Log-normal_distribution
To give a more complete answer, here's some code that draws a figure with two plots: one shows your existing Gaussian draws and another for log-normal draws. I keep the first and second moments the same (i.e. mean and variance) by setting the log-normal mu=log(mu) and sigma=sd/mu.
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
mu, sd = 400, 40
n = 100_000
# draw samples from distributions
a = np.random.normal(mu, sd, n)
b = np.random.lognormal(np.log(mu), sd / mu, n)
# use Scipy for analytical PDFs
d1 = sps.norm(mu, sd)
# warning: scipy parameterises its distributions very strangely
d2 = sps.lognorm(sd / mu, scale=mu)
# bins to use for histogram and x for PDFs
lo, hi = np.min([a, b]), np.max([a, b])
dx = (hi - lo) * 0.06
bins = np.linspace(lo, hi, 101)
x = np.linspace(lo - dx, hi + dx, 501)
# draw figure
fig, [ax1, ax2] = plt.subplots(nrows=2, sharex=True, sharey=True, figsize=(8, 5))
ax1.set_title("Normal draws")
ax1.set_xlim(lo - dx, hi + dx)
ax1.hist(a, bins, density=True, alpha=0.5)
ax1.plot(x, d1.pdf(x))
ax1.plot(x, d2.pdf(x), '--')
ax2.set_title("Log-Normal draws")
ax2.hist(b, bins, density=True, alpha=0.5, label="Binned density")
ax2.plot(x, d1.pdf(x), '--', label="Normal PDF")
ax2.plot(x, d2.pdf(x), label="Log-Normal PDF")
ax2.legend()
fig.supylabel("Density")
which produces the following output:
Because the distributions are so close here, I've included dashed lines to show the other distribution for easier comparison. Note that the log-normal distribution will always be slightly right-skewed, more so as the variance increases.
I have two questions:
1- This code takes too long to execute. Any idea how I can make it faster?
With the code bellow I want generate 100 random discrete values between 700 and 1200.
I choosed the weibull distribution because I wanted to generate failure rates data please see the histogram bellow.
import random
nums = []
alpha = 0.6
beta = 0.4
while len(nums) !=100:
temp = int(random.weibullvariate(alpha, beta))
if 700 <= temp <1200:
nums.append(temp)
print(nums)
# plotting a graph
#plt.hist(nums, bins = 200)
#plt.show()
print(nums)
I wanted to generate a histogram like this one:
Histogram
2- I have this function for discrete weibull distribution
def DiscreteWeibull(q, b, x):
return q**(x**b) - q**((x + 1)**b)
How can I generate random values that follow this distribution?
Since the Weibull distribution with shape parameter K and scale parameter lambda can be characterized as this function on the Uniform (0,1) dist. U, we can 'cut' the distribution to a desired minimum and maximum value. We do this by inverting the equation, setting W to 700 or 1200, and finding the values between 0 and 1 that correspond. Here's some sample code.
def weibull_from_uniform(shape, scale, x):
assert 0 <= x <= 1
return scale * pow(-1 * math.log(x), 1.0 / shape)
scale_param = 0.6
shape_param = 0.4
min_value = 700.0
max_value = 1200.0
lower_bound = math.exp(-1 * pow(min_value / scale_param, shape_param))
upper_bound = math.exp(-1 * pow(max_value / scale_param, shape_param))
if lower_bound > upper_bound:
lower_bound, upper_bound = upper_bound, lower_bound
nums = []
while len(nums) < 100:
nums.append(weibull_from_uniform(shape_param, scale_param, random.uniform(lower_bound, upper_bound)))
print(nums)
plt.hist(nums, bins=8)
plt.show()
This code gives a histogram very similar to the one you provided; the method will give values from the same distribution as your original method, just faster. Note that this direct approach only works when our shape parameter K <= 1, so that the density function is strictly decreasing. When K > 1, the Weibull density function increases to a mode, then decreases, so you may need to draw from two uniform intervals for particular min and max values (since inverting for W and U may give two answers).
Your question is not very clear on why you thought using this Weibull distribution was a good idea, nor what distribution you are looking to achieve.
Discrete uniform distribution
Here are two ways to achieve the discrete uniform distribution on [700, 1200).
1) With random
import random
nums = [random.randrange(700, 1200) for _ in range(100)]
2) With numpy
import numpy
nums = numpy.random.randint(700, 1200, 100)
Geometric distribution
You have edited your question with an example histogram, and the mention "I wanted to generate a histogram like this one". The histogram vaguely looks like a geometric distribution.
We can use numpy.random.geometric:
import numpy
n_samples = 100
p = 0.5
a, b = 50, 650
cap = 1200
nums = numpy.random.geometric(p, size = 2 * n_samples) * a + b
nums = nums[numpy.where(nums < cap)][:n_samples]
So, I have the code, that calculates bounds of confidence interval
import statsmodels.api as sm
from statsmodels.stats.proportion import proportion_confint
def bin_conf (k, n, a):
alpha, count, nobs = a, k, n
return proportion_confint(count, nobs, alpha, method='normal')
bin_conf(75, 300, 0.05)
>>> (0.20100090038649865, 0.29899909961350135)
But I need to count the alpha, so that an already defined probability is also covered by an already defined confidence interval ?
For example: Size of trials (n) = 500. Successful trials = 200. CI = [0,35 ; 0,45]. Alpha = ?
Has 'statsmodels' or any other Python library the solution to find out that?
statsmodels does not have a helper function for this because it is not an usual use case.
However, confidence intervals based on normal distribution can be easily inverted. The width of the confidence interval is two times the critical value times the standard deviation of the mean.
from scipy import stats
ci = [0.35, 0.45]
count, nobs = 200, 500
p = count / nobs
std = np.sqrt(p * (1 - p) / nobs)
critval = (ci[1] - ci[0]) / std / 2
alpha = stats.norm.sf(critval) * 2 # two-sided
alpha
0.02247887336612522
Check that it is the same as the "normal" proportion_confint
proportion_confint(count, nobs, alpha, method='normal')
(0.35, 0.45000000000000007)
The same could be used for confidence intervals based on the t-distribution, by replacing stats.norm by stats.t and using appropriate degrees of freedom
I am trying to fit my data to a Negative Binomial Distribution with the package scipy in Python. However, my validation seems to fail.
These are my steps:
I have some demand data which is described by the statistics:
mu = 1.4
std = 1.59
print(mu, std)
I use the parameterization function below, taken from this post to compute the two NB parameters.
def convert_params(mu, theta):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
r = theta
var = mu + 1 / r * mu ** 2
p = (var - mu) / var
return r, 1 - p
I pass (hopefully correctly...) my two statistics - the naming convention between different sources is rather confusing at this point p, r, k
firstParam, secondParam = convert_params(mu, std)
I would then use these two parameters to fit the distribution:
from scipy.stats import nbinom
rv = nbinom(firstParam, secondParam)
Then I calculate a value R with the Percent Point Function .ppf(0.95). The value R in the context of my problem is a Reorder Point.
R = rv.ppf(0.95)
Now is when I expect to validate the previous steps, but I do not manage to retrieve my original statistics mu and std with mean and math.sqrt(var) respectively.
import math
mean, var = nbinom.stats(firstParam, secondParam, moments='mv')
print(mean, math.sqrt(var))
What am I missing? Any feedback about the parameterization implemented in Scipy?
Conversion code is wrong, I believe, SciPy is NOT using Wiki convention, but Mathematica convention
#%%
import numpy as np
from scipy.stats import nbinom
def convert_params(mean, std):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
See https://mathworld.wolfram.com/NegativeBinomialDistribution.html
"""
p = mean/std**2
n = mean*p/(1.0 - p)
return n, p
mean = 1.4
std = 1.59
n, p = convert_params(mean, std)
print((n, p))
#%%
m, v = nbinom.stats(n, p, moments='mv')
print(m, np.sqrt(v))
Code prints back 1.4, 1.59 pair
And reorder point computed as
rv = nbinom(n, p)
print("reorder point:", rv.ppf(0.95))
outputs 5
It looks like you are using a different conversion. The last bullet at the cited wikipedia section gives the formulas shown below. With these formulas you get back the exact same mu and std:
import numpy as np
from scipy.stats import nbinom
def convert_mu_std_to_r_p(mu, std):
r = mu ** 2 / (std ** 2 - mu)
p = 1 - mu / std ** 2
return r, 1 - p
mu = 1.4
std = 1.59
print("mu, std:", mu, std)
firstParam, secondParam = convert_mu_std_to_r_p(mu, std)
mean, var = nbinom.stats(firstParam, secondParam, moments='mv')
print("mean, sqrt(var):", mean, np.sqrt(var))
rv = nbinom(firstParam, secondParam)
print("reorder point:", rv.ppf(0.95))
Output:
mu, std: 1.4 1.59
mean, sqrt(var): 1.4 1.59
reorder point: 5.0
I am trying to fit some data using scipy.optimize.curve_fit. I have read the documentation and also this StackOverflow post, but neither seem to answer my question.
I have some data which is simple, 2D data which looks approximately like a trig function. I want to fit it with a general trig function
using scipy.
My approach is as follows:
from __future__ import division
import numpy as np
from scipy.optimize import curve_fit
#Load the data
data = np.loadtxt('example_data.txt')
t = data[:,0]
y = data[:,1]
#define the function to fit
def func_cos(t,A,omega,dphi,C):
# A is the amplitude, omega the frequency, dphi and C the horizontal/vertical shifts
return A*np.cos(omega*t + dphi) + C
#do a scipy fit
popt, pcov = curve_fit(func_cos, t,y)
#Plot fit data and original data
fig = plt.figure(figsize=(14,10))
ax1 = plt.subplot2grid((1,1), (0,0))
ax1.plot(t,y)
ax1.plot(t,func_cos(t,*popt))
This outputs:
where blue is the data orange is the fit. Clearly I am doing something wrong. Any pointers?
If no values are provided for initial guess of the parameters p0 then a value of 1 is assumed for each of them. From the docs:
p0 : array_like, optional
Initial guess for the parameters (length N). If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised).
Since your data has very large x-values and very small y-values an initial guess of 1 is far from the actual solution and hence the optimizer does not converge. You can help the optimizer by providing suitable initial parameter values that can be guessed / approximated from the data:
Amplitude: A = (y.max() - y.min()) / 2
Offset: C = (y.max() + y.min()) / 2
Frequency: Here we can estimate the number of zero crossing by multiplying consecutive y-values and check which products are smaller than zero. This number divided by the total x-range gives the frequency and in order to get it in units of pi we can multiply that number by pi: y_shifted = y - offset; oemga = np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (t.max() - t.min())
Phase shift: can be set to zero, dphi = 0
So in summary, the following initial parameter guess can be used:
offset = (y.max() + y.min()) / 2
y_shifted = y - offset
p0 = (
(y.max() - y.min()) / 2,
np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (t.max() - t.min()),
0,
offset
)
popt, pcov = curve_fit(func_cos, t, y, p0=p0)
Which gives me the following fit function: