I'm building a simulation which requires random draws from the tail of a lognormal distribution. A threshold τ (tau) is chosen, and a resulting conditional distribution is given by:
I need to randomly sample from that conditional distribution, where F(x) is lognormal with a chosen µ (mu) and σ (sigma), and τ (tau) is set by the user.
My inelegant solution right now is simply to sample from the lognormal, tossing out any values under τ (tau), until I have the sample size I need. But I'm sure this can be improved.
Thanks for the help!
The easiest way is probably to leverage the truncated normal distribution as provided by Scipy.
This gives the following code, with ν (nu) as the variable of the standard Gaussian distribution, and τ (tau) mapping to ν0 on that distribution. This function returns a Numpy array containing ranCount lognormal variates:
import numpy as np
from scipy.stats import truncnorm
def getMySamplesScipy(ranCount, mu, sigma, tau):
nu0 = (math.log(tau) - mu) / sigma # position of tau on unit Gaussian
xs = truncnorm.rvs(nu0, np.inf, size=ranCount) # truncated unit normal samples
ys = np.exp(mu + sigma * xs) # go back to x space
return ys
If for some reason this is not suitable, well some of the tricks commonly used for Gaussian variates, such as Box-Muller do not work for a truncated distribution, but we can resort always to a general principle: the Inverse Transform Sampling theorem.
So we generate cumulative probabilities for our variates, by transforming uniform variates. And we trust Scipy, using its inverse of the erf error function to go back from our probabilities to the x space values.
This gives something like the following Python code (without any attempt at optimization):
import math
import random
import numpy as np
import numpy.random as nprd
import scipy.special as spfn
# using the "Inverse Method":
def getMySamples(ranCount, mu, sigma, tau):
nu0 = (math.log(tau) - mu) / sigma # position of tau in standard Gaussian curve
headCP = (1/2) * (1 + spfn.erf(nu0/math.sqrt(2)))
tailCP = 1.0 - headCP # probability of being in the "tail"
uvs = np.random.uniform(0.0, 1.0, ranCount) # uniform variates
cps = (headCP + uvs * tailCP) # Cumulative ProbabilitieS
nus = (math.sqrt(2)) * spfn.erfinv(2*cps-1) # positions in standard Gaussian
xs = np.exp(mu + sigma * nus) # go back to x space
return xs
Alternatives:
We can leverage the significant amount of material related to the Truncated Gaussian distribution.
There is a relatively recent (2016) review paper on the subject by Zdravko Botev and Pierre L'Ecuyer. This paper provides a pointer to publicly available R source code. Some material is seriously old, for example the 1986 book by Luc Devroye: Non-Uniform Random Variate Generation.
For example, a possible rejection-based method: if τ (tau) maps to ν0 on the standard Gaussian curve, the unit Gaussian distribution is like exp(-ν2/2). If we write ν = ν0 + δ, this is proportional to: exp(-δ2/2) * exp(-ν0*δ).
The idea is to approximate the exact distribution beyond ν0 by an exponential one, of parameter ν0. Note that the exact distribution is constantly below the approximate one. Then we can randomly accept the relatively cheap exponential variates with a probability of exp(-δ2/2).
We can just pick an equivalent algorithm in the literature. In the Devroye book, chapter IX page 382, there is some pseudo-code:
REPEAT
generate independent exponential random variates X and Y
UNTIL X2 <= 2*ν02*Y
RETURN R <-- ν0 + X/ν0
for which a Numpy rendition could be written like this:
def getMySamplesXpRj(rawRanCount, mu, sigma, tau):
nu0 = (math.log(tau) - mu) / sigma # position of tau in standard Gaussian
if (nu0 <= 0):
print("Error: τ (tau) too small in getMySamplesXpRj")
rnu0 = 1.0 / nu0
xs = nprd.exponential(1.0, rawRanCount) # exponential "raw" variates
ys = nprd.exponential(1.0, rawRanCount)
allSamples = nu0 + (rnu0 * xs)
boolArray = (xs*xs - 2*nu0*nu0*ys) <= 0.0
samples = allSamples[boolArray]
ys = np.exp(mu + sigma * samples) # go back to x space
return ys
According to Table 3 in the Botev-L'Ecuyer paper, the rejection rate of this algorithm is nicely low.
Besides, if you are willing to allow for some sophistication, there is also some literature about the Ziggurat algorithm as used for truncated Gaussian distributions, for example the 2012 arXiv 1201.6140 paper by Nicolas Chopin at ENSAE-CREST.
Side note: with recent versions of Python, it seems that you can use Greek letters for your variable names directly, σ instead of sigma, τ instead of tau, just as in the statistics books:
$ python3
Python 3.9.6 (default, Jun 29 2021, 00:00:00)
>>>
>>> σ = 2
>>> τ = 7
>>>
>>> στ = σ * τ
>>>
>>> στ + 1
15
>>>
A clean way is to define a subclass of rv_continuous with an implementation of _cdf. To draw variates you may want to also define _ppf or _rvs methods.
Related
I have a discrete power spectrum f(x) which is a power law with a specific exponent b: f(x) = A*x^(b). I want to estimate that exponent when I have f(x) and x. I'm interested in using a Maximum Likelihood Estimator (MLE) to find it. Why? Well, there's this paper that says it's the best method for such distributions with least bias: Clauset et al. 2007 .
I already tried many different methods, for instance, I took the commonly known route and took the log for both power and frequency then applied a least squares fit to the loglog spectrum but the slope (sought exponent) of such process is sometimes biased. There's already a Python implementation of the mentioned paper package but it does MLE based on PDF or CDF not the probability distribution you provide. I tried editing it but it's so complicated. I also tried using pytorch to do optimization with no success since I'm not experienced with it.
How can I implement such MLE in python for my problem?
Edit: I have found someone ask a similar question but in R. Someone answered them here but again it's in R not python.
Edit: added code
from astroML.time_series import generate_power_law
import numpy as np
import scipy
N=72 # number of points
dt=5*60 # time resolution
exponent= -2 # power law exponent
rand_seed= 1
time_max = 6*6*60 # observation time
x = np.linspace(0, time_max, N ) # time array
y = generate_power_law(N, dt, -exponent, random_state= rand_seed)
# amplitudes with power law array
# Get the spectrum by FFT for example
sample_freq = scipy.fft.rfftfreq(1*N, d=dt)[1:] # sampling frequency
sig_fft = scipy.fft.rfft(y,1*N)[1:] # FFT amplitudes
psd = (np.abs(sig_fft)**2)*2*dt/(N) # power spectral density
# do least squares fitting
logA = np.log(sample_freq )
logB = np.log(psd )
slope, intercept = np.polyfit(logA, logB, 1)
Another attempt for MLE based on the comment which links article :
import numpy as np
from scipy.optimize import minimize
def func_powerlaw(x, params):
slope = params[0]
coefficient = params[1]
return (x**slope)*coefficient
params = [1,1]
#For MLE, minimize the negative log likelihood
def neglnlike(params, x, y):
model = func_powerlaw(x, params)
output = np.sum(np.log(model) + y/model)
#Check that this is valid, returning large number if not
if not np.isfinite(output):
return 1.0e30
return output
res = minimize(neglnlike, params, args=(sample_freq , psd),
method='Nelder-Mead')
See this library named powerlaw
https://github.com/jeffalstott/powerlaw
powerlaw is a toolbox using the statistical methods developed in Clauset et al. 2007 and Klaus et al. 2011 to determine if a probability distribution fits a power law.
I am binning some time series data, I need to apply a half-normal filter to the binned data. How can I do this in python? I've provided a toy example bellow. I need Xbinned to be smoothed with a half-gaussian filter with std of 0.25 (or what ever). I'm pretty sure the half gaussian should be facing the forward time direction.
import numpy as np
X = np.random.randint(2, size=100) #example random process
bin_size = 5
Xbinned = []
for i in range(0, len(X)+1, bin_size):
Xbinned.append(sum(X[i:i+(bin_size-1)])/bin_size)
How to implement half-gaussian filtering
Scipy has a function called scipy.ndimage.gaussian_filter(). It nearly implements what we want here. Unfortunately, there's no option to use a half-gaussian instead of a gaussian. However, scipy is open-source, so we can just take the source code and modify it to be a half-gaussian.
I used this source code, and removed all of the parts that are not needed for this particular case. At the end, I had this:
import scipy.ndimage
def halfgaussian_kernel1d(sigma, radius):
"""
Computes a 1-D Half-Gaussian convolution kernel.
"""
sigma2 = sigma * sigma
x = np.arange(0, radius+1)
phi_x = np.exp(-0.5 / sigma2 * x ** 2)
phi_x = phi_x / phi_x.sum()
return phi_x
def halfgaussian_filter1d(input, sigma, axis=-1, output=None,
mode="constant", cval=0.0, truncate=4.0):
"""
Convolves a 1-D Half-Gaussian convolution kernel.
"""
sd = float(sigma)
# make the radius of the filter equal to truncate standard deviations
lw = int(truncate * sd + 0.5)
weights = halfgaussian_kernel1d(sigma, lw)
origin = -lw // 2
return scipy.ndimage.convolve1d(input, weights, axis, output, mode, cval, origin)
A short summary of how this works:
First, it generates a convolution kernel. It uses the formula e^(-1/2 * (x/sigma)^2) to generate the gaussian distribution. It keeps going until you're 4 standard deviations away from the center.
Next, it convolves that kernel against your signal. It adjusts the kernel to start at the current timestep instead of being centered on the current timestep.
Trying this on your signal, I get a result like this:
array([0.59979879, 0.6 , 0.40006707, 0.59993293, 0.79993293,
0.40013414, 0.20006707, 0.59986586, 0.40006707, 0.4 ,
0.99979879, 0.00033535, 0.59979879, 0.40006707, 0.00013414,
0.59979879, 0.20013414, 0.00006707, 0.19993293, 0.59986586])
Choice of standard deviation
If you pick a standard deviation of 0.25, that is going to have almost no effect on your signal. Here are the convolution weights it uses: [0.99966465 0.00033535]. In other words, this has less than a 0.1% effect on the signal.
I'd recommend using a larger sigma value.
Off by one error
Also, I want to point out the off-by-one error here:
for i in range(0, len(X)+1, bin_size):
Xbinned.append(sum(X[i:i+(bin_size-1)])/bin_size)
Numpy ranges are not inclusive, so a range of i to i+(bin_size-1) actually captures 4 elements, not 5.
To fix this, you can change it to this:
for i in range(0, len(X), bin_size):
Xbinned.append(X[i:i+bin_size].mean())
(Also, I fixed an off-by-one error in the loop specification and used a numpy shortcut for finding the mean.)
Is there any library/function in Python which allows us to generate discrete data that matches given target moments (mean, standard deviation, skewness, kurtosis)? I do not wish to necessarily enforce any specific underlying continuous distribution.
That is, I want to generate, say, 10000 numbers, such that when we calculate their first four moments using standard formulae we get something close to the target moments given as input.
Any known library in Python that implements such method? Her is an example of a paper in which this specific problem is solved (as part of a larger problem):
https://link.springer.com/article/10.1023/A:1021853807313
Thanks!
Yes, although not with 100% accuracy, this is possible.
import statsmodels.sandbox.distributions.extras as extras
import scipy.interpolate as interpolate
import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np
def generate_normal_four_moments(mu, sigma, skew, kurt, size=10000, sd_wide=10):
f = extras.pdf_mvsk([mu, sigma, skew, kurt])
x = np.linspace(mu - sd_wide * sigma, mu + sd_wide * sigma, num=500)
y = [f(i) for i in x]
yy = np.cumsum(y) / np.sum(y)
inv_cdf = interpolate.interp1d(yy, x, fill_value="extrapolate")
rr = np.random.rand(size)
return inv_cdf(rr)
Next, we generate the data by using
data = generate_normal_four_moments(mu=0, sigma=1, skew=-1, kurt=3)
Let's check the moments:
np.mean(data)
np.var(data)
ss.skew(data)
ss.kurtosis(data)
gives
-0.039986656405454374
1.051375501684874
-1.071149838792561
2.9813805363255472
I'd like to generate random numbers that follow a dropping linear frequency distribution, take n=1-x for an example.
The numpy library however seems to offer only more complex distributions.
So, it turns out you can totally use random.triangular(0,1,0) for this. See documentation here: https://docs.python.org/2/library/random.html
random.triangular(low, high, mode)
Return a random floating point number N such that low <= N <= high and with the specified mode between those bounds.
Histogram made with matplotlib:
import matplotlib.pyplot as plt
import random
bins = [0.1 * i for i in range(12)]
plt.hist([random.triangular(0,1,0) for i in range(2500)], bins)
For denormalized PDF with density
1-x, in the range [0...1)
normalization constant is 1/2
CDF is equal to 2x-x^2
Thus, sampling is quite obvious
r = 1.0 - math.sqrt(random.random())
Sample program produced pretty much the same plot
import math
import random
import matplotlib.pyplot as plt
bins = [0.1 * i for i in range(12)]
plt.hist([(1.0 - math.sqrt(random.random())) for k in range(10000)], bins)
plt.show()
UPDATE
let's denote S to be an integral, and S_a^b is definite integral from a to b.
So
Denormalized PDF(x) = 1-x
Normalization:
N = S_0^1 (1-x) dx = 1/2
Thus, normalized PDF
PDF(x) = 2*(1-x)
Let's compute CDF
CDF(x) = S_0^x PDF(x) dx = 2x - x*x
Checking: CDF(0) = 0, CDF(1) = 1
Sampling is via inverse CDF method, by solving for x
CDF(x) = U(0,1)
where U(0,1) is uniform random in [0,1)
This is simple quadratic equation with solution
x = 1 - sqrt(1 - U(0,1)) = 1 - sqrt(U(0,1))
which translated directly into Python code
I have been trying to get the result of a lognormal distribution using Scipy. I already have the Mu and Sigma, so I don't need to do any other prep work. If I need to be more specific (and I am trying to be with my limited knowledge of stats), I would say that I am looking for the cumulative function (cdf under Scipy). The problem is that I can't figure out how to do this with just the mean and standard deviation on a scale of 0-1 (ie the answer returned should be something from 0-1). I'm also not sure which method from dist, I should be using to get the answer. I've tried reading the documentation and looking through SO, but the relevant questions (like this and this) didn't seem to provide the answers I was looking for.
Here is a code sample of what I am working with. Thanks.
from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
total = 37
dist = lognorm.cdf(total,mean,stddev)
UPDATE:
So after a bit of work and a little research, I got a little further. But I still am getting the wrong answer. The new code is below. According to R and Excel, the result should be .7434, but that's clearly not what is happening. Is there a logic flaw I am missing?
dist = lognorm([1.744],loc=2.0785)
dist.cdf(25) # yields=0.96374596, expected=0.7434
UPDATE 2:
Working lognorm implementation which yields the correct 0.7434 result.
def lognorm(self,x,mu=0,sigma=1):
a = (math.log(x) - mu)/math.sqrt(2*sigma**2)
p = 0.5 + 0.5*math.erf(a)
return p
lognorm(25,1.744,2.0785)
> 0.7434
I know this is a bit late (almost one year!) but I've been doing some research on the lognorm function in scipy.stats. A lot of folks seem confused about the input parameters, so I hope to help these people out. The example above is almost correct, but I found it strange to set the mean to the location ("loc") parameter - this signals that the cdf or pdf doesn't 'take off' until the value is greater than the mean. Also, the mean and standard deviation arguments should be in the form exp(Ln(mean)) and Ln(StdDev), respectively.
Simply put, the arguments are (x, shape, loc, scale), with the parameter definitions below:
loc - No equivalent, this gets subtracted from your data so that 0 becomes the infimum of the range of the data.
scale - exp μ, where μ is the mean of the log of the variate. (When fitting, typically you'd use the sample mean of the log of the data.)
shape - the standard deviation of the log of the variate.
I went through the same frustration as most people with this function, so I'm sharing my solution. Just be careful because the explanations aren't very clear without a compendium of resources.
For more information, I found these sources helpful:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html#scipy.stats.lognorm
https://stats.stackexchange.com/questions/33036/fitting-log-normal-distribution-in-r-vs-scipy
And here is an example, taken from #serv-inc 's answer, posted on this page here:
import math
from scipy import stats
# standard deviation of normal distribution
sigma = 0.859455801705594
# mean of normal distribution
mu = 0.418749176686875
# hopefully, total is the value where you need the cdf
total = 37
frozen_lognorm = stats.lognorm(s=sigma, scale=math.exp(mu))
frozen_lognorm.cdf(total) # use whatever function and value you need here
It sounds like you want to instantiate a "frozen" distribution from known parameters. In your example, you could do something like:
from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
dist=lognorm([stddev],loc=mean)
which will give you a lognorm distribution object with the mean and standard deviation you specify. You can then get the pdf or cdf like this:
import numpy as np
import pylab as pl
x=np.linspace(0,6,200)
pl.plot(x,dist.pdf(x))
pl.plot(x,dist.cdf(x))
Is this what you had in mind?
from math import exp
from scipy import stats
def lognorm_cdf(x, mu, sigma):
shape = sigma
loc = 0
scale = exp(mu)
return stats.lognorm.cdf(x, shape, loc, scale)
x = 25
mu = 2.0785
sigma = 1.744
p = lognorm_cdf(x, mu, sigma) #yields the expected 0.74341
Similar to Excel and R, The lognorm_cdf function above parameterizes the CDF for the log-normal distribution using mu and sigma.
Although SciPy uses shape, loc and scale parameters to characterize its probability distributions, for the log-normal distribution I find it slightly easier to think of these parameters at the variable level rather than at the distribution level. Here's what I mean...
A log-normal variable X is related to a normal variable Z as follows:
X = exp(mu + sigma * Z) #Equation 1
which is the same as:
X = exp(mu) * exp(Z)**sigma #Equation 2
This can be sneakily re-written as follows:
X = exp(mu) * exp(Z-Z0)**sigma #Equation 3
where Z0 = 0. This equation is of the form:
f(x) = a * ( (x-x0) ** b ) #Equation 4
If you can visualize equations in your head it should be clear that the scale, shape and location parameters in Equation 4 are: a, b and x0, respectively. This means that in Equation 3 the scale, shape and location parameters are: exp(mu), sigma and zero, respectfully.
If you can't visualize that very clearly, let's rewrite Equation 2 as a function:
f(Z) = exp(mu) * exp(Z)**sigma #(same as Equation 2)
and then look at the effects of mu and sigma on f(Z). The figure below holds sigma constant and varies mu. You should see that mu vertically scales f(Z). However, it does so in a nonlinear manner; the effect of changing mu from 0 to 1 is smaller than the effect of changing mu from 1 to 2. From Equation 2 we see that exp(mu) is actually the linear scaling factor. Hence SciPy's "scale" is exp(mu).
The next figure holds mu constant and varies sigma. You should see that the shape of f(Z) changes. That is, f(Z) has a constant value when Z=0 and sigma affects how quickly f(Z) curves away from the horizontal axis. Hence SciPy's "shape" is sigma.
Even more late, but in case it's helpful to anyone else: I found that the Excel's
LOGNORM.DIST(x,Ln(mean),standard_dev,TRUE)
provides the same results as python's
from scipy.stats import lognorm
lognorm.cdf(x,sigma,0,mean)
Likewise, Excel's
LOGNORM.DIST(x,Ln(mean),standard_dev,FALSE)
seems equivalent to Python's
from scipy.stats import lognorm
lognorm.pdf(x,sigma,0,mean).
#lucas' answer has the usage down pat. As a code example, you could use
import math
from scipy import stats
# standard deviation of normal distribution
sigma = 0.859455801705594
# mean of normal distribution
mu = 0.418749176686875
# hopefully, total is the value where you need the cdf
total = 37
frozen_lognorm = stats.lognorm(s=sigma, scale=math.exp(mu))
frozen_lognorm.cdf(total) # use whatever function and value you need here
Known mean and stddev of the lognormal distribution
In case someone is looking for it, here is a solution for getting the scipy.stats.lognorm distribution if the mean mu and standard deviation sigma of the lognormal distribution are known. In this case we have to calculate the stats.lognorm parameters from the known mu and sigma like so:
import numpy as np
from scipy import stats
mu = 10
sigma = 3
a = 1 + (sigma / mu) ** 2
s = np.sqrt(np.log(a))
scale = mu / np.sqrt(a)
This was obtained by looking into the implementation of the variance and mean calculations in the stats.lognorm.stats method and essentially reversing it (solving for the input).
Then we can initialize the frozen distribution instance
distr = stats.lognorm(s, 0, scale)
# generate some randomvals
randomvals = distr.rvs(1_000_000)
# calculate mean and variance using the dedicated method
mu_stats, var_stats = distr.stats("mv")
Compare means and stddevs from input, randomvals and analytical solution from distr.stats:
print(f"""
Mean Std
----------------------------
Input: {mu:6.2f} {sigma:6.2f}
Randomvals: {randomvals.mean():6.2f} {randomvals.std():6.2f}
lognorm.stats: {mu_stats:6.2f} {np.sqrt(var_stats):6.2f}
""")
Mean Std
----------------------------
Input: 10.00 3.00
Randomvals: 10.00 3.00
lognorm.stats: 10.00 3.00
Plot PDF from stats.lognorm and histogram of the random values:
import holoviews as hv
hv.extension('bokeh')
x = np.linspace(0, 30, 301)
counts, _ = np.histogram(randomvals, bins=x)
counts = counts / counts.sum() / (x[1] - x[0])
(hv.Histogram((counts, x))
* hv.Curve((x, distr.pdf(x))).opts(color="r").opts(width=900))
If you read this and just want a function with the behaviour similar to lnorm in R. Well, then relieve yourself from violent anger and use numpy's numpy.random.lognormal.