Generating random numbers from custom continuous probability density function

Generating random numbers from custom continuous probability density function - python

as the title states I am trying to generate random numbers from a custom continuous probability density function, which is:
0.001257 *x^4 * e^(-0.285714 *x)
to do so, I use (on python 3) scipy.stats.rv_continuous and then rvs() to generate them
from decimal import Decimal
from scipy import stats
import numpy as np
class my_distribution(stats.rv_continuous):
def _pdf(self, x):
return (Decimal(0.001257) *Decimal(x)**(4)*Decimal(np.exp(-0.285714 *x)))
distribution = my_distribution()
distribution.rvs()
note that I used Decimal to get rid of an OverflowError: (34, 'Result too large').
Still, I get an error RuntimeError: Failed to converge after 100 iterations.
What's going on there? What's the proper way to achieve what I need to do?

I've found out the reason for your issue.
rvs by default uses numerical integration, which is a slow process and can fail in some cases. Your PDF is presumably one of those cases, where the left side grows without bound.
For this reason, you should specify the distribution's support as follows (the following example shows that the support is in the interval [-4, 4]):
distribution = my_distribution(a = -4, b = 4)
With this interval, the PDF will be bounded from above, allowing the integration (and thus the random variate generation) to work as normal. Note that by default, rv_continuous assumes the distribution is supported on the entire real line.
However, this will only work for the particular PDF you give here, not necessarily for arbitrary PDFs.
Usually, when you only give a PDF to your rv_continuous subclass, the subclass's rvs, mean, etc. Will then be very slow, because the method needs to integrate the PDF every time it needs to generate a random variate or calculate a statistic. For example, random variate generation requires using numerical integration to integrate the PDF, and this process can fail to converge depending on the PDF.
In future cases when you're dealing with arbitrary distributions, and particularly when speed is at a premium, you will thus need to add to an _rvs method that uses its own sampler. One example is a much simpler rejection sampler given in the answer to a related question.
See also my section "Sampling from an Arbitrary Distribution".

Related

Write a scipy function without using a standard library (exponential power)

My question might come across as stupid or so simple, but I could not work towards finding a solution. Here is my question: I want to write an exponential power distribution function which is available in scipy. However, I don't want to use the scipy for this. How do I go about it?
Here are my efforts so far:
import math
import numpy as np
def ExpPowerFun(x,b, size=1000):
distribution = b*x**(b-1)*math.exp(1+x**b-math.exp(x**b))
return distribution
I used this equation based on this scipy doc. To be fair, using this equation and writing a function using it doesn't do much. As you can see, it returns only one value. I want to generate a distribution of random numbers based on scipy's exponential power distribution function without using scipy.
I have looked at class exponpow_gefrom github code. However, it uses scipy.special(-sc), so it's kind of useless for me, unless there is any workaround and avoids the use of scipy.
I can't figure out how to go about it. Again, this might be a simple task, but I am stuck. Please help.

the simplest way to generate a random number for a given distribution is using the inverse of the CDF of that function, the PPF (Percent point function) will give you the distribution you need when you apply it on uniform distributed numbers.
for you case the PPF (taken directly from scipy source code with some modifications) is:
np.power(np.log(1-np.log(1-x)), 1.0/b)
hence you code should look like this
def ExpPowerFun(b, size=1000):
x = np.random.rand(size)
return np.power(np.log(1-np.log(1-x)), 1.0/b)
import matplotlib.pyplot as plt
plt.hist(ExpPowerFun(2.7,10000),20)
plt.show()
Edit: the uniform distribution has to be from 0 to 1 ofc since the probabilities are from 0% to 100%

Numba support for big integers?

I have a factorial lookup table that contains the first 30 integer factorials. This table is used in a function that is compiled with numba.njit. The issue is, above 20!, the number is larger than a 64-bit signed integer (9,223,372,036,854,775,807), which causes numba to raise a TypingError. If the table is reduced to only include the first 20 integer factorials the function runs fine.
Is there a way to get around this in numba? Perhaps by declaring larger integer types in the jit compiled function where the lookup table is used?

There may be some way to handle large integers in Numba, but its not a method that I'm aware of.
But, since we know that you're trying to hand-code the evaluation of the Beta distribution in Numba, I have some other suggestions.
First though, we must be careful with our language so we don't confuse the Beta distribution and the Beta function.
What I'd actually recommend is moving all your computations on to the log scale. That is, instead of computing the pdf of the Beta distribution you'd compute the log of the pdf of the Beta distribution.
This trick is commonly used in statistical computing as the log of the pdf is more numerically stable than the pdf. The Stan project, for example, works exclusively to allow the computation of the log posterior density.
From your post history I also know that you're interested in MCMC; it is also common practice to use log pdfs to perform MCMC. In the case of MCMC, instead of having the posterior proportional to the prior times the likelihood, on the log scale you would have the log-posterior proportional to the log-prior plus the log-likelihood.
I'd recommend you use log distributions as this allows you to avoid having to ever compute $\Gamma(n)$ for large n, which is prone to integer overflow. Instead, you compute $\log(\Gamma(n))$. But don't you need to compute $\Gamma(n)$ to compute $\log(\Gamma(n))$? Actually, no. You can take a look at the scipy.special function gammaln which avoids having to compute $\Gamma(n)$ at all. One way forward then would be to look at the source code in scipy.special.gammaln and make your own numba implementation from this.
In your comment you also mention using Spouge's Approximation to approximate the Gamma function. I've not used Spouge's approximation before, but I have had success with Stirling's approximation. If you want to use one of these approximations, working on the log scale you would now take the log of the approximation. You'll want to use the rules of logs to rewrite these approximations.
With all the above considered, what I'd recommend is moving computations from the pdf to the log of the pdf. To compute the log pdf of the Beta distribution I'd make use of this approximation of the Beta function. Using the rules of logs to rewrite this approximation and the Beta pdf. You could then implement this is Numba without having to worry about integer overflow.
Edit
Apologies, I'm not sure how to format maths on stack overflow.

FFT derivatives using Numpy and the Nyquist frequency

I am having trouble understanding Numpy's behavior regarding the Nyquist frequency. Consider the following example:
import numpy as np
x=np.linspace(0, 2*np.pi, 21)[:-1]
k=np.fft.rfftfreq(len(x), d=x[1]-x[0])
FFT=np.fft.rfft(x)
x1=np.fft.irfft(1j*k*FFT)
FFT[-1]+=1e5
x2=np.fft.irfft(1j*k*FFT)
print(np.allclose(x1,x2))
Prints True. So apparently it doesn't matter what I do with the Nyquist frequency in FFT, the result is always the same and the change is ignored. Curiously, this does not happen when trying to just recover the function (no derivation):
x1=np.fft.irfft(FFT)
FFT[-1]+=1e5
x2=np.fft.irfft(FFT)
print(np.allclose(x1,x2))
prints False.
I may be misunderstanding what the Nyquist frequency is here (Wikipedia and other sources weren't very helpful) but aren't both results supposed to be affected by a change in the Nyquist frequency? The closest explanation I can find is that the Nyquist frequency is supposed to be a real number, but still doesn't seem to explain both behaviors.
The reason I'm asking this is because I'm trying to reproduce results that I know are correct from a Fortran code that does do some stuff with the Nyquist frequency wen differentiating. My results are always about 1% off and I'm guessing this is the culprit.

The r in np.fft.rfft() indicates that you are using the DFT on real input. But if that is not True, you will get unexpected behaviors like this one. Just use fft functions for complex values. As a side note, always try to inspect your data.
EDIT (additional explanation):
In particular, when you calculate the "DFT for real inputs" you are enforcing certain properties to your data, i.e. the (D)FT of real valued function, implies that the (D)FT transform is Hermitian-symmetric, and hence the negative (D)FT coefficients are redundant, so rfft and later irfft are optimized for the computation under this assumption.
See their documentations np.fft.rfft() and np.fft.irfft() for more information.
Briefly, because of this expected parity, half of your coefficients (the negative ones) will not be computed by np.fft.rfft() and because of parity of the (D)FT transform, the first component is purely real (by definition) and the last component is also purely real (for convenience).
Because of the 1j multiplication, whatever was purely real is now purely imaginary (and viceversa) in the subsequent irfft calls.
Since the irfft() will ignore the imaginary part of the first and last components, your statement will not affect its result.

Integration with Scipy giving incorrect results with negative lower bound

I am attempting to calculate integrals between two limits using python/scipy.
I am using online calculators to double check my results (http://www.wolframalpha.com/widgets/view.jsp?id=8c7e046ce6f4d030f0b386ea5c17b16a, http://www.integral-calculator.com/), and my results disagree when I have certain limits set.
The code used is:
import scipy as sp
import numpy as np
def integrand(x):
return np.exp(-0.5*x**2)
def int_test(a,b):
# a and b are the lower and upper bounds of the integration
return sp.integrate.quad(integrand,a,b)
When setting the limits (a,b) to (-np.inf,1) I get answers that agree (2.10894...)
however if I set (-np.inf,300) I get an answer of zero.
On further investigation using:
for i in range(50):
print(i,int_test(-np.inf,i))
I can see that the result goes wrong at i=36.
I was wondering if there was a way to avoid this?
Thanks,
Matt

I am guessing this has to do with the infinite bounds. scipy.integrate.quad is a wrapper around quadpack routines.
https://people.sc.fsu.edu/~jburkardt/f_src/quadpack/quadpack.html
In the end, these routines chose suitable intervals and try to get the value of the integral through function evaluations and then numerical integrations. This works fine for finite integrals, assuming you know roughly how fine you can make the steps of the function evaluation.
For infinite integrals it depends how well the algorithms choose respective subintervals and how accurately they are computed.
My advice: do NOT use numerical integration software AT ALL if you are interested in accurate values for infinite integrals.
If your problem can be solved analytically, try that or confine yourself to certain bounds.

R and Python Give Different Results (Median, IQR, Mean, and STD)

I am doing feature scaling on my data and R and Python are giving me different answers in the scaling. R and Python give different answers for the many statistical values:
Median:
Numpy gives 14.948499999999999 with this code:np.percentile(X[:, 0], 50, interpolation = 'midpoint').
The built in Statistics package in Python gives the same answer with the following code: statistics.median(X[:, 0]).
On the other hand, R gives this results 14.9632 with this code: median(X[, 1]). Interestingly, the summary() function in R gives 14.960 as the median.
A similar difference occurs when computing the mean of this same data. R gives 13.10936 using the built-in mean() function and both Numpy and the Python Statistics package give 13.097945407088607.
Again, the same thing happens when computing the Standard Deviation. R gives 7.390328 and Numpy (with DDOF = 1) gives 7.3927612774052083. With DDOF = 0, Numpy gives 7.3927565984408936.
The IQR also gives different results. Using the built-in IQR() function in R, the given results is 12.3468. Using Numpy with this code: np.percentile(X[:, 0], 75) - np.percentile(X[:, 0], 25) the results is 12.358700000000002.
What is going on here? Why are Python and R always giving different results? It may help to know that my data has 795066 rows and is being treated as an np.array() in Python. The same data is being treated as a matrix in R.

tl;dr there are a few potential differences in algorithms even for such simple summary statistics, but given that you're seeing differences across the board and even in relatively simple computations such as the median, I think the problem is more likely that the values are getting truncated/modified/losing precision somehow in the transfer between platforms.
(This is more of an extended comment than an answer, but it was getting awkwardly long.)
you're unlikely to get much farther without a reproducible example; there are various ways to create examples to test hypotheses for the differences, but it's better if you do so yourself rather than making answerers do it.
how are you transferring data to/from Python/R? Is there some rounding in the representation used in the transfer? (What do you get for max/min, which should be based on a single number with no floating-point computations? How about if you drop one value to get an odd-length vector and take the median?)
medians: I was originally going to say that this could be a function of different ways to define quantile interpolation for an even-length vector, but the definition of the median is somewhat simpler than general quantiles, so I'm not sure. The differences you're reporting above seem way too big to be driven by floating-point computation in this case (since the computation is just an average of two values of similar magnitude).
IQRs: similarly, there are different possible definitions of percentiles/quantiles: see ?quantile in R.
median() vs summary(): R's summary() reports values at reduced precision (often useful for a quick overview); this is a common source of confusion.
mean/sd: there are some possible subtleties in the algorithm here -- for example, R sorts the vector before summing uses extended precision internally to reduce instability, I don't know if Python does or not. However, this shouldn't make as big a difference as you're seeing unless the data are a bit weird:
x <- rnorm(1000000,mean=0,sd=1)
> mean(x)
[1] 0.001386724
> sum(x)/length(x)
[1] 0.001386724
> mean(x)-sum(x)/length(x)
[1] -1.734723e-18
Similarly, there are more- and less-stable ways to compute a variance/standard deviation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.