Numba support for big integers? - python

I have a factorial lookup table that contains the first 30 integer factorials. This table is used in a function that is compiled with numba.njit. The issue is, above 20!, the number is larger than a 64-bit signed integer (9,223,372,036,854,775,807), which causes numba to raise a TypingError. If the table is reduced to only include the first 20 integer factorials the function runs fine.
Is there a way to get around this in numba? Perhaps by declaring larger integer types in the jit compiled function where the lookup table is used?

There may be some way to handle large integers in Numba, but its not a method that I'm aware of.
But, since we know that you're trying to hand-code the evaluation of the Beta distribution in Numba, I have some other suggestions.
First though, we must be careful with our language so we don't confuse the Beta distribution and the Beta function.
What I'd actually recommend is moving all your computations on to the log scale. That is, instead of computing the pdf of the Beta distribution you'd compute the log of the pdf of the Beta distribution.
This trick is commonly used in statistical computing as the log of the pdf is more numerically stable than the pdf. The Stan project, for example, works exclusively to allow the computation of the log posterior density.
From your post history I also know that you're interested in MCMC; it is also common practice to use log pdfs to perform MCMC. In the case of MCMC, instead of having the posterior proportional to the prior times the likelihood, on the log scale you would have the log-posterior proportional to the log-prior plus the log-likelihood.
I'd recommend you use log distributions as this allows you to avoid having to ever compute $\Gamma(n)$ for large n, which is prone to integer overflow. Instead, you compute $\log(\Gamma(n))$. But don't you need to compute $\Gamma(n)$ to compute $\log(\Gamma(n))$? Actually, no. You can take a look at the scipy.special function gammaln which avoids having to compute $\Gamma(n)$ at all. One way forward then would be to look at the source code in scipy.special.gammaln and make your own numba implementation from this.
In your comment you also mention using Spouge's Approximation to approximate the Gamma function. I've not used Spouge's approximation before, but I have had success with Stirling's approximation. If you want to use one of these approximations, working on the log scale you would now take the log of the approximation. You'll want to use the rules of logs to rewrite these approximations.
With all the above considered, what I'd recommend is moving computations from the pdf to the log of the pdf. To compute the log pdf of the Beta distribution I'd make use of this approximation of the Beta function. Using the rules of logs to rewrite this approximation and the Beta pdf. You could then implement this is Numba without having to worry about integer overflow.
Edit
Apologies, I'm not sure how to format maths on stack overflow.

Related

Checkgradient without solving optimization problem in MATLAB

I have a relatively complicated function and I have calculated the analytical form of the Jacobian of this function. However, sometimes, I mess up this Jacobian.
MATLAB has a nice way to check for the accuracy of the Jacobian when using some optimization technique as described here.
The problem though is that it looks like MATLAB solves the optimization problem and then returns if the Jacobian was correct or not. This is extremely time consuming, especially considering that some of my optimization problems take hours or even days to compute.
Python has a somewhat similar function in scipy as described here which just compares the analytical gradient with a finite difference approximation of the gradient for some user provided input.
Is there anything I can do to check the accuracy of the Jacobian in MATLAB without having to solve the entire optimization problem?
A laborious but useful method I've used for this sort of thing is to check that the (numerical) integral of the purported derivative is the difference of the function at the end points. I have found this more convenient than comparing fractions like (f(x+h)-f(x))/h with f'(x) because of the difficulty of choosing h so that on the one hand h is not so small that the fraction is not dominated by rounding error and on the other h is small enough that the fraction should be close to f'(x)
In the case of a function F of a single variable, the assumption is that you have code f to evaluate F and fd say to evaluate F'. Then the test is, for various intervals [a,b] to look at the differences, which the fundamental theorem of calculus says should be 0,
Integral{ 0<=x<=b | fd(x)} - (f(b)-f(a))
with the integral being computed numerically. There is no need for the intervals to be small.
Part of the error will, of course, be due to the error in the numerical approximation to the integral. For this reason I tend to use, for example, and order 40 Gausss Legendre integrator.
For functions of several variables, you can test one variable at a time. For several functions, these can be tested one at a time.
I've found that these tests, which are of course by no means exhaustive, show up the kinds of mistakes that occur in computing derivatives quire readily.
Have you considered the usage of Complex step differentiation to check your gradient? See this description

Generating random numbers from custom continuous probability density function

as the title states I am trying to generate random numbers from a custom continuous probability density function, which is:
0.001257 *x^4 * e^(-0.285714 *x)
to do so, I use (on python 3) scipy.stats.rv_continuous and then rvs() to generate them
from decimal import Decimal
from scipy import stats
import numpy as np
class my_distribution(stats.rv_continuous):
def _pdf(self, x):
return (Decimal(0.001257) *Decimal(x)**(4)*Decimal(np.exp(-0.285714 *x)))
distribution = my_distribution()
distribution.rvs()
note that I used Decimal to get rid of an OverflowError: (34, 'Result too large').
Still, I get an error RuntimeError: Failed to converge after 100 iterations.
What's going on there? What's the proper way to achieve what I need to do?
I've found out the reason for your issue.
rvs by default uses numerical integration, which is a slow process and can fail in some cases. Your PDF is presumably one of those cases, where the left side grows without bound.
For this reason, you should specify the distribution's support as follows (the following example shows that the support is in the interval [-4, 4]):
distribution = my_distribution(a = -4, b = 4)
With this interval, the PDF will be bounded from above, allowing the integration (and thus the random variate generation) to work as normal. Note that by default, rv_continuous assumes the distribution is supported on the entire real line.
However, this will only work for the particular PDF you give here, not necessarily for arbitrary PDFs.
Usually, when you only give a PDF to your rv_continuous subclass, the subclass's rvs, mean, etc. Will then be very slow, because the method needs to integrate the PDF every time it needs to generate a random variate or calculate a statistic. For example, random variate generation requires using numerical integration to integrate the PDF, and this process can fail to converge depending on the PDF.
In future cases when you're dealing with arbitrary distributions, and particularly when speed is at a premium, you will thus need to add to an _rvs method that uses its own sampler. One example is a much simpler rejection sampler given in the answer to a related question.
See also my section "Sampling from an Arbitrary Distribution".

Approaches for using statistics packages for maximum likelihood estimation for hundreds of covariates

I am trying to investigate the distribution of maximum likelihood estimates for specifically for a large number of covariates p and a high dimensional regime (meaning that p/n, with n the sample size, is about 1/5). I am generating the data and then using statsmodels.api.Logit to fit the parameters to my model.
The problem is, this only seems to work in a low dimensional regime (like 300 covariates and 40000 observations). Specifically, I get that the maximum number of iterations has been reached, the log likelihood is inf i.e. has diverged, and a 'singular matrix' error.
I am not sure how to remedy this. Initially, when I was still working with smaller values (say 80 covariates, 4000 observations), and I got this error occasionally, I set a maximum of 70 iterations rather than 35. This seemed to help.
However it clearly will not help now, because my log likelihood function is diverging. It is not just a matter of non-convergence within the maixmum number of iterations.
It would be easy to answer that these packages are simply not meant to handle such numbers, however there have been papers specifically investigating this high dimensional regime, say here where p=800 covariates and n=4000 observations are used.
Granted, this paper used R rather than python. Unfortunately I do not know R. However I should think that python optimisation should be of comparable 'quality'?
My questions:
Might it be the case that R is better suited to handle data in this high p/n regime than python statsmodels? If so, why and can the techniques of R be used to modify the python statsmodels code?
How could I modify my code to work for numbers around p=800 and n=4000?
In the code you currently use (from several other questions), you implicitly use the Newton-Raphson method. This is the default for the sm.Logit model. It computes and inverts the Hessian matrix to speed-up estimation, but that is incredibly costly for large matrices - let alone oft results in numerical instability when the matrix is near singular, as you have already witnessed. This is briefly explained on the relevant Wikipedia entry.
You can work around this by using a different solver, like e.g. the bfgs (or lbfgs), like so,
model = sm.Logit(y, X)
result = model.fit(method='bfgs')
This runs perfectly well for me even with n = 10000, p = 2000.
Aside from estimation, and more problematically, your code for generating samples results in data that suffer from a large degree of quasi-separability, in which case the whole MLE approach is questionable at best. You should urgently look into this, as it suggests your data may not be as well-behaved as you might like them to be. Quasi-separability is very well explained here.

How to avoid numerical overflow error with exponential terms?

Dealing with numerical calculations having exponential terms often becomes painful, thanks to overflow errors. For example, suppose you have a probability density, P(x)=C*exp(f(x)/k), where k is very small number, say of the order of 10^(-5).
To find the value of C one has to integrate P(x). Here comes the overflow error. I know it also depends on the form of f(x). But for this moment let us assume, f(x)=sin(x).
How to deal with such problems?
What are the tricks we may use to avoid them?
Is the severity of such problems language dependent? If yes, in which language should one write his code?
As I mentioned in the comments, I strongly advise using analytical methods as far as you can. However, If you want to compute integrals of the form
I=Integral[Exp[f(x)],{x,a,b}]
Where f(x) could potentially overflow the exponent, then you might want to renormalize the system a bit in the following way:
Assume that f(c) is the maximum of f(x) in the domain [a,b], then you can write:
I=Exp[f(c)] Integral[Exp[f(x)-f(c)],{x,a,b}]
It's an ugly trick, but at least your exponents will be small in the integral.
note: just realized this is the comment of roygvib
One of the options is to use GSL - GNU Scientific Library (python and fortran wrappers are available).
There is a function gsl_sf_exp_e10_e that according to the documentation
computes the exponential \exp(x) using the gsl_sf_result_e10 type to return a result with extended range. This function may be useful if the value of \exp(x) would overflow the numeric range of double.
However, I would like to note, that it's slow due to additional checks during evaluation.
P.S. As it was said earlier, it's better to use analytical solutions where possible.

fft2 different result in numpy and matlab

I was trying to port one code from python to matlab, but I encounter one inconsistence between numpy fft2 and matlab fft2:
peak =
4.377491037053e-223 3.029446976068e-216 ...
1.271610790463e-209 3.237410810582e-203 ...
(Large data can't be list directly, it can be accessed here:https://drive.google.com/file/d/0Bz1-hopez9CGTFdzU0t3RDAyaHc/edit?usp=sharing)
Matlab:
fft2(peak) --(sample result)
12.5663706143590 -12.4458341615690
-12.4458341615690 12.3264538927637
Python:
np.fft.fft2(peak) --(sample result)
12.56637061 +0.00000000e+00j -12.44583416 +3.42948517e-15j
-12.44583416 +3.35525358e-15j 12.32645389 -6.78073635e-15j
Please help me to explain why, and give suggestion on how to fix it.
The Fourier transform of a real, even function is real and even (ref). Therefore, it appears that your FFT should be real? Numpy is probably just struggling with the numerics while MATLAB may outright check for symmetry and force the solution to be real.
MATLAB uses FFTW3 while my research indicates Numpy uses a library called FFTPack. FFTW is one of the standards for FFT performance and uses a number of tricks to work quickly and perform calculations to the best precision possible. You can incredibly tiny numbers and this offers a number of numerical challenges that any library will be hard pressed to resolve.
You might consider executing the Python code against an FFTW3 wrapper like pyFFTW3 and see if you get similar results.
It appears that your input data is gaussian real and even, in which case we do expect the FFT2 of the signal to be real and even. If all your inputs are this way you could just take the real part. Or round to a certain precision. I would trust MATLAB's FFTW code over the Python code.
Or you could just ignore it. The differences are quite small and a value of 3e-15i is effectively zero for most applications. If you have automated the comparison, consider calling them equivalent if the mean square error of all the entries is less than some threshold (say 1e-8 or 1e-15 or 1e-20).

Categories

Resources