Overflow in numpy.exp()

Overflow in numpy.exp() - python

I have to calculate the exponential of the following array for my project:
w = [-1.52820754859, -0.000234000845064, -0.00527938881237, 5797.19232191, -6.64682108484,
18924.7087966, -69.308158911, 1.1158892974, 1.04454511882, 116.795573742]
But I've been getting overflow due to the number 18924.7087966.
The goal is to avoid using extra packages such as bigfloat (except "numpy") and get a close result (which has a small relative error).
1.So far I've tried using higher precision (i.e. float128):
def getlogZ_robust(w):
Z = sum(np.exp(np.dot(x,w).astype(np.float128)) for x in iter_all_observations())
return np.log(Z)
But I still get "inf" which is what I want to avoid.
I've tried clipping it using nump.clip():
def getlogZ_robust(w):
Z = sum(np.exp(np.clip(np.dot(x,w).astype(np.float128),-11000, 11000)) for x in iter_all_observations())
return np.log(Z)
But the relative error is too big.
Can you help me solving this problem, if it is possible?

Only significantly extended or arbitrary precision packages will be able to handle the huge differences in numbers. The exponential of the largest and most negative numbers in w differ by 8000 (!) orders of magnitude. float (i.e. double precision) has 'only' 15 digits of precision (meaning 1+1e-16 is numerically equal to 1), such that adding the small numbers to the huge exponential of the largest number has no effect. As a matter of fact, exp(18924.7087966) is so huge, that it dominates the sum. Below is a script performing the sum with extended precision in mpmath: the ratio of the sum of exponentials and exp(18924.7087966) is basically 1.
w = [-1.52820754859, -0.000234000845064, -0.00527938881237, 5797.19232191, -6.64682108484,
18924.7087966, -69.308158911, 1.1158892974, 1.04454511882, 116.795573742]
u = min(w)
v = max(w)
import mpmath
#using plenty of precision
mpmath.mp.dps = 32768
print('%.5e' % mpmath.log10(mpmath.exp(v)/mpmath.exp(u)))
#exp(w) differs by 8000 orders of magnitude for largest and smallest number
s = sum([mpmath.exp(mpmath.mpf(x)) for x in w])
print('%.5e' % (mpmath.exp(v)/s))
#largest exp(w) dominates such that ratio over the sums of exp(w) and exp(max(w)) is approx. 1

If the issues of loosing digits in the final results due to hugely different orders of magnitudes of added terms in not a concern, one could also mathematically transform the log of sums over exponentials the following way avoiding exp of large numbers:
log(sum(exp(w)))
= log(sum(exp(w-wmax)*exp(wmax)))
= wmax + log(sum(exp(w-wmax)))
In python:
import numpy as np
v = np.array(w)
m = np.max(v)
print(m + np.log(np.sum(np.exp(v-m))))
Note that np.log(np.sum(np.exp(v-m))) is numerically zero as the exponential of the largest number completely dominates the sum here.

Numpy has a function called logaddexp which computes
logaddexp(x1, x2) == log(exp(x1) + exp(x2))
without explicitly computing the intermediate exp() values. This way it avoids the overflow. So here is the solution:
def getlogZ_robust(w):
Z = 0
for x in iter_all_observations():
Z = np.logaddexp(Z, np.dot(x,w))
return Z

Related

Why ifft2 is not working but fft2 is fine?

I want to implement ifft2 using DFT matrix. The following code works for fft2.
import numpy as np
def DFT_matrix(N):
i, j = np.meshgrid(np.arange(N), np.arange(N))
omega = np.exp( - 2 * np.pi * 1j / N )
W = np.power( omega, i * j ) # Normalization by sqrt(N) Not included
return W
sizeM=40
sizeN=20
np.random.seed(0)
rA=np.random.rand(sizeM,sizeN)
rAfft=np.fft.fft2(rA)
dftMtxM=DFT_matrix(sizeM)
dftMtxN=DFT_matrix(sizeN)
# Matrix multiply the 3 matrices together
mA = dftMtxM # rA # dftMtxN
print(np.allclose(np.abs(mA), np.abs(rAfft)))
print(np.allclose(np.angle(mA), np.angle(rAfft)))
To get to ifft2 I assumd I need to change only the dft matrix to it's transpose, so expected the following to work, but I got false for the last two print any suggesetion please?
import numpy as np
def DFT_matrix(N):
i, j = np.meshgrid(np.arange(N), np.arange(N))
omega = np.exp( - 2 * np.pi * 1j / N )
W = np.power( omega, i * j ) # Normalization by sqrt(N) Not included
return W
sizeM=40
sizeN=20
np.random.seed(0)
rA=np.random.rand(sizeM,sizeN)
rAfft=np.fft.ifft2(rA)
dftMtxM=np.conj(DFT_matrix(sizeM))
dftMtxN=np.conj(DFT_matrix(sizeN))
# Matrix multiply the 3 matrices together
mA = dftMtxM # rA # dftMtxN
print(np.allclose(np.abs(mA), np.abs(rAfft)))
print(np.allclose(np.angle(mA), np.angle(rAfft)))

I am going to be building on some things from my answer to your previous question. Please note that I will try to distinguish between the terms Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT). Remember that DFT is the transform while FFT is only an efficient algorithm for performing it. People, including myself, however very commonly refer to the DFT as FFT since it is practically the only algorithm used for computing the DFT
The problem here is again the normalization of the data. It's interesting that this is such a fundamental and confusing part of any DFT operations yet I couldn't find a good explanation on the internet. I will try to provide a summary at the end about DFT normalization however I think the best way to understand this is by working through some examples yourself.
Why the comparisons fail?
It's important to note, that even though both of the allclose tests seemingly fail, they are actually not a very good method of comparing two complex number arrays.
Difference between two angles
In particular, the problem is when it comes to comparing angles. If you just take the difference of two close angles that are on the border between -pi and pi, you can get a value that is around 2*pi. The allclose just takes differences between values and checks that they are bellow some threshold. Thus in our cases, it can report a false negative.
A better way to compare angles is something along the lines of this function:
def angle_difference(a, b):
diff = a - b
diff[diff < -np.pi] += 2*np.pi
diff[diff > np.pi] -= 2*np.pi
return diff
You can then take the maximum absolute value and check that it's bellow some threshold:
np.max(np.abs(angle_difference(np.angle(mA), np.angle(rAfft)))) < threshold
In the case of your example, the maximum difference was 3.072209153742733e-12.
So the angles are actually correct!
Magnitude scaling
We can get an idea of the issue is when we look at the magnitude ratio between the matrix iDFT and the library iFFT.
print(np.abs(mA)/np.abs(rAfft))
We find that all the values in mA are 800, which means that our absolute values are 800 times larger than those computed by the library. Suspiciously, 800 = 40 * 20, the dimensions of our data! I think you can see where I am going with this.
Confusing DFT normalization
We spot some indications why this is the case when we have a look at the FFT formulas as taken from the Numpy FFT documentation:
You will notice that while the forward transform doesn't normalize by anything. The reverse transform divides the output by 1/N. These are the 1D FFTs but the exact same thing applies in the 2D case, the inverse transform multiplies everything by 1/(N*M)
So in our example, if we update this line, we will get the magnitudes to agree:
mA = dftMtxM # rA/(sizeM * sizeN) # dftMtxN
A side note on comparing the outputs, an alternative way to compare complex numbers is to compare the real and imaginary components:
print(np.allclose(mA.real, rAfft.real))
print(np.allclose(mA.imag, rAfft.imag))
And we find that now indeed both methods agree.
Why all this normalization mess and which should I use?
The fundamental property of the DFT transform must satisfy is that iDFT(DFT(x)) = x. When you work through the math, you find that the product of the two coefficients before the sum has to be 1/N.
There is also something called the Parseval's theorem. In simple terms, it states that the energy in the signals is just the sum of square absolutes in both the time domain and frequency domain. For the FFT this boils down to this relationship:
Here is the function for computing the energy of a signal:
def energy(x):
return np.sum(np.abs(x)**2)
You are basically faced with a choice about the 1/N factor:
You can put the 1/N before the DFT sum. This makes senses as then the k=0 DC component will be equal to the average of the time domain values. However you will have to multiply the energy in frequency domain by N in order to match it with time domain frequency.
N = len(x)
X = np.fft.fft(x)/N # Compute the FFT scaled by `1/N`
# Energy related by `N`
np.allclose(energy(x), energy(X) * N) == True
# Perform some processing...
Y = X * H
y = np.fft.ifft(Y*N) # Compute the iFFT, remember to cancel out the built in `1/N` of ifft
You put the 1/N before the iDFT. This is, slightly counterintuitively, what most implementations, including Numpy do. I could not find a definitive consensus on the reasoning behind this, but I think it has something to do with the implementation efficiency. (If anyone has a better explanation for this, please leave it in the comments) As shown in the equations earlier, the energy in the frequency domain has to be divided by N to match the time domain energy.
N = len(x)
X = np.fft.fft(x) # Compute the FFT without scaling
# Energy, related by 1/N
np.allclose(energy(x), energy(X) / N) == True
# Perform some processing...
Y = X * H
y = np.fft.ifft(Y) # Compute the iFFT with the build in `1/N`
You can split the 1/N by placing 1/sqrt(N) before each of the transforms making them perfectly symmetric. In Numpy, you can provide the parameter norm="ortho" to the fft functions which will make them use the 1/sqrt(N) normalization instead: np.fft.fft(x, norm="ortho") The nice property here is that the energy now matches in both domains.
X = np.fft.fft(x, norm='orth') # Compute the FFT scaled by `1/sqrt(N)`
# Perform some processing...
# Energy are equal:
np.allclose(energy(x), energy(X)) == True
Y = X * H
y = np.fft.ifft(Y, norm='orth') # Compute the iFFT, with scaling by `1/sqrt(N)`
In the end it boils down to what you need. Most of the time the absolute magnitude of your DFT is actually not that important. You are mostly interested in the ratio of various components or you want to perform some operation in the frequency domain but then transform back to the time domain or you are interested in the phase (angles). In all of these case, the normalization does not really play an important role, as long as you stay consistent.

Accuracy of math.pow, numpy.power, numpy.float_power, pow and ** in python

Is there are difference in accuracy between math.pow, numpy.power, numpy.float_power, pow() and ** in python, between two floating point numbers x,y?
I assume x is very close to 1, and y is large.

One way in which you would lose precision in all cases is if you are computing a small number (z say) and then computing
p = pow( 1.0+z, y)
The problem is that doubles have around 16 significant figures, so if z is say 1e-8, in forming 1.0+z you will lose half of those figures. Worse, if z is smaller than 1e-16, 1.0+z will be exactly 1.
You can get round this by using the numpy function log1p. This computes the log of its argument plus one, without actually adding 1 to its argument, so not losing precision.
You can compute p above as
p = exp( log1p(z)*y)
which will eliminate the loss of precision due to calculating 1+z

When not to do eigendecomposition for repeated linear transformation

Let's say we have a point p, e.g. (1, 2, 3) on which we want to apply a linear transformation N times. If the transformation is denoted by matrix A, then the final transformation would be given by A^N . p. Matrix multiplication being costly, I was assuming eigen-decomposition followed by diagonalization would speed up the whole process. But to my surprise, this supposedly improved method is taking more time. What am I missing here?
import timeit
mysetup = '''
import numpy as np
from numpy import linalg as LA
from numpy.linalg import matrix_power
EXP = 5 # no. of time linear transformation is applied
LT = 10 # range from which numbers are picked at random for matrices and points.
N = 100 # dimension of the vector space
A_init = np.random.randint(LT, size=(N, N))
A = (A_init + A_init.T)/2
p = np.random.randint(LT, size=N)
def run_sim_1():
An = matrix_power(A, EXP)
return An # p
def run_sim_2():
λ, V = LA.eig(A)
Λ = np.diag(λ)
Λ[np.diag_indices(N)] = λ ** EXP
An = V # Λ # V.T
return An # p
'''
# code snippet whose execution time is to be measured
# naive implementation
mycode_1 = '''run_sim_1()'''
print(timeit.timeit(setup = mysetup, stmt = mycode_1, number = 1000))
# time taken = 0.14894760597962886
# improved code snippet whose execution time is to be measured
# expecting this to take much less time.
mycode_2 = '''run_sim_2()'''
# timeit statement
print(timeit.timeit(setup = mysetup, stmt = mycode_2, number = 1000))
# time taken = 8.035318267997354

This is a bit hard to answer authoritatively. Standard implementations of both matrix multiplication and eigendecomposition are O(n^3), so there's no a priori reason to expect one to be faster than the other. And anecdotally, my experience is that eigendecomposition is generally much slower than a single matrix multiplication, so this result doesn't entirely surprise me.
Because the matrix power operation in this case involves twenty multiplications, I can see why you might expect it to be slower than eigendecomposition. But if you look at the source code, this interesting tidbit shows up:
# Use binary decomposition to reduce the number of matrix multiplications.
# Here, we iterate over the bits of n, from LSB to MSB, raise `a` to
# increasing powers of 2, and multiply into the result as needed.
z = result = None
while n > 0:
z = a if z is None else fmatmul(z, z)
n, bit = divmod(n, 2)
if bit:
result = z if result is None else fmatmul(result, z)
So in fact, it's not really doing 20 multiplications! It's using a divide-and-conquer approach that reduces that number. After thinking through the algorithm, which is really quite elegant, I believe it will never do more than 2*log(p) multiplications for a given power p. This maximum is reached when all the bits of p are one, i.e. when p is one less than a power of two.
The upshot is that although eigendecomposition might be faster in theory than repeated matrix multiplication, it carries constant overhead that makes it less efficient until p gets very large — maybe larger than any practical value.
I should add this: won't multiplying the vector directly be faster than raising the matrix to a power? Twenty vector multiplications would still be O(n^2), no? But perhaps what you really want to do is perform this operation on 10k vectors, in which case the matrix power approach is clearly superior.

Both my_code_1 and my_code_2 contain just a single def statement. So your calls to timeit are only timing how long it takes to define the functions; the functions are never called.
Move the function definitions to the setup code, and replace the statements to be timed with just the call of the appropriate function, e.g.
mycode_1 = '''
run_sim_1()
'''
Then you should lower (by a lot) the value of number that you pass to timeit. And then you'll have to fix run_sim_2() to perform the correct calculation:
def run_sim_2():
λ, V = LA.eig(A)
Λ = np.diag(λ)
Λ[np.diag_indices(N)] = λ ** 20
An = V # Λ # V.T
return An # p
Once you've made those changes, you'll still find that the run_sim_1() is faster. See #senderle's answer for the likely reason.

Matrix inversion for matrix with large values in python

I'm doing matrix inversion in python, and I found it very weird that the result differs by the data scale.
In the code below, it is expected that A_inv/B_inv = B/A. However, it shows that the difference between A_inv/B_inv and B/A becomes larger and larger depend on the data scale... Is this because Python cannot compute matrix inverse precisely for matrix with large values?
Also, I checked the condition number for B, which is a constant ~3.016 no matter the scale is.
Thanks!!!
import numpy as np
from matplotlib import pyplot as plt
D = 30
N = 300
np.random.seed(10)
original_data = np.random.sample([D, N])
A = np.cov(original_data)
A_inv = np.linalg.inv(A)
B_cond = []
diff = []
for k in xrange(1,10):
B = A * np.power(10,k)
B_cond.append(np.linalg.cond(B))
B_inv = np.linalg.inv(B)
### Two measurements of difference are used
diff.append(np.log(np.linalg.norm(A_inv/B_inv - B/A)))
#diff.append(np.max(np.abs(A_inv/B_inv - B/A)))
# print B_cond
plt.figure()
plt.plot(xrange(1,10), diff)
plt.xlabel('data(B) / data(A)')
plt.ylabel('log(||A_inv/B_inv - B/A||)')
plt.savefig('Inversion for large matrix')

I may be wrong, but I think it comes from number representation in machine.
When you are dealing with great numbers, your inverse matrix is going to have very little number in magnitude (close to zero). And clsoe to zero, the representation of the floating number is not precise enough, I guess...
https://en.wikipedia.org/wiki/Floating-point_arithmetic

There is no reason that you should expect np.linalg.norm(A_inv/B_inv - B/A) to be equal to anything special. Instead, you can check the quality of the inverse calculation by multiplying the original matrix by its inverse and checking the determinant, np.linalg.det(A.dot(A_inv)), which should be equal to 1.

Calculating Beta Binomial likelihood with n>1000

I'm struggling with a numerical precision issue in calculating the beta binomial likelihood. My goal is to estimate the probability y mod 10 = d for some digit d, given that y is Binomial(n,p) and p is Beta(a,b). I'm trying to come up with a fast solution for large n, by which I mean at least 1000. One thing that seems to be giving me reasonable answers is to use simulation.
def npliketest_exact(n,digit,a,b):
#draw 1000 values of p
probs = np.array(beta.rvs(a,b,size=1000))
#create an array of numbers whose last digit is digit
digits = np.arange(digit,n+1,10)
#create a function that calculates the pmf at x given p
exact_func = lambda x,p: binom(n,p).pmf(x)
#given p, the likeklihood of last digit "digit" is the sum over all entries in digits
likelihood = lambda p: exact_func(digits,p).sum()
#return the average of that likelihood over all the draws
return np.vectorize(likelihood)(probs).mean()
np.random.seed(1)
print npliketest_exact(1000,9,1,1) #0.0992310195195
This might be ok, but I'm worried about the precision of this strategy. In particular if there's a better/more precise way to do this calculation I'm eager to figure out how to do it.
I've started trying to use the log likelihood to come up with that answer, but I'm running into numerical stability issues even with that.
def llike(n,k,a,b):
out = gammaln(n+1) + gammaln(k+a) + gammaln(n-k+b) + gammaln(a+b) - \
( gammaln(k+1) + gammaln(n-k+1) + gammaln(a) + gammaln(b) + gammaln(n+a+b) )
return out
print exp(llike(1000,9,1,1)) #.000999000999001
print exp(llike(1000,500,1,1)) #.000999000999001
Since the beta 1,1 has mean 0.5, the probability of getting y=500 from a beta binomial with n=1000 should be much higher than getting 9, but the above calculations show a suspicious constant value.
Another thing I tried, which was suggested elsewhere on stackoverflow to deal with this problem, was to use some clever tricks to support numerical stability that are apparently hidden in scipy's betaln formula.
def binomln(n, k): #log of the binomial coefficient
# Assumes binom(n, k) >= 0
return -betaln(1 + n - k, 1 + k) - log(n + 1)
def log_betabinom_exact(n,k,a,b):
return binomln(n,k) + betaln(k+a,n-k+b) - betaln(a,b)
print exp(log_betabinom_exact(1000,9,1,1)) #.000999000999001
print exp(log_betabinom_exact(1000,500,1,1)) #0.000999000999001
Again, same suspicious constants. Would appreciate any advice. Would using sympy be of any help on this?
**** Followup
Sorry guys, dumb mistake on my part, Beta(1,1) is uniform so the results I was getting make sense. Trying different parameters makes things look differnet for different values of k.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.