How to increase precision in matplotlib? - python

I am trying to plot a set of extreme floating-point values that require high precision. It seems to me there are precision limits in matplotlib. It cannot go further than the scale of 1e28.
This is my code for displaying a graph.
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1737100, 38380894.5188064386003616016502, 378029000.0], dtype=np.longdouble)
y = np.array([-76188946654889063420743355676.5, -76188946654889063419450832178.0, -76188946654889063450098993033.0], dtype=np.longdouble)
plt.scatter(x, y)
#coefficients = np.polyfit(x, y, 2)
#poly = np.poly1d(coefficients)
#new_x = np.linspace(x[0], x[-1])
#new_y = poly(new_x)
#plt.plot(new_x, new_y)
plt.xlim([x[0], x[-1]])
plt.title('U vs. r')
plt.xlabel('Distance r')
plt.ylabel('Total gravitational potential energy U(r)')
plt.show()
I am expecting the middle point to be located higher than the other two points. It requires very high precision. How can I configure it?

Your current issue is likely not with matplotlib but with np.longdouble. To discover whether this is the case, run np.finfo(np.longdouble). This will be machine dependent, but on my machine, this says I'm using a float128 with the following description
Machine parameters for float128
---------------------------------------------------------------
precision = 18 resolution = 1.0000000000000000715e-18
machep = -63 eps = 1.084202172485504434e-19
negep = -64 epsneg = 5.42101086242752217e-20
minexp = -16382 tiny = 3.3621031431120935063e-4932
maxexp = 16384 max = 1.189731495357231765e+4932
nexp = 15 min = -max
---------------------------------------------------------------
The precision is just an estimate (due to binary vs decimal representation), but 18 digits is the float128 limit, and your specific numbers only start to become interesting after that.
An easy test is to print y[1]-y[0] and see if you get something other than 0.0.
An easy solution is to use Python ints since you'd capture most of the difference (or int of 10*y) since Python has infinite precision ints. So something like this:
x = np.array([1737100, 38380894.5188064386003616016502, 378029000.0], dtype=np.longdouble)
y = [-76188946654889063420743355676, -76188946654889063419450832178, -76188946654889063450098993033]
plt.scatter(x, [z-y[0] for z in y])
Another solution is to represent the numbers from the start so that they require a more accessible precision (ie, with most of the offset removed). And another is to use a high precision float library. It depends on which way you want to go.
It's also worth noting that, at least for my system which I think is typical, the default np.float is float64. For float64 the floating point mantisaa is 52 bits, whereas for float128 it's only 63 bits. Or in decimal, from about 15 digits to 18. So there's not a great precision increase for going from np.float to np.float128. (Here's a discussion of why np.longdouble ( or np.float128) sounds like it's going to add a lot of precision, but doesn't.)
(Finally, because this may cause confusion for some, if it were the case that np.longdouble or np.float128 were useful for this problem, it's worth noting that the line in the question that sets the initial array wouldn't give the intended precision of np.longdouble. That is, y=np.array( [-76188946654889063420743355676.5], dtype=np.longdouble) first creates and array of Python floats, and then creates the numpy array from that, but the precision will be lost in the Python array. So if longdouble were the solution, a different approach to initializing the array would be needed.)

Related

Accuracy of math.pow, numpy.power, numpy.float_power, pow and ** in python

Is there are difference in accuracy between math.pow, numpy.power, numpy.float_power, pow() and ** in python, between two floating point numbers x,y?
I assume x is very close to 1, and y is large.
One way in which you would lose precision in all cases is if you are computing a small number (z say) and then computing
p = pow( 1.0+z, y)
The problem is that doubles have around 16 significant figures, so if z is say 1e-8, in forming 1.0+z you will lose half of those figures. Worse, if z is smaller than 1e-16, 1.0+z will be exactly 1.
You can get round this by using the numpy function log1p. This computes the log of its argument plus one, without actually adding 1 to its argument, so not losing precision.
You can compute p above as
p = exp( log1p(z)*y)
which will eliminate the loss of precision due to calculating 1+z

Overflow in numpy.exp()

I have to calculate the exponential of the following array for my project:
w = [-1.52820754859, -0.000234000845064, -0.00527938881237, 5797.19232191, -6.64682108484,
18924.7087966, -69.308158911, 1.1158892974, 1.04454511882, 116.795573742]
But I've been getting overflow due to the number 18924.7087966.
The goal is to avoid using extra packages such as bigfloat (except "numpy") and get a close result (which has a small relative error).
1.So far I've tried using higher precision (i.e. float128):
def getlogZ_robust(w):
Z = sum(np.exp(np.dot(x,w).astype(np.float128)) for x in iter_all_observations())
return np.log(Z)
But I still get "inf" which is what I want to avoid.
I've tried clipping it using nump.clip():
def getlogZ_robust(w):
Z = sum(np.exp(np.clip(np.dot(x,w).astype(np.float128),-11000, 11000)) for x in iter_all_observations())
return np.log(Z)
But the relative error is too big.
Can you help me solving this problem, if it is possible?
Only significantly extended or arbitrary precision packages will be able to handle the huge differences in numbers. The exponential of the largest and most negative numbers in w differ by 8000 (!) orders of magnitude. float (i.e. double precision) has 'only' 15 digits of precision (meaning 1+1e-16 is numerically equal to 1), such that adding the small numbers to the huge exponential of the largest number has no effect. As a matter of fact, exp(18924.7087966) is so huge, that it dominates the sum. Below is a script performing the sum with extended precision in mpmath: the ratio of the sum of exponentials and exp(18924.7087966) is basically 1.
w = [-1.52820754859, -0.000234000845064, -0.00527938881237, 5797.19232191, -6.64682108484,
18924.7087966, -69.308158911, 1.1158892974, 1.04454511882, 116.795573742]
u = min(w)
v = max(w)
import mpmath
#using plenty of precision
mpmath.mp.dps = 32768
print('%.5e' % mpmath.log10(mpmath.exp(v)/mpmath.exp(u)))
#exp(w) differs by 8000 orders of magnitude for largest and smallest number
s = sum([mpmath.exp(mpmath.mpf(x)) for x in w])
print('%.5e' % (mpmath.exp(v)/s))
#largest exp(w) dominates such that ratio over the sums of exp(w) and exp(max(w)) is approx. 1
If the issues of loosing digits in the final results due to hugely different orders of magnitudes of added terms in not a concern, one could also mathematically transform the log of sums over exponentials the following way avoiding exp of large numbers:
log(sum(exp(w)))
= log(sum(exp(w-wmax)*exp(wmax)))
= wmax + log(sum(exp(w-wmax)))
In python:
import numpy as np
v = np.array(w)
m = np.max(v)
print(m + np.log(np.sum(np.exp(v-m))))
Note that np.log(np.sum(np.exp(v-m))) is numerically zero as the exponential of the largest number completely dominates the sum here.
Numpy has a function called logaddexp which computes
logaddexp(x1, x2) == log(exp(x1) + exp(x2))
without explicitly computing the intermediate exp() values. This way it avoids the overflow. So here is the solution:
def getlogZ_robust(w):
Z = 0
for x in iter_all_observations():
Z = np.logaddexp(Z, np.dot(x,w))
return Z

Mean and standard deviation of lognormal distribution do not match analytic values

As part of my research, I measure the mean and standard deviation of draws from a lognormal distribution. Given a value of the underlying normal distribution, it should be possible to analytically predict these quantities (as given at https://en.wikipedia.org/wiki/Log-normal_distribution).
However, as can be seen in the plots below, this does not seem to be the case. The first plot shows the mean of the lognormal data against the sigma of the gaussian, while the second plot shows the sigma of the lognormal data against that of the gaussian. Clearly, the "calculated" lines deviate from the "analytic" ones very significantly.
I take the mean of the gaussian distribution to be related to the sigma by mu = -0.5*sigma**2 as this ensures that the lognormal field ought to have mean of 1. Note, this is motivated by the area of physics that I work in: the deviation from analytic values still occurs if you set mu=0.0 for example.
By copying and pasting the code at the bottom of the question, it should be possible to reproduce the plots below. Any advice as to what might be causing this would be much appreciated!
Mean of lognormal vs sigma of gaussian:
Sigma of lognormal vs sigma of gaussian:
Note, to produce the plots above, I used N=10000, but have put N=1000 in the code below for speed.
import numpy as np
import matplotlib.pyplot as plt
mean_calc = []
sigma_calc = []
mean_analytic = []
sigma_analytic = []
ss = np.linspace(1.0,10.0,46)
N = 1000
for s in ss:
mu = -0.5*s*s
ln = np.random.lognormal(mean=mu, sigma=s, size=(N,N))
mean_calc += [np.average(ln)]
sigma_calc += [np.std(ln)]
mean_analytic += [np.exp(mu+0.5*s*s)]
sigma_analytic += [np.sqrt((np.exp(s**2)-1)*(np.exp(2*mu + s*s)))]
plt.loglog(ss,mean_calc,label='calculated')
plt.loglog(ss,mean_analytic,label='analytic')
plt.legend();plt.grid()
plt.xlabel(r'$\sigma_G$')
plt.ylabel(r'$\mu_{LN}$')
plt.show()
plt.loglog(ss,sigma_calc,label='calculated')
plt.loglog(ss,sigma_analytic,label='analytic')
plt.legend();plt.grid()
plt.xlabel(r'$\sigma_G$')
plt.ylabel(r'$\sigma_{LN}$')
plt.show()
TL;DR
Lognormal are positively skewed and heavy tailed distribution. When performing float arithmetic operations (such as sum, mean or std) on sample drawn from a highly skewed distribution, the sampling vector contains values with discrepancy over several order of magnitude (many decades). This makes the computation inaccurate.
The problem comes from those two lines:
mean_calc += [np.average(ln)]
sigma_calc += [np.std(ln)]
Because ln contains both very low and very high values with order of magnitude much higher than float precision.
The problem can be easily detected to warn user that its computation are wrong using the following predicate:
(max(ln) + min(ln)) <= max(ln)
Which is obviously false in Strictly Positive Real but must be considered when using Finite Precision Arithmetic.
Modifying your MCVE
If we slightly modify your MCVE to:
from scipy import stats
for s in ss:
mu = -0.5*s*s
ln = stats.lognorm(s, scale=np.exp(mu)).rvs(N*N)
f = stats.lognorm.fit(ln, floc=0)
mean_calc += [f[2]*np.exp(0.5*s*s)]
sigma_calc += [np.sqrt((np.exp(f[0]**2)-1)*(np.exp(2*mu + s*s)))]
mean_analytic += [np.exp(mu+0.5*s*s)]
sigma_analytic += [np.sqrt((np.exp(s**2)-1)*(np.exp(2*mu + s*s)))]
It gives the reasonably correct mean and standard deviation estimation even for high value of sigma.
The key is that fit uses MLE algorithm to estimates parameters. This totally differs from your original approach which directly performs the mean of the sample.
The fit method returns a tuple with (sigma, loc=0, scale=exp(mu)) which are parameters of the scipy.stats.lognorm object as specified in documentation.
I think you should investigate how you are estimating mean and standard deviation. The divergence probably comes from this part of your algorithm.
There might be several reasons why it diverges, at least consider:
Biased estimator: Are your estimator correct and unbiased? Mean is unbiased estimator (see next section) but maybe not efficient;
Sampled outliers from Pseudo Random Generator may not be as intense as they should be compared to the theoretical distribution: maybe MLE is less sensitive than your estimator New MCVE bellow does not support this hypothesis, but Float Arithmetic Error can explain why your estimators are underestimated;
Float Arithmetic Error New MCVE bellow highlights that it is part of your problem.
A scientific quote
It seems naive mean estimator (simply taking mean), even if unbiased, is inefficient to properly estimate mean for large sigma (see Qi Tang, Comparison of Different Methods for Estimating Log-normal Means, p. 11):
The naive estimator is easy to calculate and it is unbiased. However,
this estimator can be inefficient when variance is large and sample
size is small.
The thesis reviews several methods to estimate mean of a lognormal distribution and uses MLE as reference for comparison. This explains why your method has a drift as sigma increases and MLE stick better alas it is not time efficient for large N. Very interesting paper.
Statistical considerations
Recalling than:
Lognormal is a heavy and long tailed distribution positively skewed. One consequence is: as the shape parameter sigma grows, the asymmetry and skweness grows, so does the strength of outliers.
Effect of Sample Size: as the number of samples drawn from a distribution grows, the expectation of having an outlier increases (so does the extent).
Building a new MCVE
Lets build a new MCVE to make it clearer. The code bellow draws samples of different sizes (N ranges between 100 and 10000) from lognormal distribution where shape parameter varies (sigma ranges between 0.1 and 10) and scale parameter is set to be unitary.
import warnings
import numpy as np
from scipy import stats
# Make computation reproducible among batches:
np.random.seed(123456789)
# Parameters ranges:
sigmas = np.arange(0.1, 10.1, 0.1)
sizes = np.logspace(2, 5, 21, base=10).astype(int)
# Placeholders:
rv = np.empty((sigmas.size,), dtype=object)
xmean = np.full((3, sigmas.size, sizes.size), np.nan)
xstd = np.full((3, sigmas.size, sizes.size), np.nan)
xextent = np.full((2, sigmas.size, sizes.size), np.nan)
eps = np.finfo(np.float64).eps
# Iterate Shape Parameter:
for (i, s) in enumerate(sigmas):
# Create Random Variable:
rv[i] = stats.lognorm(s, loc=0, scale=1)
# Iterate Sample Size:
for (j, N) in enumerate(sizes):
# Draw Samples:
xs = rv[i].rvs(N)
# Sample Extent:
xextent[:,i,j] = [np.min(xs), np.max(xs)]
# Check (max(x) + min(x)) <= max(x)
if (xextent[0,i,j] + xextent[1,i,j]) - xextent[1,i,j] < eps:
warnings.warn("Potential Float Arithmetic Errors: logN(mu=%.2f, sigma=%2f).sample(%d)" % (0, s, N))
# Generate different Estimators:
# Fit Parameters using MLE:
fit = stats.lognorm.fit(xs, floc=0)
xmean[0,i,j] = fit[2]
xstd[0,i,j] = fit[0]
# Naive (Bad Estimators because of Float Arithmetic Error):
xmean[1,i,j] = np.mean(xs)*np.exp(-0.5*s**2)
xstd[1,i,j] = np.sqrt(np.log(np.std(xs)**2*np.exp(-s**2)+1))
# Log-transform:
xmean[2,i,j] = np.exp(np.mean(np.log(xs)))
xstd[2,i,j] = np.std(np.log(xs))
Observation: The new MCVE starts to raise warnings when sigma > 4.
MLE as Reference
Estimating shape and scale parameters using MLE performs well:
The two figures above show than:
Error on estimation grows alongside with shape parameter;
Error on estimation reduces as sample size increases (CTL);
Note than MLE also fits well the shape parameter:
Float Arithmetic
It is worthy to plot the extent of drawn samples versus shape parameter and sample size:
Or the decimal magnitude between smallest and largest number form the sample:
On my setup:
np.finfo(np.float64).precision # 15
np.finfo(np.float64).eps # 2.220446049250313e-16
It means we have at maximum 15 significant figures to work with, if the magnitude between two numbers exceed then the largest number absorb the smaller ones.
A basic example: What is the result of 1 + 1e6 if we can only keep four significant figures?
The exact result is 1,000,001.0 but it must be rounded off to 1.000e6. This implies: the result of the operation equals to the highest number because of the rounding precision. It is inherent of Finite Precision Arithmetic.
The two previous figures above in conjunction with statistical consideration supports your observation that increasing N does not improve estimation for large value of sigma in your MCVE.
Figures above and below show than when sigma > 3 we haven't enough significant figures (less than 5) to performs valid computations.
Further more we can say that estimator will be underestimated because largest numbers will absorb smallest and the underestimated sum will then be divided by N making the estimator biased by default.
When shape parameter becomes sufficiently large, computations are strongly biased because of Arithmetic Float Errors.
It means using quantities such as:
np.mean(xs)
np.std(xs)
When computing estimate will have huge Arithmetic Float Error because of the important discrepancy among values stored in xs. Figures below reproduce your issue:
As stated, estimations are in default (not in excess) because high values (few outliers) in sampled vector absorb small values (most of the sampled values).
Logarithmic Transformation
If we apply a logarithmic transformation, we can drastically reduces this phenomenon:
xmean[2,i,j] = np.exp(np.mean(np.log(xs)))
xstd[2,i,j] = np.std(np.log(xs))
And then the naive estimation of the mean is correct and will be far less affected by Arithmetic Float Error because all sample values will lie within few decades instead of having relative magnitude higher than the float arithmetic precision.
Actually, taking the log-transform returns the same result for mean and std estimation than MLE for each N and sigma:
np.allclose(xmean[0,:,:], xmean[2,:,:]) # True
np.allclose(xstd[0,:,:], xstd[2,:,:]) # True
Reference
If you are looking for complete and detailed explanations of this kind of issues when performing scientific computing, I recommend you to read the excellent book: N. J. Higham, Accuracy and Stability of Numerical Algorithms, Siam, Second Edition, 2002.
Bonus
Here an example of figure generation code:
import matplotlib.pyplot as plt
fig, axe = plt.subplots()
idx = slice(None, None, 5)
axe.loglog(sigmas, xmean[0,:,idx])
axe.axhline(1, linestyle=':', color='k')
axe.set_title(r"MLE: $x \sim \log\mathcal{N}(\mu=0,\sigma)$")
axe.set_xlabel(r"Standard Deviation, $\sigma$")
axe.set_ylabel(r"Mean Estimation, $\hat{\mu}$")
axe.set_ylim([0.1,10])
lgd = axe.legend([r"$N = %d$" % s for s in sizes[idx]] + ['Exact'], bbox_to_anchor=(1,1), loc='upper left')
axe.grid(which='both')
fig.savefig('Lognorm_MLE_Emean_Sigma.png', dpi=120, bbox_extra_artists=(lgd,), bbox_inches='tight')

Numpy and R give non-zero intercept in linear regression when x = y

I was testing some code which, among other things, runs a linear regression of the form y = m * x + b on some data. To keep things simple, I set my x and y data equal to each other, expecting the model to return one for the slope and zero for the intercept. However, that's not what I saw. Here's a super boiled-down example, taken mostly from the numpy docs:
>>> y = np.arange(5)
>>> x = np.arange(5)
>>> A = np.vstack([x, np.ones(5)]).T
>>> np.linalg.lstsq(A, y)
(array([ 1.00000000e+00, -8.51331872e-16]), array([ 7.50403936e-31]), 2, array([ 5.78859314, 1.22155205]))
>>> # ^slope ^intercept ^residuals ^rank ^singular values
Numpy finds the exact slope of the true line of best fit (one), but reports an intercept that, while very very small, is not zero. Additionally, even though the data can be perfectly modeled by a linear equation y = 1 * x + 0, because this exact equation is not found, numpy reports a tiny but non-zero residual value.
As a sanity check, I tried this out in R (my "native" language), and observed similar results:
> x <- c(0 : 4)
> y <- c(0 : 4)
> lm(y ~ x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-3.972e-16 1.000e+00
My question is, why and under what circumstances does this happen? Is it an artifact of looking for a model with a perfect fit, or is there always a tiny bit of noise added to regression output that we usually just don't see? In this case, the answer is almost certainly close enough to zero, so I'm mainly driven by academic curiosity. However, I also wonder if there are cases where this effect could be magnified to be nontrivial relative to the data.
I've probably revealed this by now, but I have basically no understanding of lower-level programming languages, and while I once had a cursory understanding of how to do this sort of linear algebra "by hand", it has long ago faded from my mind.
It looks like numerical error, the y-intercept is extremely small.
Python, and numpy included, uses double precision floating point numbers by default. These numbers are formatted to having a 52 bit coefficient (see this for floating point explanation, and this for scientific notation explanation of "base")
In your case, you found a y-intercept of ~4e-16. As it turns out, a 52 bit coefficient has roughly 2e-16 accuracy. Basically, in the regression, you subtracted a number on the order of 1 from something closely resembling itself, and hit the numerical precision of double floating point.

Is there any documentation of numpy numerical stability?

I looked around for some documentation of how numpy/scipy functions behave in terms of numerical stability, e.g. are any means taken to improve numerical stability or are there alternative stable implementations.
I am specifically interested in addition (+ operator) of floating point arrays, numpy.sum(), numpy.cumsum() and numpy.dot(). In all cases I am essentially summing a very large quantity of floating points numbers and I am concerned about the accuracy of such calculations.
Does anyone know of any reference to such issues in the numpy/scipy documentation or some other source?
The phrase "stability" refers to an algorithm. If your algorithm is unstable to start with then increasing precision or reducing rounding error of the component steps is not going to gain much.
The more complex numpy routines like "solve" are wrappers for the ATLAS/BLAS/LAPACK routines. You can refer to documentation there, for example "dgesv" solves a system of real valued linear equations using an LU decomposition with partial pivoting and row interchanges : underlying Fortran code docs for LAPACK can be seen here http://www.netlib.org/lapack/explore-html/ but http://docs.scipy.org/doc/numpy/user/install.html points out that many different versions of the standard routine implementations are available - speed optimisation and precision will vary between them.
Your examples don't introduce much rounding, "+" has no unnecessary rounding, the precision depends purely on rounding implicit in the floating point datatype when the smaller number has low-order bits that cannot be represented in an answer. Sum and dot depend only on the order of evaluation. Cumsum cannot be easily re-ordered as it outputs an array.
For the cumulative rounding during a "cumsum" or "dot" function you do have choices:
On Linux 64bit numpy provides access to a high precision "long double" type float128 which you could use to reduce loss of precision in intermediate calculations at the cost of performance and memory.
However on my Win64 install "numpy.longdouble" maps to "numpy.float64" a normal C double type so your code is not cross-platform, check "finfo". (Neither float96 or float128 with genuinely higher precision exist on Canopy Express Win64)
log2(finfo(float64).resolution)
> -49.828921423310433
actually 53-bits of mantissa internally ~ 16 significant decimal figures
log2(finfo(float32).resolution)
> -19.931568 # ~ only 7 meaningful digits
Since sum() and dot() reduce the array to a single value, maximising precision is easy with built-ins:
x = arange(1, 1000000, dtype = float32)
y = map(lambda z : float32(1.0/z), arange(1, 1000000))
sum(x) # 4.9994036e+11
sum(x, dtype = float64) # 499999500000.0
sum(y) # 14.357357
sum(y, dtype = float64) # 14.392725788474309
dot(x,y) # 999999.0
einsum('i,i', x, y) # * dot product = 999999.0
einsum('i,i', x, y, dtype = float64) # 999999.00003965141
note the single precision roundings within "dot" cancel in this case as each almost-integer is rounded to an exact integer
Optimising rounding depends on the kind of thing you are adding up - adding many small numbers first can help delay rounding but would not avoid problems where big numbers exist but cancel each other out as intermediate calculations still cause a loss of precision
example showing evaluation order dependence ...
x = array([ 1., 2e-15, 8e-15 , -0.7, -0.3], dtype=float32)
# evaluates to
# array([ 1.00000000e+00, 2.00000001e-15, 8.00000003e-15,
# -6.99999988e-01, -3.00000012e-01], dtype=float32)
sum(x) # 0
sum(x,dtype=float64) # 9.9920072216264089e-15
sum(random.permutation(x)) # gives 9.9999998e-15 / 2e-15 / 0.0

Categories

Resources