I looked around for some documentation of how numpy/scipy functions behave in terms of numerical stability, e.g. are any means taken to improve numerical stability or are there alternative stable implementations.
I am specifically interested in addition (+ operator) of floating point arrays, numpy.sum(), numpy.cumsum() and numpy.dot(). In all cases I am essentially summing a very large quantity of floating points numbers and I am concerned about the accuracy of such calculations.
Does anyone know of any reference to such issues in the numpy/scipy documentation or some other source?
The phrase "stability" refers to an algorithm. If your algorithm is unstable to start with then increasing precision or reducing rounding error of the component steps is not going to gain much.
The more complex numpy routines like "solve" are wrappers for the ATLAS/BLAS/LAPACK routines. You can refer to documentation there, for example "dgesv" solves a system of real valued linear equations using an LU decomposition with partial pivoting and row interchanges : underlying Fortran code docs for LAPACK can be seen here http://www.netlib.org/lapack/explore-html/ but http://docs.scipy.org/doc/numpy/user/install.html points out that many different versions of the standard routine implementations are available - speed optimisation and precision will vary between them.
Your examples don't introduce much rounding, "+" has no unnecessary rounding, the precision depends purely on rounding implicit in the floating point datatype when the smaller number has low-order bits that cannot be represented in an answer. Sum and dot depend only on the order of evaluation. Cumsum cannot be easily re-ordered as it outputs an array.
For the cumulative rounding during a "cumsum" or "dot" function you do have choices:
On Linux 64bit numpy provides access to a high precision "long double" type float128 which you could use to reduce loss of precision in intermediate calculations at the cost of performance and memory.
However on my Win64 install "numpy.longdouble" maps to "numpy.float64" a normal C double type so your code is not cross-platform, check "finfo". (Neither float96 or float128 with genuinely higher precision exist on Canopy Express Win64)
log2(finfo(float64).resolution)
> -49.828921423310433
actually 53-bits of mantissa internally ~ 16 significant decimal figures
log2(finfo(float32).resolution)
> -19.931568 # ~ only 7 meaningful digits
Since sum() and dot() reduce the array to a single value, maximising precision is easy with built-ins:
x = arange(1, 1000000, dtype = float32)
y = map(lambda z : float32(1.0/z), arange(1, 1000000))
sum(x) # 4.9994036e+11
sum(x, dtype = float64) # 499999500000.0
sum(y) # 14.357357
sum(y, dtype = float64) # 14.392725788474309
dot(x,y) # 999999.0
einsum('i,i', x, y) # * dot product = 999999.0
einsum('i,i', x, y, dtype = float64) # 999999.00003965141
note the single precision roundings within "dot" cancel in this case as each almost-integer is rounded to an exact integer
Optimising rounding depends on the kind of thing you are adding up - adding many small numbers first can help delay rounding but would not avoid problems where big numbers exist but cancel each other out as intermediate calculations still cause a loss of precision
example showing evaluation order dependence ...
x = array([ 1., 2e-15, 8e-15 , -0.7, -0.3], dtype=float32)
# evaluates to
# array([ 1.00000000e+00, 2.00000001e-15, 8.00000003e-15,
# -6.99999988e-01, -3.00000012e-01], dtype=float32)
sum(x) # 0
sum(x,dtype=float64) # 9.9920072216264089e-15
sum(random.permutation(x)) # gives 9.9999998e-15 / 2e-15 / 0.0
Related
I'm trying to plot a mathematical expression in python. I have a sum of functions f_i of the following type
-x/r^2*exp(-rx)+2/r^3*(1-exp(-rx))-x/r^2*exp(-r x_i)
where x_i values between 1/360 and 50. r is quite small, meaning 0.0001. I'm interested in plotting the behavior of these functions (actually I'm interested in plotting sum f_i(x) n_i, for some real n_i) as x converges to zero. I know the exact analytical expression, which I can reproduce. However, the plot for very small x tends to start oscillating and doesn't seem to converge. I'm now wondering if this has to do with the floating-point precision in python. I'm considering very small x, like 0.0000001
.0000001 isn't very small.
Check out these pages:
https://docs.python.org/3/library/stdtypes.html#typesnumeric
https://docs.python.org/3/library/sys.html#sys.float_info
Try casting your intermediate values in the computations to float before doing math.
Is there are difference in accuracy between math.pow, numpy.power, numpy.float_power, pow() and ** in python, between two floating point numbers x,y?
I assume x is very close to 1, and y is large.
One way in which you would lose precision in all cases is if you are computing a small number (z say) and then computing
p = pow( 1.0+z, y)
The problem is that doubles have around 16 significant figures, so if z is say 1e-8, in forming 1.0+z you will lose half of those figures. Worse, if z is smaller than 1e-16, 1.0+z will be exactly 1.
You can get round this by using the numpy function log1p. This computes the log of its argument plus one, without actually adding 1 to its argument, so not losing precision.
You can compute p above as
p = exp( log1p(z)*y)
which will eliminate the loss of precision due to calculating 1+z
I am trying to plot a set of extreme floating-point values that require high precision. It seems to me there are precision limits in matplotlib. It cannot go further than the scale of 1e28.
This is my code for displaying a graph.
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1737100, 38380894.5188064386003616016502, 378029000.0], dtype=np.longdouble)
y = np.array([-76188946654889063420743355676.5, -76188946654889063419450832178.0, -76188946654889063450098993033.0], dtype=np.longdouble)
plt.scatter(x, y)
#coefficients = np.polyfit(x, y, 2)
#poly = np.poly1d(coefficients)
#new_x = np.linspace(x[0], x[-1])
#new_y = poly(new_x)
#plt.plot(new_x, new_y)
plt.xlim([x[0], x[-1]])
plt.title('U vs. r')
plt.xlabel('Distance r')
plt.ylabel('Total gravitational potential energy U(r)')
plt.show()
I am expecting the middle point to be located higher than the other two points. It requires very high precision. How can I configure it?
Your current issue is likely not with matplotlib but with np.longdouble. To discover whether this is the case, run np.finfo(np.longdouble). This will be machine dependent, but on my machine, this says I'm using a float128 with the following description
Machine parameters for float128
---------------------------------------------------------------
precision = 18 resolution = 1.0000000000000000715e-18
machep = -63 eps = 1.084202172485504434e-19
negep = -64 epsneg = 5.42101086242752217e-20
minexp = -16382 tiny = 3.3621031431120935063e-4932
maxexp = 16384 max = 1.189731495357231765e+4932
nexp = 15 min = -max
---------------------------------------------------------------
The precision is just an estimate (due to binary vs decimal representation), but 18 digits is the float128 limit, and your specific numbers only start to become interesting after that.
An easy test is to print y[1]-y[0] and see if you get something other than 0.0.
An easy solution is to use Python ints since you'd capture most of the difference (or int of 10*y) since Python has infinite precision ints. So something like this:
x = np.array([1737100, 38380894.5188064386003616016502, 378029000.0], dtype=np.longdouble)
y = [-76188946654889063420743355676, -76188946654889063419450832178, -76188946654889063450098993033]
plt.scatter(x, [z-y[0] for z in y])
Another solution is to represent the numbers from the start so that they require a more accessible precision (ie, with most of the offset removed). And another is to use a high precision float library. It depends on which way you want to go.
It's also worth noting that, at least for my system which I think is typical, the default np.float is float64. For float64 the floating point mantisaa is 52 bits, whereas for float128 it's only 63 bits. Or in decimal, from about 15 digits to 18. So there's not a great precision increase for going from np.float to np.float128. (Here's a discussion of why np.longdouble ( or np.float128) sounds like it's going to add a lot of precision, but doesn't.)
(Finally, because this may cause confusion for some, if it were the case that np.longdouble or np.float128 were useful for this problem, it's worth noting that the line in the question that sets the initial array wouldn't give the intended precision of np.longdouble. That is, y=np.array( [-76188946654889063420743355676.5], dtype=np.longdouble) first creates and array of Python floats, and then creates the numpy array from that, but the precision will be lost in the Python array. So if longdouble were the solution, a different approach to initializing the array would be needed.)
I'm doing some normalization operation and, for my surprise, when trying to revert the operation I get a mismatch of 100% for the default 6 decimal precision of assert_array_almost_equal. Why is this occurring? Can it be due to the precision of my maximum value? If so, how can I get more precision in numpy.ndarray.max()?
from __future__ import division
import numpy
_max = numpy.float128(67.1036) # output of numpy.ndarray.max() on an a float32 array
def divide_and_mult(x, y):
return numpy.divide(numpy.float128(x), y) * y
for i in range(100):
try: numpy.testing.assert_array_equal(divide_and_mult(i, _max), numpy.float128(i))
except AssertionError, e: print e
You can't get more precision with numpy arrays than float128, on most systems the best is even lower: float64.
Normally you just don't care about a bit loss in precision and use np.testing.assert_almost_equal or similar functions that let you test for a specific absolute and/or relative difference.
In case you want to do it with much higher precision you need to use a type that has infinite or at least user-defined precision: decimal.Decimal or fractions.Fraction or switch to a symbolic math library like sympy.
I was testing some code which, among other things, runs a linear regression of the form y = m * x + b on some data. To keep things simple, I set my x and y data equal to each other, expecting the model to return one for the slope and zero for the intercept. However, that's not what I saw. Here's a super boiled-down example, taken mostly from the numpy docs:
>>> y = np.arange(5)
>>> x = np.arange(5)
>>> A = np.vstack([x, np.ones(5)]).T
>>> np.linalg.lstsq(A, y)
(array([ 1.00000000e+00, -8.51331872e-16]), array([ 7.50403936e-31]), 2, array([ 5.78859314, 1.22155205]))
>>> # ^slope ^intercept ^residuals ^rank ^singular values
Numpy finds the exact slope of the true line of best fit (one), but reports an intercept that, while very very small, is not zero. Additionally, even though the data can be perfectly modeled by a linear equation y = 1 * x + 0, because this exact equation is not found, numpy reports a tiny but non-zero residual value.
As a sanity check, I tried this out in R (my "native" language), and observed similar results:
> x <- c(0 : 4)
> y <- c(0 : 4)
> lm(y ~ x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-3.972e-16 1.000e+00
My question is, why and under what circumstances does this happen? Is it an artifact of looking for a model with a perfect fit, or is there always a tiny bit of noise added to regression output that we usually just don't see? In this case, the answer is almost certainly close enough to zero, so I'm mainly driven by academic curiosity. However, I also wonder if there are cases where this effect could be magnified to be nontrivial relative to the data.
I've probably revealed this by now, but I have basically no understanding of lower-level programming languages, and while I once had a cursory understanding of how to do this sort of linear algebra "by hand", it has long ago faded from my mind.
It looks like numerical error, the y-intercept is extremely small.
Python, and numpy included, uses double precision floating point numbers by default. These numbers are formatted to having a 52 bit coefficient (see this for floating point explanation, and this for scientific notation explanation of "base")
In your case, you found a y-intercept of ~4e-16. As it turns out, a 52 bit coefficient has roughly 2e-16 accuracy. Basically, in the regression, you subtracted a number on the order of 1 from something closely resembling itself, and hit the numerical precision of double floating point.