poly1d gives erroneous coefficients when they are very large integers - python

I am working with python 3.5.2 in ubuntu 16.04.2 LTS, and NumPy 1.12.1. When I use poly1d function to get the coeffs, there is a mistake with the computation:
>>> from numpy import poly1d
>>> from math import fabs
>>> pol = poly1d([2357888,459987,78123455],True)
>>>[int(x) for x in pol.coeffs]
[1, -80941330, 221226728585581, -84732529566356586496]
as you see in this list the last element is not correct. When I build the polynomial using Wolfram Alpha, I get:
x^3 - 80941330 x^2 + 221226728585581 x - 84732529566356580480
The last coefficient is different using poly1d (the first-one ends in ...496 and the other ends in ...480).Ii have to suppose that the correct ones is the last compute (made by WolframAlpha).
Is this a bug or is there something I am not taking account of? I've probed with roots with low absolute value; and in this case the computation is correct. But when I use "big roots", the difference is notable.

As Warren Weckesser said, this is a precision issue. But it can be worked around by declaring the array of roots to be of type object. In this way you can take advantage of Python's big integers, or of higher precision provided by mpmath objects. NumPy is considerate enough not to coerce them to double precision. Example:
import numpy as np
roots = np.array([2357888, 459987, 78123455], dtype=object)
pol = np.poly1d(roots, True)
print(pol.coeffs)
Output: [1 -80941330 221226728585581 -84732529566356580480]

The coefficients are stored as 64 bit floating point values. These do not have enough precision to represent the value -84732529566356580480 exactly.

Related

How to optimise the numerical evaluation of a SymPy integral?

I'm somewhat of a newbie to SymPy and was hoping someone could point out ways to optimise my code.
I need to numerically evaluate a somewhat involved expression with very high decimal places (150–300), and it is taking 30 seconds or longer per parameter set – which is very long given the parameter space to be calculated.
I have used lambdify with the mpmath backend and meijerg=True in the integral handling and it brought down run-times significantly. Are there any other methods that could be used? Ideally it would be great to push evaluation times below 1 second. My code is:
import mpmath
from mpmath import mpf, mp
mp.dps = 150 # ideally would like to have this set to 300
import numpy as np
from sympy import besselj, symbols, hankel2, legendre, sin, cos, tan, summation, I
from sympy import lambdify, expand, Integral
import time
x, alpha, k, m,n, r1, R, theta = symbols('x alpha k m n r1 R theta')
r1 = (R*cos(alpha))/cos(theta) #
Imn_part1 = (n*hankel2(n-1,k*r1)-(n+1)*hankel2(n+1,k*r1))*legendre(n, cos(theta))*cos(theta)
Imn_part2 = n*(n+1)*hankel2(n, k*r1)*(legendre(n-1, cos(theta)-legendre(n+1, cos(theta))))/k*r1
Imn_parts = expand(Imn_part1+Imn_part2)
Imn_expr = expand(Imn_parts*legendre(m,cos(theta))*(r1**2/R**2)*tan(theta))
Imn = Integral(Imn_expr, (theta, 0, alpha)).doit(meijerg=True)
# the lambdified expression
Imn_lambdify = lambdify([m,n,k,R,alpha], Imn,'mpmath')
When giving numerical inputs to the function – it takes a long time (30 s – 40 s).
substitute_dict = {'alpha':mpf(np.radians(10)), 'k':5,'R':mpf(0.1), 'm':20,'n':10}
print('starting calculation...')
start = time.time()
output = Imn_lambdify(substitute_dict['m'],
substitute_dict['n'],
substitute_dict['k'],
substitute_dict['R'],
substitute_dict['alpha'])
print(time.time()-start)
OS/package versions used:
Linux Mint 19.2
Python 3.8.5
SymPy 1.7.1
MPMath 1.2.1
Setting meijerg=True has just caused SymPy to not try as hard in evaluating the integral. It still can't evaluate it, but it has split it into 5 sub-integrals, which you can see if you print Imn. You might as well just leave it as one integral (leave off the doit()):
Imn = Integral(Imn_expr, (theta, 0, alpha))
For me, the split integral evaluates a little faster, but this is also about the same speed
Imn = Integral(simplify(Imn_expr), (theta, 0, alpha))
Ultimately, the thing that makes things slow is the number of digits that you are using. If you don't actually need these many digits, you shouldn't use them. Note that mpmath will automatically increase the precision internally to avoid cancellation, so it is unnecessary to do so yourself. I get the same value (with fewer digits) with the default dps of 15 as 150.
You can try substituting your values directly into your expression, if they do not change, and seeing if SymPy can simplify Imn_expr further with them.
As an aside, you are using np.radians(10), which a machine float, since that is what NumPy uses. This completely defeats the purpose of computing the final answer to 150 digits, since this input parameter is only accurate to 15. Consider using mpmath.pi/18 instead to get a value that is correct to the number of digits you specified.

Is there any difference of precision in python methods to calculate euclidean distance?

I am calculating euclidean distance array by array in a numpy array. I was using np.linalg.norm(v1-v2) to this. Since I was planning to use other distance measures I changed that to scipy.spatial.distance.euclidean(v1,v2) to keep a pattern in my code.
I noticed the last digits vary a bit in each scenario. I thouth it wouldn't since scipy euclidean version uses functions from numpy core like dot and sqrt. I tried other ways in Python to calculate euclidean distance to compare and for a specific example I got these results.
>>> math.sqrt(sum([(a-b)**2 for a,b in zip(v1,v2)]))
1.0065822095995844
>>> numpy.linalg.norm(v1-v2)
1.0065822095995838
>>> sklearn.metrics.pairwise.euclidean_distances(v1.reshape(1,-1),v2.reshape(1,-1))[0,0]
1.0065822095995838
>>> scipy.spatial.distance.euclidean(v1,v2)
1.006582209599584
Just for the record, in my examples, v1 and v2 are normalized histograms.
Why is there this difference in precision? Should this happen?
Floating point numbers are stored in computer as a fraction, with 53 bits to represent the numerator. So you cannot get a floating point answer with more than 15 significant digits of precision.
https://docs.python.org/3/tutorial/floatingpoint.html

numpy.sum transition to kahan but with masked arrays for increased precision

I have a multi-array stack of data that is masked to exclude 'bad' or problematic values- this is in the 3rd dimension. Current code utilizes np.sum, but the level of precision (both large and small numbers) has negatively impacted results. I've attempted to implement the kahan_sum referenced here but forgotten about the masked arrays, and the results are not similar (due to masking). It is my hope that the added precision retention by utilizing a kahan summation and accumulator will permit downstream operations to maintain less error.
Source/research:
https://github.com/numpy/numpy/issues/8786
Kahan summation
Python floating point precision sum (I've jacked up the precision as far as possible but it doesn't help)
import numpy as np
import numpy.ma as ma
def kahan_sum(a, axis=None):
s = numpy.zeros(a.shape[:axis] + a.shape[axis+1:])
c = numpy.zeros(s.shape)
for i in range(a.shape[axis]):
# http://stackoverflow.com/a/42817610/353337
y = a[(slice(None),) * axis + (i,)] - c
t = s + y
c = (t - s) - y
s = t.copy()
return s
data=np.random.rand(5,5,5)
dd=np.ma.masked_array(data=d, mask=np.random.rand(5,5,5)<0.2)
I want to sum along the 3rd (axis=2) as that's essentially my 'stack' of photos.
The masks are not coming out as I expected. It's possible I'm just overtired...
np.sum(dd, axis=2)
kahan_sum(dd, axis=2)
np.sum provides a fully populated array of data and excluded the 'masked' values.
kahan_sum essentially or'd all of the masked values, and I've been unable to come up with a pattern for it.
Printing the mask is pretty evident that thats where the problem is; I'm just not figuring out how to fix it or why it's operating the way it is.
Thank you.
If you really need more precision, consider using math.fsum which is accurate to fp resolution. If A is your 3D masked array, something like:
i,j,k = A.shape
np.frompyfunc(lambda i,j:math.fsum(A[i,j].compressed().tolist()),2,1)(*np.ogrid[:i,:j])
But before that I'd triple check thatnp.sum really isn't good enough. As far as I know it uses pairwise summation along contiguous axes which in practice tends to be pretty good.

Numpy and R give non-zero intercept in linear regression when x = y

I was testing some code which, among other things, runs a linear regression of the form y = m * x + b on some data. To keep things simple, I set my x and y data equal to each other, expecting the model to return one for the slope and zero for the intercept. However, that's not what I saw. Here's a super boiled-down example, taken mostly from the numpy docs:
>>> y = np.arange(5)
>>> x = np.arange(5)
>>> A = np.vstack([x, np.ones(5)]).T
>>> np.linalg.lstsq(A, y)
(array([ 1.00000000e+00, -8.51331872e-16]), array([ 7.50403936e-31]), 2, array([ 5.78859314, 1.22155205]))
>>> # ^slope ^intercept ^residuals ^rank ^singular values
Numpy finds the exact slope of the true line of best fit (one), but reports an intercept that, while very very small, is not zero. Additionally, even though the data can be perfectly modeled by a linear equation y = 1 * x + 0, because this exact equation is not found, numpy reports a tiny but non-zero residual value.
As a sanity check, I tried this out in R (my "native" language), and observed similar results:
> x <- c(0 : 4)
> y <- c(0 : 4)
> lm(y ~ x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-3.972e-16 1.000e+00
My question is, why and under what circumstances does this happen? Is it an artifact of looking for a model with a perfect fit, or is there always a tiny bit of noise added to regression output that we usually just don't see? In this case, the answer is almost certainly close enough to zero, so I'm mainly driven by academic curiosity. However, I also wonder if there are cases where this effect could be magnified to be nontrivial relative to the data.
I've probably revealed this by now, but I have basically no understanding of lower-level programming languages, and while I once had a cursory understanding of how to do this sort of linear algebra "by hand", it has long ago faded from my mind.
It looks like numerical error, the y-intercept is extremely small.
Python, and numpy included, uses double precision floating point numbers by default. These numbers are formatted to having a 52 bit coefficient (see this for floating point explanation, and this for scientific notation explanation of "base")
In your case, you found a y-intercept of ~4e-16. As it turns out, a 52 bit coefficient has roughly 2e-16 accuracy. Basically, in the regression, you subtracted a number on the order of 1 from something closely resembling itself, and hit the numerical precision of double floating point.

Can lambdify return an array with dytpe np.float128?

I am solving a large non-linear system of equations and I need a high degree of numerical precision. I am currently using sympy.lambdify to convert symbolic expressions for the system of equations and its Jacobian into vectorized functions that take ndarrays as inputs and return an ndarray as outputs.
By default, lambdify returns an array with dtype of numpy.float64. Is it possible to have it return an array with dtype numpy.float128? Perhaps this requires the inputs to have dtype of numpy.float128?
If you need a lot of precision, you can try using SymPy floats, or mpmath directly (which is part of SymPy), which provides arbitrary precision. For example, sympy.Float('2.0', 100) creates a float of 2.0 with 100 digits of precision. You can use something like sympy.sin(2).evalf(100) to get 100 digits of sin(2) for instance. This will be a lot slower than numpy because it is arbitrary precision, meaning it doesn't use machine floats, and it is implemented in pure Python (whereas numpy is written in Fortran and C).
The output just reflects the input:
from numpy import float128
from sympy.abc import x
from sympy.utilities import lambdify
f = lambdify(x, x ** 2)
result = f(float128(2))
result
#>>> 4.0
type(result)
#>>> <class 'numpy.float128'>

Categories

Resources