How precise are eigenvalues in Python? And can precision be improved? - python

When calculating eigenvalues and eigenvectors of a matrix, the eigenmatrix times itself should result in the identity matrix (E # E.T = I). However, this is rarely the case, as some (small) errors always occurs.
So question 1: how precise are eigenvalues / eigenvectors calculated?
And question 2: is there any way to improve precision?

I assume you are using a standard library for this, such as:
https://numpy.org/doc/stable/reference/generated/numpy.linalg.eigh.html
or
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eig.html
which I would expect to employ high speed floating point operations under the hood.
And so the (high) precision of 64 bit IEEE floats is a bound.
Far to the right of the decimal, you may find variances in results from run to run, due to the way FPUs cache and process (higher precision) intermediate results during context switches. Google returned a number of quick hits here, such as:
https://indico.cern.ch/event/166141/sessions/125686/attachments/201416/282784/Corden_FP_control.pdf
As for improving precision, your question is well discussed here, where a slower but higher resolution library is mentioned:
Higher precision eigenvalues with numpy
You might also consider using Mathematica.
https://math.stackexchange.com/questions/87871/how-to-obtain-eigenvalues-with-maximum-precision-in-mathematica

Related

Internals of numpy.sum

Why are the two sums returning different values? In fact, if 0.1 is summed up 10 times in IEEE arithmetic, the result should not be exactly 1. It could be that np.sum() groups the sum differently, so that by chance the result is exactly 1. But is there is doc about this (besides studying the source code)? Surely, numpy isn't rounding off the result. And AFAIK, it is also using IEEE double floating point.
import numpy as np
​
print(1-sum(np.ones(10)*0.1))
print(1-np.sum(np.ones(10)*0.1))
-----
1.1102230246251565e-16
0.0
Sorry, I found the solution in the docs. The algorithm of numpy.sum() is indeed more clever:
For floating point numbers the numerical precision of sum (and np.add.reduce) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when no axis is given. When axis is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters. In contrast to NumPy, Python’s math.fsum function uses a slower but more precise approach to summation. Especially when summing a large number of lower precision floating point numbers, such as float32, numerical errors can become significant. In such cases it can be advisable to use dtype="float64" to use a higher precision for the output.
See https://numpy.org/doc/stable/reference/generated/numpy.sum.html

Can Python floating-point error affect statistical test over small numbers?

Can floating-point errors affect my calculations in the following scenario, where the values are small?
My purpose is to compare two sets of values and determine if their means are statistically different.
I handle very small values the usual way in performing large-sample unpaired tests with data like this:
first group (obtained from 100 samples):
first item's mean = 2.7977620220553945e-24
std dev = 3.2257148207429583e-15
second group (obtained from 100 samples):
first item's mean = 3.1086244689504383e-15
std dev = 3.92336102789548e-15
The goal is to find out whether or not the two means are statistically significantly different.
I plan to follow the usual steps of finding the standard error of the difference and the z-score and so on. I will be using Python (or Java).
My question is not about the statistical test but about the potential problem with the smallness of the numbers (floating-point errors).
Should I (must I) approximate each of the above two means to zero (and thus conclude that there is no difference)?
That is, given the smallness of the means, is it computationally meaningless to go about performing the statistical test?
64-bit floating point numbers allot 52-bits for the significand. This is approximately 15-16 decimal places (log10(2^52) ~ 15.6). In scientific notation, this is the difference between, say 1 e -9 and 1 e -24 (because 10^-9 / 10^-24 == 10^15, i.e. they differ by 15 decimal places).
What does this all mean? Well, it means that if you add 10^-24 to 10^-9, it is just on the border of being too small to show up in the larger number (10^-9).
Observe:
>>> a = 1e-9
>>> a
1e-09
>>> a + 1e-23
1.00000000000001e-09
>>> a + 1e-24
1.000000000000001e-09
>>> a + 1e-25
1e-09
Since the z-score statistics stuff involves basically adding and subtracting a few standard deviations from the mean, or so, it will definitely have problems if the difference in the exponent is 16. It's probably not a good situation if the difference is like 14 or 15. The difference in your exponents is 9, which will still allow you 1/10^6 standard deviations of accuracy in the final sum. Since we're worried about errors on the order of, maybe, a tenth of a standard deviation or so when we talk about statistical significance, you should be fine.
On 32-bit platforms, the significand gets 23 bits, which is about 6.9 places.
In principle, if you work with numbers with the same order of magnitude, the float representation of data is sufficient to retain the same precision as numbers close to 1.
However, it is much more robust to being able to perform computation with whitened data.
If whitening is not an option for your use case, you can use an arbitrary precision library for non-integer data (Python offers built-in arbitrary precision integers), like decimal, fractions and/or statistics, and do all the computations with that.
EDIT
However, just looking at your numbers the standard deviation ranges (the interval [µ-σ, µ+σ] largely overlap, therefore you have no evidence for the two means to be statistically significantly different. Of course this is meaningful only for (at least asymptotically) normally distributed populations / samples.

Polynomial Kernel not PSD?

I am currently going through different Machine Learning methods on a low-level basis. Since the polynomial kernel
K(x,z)=(1+x^T*z)^d
is one of the more commonly used kernels, I was under the assumption that this kernel must yield a positive (semi-)definite matrix for a fixed set of data {x1,...,xn}.
However, in my implementation, this seems to not be the case, as the following example shows:
import numpy as np
np.random.seed(93)
x=np.random.uniform(0,1,5)
#Assuming d=1
kernel=(1+x[:,None]#x.T[None,:])
np.linalg.eigvals(kernel)
The output would be
[ 6.9463439e+00 1.6070016e-01 9.5388039e-08 -1.5821310e-08 -3.7724433e-08]
I'm also getting negative eigenvalues for d>2.
Am I totally misunderstanding something here? Or is the polynomia kernel simply not PSD?
EDIT: In a previous version, i used x=np.float32(np.random.uniform(0,1,5)) to reduce computing time, which lead to a greater amount of negative eigenvalues (I believe due to numerical instabilities, as #user2357112 mentioned). I guess this is a good example that precision does matter? Since negative eigenvalues still occur, even for float64 precision, the follow-up question would then be how to avoid such numerical instabilities?

what is the difference expect and mean in the scipy.stats?

according the definition of the expected value, it also refers to mean.
But in scipy.stats.binom, they get different values. like this,
import scipy.stats as st
st.binom.mean(10, 0.3) ----> 3.0
st.binom.expect(args=(10, 0.3)) ---->3.0000000000000013
so that makes me confusing!! why?
In the example the difference is in floating point computation as pointed out. In general there might also be a truncation in expect depending on the integration tolerance.
The mean and some other moments have for many distribution an analytical solution in which case we usually get a precise estimate.
expect is a general function that computes the expectation for arbitrary (*) functions through summation in the discrete case and numerical integration in the continuous case. This accumulates floating point noise but also depends on the convergence criteria for the numerical integration and will, in general, be less precise than an analytically computed moment.
(*) There might be numerical problems in the integration for some "not nice" functions, which can happen for example with default settings in scipy.integrate.quad
This could be simply a result of numerical imprecision when calculating the average. Mathematically, they should be identical, but there are different ways of calculating the mean which have different properties when implemented using finite-precision arithmetic. For example, adding up the numbers and dividing by the total is not particularly reliable, especially when the numbers fluctuate by a small amount around the true (theoretical) mean, or have opposite signs. Recursive estimates may have much better properties.

Why is numpy's sine function so inaccurate at some points?

I just checked numpy's sine function. Apparently, it produce highly inaccurate results around pi.
In [26]: import numpy as np
In [27]: np.sin(np.pi)
Out[27]: 1.2246467991473532e-16
The expected result is 0. Why is numpy so inaccurate there?
To some extend, I feel uncertain whether it is acceptable to regard the calculated result as inaccurate: Its absolute error comes within one machine epsilon (for binary64), whereas the relative error is +inf -- reason why I feel somewhat confused. Any idea?
[Edit] I fully understand that floating-point calculation can be inaccurate. But most of the floating-point libraries can manage to deliver results within a small range of error. Here, the relative error is +inf, which seems unacceptable. Just imagine that we want to calculate
1/(1e-16 + sin(pi))
The results would be disastrously wrong if we use numpy's implementation.
The main problem here is that np.pi is not exactly π, it's a finite binary floating point number that is close to the true irrational real number π but still off by ~1e-16. np.sin(np.pi) is actually returning a value closer to the true infinite-precision result for sin(np.pi) (i.e. the ideal mathematical sin() function being given the approximated np.pi value) than 0 would be.
The value is dependent upon the algorithm used to compute it. A typical implementation will use some quickly-converging infinite series, carried out until it converges within one machine epsilon. Many modern chips (starting with the Intel 960, I think) had such functions in the instruction set.
To get 0 returned for this, we would need either a notably more accurate algorithm, one that ran extra-precision arithmetic to guarantee the closest-match result, or something that recognizes special cases: detect a multiple of PI and return the exact value.

Categories

Resources