Can Python floating-point error affect statistical test over small numbers?

Can Python floating-point error affect statistical test over small numbers? - python

Can floating-point errors affect my calculations in the following scenario, where the values are small?
My purpose is to compare two sets of values and determine if their means are statistically different.
I handle very small values the usual way in performing large-sample unpaired tests with data like this:
first group (obtained from 100 samples):
first item's mean = 2.7977620220553945e-24
std dev = 3.2257148207429583e-15
second group (obtained from 100 samples):
first item's mean = 3.1086244689504383e-15
std dev = 3.92336102789548e-15
The goal is to find out whether or not the two means are statistically significantly different.
I plan to follow the usual steps of finding the standard error of the difference and the z-score and so on. I will be using Python (or Java).
My question is not about the statistical test but about the potential problem with the smallness of the numbers (floating-point errors).
Should I (must I) approximate each of the above two means to zero (and thus conclude that there is no difference)?
That is, given the smallness of the means, is it computationally meaningless to go about performing the statistical test?

64-bit floating point numbers allot 52-bits for the significand. This is approximately 15-16 decimal places (log10(2^52) ~ 15.6). In scientific notation, this is the difference between, say 1 e -9 and 1 e -24 (because 10^-9 / 10^-24 == 10^15, i.e. they differ by 15 decimal places).
What does this all mean? Well, it means that if you add 10^-24 to 10^-9, it is just on the border of being too small to show up in the larger number (10^-9).
Observe:
>>> a = 1e-9
>>> a
1e-09
>>> a + 1e-23
1.00000000000001e-09
>>> a + 1e-24
1.000000000000001e-09
>>> a + 1e-25
1e-09
Since the z-score statistics stuff involves basically adding and subtracting a few standard deviations from the mean, or so, it will definitely have problems if the difference in the exponent is 16. It's probably not a good situation if the difference is like 14 or 15. The difference in your exponents is 9, which will still allow you 1/10^6 standard deviations of accuracy in the final sum. Since we're worried about errors on the order of, maybe, a tenth of a standard deviation or so when we talk about statistical significance, you should be fine.
On 32-bit platforms, the significand gets 23 bits, which is about 6.9 places.

In principle, if you work with numbers with the same order of magnitude, the float representation of data is sufficient to retain the same precision as numbers close to 1.
However, it is much more robust to being able to perform computation with whitened data.
If whitening is not an option for your use case, you can use an arbitrary precision library for non-integer data (Python offers built-in arbitrary precision integers), like decimal, fractions and/or statistics, and do all the computations with that.
EDIT
However, just looking at your numbers the standard deviation ranges (the interval [µ-σ, µ+σ] largely overlap, therefore you have no evidence for the two means to be statistically significantly different. Of course this is meaningful only for (at least asymptotically) normally distributed populations / samples.

Related

Internals of numpy.sum

Why are the two sums returning different values? In fact, if 0.1 is summed up 10 times in IEEE arithmetic, the result should not be exactly 1. It could be that np.sum() groups the sum differently, so that by chance the result is exactly 1. But is there is doc about this (besides studying the source code)? Surely, numpy isn't rounding off the result. And AFAIK, it is also using IEEE double floating point.
import numpy as np

print(1-sum(np.ones(10)*0.1))
print(1-np.sum(np.ones(10)*0.1))
-----
1.1102230246251565e-16
0.0

Sorry, I found the solution in the docs. The algorithm of numpy.sum() is indeed more clever:
For floating point numbers the numerical precision of sum (and np.add.reduce) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when no axis is given. When axis is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters. In contrast to NumPy, Python’s math.fsum function uses a slower but more precise approach to summation. Especially when summing a large number of lower precision floating point numbers, such as float32, numerical errors can become significant. In such cases it can be advisable to use dtype="float64" to use a higher precision for the output.
See https://numpy.org/doc/stable/reference/generated/numpy.sum.html

what is the difference expect and mean in the scipy.stats?

according the definition of the expected value, it also refers to mean.
But in scipy.stats.binom, they get different values. like this,
import scipy.stats as st
st.binom.mean(10, 0.3) ----> 3.0
st.binom.expect(args=(10, 0.3)) ---->3.0000000000000013
so that makes me confusing!! why?

In the example the difference is in floating point computation as pointed out. In general there might also be a truncation in expect depending on the integration tolerance.
The mean and some other moments have for many distribution an analytical solution in which case we usually get a precise estimate.
expect is a general function that computes the expectation for arbitrary (*) functions through summation in the discrete case and numerical integration in the continuous case. This accumulates floating point noise but also depends on the convergence criteria for the numerical integration and will, in general, be less precise than an analytically computed moment.
(*) There might be numerical problems in the integration for some "not nice" functions, which can happen for example with default settings in scipy.integrate.quad

This could be simply a result of numerical imprecision when calculating the average. Mathematically, they should be identical, but there are different ways of calculating the mean which have different properties when implemented using finite-precision arithmetic. For example, adding up the numbers and dividing by the total is not particularly reliable, especially when the numbers fluctuate by a small amount around the true (theoretical) mean, or have opposite signs. Recursive estimates may have much better properties.

Why is numpy's sine function so inaccurate at some points?

I just checked numpy's sine function. Apparently, it produce highly inaccurate results around pi.
In [26]: import numpy as np
In [27]: np.sin(np.pi)
Out[27]: 1.2246467991473532e-16
The expected result is 0. Why is numpy so inaccurate there?
To some extend, I feel uncertain whether it is acceptable to regard the calculated result as inaccurate: Its absolute error comes within one machine epsilon (for binary64), whereas the relative error is +inf -- reason why I feel somewhat confused. Any idea?
[Edit] I fully understand that floating-point calculation can be inaccurate. But most of the floating-point libraries can manage to deliver results within a small range of error. Here, the relative error is +inf, which seems unacceptable. Just imagine that we want to calculate
1/(1e-16 + sin(pi))
The results would be disastrously wrong if we use numpy's implementation.

The main problem here is that np.pi is not exactly π, it's a finite binary floating point number that is close to the true irrational real number π but still off by ~1e-16. np.sin(np.pi) is actually returning a value closer to the true infinite-precision result for sin(np.pi) (i.e. the ideal mathematical sin() function being given the approximated np.pi value) than 0 would be.

The value is dependent upon the algorithm used to compute it. A typical implementation will use some quickly-converging infinite series, carried out until it converges within one machine epsilon. Many modern chips (starting with the Intel 960, I think) had such functions in the instruction set.
To get 0 returned for this, we would need either a notably more accurate algorithm, one that ran extra-precision arithmetic to guarantee the closest-match result, or something that recognizes special cases: detect a multiple of PI and return the exact value.

Float versus Logged values in Python

I am calculating relative frequencies of words (word count / total number of words). This results in quite a few very small numbers (e.g. 1.2551539760140076e-05). I have read about some of the issues with using floats in this context, e.g. in this article
A float has roughly seven decimal digits of precision ...
Some suggest using logged values instead. I am going to multiply these numbers and was wondering
In general, is the seven digit rule something to go by in Python?
In my case, should I use log values instead?
What bad things could happen if I don't -- just a less accurate value or straight up errors, e.g. in the multiplication?
And If so, do I just convert the float with math.log() - I feel at that point the information is already lost?
Any help is much appreciated!

That article talks about the type float in C, which is a 32 bit quantity. The Python type float is a 64 bit number, like C's double, and therefore can store roughly 17 decimal digits (53 fractional bits instead of 24 with C's float). While that too can be too little precision for some applications, it's much less dire than with 32-bit floats.
Furthermore, because it is a floating point format, small numbers such as 1.2551539760140076e-05 (which actually isn't that small) are not inherently disadvantaged. While only about 17 decimal digits can be represented, these 17 digits need not be the first 17 digits after the decimal point. They can be shifted around, so to speak1. In fact, you used the same concept of floating (decimal) point when you give a number as a bunch of decimal digits times a power of ten (e-5). To give extreme examples, 1-300 can be represented just fine2, as can 10300 — only when these two numbers meet, problems happen (1e300 + 1e-300 == 1e300).
As for a log representation, you would take the log of all values as early as possible and perform as many calculations as possible in log space. In your example you'd calculate the relative frequency of a word as log(word_count) - log(total_words), which is the same as log(word_count / total_words) but possibly more accurate.
What bad things could happen if I don't -- just a less accurate value or straight up errors, e.g. in the multiplication?
I'm not sure what the distinction is. Numeric calculations can have almost perfect accuracy (relative rounding error on the scale of 2-50 or better), but unstable algorithms can also give laughably bad results in some cases. There are quite strict bounds on the rounding error of each individual operation3, but in longer calculations, they interact in surprising ways to cause very large errors. For example, even just summing up a large list of floats can introduce significant error, especially if they are of very different magnitudes and signs. The proper analysis and design of reliable numeric algorithms is an art of its own which I cannot do justice here, but thanks to the good design of IEEE-754, most algorithms usually work out okay. Don't worry too much about it, but don't ignore it either.
1 In reality we're talking about 53 binary digits being shifted around, but this is unimportant for this concept. Decimal floating point formats exist.
2 With a relative rounding error of less than 2-54, which occurs for any fraction whose denominator isn't a power of two, including such mundane ones as 1/3 or 0.1.
3 For basic arithmetic operations, the rounding error should be half a unit in the last place, i.e., the result must be calculated exactly and then be rounded correctly. For transcendental functions the error is rarely more than a one or two units in the last place but can be larger.

In Python small floats tending to zero

I have a Bayesian Classifier programmed in Python, the problem is that when I multiply the features probabilities I get VERY small float values like 2.5e-320 or something like that, and suddenly it turns into 0.0. The 0.0 is obviously of no use to me since I must find the "best" class based on which class returns the MAX value (greater value).
What would be the best way to deal with this? I thought about finding the exponential portion of the number (-320) and, if it goes too low, multiplying the value by 1e20 or some value like that. But maybe there is a better way?

What you describe is a standard problem with the naive Bayes classifier. You can search for underflow with that to find the answer. or see here.
The short answer is it is standard to express all that in terms of logarithms. So rather than multiplying probabilities, you sum their logarithms.
You might want to look at other algorithms as well for classification.

Would it be possible to do your work in a logarithmic space? (For example, instead of storing 1e-320, just store -320, and use addition instead of multiplication)

Floating point numbers don't have infinite precision, which is why you saw the numbers turn to 0. Could you multiply all the probabilities by a large scalar, so that your numbers stay in a higher range? If you're only worried about max and not magnitude, you don't even need to bother dividing through at the end. Alternatively you could use an infinite precision decimal, like ikanobori suggests.

Take a look at Decimal from the stdlib.
from decimal import Decimal, getcontext
getcontext().prec = 320
Decimal(1) / Decimal(7)
I am not posting the results here as it is quite long.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.