In Python small floats tending to zero - python

I have a Bayesian Classifier programmed in Python, the problem is that when I multiply the features probabilities I get VERY small float values like 2.5e-320 or something like that, and suddenly it turns into 0.0. The 0.0 is obviously of no use to me since I must find the "best" class based on which class returns the MAX value (greater value).
What would be the best way to deal with this? I thought about finding the exponential portion of the number (-320) and, if it goes too low, multiplying the value by 1e20 or some value like that. But maybe there is a better way?

What you describe is a standard problem with the naive Bayes classifier. You can search for underflow with that to find the answer. or see here.
The short answer is it is standard to express all that in terms of logarithms. So rather than multiplying probabilities, you sum their logarithms.
You might want to look at other algorithms as well for classification.

Would it be possible to do your work in a logarithmic space? (For example, instead of storing 1e-320, just store -320, and use addition instead of multiplication)

Floating point numbers don't have infinite precision, which is why you saw the numbers turn to 0. Could you multiply all the probabilities by a large scalar, so that your numbers stay in a higher range? If you're only worried about max and not magnitude, you don't even need to bother dividing through at the end. Alternatively you could use an infinite precision decimal, like ikanobori suggests.

Take a look at Decimal from the stdlib.
from decimal import Decimal, getcontext
getcontext().prec = 320
Decimal(1) / Decimal(7)
I am not posting the results here as it is quite long.

Related

In NumPy, how to use a float that is larger than float64's max value?

I have a calculation that may result in very, very large numbers, that won fit into a float64. I thought about using np.longdouble but that may not be large enough either.
I'm not so interested in precision (just 8 digits would do for me). It's the decimal part that won't fit. And I need to have an array of those.
Is there a way to represent / hold an unlimited size number, say, only limited by the available memory? Or if not, what is the absolute max value I can place in an numpy array?
Can you rework the calculation so it works with the logarithms of the numbers instead?
That's pretty much how the built-in floats work in any case...
You would only convert the number back to linear for display, at which point you'd separate the integer and fractional parts; the fractional part gets exponentiated as normal to give the 8 digits of precision, and the integer part goes into the "×10ⁿ" or "×eⁿ" or "×2ⁿ" part of the output (depending on what base logarithm you use).

What is the Precision of SymPy's 'expand' Function?

I am trying to expand a function of the form (X + Y + Z) ^ N where N is sufficiently large so that the expanded product will contain terms with coefficients much greater than 2 ^ 64; for the sake of this discussion let's just say that N is greater than 200. This is an issue because I am hoping to do an analysis of the expanded form of this function, and this analysis requires exact precision for all of the terms and their coefficients.
To expand the function I am using the Python module SymPy, which has seemed very promising thus far and been able to expand functions where N is > 150 in a relatively short amount of time. My concern though is that after looking through some of the expanded functions, I am seeing coefficients with more trailing zeroes than I might expect. I know that I can run everything through mpmath for my analysis after the function has been expanded, but as of now, I am unsure as to whether or not some of the larger coefficients are even exactly correct in the first place.
Under the documentation for SymPy's expand function, there is no clarification of how precise the coefficients of the expansion are when working with very large numbers. I know for a fact that SymPy uses the mpmath module for some of its functions, so I know that it is capable of arbitrary precision, I just don't know if arbitrary precision explicitly applies to this case.
I know that I could also confirm if the expand function is arbitrarily precise or not by summing all of the coefficients of a given function and checking whether or not that sum is equal to N, but I'd rather not spend a few hours coding all the necessary pieces to make that assessment, only to find out that expand is imprecise.
If anyone has any suggestions for easier ways to confirm the precision of expand, then I would appreciate that if direct confirmation of its precision cannot be given.
Although PR 18960 has not yet been merged, you can affirm there that the coefficients are correct:
>>> multinomial(15,16,14)
50151543548788717200
>>> ((x+y+z)**(15+16+14)).expand().coeff(x**15*y**16*z**14)
50151543548788717200
>>> _ > 2**64
True
Since Python supports unlimited integers and the coefficients are integers, I don't know any reason that they would not be accurate.

Can Python floating-point error affect statistical test over small numbers?

Can floating-point errors affect my calculations in the following scenario, where the values are small?
My purpose is to compare two sets of values and determine if their means are statistically different.
I handle very small values the usual way in performing large-sample unpaired tests with data like this:
first group (obtained from 100 samples):
first item's mean = 2.7977620220553945e-24
std dev = 3.2257148207429583e-15
second group (obtained from 100 samples):
first item's mean = 3.1086244689504383e-15
std dev = 3.92336102789548e-15
The goal is to find out whether or not the two means are statistically significantly different.
I plan to follow the usual steps of finding the standard error of the difference and the z-score and so on. I will be using Python (or Java).
My question is not about the statistical test but about the potential problem with the smallness of the numbers (floating-point errors).
Should I (must I) approximate each of the above two means to zero (and thus conclude that there is no difference)?
That is, given the smallness of the means, is it computationally meaningless to go about performing the statistical test?
64-bit floating point numbers allot 52-bits for the significand. This is approximately 15-16 decimal places (log10(2^52) ~ 15.6). In scientific notation, this is the difference between, say 1 e -9 and 1 e -24 (because 10^-9 / 10^-24 == 10^15, i.e. they differ by 15 decimal places).
What does this all mean? Well, it means that if you add 10^-24 to 10^-9, it is just on the border of being too small to show up in the larger number (10^-9).
Observe:
>>> a = 1e-9
>>> a
1e-09
>>> a + 1e-23
1.00000000000001e-09
>>> a + 1e-24
1.000000000000001e-09
>>> a + 1e-25
1e-09
Since the z-score statistics stuff involves basically adding and subtracting a few standard deviations from the mean, or so, it will definitely have problems if the difference in the exponent is 16. It's probably not a good situation if the difference is like 14 or 15. The difference in your exponents is 9, which will still allow you 1/10^6 standard deviations of accuracy in the final sum. Since we're worried about errors on the order of, maybe, a tenth of a standard deviation or so when we talk about statistical significance, you should be fine.
On 32-bit platforms, the significand gets 23 bits, which is about 6.9 places.
In principle, if you work with numbers with the same order of magnitude, the float representation of data is sufficient to retain the same precision as numbers close to 1.
However, it is much more robust to being able to perform computation with whitened data.
If whitening is not an option for your use case, you can use an arbitrary precision library for non-integer data (Python offers built-in arbitrary precision integers), like decimal, fractions and/or statistics, and do all the computations with that.
EDIT
However, just looking at your numbers the standard deviation ranges (the interval [µ-σ, µ+σ] largely overlap, therefore you have no evidence for the two means to be statistically significantly different. Of course this is meaningful only for (at least asymptotically) normally distributed populations / samples.

what is the difference expect and mean in the scipy.stats?

according the definition of the expected value, it also refers to mean.
But in scipy.stats.binom, they get different values. like this,
import scipy.stats as st
st.binom.mean(10, 0.3) ----> 3.0
st.binom.expect(args=(10, 0.3)) ---->3.0000000000000013
so that makes me confusing!! why?
In the example the difference is in floating point computation as pointed out. In general there might also be a truncation in expect depending on the integration tolerance.
The mean and some other moments have for many distribution an analytical solution in which case we usually get a precise estimate.
expect is a general function that computes the expectation for arbitrary (*) functions through summation in the discrete case and numerical integration in the continuous case. This accumulates floating point noise but also depends on the convergence criteria for the numerical integration and will, in general, be less precise than an analytically computed moment.
(*) There might be numerical problems in the integration for some "not nice" functions, which can happen for example with default settings in scipy.integrate.quad
This could be simply a result of numerical imprecision when calculating the average. Mathematically, they should be identical, but there are different ways of calculating the mean which have different properties when implemented using finite-precision arithmetic. For example, adding up the numbers and dividing by the total is not particularly reliable, especially when the numbers fluctuate by a small amount around the true (theoretical) mean, or have opposite signs. Recursive estimates may have much better properties.

Why is numpy's sine function so inaccurate at some points?

I just checked numpy's sine function. Apparently, it produce highly inaccurate results around pi.
In [26]: import numpy as np
In [27]: np.sin(np.pi)
Out[27]: 1.2246467991473532e-16
The expected result is 0. Why is numpy so inaccurate there?
To some extend, I feel uncertain whether it is acceptable to regard the calculated result as inaccurate: Its absolute error comes within one machine epsilon (for binary64), whereas the relative error is +inf -- reason why I feel somewhat confused. Any idea?
[Edit] I fully understand that floating-point calculation can be inaccurate. But most of the floating-point libraries can manage to deliver results within a small range of error. Here, the relative error is +inf, which seems unacceptable. Just imagine that we want to calculate
1/(1e-16 + sin(pi))
The results would be disastrously wrong if we use numpy's implementation.
The main problem here is that np.pi is not exactly π, it's a finite binary floating point number that is close to the true irrational real number π but still off by ~1e-16. np.sin(np.pi) is actually returning a value closer to the true infinite-precision result for sin(np.pi) (i.e. the ideal mathematical sin() function being given the approximated np.pi value) than 0 would be.
The value is dependent upon the algorithm used to compute it. A typical implementation will use some quickly-converging infinite series, carried out until it converges within one machine epsilon. Many modern chips (starting with the Intel 960, I think) had such functions in the instruction set.
To get 0 returned for this, we would need either a notably more accurate algorithm, one that ran extra-precision arithmetic to guarantee the closest-match result, or something that recognizes special cases: detect a multiple of PI and return the exact value.

Categories

Resources