I have a numerical problem while doing likelihood ratio tests in python. I'll not go into too much detail about what the statistics mean, my problems comes down to calculating this:
LR = LR_H0 / LR_h1
where LR is the number of interest and LR_H0 and LR_H1 are numbers that can be VERY close to zero. This leads to a few numerical issues; if LR_H1 is too small then python will recognise this as a division by zero.
ZeroDivisionError: float division by zero
Also, although this is not the main issue, if LR_H1 is small enough to allow the division then the fraction LR_H0 / LR_h1 might become too big (I'm assuming that python also has an upper limit value of what a float can be).
Any tips on what the best way is to circumvent this problem? I'm considering doing something like:
def small_enough( num ):
if num == 0.0:
return *other small number*
else:
return num
But this is not ideal because it would approximate the LR value and I would like to guarantee some precision.
Work with logarithms. Take the log of all your likelihoods, and add or subtract logarithms instead of multiplying and dividing. You'll be able to work with much greater ranges of values without losing precision.
Related
I found myself needing to compute the "integer cube root", meaning the cube root of an integer, rounded down to the nearest integer. In Python, we could use the NumPy floating-point cbrt() function:
import numpy as np
def icbrt(x):
return int(np.cbrt(x))
Though this works most of the time, it fails at certain input x, with the result being one less than expected. For example, icbrt(15**3) == 14, which comes about because np.cbrt(15**3) == 14.999999999999998. The following finds the first 100,000 such failures:
print([x for x in range(100_000) if (icbrt(x) + 1)**3 == x])
# [3375, 19683, 27000, 50653] == [15**3, 27**3, 30**3, 37**3]
Question: What is special about 15, 27, 30, 37, ..., making cbrt() return ever so slightly below the exact result? I can find no obvious underlying pattern for these numbers.
A few observations:
The story is the same if we switch from NumPy's cbrt() to that of Python's math module, or if we switch from Python to C (not surprising, as I believe that both numpy.cbrt() and math.cbrt() delegate to cbrt() from the C math library in the end).
Replacing cbrt(x) with x**(1/3) (pow(x, 1./3.) in C) leads to many more cases of failure. Let us stick to cbrt().
For the square root, a similar problem does not arise, meaning that
import numpy as np
def isqrt(x):
return int(np.sqrt(x))
returns the correct result for all x (tested up to 100,000,000). Test code:
print([x for x in range(100_000) if (y := np.sqrt(x))**2 != x and (y + 1)**2 <= x])
Extra
As the above icbrt() only seems to fail on cubic input, we can correct for the occasional mistakes by adding a fixup, like so:
import numpy as np
def icbrt(x):
y = int(np.cbrt(x))
if (y + 1)**3 == x:
y += 1
return y
A different solution is to stick to exact integer computation, implementing icbrt() without the use of floating-point numbers. This is discussed e.g. in this SO question. An extra benefit of such approaches is that they are (or can be) faster than using the floating-point cbrt().
To be clear, my question is not about how to write a better icbrt(), but about why cbrt() fails at some specific inputs.
This problem is caused by a bad implementation of cbrt. It is not caused by floating-point arithmetic because floating-point arithmetic is not a barrier to computing the cube root well enough to return an exactly correct result when the exactly correct result is representable in the floating-point format.
For example, if one were to use integer arithmetic to compute nine-fifths of 80, we would expect a correct result of 144. If a routine to compute nine-fifths of a number were implemented as int NineFifths(int x) { return 9/5*x; }, we would blame that routine for being implemented incorrectly, not blame integer arithmetic for not handling fractions. Similarly, if a routine uses floating-point arithmetic to calculate an incorrect result when a correct result is representable, we blame the routine, not floating-point arithmetic.
Some mathematical functions are difficult to calculate, and we accept some amount of error in them. In fact, for some of the routines in the math library, humans have not yet figured out how to calculate them with correct rounding in a known-bounded execution time. So we accept that not every math routine is correctly rounded.
Howver, when the mathematical value of a function is exactly representable in a floating-point format, the correct result can be obtained by faithful rounding rather than correct rounding. So this is a desirable goal for math library functions.
Correctly rounded means the computed result equals the number you would obtain by rounding the exact mathematical result to the nearest representable value.1 Faithfully rounded means the computed result is less than one ULP from the exact mathematical result. An ULP is the unit of least precision, the distance between two adjacent representable numbers.
Correctly rounding a function can be difficult because, in general, a function can be arbitrarily close to a rounding decision point. For round-to-nearest, this is midway between two adjacent representable numbers. Consider two adjacent representable numbers a and b. Their midpoint is m = (a+b)/2. If the mathematical value of some function f(x) is just below m, it should be rounded to a. If it is just above, it should be rounded to b. As we implement f in software, we might compute it with some very small error e. When we compute f(x), if our computed result lies in [m-e, m+e], and we only know the error bound is e, then we cannot tell whether f(x) is below m or above m. And because, in general, a function f(x) can be arbitrarily close to m, this is always a problem: No matter how accurately we compute f, no matter how small we make the error bound e, there is a possibility that our computed value will lie very close to a midpoint m, closer than e, and therefore our computation will not tell us whether to round down or to round up.
For some specific functions and floating-point formats, studies have been made and proofs have been written about how close the functions approach such rounding decision points, and so certain functions like sine and cosine can be implemented with correct rounding with known bounds on the compute time. Other functions have eluded proof so far.
In contrast, faithful rounding is easier to implement. If we compute a function with an error bound less than ½ ULP, then we can always return a faithfully rounded result, one that is within one ULP of the exact mathematical result. Once we have computed some result y, we round that to the nearest representable value2 and return that. Starting with y having error less than ½ ULP, the rounding may add up to ½ ULP more error, so the total error is less than one ULP, which is faithfully rounded.
A benefit of faithful rounding is that a faithfully rounded implementation of a function always produces the exact result when the exact result is representable. This is because the next nearest result is one ULP away, but faithful rounding always has an error less than one ULP. Thus, a faithfully rounded cbrt function returns exact results when they are representable.
What is special about 15, 27, 30, 37, ..., making cbrt() return ever so slightly below the exact result? I can find no obvious underlying pattern for these numbers.
The bad cbrt implementation might compute the cube root by reducing the argument to a value in [1, 8) or similar interval and then applying a precomputed polynomial approximation. Each addition and multiplication in that polynomial may introduce a rounding error as the result of each operation is rounded to the nearest representable value in floating-point format. Additionally, the polynomial has inherent error. Rounding errors behave somewhat like a random process, sometimes rounding up, sometimes down. As they accumulate over several calculations, they may happen to round in different directions and cancel, or they may round in the same direction ad reinforce. If the errors happen to cancel by the end of the calculations, you get an exact result from cbrt. Otherwise, you may get an incorrect result from cbrt.
Footnotes
1 In general, there is a choice of rounding rules. The default and most common is round-to-nearest, ties-to-even. Others include round-upward, round-downward, and round-toward-zero. This answer focuses on round-to-nearest.
2 Inside a mathematical function, numbers may be computed using extended precision, so we may have computed results that are not representable in the destination floating-point format; they will have more precision.
I happen to have a numpy array of floats:
a.dtype, a.shape
#(dtype('float64'), (32769,))
The values are:
a[0]
#3.699822718929953
all(a == a[0])
True
However:
a.mean()
3.6998227189299517
The mean is off by 15th and 16th figure.
Can anybody show how this difference is accumulated over 30K mean and if there is a way to avoid it?
In case it matters my OS is 64 bit.
Here is a rough approximation of a bound on the maximum error. This will not be representative of average error, and it could be improved with more analysis.
Consider calculating a sum using floating-point arithmetic with round-to-nearest ties-to-even:
sum = 0;
for (i = 0; i < n; ++n)
sum += a[i];
where each a[i] is in [0, m).
Let ULP(x) denote the unit of least precision in the floating-point number x. (For example, in the IEEE-754 binary64 format with 53-bit significands, if the largest power of 2 not greater than |x| is 2p, then ULP(x) = 2p−52. With round-to-nearest, the maximum error in any operation with result x is ½ULP(x).
If we neglect rounding errors, the maximum value of sum after i iterations is i•m. Therefore, a bound on the error in the addition in iteration i is ½ULP(i•m). (Actually zero for i=1, since that case adds to zero, which has no error, but we neglect that for this approximation.) Then the total of the bounds on all the additions is the sum of ½ULP(i•m) for i from 1 to n. This is approximately ½•n•(n+1)/2•ULP(m) = ¼•n•(n+1)•ULP(m). (This is an approximation because it moves i outside the ULP function, but ULP is a discontinuous function. It is “approximately linear,“ but there are jumps. Since the jumps are by factors of two, the approximation can be off by at most a factor of two.)
So, with 32,769 elements, we can say the total rounding error will be at most about ¼•32,769•32,770•ULP(m), about 2.7•108 times the ULP of the maximum element value. The ULP is 2−52 times the greatest power of two not less than m, so that is about 2.7•108•2−52 = 6•10−8 times m.
Of course, the likelihood that 32,768 sums (not 32,769 because the first necessarily has no error) all round in the same direction by chance is vanishingly small but I conjecture one might engineer a sequence of values that gets close to that.
An Experiment
Here is a chart of (in blue) the mean error over 10,000 samples of summing arrays with sizes 100 to 32,800 by 100s and elements drawn randomly from a uniform distribution over [0, 1). The error was calculated by comparing the sum calculated with float (IEEE-754 binary32) to that calculated with double (IEEE-754 binary64). (The samples were all multiples of 2−24, and double has enough precision so that the sum for up to 229 such values is exact.)
The green line is c n √n with c set to match the last point of the blue line. We see it tracks the blue line over the long term. At points where the average sum crosses a power of two, the mean error increases faster for a time. At these points, the sum has entered a new binade, and further additions have higher average errors due to the increased ULP. Over the course of the binade, this fixed ULP decreases relative to n, bringing the blue line back to the green line.
This is due to incapability of float64 type to store the sum of your float numbers with correct precision. In order to get around this problem you need to use a larger data type of course*. Numpy has a longdouble dtype that you can use in such cases:
In [23]: np.mean(a, dtype=np.longdouble)
Out[23]: 3.6998227189299530693
Also, note:
In [25]: print(np.longdouble.__doc__)
Extended-precision floating-point number type, compatible with C
``long double`` but not necessarily with IEEE 754 quadruple-precision.
Character code: ``'g'``.
Canonical name: ``np.longdouble``.
Alias: ``np.longfloat``.
Alias *on this platform*: ``np.float128``: 128-bit extended-precision floating-point number type.
* read the comments for more details.
The mean is (by definition):
a.sum()/a.size
Unfortunately, adding all those values up and dividing accumulates floating point errors. They are usually around the magnitude of:
np.finfo(np.float).eps
Out[]: 2.220446049250313e-16
Yeah, e-16, about where you get them. You can make the error smaller by using higher-accuracy floats like float128 (if your system supports it) but they'll always accumulate whenever you're summing a large number of float together. If you truly want the identity, you'll have to hardcode it:
def mean_(arr):
if np.all(arr == arr[0]):
return arr[0]
else:
return arr.mean()
In practice, you never really want to use == between floats. Generally in numpy we use np.isclose or np.allclose to compare floats for exactly this reason. There are ways around it using other packages and leveraging arcane machine-level methods of calculating numbers to get (closer to) exact equality, but it's rarely worth the performance and clarity hit.
For the moment, put aside any issues relating to pseudorandom number generators and assume that numpy.random.rand perfectly samples from the discrete distribution of floating point numbers over [0, 1). What are the odds getting at least two exactly identical floating point numbers in the result of:
numpy.random.rand(n)
for any given value of n?
Mathematically, I think this is equivalent to first asking how many IEEE 754 singles or doubles there are in the interval [0, 1). Then I guess the next step would be to solve the equivalent birthday problem? I'm not really sure. Anyone have some insight?
The computation performed by numpy.random.rand for each element generates a number 0.<53 random bits>, for a total of 2^53 equally likely outputs. (Of course, the memory representation isn't a fixed-point 0.stuff; it's still floating point.) This computation is incapable of producing most binary64 floating-point numbers between 0 and 1; for example, it cannot produce 1/2^60. You can see the code in numpy/random/mtrand/randomkit.c:
double
rk_double(rk_state *state)
{
/* shifts : 67108864 = 0x4000000, 9007199254740992 = 0x20000000000000 */
long a = rk_random(state) >> 5, b = rk_random(state) >> 6;
return (a * 67108864.0 + b) / 9007199254740992.0;
}
(Note that rk_random produces 32-bit outputs, regardless of the size of long.)
Assuming a perfect source of randomness, the probability of repeats in numpy.random.rand(n) is 1-(1-0/k)(1-1/k)(1-2/k)...(1-(n-1)/k), where k=2^53. It's probably best to use an approximation instead of calculating this directly for large values of n. (The approximation may even be more accurate, depending on how the approximation error compares to the rounding error accumulated in a direct computation.)
I think you are correct, this is like the birthday problem.
But you need to decide on the number of possible options. You do this by deciding the precision of your floating point numbers.
For example, if you decide to have a precision of 2 numbers after the dot, then there are 100 options(including zero and excluding 1).
And if you have n numbers then the probability of not having a collision is:
or when given R possible numbers and N data points, the probability of no collision is:
And of collision is 1 - P.
This is because the probability of getting any given number is 1/R. And at any point, the probability of a data point not colliding with prior data points is (R-i)/R for i being the index of the data point. But to get the probability of no data points colliding with each other, we need to multiply all the probabilities of data points not colliding with those prior to them. Applying some algebraic operations, we get the equation above.
I'm doing calculations with 3D vectors with floating point coordinates. Occasionally, I want to check if a vector is nonzero. However, with floating point numbers, there's always a chance of a rounding error.
Is there a standard way in Python to check if a floating point number is sufficiently close to zero? I could write abs(x) < 0.00001, but it's the hard-coded cutoff that bugs me on general grounds ...
Like Ami wrote in the comments, it depends on what you're doing. The system epsilon is good for single operation errors, but when you use already rounded values in further calculations, the errors can get much larger than the system epsilon. Take this extreme example:
import sys
print('%.20f\n' % sys.float_info.epsilon)
x = 0.1
for _ in range(25):
print('%.20f' % x)
x = 11*x - 1
With exact values, x would always be 0.1, since 11*0.1-1 is 0.1 again. But what really happens is this:
0.00000000000000022204
0.10000000000000000555
0.10000000000000008882
0.10000000000000097700
0.10000000000001074696
0.10000000000011821655
0.10000000000130038202
0.10000000001430420227
0.10000000015734622494
0.10000000173080847432
0.10000001903889321753
0.10000020942782539279
0.10000230370607932073
0.10002534076687252806
0.10027874843559780871
0.10306623279157589579
0.13372856070733485367
0.47101416778068339042
4.18115584558751685051
44.99271430146268357930
493.91985731608951937233
5432.11843047698494046926
59752.30273524683434516191
657274.33008771517779678106
7230016.63096486683934926987
79530181.94061353802680969238
Note that the original x differed from 0.1 by far less than my system epsilon, but the error quickly grew larger than that epsilon and even your 0.00001 and now it's in the millions.
This is an extreme example, though, and it's highly unlikely you'll encounter something this bad. But it shows that the precision really depends on what you're doing, so you'll have to find a good way for your particular calculation.
I have an binary search implemented in python.
Now I want to check if element math.floor(n ^ (1/p)) is in my binary search.
But p is a very, very large number. I wrote using fractions module:
binary_search.search(list,int (n**fractions.Fraction('1'+'/'+str(p))))
But I have an error OverflowError: integer division result too large for a float
How can I take to n to the power, which is a fraction and do it fast?
Unless your values of n are also incredibly large, floor(n^(1/p)) is going to tend toward 1 for "very, very large" values of p. Since you're only interested in the integer portion, you could get away with a simple loop to test if 1^P, 2^p, 3^p and so on are greater than n.
Don't waste time finding exact values if you don't need them.
n^(1/p)=exp(ln(n)/p) ~~ 1+ln(n)/p for big p values
So you can compare p with natural logarithm of n. If the ratio p/ln(n) >> 1 (much larger), then you can use approximation above (which tends to 1)