Related
I found myself needing to compute the "integer cube root", meaning the cube root of an integer, rounded down to the nearest integer. In Python, we could use the NumPy floating-point cbrt() function:
import numpy as np
def icbrt(x):
return int(np.cbrt(x))
Though this works most of the time, it fails at certain input x, with the result being one less than expected. For example, icbrt(15**3) == 14, which comes about because np.cbrt(15**3) == 14.999999999999998. The following finds the first 100,000 such failures:
print([x for x in range(100_000) if (icbrt(x) + 1)**3 == x])
# [3375, 19683, 27000, 50653] == [15**3, 27**3, 30**3, 37**3]
Question: What is special about 15, 27, 30, 37, ..., making cbrt() return ever so slightly below the exact result? I can find no obvious underlying pattern for these numbers.
A few observations:
The story is the same if we switch from NumPy's cbrt() to that of Python's math module, or if we switch from Python to C (not surprising, as I believe that both numpy.cbrt() and math.cbrt() delegate to cbrt() from the C math library in the end).
Replacing cbrt(x) with x**(1/3) (pow(x, 1./3.) in C) leads to many more cases of failure. Let us stick to cbrt().
For the square root, a similar problem does not arise, meaning that
import numpy as np
def isqrt(x):
return int(np.sqrt(x))
returns the correct result for all x (tested up to 100,000,000). Test code:
print([x for x in range(100_000) if (y := np.sqrt(x))**2 != x and (y + 1)**2 <= x])
Extra
As the above icbrt() only seems to fail on cubic input, we can correct for the occasional mistakes by adding a fixup, like so:
import numpy as np
def icbrt(x):
y = int(np.cbrt(x))
if (y + 1)**3 == x:
y += 1
return y
A different solution is to stick to exact integer computation, implementing icbrt() without the use of floating-point numbers. This is discussed e.g. in this SO question. An extra benefit of such approaches is that they are (or can be) faster than using the floating-point cbrt().
To be clear, my question is not about how to write a better icbrt(), but about why cbrt() fails at some specific inputs.
This problem is caused by a bad implementation of cbrt. It is not caused by floating-point arithmetic because floating-point arithmetic is not a barrier to computing the cube root well enough to return an exactly correct result when the exactly correct result is representable in the floating-point format.
For example, if one were to use integer arithmetic to compute nine-fifths of 80, we would expect a correct result of 144. If a routine to compute nine-fifths of a number were implemented as int NineFifths(int x) { return 9/5*x; }, we would blame that routine for being implemented incorrectly, not blame integer arithmetic for not handling fractions. Similarly, if a routine uses floating-point arithmetic to calculate an incorrect result when a correct result is representable, we blame the routine, not floating-point arithmetic.
Some mathematical functions are difficult to calculate, and we accept some amount of error in them. In fact, for some of the routines in the math library, humans have not yet figured out how to calculate them with correct rounding in a known-bounded execution time. So we accept that not every math routine is correctly rounded.
Howver, when the mathematical value of a function is exactly representable in a floating-point format, the correct result can be obtained by faithful rounding rather than correct rounding. So this is a desirable goal for math library functions.
Correctly rounded means the computed result equals the number you would obtain by rounding the exact mathematical result to the nearest representable value.1 Faithfully rounded means the computed result is less than one ULP from the exact mathematical result. An ULP is the unit of least precision, the distance between two adjacent representable numbers.
Correctly rounding a function can be difficult because, in general, a function can be arbitrarily close to a rounding decision point. For round-to-nearest, this is midway between two adjacent representable numbers. Consider two adjacent representable numbers a and b. Their midpoint is m = (a+b)/2. If the mathematical value of some function f(x) is just below m, it should be rounded to a. If it is just above, it should be rounded to b. As we implement f in software, we might compute it with some very small error e. When we compute f(x), if our computed result lies in [m-e, m+e], and we only know the error bound is e, then we cannot tell whether f(x) is below m or above m. And because, in general, a function f(x) can be arbitrarily close to m, this is always a problem: No matter how accurately we compute f, no matter how small we make the error bound e, there is a possibility that our computed value will lie very close to a midpoint m, closer than e, and therefore our computation will not tell us whether to round down or to round up.
For some specific functions and floating-point formats, studies have been made and proofs have been written about how close the functions approach such rounding decision points, and so certain functions like sine and cosine can be implemented with correct rounding with known bounds on the compute time. Other functions have eluded proof so far.
In contrast, faithful rounding is easier to implement. If we compute a function with an error bound less than ½ ULP, then we can always return a faithfully rounded result, one that is within one ULP of the exact mathematical result. Once we have computed some result y, we round that to the nearest representable value2 and return that. Starting with y having error less than ½ ULP, the rounding may add up to ½ ULP more error, so the total error is less than one ULP, which is faithfully rounded.
A benefit of faithful rounding is that a faithfully rounded implementation of a function always produces the exact result when the exact result is representable. This is because the next nearest result is one ULP away, but faithful rounding always has an error less than one ULP. Thus, a faithfully rounded cbrt function returns exact results when they are representable.
What is special about 15, 27, 30, 37, ..., making cbrt() return ever so slightly below the exact result? I can find no obvious underlying pattern for these numbers.
The bad cbrt implementation might compute the cube root by reducing the argument to a value in [1, 8) or similar interval and then applying a precomputed polynomial approximation. Each addition and multiplication in that polynomial may introduce a rounding error as the result of each operation is rounded to the nearest representable value in floating-point format. Additionally, the polynomial has inherent error. Rounding errors behave somewhat like a random process, sometimes rounding up, sometimes down. As they accumulate over several calculations, they may happen to round in different directions and cancel, or they may round in the same direction ad reinforce. If the errors happen to cancel by the end of the calculations, you get an exact result from cbrt. Otherwise, you may get an incorrect result from cbrt.
Footnotes
1 In general, there is a choice of rounding rules. The default and most common is round-to-nearest, ties-to-even. Others include round-upward, round-downward, and round-toward-zero. This answer focuses on round-to-nearest.
2 Inside a mathematical function, numbers may be computed using extended precision, so we may have computed results that are not representable in the destination floating-point format; they will have more precision.
I have the following code:
import numpy as np
float_number_1 = -0.09115307778120041
float_number_2 = -0.41032475233078003
print(float_number_1) #-0.09115307778120041
print(float_number_2) #-0.41032475233078003
my_array= np.array([[float_number_1, float_number_2]], dtype=np.float64)
for number in my_array:
print(number) #[-0.09115308 -0.41032475]
Now when I add the following:
np.set_printoptions(precision=50)
and print my_array again I get the full numbers:
[-0.09115307778120041 -0.41032475233078003]
Is set_printoptions just for display purposes or does this affect the actual precision of the numbers held in the numpy array?
I need to keep the precision for calculation purposes.
Also, what is the maximum precision I can set for float64?
Yes, it is just there for display options. The storage format for numbers is not altered. From the numpy.set_printoptions() documentation:
These options determine the way floating point numbers, arrays and other NumPy objects are displayed.
(Bold emphasis mine)
A 64-bit floating point number has 53 bits of significand precision, so the smallest binary fraction is 2^-52, or 2.220446049250313e-16; about 16 decimal digits, so going beyond np.set_printoptions(precision=16) there probably is not much point.
Note that the setting for floatmode also matters; the default is 'maxprec_equal', which means that the number of digits actually shown depends on the actual values in your array; if you set precision=16, but your array values can be all be uniquely represented with just 4 decimals, then numpy will not use more. Only if floatmode is set to 'fixed' will numpy stick to a larger precision setting.
On what it means to uniquely represent floating point numbers: Because floating point numbers are an approximation using binary fractions, there is a whole range of Real numbers that all would result in the exact same floating point representation e.g. 0.5 and 0.50000000000000005 both end up as the binary value 00111111000000000000000000000000. Numpy strives to find the fewest number of decimal digits that a floating point number can represent, then show that to you.
This is more of a numerical analysis rather than programming question, but I suppose some of you will be able to answer it.
In the sum two floats, is there any precision lost? Why?
In the sum of a float and a integer, is there any precision lost? Why?
Thanks.
In the sum two floats, is there any precision lost?
If both floats have differing magnitude and both are using the complete precision range (of about 7 decimal digits) then yes, you will see some loss in the last places.
Why?
This is because floats are stored in the form of (sign) (mantissa) × 2(exponent). If two values have differing exponents and you add them, then the smaller value will get reduced to less digits in the mantissa (because it has to adapt to the larger exponent):
PS> [float]([float]0.0000001 + [float]1)
1
In the sum of a float and a integer, is there any precision lost?
Yes, a normal 32-bit integer is capable of representing values exactly which do not fit exactly into a float. A float can still store approximately the same number, but no longer exactly. Of course, this only applies to numbers that are large enough, i. e. longer than 24 bits.
Why?
Because float has 24 bits of precision and (32-bit) integers have 32. float will still be able to retain the magnitude and most of the significant digits, but the last places may likely differ:
PS> [float]2100000050 + [float]100
2100000100
The precision depends on the magnitude of the original numbers. In floating point, the computer represents the number 312 internally as scientific notation:
3.12000000000 * 10 ^ 2
The decimal places in the left hand side (mantissa) are fixed. The exponent also has an upper and lower bound. This allows it to represent very large or very small numbers.
If you try to add two numbers which are the same in magnitude, the result should remain the same in precision, because the decimal point doesn't have to move:
312.0 + 643.0 <==>
3.12000000000 * 10 ^ 2 +
6.43000000000 * 10 ^ 2
-----------------------
9.55000000000 * 10 ^ 2
If you tried to add a very big and a very small number, you would lose precision because they must be squeezed into the above format. Consider 312 + 12300000000000000000000. First you have to scale the smaller number to line up with the bigger one, then add:
1.23000000000 * 10 ^ 15 +
0.00000000003 * 10 ^ 15
-----------------------
1.23000000003 <-- precision lost here!
Floating point can handle very large, or very small numbers. But it can't represent both at the same time.
As for ints and doubles being added, the int gets turned into a double immediately, then the above applies.
When adding two floating point numbers, there is generally some error. D. Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic" describes the effect and the reasons in detail, and also how to calculate an upper bound on the error, and how to reason about the precision of more complex calculations.
When adding a float to an integer, the integer is first converted to a float by C++, so two floats are being added and error is introduced for the same reasons as above.
The precision available for a float is limited, so of course there is always the risk that any given operation drops precision.
The answer for both your questions is "yes".
If you try adding a very large float to a very small one, you will for instance have problems.
Or if you try to add an integer to a float, where the integer uses more bits than the float has available for its mantissa.
The short answer: a computer represents a float with a limited number of bits, which is often done with mantissa and exponent, so only a few bytes are used for the significant digits, and the others are used to represent the position of the decimal point.
If you were to try to add (say) 10^23 and 7, then it won't be able to accurately represent that result. A similar argument applies when adding a float and integer -- the integer will be promoted to a float.
In the sum two floats, is there any precision lost?
In the sum of a float and a integer, is there any precision lost? Why?
Not always. If the sum is representable with the precision you ask, and you won't get any precision loss.
Example: 0.5 + 0.75 => no precision loss
x * 0.5 => no precision loss (except if x is too much small)
In the general case, one add floats in slightly different ranges so there is a precision loss which actually depends on the rounding mode.
ie: if you're adding numbers with totally different ranges, expect precision problems.
Denormals are here to give extra-precision in extreme cases, at the expense of CPU.
Depending on how your compiler handle floating-point computation, results can vary.
With strict IEEE semantics, adding two 32 bits floats should not give better accuracy than 32 bits.
In practice it may requires more instruction to ensure that, so you shouldn't rely on accurate and repeatable results with floating-point.
In both cases yes:
assert( 1E+36f + 1.0f == 1E+36f );
assert( 1E+36f + 1 == 1E+36f );
The case float + int is the same as float + float, because a standard conversion is applied to the int. In the case of float + float, this is implementation dependent, because an implementation may choose to do the addition at double precision. There may be some loss when you store the result, of course.
In both cases, the answer is "yes". When adding an int to a float, the integer is converted to floating point representation before the addition takes place anyway.
To understand why, I suggest you read this gem: What Every Computer Scientist Should Know About Floating-Point Arithmetic.
Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
I noticed a glitch while using math.sin(math.pi).
The answer should have been 0 but instead it is 1.2246467991473532e-16.
If the statement is math.sin(math.pi/2) the answer is 1.0 which is correct.
why this error?
The result is normal: numbers in computers are usually represented with floats, which have a finite precision (they are stored in only a few bytes). This means that only a finite number of real numbers can be represented by floats. In particular, π cannot be represented exactly, so math.pi is not π but a very good approximation of it. This is why math.sin(math.pi) does not have to be sin(π) but only something very close to it.
The precise value that you observe for math.sin(math.pi) is understandable: the relative precision of (double precision) floats is about 1e-16. This means that math.pi can be wrong by about π*1e-16 ~ 3e-16. Since sin(π-ε) ~ ε, the value that you obtain with math.sin(math.pi) can be in the worst case ~3e-16 (in absolute value), which is the case (this calculation is not supposed to give the exact value but just the correct order of magnitude, and it does).
Now, the fact that math.sin(math.pi/2) == 1 is not shocking: it may be (I haven't checked) that math.pi/2 (a float) is so close to the exact value π/2 that the float which is the closest to sin(math.pi/2) is exactly 1. In general, you can expect the result of a function applied to a floating point number to be off by about 1e-16 relative (or be about 1e-16 instead of 0).