Float versus Logged values in Python - python

I am calculating relative frequencies of words (word count / total number of words). This results in quite a few very small numbers (e.g. 1.2551539760140076e-05). I have read about some of the issues with using floats in this context, e.g. in this article
A float has roughly seven decimal digits of precision ...
Some suggest using logged values instead. I am going to multiply these numbers and was wondering
In general, is the seven digit rule something to go by in Python?
In my case, should I use log values instead?
What bad things could happen if I don't -- just a less accurate value or straight up errors, e.g. in the multiplication?
And If so, do I just convert the float with math.log() - I feel at that point the information is already lost?
Any help is much appreciated!

That article talks about the type float in C, which is a 32 bit quantity. The Python type float is a 64 bit number, like C's double, and therefore can store roughly 17 decimal digits (53 fractional bits instead of 24 with C's float). While that too can be too little precision for some applications, it's much less dire than with 32-bit floats.
Furthermore, because it is a floating point format, small numbers such as 1.2551539760140076e-05 (which actually isn't that small) are not inherently disadvantaged. While only about 17 decimal digits can be represented, these 17 digits need not be the first 17 digits after the decimal point. They can be shifted around, so to speak1. In fact, you used the same concept of floating (decimal) point when you give a number as a bunch of decimal digits times a power of ten (e-5). To give extreme examples, 1-300 can be represented just fine2, as can 10300 — only when these two numbers meet, problems happen (1e300 + 1e-300 == 1e300).
As for a log representation, you would take the log of all values as early as possible and perform as many calculations as possible in log space. In your example you'd calculate the relative frequency of a word as log(word_count) - log(total_words), which is the same as log(word_count / total_words) but possibly more accurate.
What bad things could happen if I don't -- just a less accurate value or straight up errors, e.g. in the multiplication?
I'm not sure what the distinction is. Numeric calculations can have almost perfect accuracy (relative rounding error on the scale of 2-50 or better), but unstable algorithms can also give laughably bad results in some cases. There are quite strict bounds on the rounding error of each individual operation3, but in longer calculations, they interact in surprising ways to cause very large errors. For example, even just summing up a large list of floats can introduce significant error, especially if they are of very different magnitudes and signs. The proper analysis and design of reliable numeric algorithms is an art of its own which I cannot do justice here, but thanks to the good design of IEEE-754, most algorithms usually work out okay. Don't worry too much about it, but don't ignore it either.
1 In reality we're talking about 53 binary digits being shifted around, but this is unimportant for this concept. Decimal floating point formats exist.
2 With a relative rounding error of less than 2-54, which occurs for any fraction whose denominator isn't a power of two, including such mundane ones as 1/3 or 0.1.
3 For basic arithmetic operations, the rounding error should be half a unit in the last place, i.e., the result must be calculated exactly and then be rounded correctly. For transcendental functions the error is rarely more than a one or two units in the last place but can be larger.

Related

In NumPy, how to use a float that is larger than float64's max value?

I have a calculation that may result in very, very large numbers, that won fit into a float64. I thought about using np.longdouble but that may not be large enough either.
I'm not so interested in precision (just 8 digits would do for me). It's the decimal part that won't fit. And I need to have an array of those.
Is there a way to represent / hold an unlimited size number, say, only limited by the available memory? Or if not, what is the absolute max value I can place in an numpy array?
Can you rework the calculation so it works with the logarithms of the numbers instead?
That's pretty much how the built-in floats work in any case...
You would only convert the number back to linear for display, at which point you'd separate the integer and fractional parts; the fractional part gets exponentiated as normal to give the 8 digits of precision, and the integer part goes into the "×10ⁿ" or "×eⁿ" or "×2ⁿ" part of the output (depending on what base logarithm you use).

Transferring a double from C++ to python without loss of precision

I have some C++ code which outputs an array of double values. I want to use these double values in python. The obvious and easiest way to transfer the values would of course be dumping them into a file and then rereading the file in python. However, this would lead to loss of precision, because not all decimal places may be transferred. On the other hand, if I add more decimal places, the file gets larger. The array I am trying to transfer has a few million entries. Hence, my idea is to use the double's binary representation, dump them into a binary file and rereading that in python.
The first problem is, that I do not know how the double values are formatted in memory, for example here. It is easy to read the binary representation of an object from memory, but I have to known where the sign bit, the exponent and the mantiassa are located. There are of course standards for this. The first question is therefore, how do I know which standard my compiler uses? I want to use g++-9. I tried googling this question for various compilers, but without any precise answer. The next question would be on how to turn the bytes back into a double, given the format.
Another possibility may be to compile the C++ code as a python module and use it directly, transferring the array without a file from memory only. But I do not know if this would be easy to set up quickly.
I have also seen that it is possible to compile C++ code directly from a string in python using numpy, but I cannot find any documentation for that.
You could write out the double value(s) in binary form and then read and convert them in python with struct.unpack("d", file.read(8)), thereby assuming that IEEE 754 is used.
There are a couple of issues, however:
C++ does not specify the bit representation of doubles. While it is IEEE 754 on any platform I have come across, this should not be taken for granted.
Python assumes big endian byte ordering. So on a little endian machine you have to tell struct.unpack when reading or change endianess before writing.
If this code is targeted for a specific machine I would advice to just test the approach on the machine.
This code should then not be assumed to work on other architectures, so it is advisable that you have checks in your Makefile/CMakefile that refuses to build on unexpected targets.
Another approach would be to use a common serialization format, such as protobuf. They essentially have to deal with the same problems but I would argue that they have solved it.
I have not checked that, but probably python's C++ interface will store doubles by just copying the binary image they represent (the 64bit image) as most probably both languages use the same internal representation of binary floating point numbers (IEEE-754 binary 64bit format) This has one reason: it is because both use the floating point coprocessor to operate on them, and that's the format it requires to pass it the numbers.
One question arises on that, as you don't say: How have you determined that you are lossing precision in the data? Have you checked different decimal digits only? Or have you exported the actual binary format to check for differences in the bit patterns? A common mistake is to print both numbers with, let's say 20 significand digits, and then observe differences in the last two or three digits. This is because you are failing to acquaint on that doubles represented this way (in binary IEEE-752 format) have only around 17 significant digits (it depends on the number, but you can have differences on digit 17th or later, this is because the numbers are binary encoded)
What I strongly don't recommend to you is to convert those numbers into a decimal representation and send them as ascii strings. You are going to lose some precision (in form of rounding errors, see below) in the encoding, and then again in the decoding phase in python. Think that converting (even at the maximum precision) a binary floating point into decimal, and then back to binary is almost always a lossing information process. The problem is that a number that can be represented exactly in decimal (like 0.1) cannot be represented exactly in binary form (you get a periodic infinite repeating sequence, as when you divide 1.0 by 3.0 in decimal, you get a result that is not exact) The opposite conversion is different, as you can always convert a finite decimal binary number into a finite decimal base ten number, but not within 53 bits --which is the amount of bits dedicated to the significand in 64 bit floating point numbers)
So, my advice is to recheck where your numbers show differences and compare with what I say here (if the numbers show differences in digit positions after the 16th decimal digit, those differences are ok --- they have to do only with the different algorithms used by C++ library and python library to convert the numbers into decimal format) If the differences occur before that, check how are represented floating point numbers in python, or check if, at some point, you lose precision by storing those numbers in a single precision float variable (this is more frequent that normally one estimates) and see if there's some difference (I don't believe there will be) in the formats used by both environments. By the way, showing such differences in your question should be a plus (something you have also not done) as we could tell you if the differences you observe are normal or not.

Unexpected rounding error with simple addition [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 2 years ago.
Good day,
I'm getting a strange rounding error and I'm unsure as to why.
print(-0.0075+0.005)
is coming out in the terminal as
-0.0024999999999999996
Which appears to be throwing off the math in the rest of my program. Given that the numbers in this function could be anywhere between 1 and 0.0001, the number of decimal places can vary.
Any ideas why I'm not getting the expected answer of -0.0025
Joe,
The 'rounding error' you refer to is a consequence of the arithmetic system Python is using to perform the requested operation (though by no means limited to Python!).
All real numbers need to be represented by a finite number of bits in a computer program, thus they are 'rounded' to a suitable representation. This is necessary because with a finite number of bits it is not possible to represent all real numbers exactly (or even a finite interval of the numbers in the real line). So while you may define a variable to be 'a=0.005', the computer will store it as something very close to, but not exactly, that. Typically this rounding is done through floating-point representations which is the case in standard Python. In the binary version of this system, real numbers are approximated by integers multiplied by powers of 2 for their representation.
Consequently, operations such as the sum that you are performing operate on this 'rounded' version of the numbers and return another rounded version of the result. This implies that the arithmetic in the computer is always approximate, although usually, it is precise enough that we do not care. If these rounding errors are too large for your application, you may try to switch to use a more precise representation (more bits). You can find a good explainer with examples on Python's docs.

Getting the most accurate precision with a function equating factorial, divison and squaring [duplicate]

I'm using the Decimal class for operations that requires precision.
I would like to use 'largest possible' precision. With this, I mean as precise as the system on which the program runs can handle.
To set a certain precision it's simple:
import decimal
decimal.getcontext().prec = 123 #123 decimal precision
I tried to figure out the maximum precision the 'Decimal' class can compute:
print(decimal.MAX_PREC)
>> 999999999999999999
So I tried to set the precision to the maximum precision (knowing it probably won't work..):
decimal.getcontext().prec = decimal.MAX_PREC
But, of course, this throws a Memory Error (on division)
So my question is: How do I figure out the maximum precision the current system can handle?
Extra info:
import sys
print(sys.maxsize)
>> 9223372036854775807
Trying to do this is a mistake. Throwing more precision at a problem is a tempting trap for newcomers to floating-point, but it's not that useful, especially to this extreme.
Your operations wouldn't actually require the "largest possible" precision even if that was a well-defined notion. Either they require exact arithmetic, in which case decimal.Decimal is the wrong tool entirely and you should look into something like fractions.Fraction or symbolic computation, or they don't require that much precision, and you should determine how much precision you actually need and use that.
If you still want to throw all the precision you can at your problem, then how much precision that actually is will depend on what kind of math you're doing, and how many absurdly precise numbers you're attempting to store in memory at once. This can be determined by analyzing your program and the memory requirements of Decimal objects, or you can instead take the precision as a parameter and binary search for the largest precision that doesn't cause a crash.
I'd like to suggest a function that allows you to estimate your maximum precision for a given operation in a brute force way:
def find_optimum(a,b, max_iter):
for i in range(max_iter):
print(i)
c = int((a+b)/2)
decimal.getcontext().prec = c
try:
dummy = decimal.Decimal(1)/decimal.Decimal(7) #your operation
a = c
print("no fail")
except MemoryError:
print("fail")
dummy = 1
b = c
print(c)
del dummy
This is just halving intervals one step at a time and looks if an error occurs. Calling with max_iter=10 and a=int(1e9), b=int(1e11) gives:
>>> find_optimum(int(1e9), int(1e11), 10)
0
fail
50500000000
1
no fail
25750000000
2
no fail
38125000000
3
no fail
44312500000
4
fail
47406250000
5
fail
45859375000
6
no fail
45085937500
7
no fail
45472656250
8
no fail
45666015625
9
no fail
45762695312
This may give a rough idea of what you are dealing with. This took approx half an hour on i5-3470 and 16GB RAM so you really only would use it for testing purposes.
I don't think, that there is an actual exact way of getting the maximum precision for your operation, as you'd have to have exact knowledge of the dependency of your memory usage on memory consumption. I hope this helps you at least a bit and I would really like to know, what you need that kind of precision for.
EDIT I feel like this really needs to be added, since I read your comments under the top rated post here. Using arbitrarily high precision in this manner is not the way, that people calculate constants. You would program something, that utilizes disk space in a smart way (for example calcutating a bunch of digits in RAM and writing this bunch to a text file), but never only use RAM/swap only, because this will always limit your results. With modern algorithms to calculate pi, you don't need infinite RAM, you just put another 4TB hard drive in the machine and let it write the next digits. So far for mathematical constants.
Now for physical constants: They are not precise. They rely on measurement. I'm not quite sure atm (will edit) but I think the most exact physical constant has an error of 10**(-8). Throwing more precision at it, doesn't make it more exact, you just calculate more wrong numbers.
As an experiment though, this was a fun idea, which is why I even posted the answer in the first place.
The maximum precision of the Decimal class is a function of the memory on the device, so there's no good way to set it for the general case. Basically, you're allocating all of the memory on the machine to one variable to get the maximum precision.
If the mathematical operation supports it, long integers will give you unlimited precision. However, you are limited to whole numbers.
Addition, subtraction, multiplication, and simple exponents can be performed exactly with long integers.
Prior to Python 3, the built-in long data type would perform arbitrary precision calculations.
https://docs.python.org/2/library/functions.html#long
In Python >=3, the int data type now represents long integers.
https://docs.python.org/3/library/functions.html#int
One example of a 64-bit integer math is implementation is bitcoind, where transactions calculations require exact values. However, the precision of Bitcoin transactions is limited to 1 "Satoshi"; each Bitcoin is defined as 10^8 (integer) Satoshi.
The Decimal class works similarly under the hood. A Decimal precision of 10^-8 is similar to the Bitcoin-Satoshi paradigm.
From your reply above:
What if I just wanted to find more digits in pi than already found? what if I wanted to test the irrationality of e or mill's constant.
I get it. I really do. My one SO question, several years old, is about arbitrary-precision floating point libraries for Python. If those are the types of numerical representations you want to generate, be prepared for the deep dive. Decimal/FP arithmetic is notoriously tricky in Computer Science.
Some programmers, when confronted with a problem, think “I know, I’ll use floating point arithmetic.” Now they have 1.999999999997 problems. – #tomscott
I think when others have said it's a "mistake" or "it depends" to wonder what the max precision is for a Python Decimal type on a given platform, they're taking your question more literally than I'm guessing it was intended. You asked about the Python Decimal type, but if you're interested in FP arithmetic for educational purposes -- "to find more digits in pi" -- you're going to need more powerful, more flexible tools than Decimal or float. These built-in Python types don't even come close. Those are good enough for NASA maybe, but they have limits... in fact, the very limits you are asking about.
That's what multiple-precision (or arbitrary-precision) floating point libraries are for: arbitrarily-precise representations. Want to compute pi for the next 20 years? Python's Decimal type won't even get you through the day.
The fact is, multi-precision binary FP arithmetic is still kinda fringe science. For Python, you'll need to install the GNU MPFR library on your Linux box, then you can use the Python library gmpy2 to dive as deep as you like.
Then, the question isn't, "What's the max precision my program can use?"
It's, "How do I write my program so that it'll run until the electricity goes out?"
And that's a whole other problem, but at least it's restricted by your algorithm, not the hardware it runs on.

Can Python floating-point error affect statistical test over small numbers?

Can floating-point errors affect my calculations in the following scenario, where the values are small?
My purpose is to compare two sets of values and determine if their means are statistically different.
I handle very small values the usual way in performing large-sample unpaired tests with data like this:
first group (obtained from 100 samples):
first item's mean = 2.7977620220553945e-24
std dev = 3.2257148207429583e-15
second group (obtained from 100 samples):
first item's mean = 3.1086244689504383e-15
std dev = 3.92336102789548e-15
The goal is to find out whether or not the two means are statistically significantly different.
I plan to follow the usual steps of finding the standard error of the difference and the z-score and so on. I will be using Python (or Java).
My question is not about the statistical test but about the potential problem with the smallness of the numbers (floating-point errors).
Should I (must I) approximate each of the above two means to zero (and thus conclude that there is no difference)?
That is, given the smallness of the means, is it computationally meaningless to go about performing the statistical test?
64-bit floating point numbers allot 52-bits for the significand. This is approximately 15-16 decimal places (log10(2^52) ~ 15.6). In scientific notation, this is the difference between, say 1 e -9 and 1 e -24 (because 10^-9 / 10^-24 == 10^15, i.e. they differ by 15 decimal places).
What does this all mean? Well, it means that if you add 10^-24 to 10^-9, it is just on the border of being too small to show up in the larger number (10^-9).
Observe:
>>> a = 1e-9
>>> a
1e-09
>>> a + 1e-23
1.00000000000001e-09
>>> a + 1e-24
1.000000000000001e-09
>>> a + 1e-25
1e-09
Since the z-score statistics stuff involves basically adding and subtracting a few standard deviations from the mean, or so, it will definitely have problems if the difference in the exponent is 16. It's probably not a good situation if the difference is like 14 or 15. The difference in your exponents is 9, which will still allow you 1/10^6 standard deviations of accuracy in the final sum. Since we're worried about errors on the order of, maybe, a tenth of a standard deviation or so when we talk about statistical significance, you should be fine.
On 32-bit platforms, the significand gets 23 bits, which is about 6.9 places.
In principle, if you work with numbers with the same order of magnitude, the float representation of data is sufficient to retain the same precision as numbers close to 1.
However, it is much more robust to being able to perform computation with whitened data.
If whitening is not an option for your use case, you can use an arbitrary precision library for non-integer data (Python offers built-in arbitrary precision integers), like decimal, fractions and/or statistics, and do all the computations with that.
EDIT
However, just looking at your numbers the standard deviation ranges (the interval [µ-σ, µ+σ] largely overlap, therefore you have no evidence for the two means to be statistically significantly different. Of course this is meaningful only for (at least asymptotically) normally distributed populations / samples.

Categories

Resources