Mongodb lack of precision incrementing floats

Mongodb lack of precision incrementing floats - python

I have a problem because Mongodb doesn't seem to maintain precision when incrementing floats. For example, the following should yield 2.0:
from decimal import Decimal # for python precision
for i in range(40):
db.test.update({}, {'$inc': {'count': float(Decimal(1) / 20)}}, upsert=True)
print db.test.find_one()['count']
2.000000000000001
How can I get around this issue?

Unfortunately, you can't -- at least not directly. Mongo stores floating-point numbers as double-precision IEEE floats (https://en.wikipedia.org/wiki/IEEE_floating_point), and those rounding errors are inherent to the format.
I'm noticing you're using Decimals in your code -- they're converted to Python floats (which are doubles) before being sent to the DB. If you want to keep your true decimal precision, you'll have to store your numbers as stringified Decimals, which means you'll also have to give up Mongo's number-handling facilities such as $inc.
It is, sadly, a tradeoff you'll be confronted to in most databases and programming languages: IEEE floating-point numbers is the format CPUs natively deal with, and any attempts to stray away from them (to use arbitrary-precision decimals like decimal.Decimal) come with a big performance and usability penalty.

Related

Transferring a double from C++ to python without loss of precision

I have some C++ code which outputs an array of double values. I want to use these double values in python. The obvious and easiest way to transfer the values would of course be dumping them into a file and then rereading the file in python. However, this would lead to loss of precision, because not all decimal places may be transferred. On the other hand, if I add more decimal places, the file gets larger. The array I am trying to transfer has a few million entries. Hence, my idea is to use the double's binary representation, dump them into a binary file and rereading that in python.
The first problem is, that I do not know how the double values are formatted in memory, for example here. It is easy to read the binary representation of an object from memory, but I have to known where the sign bit, the exponent and the mantiassa are located. There are of course standards for this. The first question is therefore, how do I know which standard my compiler uses? I want to use g++-9. I tried googling this question for various compilers, but without any precise answer. The next question would be on how to turn the bytes back into a double, given the format.
Another possibility may be to compile the C++ code as a python module and use it directly, transferring the array without a file from memory only. But I do not know if this would be easy to set up quickly.
I have also seen that it is possible to compile C++ code directly from a string in python using numpy, but I cannot find any documentation for that.

You could write out the double value(s) in binary form and then read and convert them in python with struct.unpack("d", file.read(8)), thereby assuming that IEEE 754 is used.
There are a couple of issues, however:
C++ does not specify the bit representation of doubles. While it is IEEE 754 on any platform I have come across, this should not be taken for granted.
Python assumes big endian byte ordering. So on a little endian machine you have to tell struct.unpack when reading or change endianess before writing.
If this code is targeted for a specific machine I would advice to just test the approach on the machine.
This code should then not be assumed to work on other architectures, so it is advisable that you have checks in your Makefile/CMakefile that refuses to build on unexpected targets.
Another approach would be to use a common serialization format, such as protobuf. They essentially have to deal with the same problems but I would argue that they have solved it.

I have not checked that, but probably python's C++ interface will store doubles by just copying the binary image they represent (the 64bit image) as most probably both languages use the same internal representation of binary floating point numbers (IEEE-754 binary 64bit format) This has one reason: it is because both use the floating point coprocessor to operate on them, and that's the format it requires to pass it the numbers.
One question arises on that, as you don't say: How have you determined that you are lossing precision in the data? Have you checked different decimal digits only? Or have you exported the actual binary format to check for differences in the bit patterns? A common mistake is to print both numbers with, let's say 20 significand digits, and then observe differences in the last two or three digits. This is because you are failing to acquaint on that doubles represented this way (in binary IEEE-752 format) have only around 17 significant digits (it depends on the number, but you can have differences on digit 17th or later, this is because the numbers are binary encoded)
What I strongly don't recommend to you is to convert those numbers into a decimal representation and send them as ascii strings. You are going to lose some precision (in form of rounding errors, see below) in the encoding, and then again in the decoding phase in python. Think that converting (even at the maximum precision) a binary floating point into decimal, and then back to binary is almost always a lossing information process. The problem is that a number that can be represented exactly in decimal (like 0.1) cannot be represented exactly in binary form (you get a periodic infinite repeating sequence, as when you divide 1.0 by 3.0 in decimal, you get a result that is not exact) The opposite conversion is different, as you can always convert a finite decimal binary number into a finite decimal base ten number, but not within 53 bits --which is the amount of bits dedicated to the significand in 64 bit floating point numbers)
So, my advice is to recheck where your numbers show differences and compare with what I say here (if the numbers show differences in digit positions after the 16th decimal digit, those differences are ok --- they have to do only with the different algorithms used by C++ library and python library to convert the numbers into decimal format) If the differences occur before that, check how are represented floating point numbers in python, or check if, at some point, you lose precision by storing those numbers in a single precision float variable (this is more frequent that normally one estimates) and see if there's some difference (I don't believe there will be) in the formats used by both environments. By the way, showing such differences in your question should be a plus (something you have also not done) as we could tell you if the differences you observe are normal or not.

How does using decimal for money avoid the floating point problems in python?

So currency/money has lot's of known math issues when using a floating point. It seems in python that decimal is used in money libraries, but according to the python docs, decimal is based on a floating point. So how does this not have the same problems?
context
a lot of currency libraries measure their monetary values as integers (so cents of USD, not dollars). We've just had the issue of a python application representing it's money as decimal, it goes into javascript, which then needs to convert it to an integer for another service.
10.05 / 100, became 1050.0000...1 which is of course, not an integer. So of course I was wondering why python chose this route, as most recommendations I've seen recommend treating money as integers.

You are confusing binary floating point with decimal floating point. From the module documentation:
The decimal module provides support for fast correctly-rounded decimal floating point arithmetic.
[...]
Decimal numbers can be represented exactly. In contrast, numbers like 1.1 and 2.2 do not have exact representations in binary floating point
(bold emphasis mine).
The floating point aspect refers to the variability of the exponent; the number 12300000 can be represented as 123 with a decimal exponent of 5 (10 ** 5). Both float and decimal use a floating point representation. But float adds up a number of binary fractions (1/2 + 1/4 + 1/8 + 1/16 + ...), and that makes them unsuitable for representing currencies as binary fractions can not predicisely model 1/100ths or 1/10ths, which currency values tend to use a lot.
The DZone article on floating point issues for currency you link also teaches you about the Java java.math.BigDecimal package. Python's decimal is essentially the same thing; where the BigDecimal documentation talks about values consist[ing] of an arbitrary precision integer unscaled value and a 32-bit integer scale, the scale is essentially the position of the floating point.
Because decimal can represent 1/100ths (cents) in currency values exactly, it is far more suitable to model currency values.

Decimal avoids some of the problems of binary floating-point, but not all, possibly not even most.
The actual problem is not floating-point but numerical formats. No numerical format can represent all real numbers, or even all rational numbers, so no numerical format can handle all the operations we want to do with numbers.
Money is commonly represented in decimal fractions of a unit of currency. For example, the US dollar and many other currencies have as a “cent” which is 1/100th of a dollar. A decimal format can represent 1/100th exactly. A binary format cannot. So, with a decimal format, you can:
Represent decimal units of currency exactly (within bounds of the format).
Add and subtract decimal amounts of currency exactly (within bounds of the format).
Multiply decimal units of currency by integers exactly (within bounds of the format).
However, problems arise when you try:
To average numbers or divide by numbers other than powers of ten (or two or five). For example, if a grocery wants to sell a product at three for a dollar, there is no way to represent ⅓ exactly in a decimal format.
Multiplying numbers with decimal fractions more than a few times. Each multiplication will increase the number of digits after the decimal point. For example, interest compounded monthly for a year cannot be computed exactly with typical decimal formats.
Any complex (in the general sense, not mathematical) operations such as exponentiation that may be involved in considering the time value of money, stock market options evaluation, and so on.
There is no general solution to how to compute numerically. Studying numerical computing and its errors is an entire field of study with textbooks, courses, and research papers. So you cannot solve numerical problems merely by choosing a format. It is important to understand whatever format(s) you use, what errors arise in using them, how to deal with those errors, and what results you need to achieve.

Decimal types allow decimal floating point rather than binary floating point. The class of problems you are referring to relate to the latter.

Python Dictionary Floats

I came across a strange behavior in Python (2.6.1) dictionaries:
The code I have is:
new_item = {'val': 1.4}
print new_item['val']
print new_item
And the result is:
1.4
{'val': 1.3999999999999999}
Why is this? It happens with some numbers, but not others. For example:
0.1 becomes 0.1000...001
0.4 becomes 0.4000...002
0.7 becomes 0.6999...996
1.9 becomes 1.8888...889

This is not Python-specific, the issue appears with every language that uses binary floating point (which is pretty much every mainstream language).
From the Floating-Point Guide:
Because internally, computers use a format (binary floating-point)
that cannot accurately represent a number like 0.1, 0.2 or 0.3 at all.
When the code is compiled or interpreted, your “0.1” is already
rounded to the nearest number in that format, which results in a small
rounding error even before the calculation happens.
Some values can be exactly represented as binary fraction, and output formatting routines will often display the shortest number that is closer to the actual value than to any other floating-point number, which masks some of the rounding errors.

This problem is related to floating point representations in binary, as others have pointed out.
But I thought you might want something that would help you solve your implied problem in Python.
It's unrelated to dictionaries, so if I were you, I would remove that tag.
If you can use a fixed-precision decimal number for your purposes, I would recommend you check out the Python decimal module. From the page (emphaisis mine):
Decimal “is based on a floating-point model which was designed with people in mind, and necessarily has a paramount guiding principle – computers must provide an arithmetic that works in the same way as the arithmetic that people learn at school.” – excerpt from the decimal arithmetic specification.
Decimal numbers can be represented exactly. In contrast, numbers like 1.1 and 2.2 do not have an exact representations in binary floating point. End users typically would not expect 1.1 + 2.2 to display as 3.3000000000000003 as it does with binary floating point.
The exactness carries over into arithmetic. In decimal floating point, 0.1 + 0.1 + 0.1 - 0.3 is exactly equal to zero. In binary floating point, the result is 5.5511151231257827e-017. While near to zero, the differences prevent reliable equality testing and differences can accumulate. For this reason, decimal is preferred in accounting applications which have strict equality invariants.

How does the decimal accuracy of Python compare to that of C?

I was looking at the Golden Ratio formula for finding the nth Fibonacci number, and it made me curious.
I know Python handles arbitrarily large integers, but what sort of precision do you get with decimals? Is it just straight on top of a C double or something, or does it use a a more accurate modified implementation too? (Obviously not with arbitrary accuracy. ;D)

almost all platforms map Python floats to IEEE-754 “double precision”.
http://docs.python.org/tutorial/floatingpoint.html#representation-error
there's also the decimal module for arbitrary precision floating point math

Python floats use the double type of the underlying C compiler. As Bwmat says, this is generally IEEE-754 double precision.
However if you need more precision than that you can use the Python decimal module which was added in Python 2.4.
Python 2.6 also added the fraction module which may be a better fit for some problems.
Both of these are going to be slower than using the float type, but that is the price for more precision.

How to deal with rounding errors of floating types for financial calculations in Python SQLite? [duplicate]

This question already has answers here:
Is floating point arbitrary precision available?
(5 answers)
Closed 2 years ago.
I'm creating a financial app and it seems my floats in sqlite are floating around. Sometimes a 4.0 will be a 4.000009, and a 6.0 will be a 6.00006, things like that. How can I make these more exact and not affect my financial calculations?
Values are coming from Python if that matters. Not sure which area the messed up numbers are coming from.

Please use Decimal
http://docs.python.org/library/decimal.html

Seeing as this is a financial application, if you only have calculations up to 2 or 3 decimal places, you can store all the data internally as integers, and only convert them to float for presentation purposes.
E.g.
6.00 -> 600
4.35 -> 435

This is a common problem using SQLite as it does not have a Currency type.
As S.Mark said you can use the Decimal representation library. However SQLite 3 only supports binary floating point numbers (sqlite type REAL) so you would have to store the Decimal encoded float as either TEXT or a BLOB or convert to REAL(but then you'd be back to a 64bit binary float)
So consider the range of numbers that you need to represent and whether you need to be able to perform calculations from within the Database.
You may be better off using a different DB which supports NUMERIC types e.g. MYSql, PostgreSQL, Firebird

Most people would probably use Decimal for this, however if this doesn't map onto a database type you may take a performance hit.
If performance is important you might want to consider using Integers to represent an appropriate currency unit - often cents or tenths of cents is ok.
There should be business rules about how amounts are to be rounded in various situations and you should have tests covering each scenario.

Use Decimal to manipulate your figures, then use pickle to save it and load it from SQLite as text, since it doesn't handle numeric types.
Finaly, use unitest and doctest, for financial operations, you want to ensure all the code does what it is suppose to do in any circonstances. You can't fix bugs on the way like with, let's say, a social network...

You have to use decimal numbers.
Decimal numbers can be represented exactly.
In decimal floating point, 0.1 + 0.1 + 0.1 - 0.3 is exactly equal to zero. In binary floating point, the result is 5.5511151231257827e-017.
So, just try decimal:
import decimal

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.