Float precision differs between elements in pandas dataframe

Float precision differs between elements in pandas dataframe - python

I am trying to read a dataframe from a csv, do some calculations with it and then export the results to another csv. While doing that I noticed that the value 8.1e-202 is getting changed to 8.1000000000000005e-202. But all the other numbers are represented correctly.
Example:
A example.csv looks like this:
id,e-value
ID1,1e-20
ID2,8.1e-202
ID3,9.24e-203
If I do:
import pandas as pd
df = pd.read_csv("example.csv")
df.iloc[1]["e-value"]
>>> 8.1000000000000005e-202
df.iloc[2]["e-value"]
>>> 9.24e-203
Why is 8.1e-202 being altered but 9.24e-203 isn't?
I tried to change the datatype that pandas is using from the default
df["e-value"].dtype
>>> dtype('float64')
to numpy datatypes like this:
import numpy as np
df = pd.read_csv("./temp/test", dtype={"e-value" : np.longdouble})
but this will just result in:
df.iloc[1]["e-value"]
>>> 8.100000000000000522e-202
Can someone explain to me why this is happening? I can't replicate this problem with any other number. Everything bigger or smaller than 8.1e-202 seems to work normally.
EDIT:
To specify my problem. I am aware that floats are not perfect. My actual problem with this is that once I write the dataframe back to a csv the resulting file will then look like this:
id,e-value
ID1,1e-20
ID2,8.1000000000000005e-202
ID3,9.24e-203
And I need the second row to be ID2,8.1e-202
I "fixed" this by just formatting this column before I write the csv, but I'm unhappy with this solution since the formatting will change other elements to something scientific notation where it was just a normal float.
def format_eval(e):
return "{0:.1e}".format(e)
df["e-value"] = df["e-value"].apply(lambda x: format_eval(x))

Float number representation is something not so simple. Not every real number can be represented and almost all (relatively speaking) are actually approximations. Is not like integers, the precision varies and python has a precision undefined float really.
Each floating point standar will have their own set of real numbers that can represent exactly. There's no work around.
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
https://en.wikipedia.org/wiki/IEEE_754-2008_revision
If the problem really is the arithmetic or comparison, you should consider if error will grow or decrease. For example multiplying by large numbers can grow the representation error.
And also, when comparing you should do things like math.is_close. Basically comparing the distance between the numbers.
If you are trying to represent and operate real numbers, that aren't irrational numbers. Like integers, fractions or decimal numbers with finite digits, you can also consider cast to the proper digit representation like: int, decimal or fraction.
See this for further ideas:
https://davidamos.dev/the-right-way-to-compare-floats-in-python/#:~:text=How%20To%20Compare%20Floats%20in%20Python&text=If%20abs(a%20%2D%20b),rel_tol%20keyword%20argument%20of%20math.

Related

Xlwings reading range of floats as decimals, and truncates decimals

I am trying to read in a 2 dimensional range of values from a ".xlsb" file using xlwings. The range contains a series of formulas, that returns floats. When I read in the values, it gets read in as Decimals rather than floats. The problem is Decimals beyond 4 spots get truncated. For example, I have a value in excel of 0.0913495 but it gets read in as Decimal('0.0913'). To make matters worse, when I try converting these decimals to floats, I see that any precision beyond 4 decimal places has been completely ignored. For example calling float(Decimal('0.0913')) returns 0.0913!
So far I have tried the following to fix this problem, none have worked:
Set precision to 28 by calling decimal.getcontext().prec = 28. I have also tried 7, 8, etc. This seems to change nothing.
Use the .options method: sheet.range("myrange").options(numbers = lambda x : float(x)).value
Tried ".raw_value"
Ironically (2) still returns numbers as decimals, it is as if my options got ignored.
This is a problem as for my particular application I rely on a higher degree of accuracy than 4 decimals places, yet xlwings refuses to read in the estimated values at any precision beyond 4 decimal places. How do I fix this?
For reference, I am using xlwings 0.23.0 with Python 3.8.8 and Excel version 2108 (Build 14326.20238)

Storing bernoulli number is giving overflow error in python even after using decimal module

I'm trying to store first 1000 bernoulli numbers in a dictionary in python. At first I just stored the numbers as it is. So I got an overflow error. Now after going through previous answers I thought of using decimal module.
So here it is
-5218507479961513801890596392421261361036935624312258325065379143295948300812040703848766095836974598734762472300638625802884257082786883956679824964010841565051175167717451747328911935282639583972372470105587187736495055501208701522099921363239317373617854217050435670713936357978555246779460902210809009009539232173 / 2291190
The 260th bernouli number. I was able to store all the previous ones in the dictionary.
This is the sample code I've written.
from decimal import *
d = Decimal
getcontext().prec = 10000
di = {260: d(-5218507479961513801890596392421261361036935624312258325065379143295948300812040703848766095836974598734762472300638625802884257082786883956679824964010841565051175167717451747328911935282639583972372470105587187736495055501208701522099921363239317373617854217050435670713936357978555246779460902210809009009539232173 / 2291190)}
This is the error snap shot
Is there any better way to handle such huge numbers ? Please tell me if there is some thing that can be done to store these numbers.

You should convert the large number to Decimal before doing the division, i.e.:
(Note the end of brackets)
di = {260: d(-5218507479961513801890596392421261361036935624312258325065379143295948300812040703848766095836974598734762472300638625802884257082786883956679824964010841565051175167717451747328911935282639583972372470105587187736495055501208701522099921363239317373617854217050435670713936357978555246779460902210809009009539232173) / 2291190}

Python iterate the list with decimal precision

I used netCDF Python library to read netCDF variable which has list(variable) returns correct decimal precision as in the image (using PyCharm IDE). However, when I try to get the element by index, e.g: variable[0], it returns the rounded value instead (e.g: 5449865.55794), while I need 5449865.55793999997.
How can I iterate this list with correct decimal precision ?
Some basic code
from netCDF4 import Dataset
nc_dataset = Dataset(self.get_file().get_filepath(), "r")
variable_name = "E"
// netCDF file contains few variables (axis dimensions)
variable = nc_dataset.variables[variable_name]
variable is not a list but a netCDF object, however when using list() or variable[index] will return element value of the axis dimension.

The decimals you are chasing are bogus. The difference is only in the way these numbers are represented on your screen, not in how they are stored in your computer.
Try the following to convince yourself
a = 5449865.55793999997
a
# prints 5449865.55794
The difference between the two numbers if we were to take them literally is 3x10^-11. The smallest difference a 64 bit floating point variable at the size of a can resolve is more than an order of magnitude larger. So your computer cannot tell these two decimal numbers apart.
But look at the bright side. Your data aren't corrupted by some mysterious process.

Hope this is what suits your needs:
import decimal
print decimal.Decimal('5449865.55793999997')

vector magnitude for large components

I noticed that numpy has a built in function linalg.norm(vector), which produces the magnitude. For small values I get the desired output
>>> import numpy as np
>>> np.linalg.norm([0,2])
2.0
However for large values:
>>> np.linalg.norm([0,149600000000])
2063840737.6330884
This is a huge error, what could I do instead. Making my own function seems to produce the same error. What is the problem here, is a rounding error this big?, and what can I do instead?

Your number is written as an integer, and yet it is too big to fit into a numpy.int32. This problem seems to happen even in python3, where
the native numbers are big.
In numerical work I try to make everything floating point unless it is an index. So I tried:
In [3]: np.linalg.norm([0.0,149600000000.0])
Out[3]: 149600000000.0
To elaborate: in this case Adding the .0 was an easy way of turning integers into doubles. In more realistic code, you might have incoming data which is of uncertain type. The safest (but not always the right) thing to do is just coerce to a floating point array at the top of your function.
def do_something_with_array(arr):
arr = np.double(arr) # or np.float32 if you prefer.
... do something ...

Python plotting

I have a question about the plotting. I want to plot some data between ranges :
3825229325678980.0786812569752124806963380417361932
and
3825229325678980.078681262584097479512892231994772
but I get the following error:
Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=3.82522932568e+15, top=3.82522932568e+15
How should I increase the decimal points here to solve the problem?

The difference between your min and max value is less than the precision an eps of a double (~1e-15).
Basically using a 4-byte floating point representation you can not distinguish between the two numbers.
I suggest to remove all the integer digits from your data and represent only the decimal part. The integer part is only a big constant that you can always add later.

It might be easiest to scale your data to provide a range that looks less like zero.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Float precision differs between elements in pandas dataframe - python

Related

Xlwings reading range of floats as decimals, and truncates decimals

Storing bernoulli number is giving overflow error in python even after using decimal module

Python iterate the list with decimal precision

vector magnitude for large components

Python plotting

Categories

Resources