Why does a float64 value 123456789.0 in a Pandas.DataFrame gets converted to 123456792.0, preserving only 7 significant digits?
import pandas as pd
df = pd.DataFrame([123456789.0])
# 0
# 0 123456789.0
df = df.astype('float32')
# 0
# 0 123456792.0
Essentially, float32 is numpy's dtype. The reason why you see some difference in the precision when converting float64 to float32 is because 123456789.0 cannot be accurately represented using float32 which is a 32-bit dtype (1 sign bit, 8 bits exponent, 23 bits mantissa).
In general, float32 requires half of the memory that float64 requires to represent a numerical value , however float32 can represent numbers less accurately compared to float64.
Note there is no workaround for this. If you need to represent particular numbers that cannot be represented using a 32-bit dtype like float32, then go for higher precision dtypes (float64).
My dataframe looks something like this:
import pandas as pd
df = pd.read_sql('select * from foo')
a b c
0.1 0.2 0.3
0.3 0.4 0.5
If I directly run df['a'] * df['b'] the result is not exact as I expected because of float number issues.
I tried
import Decimal
df['a'].apply(Decimal) * df['b'].apply(Decimal)
But when I inspect df['a'].apply(Decimal) with PyCharm, the column turns out to be something strange, here is just an example, not real numbers:
a
0.09999999999999999
0.30000000000001231
I wonder how to do exact multiplication in pandas.
The problem is not in pandas but in floating point inaccuracy: decimal.Decimal(0.1) is Decimal('0.1000000000000000055511151231257827021181583404541015625') on my 64 bits system.
A simple trick would be to first change the floats to strings, because pandas knows enough about string conversion to properly round the values:
x = df['a'].astype(str).apply(Decimal) * df['b'].astype(str).apply(Decimal)
You will get a nice Series of Decimal:
>>> print(x.values)
[Decimal('0.02') Decimal('0.12')]
So with exact decimal operations - which can matters if you process monetary values...
I have a large DataFrame (circa 4e+07 rows).
When summing it, I get 2 significantly different results whether I do the sum before or after the column selection.
Also, the type changes from float32 to float64 even though totals are all below 2**31
df[[col1, col2, col3]].sum()
Out[1]:
col1 9.36e+07
col2 1.39e+09
col3 6.37e+08
dtype: float32
df.sum()[[col1, col2, col3]]
Out[2]:
col1 1.21e+08
col2 1.70e+09
col3 7.32e+08
dtype: float64
I am obviously missing something, has anybody had the same issue?
Thanks for your help.
To understand what's going on here, you need to understand what Pandas is doing under the hood. I'm going to simplify a bit, since there are lots of bells and whistles and special cases to consider, but roughly it looks like this:
Suppose you've got a Pandas DataFrame object df with various numeric columns (we'll ignore datetime columns, categorical columns, and the like). When you compute df.sum(), Pandas:
Extracts the values of the dataframe into a two-dimensional NumPy array.
Applies the NumPy sum function to that 2d array with axis=0 to compute the column sums.
It's the first step that's important here. The columns of a DataFrame might have different dtypes, but a 2d NumPy array can only have a single dtype. If df has a mixture of float32 and int32 columns (for example), Pandas has to choose a single dtype that's appropriate for both columns simultaneously, and in this case it chooses float64. So when the sum is computed, it's computed on double-precision values, using double-precision arithmetic. This is what's happening in your second example.
On the other hand, if you cut down to just the float32 columns in the first place, then Pandas can and will use the float32 dtype for the 2d NumPy array, and so the sum computation is performed in single precision. This is what's happening in your first example.
Here's a simple example showing this in action: we'll set up a DataFrame with 100 million rows and three columns, of dtypes float32, float32 and int32 respectively. All the values are ones:
>>> import numpy as np, pandas as pd
>>> s = np.ones(10**8, dtype=np.float32)
>>> t = np.ones(10**8, dtype=np.int32)
>>> df = pd.DataFrame(dict(A=s, B=s, C=t))
>>> df.head()
A B C
0 1.0 1.0 1
1 1.0 1.0 1
2 1.0 1.0 1
3 1.0 1.0 1
4 1.0 1.0 1
>>> df.dtypes
A float32
B float32
C int32
dtype: object
Now when we compute the sums directly, Pandas first turns everything into float64s. The computation is also done using the float64 type, for all three columns, and we get an accurate answer.
>>> df.sum()
A 100000000.0
B 100000000.0
C 100000000.0
dtype: float64
But if we first cut down our dataframe to just the float32 columns, then float32-arithmetic is used for the sum, and we get very poor answers.
>>> df[['A', 'B']].sum()
A 16777216.0
B 16777216.0
dtype: float32
The inaccuracy is of course due to using a dtype that doesn't have enough precision for the task in question: at some point in the summation, we end up repeatedly adding 1.0 to 16777216.0, and getting 16777216.0 back each time, thanks to the usual floating-point problems. The solution is to explicitly convert to float64 yourself before doing the computation.
However, this isn't quite the end of the surprises that Pandas has in store for us. With the same dataframe as above, let's try just computing the sum for column "A":
>>> df[['A']].sum()
A 100000000.0
dtype: float32
Suddenly we're getting full accuracy again! So what's going on? This has little to do with dtypes: we're still using float32 to do the summation. It's now the second step (the NumPy summation) that's responsible for the difference. What's happening is that NumPy can, and sometimes does, use a more accurate summation algorithm, called pairwise summation, and with float32 dtype and the size arrays that we're using, that accuracy can make a hugely significant difference to the final result. However, it only uses that algorithm when summing along the fastest-varying axis of an array; see this NumPy issue for related discussion. In the case where we compute the sum of both column "A" and column "B", we end up with a values array of shape (100000000, 2). The fastest-varying axis is axis 1, and we're computing the sum along axis 0, so the naive summation algorithm is used and we get poor results. But if we only ask for the sum of column "A", we get the accurate sum result, computed using pairwise summation.
In sum, when working with DataFrames of this size, you want to be careful to (a) work with double precision rather than single precision whenever possible, and (b) be prepared for differences in output results due to NumPy making different algorithm choices.
You can lose precision with np.float32 relative to np.float64
np.finfo(np.float32)
finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
And
np.finfo(np.float64)
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)
A contrived example
df = pd.DataFrame(dict(
x=[-60499999.315, 60500002.685] * int(2e7),
y=[-60499999.315, 60500002.685] * int(2e7),
z=[-60499999.315, 60500002.685] * int(2e7),
)).astype(dict(x=np.float64, y=np.float32, z=np.float32))
print(df.sum()[['y', 'z']], df[['y', 'z']].sum(), sep='\n\n')
y 80000000.0
z 80000000.0
dtype: float64
y 67108864.0
z 67108864.0
dtype: float32
I am trying to alter my dataframe with the following line of code:
df = df[df['P'] <= cutoff]
However, if for example I set cutoff to be 0.1, numbers such as 0.100496 make it through the filter.
My suspicion is that my initial dataframe has entries in scientific notation and float format as well. Could this be affecting the rounding and precision? Is there a potential workaround to this issue.
Thank you in advance.
EDIT: I am reading from a file. Here is a sample of the total data.
2.29E-98
1.81E-42
2.19E-35
3.35E-30
0.0313755
0.0313817
0.03139
0.0313991
0.0314062
0.1003476
0.1003483
0.1003487
0.1003521
0.100496
Floating point comparison isn't perfect. For example
>>> 0.10000000000000000000000000000000000001 <= 0.1
True
Have a look at numpy.isclose. It allows you to compare floats and set a tolerance for the comparison.
Similar question here
Ok, I have a pandas dataframe like this:
lat long level date time value
3341 29.232 -15.652 10.0 20100109.0 700.0 0.5
3342 27.887 -13.668 120.0 20100109.0 700.0 3.2
...
3899 26.345 -11.234 0.0 20100109.0 700.0 5.8
The reason of the strange number of the index is because it comes from a csv converted to pandas dataframe with some values filtered. Columns level, date, time are not really relevant.
I am trying, in ipython, to see the some rows filtering by latitude, so I do (if the dataframe is c):
c[c['lat'] == 26.345]
or
c.loc[c['lat'] == 26.345]
and I can see if the value is present or not, but sometimes it outputs nothing for latitude values that I am seeing in the dataframe !?! (For instance, I can see in the dataframe the value of latitude 27.702 and when I do c[c['lat'] == 27.702] or c.loc[c['lat'] == 27.702] I get an empty dataframe and I am seeing the value for such latitude). What is happening here?
Thank you.
This is probably because you are asking for an exact match against floating point values, which is very, very dangerous. They are approximations, often printed to less precision than actually stored.
It's very easy to see 0.735471 printed, say, and think that's all there is, when in fact the value is really 0.73547122072282867; the display function has simply truncated the result. But when you try a strict equality test on the attractively short value, boom. Doesn't work.
Instead of
c[c['lat'] == 26.345]
Try:
import numpy as np
c[np.isclose(c['lat'], 26.345)]
Now you'll get values that are within a certain range of the value you specified. You can set the tolerance.
It is a bit difficult to give a precise answer, as the question does not contain reproducible example, but let me try. Most probably, this is due floating point issues. It is possible that the number you see (and try to compare with) is not the same number that is stored in the memory due to rounding. For example:
import numpy as np
x = 0.1
arr = np.array([x + x + x])
print(np.array([x + x + x]))
# [ 0.3]
print(arr[arr == 0.3])
# []
print(x + x + x)
# 0.30000000000000004
# in fact 0.1 is not exactly equal to 1/10,
# so 0.1 + 0.1 + 0.1 is not equal to 0.3
You can overcome this issue using np.isclose instead of ==:
print(np.isclose(arr, 0.3))
# [ True]
print(arr[np.isclose(arr, 0.3)])
# [ 0.3]
In addition to the answers addressing comparison on floating point values, some of the values in your lat column may be string type instead of numeric.
EDIT: You indicated that this is not the problem, but I'll leave this response here in case it helps someone else. :)
Use the to_numeric() function from pandas to convert them to numeric.
import pandas as pd
df['lat'] = pd.to_numeric(df['lat'])
# you can adjust the errors parameter as you need
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')