My dataframe looks something like this:
import pandas as pd
df = pd.read_sql('select * from foo')
a b c
0.1 0.2 0.3
0.3 0.4 0.5
If I directly run df['a'] * df['b'] the result is not exact as I expected because of float number issues.
I tried
import Decimal
df['a'].apply(Decimal) * df['b'].apply(Decimal)
But when I inspect df['a'].apply(Decimal) with PyCharm, the column turns out to be something strange, here is just an example, not real numbers:
a
0.09999999999999999
0.30000000000001231
I wonder how to do exact multiplication in pandas.
The problem is not in pandas but in floating point inaccuracy: decimal.Decimal(0.1) is Decimal('0.1000000000000000055511151231257827021181583404541015625') on my 64 bits system.
A simple trick would be to first change the floats to strings, because pandas knows enough about string conversion to properly round the values:
x = df['a'].astype(str).apply(Decimal) * df['b'].astype(str).apply(Decimal)
You will get a nice Series of Decimal:
>>> print(x.values)
[Decimal('0.02') Decimal('0.12')]
So with exact decimal operations - which can matters if you process monetary values...
Related
I am trying to add two columns together using the Decimal module in Python but can't seem to get the syntax right for this. I have 2 columns called month1 and month2 and do not want these to become floats at any point in the outcome as division and then rounding will later be required.
The month1 and month2 columns are already to several decimals as they are averages and I need to preserve this accuracy in the addition.
I can see guidance online for how to add numbers together using Decimal but not how to apply it to columns in a pandas dataframe. I've tried things like:
df['MonthTotal'] = Decimal.decimal(df['Month1']) + Decimal.decimal(df['Month1'])
What is the solution?
from decimal import Decimal
def convert_decimal(row):
row["monthtotal"] = Decimal(row["month1"])+Decimal(row["month2"])
return row
df = df.apply(convert_decimal, axis =1)
decimal.Decimal is designed to accept single value, not pandas.Series of them. Assuming that your column is holding strings representing number values, you might use .applymap for using decimal.Decimal element-wise i.e.:
import decimal
import pandas as pd
df = pd.DataFrame({'x':['0.1','0.1','0.1'],'y':['0.1','0.1','0.1'],'z':['0.1','0.1','0.1']})
df_decimal = df.applymap(decimal.Decimal)
df_decimal["total"] = df_decimal.x + df_decimal.y + df_decimal.z
print(df_decimal.total[0])
print(type(df_decimal.total[0]))
output
0.3
<class 'decimal.Decimal'>
What I am looking to do is make it so that regardless of the value, it displays 2 decimal places.
What I have tried thus far:
DF['price'] = DF['price'].apply(lambda x: round(x, 2))
However, the problem is that I wish to display everything in 2 decimal places, but values like 0.5 are staying at 1 decimal place since they don't need to be rounded.
Is there a function I can apply that gives the following type of output:
Current After Changes
0 0.00
0.5 0.50
1.01 1.01
1.133333 1.13
Ideally, these values will be rounded but I am open to truncating if that is all that works.
I think you want something like this
DF['price'] = DF['price'].apply(lambda x: float("{:.2f}".format(x)))
This applies the change just to that column
You have to set the precision for pandas display. Put this on top of your script after importing pandas:
import pandas as pd
pd.set_option('precision', 2)
If you want to only modify the format of your values without doing any operation in pandas, you should just execute the following instruction:
pd.options.display.float_format = "{:,.2f}".format
You should be able to get more info here:
https://pandas.pydata.org/docs/user_guide/options.html#number-formatting
Try:
import pandas as pd
pd.set_option('display.precision', 2)
I am trying to alter my dataframe with the following line of code:
df = df[df['P'] <= cutoff]
However, if for example I set cutoff to be 0.1, numbers such as 0.100496 make it through the filter.
My suspicion is that my initial dataframe has entries in scientific notation and float format as well. Could this be affecting the rounding and precision? Is there a potential workaround to this issue.
Thank you in advance.
EDIT: I am reading from a file. Here is a sample of the total data.
2.29E-98
1.81E-42
2.19E-35
3.35E-30
0.0313755
0.0313817
0.03139
0.0313991
0.0314062
0.1003476
0.1003483
0.1003487
0.1003521
0.100496
Floating point comparison isn't perfect. For example
>>> 0.10000000000000000000000000000000000001 <= 0.1
True
Have a look at numpy.isclose. It allows you to compare floats and set a tolerance for the comparison.
Similar question here
I am aware of Python having floating point errors when using the normal types. That is why I am using Pandas instead.
I suddenly started having some issues with data I input (not calculation) and cannot explain the following behavior:
In [600]: df = pd.DataFrame([[0.05], [0.05], [0.05], [0.05]], columns = ['a'])
In [601]: df.dtypes
Out[601]:
a float64
dtype: object
In [602]: df['a'].sum()
Out[602]: 0.20000000000000001
In [603]: df['a'].round(2).sum()
Out[603]: 0.20000000000000001
In [604]: (df['a'] * 1000000).round(0).sum()
Out[604]: 200000.0
In [605]: (df['a'] * 1000000).round(0).sum() / 1000000
Out[605]: 0.20000000000000001
Hopefully somebody can help me either fix this or figure out how to correctly sum 0.2 (or I don't mind if the result is 20 or 2000, but as you can see when I then divide I get to the same point where the sum is incorrect!).
(to run my code remember to do import pandas as pd)
Ok so this works:
In [642]: (((df * 1000000).round(0)) / 1000000).sum()
Out[642]:
a 0.2
dtype: float64
But this doesn't:
In [643]: (((df * 1000000).round(0))).sum() * 1000000
Out[643]:
a 2.000000e+11
dtype: float64
So you have to do all calculations inside the Panda array or risk breaking up things.
"I get to the same point where the sum is incorrect!" By your definition of incorrect nearly all floating point operations would be incorrect. Only powers of 2 are perfectly represented by floating points, everything else has a rounding error of about 15–17 decimal digits (for double precision floats). Some applications just hide this error better than others when displaying these values.
That precision is far more than sufficient for the data you are using.
If you are bothered by the ugly-looking output, then you can do "{:.1f}".format(value) to round the output string to 1 decimal digit after the point or "{:g}".format(value) to automatically select a reasonable number of digits for display.
Ok, I have a pandas dataframe like this:
lat long level date time value
3341 29.232 -15.652 10.0 20100109.0 700.0 0.5
3342 27.887 -13.668 120.0 20100109.0 700.0 3.2
...
3899 26.345 -11.234 0.0 20100109.0 700.0 5.8
The reason of the strange number of the index is because it comes from a csv converted to pandas dataframe with some values filtered. Columns level, date, time are not really relevant.
I am trying, in ipython, to see the some rows filtering by latitude, so I do (if the dataframe is c):
c[c['lat'] == 26.345]
or
c.loc[c['lat'] == 26.345]
and I can see if the value is present or not, but sometimes it outputs nothing for latitude values that I am seeing in the dataframe !?! (For instance, I can see in the dataframe the value of latitude 27.702 and when I do c[c['lat'] == 27.702] or c.loc[c['lat'] == 27.702] I get an empty dataframe and I am seeing the value for such latitude). What is happening here?
Thank you.
This is probably because you are asking for an exact match against floating point values, which is very, very dangerous. They are approximations, often printed to less precision than actually stored.
It's very easy to see 0.735471 printed, say, and think that's all there is, when in fact the value is really 0.73547122072282867; the display function has simply truncated the result. But when you try a strict equality test on the attractively short value, boom. Doesn't work.
Instead of
c[c['lat'] == 26.345]
Try:
import numpy as np
c[np.isclose(c['lat'], 26.345)]
Now you'll get values that are within a certain range of the value you specified. You can set the tolerance.
It is a bit difficult to give a precise answer, as the question does not contain reproducible example, but let me try. Most probably, this is due floating point issues. It is possible that the number you see (and try to compare with) is not the same number that is stored in the memory due to rounding. For example:
import numpy as np
x = 0.1
arr = np.array([x + x + x])
print(np.array([x + x + x]))
# [ 0.3]
print(arr[arr == 0.3])
# []
print(x + x + x)
# 0.30000000000000004
# in fact 0.1 is not exactly equal to 1/10,
# so 0.1 + 0.1 + 0.1 is not equal to 0.3
You can overcome this issue using np.isclose instead of ==:
print(np.isclose(arr, 0.3))
# [ True]
print(arr[np.isclose(arr, 0.3)])
# [ 0.3]
In addition to the answers addressing comparison on floating point values, some of the values in your lat column may be string type instead of numeric.
EDIT: You indicated that this is not the problem, but I'll leave this response here in case it helps someone else. :)
Use the to_numeric() function from pandas to convert them to numeric.
import pandas as pd
df['lat'] = pd.to_numeric(df['lat'])
# you can adjust the errors parameter as you need
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')