I am trying to alter my dataframe with the following line of code:
df = df[df['P'] <= cutoff]
However, if for example I set cutoff to be 0.1, numbers such as 0.100496 make it through the filter.
My suspicion is that my initial dataframe has entries in scientific notation and float format as well. Could this be affecting the rounding and precision? Is there a potential workaround to this issue.
Thank you in advance.
EDIT: I am reading from a file. Here is a sample of the total data.
2.29E-98
1.81E-42
2.19E-35
3.35E-30
0.0313755
0.0313817
0.03139
0.0313991
0.0314062
0.1003476
0.1003483
0.1003487
0.1003521
0.100496
Floating point comparison isn't perfect. For example
>>> 0.10000000000000000000000000000000000001 <= 0.1
True
Have a look at numpy.isclose. It allows you to compare floats and set a tolerance for the comparison.
Similar question here
Related
How can I retrain only 2 decimals for each values in a Pandas series? (I'm working with latitudes and longitudes). dtype is float64.
series = [-74.002568, -74.003085, -74.003546]
I tried using the round function but as the name suggests, it rounds. I looked into trunc() but this can only remove all decimals. Then I figures why not try running a For loop. I tried the following:
for i in series:
i = "{0:.2f}".format(i)
I was able to run the code without any errors but it didn't modify the data in any way.
Expected output would be the following:
[-74.00, -74.00, -74.00]
Anyone knows how to achieve this? Thanks!
series = [-74.002568, -74.003085, -74.003546]
["%0.2f" % (x,) for x in series]
['-74.00', '-74.00', '-74.00']
It will convert your data to string/object data type. It is just for display purpose. If you want to use it for calculation purpose then you can cast it to float. Then only one digit decimal will be visible.
[float('{0:.2f}'.format(x)) for x in series]
[-74.0, -74.0, -74.0]
here is one way to do it
assuming you meant pandas.Series, and if its true then
# you indicated its a series but defined only a list
# assuming you meant pandas.Series, and if its true then
series = [-74.002568, -74.003085, -74.003546]
s=pd.Series(series)
# use regex extract to pick the number until first two decimal places
out=s.astype(str).str.extract(r"(.*\..{2})")[0]
out
0 -74.00
1 -74.00
2 -74.00
Name: 0, dtype: object
Change the display options. This shouldn't change your underlying data.
pd.options.display.float_format = "{:,.2f}".format
When I apply the bar styling to a pandas dataframe after rounding I lose the rounding formatting, and I can't figure out how to apply the rounding formatting after because df.style.bar doesn't return a dataframe but a "Styler" object.
df = pd.DataFrame({'A': [1.23456, 2.34567,3.45678], 'B':[2,3,4]})
df['A'] = df['A'].round(2)
df.style.bar(subset='A')
This returns
but I don't want all of those extra zeros displayed.
You will have to treat a styler as purely a rendering of the original dataframe. This means you can use a format to display the data rounded to 2 decimal places.
The basic idea behind styling is that a user will want to modify the way the data is presented but still preserve the underlying format for further manipulation.
f = {'A':'{:.2f}'} #column col A to 2 decimals
df.style.format(f).bar(subset='A')
Read this excellent tutorial for exploring what all you can do with it and how.
EDIT: Added a formatting dict to show general use and to only apply the format to a single column.
The Dataframe contains float numbers like 7.5600000000000005, 2.36599999999954.
I would like to change the numbers to 7.56 and 2.366.
Note that the number of decimal place is not fixed.
How to achieve it?
Thank you very much.
The pandas.DataFrame.round function could be used. Therefore, given a dataframe df -
df = pd.DataFrame([( 7.5600000000000005, 2.36599999999954)])
df.round(3)
The output rounded to the nearest thousandth will be -
7.56 2.366
Ok, I have a pandas dataframe like this:
lat long level date time value
3341 29.232 -15.652 10.0 20100109.0 700.0 0.5
3342 27.887 -13.668 120.0 20100109.0 700.0 3.2
...
3899 26.345 -11.234 0.0 20100109.0 700.0 5.8
The reason of the strange number of the index is because it comes from a csv converted to pandas dataframe with some values filtered. Columns level, date, time are not really relevant.
I am trying, in ipython, to see the some rows filtering by latitude, so I do (if the dataframe is c):
c[c['lat'] == 26.345]
or
c.loc[c['lat'] == 26.345]
and I can see if the value is present or not, but sometimes it outputs nothing for latitude values that I am seeing in the dataframe !?! (For instance, I can see in the dataframe the value of latitude 27.702 and when I do c[c['lat'] == 27.702] or c.loc[c['lat'] == 27.702] I get an empty dataframe and I am seeing the value for such latitude). What is happening here?
Thank you.
This is probably because you are asking for an exact match against floating point values, which is very, very dangerous. They are approximations, often printed to less precision than actually stored.
It's very easy to see 0.735471 printed, say, and think that's all there is, when in fact the value is really 0.73547122072282867; the display function has simply truncated the result. But when you try a strict equality test on the attractively short value, boom. Doesn't work.
Instead of
c[c['lat'] == 26.345]
Try:
import numpy as np
c[np.isclose(c['lat'], 26.345)]
Now you'll get values that are within a certain range of the value you specified. You can set the tolerance.
It is a bit difficult to give a precise answer, as the question does not contain reproducible example, but let me try. Most probably, this is due floating point issues. It is possible that the number you see (and try to compare with) is not the same number that is stored in the memory due to rounding. For example:
import numpy as np
x = 0.1
arr = np.array([x + x + x])
print(np.array([x + x + x]))
# [ 0.3]
print(arr[arr == 0.3])
# []
print(x + x + x)
# 0.30000000000000004
# in fact 0.1 is not exactly equal to 1/10,
# so 0.1 + 0.1 + 0.1 is not equal to 0.3
You can overcome this issue using np.isclose instead of ==:
print(np.isclose(arr, 0.3))
# [ True]
print(arr[np.isclose(arr, 0.3)])
# [ 0.3]
In addition to the answers addressing comparison on floating point values, some of the values in your lat column may be string type instead of numeric.
EDIT: You indicated that this is not the problem, but I'll leave this response here in case it helps someone else. :)
Use the to_numeric() function from pandas to convert them to numeric.
import pandas as pd
df['lat'] = pd.to_numeric(df['lat'])
# you can adjust the errors parameter as you need
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')
I import a csv data file into a pandas DataFrame df with pd.read_csv. The text file contains a column with strings like these:
y
0.001
0.0003
0.0001
3e-05
1e-05
1e-06
If I print the DataFrame, pandas outputs the decimal representation of these values with 6 digits after the comma, and everything looks good.
When I try to select rows by value, like here:
df[df['y'] == value],
by typing the corresponding decimal representation of value, pandas correctly matches certain values (example: rows 0, 2, 4) but does not match others (rows 1, 3, 5). This is of course due to the fact that those rows values do not have a perfect representation in base two.
I was able to workaround this problem is this way:
df[abs(df['y']/value-1) <= 0.0001]
but it seems somewhat awkward. What I'm wondering is: numpy already has a method, .isclose, that is specifically for this purpose.
Is there a way to use .isclose in a case like this? Or a more direct solution in pandas?
Yes, you can use numpy's isclose
df[np.isclose(df['y'], value)]