Pandas selecting row by column value, strange behaviour - python

Ok, I have a pandas dataframe like this:
lat long level date time value
3341 29.232 -15.652 10.0 20100109.0 700.0 0.5
3342 27.887 -13.668 120.0 20100109.0 700.0 3.2
...
3899 26.345 -11.234 0.0 20100109.0 700.0 5.8
The reason of the strange number of the index is because it comes from a csv converted to pandas dataframe with some values filtered. Columns level, date, time are not really relevant.
I am trying, in ipython, to see the some rows filtering by latitude, so I do (if the dataframe is c):
c[c['lat'] == 26.345]
or
c.loc[c['lat'] == 26.345]
and I can see if the value is present or not, but sometimes it outputs nothing for latitude values that I am seeing in the dataframe !?! (For instance, I can see in the dataframe the value of latitude 27.702 and when I do c[c['lat'] == 27.702] or c.loc[c['lat'] == 27.702] I get an empty dataframe and I am seeing the value for such latitude). What is happening here?
Thank you.

This is probably because you are asking for an exact match against floating point values, which is very, very dangerous. They are approximations, often printed to less precision than actually stored.
It's very easy to see 0.735471 printed, say, and think that's all there is, when in fact the value is really 0.73547122072282867; the display function has simply truncated the result. But when you try a strict equality test on the attractively short value, boom. Doesn't work.
Instead of
c[c['lat'] == 26.345]
Try:
import numpy as np
c[np.isclose(c['lat'], 26.345)]
Now you'll get values that are within a certain range of the value you specified. You can set the tolerance.

It is a bit difficult to give a precise answer, as the question does not contain reproducible example, but let me try. Most probably, this is due floating point issues. It is possible that the number you see (and try to compare with) is not the same number that is stored in the memory due to rounding. For example:
import numpy as np
x = 0.1
arr = np.array([x + x + x])
print(np.array([x + x + x]))
# [ 0.3]
print(arr[arr == 0.3])
# []
print(x + x + x)
# 0.30000000000000004
# in fact 0.1 is not exactly equal to 1/10,
# so 0.1 + 0.1 + 0.1 is not equal to 0.3
You can overcome this issue using np.isclose instead of ==:
print(np.isclose(arr, 0.3))
# [ True]
print(arr[np.isclose(arr, 0.3)])
# [ 0.3]

In addition to the answers addressing comparison on floating point values, some of the values in your lat column may be string type instead of numeric.
EDIT: You indicated that this is not the problem, but I'll leave this response here in case it helps someone else. :)
Use the to_numeric() function from pandas to convert them to numeric.
import pandas as pd
df['lat'] = pd.to_numeric(df['lat'])
# you can adjust the errors parameter as you need
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')

Related

Python pandas: issues when subsetting Series using .values attribute [duplicate]

This question already has answers here:
python numpy arange unexpected results
(5 answers)
Closed 2 years ago.
I'm having an issue with Pandas Series: I've created an array with some values in it. For testing puroposes I was trying to make sure of the presence of certain values in the Series, so I'm subsetting it like in what's below:
A = np.arange(start=-10, stop=10, step=0.1)
Aseries = pd.Series(A)
Aseries[Aseries.values == 9]
and this returns me an empty array. But I just have to change the step (from 0.1 to 1) and then it works... I've double checked that the Series actually contains the value I'm looking for (for both steps values...)
Here's the code for when I change the step (With the output as proof)
#Generating an array conaining 200 values from -10 to 10 with a step of 0.1
A = np.arange(start=-10, stop=10, step=0.1)
Aseries = pd.Series(A)
Aseries[Aseries.values == 9]
#Generating an array conaining 20 values from -10 to 10 with a step of 0.1
B = np.arange(start=-10, stop=10, step=1)
Bseries = pd.Series(B)
print("'Aseries' having the value 9:")
print(Aseries[Aseries.values == 9])
print("'Bseries' having the value 9:")
print(Bseries[Bseries.values == 9])
output:
'Aseries' having the value 9:
Series([], dtype: float64)
'Bseries' having the value 9:
19 9
dtype: int32
any idea of what's going on here? thanks in advance!
[EDIT]: for some reason I can't add any other post to this thread, so I'll add the solution I found here:
like #Quang Hoang and #Kim Rop explained by the non integer step value which doesnt really returns what it's supposed to. So after:
Aseries = pd.Series(A)
I simply added a rounding instruction to only keep one decimal after the values in the array and adapted my subsetting operation with something like that:
Aseries[(Aseries.values < 9.1) &(Aseries.values < 9.1)]
I'm not having the issue anymore... Thanks #Quang Hoang and #Kim Rop
According to the offical document:
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use numpy.linspace for these cases.
And this is also partially because of floating point precision.

How to do exact decimal number calculation in pandas?

My dataframe looks something like this:
import pandas as pd
df = pd.read_sql('select * from foo')
a b c
0.1 0.2 0.3
0.3 0.4 0.5
If I directly run df['a'] * df['b'] the result is not exact as I expected because of float number issues.
I tried
import Decimal
df['a'].apply(Decimal) * df['b'].apply(Decimal)
But when I inspect df['a'].apply(Decimal) with PyCharm, the column turns out to be something strange, here is just an example, not real numbers:
a
0.09999999999999999
0.30000000000001231
I wonder how to do exact multiplication in pandas.
The problem is not in pandas but in floating point inaccuracy: decimal.Decimal(0.1) is Decimal('0.1000000000000000055511151231257827021181583404541015625') on my 64 bits system.
A simple trick would be to first change the floats to strings, because pandas knows enough about string conversion to properly round the values:
x = df['a'].astype(str).apply(Decimal) * df['b'].astype(str).apply(Decimal)
You will get a nice Series of Decimal:
>>> print(x.values)
[Decimal('0.02') Decimal('0.12')]
So with exact decimal operations - which can matters if you process monetary values...

pandas rounding when converting float to integer

I've got a pandas DataFrame with a float (on decimal) index which I use to look up values (similar to a dictionary). As floats are not exactly the value they are supposed to be multiplied everything by 10 and converted it to integers .astype(int) before setting it as index. However this seems to do a floor instead of rounding. Thus 1.999999999999999992 is converted to 1 instead of 2. Rounding with the pandas.DataFrame.round() method before does not avoid this problem as the values are still stored as floats.
The original idea (which obviously rises a key error) was this:
idx = np.arange(1,3,0.001)
s = pd.Series(range(2000))
s.index=idx
print(s[2.022])
trying with converting to integers:
idx_int = idx*1000
idx_int = idx_int.astype(int)
s.index = idx_int
for i in range(1000,3000):
print(s[i])
the output is always a bit random as the 'real' value of an integer can be slightly above or below the wanted value. In this case the index contains two times the value 1000 and does not contain the value 2999.
You are right, astype(int) does a conversion toward zero:
‘integer’ or ‘signed’: smallest signed int dtype
from pandas.to_numeric documentation (which is linked from astype() for numeric conversions).
If you want to round, you need to do a float round, and then convert to int:
df.round(0).astype(int)
Use other rounding functions, according your needs.
the output is always a bit random as the 'real' value of an integer can be slightly above or below the wanted value
Floats are able to represent whole numbers, making a conversion after round(0) lossless and non-risky, check here for details.
If I understand right you could just perform the rounding operation followed by converting it to an integer?
s1 = pd.Series([1.2,2.9])
s1 = s1.round().astype(int)
Which gives the output:
0 1
1 3
dtype: int32
In case the data frame contains both, numeric and non-numeric values and you only want to touch numeric fields:
df = df.applymap(lambda x: int(round(x, 0)) if isinstance(x, (int, float)) else x)
There is a potential that NA as a float type exists in the dataframe. so an alternative solution is: df.fillna(0).astype('int')

Pandas Dataframe Float Precision

I am trying to alter my dataframe with the following line of code:
df = df[df['P'] <= cutoff]
However, if for example I set cutoff to be 0.1, numbers such as 0.100496 make it through the filter.
My suspicion is that my initial dataframe has entries in scientific notation and float format as well. Could this be affecting the rounding and precision? Is there a potential workaround to this issue.
Thank you in advance.
EDIT: I am reading from a file. Here is a sample of the total data.
2.29E-98
1.81E-42
2.19E-35
3.35E-30
0.0313755
0.0313817
0.03139
0.0313991
0.0314062
0.1003476
0.1003483
0.1003487
0.1003521
0.100496
Floating point comparison isn't perfect. For example
>>> 0.10000000000000000000000000000000000001 <= 0.1
True
Have a look at numpy.isclose. It allows you to compare floats and set a tolerance for the comparison.
Similar question here

Returning rows in a dataframe to a list of integers

I have a dataframe with multiple columns and a few 1000 rows with text data. One column contains floats that represent time in ascending order (0, 0.45, 0.87, 1.10 etc). From this I want to build a new dataframe that contains only all the rows where these time values are closest to the integers x = 0,1,2,3......etc
Here on Stackoverflow I found an answer to a very similar question, answer posted by DSM. The code is essentially this, modified (hopefully) to give -the- closest number to x, df is my data frame.
df.loc[(df.ElapsedTime-x).abs().argsort()[:1]]
This seems to essentially do what I need for one x value but I can't figure out how to iterate this over the -entire- data frame to extract -all- rows where the column value is closest to x = 0,1,2,3....in ascending order. This code gives me a data frame, there must be a way to loop this and append the resulting data frames to get the desired result?
I have tried this:
L=[]
for x in np.arange(len(df)):
L.append(df.loc[(df.ElapsedTime-x).abs().argsort()[:1]])
L
L, in principle has the right rows but it is a messy list and it takes a long time to execute because for loops are not a great way to iterate over a data frame. I'd prefer to get a data frame as the result.
I feel I am missing something trivial.
Not sure how to post the desired dataframe.
Lets say the timevalues are (taken from my dataframe):
0.00,0.03,0.58,1.59,1.71,1.96,2.21,2.33,2.46,2.58,2.7,2.83,2.95,3.07
The values grabbed for 0,1,2,3 would be 0, .58, 1.96, 2.95
#beroe: if the numbers are 0.8, 1.1, 1.4, 2.8, in this case 1.1 should be grabbed for 1 and 1.4 should be grabbed for 2. If as an example the numbers are 0.5 1.5 2.5. While I think it is unlikely this will happen in my data I think it would be fine to grab 1.5 as 1 and 2.5 as 2. In this application I don't think it is that critical, although I am not sure how I would implement this.
Please let me know if anyone needs any additional info.
Don't know how fast this would be, but you could round the times to get "integer" candidates, take the absolute value of the difference to give yourself a way to find the closest, then sort by difference, and then groupby the integer time to return just the rows that are close to integers:
# setting up my fake data
df=pd.DataFrame()
df['ElapsedTime']=pd.Series([0.5, 0.8, 1.1, 1.4, 1.8, 2.2, 3.1])
# To use your own data set, set df = Z, and start here...
df['bintime'] = df.ElapsedTime.round()
df['d'] = abs(df.ElapsedTime - df.bintime)
dfindex = df.sort('d').groupby('bintime').first()
For the fake time series defined above, the contents of dfindex is:
ElapsedTime d
bintime
0 0.5 0.5
1 1.1 0.1
2 1.8 0.2
3 3.1 0.1
Consider the following pd.Series s
s = pd.Series(np.arange(5000), np.random.rand(5000) * 100).sort_index()
s.head()
0.002587 3007
0.003418 4332
0.060767 2045
0.125182 3179
0.134487 4614
dtype: int64
Get all integers to get closest to with:
idx = (s.index // 1).unique()
Then reindex with method='nearest'
s.reindex(idx, method='nearest').head()
0.0 3912
1.0 3617
2.0 2574
3.0 811
4.0 932
dtype: int64

Categories

Resources