Selecting rows by value in a floating point column in pandas - python

I import a csv data file into a pandas DataFrame df with pd.read_csv. The text file contains a column with strings like these:
y
0.001
0.0003
0.0001
3e-05
1e-05
1e-06
If I print the DataFrame, pandas outputs the decimal representation of these values with 6 digits after the comma, and everything looks good.
When I try to select rows by value, like here:
df[df['y'] == value],
by typing the corresponding decimal representation of value, pandas correctly matches certain values (example: rows 0, 2, 4) but does not match others (rows 1, 3, 5). This is of course due to the fact that those rows values do not have a perfect representation in base two.
I was able to workaround this problem is this way:
df[abs(df['y']/value-1) <= 0.0001]
but it seems somewhat awkward. What I'm wondering is: numpy already has a method, .isclose, that is specifically for this purpose.
Is there a way to use .isclose in a case like this? Or a more direct solution in pandas?

Yes, you can use numpy's isclose
df[np.isclose(df['y'], value)]

Related

Multiply pd DataFrame column with 7-digit scalar

I am trying to modify a pandas dataframe column this way:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE["Var"]["Jan"] = 2678400*SLICE["Var"]["Jan"]
However, this does not work. The resulting column SLICE["Var"]["Jan"] is still the same as before the multiplication.
If I multiply with 2 orders of magnitude less, the multiplication works. Also a subsequent multiplication with 100 to receive the same value that was intended in the first place, works.
SLICE["Var"]["Jan"] = 26784*SLICE["Var"]["Jan"]
SLICE["Var"]["Jan"] = 100*SLICE["Var"]["Jan"]
I seems like the scalar is too large for the multiplication. Is this a python thing or a pandas thing? How can I make sure that the multiplication with the 7-digit number works directly?
I am using Python 3.8, the precision of numbers in the dataframe is float32, they are in a range between 5.0xE-5 and -5.0xE-5 with some numbers having a smaller absolute value than 1xE-11.
EDIT: It might have to do with the 2-level column indexing. When I delete the first level, the calculation works:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE=SLICE.droplevel(0, axis=1)
SLICE["Jan"] = 2678400*SLICE["Jan"]
Your first method might give SettingWithCopyWarning which basically means the changes are not made to the actual dataframe. You can use .loc instead:
SLICE.loc[:,('Var', 'Jan')] = SLICE.loc[:,('Var', 'Jan')]*2678400

In pandas, how to convert the result of dividing two columns from a decimal to a percentage?

When using pandas, how do I convert a decimal to a percentage when the result obtained by dividing two columns as another column?
for example :
df_income_structure['C'] = (df_income_structure['A']/df_income_structure['B'])
If the value of df_income_structure['C'] is a decimal, how to convert it to a percentage ?
Format it like this:
df_income_structure.style.format({'C': '{:,.2%}'.format})
Change the number depending on how many decimal places you'd like.
Use basic pandas operators
For example if we have a dataframe with columns name like column1, column2,column3 , ... so we can :
Columns = [column1, column2,column3, ....] .
df[Columns] = df[Columns].div(df[Columns].sum(axis=1), axis=0).multiply(100)
(df[Columns].sum(axis=1). axis=1 makes the summation for rows.
Divide the dataframe by (df[Columns].div(df[Columns].sum(axis=1), axis=0). axis=0 is for devision of columns.
multiply the results by 100 for percentages of 100.
I hope this answer has solved your problem.
Good luck

Quickly search through a numpy array and sum the corresponding values

I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers.
So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!
You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up
A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
elements of your input array are of a single type (in this case
object),
so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly
into a DataFrame (see read_csv method), so there is no need to use
any Numpy array.
The advantage is that read_csv is clever enough to recognize the data
type of each column separately, at least it is able to tell apart numbers
from strings.

Getting the index of a float in a column using pandas

I have a dataset that I am pulling using pandas. It looks like this:
import pandas as pd
dataset=pd.read_csv('D:\\filename.csv', header=None, usecols=3,4,10,16,22,28])
time=dataset.iloc[:,0]
Now, the 'time' dataset has a value of 0.00017 somewhere down the column and I want to find the index number of that location. How can I get that?
Assuming you're dealing with floats, you can't use an equality comparison here (because of floating point inaccuracies creeping in).
Use np.isclose + np.argmax:
idx = np.isclose(df['time'], 0.00017).argmax()
If there's a possibility this value may not exist:
m = np.isclose(df['time'], 0.00017)
if m.sum() > 0:
idx = m.argmax()
Otherwise, set idx to whatever (None, -1, etc).

Pandas Dataframe Float Precision

I am trying to alter my dataframe with the following line of code:
df = df[df['P'] <= cutoff]
However, if for example I set cutoff to be 0.1, numbers such as 0.100496 make it through the filter.
My suspicion is that my initial dataframe has entries in scientific notation and float format as well. Could this be affecting the rounding and precision? Is there a potential workaround to this issue.
Thank you in advance.
EDIT: I am reading from a file. Here is a sample of the total data.
2.29E-98
1.81E-42
2.19E-35
3.35E-30
0.0313755
0.0313817
0.03139
0.0313991
0.0314062
0.1003476
0.1003483
0.1003487
0.1003521
0.100496
Floating point comparison isn't perfect. For example
>>> 0.10000000000000000000000000000000000001 <= 0.1
True
Have a look at numpy.isclose. It allows you to compare floats and set a tolerance for the comparison.
Similar question here

Categories

Resources