How to subtract two columns when floating point precision matters - python

I am trying to subtract two columns in the dataframe but it is giving me same result for all the values?
Here is my data:
a b
0 0.35805 -0.01315
1 0.35809 -0.01311
2 0.35820 -0.01300
3 0.35852 -0.01268
I tried following approach suggested in here, but it is repeating same result for me in all the rows.

More like a precision issue , I always using decimal
from decimal import *
df.z.map(Decimal)-df.dist.map(Decimal)
Out[189]:
0 0.3711999999999999796246319406
1 0.3712000000000000195232718880
2 0.3712000000000000177885484121
3 0.3712000000000000056454840802
dtype: object

I think this will work fine
df['a-b'] = df['a']-df['b']

Related

How to round values of the index in a pandas dataframe

I would like to round the index values of a pandas dataframe with the name results, such that they do not have any decimal values. I use the followin code that I took from here Round columns in pandas dataframe. So basically I have a column with the name "set_timeslot" and I would like to round its values and then use it as an index
cols = ['set_timeslots']
results[cols]= results [cols].round(0)
results.set_index('set_timeslots', inplace=True)
However, I still get a decimal value as you can see in the screenshot
Do you know what I have to do in order to get rid of the decimal values? I'd appreciate every comment.
If need round and convert to integers add Series.astype:
results[cols]= results [cols].round(0).astype(int)
We can't use pandas.DataFrame.round() in this scenario because the round module is used for the trimming of decimal points. let's take our case only.
# Import all-Important Libraries
import pandas as pd
# Reproduced Sample 'set_timeslots'
results = pd.DataFrame({
'set_timeslots':[1.0, 2.0, 3.0, 4.0, 5.0]
})
# Declaration of 'cols' variable for storing 'set_timeslots' column
cols = ['set_timeslots']
# Print result
results[cols]
# Output of above cell:-
set_timeslots
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
# Implementation of 'round' module:-
results[cols] = results[cols].round(0)
# Print result after round function
results
# Output of above cell:-
set_timeslots
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
Appropriate Solution:-
So, for the conversion of set_timeslots from decimal to int we can use pandas.DataFrame.astype() Module.
Code for above mention scenario was stated below:-
# Implementation of 'astype(int)' function
results[cols] = results[cols].astype(int)
# Print result after the Conversion
results
# Output of above cell:-
set_timeslots
0 1
1 2
2 3
3 4
4 5
As you can see we have achieved our desired output which is to remove decimal points from set_timeslots column. and Hope this Solution helps you in the clarification of round() function and astype() function.
To Learn more about pandas.DataFrame.round():- Click Here To Learn more about pandas.DataFrame.astype():- Click Here

Can anyone suggest better ASSERT method to compare two columns of a single dataframe in pytest?

I am using pytest for comparing two columns of a dataframe
by using below assert method
def test_compare():
np.testing.assert_almost_equal(v['col1'].values, v['col2'].values, decimal=4,verbose=True)
but the issue with this assert_almost_equal() method is while comparing col1 value i.e. 0.850341028331584 (upto 4 decimal places i.e. 0.8503) with col2 i.e. 0.850341028331585 (upto 4 decimal places i.e. 0.8503) it throws an error :
> raise AssertionError(msg)
E AssertionError:
E Arrays are not almost equal to 4 decimals
E
E x and y nan location mismatch:
E x: array([0.8503, 0.1234, ..., 0.9028, 0.981 , 0.9789])
E y: array([0.8503, 0.1234, ..., 0.9028, 0.981 , 0.9789])
is there any workaround or some other assert functions which compare strictly up to 4 decimal places only instead of round-off.
You could try the pandas method assert_series_equal with relative tolerance - it seems to work for me:
>>> import pandas as pd
>>> df = pd.DataFrame([[0.850341028331584,0.850341028331585]])
>>> df
0 1
0 0.850341 0.850341
>>> df[0][0]
0.850341028331584
>>> df[1][0]
0.850341028331585
>>> pd.testing.assert_series_equal(df[1].rename(0),df[0],check_exact=False, rtol=4)
>>>
Unfortunately, you have to rename the first series to have the same name as the second in order for this function to work. There is a closed issue related to this.
You can use np.isclose for these situations, where you can control the precision if you wish
assert np.all(np.isclose(v['col1'].values, v['col2'].values))
if you want to use assert method which strictly compare data up to certain decimal places use this
assert_array_almost_equal(df['col1'],df['col2'], decimal=13, err_msg='',verbose=True)
this method work perfectly fine without performing the default round-off

Iterating over dataframe and using replace method based on condtions

I am attempting to iterate over a specific column in my dataframe.
The column is:
df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']
I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).
To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.
I also attempted to use pd.DataFrame.where, but am getting an error:
for i,row in df.iterrows():
df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])
``AttributeError: 'numpy.ndarray' object has no attribute 'replace'
Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).
Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:
x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')
This'll give you:
0 1
0 1.4 million
1 1235000 NaN
2 100 million
3 NaN NaN
4 14 million
5 2.5 mill
Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:
res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)
That'll give you:
0 1400000.0
1 1235000.0
2 100000000.0
3 NaN
4 14000000.0
5 2500000.0
Try this:
df['column'].apply(lambda x : x.replace('million','00000'))
Make sure your dtype is string before applying this
For the given data:
df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
if 'million' in str(x) or 'mill' in str(x) else x)
If there may be many forms of million in the column, then regex search.

Why does Pandas/Numpy automatically round up 9999999999 to 1.000000e+10?

I have a Pandas dataframe, with 4 rows, and one of the columns (named limit) contains floating point values, where any zeros must be replaced with 9999999999 (9.999999999 billion). The column is set to the float32 data type, and I use the pandas.DataFrame.where method to do the replacement. But it's not working as expected because Numpy is rounding up 9999999999 to 10000000000 (10 billion).
I've tried this in iPython 3 (Python 3.6.8), Pandas version 0.24.2, Numpy version 1.14.0.
This is the replacement statement
df['limit'] = df['limit'].where(df['limit'] != 0, 9999999999)
I'm seeing the following column values for limit:
0 1.000000e+10
1 1.000000e+10
2 1.000000e+10
3 1.000000e+10
but I'm expecting
0 9999999999.0
1 9999999999.0
2 9999999999.0
3 9999999999.0
Why does the rounding up happen? This doesn't happen with plain Python
In [1]: (9.999999999) * 10**9
Out[1]: 9999999999.0
This is simply because int32 is not capable of preserving that number. You can check this by calculating the number of bits needed for demonstrating that number:
In [24]: np.floor(np.log2(9999999999)) + 1
Out[24]: 34.0
As you can see you need at least 34 bits for demonstrating that number. Therefore you should use int64 as a larger data type for representing it.
Even if you test this by putting the number in a series with same data type you'll see the unexpected result (overflow) again:
In [25]: s = pd.Series([9999999999], dtype=pd.np.int32)
In [26]: s
Out[26]:
0 1410065407
dtype: int32

Dividing two pandas values

I have the following dataset from the following commands
df.ab.value_counts()
Out[154]:
0 31196
1 18804
dtype: int64
I want to find the fraction of 1 out of total counts. So basically 18804/50000
I do the following:
(df.ab.value_counts()[1])/(df.ab.count())
Out[146]:
0
As you see it gives me zero while it should be 0.3768 (18804/50000)
Any idea why?
Edit II:
Any idea how to plot the bar graph of these two values 0 and 1 and their counts?
I think you need add casting to float:
print (df.ab.value_counts()[1])/float(df.ab.count())
0.37608
print 18804/50000
0
print 18804/float(50000)
0.37608
print float(18804)/50000
0.37608
You can do it easier
df.ab.value_counts(normalize=True)
from __future__ import division
(df.ab.value_counts()[1])/(df.ab.count()) ###should give float value 0.3768
Have not tested it, but hopes it will help you.

Categories

Resources