Pandas .round() is not rounded as desired - python

I have the following id from a huge dataframe with a lot of ids, i pick this one in particular to show you what is the problem
id year anual_jobs anual_wage
874180 20001150368 2010 10.5 1071.595917
after this i code
df.anual_jobs= df.anual_jobs.round()
i get this error but code runs anyways.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-
docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
my result is:
id year anual_jobs anual_wage
874180 20001150368 2010 10.0 1071.595917
when i want to round anual_jobs to 11.0 instead of 10.0

As #cᴏʟᴅsᴘᴇᴇᴅ pointed out, this is happening because numpy rounds half-values to the nearest even integer (see docs here and a more general discussion here), and pandas uses numpy for most of its numerical work. You can resolve this by rounding the "old-fashioned" way:
import numpy as np
df.anual_jobs = np.floor(df.anual_jobs + 0.5)
or
import pandas as pd
df.anual_jobs = pd.np.floor(df.anual_jobs + 0.5)
As #cᴏʟᴅsᴘᴇᴇᴅ pointed out you can also resolve the slice assignment warning by creating your dataframe as a free-standing frame instead of a view on an older dataframe, i.e., execute the following at some point before you assign values into the dataframe:
df = df.copy()

If what you want is because of the half-integer use decimal
from decimal import Decimal, ROUND_HALF_UP
print(Decimal(10.5).quantize(0, ROUND_HALF_UP))
print(Decimal(10.2).quantize(0, ROUND_HALF_UP))
>> 11
>> 10

Related

How to subset Pandas Dataframe using an OR operator whilst avoiding "FutureWarning: elementwise comparison failed;"

I have a Pandas dataframe (tempDF) of 5 columns by N rows. Each element of the dataframe is an object (string in this case). For example, the dataframe looks like (this is fake data - not real world):
I have two tuples, each contains a collection of numbers as a string type. For example:
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
I want to retrieve a subset of the rows from the tempDF in which the columns "medcode" OR "enttype" have at least one entry in the tuples above. Thus, from the example above, I would retrieve a subset containing rows with the index 8 and 9 and 11.
Until updating some packages earlier today (too many now to work out which has started throwing the warning), this did work:
tempDF = tempDF[tempDF["medcode"].isin(codeSet) | tempDF["enttype"].isin(additionalClinicalCodes)]
But now it is throwing the warning:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask |= (ar1 == a)
Looking at the API, isin states the the condition "if ALL" is in the iterable collection. I want an "if ANY" condition.
UPDATE #1
The problem lies with using the | operator, also the np.logical_or method. If I remove the second isin condition i.e., just keep tempDF[tempDF["medcode"].isin(codeSet) then no warning is thrown but I'm only subsetting on the one possible condition.
import numpy as np
tempDF = tempDF[np.logical_or(tempDF["medcode"].isin(codeSet), tempDF["enttype"].isin(additionalClinicalCodes))
I'm unable to reproduce your warning (I assume you are using an outdated numpy version), however I believe it is related to the fact that your enttype column is a numerical type, but you're using strings in additionalClinicalCodes.
Try this:
tempDF = temp[temp["medcode"].isin(list(codeset)) | temp["enttype"].isin(list(additionalClinicalCodes))]
Boiling your question down to an executable example:
import pandas as pd
tempDF = pd.DataFrame({'medcode': ['6108', '6154', '95744', '98120'], 'enttype': ['99', '131', '372', '372']})
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
newDF = tempDF[tempDF["medcode"].isin(codeset) | tempDF["enttype"].isin(additionalClinicalCodes)]
print(newDF)
print("Pandas Version")
print(pd.__version__)
This returns for me
medcode enttype
0 6108 99
1 6154 131
3 98120 372
Pandas Version
1.4.2
Thus I am not able to reproduce your warning.
This is a numpy strange behaviour. I think the right way to do this is yours way, but if the warning bothers you, try this:
tempDF = tempDF[
(
tempDF.medcode.isin(codeset).astype(int) +
tempDF.isin(additionalClinicalCode).astype(int)
) >= 1
]

How to add two columns of a dataframe as Decimals?

I am trying to add two columns together using the Decimal module in Python but can't seem to get the syntax right for this. I have 2 columns called month1 and month2 and do not want these to become floats at any point in the outcome as division and then rounding will later be required.
The month1 and month2 columns are already to several decimals as they are averages and I need to preserve this accuracy in the addition.
I can see guidance online for how to add numbers together using Decimal but not how to apply it to columns in a pandas dataframe. I've tried things like:
df['MonthTotal'] = Decimal.decimal(df['Month1']) + Decimal.decimal(df['Month1'])
What is the solution?
from decimal import Decimal
def convert_decimal(row):
row["monthtotal"] = Decimal(row["month1"])+Decimal(row["month2"])
return row
df = df.apply(convert_decimal, axis =1)
decimal.Decimal is designed to accept single value, not pandas.Series of them. Assuming that your column is holding strings representing number values, you might use .applymap for using decimal.Decimal element-wise i.e.:
import decimal
import pandas as pd
df = pd.DataFrame({'x':['0.1','0.1','0.1'],'y':['0.1','0.1','0.1'],'z':['0.1','0.1','0.1']})
df_decimal = df.applymap(decimal.Decimal)
df_decimal["total"] = df_decimal.x + df_decimal.y + df_decimal.z
print(df_decimal.total[0])
print(type(df_decimal.total[0]))
output
0.3
<class 'decimal.Decimal'>

How do you display values in a pandas dataframe column with 2 decimal places?

What I am looking to do is make it so that regardless of the value, it displays 2 decimal places.
What I have tried thus far:
DF['price'] = DF['price'].apply(lambda x: round(x, 2))
However, the problem is that I wish to display everything in 2 decimal places, but values like 0.5 are staying at 1 decimal place since they don't need to be rounded.
Is there a function I can apply that gives the following type of output:
Current After Changes
0 0.00
0.5 0.50
1.01 1.01
1.133333 1.13
Ideally, these values will be rounded but I am open to truncating if that is all that works.
I think you want something like this
DF['price'] = DF['price'].apply(lambda x: float("{:.2f}".format(x)))
This applies the change just to that column
You have to set the precision for pandas display. Put this on top of your script after importing pandas:
import pandas as pd
pd.set_option('precision', 2)
If you want to only modify the format of your values without doing any operation in pandas, you should just execute the following instruction:
pd.options.display.float_format = "{:,.2f}".format
You should be able to get more info here:
https://pandas.pydata.org/docs/user_guide/options.html#number-formatting
Try:
import pandas as pd
pd.set_option('display.precision', 2)

Using latest panda APIs to compute exponential moving average

I have a python v3.6 function using panda to compute exponential moving average of a list containing floating numbers. Here is the function and it is tested to work;
def get_moving_average(values, period):
import pandas as pd
import numpy as np
values = np.array(values)
moving_average = pd.ewma(values, span=period)[-1]
return moving_average
However, pd.ewma is a deprecated function and although it still works, I would like to use the latest API to use panda the correct way.
Here is the documentation for the latest exponential moving average API.
http://pandas.pydata.org/pandas-docs/stable/api.html#exponentially-weighted-moving-window-functions
I modified the original function into this to use the latest API;
def get_moving_average(values, period, type="exponential"):
import pandas as pd
import numpy as np
values = np.array(values)
moving_average = 0
moving_average = pd.ewm.mean(values, span=period)[-1]
return moving_average
Unfortunately, I got the error AttributeError: module 'pandas' has no attribute 'EWM'
The ewm() method now has a similar API to moving() and expanding(): you call ewm() and then follow it with a compatible method like mean(). For example:
df=pd.DataFrame({'x':np.random.randn(5)})
df['x'].ewm(halflife=2).mean()
0 -0.442148
1 -0.318170
2 0.099168
3 -0.062827
4 -0.371739
Name: x, dtype: float64
If you try df['x'].ewm() without arguments it will tell you:
Must pass one of com, span, halflife, or alpha
See below for documentation that may be more clear than the link in the OP:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html#pandas.DataFrame.ewm
http://pandas.pydata.org/pandas-docs/stable/computation.html#exponentially-weighted-windows

Pandas Groupby - Sparse Matrix Error

This question is related to the question I asked previously about using pandas get_dummies() function (link below).
Pandas Get_dummies for nested tables
However in the course of utilizing the solution provide in the answer I noticed odd behavior when looking at the groupby function. The issue is that repeated (non-unique) index values for a dataframe appear to cause an error when the matrix is represented in sparse format, while working as expected for dense matrix.
I have extremely high dimensional data thus sparse matrix will be required for memory reasons. An example of the error is below. If anyone has a work around it would be greatly appreciated
Working:
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name')
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Failing
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Note you will need version 16.1 or greater of pandas.
Thank you in advance
You can perform your groupby in a different way as a workaround. Don't set Instance as the index and use the column for your groupby and drop the Instance column (last column in this case since it was just added). Groupby will will make an Instance index.
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
#WORKAROUND:
result=result.groupby('Instance').apply(max)[result.columns[:-1]]
result
Out[58]:
Name_Doe Name_Jane Name_John Name_Smith
Instance
1 0 0 1 1
2 0 1 0 0
3 1 0 0 0
Note: The sparse dataframe stores your Instance int's as floats within a BlockIndex in the dataframe column. In order to have the index the exact same as the first example you'd need to change to int from float.
result.index=result.index.map(int)
result.index.name='Instance'

Categories

Resources