Invalid comparison error python while scanning for outliers - python

I'm trying to extract the outliers from a pandas dataframe. I wanted to extract those by interquantile distance, but It triggered "TypeError: Invalid comparison between dtype=float64 and str", yet I don't see where it is a string or a float, when I checked both things I was comparing, upp_iqr, low_iqr where pandas.core.series.Series and prices list was a pandas.core.frame.DataFrame. The dataframe have only numbers in their cells. This is my code:
prices_list = df.filter(regex='Price')
q1 = prices_list.quantile(0.25)
q3 = prices_list.quantile(0.75)
iqr = q3 - q1
low_iqr = q1 - (1.5 * iqr)
upp_iqr = q3 + (1.5 * iqr)
if any(x > low_iqr for x in prices_list):
print("check_outlier: False")
logging.debug(print(price_outliers))
else:
print("check_outlier: False")
logging.debug(print(price_outliers))
I expected it to return a dataframe with their values and positions. Any help would be apreciated, please and thank you

Related

Fixing IndexingError to clean the data

I'm trying to identify outliers in each housing type category, but encountering an issue. Whenever I run the code, I receive the following error: "IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
grouped = df.groupby('Type')
q1 = grouped["price"].quantile(0.25)
q3 = grouped["price"].quantile(0.75)
iqr = q3 - q1
upper_bound = q3 + (1.5 * iqr)
lower_bound = q1 - (1.5 * iqr)
outliers = df[(df["price"].reset_index(drop=True) > upper_bound[df["Type"]].reset_index(drop=True)) | (df["price"].reset_index(drop=True) < lower_bound[df["Type"].reset_index(drop=True)])]
print(outliers)
When I run this part of the code
(df["price"].reset_index(drop=True) > upper_bound[df["Type"]].reset_index(drop=True)).reset_index(drop = True)
I'm getting boolean Series, but when I put it in the df[] it breaks.
Use transform to compute q1/q3, this will maintain the original index:
q1 = grouped["price"].transform(lambda x: x.quantile(0.25))
q3 = grouped["price"].transform(lambda x: x.quantile(0.75))
iqr = q3 - q1
upper_bound = q3 + (1.5 * iqr)
lower_bound = q1 - (1.5 * iqr)
outliers = df[df["price"].gt(upper_bound) | df["price"].lt(lower_bound)]
Use Series.map, then reset_index is not necessary:
outliers = df[(df["price"] > df["Type"].map(upper_bound)) |
(df["price"] < df["Type"].map(lower_bound))]
print(outliers)

Warning "Automatic reindexing" when applying IQR on the Pandas Dataset

When I try to remove outliers from my dataset, I get this warning:
Code
def remout(df):
Q1 = df.quantile(0.02)
Q3 = df.quantile(0.98)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
return df
df=remout(df)
df
Warning Message
FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right`
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
First, you're not specifying which column you're targeting to omit your outliers from, if your dataframe has one column, fine, but generally speaking you have to give a dataframe and a single column as an input to this function.
Second, Interquartile score or IQR is defined as difference between the 75th percentile and the 25th percentile. Therefore the numbers you have introduced as Q1 and Q3 are wrong.
Third, there is a problem with the logic you have used in your function, this will solve the problem without any future warnings:
def remout(df, col):
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
print(Q1, Q3, IQR)
df = df[(df[col] > (Q1 - 1.5 * IQR)) & (df[col] < (Q3 + 1.5 * IQR))]
return df
df=remout(df, 'col_name')
Instead of col_name specify the name of the column you want to remove outliers from.

Python function that calculates lower and upper bound using quantiles

I have a dataset name data. I would like to calculate the Lower and Upper value for multiple numeric variables: Loan, Amount, Value, LTV, UR. Instead of computing them one by one, how can I automate the following python code?
#Loan
Q1= data['LOAN'].quantile(q=0.25)
Q3= data['LOAN'].quantile(q=0.75)
IQR= Q3 - Q1
Lower = (Q1 - 1.5*IQR)
Upper = (Q3 + 1.5*IQR)
print('Loan')
print(Lower)
print(Upper)
I'd define an array with the colum names, then i'll cycle all the columns calling a function that calculates the values:
names =['LOAN', 'AMOUNT','VALUE','LUV','UR']
for column in data[names]:
x,y=calculateLowerUpper(column)
print (column)
print(x)
print(y)
def calculateLowerUpper(column):
q1= data['LOAN'].quantile(q=0.25)
q3= data['LOAN'].quantile(q=0.75)
iqr= q3 - q1
lower = (q1 - 1.5*iqr)
upper = (q3 + 1.5*iqr)
return lower,upper

removing outliers from numerical features

hi i'm trying to remove outliers from columns with numerical features but when i execute my code the whole dataset is removed can any1 tell me what im doing wrong please
numerical_columns = data.select_dtypes(include=['int64','float64']).columns.tolist()
print('Number of rows before discarding outlier = %d' % (data.shape[0]))
for i in numerical_columns:
q1 = data[i].quantile(0.25)
q3 = data[i].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
data = data.loc[(data[i] > fence_low) & (data[i] < fence_high)]
print('Number of rows after discarding outlier = %d' % (data.shape[0]))
The below code has worked for me. Here col is the numerical column of dataframe for which you need to remove outliers
#Remove Outliers: keep only the ones that are within +3 to -3
# standard deviations in the column
df = df[np.abs(df[col]-df[col].mean()) <= (3*df[col].std())]

Referring to specific df column when filtering for values in column

I am removing outliers from my dataset and wanted to get some thoughts on efficient methods.
I am currently using IQR to filter out any outliers in my data as below:
Q1 = df.grades.quantile(0.25)
Q3 = df.grades.quantile(0.75)
IQR = Q3 - Q1
Where the grades column in my df contains the values where I want to remove the outliers.
I have previous code used that does this but how can I edit code below, to account only for the grades column? (df.grades) and not just df
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
Thanks!
I think you need remove any and test column grades like:
df = df[~((df.grades < (Q1 - 1.5 * IQR)) | (df.grades > (Q3 + 1.5 * IQR)))]

Categories

Resources