Related
I have a pandas data frame with few columns.
Now I know that certain rows are outliers based on a certain column value.
For instance
column 'Vol' has all values around 12xx and one value is 4000 (outlier).
Now I would like to exclude those rows that have Vol column like this.
So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.
What is an elegant way to achieve this?
Remove all rows that have outliers in, at least, one column
If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame(np.random.randn(100, 3))
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
Description:
For each column, it first computes the Z-score of each value in the
column, relative to the column mean and standard deviation.
It then takes the absolute Z-score because the direction does not
matter, only if it is below the threshold.
all(axis=1) ensures that for each row, all column satisfy the
constraint.
Finally, the result of this condition is used to index the dataframe.
Filter other columns based on a single column
Specify a column for the zscore, df[0] for example, and remove .all(axis=1).
df[(np.abs(stats.zscore(df[0])) < 3)]
For each of your dataframe column, you could get quantile with:
q = df["col"].quantile(0.99)
and then filter with:
df[df["col"] < q]
If one need to remove lower and upper outliers, combine condition with an AND statement:
q_low = df["col"].quantile(0.01)
q_hi = df["col"].quantile(0.99)
df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]
Use boolean indexing as you would do in numpy.array
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data.
df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around
For a series it is similar:
S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]
This answer is similar to that provided by #tanemaki, but uses a lambda expression instead of scipy stats.
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
standard_deviations = 3
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < standard_deviations)
.all(axis=1)]
To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations:
df[((df['B'] - df['B'].mean()) / df['B'].std()).abs() < standard_deviations]
See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe
#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Before answering the actual question we should ask another one that's very relevant depending on the nature of your data:
What is an outlier?
Imagine the series of values [3, 2, 3, 4, 999] (where the 999 seemingly doesn't fit in) and analyse various ways of outlier detection
Z-Score
The problem here is that the value in question distorts our measures mean and std heavily, resulting in inconspicious z-scores of roughly [-0.5, -0.5, -0.5, -0.5, 2.0], keeping every value within two standard deviations of the mean. One very large outlier might hence distort your whole assessment of outliers. I would discourage this approach.
Quantile Filter
A way more robust approach is given is this answer, eliminating the bottom and top 1% of data. However, this eliminates a fixed fraction independant of the question if these data are really outliers. You might loose a lot of valid data, and on the other hand still keep some outliers if you have more than 1% or 2% of your data as outliers.
IQR-distance from Median
Even more robust version of the quantile principle: Eliminate all data that is more than f times the interquartile range away from the median of the data. That's also the transformation that sklearn's RobustScaler uses for example. IQR and median are robust to outliers, so you outsmart the problems of the z-score approach.
In a normal distribution, we have roughly iqr=1.35*s, so you would translate z=3 of a z-score filter to f=2.22 of an iqr-filter. This will drop the 999 in the above example.
The basic assumption is that at least the "middle half" of your data is valid and resembles the distribution well, whereas you also mess up if your distribution has wide tails and a narrow q_25% to q_75% interval.
Advanced Statistical Methods
Of course there are fancy mathematical methods like the Peirce criterion, Grubb's test or Dixon's Q-test just to mention a few that are also suitable for non-normally distributed data. None of them are easily implemented and hence not addressed further.
Code
Replacing all outliers for all numerical columns with np.nan on an example data frame. The method is robust against all dtypes that pandas provides and can easily be applied to data frames with mixed types:
import pandas as pd
import numpy as np
# sample data of all dtypes in pandas (column 'a' has an outlier) # dtype:
df = pd.DataFrame({'a': list(np.random.rand(8)) + [123456, np.nan], # float64
'b': [0,1,2,3,np.nan,5,6,np.nan,8,9], # int64
'c': [np.nan] + list("qwertzuio"), # object
'd': [pd.to_datetime(_) for _ in range(10)], # datetime64[ns]
'e': [pd.Timedelta(_) for _ in range(10)], # timedelta[ns]
'f': [True] * 5 + [False] * 5, # bool
'g': pd.Series(list("abcbabbcaa"), dtype="category")}) # category
cols = df.select_dtypes('number').columns # limits to a (float), b (int) and e (timedelta)
df_sub = df.loc[:, cols]
# OPTION 1: z-score filter: z-score < 3
lim = np.abs((df_sub - df_sub.mean()) / df_sub.std(ddof=0)) < 3
# OPTION 2: quantile filter: discard 1% upper / lower values
lim = np.logical_and(df_sub < df_sub.quantile(0.99, numeric_only=False),
df_sub > df_sub.quantile(0.01, numeric_only=False))
# OPTION 3: iqr filter: within 2.22 IQR (equiv. to z-score < 3)
iqr = df_sub.quantile(0.75, numeric_only=False) - df_sub.quantile(0.25, numeric_only=False)
lim = np.abs((df_sub - df_sub.median()) / iqr) < 2.22
# replace outliers with nan
df.loc[:, cols] = df_sub.where(lim, np.nan)
To drop all rows that contain at least one nan-value:
df.dropna(subset=cols, inplace=True) # drop rows with NaN in numerical columns
# or
df.dropna(inplace=True) # drop rows with NaN in any column
Using pandas 1.3 functions:
pandas.DataFrame.select_dtypes()
pandas.DataFrame.quantile()
pandas.DataFrame.where()
pandas.DataFrame.dropna()
Since I haven't seen an answer that deal with numerical and non-numerical attributes, here is a complement answer.
You might want to drop the outliers only on numerical attributes (categorical variables can hardly be outliers).
Function definition
I have extended #tanemaki's suggestion to handle data when non-numeric attributes are also present:
from scipy import stats
def drop_numerical_outliers(df, z_thresh=3):
# Constrains will contain `True` or `False` depending on if it is a value below the threshold.
constrains = df.select_dtypes(include=[np.number]) \
.apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
.all(axis=1)
# Drop (inplace) values set to be rejected
df.drop(df.index[~constrains], inplace=True)
Usage
drop_numerical_outliers(df)
Example
Imagine a dataset df with some values about houses: alley, land contour, sale price, ... E.g: Data Documentation
First, you want to visualise the data on a scatter graph (with z-score Thresh=3):
# Plot data before dropping those greater than z-score 3.
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)
# Drop the outliers on every attributes
drop_numerical_outliers(train_df)
# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)
For each series in the dataframe, you could use between and quantile to remove outliers.
x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers
scipy.stats has methods trim1() and trimboth() to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.
If you like method chaining, you can get your boolean condition for all numeric columns like this:
df.sub(df.mean()).div(df.std()).abs().lt(3)
Each value of each column will be converted to True/False based on whether its less than three standard deviations away from the mean or not.
Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.
import pandas as pd
from scipy.stats import mstats
%matplotlib inline
test_data = pd.Series(range(30))
test_data.plot()
# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05]))
transformed_test_data.plot()
You can use boolean mask:
import pandas as pd
def remove_outliers(df, q=0.05):
upper = df.quantile(1-q)
lower = df.quantile(q)
mask = (df < upper) & (df > lower)
return mask
t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9],
'y': [1,0,0,1,1,0,0,1,1,1,0]})
mask = remove_outliers(t['train'], 0.1)
print(t[mask])
output:
train y
2 2 0
3 3 1
4 4 1
5 5 0
6 6 0
7 7 1
8 8 1
Since I am in a very early stage of my data science journey, I am treating outliers with the code below.
#Outlier Treatment
def outlier_detect(df):
for i in df.describe().columns:
Q1=df.describe().at['25%',i]
Q3=df.describe().at['75%',i]
IQR=Q3 - Q1
LTV=Q1 - 1.5 * IQR
UTV=Q3 + 1.5 * IQR
x=np.array(df[i])
p=[]
for j in x:
if j < LTV or j>UTV:
p.append(df[i].median())
else:
p.append(j)
df[i]=p
return df
Get the 98th and 2nd percentile as the limits of our outliers
upper_limit = np.percentile(X_train.logerror.values, 98)
lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe
data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit
a full example with data and 2 groups follows:
Imports:
from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)
Data example with 2 groups: G1:Group 1. G2: Group 2:
TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1
1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6
2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6
2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")
Read text data to pandas dataframe:
df = pd.read_csv(TESTDATA, sep=";")
Define the outliers using standard deviations
stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
lambda group: (group - group.mean()).abs().div(group.std())) > stds
Define filtered data values and the outliers:
dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]
Print the result:
print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)
My function for dropping outliers
def drop_outliers(df, field_name):
distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)
I prefer to clip rather than drop. the following will clip inplace at the 2nd and 98th pecentiles.
df_list = list(df)
minPercentile = 0.02
maxPercentile = 0.98
for _ in range(numCols):
df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))
If your data frame has Outlier there are many ways you can handle those outliers:
Most of them are mentioned in my articles : Give this a read
Find the code here : Notebook
Deleting and dropping outliers I believe is wrong statistically.
It makes the data different from original data.
Also makes data unequally shaped and hence best way is to reduce or avoid the effect of outliers by log transform the data.
This worked for me:
np.log(data.iloc[:, :])
I am working with large biological dataset.
I want to calculate PCC(Pearson's correlation coefficient) of all 2-column combinations in my data table and save the result as DataFrame or CSV file.
Data table is like below:columns are the name of genes, and rows are the code of dataset. The float numbers mean how much the gene is activated in the dataset.
GeneA GeneB GeneC ...
DataA 1.5 2.5 3.5 ...
DataB 5.5 6.5 7.5 ...
DataC 8.5 8.5 8.5 ...
...
As a output, I want to build the table(DataFrame or csv file) like below, because scipy.stats.pearsonr function returns (PCC, p-value).
In my example, XX and YY mean the results of pearsonr([1.5, 5.5, 8.5], [2.5, 6.5, 8.5]). In the same way, ZZ and AA mean the result of pearsonr([1.5, 5.5, 8.5], [3.5, 7.5, 8.5]). I do not need the redundant data such as GeneB_GeneA or GeneC_GeneB in my test.
PCC P-value
GeneA_GeneB XX YY
GeneA_GeneC ZZ AA
GeneB_GeneC BB CC
...
As the number of columns and rows are many(over 100) and their names are complicated, using column names or row names will be difficult.
It might be a simple problem for experts, I do not know how to deal with this kind of table with python and pandas library. Especially making new DataFrame and adding result seems to be very difficult.
Sorry for my poor explanation, but I hope someone could help me.
from pandas import *
import numpy as np
from libraries.settings import *
from scipy.stats.stats import pearsonr
import itertools
Creating random sample data:
df = DataFrame(np.random.random((5, 5)), columns=['gene_' + chr(i + ord('a')) for i in range(5)])
print(df)
gene_a gene_b gene_c gene_d gene_e
0 0.471257 0.854139 0.781204 0.678567 0.697993
1 0.292909 0.046159 0.250902 0.064004 0.307537
2 0.422265 0.646988 0.084983 0.822375 0.713397
3 0.113963 0.016122 0.227566 0.206324 0.792048
4 0.357331 0.980479 0.157124 0.560889 0.973161
correlations = {}
columns = df.columns.tolist()
for col_a, col_b in itertools.combinations(columns, 2):
correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])
result = DataFrame.from_dict(correlations, orient='index')
result.columns = ['PCC', 'p-value']
print(result.sort_index())
PCC p-value
gene_a__gene_b 0.461357 0.434142
gene_a__gene_c 0.177936 0.774646
gene_a__gene_d -0.854884 0.064896
gene_a__gene_e -0.155440 0.802887
gene_b__gene_c -0.575056 0.310455
gene_b__gene_d -0.097054 0.876621
gene_b__gene_e 0.061175 0.922159
gene_c__gene_d -0.633302 0.251381
gene_c__gene_e -0.771120 0.126836
gene_d__gene_e 0.531805 0.356315
Get unique combinations of DataFrame columns using
itertools.combination(iterable, r)
Iterate through these combinations and calculate pairwise correlations using scipy.stats.stats.personr
Add results (PCC and p-value tuple) to dictionary
Build DataFrame from dictionary
You could then also save result.to_csv(). You might find it convenient to use a MultiIndex (two columns containing the names of each columns) instead of the created names for the pairwise correlations.
A simple solution is to use the pairwise_corr function of the Pingouin package (which I created):
import pingouin as pg
pg.pairwise_corr(data, method='pearson')
This will give you a DataFrame with all combinations of columns, and, for each of those, the r-value, p-value, sample size, and more.
There are also a number of options to specify one or more columns (e.g. one-vs-all behavior), as well as covariates for partial correlation and different methods to calculate the correlation coefficient. Please see this example Jupyter Notebook for a more in-depth demo.
To get pairs, it is a combinations problem. You can concat all the rows into one the result dataframe.
from pandas import *
from itertools import combinations
df = pandas.read_csv('gene.csv')
# get the column names as list, which are gene names
column_list = df.columns.values.tolist()
result = []
for c in combinations(column_list, 2):
firstGene, secondGene = c
firstGeneData = df[firstGene].tolist()
secondGeneData = df[secondGene].tolist()
# now to get the PCC, P-value using scipy
pcc = ...
p-value = ...
result.append(pandas.DataFrame([{'PCC': pcc, 'P-value': p-value}], index=str(firstGene)+ '_' + str(secondGene), columns=['PCC', 'P-value'])
result_df = pandas.concat(result)
#result_df.to_csv(...)
Assuming the data you have is in a pandas DataFrame.
df.corr('pearson') # 'kendall', and 'spearman' are the other 2 options
will provide you a correlation matrix between each column.
I'm currently learning how to use Pandas, and I'm in a situation where I'm attempting to replace missing data (Horsepower feature) using a best-fit line generated from linear regression with the Displacement column. What I'm doing is iterating through only the parts of the dataframe that are marked as NaN in the Horsepower column and replacing the data by feeding in the value of Displacement in that same row into the best-fit algorithm. My code looks like this:
for row, value in auto_data.HORSEPOWER[pd.isnull(auto_data.HORSEPOWER)].iteritems():
auto_data.HORSEPOWER[row] = int(round(slope * auto_data.DISPLACEMENT[row] + intercept))
Now, the code works and the data is replaced as expected, but it generates the SettingWithCopyWarning when I run it. I understand why the warning is generated, and that in this case I'm fine, but if there is a better way to iterate through the subset, or a method that's just more elegant, I'd rather avoid chained indexing that could cause a real problem in the future.
I've looked through the docs, and through other answers on Stack Overflow. All solutions to this seem to use .loc, but I just can't seem to figure out the correct syntax to get the subset of NaN rows using .loc Any help is appreciated. If it helps, the dataframe looks like this:
auto_data.dtypes
Out[15]:
MPG float64
CYLINDERS int64
DISPLACEMENT float64
HORSEPOWER float64
WEIGHT int64
ACCELERATION float64
MODELYEAR int64
NAME object
dtype: object
IIUC you should be able to just do:
auto_data.loc[auto_data[HORSEPOWER].isnull(),'HORSEPOWER'] = np.round(slope * auto_data['DISPLACEMENT'] + intercept)
The above will be vectorised and avoid looping, the error you get is from doing this:
auto_data.HORSEPOWER[row]
I think if you did:
auto_data.loc[row,'HORSEPOWER']
then the warning should not be raised
Instead of looping through the DataFrame row-by-row, it would be more efficient to calculate the extrapolated values in a vectorized way for the entire column:
y = (slope * auto_data['DISPLACEMENT'] + intercept).round()
and then use update to replace the NaN values:
auto_data['HORSEPOWER'].update(y)
Using update works for the particular case of replacing NaN values.
Ed Chum's solution shows how to replace the value in arbitrary rows by using a boolean mask and auto_data.loc.
For example,
import numpy as np
import pandas as pd
auto_data = pd.DataFrame({
'HORSEPOWER':[1, np.nan, 2],
'DISPLACEMENT': [3, 4, 5]})
slope, intercept = 2, 0.5
y = (slope * auto_data['DISPLACEMENT'] + intercept).round()
auto_data['HORSEPOWER'].update(y)
print(auto_data)
yields
DISPLACEMENT HORSEPOWER
0 3 6
1 4 8
2 5 10
On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by #Ami Tavory, but with a few subtle differences:
It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction.
# function finds the labels that are more than certain percentage/threshold
def get_freq_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)[var].count() / len(df)
return tmp[tmp > rare_perc].index
vars_cat = [val for val in data.columns if data[val].dtype=='O']
for var in vars_cat:
# find the frequent categories
frequent_cat = get_freq_labels(data, var, 0.05)
# replace rare categories by the string "Rare"
data[var] = np.where(data[var].isin(
frequent_cat ), data[var], 'Rare')
I have a pandas data frame with few columns.
Now I know that certain rows are outliers based on a certain column value.
For instance
column 'Vol' has all values around 12xx and one value is 4000 (outlier).
Now I would like to exclude those rows that have Vol column like this.
So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.
What is an elegant way to achieve this?
Remove all rows that have outliers in, at least, one column
If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame(np.random.randn(100, 3))
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
Description:
For each column, it first computes the Z-score of each value in the
column, relative to the column mean and standard deviation.
It then takes the absolute Z-score because the direction does not
matter, only if it is below the threshold.
all(axis=1) ensures that for each row, all column satisfy the
constraint.
Finally, the result of this condition is used to index the dataframe.
Filter other columns based on a single column
Specify a column for the zscore, df[0] for example, and remove .all(axis=1).
df[(np.abs(stats.zscore(df[0])) < 3)]
For each of your dataframe column, you could get quantile with:
q = df["col"].quantile(0.99)
and then filter with:
df[df["col"] < q]
If one need to remove lower and upper outliers, combine condition with an AND statement:
q_low = df["col"].quantile(0.01)
q_hi = df["col"].quantile(0.99)
df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]
Use boolean indexing as you would do in numpy.array
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data.
df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around
For a series it is similar:
S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]
This answer is similar to that provided by #tanemaki, but uses a lambda expression instead of scipy stats.
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
standard_deviations = 3
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < standard_deviations)
.all(axis=1)]
To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations:
df[((df['B'] - df['B'].mean()) / df['B'].std()).abs() < standard_deviations]
See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe
#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Before answering the actual question we should ask another one that's very relevant depending on the nature of your data:
What is an outlier?
Imagine the series of values [3, 2, 3, 4, 999] (where the 999 seemingly doesn't fit in) and analyse various ways of outlier detection
Z-Score
The problem here is that the value in question distorts our measures mean and std heavily, resulting in inconspicious z-scores of roughly [-0.5, -0.5, -0.5, -0.5, 2.0], keeping every value within two standard deviations of the mean. One very large outlier might hence distort your whole assessment of outliers. I would discourage this approach.
Quantile Filter
A way more robust approach is given is this answer, eliminating the bottom and top 1% of data. However, this eliminates a fixed fraction independant of the question if these data are really outliers. You might loose a lot of valid data, and on the other hand still keep some outliers if you have more than 1% or 2% of your data as outliers.
IQR-distance from Median
Even more robust version of the quantile principle: Eliminate all data that is more than f times the interquartile range away from the median of the data. That's also the transformation that sklearn's RobustScaler uses for example. IQR and median are robust to outliers, so you outsmart the problems of the z-score approach.
In a normal distribution, we have roughly iqr=1.35*s, so you would translate z=3 of a z-score filter to f=2.22 of an iqr-filter. This will drop the 999 in the above example.
The basic assumption is that at least the "middle half" of your data is valid and resembles the distribution well, whereas you also mess up if your distribution has wide tails and a narrow q_25% to q_75% interval.
Advanced Statistical Methods
Of course there are fancy mathematical methods like the Peirce criterion, Grubb's test or Dixon's Q-test just to mention a few that are also suitable for non-normally distributed data. None of them are easily implemented and hence not addressed further.
Code
Replacing all outliers for all numerical columns with np.nan on an example data frame. The method is robust against all dtypes that pandas provides and can easily be applied to data frames with mixed types:
import pandas as pd
import numpy as np
# sample data of all dtypes in pandas (column 'a' has an outlier) # dtype:
df = pd.DataFrame({'a': list(np.random.rand(8)) + [123456, np.nan], # float64
'b': [0,1,2,3,np.nan,5,6,np.nan,8,9], # int64
'c': [np.nan] + list("qwertzuio"), # object
'd': [pd.to_datetime(_) for _ in range(10)], # datetime64[ns]
'e': [pd.Timedelta(_) for _ in range(10)], # timedelta[ns]
'f': [True] * 5 + [False] * 5, # bool
'g': pd.Series(list("abcbabbcaa"), dtype="category")}) # category
cols = df.select_dtypes('number').columns # limits to a (float), b (int) and e (timedelta)
df_sub = df.loc[:, cols]
# OPTION 1: z-score filter: z-score < 3
lim = np.abs((df_sub - df_sub.mean()) / df_sub.std(ddof=0)) < 3
# OPTION 2: quantile filter: discard 1% upper / lower values
lim = np.logical_and(df_sub < df_sub.quantile(0.99, numeric_only=False),
df_sub > df_sub.quantile(0.01, numeric_only=False))
# OPTION 3: iqr filter: within 2.22 IQR (equiv. to z-score < 3)
iqr = df_sub.quantile(0.75, numeric_only=False) - df_sub.quantile(0.25, numeric_only=False)
lim = np.abs((df_sub - df_sub.median()) / iqr) < 2.22
# replace outliers with nan
df.loc[:, cols] = df_sub.where(lim, np.nan)
To drop all rows that contain at least one nan-value:
df.dropna(subset=cols, inplace=True) # drop rows with NaN in numerical columns
# or
df.dropna(inplace=True) # drop rows with NaN in any column
Using pandas 1.3 functions:
pandas.DataFrame.select_dtypes()
pandas.DataFrame.quantile()
pandas.DataFrame.where()
pandas.DataFrame.dropna()
Since I haven't seen an answer that deal with numerical and non-numerical attributes, here is a complement answer.
You might want to drop the outliers only on numerical attributes (categorical variables can hardly be outliers).
Function definition
I have extended #tanemaki's suggestion to handle data when non-numeric attributes are also present:
from scipy import stats
def drop_numerical_outliers(df, z_thresh=3):
# Constrains will contain `True` or `False` depending on if it is a value below the threshold.
constrains = df.select_dtypes(include=[np.number]) \
.apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
.all(axis=1)
# Drop (inplace) values set to be rejected
df.drop(df.index[~constrains], inplace=True)
Usage
drop_numerical_outliers(df)
Example
Imagine a dataset df with some values about houses: alley, land contour, sale price, ... E.g: Data Documentation
First, you want to visualise the data on a scatter graph (with z-score Thresh=3):
# Plot data before dropping those greater than z-score 3.
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)
# Drop the outliers on every attributes
drop_numerical_outliers(train_df)
# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)
For each series in the dataframe, you could use between and quantile to remove outliers.
x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers
scipy.stats has methods trim1() and trimboth() to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.
If you like method chaining, you can get your boolean condition for all numeric columns like this:
df.sub(df.mean()).div(df.std()).abs().lt(3)
Each value of each column will be converted to True/False based on whether its less than three standard deviations away from the mean or not.
Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.
import pandas as pd
from scipy.stats import mstats
%matplotlib inline
test_data = pd.Series(range(30))
test_data.plot()
# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05]))
transformed_test_data.plot()
You can use boolean mask:
import pandas as pd
def remove_outliers(df, q=0.05):
upper = df.quantile(1-q)
lower = df.quantile(q)
mask = (df < upper) & (df > lower)
return mask
t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9],
'y': [1,0,0,1,1,0,0,1,1,1,0]})
mask = remove_outliers(t['train'], 0.1)
print(t[mask])
output:
train y
2 2 0
3 3 1
4 4 1
5 5 0
6 6 0
7 7 1
8 8 1
Since I am in a very early stage of my data science journey, I am treating outliers with the code below.
#Outlier Treatment
def outlier_detect(df):
for i in df.describe().columns:
Q1=df.describe().at['25%',i]
Q3=df.describe().at['75%',i]
IQR=Q3 - Q1
LTV=Q1 - 1.5 * IQR
UTV=Q3 + 1.5 * IQR
x=np.array(df[i])
p=[]
for j in x:
if j < LTV or j>UTV:
p.append(df[i].median())
else:
p.append(j)
df[i]=p
return df
Get the 98th and 2nd percentile as the limits of our outliers
upper_limit = np.percentile(X_train.logerror.values, 98)
lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe
data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit
a full example with data and 2 groups follows:
Imports:
from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)
Data example with 2 groups: G1:Group 1. G2: Group 2:
TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1
1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6
2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6
2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")
Read text data to pandas dataframe:
df = pd.read_csv(TESTDATA, sep=";")
Define the outliers using standard deviations
stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
lambda group: (group - group.mean()).abs().div(group.std())) > stds
Define filtered data values and the outliers:
dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]
Print the result:
print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)
My function for dropping outliers
def drop_outliers(df, field_name):
distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)
I prefer to clip rather than drop. the following will clip inplace at the 2nd and 98th pecentiles.
df_list = list(df)
minPercentile = 0.02
maxPercentile = 0.98
for _ in range(numCols):
df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))
If your data frame has Outlier there are many ways you can handle those outliers:
Most of them are mentioned in my articles : Give this a read
Find the code here : Notebook
Deleting and dropping outliers I believe is wrong statistically.
It makes the data different from original data.
Also makes data unequally shaped and hence best way is to reduce or avoid the effect of outliers by log transform the data.
This worked for me:
np.log(data.iloc[:, :])