Detect and exclude outliers in a pandas DataFrame - python
I have a pandas data frame with few columns.
Now I know that certain rows are outliers based on a certain column value.
For instance
column 'Vol' has all values around 12xx and one value is 4000 (outlier).
Now I would like to exclude those rows that have Vol column like this.
So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.
What is an elegant way to achieve this?
Remove all rows that have outliers in, at least, one column
If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame(np.random.randn(100, 3))
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
Description:
For each column, it first computes the Z-score of each value in the
column, relative to the column mean and standard deviation.
It then takes the absolute Z-score because the direction does not
matter, only if it is below the threshold.
all(axis=1) ensures that for each row, all column satisfy the
constraint.
Finally, the result of this condition is used to index the dataframe.
Filter other columns based on a single column
Specify a column for the zscore, df[0] for example, and remove .all(axis=1).
df[(np.abs(stats.zscore(df[0])) < 3)]
For each of your dataframe column, you could get quantile with:
q = df["col"].quantile(0.99)
and then filter with:
df[df["col"] < q]
If one need to remove lower and upper outliers, combine condition with an AND statement:
q_low = df["col"].quantile(0.01)
q_hi = df["col"].quantile(0.99)
df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]
Use boolean indexing as you would do in numpy.array
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data.
df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around
For a series it is similar:
S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]
This answer is similar to that provided by #tanemaki, but uses a lambda expression instead of scipy stats.
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
standard_deviations = 3
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < standard_deviations)
.all(axis=1)]
To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations:
df[((df['B'] - df['B'].mean()) / df['B'].std()).abs() < standard_deviations]
See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe
#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Before answering the actual question we should ask another one that's very relevant depending on the nature of your data:
What is an outlier?
Imagine the series of values [3, 2, 3, 4, 999] (where the 999 seemingly doesn't fit in) and analyse various ways of outlier detection
Z-Score
The problem here is that the value in question distorts our measures mean and std heavily, resulting in inconspicious z-scores of roughly [-0.5, -0.5, -0.5, -0.5, 2.0], keeping every value within two standard deviations of the mean. One very large outlier might hence distort your whole assessment of outliers. I would discourage this approach.
Quantile Filter
A way more robust approach is given is this answer, eliminating the bottom and top 1% of data. However, this eliminates a fixed fraction independant of the question if these data are really outliers. You might loose a lot of valid data, and on the other hand still keep some outliers if you have more than 1% or 2% of your data as outliers.
IQR-distance from Median
Even more robust version of the quantile principle: Eliminate all data that is more than f times the interquartile range away from the median of the data. That's also the transformation that sklearn's RobustScaler uses for example. IQR and median are robust to outliers, so you outsmart the problems of the z-score approach.
In a normal distribution, we have roughly iqr=1.35*s, so you would translate z=3 of a z-score filter to f=2.22 of an iqr-filter. This will drop the 999 in the above example.
The basic assumption is that at least the "middle half" of your data is valid and resembles the distribution well, whereas you also mess up if your distribution has wide tails and a narrow q_25% to q_75% interval.
Advanced Statistical Methods
Of course there are fancy mathematical methods like the Peirce criterion, Grubb's test or Dixon's Q-test just to mention a few that are also suitable for non-normally distributed data. None of them are easily implemented and hence not addressed further.
Code
Replacing all outliers for all numerical columns with np.nan on an example data frame. The method is robust against all dtypes that pandas provides and can easily be applied to data frames with mixed types:
import pandas as pd
import numpy as np
# sample data of all dtypes in pandas (column 'a' has an outlier) # dtype:
df = pd.DataFrame({'a': list(np.random.rand(8)) + [123456, np.nan], # float64
'b': [0,1,2,3,np.nan,5,6,np.nan,8,9], # int64
'c': [np.nan] + list("qwertzuio"), # object
'd': [pd.to_datetime(_) for _ in range(10)], # datetime64[ns]
'e': [pd.Timedelta(_) for _ in range(10)], # timedelta[ns]
'f': [True] * 5 + [False] * 5, # bool
'g': pd.Series(list("abcbabbcaa"), dtype="category")}) # category
cols = df.select_dtypes('number').columns # limits to a (float), b (int) and e (timedelta)
df_sub = df.loc[:, cols]
# OPTION 1: z-score filter: z-score < 3
lim = np.abs((df_sub - df_sub.mean()) / df_sub.std(ddof=0)) < 3
# OPTION 2: quantile filter: discard 1% upper / lower values
lim = np.logical_and(df_sub < df_sub.quantile(0.99, numeric_only=False),
df_sub > df_sub.quantile(0.01, numeric_only=False))
# OPTION 3: iqr filter: within 2.22 IQR (equiv. to z-score < 3)
iqr = df_sub.quantile(0.75, numeric_only=False) - df_sub.quantile(0.25, numeric_only=False)
lim = np.abs((df_sub - df_sub.median()) / iqr) < 2.22
# replace outliers with nan
df.loc[:, cols] = df_sub.where(lim, np.nan)
To drop all rows that contain at least one nan-value:
df.dropna(subset=cols, inplace=True) # drop rows with NaN in numerical columns
# or
df.dropna(inplace=True) # drop rows with NaN in any column
Using pandas 1.3 functions:
pandas.DataFrame.select_dtypes()
pandas.DataFrame.quantile()
pandas.DataFrame.where()
pandas.DataFrame.dropna()
Since I haven't seen an answer that deal with numerical and non-numerical attributes, here is a complement answer.
You might want to drop the outliers only on numerical attributes (categorical variables can hardly be outliers).
Function definition
I have extended #tanemaki's suggestion to handle data when non-numeric attributes are also present:
from scipy import stats
def drop_numerical_outliers(df, z_thresh=3):
# Constrains will contain `True` or `False` depending on if it is a value below the threshold.
constrains = df.select_dtypes(include=[np.number]) \
.apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
.all(axis=1)
# Drop (inplace) values set to be rejected
df.drop(df.index[~constrains], inplace=True)
Usage
drop_numerical_outliers(df)
Example
Imagine a dataset df with some values about houses: alley, land contour, sale price, ... E.g: Data Documentation
First, you want to visualise the data on a scatter graph (with z-score Thresh=3):
# Plot data before dropping those greater than z-score 3.
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)
# Drop the outliers on every attributes
drop_numerical_outliers(train_df)
# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)
For each series in the dataframe, you could use between and quantile to remove outliers.
x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers
scipy.stats has methods trim1() and trimboth() to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.
If you like method chaining, you can get your boolean condition for all numeric columns like this:
df.sub(df.mean()).div(df.std()).abs().lt(3)
Each value of each column will be converted to True/False based on whether its less than three standard deviations away from the mean or not.
Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.
import pandas as pd
from scipy.stats import mstats
%matplotlib inline
test_data = pd.Series(range(30))
test_data.plot()
# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05]))
transformed_test_data.plot()
You can use boolean mask:
import pandas as pd
def remove_outliers(df, q=0.05):
upper = df.quantile(1-q)
lower = df.quantile(q)
mask = (df < upper) & (df > lower)
return mask
t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9],
'y': [1,0,0,1,1,0,0,1,1,1,0]})
mask = remove_outliers(t['train'], 0.1)
print(t[mask])
output:
train y
2 2 0
3 3 1
4 4 1
5 5 0
6 6 0
7 7 1
8 8 1
Since I am in a very early stage of my data science journey, I am treating outliers with the code below.
#Outlier Treatment
def outlier_detect(df):
for i in df.describe().columns:
Q1=df.describe().at['25%',i]
Q3=df.describe().at['75%',i]
IQR=Q3 - Q1
LTV=Q1 - 1.5 * IQR
UTV=Q3 + 1.5 * IQR
x=np.array(df[i])
p=[]
for j in x:
if j < LTV or j>UTV:
p.append(df[i].median())
else:
p.append(j)
df[i]=p
return df
Get the 98th and 2nd percentile as the limits of our outliers
upper_limit = np.percentile(X_train.logerror.values, 98)
lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe
data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit
a full example with data and 2 groups follows:
Imports:
from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)
Data example with 2 groups: G1:Group 1. G2: Group 2:
TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1
1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6
2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6
2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")
Read text data to pandas dataframe:
df = pd.read_csv(TESTDATA, sep=";")
Define the outliers using standard deviations
stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
lambda group: (group - group.mean()).abs().div(group.std())) > stds
Define filtered data values and the outliers:
dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]
Print the result:
print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)
My function for dropping outliers
def drop_outliers(df, field_name):
distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)
I prefer to clip rather than drop. the following will clip inplace at the 2nd and 98th pecentiles.
df_list = list(df)
minPercentile = 0.02
maxPercentile = 0.98
for _ in range(numCols):
df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))
If your data frame has Outlier there are many ways you can handle those outliers:
Most of them are mentioned in my articles : Give this a read
Find the code here : Notebook
Deleting and dropping outliers I believe is wrong statistically.
It makes the data different from original data.
Also makes data unequally shaped and hence best way is to reduce or avoid the effect of outliers by log transform the data.
This worked for me:
np.log(data.iloc[:, :])
Related
Boxplot and remove outliers from large dataframe [duplicate]
I have a pandas data frame with few columns. Now I know that certain rows are outliers based on a certain column value. For instance column 'Vol' has all values around 12xx and one value is 4000 (outlier). Now I would like to exclude those rows that have Vol column like this. So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean. What is an elegant way to achieve this?
Remove all rows that have outliers in, at least, one column If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot: import pandas as pd import numpy as np from scipy import stats df = pd.DataFrame(np.random.randn(100, 3)) df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] Description: For each column, it first computes the Z-score of each value in the column, relative to the column mean and standard deviation. It then takes the absolute Z-score because the direction does not matter, only if it is below the threshold. all(axis=1) ensures that for each row, all column satisfy the constraint. Finally, the result of this condition is used to index the dataframe. Filter other columns based on a single column Specify a column for the zscore, df[0] for example, and remove .all(axis=1). df[(np.abs(stats.zscore(df[0])) < 3)]
For each of your dataframe column, you could get quantile with: q = df["col"].quantile(0.99) and then filter with: df[df["col"] < q] If one need to remove lower and upper outliers, combine condition with an AND statement: q_low = df["col"].quantile(0.01) q_hi = df["col"].quantile(0.99) df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]
Use boolean indexing as you would do in numpy.array df = pd.DataFrame({'Data':np.random.normal(size=200)}) # example dataset of normally distributed data. df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())] # keep only the ones that are within +3 to -3 standard deviations in the column 'Data'. df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))] # or if you prefer the other way around For a series it is similar: S = pd.Series(np.random.normal(size=200)) S[~((S-S.mean()).abs() > 3*S.std())]
This answer is similar to that provided by #tanemaki, but uses a lambda expression instead of scipy stats. df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC')) standard_deviations = 3 df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < standard_deviations) .all(axis=1)] To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations: df[((df['B'] - df['B'].mean()) / df['B'].std()).abs() < standard_deviations] See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe
#------------------------------------------------------------------------------ # accept a dataframe, remove outliers, return cleaned data in a new dataframe # see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm #------------------------------------------------------------------------------ def remove_outlier(df_in, col_name): q1 = df_in[col_name].quantile(0.25) q3 = df_in[col_name].quantile(0.75) iqr = q3-q1 #Interquartile range fence_low = q1-1.5*iqr fence_high = q3+1.5*iqr df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)] return df_out
Before answering the actual question we should ask another one that's very relevant depending on the nature of your data: What is an outlier? Imagine the series of values [3, 2, 3, 4, 999] (where the 999 seemingly doesn't fit in) and analyse various ways of outlier detection Z-Score The problem here is that the value in question distorts our measures mean and std heavily, resulting in inconspicious z-scores of roughly [-0.5, -0.5, -0.5, -0.5, 2.0], keeping every value within two standard deviations of the mean. One very large outlier might hence distort your whole assessment of outliers. I would discourage this approach. Quantile Filter A way more robust approach is given is this answer, eliminating the bottom and top 1% of data. However, this eliminates a fixed fraction independant of the question if these data are really outliers. You might loose a lot of valid data, and on the other hand still keep some outliers if you have more than 1% or 2% of your data as outliers. IQR-distance from Median Even more robust version of the quantile principle: Eliminate all data that is more than f times the interquartile range away from the median of the data. That's also the transformation that sklearn's RobustScaler uses for example. IQR and median are robust to outliers, so you outsmart the problems of the z-score approach. In a normal distribution, we have roughly iqr=1.35*s, so you would translate z=3 of a z-score filter to f=2.22 of an iqr-filter. This will drop the 999 in the above example. The basic assumption is that at least the "middle half" of your data is valid and resembles the distribution well, whereas you also mess up if your distribution has wide tails and a narrow q_25% to q_75% interval. Advanced Statistical Methods Of course there are fancy mathematical methods like the Peirce criterion, Grubb's test or Dixon's Q-test just to mention a few that are also suitable for non-normally distributed data. None of them are easily implemented and hence not addressed further. Code Replacing all outliers for all numerical columns with np.nan on an example data frame. The method is robust against all dtypes that pandas provides and can easily be applied to data frames with mixed types: import pandas as pd import numpy as np # sample data of all dtypes in pandas (column 'a' has an outlier) # dtype: df = pd.DataFrame({'a': list(np.random.rand(8)) + [123456, np.nan], # float64 'b': [0,1,2,3,np.nan,5,6,np.nan,8,9], # int64 'c': [np.nan] + list("qwertzuio"), # object 'd': [pd.to_datetime(_) for _ in range(10)], # datetime64[ns] 'e': [pd.Timedelta(_) for _ in range(10)], # timedelta[ns] 'f': [True] * 5 + [False] * 5, # bool 'g': pd.Series(list("abcbabbcaa"), dtype="category")}) # category cols = df.select_dtypes('number').columns # limits to a (float), b (int) and e (timedelta) df_sub = df.loc[:, cols] # OPTION 1: z-score filter: z-score < 3 lim = np.abs((df_sub - df_sub.mean()) / df_sub.std(ddof=0)) < 3 # OPTION 2: quantile filter: discard 1% upper / lower values lim = np.logical_and(df_sub < df_sub.quantile(0.99, numeric_only=False), df_sub > df_sub.quantile(0.01, numeric_only=False)) # OPTION 3: iqr filter: within 2.22 IQR (equiv. to z-score < 3) iqr = df_sub.quantile(0.75, numeric_only=False) - df_sub.quantile(0.25, numeric_only=False) lim = np.abs((df_sub - df_sub.median()) / iqr) < 2.22 # replace outliers with nan df.loc[:, cols] = df_sub.where(lim, np.nan) To drop all rows that contain at least one nan-value: df.dropna(subset=cols, inplace=True) # drop rows with NaN in numerical columns # or df.dropna(inplace=True) # drop rows with NaN in any column Using pandas 1.3 functions: pandas.DataFrame.select_dtypes() pandas.DataFrame.quantile() pandas.DataFrame.where() pandas.DataFrame.dropna()
Since I haven't seen an answer that deal with numerical and non-numerical attributes, here is a complement answer. You might want to drop the outliers only on numerical attributes (categorical variables can hardly be outliers). Function definition I have extended #tanemaki's suggestion to handle data when non-numeric attributes are also present: from scipy import stats def drop_numerical_outliers(df, z_thresh=3): # Constrains will contain `True` or `False` depending on if it is a value below the threshold. constrains = df.select_dtypes(include=[np.number]) \ .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \ .all(axis=1) # Drop (inplace) values set to be rejected df.drop(df.index[~constrains], inplace=True) Usage drop_numerical_outliers(df) Example Imagine a dataset df with some values about houses: alley, land contour, sale price, ... E.g: Data Documentation First, you want to visualise the data on a scatter graph (with z-score Thresh=3): # Plot data before dropping those greater than z-score 3. # The scatterAreaVsPrice function's definition has been removed for readability's sake. scatterAreaVsPrice(df) # Drop the outliers on every attributes drop_numerical_outliers(train_df) # Plot the result. All outliers were dropped. Note that the red points are not # the same outliers from the first plot, but the new computed outliers based on the new data-frame. scatterAreaVsPrice(train_df)
For each series in the dataframe, you could use between and quantile to remove outliers. x = pd.Series(np.random.normal(size=200)) # with outliers x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers
scipy.stats has methods trim1() and trimboth() to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.
If you like method chaining, you can get your boolean condition for all numeric columns like this: df.sub(df.mean()).div(df.std()).abs().lt(3) Each value of each column will be converted to True/False based on whether its less than three standard deviations away from the mean or not.
Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data. import pandas as pd from scipy.stats import mstats %matplotlib inline test_data = pd.Series(range(30)) test_data.plot() # Truncate values to the 5th and 95th percentiles transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) transformed_test_data.plot()
You can use boolean mask: import pandas as pd def remove_outliers(df, q=0.05): upper = df.quantile(1-q) lower = df.quantile(q) mask = (df < upper) & (df > lower) return mask t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9], 'y': [1,0,0,1,1,0,0,1,1,1,0]}) mask = remove_outliers(t['train'], 0.1) print(t[mask]) output: train y 2 2 0 3 3 1 4 4 1 5 5 0 6 6 0 7 7 1 8 8 1
Since I am in a very early stage of my data science journey, I am treating outliers with the code below. #Outlier Treatment def outlier_detect(df): for i in df.describe().columns: Q1=df.describe().at['25%',i] Q3=df.describe().at['75%',i] IQR=Q3 - Q1 LTV=Q1 - 1.5 * IQR UTV=Q3 + 1.5 * IQR x=np.array(df[i]) p=[] for j in x: if j < LTV or j>UTV: p.append(df[i].median()) else: p.append(j) df[i]=p return df
Get the 98th and 2nd percentile as the limits of our outliers upper_limit = np.percentile(X_train.logerror.values, 98) lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit
a full example with data and 2 groups follows: Imports: from StringIO import StringIO import pandas as pd #pandas config pd.set_option('display.max_rows', 20) Data example with 2 groups: G1:Group 1. G2: Group 2: TESTDATA = StringIO("""G1;G2;Value 1;A;1.6 1;A;5.1 1;A;7.1 1;A;8.1 1;B;21.1 1;B;22.1 1;B;24.1 1;B;30.6 2;A;40.6 2;A;51.1 2;A;52.1 2;A;60.6 2;B;80.1 2;B;70.6 2;B;90.6 2;B;85.1 """) Read text data to pandas dataframe: df = pd.read_csv(TESTDATA, sep=";") Define the outliers using standard deviations stds = 1.0 outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform( lambda group: (group - group.mean()).abs().div(group.std())) > stds Define filtered data values and the outliers: dfv = df[outliers.Value == False] dfo = df[outliers.Value == True] Print the result: print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.' print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)
My function for dropping outliers def drop_outliers(df, field_name): distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25)) df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True) df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)
I prefer to clip rather than drop. the following will clip inplace at the 2nd and 98th pecentiles. df_list = list(df) minPercentile = 0.02 maxPercentile = 0.98 for _ in range(numCols): df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))
If your data frame has Outlier there are many ways you can handle those outliers: Most of them are mentioned in my articles : Give this a read Find the code here : Notebook
Deleting and dropping outliers I believe is wrong statistically. It makes the data different from original data. Also makes data unequally shaped and hence best way is to reduce or avoid the effect of outliers by log transform the data. This worked for me: np.log(data.iloc[:, :])
How to compute occurrencies of specific value and its percentage for each column based on condition pandas dataframe?
I have the following dataframe df, in which I highlighted in green the cells with values of interest: enter image description here and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark. For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted. The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5). enter image description here e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output. Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.
I believe I have understood your ask in the below code... It would be good if you could provide an expected output in your question so that it is easier to follow. Anyways the first part of the code below is just set up so can be ignored as you already have your data set up. Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define. This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting. import pandas as pd import numpy as np import random import datetime ### SET UP ### base = datetime.datetime.today() date_list = [base - datetime.timedelta(days=x) for x in range(10)] def rand_num_list(length): peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length) random.shuffle(peak) return peak df = pd.DataFrame( { 'A':rand_num_list(3), 'B':rand_num_list(5), 'C':rand_num_list(7), 'D':rand_num_list(2), 'E':rand_num_list(6), 'F':rand_num_list(4) }, index=date_list ) df = df.replace({0:np.nan}) ############## print(df) def less_than_threshold(thresh_df, thresh_col, threshold): if len(thresh_df[thresh_col].dropna()) == 0: return 0 return len(thresh_df.loc[thresh_df[thresh_col]<=threshold]) / len(thresh_df[thresh_col].dropna()) output_dict = {'cols':[]} col_threshold = 0.5 output_threshold = 0.5 for col in df.columns: if less_than_threshold(df, col, col_threshold) >= output_threshold: output_dict['cols'].append(col) df_output = df.loc[:,output_dict.get('cols')] print(df_output) Hope this achieves your goal!
How to replace the outliers with the 95th and 5th percentile in Python?
I am trying to do an outlier treatment on my time series data where I want to replace the values > 95th percentile with the 95th percentile and the values < 5th percentile with the 5th percentile value. I have prepared some code but I am unable to find the desired result. I am trying to create a OutlierTreatment function using a sub- function called Cut. The code is given below def outliertreatment(df,high_limit,low_limit): df_temp=df['y'].apply(cut,high_limit,low_limit, extra_kw=1) return df_temp def cut(column,high_limit,low_limit): conds = [column > np.percentile(column, high_limit), column < np.percentile(column, low_limit)] choices = [np.percentile(column, high_limit), np.percentile(column, low_limit)] return np.select(conds,choices,column) I expect to send the dataframe, 95 as high_limit and 5 as low_limit in the OutlierTreatment function. How to achieve the desired result?
I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip function is useful. It assigns values outside boundary to boundary values. You can read more in documentation. data=pd.Series(np.random.randn(100)) data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))
If your data contains multiple columns For individual column p_05 = df['sales'].quantile(0.05) # 5th quantile p_95 = df['sales'].quantile(0.95) # 95th quantile df['sales'].clip(p_05, p_95, inplace=True) For more than one numerical columns: num_col = df.select_dtypes(include=['int64','float64']).columns.tolist() # or you can create a custom list of numerical columns df[num_col] = df[num_col].apply(lambda x: x.clip(*x.quantile([0.05, 0.95]))) Bonus: To check outliers using box plot import matplotlib.pyplot as plt for x in num_col: df[num_col].boxplot(x) plt.figure()
Scaling a column with for loop
I want to scale all the values of a column of a dataframe with a function. This is the function so far: def scale0_1(cname): temp = array(cname) for i in range(len(temp)): value = temp[i]-min(temp)/(max(temp)-min(temp)) temp[i] = value return pd.DataFrame(temp) Here is a sample column to test the function with: samplecolumn = pd.DataFrame([7.0, 15.8, 19.4, 11.4]) However, when I use the function with a column of a data frame (any numeric column should work), it just returns the original values, doing nothing. There is no error message. Does anyone have an idea how to fix this? I would be very grateful for any help :)
Using np.interp a=df[0].values np.interp(a, (a.min(), a.max()), (0, +1)) Out[36]: array([0. , 0.70967742, 1. , 0.35483871])
With pandas dataframes you can apply operations to entire columns. This allows you to do something like this: def scale0_1(cname): scale_factor = min(cname) / (max(cname) - min(cname)) return cname - scale_factor This also allows you to keep the data in a pandas Series or DataFrame through the whole operation and avoids the added complexity of converting it into an array and back.
Where possible, you should use a vectorised approach rather than iterating rows explicitly. For example, you can calculate a column's maximum and minimum. Then, when performing operations with series, the calculations are automatically vectorised. df = pd.DataFrame({'A': [7.0, 15.8, 19.4, 11.4]}) col_min = df['A'].min() col_max = df['A'].max() df['B'] = (df['A'] - col_min) / (col_max - col_min) This is a frequent task, so you will find it exists in other 3rd party libraries. For example, using sklearn: from sklearn import preprocessing min_max_scaler = preprocessing.MinMaxScaler() df['B'] = min_max_scaler.fit_transform(df['A']) Result print(df) A B 0 7.0 0.000000 1 15.8 0.709677 2 19.4 1.000000 3 11.4 0.354839
Pandas: Filter dataframe for values that are too frequent or too rare
On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number. But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval. import pandas as pd import string import numpy as np vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in 'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval'] df = pd.DataFrame(dict(vals)) >> df.head() city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval 0 f p a n 1 k b a f 2 q s n j 3 h c g u 4 w d m h If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on. I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop. This answer is similar to that provided by #Ami Tavory, but with a few subtle differences: It normalizes the value counts so you can just use a percentile threshold. It calculates counts just once per column instead of twice. This results in faster execution. Code: threshold = 0.03 for col in df: counts = df[col].value_counts(normalize=True) df = df.loc[df[col].isin(counts[counts > threshold].index), :] Code timing: df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True), columns=list('ABCD')) %%timeit df=df2.copy() threshold = 0.03 for col in df: counts = df[col].value_counts(normalize=True) df = df.loc[df[col].isin(counts[counts > threshold].index), :] 1 loops, best of 3: 485 ms per loop %%timeit df=df2.copy() m = 0.03 * len(df) for c in df: df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)] 1 loops, best of 3: 688 ms per loop
I would go with one of the following: Option A m = 0.03 * len(df) df[np.all( df.apply( lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()), axis=1)] Explanation: m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression) df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns. df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results. c.isin(...) checks, for each column item, whether it is in some set. c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m. Option B m = 0.03 * len(df) for c in df.columns: df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)] The explanation is similar to the one above. Tradeoffs: Personally, I find B more readable. B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach. Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame. # Column you want to filter col = 'time of day' # Set your frequency to filter out. Currently set to 5% bin_freq = float(5)/float(100) DF_Filtered = pd.DataFrame() for i in DF[col].unique(): counts = DF[DF[col]==i].count()[col] total_counts = DF[col].count() freq = float(counts)/float(total_counts) if freq > bin_freq: DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered]) print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction. # function finds the labels that are more than certain percentage/threshold def get_freq_labels(df, var, rare_perc): df = df.copy() tmp = df.groupby(var)[var].count() / len(df) return tmp[tmp > rare_perc].index vars_cat = [val for val in data.columns if data[val].dtype=='O'] for var in vars_cat: # find the frequent categories frequent_cat = get_freq_labels(data, var, 0.05) # replace rare categories by the string "Rare" data[var] = np.where(data[var].isin( frequent_cat ), data[var], 'Rare')