Remove outliers using groupby in data with several categories - python

I have a time-series with several products. I want to remove outliers using the Tukey Fence method. The idea is to create a column with a flag indicating outlier or not, using groupby. It should be like that (flag column is added by the groupby):
date prod units flag
1 a 100 0
2 a 90 0
3 a 80 0
4 a 15 1
1 b 200 0
2 b 180 0
3 b 190 0
4 b 30000 1
I was able to do it separating the prods using a for-loop and then making corresponding joins, but I wish to do it more cleanly.

I would compute the quantiles first; then derive IQR from them. Compute the fence bounds and call merge() to map these limits to the original dataframe and call eval() to check if the units are within their respective Tukey fence bounds.
# compute quantiles
quantiles = df.groupby('prod')['units'].quantile([0.25, 0.75]).unstack()
# compute interquartile range for each prod
iqr = quantiles.diff(axis=1).bfill(axis=1)
# compute fence bounds
fence_bounds = quantiles + iqr * [-1.5, 1.5]
# check if units are outside their respective tukey ranges
df['flag'] = df.merge(fence_bounds, left_on='prod', right_index=True).eval('not (`0.25` < units < `0.75`)').astype(int)
df
The intermediate fence bounds are:

Related

mean of the values in interquartile range in python

above25percentile=df.loc[df["order_amount"]>np.percentile(df["order_amount"],25)]
below75percentile=df.loc[df["order_amount"]<np.percentile(df["order_amount"],75)]
interquartile=above25percentile & below75percentile
print(interquartile.mean())
can't seem to get the mean here. any thoughts?
You attempt to compute interquartile as a boolean mask based on the & operator, but its components are Series containing values from the ranges. While the two series are likely to be similar sizes, & will not give you an intersection of their indices. If they were boolean masks, in your subsequent usage, you'd be taking the mean of a bunch of zeros and ones, which is going to be 0.5 (the ratio of data that falls within the IQR as a matter of fact).
First, compute interquartile as a proper mask. Pandas has its own quantile method, which, like np.percentile and siblings, accepts multiple percentiles simultaneously. You can combine that with between to get your mask more efficiently:
interquartile = df['order_amount'].between(*df['order_amount'].quantile([0.25, 0.75]))
You can apply the mask to the column and take the mean like this:
df.loc[interquartile, 'order_amount'].mean()
Try:
above25percentile = df["order_amount"]>np.percentile(df['order_amount'],25)
below75percentile = df['order_amount']<np.percentile(df['order_amount'],75)
print(df.loc[above25percentile & below75percentile, 'order_amount'].mean())
Or you can use between:
df.loc[df['order_amount'].between(np.percentile(df['order_amount'], 25),
np.percentile(df['order_amount'], 75),
inclusive='neither'), 'order_amount'].mean()
Suppose the following dataframe:
df = pd.DataFrame({'order_amount': range(0, 10)})
print(df)
# Output
order_amount
0 0 # Excluded
1 1 # "
2 2 # "
3 3
4 4 # mean <- (3 + 4 + 5 + 6) / 4 = 4.5
5 5
6 6
7 7 # Excluded
8 8 # "
9 9 # "
Output:
>>> df.loc[df['order_amount'].between(np.percentile(df['order_amount'], 25),
np.percentile(df['order_amount'], 75),
inclusive='neither'), 'order_amount'].mean()
4.5

Add multiple columns prominent of the same operation to pandas DataFrame

The question
Let's say I had a DataFrame df with a numeric column x and a categorical column y. I want to calculate the q and the quantile of q in which the quantile of q is higher than 0 for each group of y. With these two values, q and quantile of q, I want to calculate the number of elements lower than quantile of q. To do this, I start with q=0 and I keep increasing q in 0.05 until quantile of q is greater than 0.
The solution
groups = df.groupby(y_col)
less_quantile = np.empty(len(groups))
quantiles = np.empty(len(groups))
qs = np.empty(len(groups))
for i, (_, group) in zip(range(len(groups)), groups):
q = 0.0
while q <= 1.0:
quantile = group[x_col].quantile(q)
if quantile > 0.0:
less_quantile[i] = (group[x_col] < quantile).sum() #count Trues
qs[i] = q
quantiles[i] = quantile
break
q += 0.05
df_final = df.drop_duplicates(y_col)\
.assign(quantile=quantiles, q=qs, less_quantile_count=less_quantile)
The problem
There are the following problems regarding the implementation above:
I'm not using any pandas or numpy optimizations (e.g. vectorization).
It's not guaranteed by pandas that drop_duplicates and groupby will have the same order of groups to assign it the way I did. The only reason I did this was that by experimentation they were the same.
What are my other options?
Using agg or apply and merge
Creating empty columns to calculate and to replicate qs, then quantiles, and then counting the ones less than quantile
What other problems can it cause?
I'll probably need to recalculate quantile lots of times.
I'll occupy lots of memory with repeated values within each row of each group.
Why do I care for that?
As a C programmer, it hurts a lot to use so much memory unnecessarily and the same for the operations. Also because it is a small DataFrame I can solve it easily, but if the DataFrame was too big, I should know the best way to do this. I'm not sure if it is a limitation of the library due to its abstraction level or if I don't know enough to solve the problem the way it has to be.
Edit 1 - Added an example
x y
0 pear 0.0
1 pear 0.0
2 pear 0.194329
3 apple 0.714319
4 apple 0.171905
5 apple 0.337234
6 apple 0.769216
7 orange 0.529154
8 orange 0.844691
# Let's take pear as an example:
quantile = group_pear.quantile(0) # 0.0
quantile = group_pear.quantile(0.05) # 0.0
...
quantile = group_pear.quantile(0.50) # 0.0
quantile = group_pear.quantile(0.55) # 0.09857
q = 0.55
# Found quantile of q with q=0.55 resulting in 0.09857
# Now I just need to count how many rows within the pear group have 'y'
# less than 0.09857
count_pear = (group_pear['y'] < 0.09857).sum()
# I just need to do the same for other groups and then produce a
# DataFrame like this
x q quantile count_less
0 pear 0.55 0.09857 2
1 apple 0.0 ... ...
2 orange 0.0 ... ...

Is there a way to replace anomalies with the mean of the two rows between them?

I have a column that I'm trying to smooth out the results. Most of the data creates a smooth chart but sometimes I get a random spike. I want to reduce the impact of the spike.
My thought was to take the outlier and just make it the mean of the values between it but I'm struggling and not getting the result I want.
Here's what I'm doing right now:
df = pd.DataFrame(np.random.randint(0,100,size=(5, 1)), columns=list('A'))
def aDetection(inputs):
median = inputs["A"].median()
std = inputs["A"].std()
outliers = (inputs["A"] - median).abs() > std
print("outliers")
print(outliers)
inputs[outliers]["A"] = np.nan #this isn't working.
inputs[outliers] = np.nan #works but wipes out entire row
inputs['A'].fillna(median, inplace=True)
print("modified:")
print(inputs)
print("original")
print(df)
aDetection(df)
original
A
0 4
1 86
2 40
3 99
4 97
outliers
0 True
1 False
2 True
3 False
4 False
Name: A, dtype: bool
modified:
A
0 86.0
1 86.0
2 86.0
3 99.0
4 97.0
For one, it seems to change all rows not just the single column. But the bigger problem is all the outliers in my example are using 86. I realize this is because I set the mean for the entire column, but I would like the mean between the previous column with the missing data.
For a single column, you can do your task with the following one-liner
(for readability folded into 2 lines):
df.A = df.A.mask((df.A - df.A.median()).abs() > df.A.std(),
pd.concat([df.A.shift(), df.A.shift(-1)], axis=1).mean(axis=1))
Details:
(df.A - df.A.median()).abs() > df.A.std() - computes outliers.
df.A.shift() - computes a Series of previous values.
df.A.shift(-1) - computes a Series of following values.
pd.concat(...) - creates a DataFrame from both the above Series.
mean(axis=1) - computes means by rows.
mask(...) - takes original values of A column for non-outliers
and the value from concat for outliers.
The result is:
A
0 86.0
1 86.0
2 92.5
3 99.0
4 97.0
If you want to apply this mechanism to all columns of your DataFrame,
then:
Change the above code to a function:
def replOutliers(col):
return col.mask((col - col.median()).abs() > col.std(),
pd.concat([col.shift(), col.shift(-1)], axis=1).mean(axis=1))
Apply it (to each column):
df = df.apply(replOutliers)

Scaling numbers within a dataframe column to the same proportion

I have a series of numbers of two different magnitudes in a dataframe column. They are
0 154480.429000
1 154.480844
2 154480.433000
3 154.480844
4 154480.433000
......
As we can see that above, I am not sure how to set a condition to scale the small number 154.480844 to have the same order of magnitude as the large one 154480.433000 in dataframe.
How can this be done efficiently with pandas?
Use np.log10 to determine the scaling factor required. Something like this:
v = np.log10(ser).astype(int)
ser * 10 ** (v.max() - v).values
0 154480.429
1 154480.844
2 154480.433
3 154480.844
4 154480.433
Name: 1, dtype: float64

Identifying statistical outliers with pandas: groupby and individual columns

I'm trying to understand how to identify statistical outliers which I will be sending to a spreadsheet. I will need to group the rows by the index and then find the stdev for specific columns and anything that exceeds the stdev would be used to populate a spreadsheet.
df = pandas.DataFrame({'Sex': ['M','M','M','F','F','F','F'], 'Age': [33,42,19,64,12,30,32], 'Height': ['163','167','184','164','162','158','160'],})
Using a dataset like this I would like to group by sex, and then find entries that exceed either the stdev of age or height. Most examples I've seen are addressing the stdev of the entire dataset as opposed to broken down by columns. There will be additional columns such as state, so I don't need the stdev of every column just particular ones out of the set.
Looking for the ouput to just contain the data for the rows that are identified as statistical outliers in either of the columns. For instance:
0 M 64 164
1 M 19 184
Assuming that 64 years old exceeds the men's stdevs set for height and 184 cm tall exceeds the stdevs for men's height
First, convert your height from strings to values.
df['Height'] = df['Height'].astype(float)
You then need to group on Sex using transform to create a boolean indicator marking if any of Age or Height is a statistical outlier within the group.
stds = 1.0 # Number of standard deviation that defines 'outlier'.
z = df[['Sex', 'Age', 'Height']].groupby('Sex').transform(
lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
>>> outliers
Age Height
0 False False
1 False False
2 True True
3 True True
4 True False
5 False True
6 False False
Now filter for rows that contain any outliers:
>>> df[outliers.any(axis=1)]
Age Height Sex
2 19 184 M
3 64 164 F
4 12 162 F
5 30 158 F
If you only care about the upside of the distribution (i.e. values > mean + 2 SDs), then just drop the .abs(), i.e. lambda group: (group - group.mean()).div(group.std()).abs() > stds

Categories

Resources