Identifying statistical outliers with pandas: groupby and individual columns - python

I'm trying to understand how to identify statistical outliers which I will be sending to a spreadsheet. I will need to group the rows by the index and then find the stdev for specific columns and anything that exceeds the stdev would be used to populate a spreadsheet.
df = pandas.DataFrame({'Sex': ['M','M','M','F','F','F','F'], 'Age': [33,42,19,64,12,30,32], 'Height': ['163','167','184','164','162','158','160'],})
Using a dataset like this I would like to group by sex, and then find entries that exceed either the stdev of age or height. Most examples I've seen are addressing the stdev of the entire dataset as opposed to broken down by columns. There will be additional columns such as state, so I don't need the stdev of every column just particular ones out of the set.
Looking for the ouput to just contain the data for the rows that are identified as statistical outliers in either of the columns. For instance:
0 M 64 164
1 M 19 184
Assuming that 64 years old exceeds the men's stdevs set for height and 184 cm tall exceeds the stdevs for men's height

First, convert your height from strings to values.
df['Height'] = df['Height'].astype(float)
You then need to group on Sex using transform to create a boolean indicator marking if any of Age or Height is a statistical outlier within the group.
stds = 1.0 # Number of standard deviation that defines 'outlier'.
z = df[['Sex', 'Age', 'Height']].groupby('Sex').transform(
lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
>>> outliers
Age Height
0 False False
1 False False
2 True True
3 True True
4 True False
5 False True
6 False False
Now filter for rows that contain any outliers:
>>> df[outliers.any(axis=1)]
Age Height Sex
2 19 184 M
3 64 164 F
4 12 162 F
5 30 158 F
If you only care about the upside of the distribution (i.e. values > mean + 2 SDs), then just drop the .abs(), i.e. lambda group: (group - group.mean()).div(group.std()).abs() > stds

Related

Remove outliers using groupby in data with several categories

I have a time-series with several products. I want to remove outliers using the Tukey Fence method. The idea is to create a column with a flag indicating outlier or not, using groupby. It should be like that (flag column is added by the groupby):
date prod units flag
1 a 100 0
2 a 90 0
3 a 80 0
4 a 15 1
1 b 200 0
2 b 180 0
3 b 190 0
4 b 30000 1
I was able to do it separating the prods using a for-loop and then making corresponding joins, but I wish to do it more cleanly.
I would compute the quantiles first; then derive IQR from them. Compute the fence bounds and call merge() to map these limits to the original dataframe and call eval() to check if the units are within their respective Tukey fence bounds.
# compute quantiles
quantiles = df.groupby('prod')['units'].quantile([0.25, 0.75]).unstack()
# compute interquartile range for each prod
iqr = quantiles.diff(axis=1).bfill(axis=1)
# compute fence bounds
fence_bounds = quantiles + iqr * [-1.5, 1.5]
# check if units are outside their respective tukey ranges
df['flag'] = df.merge(fence_bounds, left_on='prod', right_index=True).eval('not (`0.25` < units < `0.75`)').astype(int)
df
The intermediate fence bounds are:

Pandas: row operations on a column, given one reference value on a different column

I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.

How to reject a window containing an outlier with a condition during rolling average using python?

The problem that I am facing is how i can reject a window of 10 rows if one or many of the rows consist of an outlier while computing rolling average using python pandas? The assistance i require in is the conditional logic based on the following scenarios mentioned below
The condition on the outlier in a window is:
The upper bound for outlier is 15, the lower bound is 0
if the frequency of occurrence of outlier in a window is greater than 10%, we reject that particular window and move next.
if the frequency of occurrence of outlier in a window is less than 10%, we accept the particular window with the following changes: 1) replace the value of the outlier with the value derived from the average of the non-outlier values i.e. the rest of the 9 rows, then averaging the same window again before moving next
Here's the following code till now:
_filter = lambda x: float("inf") if x > 15 or x < 0 else x
#Apply the mean over window with inf to result those values in
result = df_list["speed"].apply(_filter).rolling(10).mean().dropna()
#Print Max rolling average
print("The max rolling average is:")
result.max()
Use rolling with a custom aggregation function:
df = pd.DataFrame({"a": range(100), "speed": np.random.randint(0, 17, 100)})
MAX = 15
MIN = 0
def my_mean(s):
outlier_count = ((s<MIN) | (s > MAX)).sum()
if outlier_count > 2: # defined 2 as the threshold - can put any other number here
return np.NaN
res = s[(s <= MAX) & (s >= MIN)].mean()
return res
df["roll"] = df.speed.rolling(10).apply(my_mean)
This results, in one example, in:
...
35 35 8 9.444444
36 36 14 9.666667
37 37 11 9.888889
38 38 16 10.250000
39 39 16 NaN
40 40 15 NaN
41 41 6 NaN
42 42 9 11.375000
43 43 2 10.000000
44 44 8 9.125000
...
What happens here is as follows:
We create a rolling window of size 10 (df.speed.rolling(10))
For each window, which is a series of 10 numbers, we apply the function my_mean.
my_mean first counts the number of outliers, by summing the number of cases in which elements in the series s are smaller than the minimum or larger that the maximum.
if the count is outliers is too large, we just say that there's no mean and return not-a-number.
Otherwise, we filter out outliers and calculate the mean of the other numbers (s[(s <= MAX) & (s >= MIN)].mean()).

Conditional Sampling of Data Frame in Python

I have a Dataframe of Names, Sex, Ages of individuals:
I would like to create a new Dataframe by sampling a fixed number of samples such that the average age of the new DataFrame is the same as the original DataFrame.
sample_df = pd.DataFrame({'Var':['A','B','C','D','E'] , 'Ages' : [22,35,43,18,NaN]})
sample_df
Out[410]:
Var Ages
0 A 22
1 B 35
2 C 43
3 D 18
4 E NaN
I would like to sample only 3 rows such that the age of 'E' is equal to the mean of A,B,C,D
Consider an indefinite iteration using while True then break after needs are met but depending on the variability of your data, this may take some time to process. Below builds a list of 100-row samples and breaks after ten samples are achieved.
samples = []
while True:
sample_df = df.sample(n = 100)
if sample_df['Age'].mean() == df['Age'].mean():
samples.append(sample_df)
if len(samples) == 10:
break

Scaling numbers within a dataframe column to the same proportion

I have a series of numbers of two different magnitudes in a dataframe column. They are
0 154480.429000
1 154.480844
2 154480.433000
3 154.480844
4 154480.433000
......
As we can see that above, I am not sure how to set a condition to scale the small number 154.480844 to have the same order of magnitude as the large one 154480.433000 in dataframe.
How can this be done efficiently with pandas?
Use np.log10 to determine the scaling factor required. Something like this:
v = np.log10(ser).astype(int)
ser * 10 ** (v.max() - v).values
0 154480.429
1 154480.844
2 154480.433
3 154480.844
4 154480.433
Name: 1, dtype: float64

Categories

Resources