How to compute Minitab-equivalent quartiles using NumPy - python

I have a homework assignment that I was doing with Minitab to find quartiles and the interquartile range of a data set. When I tried to replicate the results using NumPy, the results were different. After doing some googling, I see that there are many different algorithms for computing quartiles: as listed here. I've tried all the different types of interpolation listed in the NumPy docs for the percentile function but none of them match minitab's algorithm. Is there any lazy solution to achieve the minitab algorithm with NumPy or will I just need to roll out my own code and implement the algorithm?
Sample code:
import pandas as pd
import numpy as np
terrestrial = Series([76.5,6.03,3.51,9.96,4.24,7.74,9.54,41.7,1.84,2.5,1.64])
aquatic = Series([.27,.61,.54,.14,.63,.23,.56,.48,.16,.18])
df = DataFrame({'terrestrial' : terrestrial, 'aquatic' : aquatic})
This is the method I used with NumPy
q75,q25 = np.percentile(df.aquatic, [75,25], interpolation='linear')
iqr = q75 - q25
The results from Minitab are different:
Descriptive Statistics: aquatic, terrestrial
Variable Q1 Q3 IQR
aquatic 0.1750 0.5725 0.3975
terrestrial 2.50 9.96 7.46

Here's an attempt to implement Minitab's algorithm. I've written these functions assuming that you've already dropped missing observations from the series a:
# Drop missing obs
x = df.aquatic[~ pd.isnull(df.aquatic)]
def get_quartile1(a):
a = a.sort(inplace=False)
pos1 = (len(a) + 1) / 4.0
round_pos1 = int(np.floor((len(a) + 1) / 4.0))
first_part = a.iloc[round_pos1 - 1]
extra_prop = pos1 - round_pos1
interp_part = extra_prop * (a.iloc[round_pos1] - first_part)
return first_part + interp_part
get_quartile1(x)
Out[84]: 0.17499999999999999
def get_quartile3(a):
a = a.sort(inplace=False)
pos3 = (3 * len(a) + 3) / 4.0
round_pos3 = round((3 * len(a) + 3) / 4)
first_part = a.iloc[round_pos3 - 1]
extra_prop = pos3 - round_pos3
interp_part = extra_prop * (a.iloc[round_pos3] - first_part)
return first_part + interp_part
get_quartile3(x)
Out[86]: 0.57250000000000001

I think you will have to roll your own. The interpolation methods provided by np.percentile only affect how the interpolation is done between the nearest data points around the quantile position. But it appears that minitab is actually using a different method for determining the quantile position in the first place.

Related

Converting linreg function from pinescript to Python?

I am trying to convert a TradingView indicator into Python (also using pandas to store its result).
This is the indicator public code I want to convert into a python indicator:
https://www.tradingview.com/script/sU9molfV/
And I am stuck creating that pine script linereg default function.
This is the fragment of the pinescript indicator I have troubles with:
lrc = linreg(src, length, 0)
lrc1 = linreg(src,length,1)
lrs = (lrc-lrc1)
TSF = linreg(src, length, 0)+lrs
This is its documentation:
Linear regression curve. A line that best fits the prices specified
over a user-defined time period. It is calculated using the least
squares method. The result of this function is calculated using the
formula: linreg = intercept + slope * (length - 1 - offset), where
length is the y argument, offset is the z argument, intercept and
slope are the values calculated with the least squares method on
source series (x argument). linreg(source, length, offset) →
series[float]
Source:
https://www.tradingview.com/pine-script-reference/#fun_linreg
I have found this mql4 code and tried to follow it step by step in order to convert it and finally to create a function linreg in Python in order to use it further for building that pine script indicator:
https://www.mql5.com/en/code/8016
And this is my code so far:
# calculate linear regression:
# https://www.mql5.com/en/code/8016
barsToCount = 14
# sumy+=Close[i];
df['sumy'] = df['Close'].rolling(window=barsToCount).mean()
# sumxy+=Close[i]*i;
tmp = []
sumxy_lst = []
for window in df['Close'].rolling(window=barsToCount):
for index in range(len(window)):
tmp.append(window[index] * index)
sumxy_lst.append(sum(tmp))
del tmp[:]
df.loc[:,'sumxy'] = sumxy_lst
# sumx+=i;
sumx = 0
for i in range(barsToCount):
sumx += i
# sumx2+=i*i;
sumx2 = 0
for i in range(barsToCount):
sumx2 += i * i
# c=sumx2*barsToCount-sumx*sumx;
c = sumx2*barsToCount - sumx*sumx
# Line equation:
# b=(sumxy*barsToCount-sumx*sumy)/c;
df['b'] = ((df['sumxy']*barsToCount)-(sumx*df['sumy']))/c
# a=(sumy-sumx*b)/barsToCount;
df['a'] = (df['sumy']-sumx*df['b'])/barsToCount
# Linear regression line in buffer:
df['LR_line'] = 0.0
for x in range(barsToCount):
# LR_line[x]=a+b*x;
df['LR_line'].iloc[x] = df['a'].iloc[x] + df['b'].iloc[x] * x
# print(x, df['a'].iloc[x], df['b'].iloc[x], df['b'].iloc[x]*x)
print(df.tail(50))
print(list(df))
It doesn't work.
Any idea how to create a similar pine script linereg function into python, please?
Thank you in advance!
I used talib to calculate the slope and intercept on the closing prices, then realised talib offers the full calc also. The result looks to be same as TradingView (just eyeballing).
Did the following in jupyterlab:
import pandas as pd
import numpy as np
import talib as tl
from pandas_datareader import data
%run "../../plt_setup.py"
asset = data.DataReader('^AXJO', 'yahoo', start='1/1/2015')
n = 270
(asset
.assign(linreg = tl.LINEARREG(asset.Close, n))
[['Close', 'linreg']]
.dropna()
.loc['2019-01-01':]
.plot()
);

Checking if Frequentist approach is correct? Bayesian approach using MCMC for AB test. How to calculate Bayes Factors in Python?

I've been trying to get my head around Frequentist and Bayesian approaches for a toy data AB test problem.
The results don't really make sense to me. I am struggling to understand the results, or whether I have computed them (in)correctly (which is probably likely). Furthermore, after much research, I am still somewhat lost as to how to compute Bayes Factors. I've seen packages in R that make this look somewhat easy. Alas, I am not familiar with R and would prefer to be able to solve this problem in Python.
I would greatly appreciate any help and guidance regarding this!
Here is the data:
# imports
import pingouin as pg
import pymc3 as pm
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.stats.api as sms
import math
import matplotlib.pyplot as plt
# A = control -- B = treatment
a_success = 10730
a_failure = 61988
a_total = a_success + a_failure
a_cr = a_success / a_total
b_success = 10966
b_failure = 60738
b_total = b_success + b_failure
b_cr = b_success / b_total
I started by doing some power analysis, to determine the number of required samples with a power of 0.8, alpha of 0.05 and a practical significance of 2%. I'm not sure whether expected conversion rates should be supplied, or the baseline + some proportion. Depending on the effect size, the required number of samples increases significantly.
# determine required sample size
baseline_rate = a_cr
practical_significance = 0.02
alpha = 0.05
power = 0.8
nobs1 = None
# is this how to calculate effect size?
effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance) # 5204
# # or this?
# effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + baseline_rate * practical_significance) # 228583
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size,
power = power,
alpha = alpha,
nobs1 = nobs1,
ratio = 1)
I continued trying to determine if the null hypothesis could be rejected:
# calculate pooled probability
pooled_probability = (a_success + b_success) / (a_total + b_total)
# calculate pooled standard error and margin of error
se_pooled = math.sqrt(pooled_probability * (1 - pooled_probability) * (1 / b_total + 1 / a_total))
z_score = scs.norm.ppf(1 - alpha / 2)
margin_of_error = se_pooled * z_score
# the estimated difference between probability of conversions of both groups
d_hat = (test_b_success / test_b_total) - (test_a_success / test_a_total)
# test if null hypothesis can be rejected
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error
if practical_significance < lower_bound:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
which evaluates to 'do not reject the null ...' despite group B (treatment) showing a 3.65% relative improvement with regards to conversion rate over group A (control) which seems... odd?
I tried a slightly different approach (I guess a slightly different hypothesis?):
successes = [a_success, b_success]
nobs = [a_total, b_total]
z_stat, p_value = sms.proportions_ztest(successes, nobs=nobs)
(lower_a, lower_b), (upper_a, upper_b) = sms.proportion_confint(successes, nobs=nobs, alpha=alpha)
if p_value < alpha:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
Which evaluates to 'reject null hypothesis ... ' with p-value: 0.004236. This seems highly contradictory, especially since the p-value is < 0.01.
On to Bayes... I created some arrays of success and failures (and only tested on 100 observations) due to how long this thing takes, and ran the following:
# generate lists of 1, 0
obs_a = np.repeat([1, 0], [a_success, a_failure])
obs_v = np.repeat([1, 0], [b_success, b_failure])
for _ in range(10):
np.random.shuffle(observations_A)
np.random.shuffle(observations_B)
with pm.Model() as model:
p_A = pm.Beta("p_A", 1, 1)
p_B = pm.Beta("p_B", 1, 1)
delta = pm.Deterministic("delta", p_A - p_B)
obs_A = pm.Bernoulli("obs_A", p_A, observed = obs_a[:1000])
obs_B = pm.Bernoulli("obs_B", p_B, observed = obs_b[:1000])
step = pm.NUTS()
trace = pm.sample(1000, step = step, chains = 2)
Firstly, I understand that you are supposed to burn some proportion of the trace -- how do you determine an appropriate number of indices to burn?
In trying to evaluate the posterior probabilities, is the following code the correct way to do this?
b_lift = (trace['p_B'].mean() - trace['p_A'].mean()) / trace['p_A'].mean() * 100
b_prob = np.mean(trace["delta"] > 0)
a_lift = (trace['p_A'].mean() - trace['p_B'].mean()) / trace['p_B'].mean() * 100
a_prob = np.mean(trace["delta"] < 0)
# is the Bayes Factor just the ratio of the posterior probabilities for these two models?
BF = (trace['p_B'] / trace['p_A']).mean()
print(f'There is {b_prob} probability B outperforms A by a magnitude of {round(b_lift, 2)}%')
print(f'There is {a_prob} probability A outperforms B by a magnitude of {round(a_lift, 2)}%')
print('BF:', BF)
-- output:
There is 0.666 probability B outperforms A by a magnitude of 1.29%
There is 0.334 probability A outperforms B by a magnitude of -1.28%
BF: 1.013357654428127
I suspect that this is not the correct way to calculate Bayes Factors. How can the Bayes Factor be calculated?
I really hope you can help me understand all of the above... I realize it's an exceptionally long post. But I've tried every resource I can find and am still stuck!
Kind regards.

Improving performance of Cronbach Alpha code python numpy

I made some code for calculating Cronbach Alpha that works. But I am not too good using lambda functions. Is there a way to reduce the code and improve efficiency by using lambda instead of the svar() function and getting rid of some of the for loops by using numpy arrays?
import numpy as np
def svar(X):
n = float(len(X))
svar=(sum([(x-np.mean(X))**2 for x in X]) / n)* n/(n-1.)
return svar
def CronbachAlpha(itemscores):
itemvars = [svar(item) for item in itemscores]
tscores = [0] * len(itemscores[0])
for item in itemscores:
for i in range(len(item)):
tscores[i]+= item[i]
nitems = len(itemscores)
#print "total scores=", tscores, 'number of items=', nitems
Calpha=nitems/(nitems-1.) * (1-sum(itemvars)/ svar(tscores))
return Calpha
###########Test################
itemscores = [[ 4,14,3,3,23,4,52,3,33,3],
[ 5,14,4,3,24,5,55,4,15,3]]
print "Cronbach alpha = ", CronbachAlpha(itemscores)
def CronbachAlpha(itemscores):
itemscores = numpy.asarray(itemscores)
itemvars = itemscores.var(axis=1, ddof=1)
tscores = itemscores.sum(axis=0)
nitems = len(itemscores)
return nitems / (nitems-1.) * (1 - itemvars.sum() / tscores.var(ddof=1))
NumPy has a variance function built in. Specifying ddof=1 uses a denominator of N-1, giving a sample variance. There's also a sum builtin.
As Julien Marrec mentioned I suggest the following refactoring of the CronbachAlpha:
def CronbachAlpha(itemscores):
# cols are items, rows are observations
itemscores = np.asarray(itemscores)
itemvars = itemscores.var(axis=0, ddof=1)
tscores = itemscores.sum(axis=1)
nitems = len(itemscores.columns)
return (nitems / (nitems-1)) * (1 - (itemvars.sum() / tscores.var(ddof=1)))
Same as the other answers, just a bit more Pythonic. X is a data matrix -- that is, the rows are samples, the columns are items. X may be a numpy array or pandas DataFrame.
def cronbach_alpha(X):
num_items = X.shape[1]
sum_of_item_variances = X.var(axis=0).sum()
variance_of_sum_of_items = X.sum(axis=1).var()
return num_items/(num_items - 1)*(1 - sum_of_item_variances/variance_of_sum_of_items)
(It's not necessary to specify ddof, as the term appears in the denominator and numerator, and cancels.)

Is there a numpy builtin to reject outliers from a list

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed based on some assumed distribution of the points in d.
import numpy as np
def reject_outliers(data):
m = 2
u = np.mean(data)
s = np.std(data)
filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return filtered
>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]
I say 'something like' because the function might allow for varying distributions (poisson, gaussian, etc.) and varying outlier thresholds within those distributions (like the m I've used here).
Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.
Building on eumiro's answer:
def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else np.zero(len(d))
return data[s<m]
Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.
Note that for the data[s<m] syntax to work, data must be a numpy array.
This method is almost identical to yours, just more numpyst (also working on numpy arrays only):
def reject_outliers(data, m=2):
return data[abs(data - np.mean(data)) < m * np.std(data)]
Benjamin Bannier's answer yields a pass-through when the median of distances from the median is 0, so I found this modified version a bit more helpful for cases as given in the example below.
def reject_outliers_2(data, m=2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m]
Example:
data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))
Gives:
[[10, 10, 10, 17, 10, 10]] # 17 is not filtered
[10, 10, 10, 10, 10] # 17 is filtered (it's distance, 7, is greater than m)
Building on Benjamin's, using pandas.Series, and replacing MAD with IQR:
def reject_outliers(sr, iq_range=0.5):
pcnt = (1 - iq_range) / 2
qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
iqr = qhigh - qlow
return sr[ (sr - median).abs() <= iqr]
For instance, if you set iq_range=0.6, the percentiles of the interquartile-range would become: 0.20 <--> 0.80, so more outliers will be included.
An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)
As a simple example:
import numpy as np
# Create some random numbers
x = np.random.normal(5, 2, 1000)
# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))
# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500
# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))
# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)
rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)
The output I get is:
Mean= 4.99760520022
Median= 4.95395274981
Max/Min= 11.1226494654 -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649
Mean= 9.64760520022
Median= 4.95667658782
Max/Min= 2205.43861943 -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694
Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462
Which is close to the expected value of 2.
If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):
y = x[abs(x - p50) < rSig*5]
# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))
Which gives:
Mean= 4.99755359935
Median= 4.95213030447
Max/Min= 11.1226494654 -2.15388472011
StdDev= 1.97692712883
I have no idea which approach is the more efficent/robust
I wanted to do something similar, except setting the number to NaN rather than removing it from the data, since if you remove it you change the length which can mess up plotting (i.e. if you're only removing outliers from one column in a table, but you need it to remain the same as the other columns so you can plot them against each other).
To do so I used numpy's masking functions:
def reject_outliers(data, m=2):
stdev = np.std(data)
mean = np.mean(data)
maskMin = mean - stdev * m
maskMax = mean + stdev * m
mask = np.ma.masked_outside(data, maskMin, maskMax)
print('Masking values outside of {} and {}'.format(maskMin, maskMax))
return mask
I would like to provide two methods in this answer, solution based on "z score" and solution based on "IQR".
The code provided in this answer works on both single dim numpy array and multiple numpy array.
Let's import some modules firstly.
import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr
z score based method
This method will test if the number falls outside the three standard deviations. Based on this rule, if the value is outlier, the method will return true, if not, return false.
def sd_outlier(x, axis = None, bar = 3, side = 'both'):
assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'
d_z = stat.zscore(x, axis = axis)
if side == 'gt':
return d_z > bar
elif side == 'lt':
return d_z < -bar
elif side == 'both':
return np.abs(d_z) > bar
IQR based method
This method will test if the value is less than q1 - 1.5 * iqr or greater than q3 + 1.5 * iqr, which is similar to SPSS's plot method.
def q1(x, axis = None):
return np.percentile(x, 25, axis = axis)
def q3(x, axis = None):
return np.percentile(x, 75, axis = axis)
def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'
d_iqr = iqr(x, axis = axis)
d_q1 = q1(x, axis = axis)
d_q3 = q3(x, axis = axis)
iqr_distance = np.multiply(d_iqr, bar)
stat_shape = list(x.shape)
if isinstance(axis, collections.Iterable):
for single_axis in axis:
stat_shape[single_axis] = 1
else:
stat_shape[axis] = 1
if side in ['gt', 'both']:
upper_range = d_q3 + iqr_distance
upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
if side in ['lt', 'both']:
lower_range = d_q1 - iqr_distance
lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)
if side == 'gt':
return upper_outlier
if side == 'lt':
return lower_outlier
if side == 'both':
return np.logical_or(upper_outlier, lower_outlier)
Finally, if you want to filter out the outliers, use a numpy selector.
Have a nice day.
Consider that all the above methods fail when your standard deviation gets very large due to huge outliers.
(Simalar as the average caluclation fails and should rather caluclate the median. Though, the average is "more prone to such an error as the stdDv".)
You could try to iteratively apply your algorithm or you filter using the interquartile range:
(here "factor" relates to a n*sigma range, yet only when your data follows a Gaussian distribution)
import numpy as np
def sortoutOutliers(dataIn,factor):
quant3, quant1 = np.percentile(dataIn, [75 ,25])
iqr = quant3 - quant1
iqrSigma = iqr/1.34896
medData = np.median(dataIn)
dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ]
return(dataOut)
So many answers, but I'm adding a new one that can be useful for the author or even for other users.
You could use the Hampel filter. But you need to work with Series.
Hampel filter returns the Outliers indices, then you can delete them from the Series, and then convert it back to a List.
To use Hampel filter, you can easily install the package with pip:
pip install hampel
Usage:
# Imports
from hampel import hampel
import pandas as pd
list_d = [2, 4, 5, 1, 6, 5, 40]
# List to Series
time_series = pd.Series(list_d)
# Outlier detection with Hampel filter
# Returns the Outlier indices
outlier_indices = hampel(ts = time_series, window_size = 3)
# Drop Outliers indices from Series
filtered_d = time_series.drop(outlier_indices)
filtered_d.values.tolist()
print(f'filtered_d: {filtered_d.values.tolist()}')
And the output will be:
filtered_d: [2, 4, 5, 1, 6, 5]
Where, ts is a pandas Series object and window_size is a total window size will be computed as 2 * window_size + 1.
For this Series I set window_size with the value 3.
The cool thing about working with Series is being able to generate graphics:
# Imports
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
# Plot Original Series
time_series.plot(style = 'k-')
plt.title('Original Series')
plt.show()
# Plot Cleaned Series
filtered_d.plot(style = 'k-')
plt.title('Cleaned Series (Without detected Outliers)')
plt.show()
And the output will be:
To learn more about Hampel filter, I recommend the following readings:
Python implementation of the Hampel Filter
Outlier Detection with Hampel Filter
Clean-up your time series data with a Hampel Filter
if you want to get the index position of the outliers idx_list will return it.
def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else 0.
data_range = np.arange(len(data))
idx_list = data_range[s>=m]
return data[s<m], idx_list
data_points = np.array([8, 10, 35, 17, 73, 77])
print(reject_outliers(data_points))
after rejection: [ 8 10 35 17], index positions of outliers: [4 5]
For a set of images (each image has 3 dimensions), where I wanted to reject outliers for each pixel I used:
mean = np.mean(imgs, axis=0)
std = np.std(imgs, axis=0)
mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
masked = np.multiply(imgs, mask)
Then it is possible to compute the mean:
masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))
(I use it for Background Subtraction)
Here I find the outliers in x and substitute them with the median of a window of points (win) around them (taking from Benjamin Bannier answer the median deviation)
def outlier_smoother(x, m=3, win=3, plots=False):
''' finds outliers in x, points > m*mdev(x) [mdev:median deviation]
and replaces them with the median of win points around them '''
x_corr = np.copy(x)
d = np.abs(x - np.median(x))
mdev = np.median(d)
idxs_outliers = np.nonzero(d > m*mdev)[0]
for i in idxs_outliers:
if i-win < 0:
x_corr[i] = np.median(np.append(x[0:i], x[i+1:i+win+1]))
elif i+win+1 > len(x):
x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:len(x)]))
else:
x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:i+win+1]))
if plots:
plt.figure('outlier_smoother', clear=True)
plt.plot(x, label='orig.', lw=5)
plt.plot(idxs_outliers, x[idxs_outliers], 'ro', label='outliers')
plt.plot(x_corr, '-o', label='corrected')
plt.legend()
return x_corr
Trim outliers in a numpy array along axis and replace them with min or max values along this axis, whichever is closer. The threshold is z-score:
def np_z_trim(x, threshold=10, axis=0):
""" Replace outliers in numpy ndarray along axis with min or max values
within the threshold along this axis, whichever is closer."""
mean = np.mean(x, axis=axis, keepdims=True)
std = np.std(x, axis=axis, keepdims=True)
masked = np.where(np.abs(x - mean) < threshold * std, x, np.nan)
min = np.nanmin(masked, axis=axis, keepdims=True)
max = np.nanmax(masked, axis=axis, keepdims=True)
repl = np.where(np.abs(x - max) < np.abs(x - min), max, min)
return np.where(np.isnan(masked), repl, masked)
My solution drops the top and bottom percentiles, keeping values that are equal to the boundary:
def remove_percentile_outliers(data, percent_to_drop=0.001):
low, high = data.quantile([percent_to_drop / 2, 1-percent_to_drop / 2])
return data[(data >= low )&(data <= high)]
My solution let the outliers equal to the previous value.
test_data = [2,4,5,1,6,5,40, 3]
def reject_outliers(data, m=2):
mean = np.mean(data)
std = np.std(data)
for i in range(len(data)) :
if np.abs(data[i] -mean) > m*std :
data[i] = data[i-1]
return data
reject_outliers(test_data)
Output:
[2, 4, 5, 1, 6, 5, 5, 3]

Is a Fuzzy C-Means algorithm available for Python?

I have some dots in a 3 dimensional space and would like to cluster them. I know Pythons module "cluster", but it has only K-Means. Do you know a module which has FCM (Fuzzy C-Means)?
(If you know some other python modules which are related to clustering you could name them as a bonus. But the important question is the one for a FCM-algorithm in python.)
Matlab
It seems to be quite easy to use FCM in Matlab (example). Isn't something like this available for Python?
NumPy, SciPy and Sage
I didn't find FCM in NumPy, SciPy or Sage. I've downloaded the documentation and searched for it. No results
Python-cluster
It seems like the cluster module will add fuzzy C-Means with the next version (see Roadmap). But I need it now
PEACH will provide some Fuzzy C-Means functionality:
http://code.google.com/p/peach/
However there doesn't seem to be any usable documentation as the wiki is empty. An example for using FCM with PEACH can be found on its website.
Have a look at scikit-fuzzy package. It has the very basic fuzzy logic functionality, including fuzzy c-means clustering.
Python
There is a fuzzy-c-means package in the PyPI. Check out the link : fuzzy-c-means Python
This is the simplest way to use FCM in python. Hope it helps.
I have done it from scratch, using K++ initialization (with fixed seeds and 5 centroids. It should't be too difficult to addapt it to your desired number of centroids):
# K++ initialization Algorithm:
import random
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
np.random.seed(20) # fixxing seeds
#random.seed(0) # fixxing seeds
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
a = initialize(data2,5) # "a" is the centroids initial array... I used 5 centroids
# Now the Fuzzy c means algorithm:
m = 1.5 # Fuzzy parameter (it can be tuned)
r = (2/(m-1))
# Initial centroids:
c1,c2,c3,c4,c5 = a[0],a[1],a[2],a[3],a[4]
# prepare empty lists to add the final centroids:
cc1,cc2,cc3,cc4,cc5 = [],[],[],[],[]
n_iterations = 10000
for j in range(n_iterations):
u1,u2,u3,u4,u5 = [],[],[],[],[]
for i in range(len(data2)):
# Distances (of every point to each centroid):
a = LA.norm(data2[i]-c1)
b = LA.norm(data2[i]-c2)
c = LA.norm(data2[i]-c3)
d = LA.norm(data2[i]-c4)
e = LA.norm(data2[i]-c5)
# Pertenence matrix vectors:
U1 = 1/(1 + (a/b)**r + (a/c)**r + (a/d)**r + (a/e)**r)
U2 = 1/((b/a)**r + 1 + (b/c)**r + (b/d)**r + (b/e)**r)
U3 = 1/((c/a)**r + (c/b)**r + 1 + (c/d)**r + (c/e)**r)
U4 = 1/((d/a)**r + (d/b)**r + (d/c)**r + 1 + (d/e)**r)
U5 = 1/((e/a)**r + (e/b)**r + (e/c)**r + (e/d)**r + 1)
# We will get an array of n row points x K centroids, with their degree of pertenence
u1.append(U1)
u2.append(U2)
u3.append(U3)
u4.append(U4)
u5.append(U5)
# now we calculate new centers:
c1 = (np.array(u1)**2).dot(data2) / np.sum(np.array(u1)**2)
c2 = (np.array(u2)**2).dot(data2) / np.sum(np.array(u2)**2)
c3 = (np.array(u3)**2).dot(data2) / np.sum(np.array(u3)**2)
c4 = (np.array(u4)**2).dot(data2) / np.sum(np.array(u4)**2)
c5 = (np.array(u5)**2).dot(data2) / np.sum(np.array(u5)**2)
cc1.append(c1)
cc2.append(c2)
cc3.append(c3)
cc4.append(c4)
cc5.append(c5)
if (j>5):
change_rate1 = np.sum(3*cc1[j] - cc1[j-1] - cc1[j-2] - cc1[j-3])/3
change_rate2 = np.sum(3*cc2[j] - cc2[j-1] - cc2[j-2] - cc2[j-3])/3
change_rate3 = np.sum(3*cc3[j] - cc3[j-1] - cc3[j-2] - cc3[j-3])/3
change_rate4 = np.sum(3*cc4[j] - cc4[j-1] - cc4[j-2] - cc4[j-3])/3
change_rate5 = np.sum(3*cc5[j] - cc5[j-1] - cc5[j-2] - cc5[j-3])/3
change_rate = np.array([change_rate1,change_rate2,change_rate3,change_rate4,change_rate5])
changed = np.sum(change_rate>0.0000001)
if changed == 0:
break
print(c1) # to check a centroid coordinates c1 - c5 ... they are the last centroids calculated, so supposedly they converged.
print(U) # this is the degree of pertenence to each centroid (so n row points x K centroids columns).
I know it is not very pythonic, but I hope it can be a starting point for your complete fuzzy C means algorithm. I think that "soft clustering" is the way to go when data is not easily separable (for example, when "t-SNE visualization" show all data together instead of showing groups clearly separated. In this case, forcing data to pertain strictly to only one clustering can be dangerous). I would give a try with m = 1.1, to m = 2.0, so you can see how the fuzzy parameter affects to the pertenence matrix.

Categories

Resources