Outlier detection using Gaussian mixture

Outlier detection using Gaussian mixture - python

I have 5000 data points for each of my 17 features in a numpy array resulting in a 5000 x 17 array. I am trying to find the outliers for each feature using Gaussian mixture and I am rather confused on the following: 1)how many components should I use for my GaussiasnMixture? 2) Should I fit the GaussianMixture directly on the array of 5000 x 17 or to each feature column seperately resulting in 17 GaussianMixture models?
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array)
or
clf = mixture.GaussianMixture(n_components=17, covariance_type='full')
clf.fit(full_feature_array)
or
for feature in range(0, full_feature_matrix):
clf[feature] = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array[:,feature)

The task of selecting the number of components to model a distribution with a Gaussian mixture model is an instance of Model Selection. This is not so straightforward and there exist many approaches. A good summary can be found here https://en.m.wikipedia.org/wiki/Model_selection . One of the simplest and most widely used is to perform cross validation.
Normally outliers can be determined as those belonging to the component or components with the largest variance. You would call this strategy an unsupervised approach, however it can still be difficult to decide what the cutoff variance should be. A better approach (if applicable) is a supervised approach where you would train the GMM with outlier-free data (by manually removing outliers). You then use this to classify outliers as those which have particularly low likelihood scores. The second way to do it with a supervised approach would be to train two GMMs (one for outliers and one for inliers using model selection) then perform two-class classification for new data. Regarding your question about training univariate versus multivariate GMMs - it's difficult to say but for the purposes of outlier detection univariate GMMs ( or equivalently multivariate GMMs with diagonal covariance matrices) may be sufficient and require training fewer parameters compared to general multivariate GMMs, so I would start with that.

Using Gaussian Mixture Model (GMM) any point sitting on low-density area can be considered outlier - Perhaps the challenge is how to define low density area - For example you can say whatever lower than 4th quantile density is outlier.
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]
regarding choosing the number of component - look into "information criterion" provided by AIC or BIC given different number of components - they often agree in such cases. The lowest is better.
gm.bic(x)
gm.aic(x)
alternatively, BayesianGaussianMixture gives zero as weight to those clusters that are unnecessary.
from sklearn.mixture import BayesianGaussianMixture
bgm = BayesianGaussianMixture(n_components=8, n_init=10) # n_components should be large enough
bgm.fit(X)
np.round(bgm.weights_, 2)
output
array([0.5 , 0.3, 0.2 , 0. , 0. , 0. , 0. , 0. ])
so here it the bayesian gmm detected there are three clusters.

Related

Clusters with k-means having bad results

I have a dataset with 60 000 rows and 19 columns ( I will leave a sample below) and I am trying to make clusters.
Using the k-means algorithm I am getting a very low score.
Dataset Sample
Since some of my columns are categorical variables I proceded to transform them in continous using dictionaries variables as follows:
def education_dict(data):
education_dict= {
"Bachelors": 0,
"Graduate Degree": 1,
"High School": 2,
"Partial College": 3,
"Partial High School": 4
}
data["IDEducation"]=data["Education"].map(education_dict)
After converting the categorical variable to a continuous variable I delete the old variable.
After that I do the normalization of the data ( all columns since now all of them are continuous) and I proceed to the k-means algorithm.
mms=MinMaxScaler()
mms.fit(data)
Xnorm=mms.transform(data)
print(Xnorm.min(axis=0))
print(Xnorm.max(axis=0))
print(Xnorm.shape)
km=KMeans(n_clusters=10,n_init=1000,max_iter=800,random_state=42)
y_kmeans=km.fit_predict(Xnorm)
#Clustering evaluation
#Silhouette score
#the closest to 1 the better
silSc=silhouette_score(data,y_kmeans,metric="euclidean")
print("Silhouette score: " , round(silSc,3))
print("\nThese measures need grand truth\n")
The cluster evaluation is returning me a silhouette score of about 0.08 and this is to low.
If I run hieraquical clustering, which is not suited to big datasets like mine, I get a score of about 0.54
segmentation=["single","average","complete"]
results=[]
for met in segmentation:
distance_matrix=linkage(Xnorm,method=met,metric="euclidean")
#Assign cluster lables
cluster_labels=fcluster(distance_matrix,3,criterion="maxclust")
silSc=silhouette_score(data,cluster_labels,metric="euclidean")
print("Silhouette score: " , round(silSc,3))
Am I doing something wrong?

Clustering algos simply do what you would expect them to do. They are unsupervised learners. Nevertheless, you can find the accuracy of a unsupervised algo, similar to supervised learning. See the link below for details.
https://smorbieu.gitlab.io/accuracy-from-classification-to-clustering-evaluation/
Now, I think you should try several clustering algos on your given data set, and see which performs optimally. See the link below for several samples of different clustering algos.
https://machinelearningmastery.com/clustering-algorithms-with-python/
Just set different models, fit each model, and check the output of each one. Finally, I see that you are scaling your data. That's great! It's of utmost importance to scale the data before doing K-means clustering, or any algorithm that uses distances. Without scaling, features on a larger scale will weight more heavily in the algorithm. All features should weigh equally at the initial stage.

Using SHAP: Scaling the Shapley values for each model and then averaging across models, or just adding the Shapley values for each model?

I'm running n trials on a Keras model with k features, after which I apply SHAPs DeepExplainer to the model on each trial. All the data is the same, but it is randomly split between the training and testing sets. I'm trying to figure out the best way to combine the model outputs, whether it be directly by adding the Shapley values for each trial, feature by feature, and then averaging - or by scaling the Shapley values output each trial first and then adding them and averaging.
My initial thought was that, as the "baseline is always relative based on the average of all predictions" (from here), the overall average would be skewed and there might be a better way of combining the data. Though I wonder if, despite the different samples in the train/test split and a different relative "baseline" for each model, if averaging over many models would give a final averaged model should have as much interpretation value as a single model. Should this be the case?
However, would scaling the features per model offer any benefits: Again from here I can (save for the caveats) scale a features Shapley values for a single observation in a model. It seems then that I should be able to scale each of the features Shapley values after summing over all observations, over each bin such that all Shapley values for each feature sum to 1. If this is the case, that I can scale by features within the model can I average the models this way? I am thinking a benefit of this is that all models will then have equal weight since the features are scaled within each. Is this a valid approach, and if so, does it offer any benefit over adding all the Shapley values, feature by feature together over all models?
To be clear on what mean concerning the bins, they are the the lists returned from the explainer, equal to the number of classifications:
explainer = shap.DeepExplainer(model, X_train)
ShapleyBinVals = explainer.shap_values(X_test)
Bin = ShapleyBinVals[n]
where n is the number of output classifications. Here's a bar plot of the scaled output:
Notice that for each feature e.g. PSWQ_2 the y-value is a percentage and the sum of percentages over all bins is 1.

High AUC but bad predictions with imbalanced data

I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3, i.e.:
Class
0 0.970691
1 0.029309
Params I used and the code for training is as shown below.
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.1,
'is_unbalance': 'true', #because training data is unbalance (replaced with scale_pos_weight)
'num_leaves': 31, # we should let it be smaller than 2^(max_depth)
'max_depth': 6, # -1 means no limit
'subsample' : 0.78
}
# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10,
verbose_eval=10, early_stopping_rounds=40)
nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)
model = lgb.train(lgb_params, dtrain, num_boost_round=nround)
preds = model.predict(test_feats)
preds = [1 if x >= 0.5 else 0 for x in preds]
I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.
But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.
What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?

The issue is that, despite the extreme class imbalance in your dataset, you are still using the "default" threshold of 0.5 when deciding the final hard classification in
preds = [1 if x >= 0.5 else 0 for x in preds]
This should not be the case here.
This is a rather big topic, and I strongly suggest you do your own research (try googling for threshold or cut off probability imbalanced data), but here are some pointers to get you started...
From a relevant answer at Cross Validated (emphasis added):
Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.
From a relevant academic paper, Finding the Best Classification Threshold in Imbalanced Classification:
2.2. How to set the classification threshold for the testing set
Prediction
results
are
ultimately
determined
according
to
prediction
probabilities.
The
threshold
is
typically
set
to
0.5.
If
the
prediction
probability
exceeds
0.5,
the
sample
is
predicted
to
be
positive;
otherwise,
negative.
However,
0.5
is
not
ideal
for
some
cases,
particularly
for
imbalanced
datasets.
The post Optimizing Probability Thresholds for Class Imbalances from the (highly recommended) Applied Predictive Modeling blog is also relevant.
Take home lesson from all the above: AUC is seldom enough, but the ROC curve itself is often your best friend...
On a more general level regarding the role of the threshold itself in the classification process (which, according to my experience at least, many practitioners get wrong), check also the Classification probability threshold thread (and the provided links) at Cross Validated; key point:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Sklearn GaussianMixture

I have been learning for myself for several months artificial intelligence through a project of character recognition and transcription of handwriting. Until now I have successfully used Keras, Theano and Tensorflow by implementing CNN, CTC neural networks.
Today, I try to use Gaussian mixture models, the first step towards hidden markov models with Gaussian emission. To do so, I used the sklearn mixture with pca reduction to select the best model with Akaike and Bayesian information criterion. With type of covariance Full for Aic which provides a nice U-curve and Tied for Bic, because with Full covariance Bic gives just a linear curve. With 12.000 samples, I get the best model at 60 n-components for Aic and 120 n-components for Bic.
My input images have 64 pixels aside which represent only the capital letters of the English alphabet, 26 categories numbered from 0 to 25.
The fit method of Sklearn GaussianMixture ignore labels and the predict method returns the position of the component (0 to 59 or 0 to 119) into the n-components regarding the probabilities.
How to retrieve the original label the position of the character in a list using sklearn GaussianMixture ?

So, you want to use GaussianMixture in a generative classifier. You need to compute P(Y|X) for each label and estimate label according to these probabilities. To do so, you need to keep a GMM for each label and train with data from corresponding label. Then score method will give you likelihood, P(X|Y), of given data (or log-likelihood, you may want to check that). If you multiple likelihood with prior, you get posterior, P(Y|X). For each label, you will get a posterior e.g. P(Y=0|X), P(Y=1|X), ... Label with the maximum posterior probability can be reported as estimated label.
You can get some hints from the code sample below. (Here it is assumed that prior probabilities are equal, you need to consider that in your implementation)
Y_predicted = clf.predict(X_test)
score = np.empty((Y_test.shape[0], 10))
predictor_list = []
for i in range(10):
predictor = GMM()
predictor.fit(X[Y==i])
predictor_list.append(predictor)
score[:, i] = predictor.score(X_test)
Y_predicted = np.argmax(score, axis=1)

Dimension of data before and after performing PCA

I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.
After removing labels from the training data, I add each row in CSV into a list like this:
for row in csv:
train_data.append(np.array(np.int64(row)))
I do the same for the test data.
I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):
def preprocess(train_data, test_data, pca_components=100):
# convert to matrix
train_data = np.mat(train_data)
# reduce both train and test data
pca = decomposition.PCA(n_components=pca_components).fit(train_data)
X_train = pca.transform(train_data)
X_test = pca.transform(test_data)
return (X_train, X_test)
I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.
Using this method I can get around 97% accuracy.
My question is about the dimensionality of the data before and after PCA is performed
What are the dimensions of train_data and X_train?
How does the number of components influence the dimensionality of the output? Are they the same thing?

TL;DR: Yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).
The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.
The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.
The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.
For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.