Clusters with k-means having bad results - python

I have a dataset with 60 000 rows and 19 columns ( I will leave a sample below) and I am trying to make clusters.
Using the k-means algorithm I am getting a very low score.
Dataset Sample
Since some of my columns are categorical variables I proceded to transform them in continous using dictionaries variables as follows:
def education_dict(data):
education_dict= {
"Bachelors": 0,
"Graduate Degree": 1,
"High School": 2,
"Partial College": 3,
"Partial High School": 4
}
data["IDEducation"]=data["Education"].map(education_dict)
After converting the categorical variable to a continuous variable I delete the old variable.
After that I do the normalization of the data ( all columns since now all of them are continuous) and I proceed to the k-means algorithm.
mms=MinMaxScaler()
mms.fit(data)
Xnorm=mms.transform(data)
print(Xnorm.min(axis=0))
print(Xnorm.max(axis=0))
print(Xnorm.shape)
km=KMeans(n_clusters=10,n_init=1000,max_iter=800,random_state=42)
y_kmeans=km.fit_predict(Xnorm)
#Clustering evaluation
#Silhouette score
#the closest to 1 the better
silSc=silhouette_score(data,y_kmeans,metric="euclidean")
print("Silhouette score: " , round(silSc,3))
print("\nThese measures need grand truth\n")
The cluster evaluation is returning me a silhouette score of about 0.08 and this is to low.
If I run hieraquical clustering, which is not suited to big datasets like mine, I get a score of about 0.54
segmentation=["single","average","complete"]
results=[]
for met in segmentation:
distance_matrix=linkage(Xnorm,method=met,metric="euclidean")
#Assign cluster lables
cluster_labels=fcluster(distance_matrix,3,criterion="maxclust")
silSc=silhouette_score(data,cluster_labels,metric="euclidean")
print("Silhouette score: " , round(silSc,3))
Am I doing something wrong?

Clustering algos simply do what you would expect them to do. They are unsupervised learners. Nevertheless, you can find the accuracy of a unsupervised algo, similar to supervised learning. See the link below for details.
https://smorbieu.gitlab.io/accuracy-from-classification-to-clustering-evaluation/
Now, I think you should try several clustering algos on your given data set, and see which performs optimally. See the link below for several samples of different clustering algos.
https://machinelearningmastery.com/clustering-algorithms-with-python/
Just set different models, fit each model, and check the output of each one. Finally, I see that you are scaling your data. That's great! It's of utmost importance to scale the data before doing K-means clustering, or any algorithm that uses distances. Without scaling, features on a larger scale will weight more heavily in the algorithm. All features should weigh equally at the initial stage.

Related

Why are probabilities hand-calculated from sklearn.linear_model.LogisticRegression coefficients different from .predict_proba()?

I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().
mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
pd.get_dummies(df[["Education","Gender"]]),
preprocessing.LabelEncoder().fit_transform(df["statement"])
)
I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this:
Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this:
These sum to 1 across the 3 potential categories for each data point.
However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.
First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this:
From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below:
For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.
I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?
I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean:
=
= 0.737007424626824
Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).
Sources that got me here:
How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function

Using SHAP: Scaling the Shapley values for each model and then averaging across models, or just adding the Shapley values for each model?

I'm running n trials on a Keras model with k features, after which I apply SHAPs DeepExplainer to the model on each trial. All the data is the same, but it is randomly split between the training and testing sets. I'm trying to figure out the best way to combine the model outputs, whether it be directly by adding the Shapley values for each trial, feature by feature, and then averaging - or by scaling the Shapley values output each trial first and then adding them and averaging.
My initial thought was that, as the "baseline is always relative based on the average of all predictions" (from here), the overall average would be skewed and there might be a better way of combining the data. Though I wonder if, despite the different samples in the train/test split and a different relative "baseline" for each model, if averaging over many models would give a final averaged model should have as much interpretation value as a single model. Should this be the case?
However, would scaling the features per model offer any benefits: Again from here I can (save for the caveats) scale a features Shapley values for a single observation in a model. It seems then that I should be able to scale each of the features Shapley values after summing over all observations, over each bin such that all Shapley values for each feature sum to 1. If this is the case, that I can scale by features within the model can I average the models this way? I am thinking a benefit of this is that all models will then have equal weight since the features are scaled within each. Is this a valid approach, and if so, does it offer any benefit over adding all the Shapley values, feature by feature together over all models?
To be clear on what mean concerning the bins, they are the the lists returned from the explainer, equal to the number of classifications:
explainer = shap.DeepExplainer(model, X_train)
ShapleyBinVals = explainer.shap_values(X_test)
Bin = ShapleyBinVals[n]
where n is the number of output classifications. Here's a bar plot of the scaled output:
Notice that for each feature e.g. PSWQ_2 the y-value is a percentage and the sum of percentages over all bins is 1.

High AUC but bad predictions with imbalanced data

I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3, i.e.:
Class
0 0.970691
1 0.029309
Params I used and the code for training is as shown below.
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.1,
'is_unbalance': 'true', #because training data is unbalance (replaced with scale_pos_weight)
'num_leaves': 31, # we should let it be smaller than 2^(max_depth)
'max_depth': 6, # -1 means no limit
'subsample' : 0.78
}
# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10,
verbose_eval=10, early_stopping_rounds=40)
nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)
model = lgb.train(lgb_params, dtrain, num_boost_round=nround)
preds = model.predict(test_feats)
preds = [1 if x >= 0.5 else 0 for x in preds]
I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.
But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.
What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?
The issue is that, despite the extreme class imbalance in your dataset, you are still using the "default" threshold of 0.5 when deciding the final hard classification in
preds = [1 if x >= 0.5 else 0 for x in preds]
This should not be the case here.
This is a rather big topic, and I strongly suggest you do your own research (try googling for threshold or cut off probability imbalanced data), but here are some pointers to get you started...
From a relevant answer at Cross Validated (emphasis added):
Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.
From a relevant academic paper, Finding the Best Classification Threshold in Imbalanced Classification:
2.2. How to set the classification threshold for the testing set
Prediction
results
are
ultimately
determined
according
to
prediction
probabilities.
The
threshold
is
typically
set
to
0.5.
If
the
prediction
probability
exceeds
0.5,
the
sample
is
predicted
to
be
positive;
otherwise,
negative.
However,
0.5
is
not
ideal
for
some
cases,
particularly
for
imbalanced
datasets.
The post Optimizing Probability Thresholds for Class Imbalances from the (highly recommended) Applied Predictive Modeling blog is also relevant.
Take home lesson from all the above: AUC is seldom enough, but the ROC curve itself is often your best friend...
On a more general level regarding the role of the threshold itself in the classification process (which, according to my experience at least, many practitioners get wrong), check also the Classification probability threshold thread (and the provided links) at Cross Validated; key point:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Outlier detection using Gaussian mixture

I have 5000 data points for each of my 17 features in a numpy array resulting in a 5000 x 17 array. I am trying to find the outliers for each feature using Gaussian mixture and I am rather confused on the following: 1)how many components should I use for my GaussiasnMixture? 2) Should I fit the GaussianMixture directly on the array of 5000 x 17 or to each feature column seperately resulting in 17 GaussianMixture models?
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array)
or
clf = mixture.GaussianMixture(n_components=17, covariance_type='full')
clf.fit(full_feature_array)
or
for feature in range(0, full_feature_matrix):
clf[feature] = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array[:,feature)
The task of selecting the number of components to model a distribution with a Gaussian mixture model is an instance of Model Selection. This is not so straightforward and there exist many approaches. A good summary can be found here https://en.m.wikipedia.org/wiki/Model_selection . One of the simplest and most widely used is to perform cross validation.
Normally outliers can be determined as those belonging to the component or components with the largest variance. You would call this strategy an unsupervised approach, however it can still be difficult to decide what the cutoff variance should be. A better approach (if applicable) is a supervised approach where you would train the GMM with outlier-free data (by manually removing outliers). You then use this to classify outliers as those which have particularly low likelihood scores. The second way to do it with a supervised approach would be to train two GMMs (one for outliers and one for inliers using model selection) then perform two-class classification for new data. Regarding your question about training univariate versus multivariate GMMs - it's difficult to say but for the purposes of outlier detection univariate GMMs ( or equivalently multivariate GMMs with diagonal covariance matrices) may be sufficient and require training fewer parameters compared to general multivariate GMMs, so I would start with that.
Using Gaussian Mixture Model (GMM) any point sitting on low-density area can be considered outlier - Perhaps the challenge is how to define low density area - For example you can say whatever lower than 4th quantile density is outlier.
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]
regarding choosing the number of component - look into "information criterion" provided by AIC or BIC given different number of components - they often agree in such cases. The lowest is better.
gm.bic(x)
gm.aic(x)
alternatively, BayesianGaussianMixture gives zero as weight to those clusters that are unnecessary.
from sklearn.mixture import BayesianGaussianMixture
bgm = BayesianGaussianMixture(n_components=8, n_init=10) # n_components should be large enough
bgm.fit(X)
np.round(bgm.weights_, 2)
output
array([0.5 , 0.3, 0.2 , 0. , 0. , 0. , 0. , 0. ])
so here it the bayesian gmm detected there are three clusters.

Can you use counts in sklearn logistic regression input?

So, I know that in R you can provide data for a logistic regression in this form:
model <- glm( cbind(count_1, count_0) ~ [features] ..., family = 'binomial' )
Is there a way to do something like cbind(count_1, count_0) with sklearn.linear_model.LogisticRegression? Or do I actually have to provide all those duplicate rows? (My features are categorical, so there would be a lot of redundancy.)
If they are categorical - you should provide binarized version of it. I don't know how that code in R works, but you should binarize your categorical feature always. Because you have to emphasize that each value of your feature is not related to other one, i.e. for feature "blood_type" with possible values 1,2,3,4 your classifier must learn that 2 is not related to 3, and 4 is not related to 1 in any sense. These is achieved by binarization.
If you have too many features after binarization - you can reduce dimensionality of binarized dataset by FeatureHasher or more sophisticated methods like PCA.

Categories

Resources