Machine Learning: Move Treshhold - python

I'm trying to solve a binary classification problem where 80% of the data belongs to class x and 20% of the data belongs to class y. All my models (AdaBoost, Neural Networks and SVC) just predict all data to be part of class x as this is the highest accuracy they can achieve.
My goal is to achieve a higher precision for all entries of class x and I don't care how many entries are falsely classified to be part of class y.
My idea would be to just put entries in class x when the model is super sure about them and put them in class y otherwise.
How would I achieve this? Is there a way to move the treshold so that only very obvious entries are classified as class x?
I'm using python and sklearn
Sample Code:
adaboost = AdaBoostClassifier(random_state=1)
adaboost.fit(X_train, y_train)
adaboost_prediction = adaboost.predict(X_test)
confusion_matrix(adaboost_prediction,y_test) outputs:
array([[ 0, 0],
[10845, 51591]])

Using AdaBoostClassifier you can output class probabilities and then threshold them by using predict_proba instead of predict:
adaboost = AdaBoostClassifier(random_state=1)
adaboost.fit(X_train, y_train)
adaboost_probs = adaboost.predict_proba(X_test)
threshold = 0.8 # for example
thresholded_adaboost_prediction = adaboost_probs > threshold
Using this approach you could also inspect (just debug print, or maybe sort and plot on a graph) how the confidence levels vary in your final model on the test data to help decide whether it is worth taking further.
There is more than one way to approach your problem though. For example see Miriam Farber's answer which looks at re-weighting the classifier to adjust for your 80/20 class imbalance during training. You might find you have other problems, including perhaps the classifiers you are using cannot realistically separate x and y classes given your current data. Going through all possibilities of a data problem like this might take a few different approaches.
If you have more questions about issues with your data problem as opposed to the code, there are Stack Exchange sites that could help you as well as Stack Overflow (do read the site guidelines before posting): Data Science and Cross Validated.

In SVM, one way to move the threshold is to choose class_weight in such a way that you put much more weight on data points from class y. Consider the below example, taken from SVM: Separating hyperplane for unbalanced classes:
The straight line is the decision boundary that you get when you use SVC with default class weights (same weight for every class). The dashed line is the decision boundary that you get when you use class_weight={1: 10} (that is, put much more weight on class 1, relatively to class 0).
Class weights besically adjust the penalty parameter in SVM:
class_weight : {dict, ‘balanced’}, optional
Set the parameter C of class i to class_weight[i]*C for SVC. If not
given, all classes are supposed to have weight one. The “balanced”
mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples /
(n_classes * np.bincount(y))

Related

How to use KMeans clustering to improve the accuracy of a logistic regression model?

I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?
I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.

How to find feature importance for each class in multiclass classification

I have written code to find the importance of each feature in the entire dataset for multiclass classification. Now I want to find feature importance for each class in multiclass classification, i.e. I want to find the list of features (for each class) that are more important to classify that individual classes.
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
model = DecisionTreeClassifier()
model.fit(x3, y3)
importance = model.feature_importances_
for i,v in enumerate(importance):
print('Feature[%0d]:%s, Score: %.6f' % (i,df.columns[i],v))
plt.subplots(figsize=(15,7))
plt.bar([x for x in range(len(importance))], importance)
plt.xlabel('Feature index')
plt.ylabel('Feature importance score')
plt.xticks(rotation=90)
plt.xticks(np.arange(0,len(df.columns)-2, 2.0))
plt.show()
EDIT (28-04-2022):
I read a paper titled Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization; quoting:
On the evaluate section, we fist extract the 80 traffic features from the dataset and clarify the best short feature set to detect each attack family using RandomForestRegressor algorithm. Afterwards, we examine the performance and accuracy of the selected features
with seven common machine learning algorithms.
Can anyone explain how this is done?click for picture from that paper
The decision trees are split into nodes that maximise information gain. Each split is based on the Gini index or entropy values. So the only way I think what you want to do can be achieved is by printing out the tree and examining it yourself visually, provided there are not too many nodes.
You can't say with certainty that one of your features is very important in discriminating against a certain class because suppose you have two classes, A and B. The feature that discriminates class A against class B is also discriminating class B against class A. So the importance of that feature is for both classes. In general, you can only get the overall feature importance not specific to any of your classes but the features that help get the work done.
Trees are highly unstable, and a slight change in your dataset will build an entirely new different tree from the first.
EDIT(28-04-2022):
The paper says they used Random-ForestRegressor, different from the decision tree you used. Random-ForestRegressor meant they had a regression task. The paper used the algorithm as a feature selection technique to reduce the 80 features. The few features selected (based on feature importance) were then used to train seven other different models. Using fewer features instead of the whole 80 will make the resulting models more elegant and less prone to overfitting.
It is important to know that Random forest is an ensemble method and has a lot of random happenings in the background such as bagging and bootstrapping. Feature importance is a form of model interpretation. It is difficult to interpret Ensemble algorithms the way you have described. Such a way would be too detailed. So, definitely, what they wrote in the paper is different from what you think.
Decision trees are a lot more interpretable. If you want to understand causality in your decision tree model, you can click here to see how the model can be converted into rules or as suggested earlier, observe the tree with your naked eyes.

Scikit-Learn: Label not x is present in all training examples

I'm trying to do multilabel classification with SVM.
I have nearly 8k features and also have y vector of length with nearly 400. I already have binarized Y vectors, so I didn't use MultiLabelBinarizer() but when I use it with my Y data's raw form, it still gives same thing.
I'm running this code:
X = np.genfromtxt('data_X', delimiter=";")
Y = np.genfromtxt('data_y', delimiter=";")
training_X = X[:2600,:]
training_y = Y[:2600,:]
test_sample = X[2600:2601,:]
test_result = Y[2600:2601,:]
classif = OneVsRestClassifier(SVC(kernel='rbf'))
classif.fit(training_X, training_y)
print(classif.predict(test_sample))
print(test_result)
After all fitting process when it comes to prediction part, it says Label not x is present in all training examples (x is a few different numbers in range of my y vector length which is 400). After that it gives predicted y vector which is always zero vector with length of 400(y vector length).
I'm new at scikit-learn and also in machine learning. I couldn't figure out the problem here. What's the problem and what should I do to fix it?
Thanks.
There are 2 problems here:
1) The missing label warning
2) You are getting all 0's for predictions
The warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them must only occur very rarely, and on any split of the data, some classes may be missing from one side of the split. There may also be classes that simply don't occur in your data at all. You could try Y.sum(axis=0).all() and if that is False, then some classes do not occur even in Y. This all sounds horrible, but realistically, you aren't going to be able to correctly predict classes that occur 0, 1, or any very small number of times anyway, so predicting 0 for those is probably about the best you can do.
As for the all-0 predictions, I'll point out that with 400 classes, probably all of your classes occur much less than half the time. You could check Y.mean(axis=0).max() to get the highest label frequency. With 400 classes, it might only be a few percent. If so, a binary classifier that has to make a 0-1 prediction for each class will probably pick 0 for all classes on all instances. This isn't really an error, it is just because all of the class frequencies are low.
If you know that each instance has a positive label (at least one), you could get the decision values (clf.decision_function) and pick the class with the highest one for each instance. You'll have to write some code to do that, though.
I once had a top-10 finish in a Kaggle contest that was similar to this. It was a multilabel problem with ~200 classes, none of which occurred with even a 10% frequency, and we needed 0-1 predictions. In that case I got the decision values and took the highest one, plus anything that was above a threshold. I chose the threshold that worked the best on a holdout set. The code for that entry is on Github: Kaggle Greek Media code. You might take a look at it.
If you made it this far, thanks for reading. Hope that helps.

What is the difference between sample weight and class weight options in scikit learn?

I have class imbalance problem and want to solve this using cost sensitive learning.
under sample and over sample
give weights to class to use a modified loss function
Question
Scikit learn has 2 options called class weights and sample weights. Is sample weight actually doing option 2) and class weight options 1). Is option 2) the the recommended way of handling class imbalance.
It's similar concepts, but with sample_weights you can force estimator to pay more attention on some samples, and with class_weights you can force estimator to learn with attention to some particular class. sample_weight=0 or class_weight=0 basically means that estimator doesn't need to take into consideration such samples/classes in learning process at all. Thus classifier (for example) will never predict some class if class_weight = 0 for this class. If some sample_weight/class_weight bigger than sample_weight/class_weight on other samples/classes - estimator will try to minimize error on that samples/classes in the first place. You can use user-defined sample_weights and class_weights simultaneously.
If you want to undersample/oversample your training set with simple cloning/removing - this will be equal to increasing/decreasing of corresponding sample_weights/class_weights.
In more complex cases you can also try artificially generate samples, with techniques like SMOTE.
sample_weight and class_weight have a similar function, that is to make your estimator pay more attention to some samples.
Actual sample weights will be sample_weight * weights from class_weight.
This serves the same purpose as under/oversampling but the behavior is likely to be different: say you have an algorithm that randomly picks samples (like in random forests), it matters whether you oversampled or not.
To sum it up:
class_weight and sample_weight both do 2), option 2) is one way to handle class imbalance. I don't know of an universally recommended way, I would try 1), 2) and 1) + 2) on your specific problem to see what works best.

Unbalanced classification using RandomForestClassifier in sklearn

I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
You can pass sample weights argument to Random Forest fit method
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits
that would create child nodes with net zero or negative weight are
ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any
single class carrying a negative weight in either child node.
In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.
Update
Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple
sample_weight = np.array([5 if i == 0 else 1 for i in y])
assigning weight of 5 to all 0 instances and weight of 1 to all 1 instances. See link above for a bit more crafty balance_weights weights evaluation function.
This is really a shame that sklearn's "fit" method does not allow specifying a performance measure to be optimized. No one around seem to understand or question or be interested in what's actually going on when one calls fit method on data sample when solving a classification task.
We (users of the scikit learn package) are silently left with suggestion to indirectly use crossvalidated grid search with specific scoring method suitable for unbalanced datasets in hope to stumble upon a parameters/metaparameters set which produces appropriate AUC or F1 score.
But think about it: looks like "fit" method called under the hood each time always optimizes accuracy. So in end effect, if we aim to maximize F1 score, GridSearchCV gives us "model with best F1 from all modesl with best accuracy". Is that not silly? Would not it be better to directly optimize model's parameters for maximal F1 score?
Remember old good Matlab ANNs package, where you can set desired performance metric to RMSE, MAE, and whatever you want given that gradient calculating algo is defined. Why is choosing of performance metric silently omitted from sklearn?
At least, why there is no simple option to assign class instances weights automatically to remedy unbalanced datasets issues? Why do we have to calculate wights manually? Besides, in many machine learning books/articles I saw authors praising sklearn's manual as awesome if not the best sources of information on topic. No, really? Why is unbalanced datasets problem (which is obviously of utter importance to data scientists) not even covered nowhere in the docs then?
I address these questions to contributors of sklearn, should they read this. Or anyone knowing reasons for doing that welcome to comment and clear things out.
UPDATE
Since scikit-learn 0.17, there is class_weight='balanced' option which you can pass at least to some classifiers:
The “balanced” mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as n_samples / (n_classes * np.bincount(y)).
Use the parameter class_weight='balanced'
From sklearn documentation: The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
If the majority class is 1, and the minority class is 0, and they are in the ratio 5:1, the sample_weight array should be:
sample_weight = np.array([5 if i == 1 else 1 for i in y])
Note that you do not invert the ratios.This also applies to class_weights. The larger number is associated with the majority class.

Categories

Resources