how to select the metric to optimize in sklearn's fit function? - python

When using tensorflow to train a neural network I can set the loss function arbitrarily. Is there a way to do the same in sklearn when training a SVM? Let's say I want my classifier to only optimize sensitivity (regardless of the sense of it), how would I do that?

This is not possible with Support Vector Machines, as far as I know. With other models you might either change the loss that is optimized, or change the classification threshold on the predicted probability.
SVMs however minimize the hinge loss, and they do not model the probability of classes but rather their separating hyperplane, so there is not much room for manual adjustements.
If you need to focus on Sensitivity or Specificity, use a different model that allows maximizing that function directly, or that allows predicting the class probabilities (thinking Logistic Regressions, Tree based methods, for example)

Related

Adding a "closeness" estimate to the output of a neural network model

I have a neural network which maps a set of 4 floating-point input parameters to a set of 10 floating point outputs trained on a dataset of ~300 points. The points themselves are intrinsically multi-modal, and there are some sparse areas in the training set that I don't currently have any good way to gather data for (although in real-world deployment they will eventually be encountered).
The data itself trained as-expected during training (test-split loss value was uniformly decreasing during the training, and the errors are all within acceptable levels). So I believe these model is mapping the variables well against each other. However, I'm concerned over how well 'generalized' the model is within areas where it doesn't have training data.
So I'm looking to add an additional output to the model to provide a "closeness" estimate to the training points. My implementation currently is to use scipy to calculate a gauissian KDE from the 4 parameters and then check the points closeness to the training space based on that. Then, in deployment, I return a warning/error if the inputs are too far from the space the model was trained on. This works okay but I have to pass the entire test "X" set around with the model which is a little inconvenient and kludgy.
Is there a way to embed this closeness estimate in the model itself? Or is there any more formalized way to handle this (ex., to give a "confidence" estimate in the model output)?
I think you want to relook at loss value of your model. At its core, a loss function is a measure of how good your prediction model does in terms of being able to predict the expected outcome(or value). We convert the learning problem into an optimization problem, define a loss function and then optimize the algorithm to minimize the loss function.
The loss value gives you this closeness estimate between data and
model output.
This loss value can be accessed in the history object returned while fitting the tensorflow model.
>>>history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),
epochs=10, verbose=1)
>>>print(history.history.keys())
dict_keys(['loss'])
If you want closeness estimate between two different models, you need KL Divergence(Kullback–Leibler divergence) or relative entropy.
From wiki
KL Divergence is a type of statistical distance: a measure of how one
probability distribution P is different from a second, reference
probability distribution Q.A simple interpretation of the KL
divergence of P from Q is the expected excess surprise from using Q as
a model when the actual distribution is P.
In the simple case, a relative entropy of 0 indicates that the two distributions in question have identical quantities of information.
KLDivergence is used to distill knowledge of a teacher model into a student model and check if both of them have identical quantities of information.
Trivia: Knowledge distillation is used to compress deep learning model size with little bit compromise in quality.
KLDivergence can be directly imported in tensorflow as follows:
from tensorflow.keras.losses import KLDivergence
You can check a full fledged KD implementation using keras in https://keras.io/examples/vision/knowledge_distillation/

Should the same cross-validation method be used across multiple models?

The assignment is to write a simple ML program that trains and predicts on a dataset of our choice. I want to determine the best model for my data. The response is a class (0/1). I wrote code to try different cross-validation methods (validation set, leave-one-out, and k-fold) on multiple models (linear regression, logistic regression, k-nearest neighbors, linear discriminant analysis). Per model, I report the MSE for each cross-validation method and track the lowest one. I then pick the model with the lowest tracked MSE. This is where I think I went wrong. If I am cross-validating multiple models, should I use the same cross-validation method?

XGBoost for multiclassification and imbalanced data

I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight can help me?
I tried 1) computing class weights using sklearn compute_class_weight; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight (even in this case class_weight alone does not help).
But scale_pos_weight, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
Using RandomForestClassifier instead of XGBClassifier, I can handle the problem by setting class_weight='balanced_subsample' and tunning max_leaf_nodes. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.
sample_weight parameter is useful for handling imbalanced data while using XGBoost for training the data. You can compute sample weights by using compute_sample_weight() of sklearn library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)
You can use sample_weight as #Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)

How to perform supervised training of a deep neural network when the target label has only 0 and 1?

I am trying to train a Deep Neural Network (DNN) with labeled data. The labels are encoded in such a way that it only contains values 0 and 1. The shape of the encoded label is 5 x 5 x 232. About 95% of values in the label is 0and rests are 1. Currently, I am using binary_crossentroy loss function to train the network.
What is the best technique to train the DNN in such a scenario? Is the choice of binary_crossentroy
as the loss function is appropriate in this case? Any suggestion to improve the performance of the model.
You can try MSE loss. If you want to stick to binary cross-entropy (used in binary classification), consider using label smoothing.
You may use 2 other alternative loss functions instead of Binary cross-entropy.They are
Hinge Loss
An alternative to cross-entropy for binary classification problems is the hinge loss function, primarily developed for use with Support Vector Machine (SVM) models.
It is intended for use with binary classification where the target values are in the set {-1, 1}.
Squared Hinge Loss
For more Detail on loss function with examples.click here
Hope helpful, happy learning.
binary_crossentroy as loss is fine
Don't use accuracy as your metrics, because model will just predict every thing as label 0 and will still get 95% accuracy. Instead use F1 score (or precision or recall)
Use Weighted loss: I.e penalize class 1 heavily if they are wrong as compared to class 0.
Instead of class weights you can also use methods like oversampling form the minority class. (Techniques like SMOTE)
How to calculate class weight
You can use sklearn.utils.class_weight to calculate weight from your labels. Check this answer
In such scenarios where you have highly imbalanced data, I would suggest going with Random Forest with up-Sampling. This approach will up-sample the minority class and hence improve the model accuracy.

How can the output of a model be displayed?

I am performing a machine learning task wherein I am using logistic regression for topic classification.
If this is my code:
model= LogisticRegression()
model= model.fit(mat_tmp, label_tmp)
y_train_pred = model.predict(mat_tmp_test)
print(metrics.accuracy_score(label_tmp_test, y_train_pred))
Is there a way I can output what exactly is happening inside the model. Like probably a working example of what my model is doing? Like maybe displaying 2-3 documents and how they are being classified?
In order to be fully aware of what is happening in your model, you must first take some time to study the logistic regression algorithm (eg. from lecture notes or Wikipedia). As with other supervised techniques, logistic regression has hyper-parameters and parameters. Hyper-parameters basically specify how your algorithm runs, which you must provide at initialisation (ie. before it sees any data). For example, you could have prior information about the distribution of classes, which then would be a hyper-parameter. Parameters are "learnt" from your data.
Once you understand the algorithm, the interesting question will be what the parameters of your model are (recall that these are retrieved from the data). By visiting the documentation, you find in the attributes section, that this classifier has 3 parameters, which you can access by their field names.
If you are not interested in such details, but only want to assess the accuracy of your classifier, a useful technique is cross-validation. You split your labeled data into k equal sized subsets, and train your classifier using k-1 of them. Then you evaluate the trained classifier on the remaining 1 subset and calculate the accuracy (ie. what proportion of the data could be predicted properly). This method has its drawbacks, but proves to be very useful in general.

Categories

Resources