I have an imbalanced binary classification problem. The ratio for positive to negative class is about 1:10. I trained an XGBoost tree model to predict these two classes using continuous and categorical data as input.
Using this XGBoost library, I predict the probability of new inputs using predict_proba. I am assuming the probability values output here is the likelihood of these new test data being the positive class? Say I have an entire test set with test labels, how am I able to evaluate the quality of these (I'm assuming) likelihoods?
Related
I'm trying to make a model which can predict test scores. I'm currently using a simple linear regression model but receiving an accuracy score of close to 0 due to the fact that it's guessing a single number as the score. I was wondering if there was a way to have the model predict a range of about 10 numbers and if the true number is in that range it is marked as a correct guess.
The dataset I am using
Github page with notebook
It seems like you are using a LogisticRegression, LogisticRegression is in fact not for regression, it is for classification (for example, is the input data class a or b).
use sklearn.linear_model.LinearRegression for linear regression, read this for more details
There are also many other regression algorithms that I cannot list all in an answer. If you want to use regressions other than simple naive linear regression, read this for all available supervised learning algorithms scikit-learn provides, Ridge regression and SVR might be good places to start with.
I am trying to get the confidence intervals from an XGBoost saved model in a .tar.gz file that is created using python XGBoost library.
The problem is that the model has already been fitted, and I dont have training data any more, I just have inference or serving data to predict. All the examples that I found entail using a training and test data to create either quantile regression models, or bagged models, but I dont think I have the chance to do that.
Why your desired approach will not work
I assume we are talking about regression here. Given a regression model that you cannot modify, I think you will not be able to achieve your desired result using only the given model. The model was trained to calculate a continuous value that appoximates some objective value (i.e., its true value) based on some given input. Nothing more.
Possible solution
The only workaround I can think of would be to train two more models. These model's training goal would be to predict the quality of the output of your given model. One would calculate the upper bound of a given (i.e., predefined by you at training time) confidence interval and the other one the lower bound. This would probably include a lot of feature engineering. One would probably like to find features that correlate with the prediction quality of the original model.
I have followed an example of applying SciKit Learning's machine learning to facial recognition.
https://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py
I have been able to adapt the example to my own data successfully. However, I am lost on one point:
after preparing the data, training the model, ultimately, you end up with the line:
Y_pred = clf.predict(X_test_pca)
This produces a vector of predictions, one per face.
What I can't figure out is how to get any confidence measurement to correspond with that.
The classification method is a forced choice, so that each face passed in MUST be classified as one of the known faces, even if it isn't even close.
How can I get a number per face that will reflect how well the result matches the known face?
It seems like you are looking for the .predict_proba() method of the scikit-learn estimators. It returns the probabilities of possible outcomes instead of a single prediction.
The example you are referring to is using an SVC. It is a little special in regard to this function as it states:
The model need to have probability information computed at training time: fit with attribute probability set to True.
So, if you are using the same model as in the example, instantiate it with:
SVC(kernel='rbf', class_weight='balanced', probability=True)
and use .predict_proba() instead of .predict():
y_pred = clf.predict_proba(X_test_pca)
This returns an array of shape (n_samples, n_classes), i.e. the probabilities for each class for each sample. Accessing the probabilities for class k could then be done by calling y_pred[k] for example.
For a classification task, I am using voting classifier to ensemble logistic regression and SVM with voting parameter set to soft. The result is clearly better than each individual model. I am not sure if I understand how it works though. How can the model find the majority vote between only two models?
Assuming you have two classes class-A and class-B
Logistic Regression( has an inbuilt predict_proba() method) and SVC(set probability=True) both are able to estimate class probabilities on their outputs i.e. they predict if input is class-A with probability a and class-B with probability b. If a>b then it outputs predicted class is A otherwise B .In a voting classifier setting the voting parameter to soft enables them(SVM and LogiReg) to calculate their probability(also known as confidence score) individually and present it to the voting classifier, then the voting classifier averages them and outputs the class with the highest probability.
Make sure that if you set voting=soft then the classifiers you provide can also calculate this confidence score.
To see the confidence of each classifier you can do:
from sklearn.metrics import accuracy_score
y_pred=classifer_name.predict(X_test) #classifier_name=trained SVM/LogiReg/VotingClassifier
print(classifier_name.__class__.__name__,accuracy_score(y_true,y_pred))
NOTE: a+b may not appear to be 1 due to computer floating point round off. But it is 1. I can't say about other confidence scores like decision functions, but with predict_proba() it is the case.
I am trying to build my own pmml exporter for Naive Bayes model that I have built in scikit learn. In reading the PMML documentation it seems that for each feature vector you can either output the model in terms of count data if it is discrete or as a Gaussian/Poisson distribution if it is continous. But the coefficients of my scikit learn model are in terms of Empirical log probability of features i.e p(y|x_i). Is it possible to specify the Bayes input parameters in terms of these probability rather than counts?
Since the PMML representation of the Naive Bayes model implements representing joint probabilities via the "PairCounts" element, one can simply replace that ratio with the probabilities output (not the log probability). Since the final probabilities are normalized, the difference doesn't matter. If the requirements involve a large number of proabilities which are mostly 0, the "threshold" attribute of the model can be used to set the default values for such probabilities.