Confidence score for machine learning with SciKit Learn? - python

I have followed an example of applying SciKit Learning's machine learning to facial recognition.
https://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py
I have been able to adapt the example to my own data successfully. However, I am lost on one point:
after preparing the data, training the model, ultimately, you end up with the line:
Y_pred = clf.predict(X_test_pca)
This produces a vector of predictions, one per face.
What I can't figure out is how to get any confidence measurement to correspond with that.
The classification method is a forced choice, so that each face passed in MUST be classified as one of the known faces, even if it isn't even close.
How can I get a number per face that will reflect how well the result matches the known face?

It seems like you are looking for the .predict_proba() method of the scikit-learn estimators. It returns the probabilities of possible outcomes instead of a single prediction.
The example you are referring to is using an SVC. It is a little special in regard to this function as it states:
The model need to have probability information computed at training time: fit with attribute probability set to True.
So, if you are using the same model as in the example, instantiate it with:
SVC(kernel='rbf', class_weight='balanced', probability=True)
and use .predict_proba() instead of .predict():
y_pred = clf.predict_proba(X_test_pca)
This returns an array of shape (n_samples, n_classes), i.e. the probabilities for each class for each sample. Accessing the probabilities for class k could then be done by calling y_pred[k] for example.

Related

Getting confidence intervals from an Xgboost fitted model

I am trying to get the confidence intervals from an XGBoost saved model in a .tar.gz file that is created using python XGBoost library.
The problem is that the model has already been fitted, and I dont have training data any more, I just have inference or serving data to predict. All the examples that I found entail using a training and test data to create either quantile regression models, or bagged models, but I dont think I have the chance to do that.
Why your desired approach will not work
I assume we are talking about regression here. Given a regression model that you cannot modify, I think you will not be able to achieve your desired result using only the given model. The model was trained to calculate a continuous value that appoximates some objective value (i.e., its true value) based on some given input. Nothing more.
Possible solution
The only workaround I can think of would be to train two more models. These model's training goal would be to predict the quality of the output of your given model. One would calculate the upper bound of a given (i.e., predefined by you at training time) confidence interval and the other one the lower bound. This would probably include a lot of feature engineering. One would probably like to find features that correlate with the prediction quality of the original model.

Single sample prediction problem when training set dimensions reduced with LDA

I have a trouble with supervised classification method I am using for my data.
Let's think we are training our algorithm with a data (N=70) after reducing the dimensions from 100 to 2 by using LDA dimensionality reduction method.
Now, we would like to predict the class of the 71st sample, whose class is completely unknown to us. However, it has 100 features still; so we have to reduce its dimensions.
That seems easy in the first look: I can use the transform characteristics of the first reduction. For example, in python:
clf.fit(X,Y)
lda = LinearDiscriminantAnalysis(n_components=2)
flda = lda.fit(X, Y)
X_lda = flda.transform(X)
I had stored the fitting properties of training data. X_p is my single sample. So when I use 'flda' again for transformation, same fitting information is used:
X_p = flda.transform(X_p.reshape(1, -1))
However, it doesn't predict properly! To test, I used my first N=70 data. Extract one of them (so now, it is N=69). I used 70th data as test sample. And it didn't predict properly again.
When I compared my previous data (N=70) and the new one (N=69), I saw that every single number changed! If I am not missing something (I hope I am missing and you can tell me what I am missing) LDA dimensionality reduction is not applicable for real machine learning applications, because only one data could change everything.
As a note, the plot of the reduced data doesn't change, despite all numbers significanly change (which means the relative locations of points do not change).
Do you know how LDA dimensionality reduction is used in real machine learning applications? What must I do, to test one sample in the following order:
Reduce dimensions to 2 for training data
Reduce dimensions to 2 for test data
Predict!
without using the same tranformation charactheristics?

How can the output of a model be displayed?

I am performing a machine learning task wherein I am using logistic regression for topic classification.
If this is my code:
model= LogisticRegression()
model= model.fit(mat_tmp, label_tmp)
y_train_pred = model.predict(mat_tmp_test)
print(metrics.accuracy_score(label_tmp_test, y_train_pred))
Is there a way I can output what exactly is happening inside the model. Like probably a working example of what my model is doing? Like maybe displaying 2-3 documents and how they are being classified?
In order to be fully aware of what is happening in your model, you must first take some time to study the logistic regression algorithm (eg. from lecture notes or Wikipedia). As with other supervised techniques, logistic regression has hyper-parameters and parameters. Hyper-parameters basically specify how your algorithm runs, which you must provide at initialisation (ie. before it sees any data). For example, you could have prior information about the distribution of classes, which then would be a hyper-parameter. Parameters are "learnt" from your data.
Once you understand the algorithm, the interesting question will be what the parameters of your model are (recall that these are retrieved from the data). By visiting the documentation, you find in the attributes section, that this classifier has 3 parameters, which you can access by their field names.
If you are not interested in such details, but only want to assess the accuracy of your classifier, a useful technique is cross-validation. You split your labeled data into k equal sized subsets, and train your classifier using k-1 of them. Then you evaluate the trained classifier on the remaining 1 subset and calculate the accuracy (ie. what proportion of the data could be predicted properly). This method has its drawbacks, but proves to be very useful in general.

Classifying new occurances - Multinomial Naive Bayes

So I have currently trained a Multinomial Naive Bayes classifier, using [SKiLearn][1]
Now what I can do is classify test data by using predict.
But if I want to run this every night, as a script, I clearly need to always have a classifier already trained up! Now what I'd like to be able to do, is take classifier coefficients, informative words, and use these to classify new data.
Is this possible - to develop my own method for classification? Or should I be simply training the SkiLearn classifier nightly?
EDIT: One thing, it seems I can do, is retain and save my trained classifier.
However with logistic regression, you can take the coefficients and use these on new data. Is there anything similar to this for NB?
Do you mean [sklearn]? Are you using python? If that is the case, it turns out that [sklearn] provides a function for getting the parameters of the model [get_params(deep=True)] as well as a function for setting them [set_params(**params)].
Therefore, a possible procedure could be:
Training stage:
1) Train the model
2) Get the parameters of the model by using get_params()
3) Save the parameters into a binary file (e.g. by using pickle.dump())
Prediction stage:
1) Load the parameters of the model from the binary file (e.g. by using pickle.load())
2) Set the parameters of the model by using set_params()
3) Classify new data by using the predict() function
Hope that helps.

Support vector regression online learning

I am predicting stock prices using support vector regression. I have trained with some values but when i predict the values every time I have to train based on that(online learning). So i have passed the values to train inside the loop after predicted.
inside loop
//prediction
clf.fit(testx[i],testy[i])
So when i call the fit function everytime how svr training work internally based on one input?
clf.fit is not incremental. You have to pass all the previous training points in addition to the new instance to re-train a new model that benefits from the new data points unfortunately.
This is a limitation of the SMO algorithm implemented by the libsvm library used internally in the sklearn.svm.SVR class.

Categories

Resources