I am using python to do a bit of machine learning.
I have a python nd array with 2000 entries. Each entry has information about some subjects and at the end has a boolean to tell me if they are a vampire or not.
Each entry in the array looks like this:
[height(cm), weight(kg), stake aversion, garlic aversion, reflectance, shiny, IS_VAMPIRE?]
My goal is to be able to give a probability that a new subject is a vampire given the data shown above for the subject.
I have used sklearn to do some machine learning for me:
clf = tree.DecisionTreeRegressor()
clf=clf.fit(X,Y)
print clf.predict(W)
Where W is an array of data for the new subject. The script I have written returns booleans, but I would like it to return probabilities. How can I modify it?
If you are using DecisionTreeRegressor() then you may use the score function to determine the coefficient of determination R^2 of the prediction.
Please find the below link to the documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor
Also you can list out the cross validation score (for 10 samples) as below
from sklearn.model_selection import cross_val_score
clf = tree.DecisionTreeRegressor()
clf=clf.fit(X,Y)
cross_val_score(clf, X, Y, cv=10)
print clf.predict(W)
Which gives an output something similar to this,
array([ 0.61..., 0.57..., -0.34..., 0.41..., 0.75...,
0.07..., 0.29..., 0.33..., -1.42..., -1.77...])
Use a DecisionTreeClassifier instead of a regressor, and use the predict_proba method. Alternatively, you could use a logistic regression (also available in scikit learn.)
The basic idea is this:
clf = tree.DecisionTreeClassifier()
clf=clf.fit(X,Y)
print clf.predict_proba(W)
You want to use a classifier that gives you a probability. Also, you will want to make sure in your testing array W, the data points are not replicates of any of your training data. If it matches exactly with any of your training data, it thinks it's definitely vampire or definitely not vampire, so will give you 0 or 1.
You're using a regressor but you probably want to use a classifier.
You'll also want to use a classifier that can give you posterior probabilities like a decision tree or logistic regression. Other classifiers may give you a score (some kind of confidence measure) which may also work for your needs.
Related
I used eli5 to apply the permutation procedure for feature importance. In the documentation, there is some explanation and a small example but it is not clear.
I am using a sklearn SVC model for a classification problem.
My question is: Are these weights the change (decrease/increase) of the accuracy when the specific feature is shuffled OR is it the SVC weights of these features?
In this medium article, the author states that these values show the reduction in model performance by the reshuffle of that feature. But not sure if that's indeed the case.
Small example:
from sklearn import datasets
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.svm import SVC, SVR
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
clf = SVC(kernel='linear')
perms = PermutationImportance(clf, n_iter=1000, cv=10, scoring='accuracy').fit(X, y)
print(perms.feature_importances_)
print(perms.feature_importances_std_)
[0.38117333 0.16214 ]
[0.1349115 0.11182505]
eli5.show_weights(perms)
I did some deep research.
After going through the source code here is what I believe for the case where cv is used and is not prefit or None. I use a K-Folds scheme for my application. I also use a SVC model thus, score is the accuracy in this case.
By looking at the fit method of thePermutationImportance object, the _cv_scores_importances are computed (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/permutation_importance.py#L202). The specified cross-validation scheme is used and the base_scores, feature_importances are returned using the test data (function: _get_score_importances inside _cv_scores_importances).
By looking at get_score_importances function (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/permutation_importance.py#L55), we can see that base_score is the score on the non shuffled data and feature_importances (called differently there as: scores_decreases) are defined as non shuffled score - shuffled score (see https://github.com/TeamHG-Memex/eli5/blob/master/eli5/permutation_importance.py#L93)
Finally, the errors (feature_importances_std_) are the SD of the above feature_importances (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/permutation_importance.py#L209) and the feature_importances_ is the mean of the above feature_importances (non-shuffled score minus (-) shuffled score).
A fair bit shorter answer to your original question, regardless of the setting for the cv parameter, eli5 will calculate the average decrease in the scorer you provide. Because you're using the sklearn wrapper, the scorer will come from scikit-learn: in your case accuracy. Overall as a word on the package, some of these details are particularly difficult to figure out without going into the deeper into the source code, might be worth trying to submit a pull request to make the documentation more detailed where possible.
I am trying to evaluate logistic regression using the AUROC curve and and cross-validate my scores. When I don't cross-validate I have no issues, but I really want to use cross validation to help decrease bias in my method.
Anyway, below is the code and error term I get for the beginning part of my code:
X = df.drop('Survived', axis=1)
y = df['Survived']
skf = StratifiedKFold(n_splits=5)
logmodel = LogisticRegression()
i=0
for train, test in skf.split(X,y):
logmodel.fit(X[train], y[train]) # error occurs here
predictions = logmodel.predict_proba(X[test])
# a bunch of code that I haven't included which creates the ROC curve
i += 1
The error occurs in the fourth to last line, and returns a list of integers followed by 'not in index'
I don't really understand what the problem is?
This is my understanding of the code: First I create an instance of both stratified kfold and logistic regression. The instance of stratified kfold states that five folds are to be made. Next, I say that for each train and test fold in my dataset X, y I fit the logistic model to the data and then create a list of predictions for different probabilities based on the test data. Later (this part is not showed) I will create a ROC curve for each k-fold of data.
Again, I don't really understand what the problem is but maybe somebody can clarify. My work is more or less copied directly from this link in sklearn: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py
Please add more details so it can be truly examined. Preferably (and actually required) a piece of code that one can run to see the error.
From first view, you take a pandas dataframe and feed it into the model, and that is done incorrect.
See the following lines that are correct for retrieving data and feeding it to the model:
X = df.drop('Survived', axis=1).values
y = df['Survived'].values
The .values suffix accesses the numpy data object that is stored in those dataframes, which is consistent with the rest of the code.
Hopefully that helps you to solve the error.
Good luck!
I'm confused about the reason to use cross_val_score.
From what I understood, cross_val_score tells if my model is
'overfitting' or 'underfitting'. Moreover,it does not train my model.
Since I have only 1 feature, it is tfidf (sparse matrix). I don't know
what to do if it under/over fitting.
Q1: Did I use it in wrong order? I've seen both 'cross->fit' and
'fit->cross' examples.
Q2: What did the scores in '#print1' tell me? Does it mean I have to train my model k-times (with the same training set) where k is the k-fold that give the best score?
My code now:
model1=GaussianNB(priors=None)
score=cross_val_score(model1, X_train.toarray(), y_train,cv=3,scoring='accuracy')
# print1
print (score.mean())
model1.fit(X_train.toarray(),y_train)
predictions1 = model1.predict(X_test.toarray()) #held out data
# print2
print (classification_report(predictions1,y_test))
Here are some informations about cross-validation.
The order (cross then fit) seems fine to me.
First you evaluate the performance of your model on known data. Taking the mean of all the CV scores is interesting but maybe it would be best to leave the raw scores to see if your model doesn't work on some sets.
If your model works, then you can fit it on your train set and predict on your test set.
Training the same model k times won't change anything.
I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.
scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.
I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.
If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.
Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)
I've trained a Random Forest (regressor in this case) model using scikit learn (python), and I'would like to plot the error rate on a validation set based on the numeber of estimators used. In other words, there's a way to predict using only a portion of the estimators in your RandomForestRegressor?
Using predict(X) will give you the predictions based on the mean of every single tree results. There is a way to limit the usage of the trees? Or eventually, get each single output for each single tree in the forest?
Thanks to cohoz I've figured out how to do it.
I've written a couple of def, which turned out to be handy while plotting the learning curve of the random forest regressor on the test set.
## Error metric
import numpy as np
def rmse(train,test):
return np.sqrt(np.mean(pow(test - train+,2)))
## Print test set error
## Input the RandomForestRegressor, test set feature and test set known values
def rfErrCurve(rf_model,test_X,test_y):
p = []
for i,tree in enumerate(rf_model.estimators_):
p.insert(i,tree.predict(test_X))
print rmse(np.mean(p,axis=0),test_y)
Once trained, you can access these via the "estimators_" attribute of the random forest object.