I am using scikit-learn's linearSVC classifier for text mining. I have the y value as a label 0/1 and the X value as the TfidfVectorizer of the text document.
I use a pipeline like below
pipeline = Pipeline([
('count_vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
('classifier', LinearSVC())
])
For a prediction, I would like to get the confidence score or probability of a data point being classified as
1 in the range (0,1)
I currently use the decision function feature
pipeline.decision_function(test_X)
However it returns positive and negative values that seem to indicate confidence. I am not too sure about what they mean either.
However, is there a way to get the values in range 0-1?
For example here is the output of the decision function for some of the data points
-0.40671879072078421,
-0.40671879072078421,
-0.64549376401063352,
-0.40610652684648957,
-0.40610652684648957,
-0.64549376401063352,
-0.64549376401063352,
-0.5468745098794594,
-0.33976011539714374,
0.36781572474117097,
-0.094943829974515004,
0.37728641897721765,
0.2856211778200019,
0.11775493140003235,
0.19387473663623439,
-0.062620918785563556,
-0.17080866610522819,
0.61791016307670399,
0.33631340372946961,
0.87081276844501176,
1.026991628346146,
0.092097790098391641,
-0.3266704728249083,
0.050368652422013376,
-0.046834129250376291,
You can't.
However you can use sklearn.svm.SVC with kernel='linear' and probability=True
It may run longer, but you can get probabilities from this classifier by using predict_proba method.
clf=sklearn.svm.SVC(kernel='linear',probability=True)
clf.fit(X,y)
clf.predict_proba(X_test)
If you insist on using the LinearSVC class, you can wrap it in a sklearn.calibration.CalibratedClassifierCV object and fit the calibrated classifier which will give you a probabilistic classifier.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features
y = iris.target #3 classes: 0, 1, 2
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = [[2.3, 5],
[4, 7]]
predicted_probs = calibrated_svc.predict_proba(prediction_data) #important to use predict_proba
print predicted_probs
Here is the output:
[[ 9.98626760e-01 1.27594869e-03 9.72912751e-05]
[ 9.99578199e-01 1.79053170e-05 4.03895759e-04]]
which shows probabilities for each class for each data point.
Related
I'm trying to figure out how to unscale my data (presumably using inverse_transform) for predictions when I'm using a pipeline. The data below is just an example. My actual data is much larger and complicated, but I'm looking to use RobustScaler (as my data has outliers) and Lasso (as my data has dozens of useless features). I am new to pipelines in general.
Basically, if I try to use this model to predict anything, I want that prediction in unscaled terms. Is this possible with a pipeline? How can I do this with inverse_transform?
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
data = [[100, 1, 50],[500 , 3, 25],[1000 , 10, 100]]
df = pd.DataFrame(data,columns=['Cost','People', 'Supplies'])
X = df[['People', 'Supplies']]
y = df[['Cost']]
#Split
X_train,X_test,y_train,y_test = train_test_split(X,y)
#Pipeline
pipeline = Pipeline([('scale', RobustScaler()),
('alg', Lasso())])
clf = pipeline.fit(X_train,y_train)
train_score = clf.score(X_train,y_train)
test_score = clf.score(X_test,y_test)
print ("training score:", train_score)
print ("test score:", test_score)
#Predict example
example = [[10,100]]
clf.predict(example)
Simple Explanation
Your pipeline is only transforming the values in X, not y. The differences you are seeing in y for predictions are related to the differences in the coefficient values between two models fitted using scaled vs. unscaled data.
So, if you "want that prediction in unscaled terms" then take the scaler out of your pipeline. If you want that prediction in scaled terms you need scale the new prediction data prior to passing it to the .predict() function. The Pipeline actually does this for you automatically if you have included a scaler step in it.
Scaling and Regression
The practical purpose of scaling here would be when people and supplies have different dynamic ranges. Using the RobustScaler() removes the median and scales the data according to the quantile range. Typically you would only do this if you thought that your people or supply data has outliers that would influence the sample mean / variance in a negative way. If this is not the case, you would likely use the StandardScaler() to remove the mean and scale to unit variance.
Once the data is scaled, you can compare the regression coefficients to better to understand how the model is making its predictions. This is important since the coefficients for unscaled data may be very misleading.
An Example Using Your Code
The following example shows:
Predictions using both scaled and unscaled data with and without the pipeline.
The predictions match in both cases.
You can see what the pipeline is doing in the background by looking at the non-pipeline examples.
I have also included the model coefficients in both cases. Note that the coefficients or weights for the scaled vs. unscaled fitted models are very different.
These coefficients are used to generate each prediction value for the variable example.
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
data = [[100, 1, 50],[500 , 3, 25],[1000 , 10, 100]]
df = pd.DataFrame(data,columns=['Cost','People', 'Supplies'])
X = df[['People', 'Supplies']]
y = df[['Cost']]
#Split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
#Pipeline
pipeline_scaled = Pipeline([('scale', RobustScaler()),
('alg', Lasso(random_state=0))])
pipeline_unscaled = Pipeline([('alg', Lasso(random_state=0))])
clf1 = pipeline_scaled.fit(X_train,y_train)
clf2 = pipeline_unscaled.fit(X_train,y_train)
#Pipeline predict example
example = [[10,100]]
print('Pipe Scaled: ', clf1.predict(example))
print('Pipe Unscaled: ',clf2.predict(example))
#------------------------------------------------
rs = RobustScaler()
reg = Lasso(random_state=0)
# Scale the taining data
X_train_scaled = rs.fit_transform(X_train)
reg.fit(X_train_scaled, y_train)
# Scale the example
example_scaled = rs.transform(example)
# Predict using the scaled data
print('----------------------')
print('Reg Scaled: ', reg.predict(example_scaled))
print('Scaled Coefficents:',reg.coef_)
#------------------------------------------------
reg.fit(X_train, y_train)
print('Reg Unscaled: ', reg.predict(example))
print('Unscaled Coefficents:',reg.coef_)
Outputs:
Pipe Scaled: [1892.]
Pipe Unscaled: [-699.6]
----------------------
Reg Scaled: [1892.]
Scaled Coefficents: [199. -0.]
Reg Unscaled: [-699.6]
Unscaled Coefficents: [ 0. -15.9936]
For Completeness
You original question asks about "unscaling" your data. I don't think this is what you actually need, since the X_train is your unscaled data. Howver, the following example shows how you could do this as well using the scaler object from your pipeline.
#------------------------------------------------
pipeline_scaled['scale'].inverse_transform(X_train_scaled)
Output
array([[ 3., 25.],
[ 1., 50.]])
I'm using the RandomForestClassifier of sklearn on data with highly imbalanced classes--lots of 0 and few of 1. I'm interested in the number of 1s in the prediction. Example (credit):
# Load libraries
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import datasets
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Make class highly imbalanced by removing first 40 observations
X = X[46:,:]
y = y[46:]
# Create target vector indicating if class 0, otherwise 1
y = np.where((y == 0), 1, 0)
#split into training and testing
trainx = X[::2]
trainy = y[::2]
testx = X[1::2]
testy = y[1::2]
# Create decision tree classifer object
clf = RandomForestClassifier()
# Train model
clf.fit(trainx, trainy)
print(clf.predict(testx).sum())
This returns 2. Which is fine, except on my real data the result is a bit lower than the true answer. I want to deal with this by using the class_weight parameter. However when I do:
clf = RandomForestClassifier(class_weight="balanced")
# Train model
clf.fit(trainx, trainy)
print(clf.predict(testx).sum())
I get a result of 0. Same if I use class_weight={1:10}. If I use class_weight={1:.1} I get 2 again.
I get similar behavior on my real data: the higher the weight I give to class 1, the fewer 1s I get in the prediction.
This is the opposite of the behavior I expect (and the opposite of what the class_weight parameter does in svm). What's going on here? This question suggests sklearn is assigning the class labels by some kind of default, but that seems bizarre. Why wouldn't it use the class labels I gave it?
I have a data set with a target variable that can have 7 different labels. Each sample in my training set has only one label for the target variable.
For each sample, I want to calculate the probability for each of the target labels. So my prediction would consist of 7 probabilities for each row.
On the sklearn website I read about multi-label classification, but this doesn't seem to be what I want.
I tried the following code, but this only gives me one classification per sample.
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
Does anyone have some advice on this? Thanks!
You can do that by simply removing the OneVsRestClassifer and using predict_proba method of the DecisionTreeClassifier. You can do the following:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)
This will give you a probability for each of your 7 possible classes.
Hope that helps!
You can try using scikit-multilearn - an extension of sklearn that handles multilabel classification. If your labels are not overly correlated you can train one classifier per label and get all predictions - try (after pip install scikit-multilearn):
from skmultilearn.problem_transform import BinaryRelevance
classifier = BinaryRelevance(classifier = DecisionTreeClassifier())
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
Predictions will contain a sparse matrix of size (n_samples, n_labels) in your case - n_labels = 7, each column contains prediction per label for all samples.
In case your labels are correlated you might need more sophisticated methods for multi-label classification.
Disclaimer: I'm the author of scikit-multilearn, feel free to ask more questions.
If you insist on using the OneVsRestClassifer, then you could also call predict_proba(X_test) as it is supported by OneVsRestClassifer as well.
For eg:
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)
The order of the labels for which you get the result can be found in:
clf.classes_
I am dealing with multi-label classification with OneVsRestClassifier and SVC,
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
L=3
X, y = make_multilabel_classification(n_classes=L, n_labels=2,
allow_unlabeled=True,
random_state=1, return_indicator=True)
model_to_set = OneVsRestClassifier(SVC())
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly","rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1')
model_tunning.fit(X, y)
print model_tunning.best_score_
print model_tunning.best_params_
#0.855175822314
#{'estimator__kernel': 'poly', 'estimator__C': 1, 'estimator__degree': 3}
1st question
What is the number 0.85 representing for? Is it the best score among the L classifiers or the averaged one? Similarly, does the set of parameters stand for the best-scorer among L classifiers?
2nd question
Based on the fact that, if I am right, the OneVsRestClassifier literally builds L classifiers for each label, one can expect to access or observe the performance of EACH LABEL. But how, in the above example, to obtain L scores from the GridSearchCV object?
EDIT
To simplify the problem and help myself understand more about OneVsRestClassifier, before tuning model,
model_to_set.fit(X,y)
gp = model_to_set.predict(X) # the "global" prediction
fp = model_to_set.estimators_[0].predict(X) # the first-class prediction
sp = model_to_set.estimators_[1].predict(X) # the second-class prediction
tp = model_to_set.estimators_[2].predict(X) # the third-class prediction
It can be shown that gp.T[0]==fp, gp.T[1]==sp and gp.T[2]==tp. So the "global" prediction is simply the 'sequential' L individual predictions and the 2nd question is solved.
But it is still confusing for me that if one meta-classifier OneVsRestClassifier contains L classifiers, how can GridSearchCV returns only ONE best score, corresponding to one of 4*2*4 sets of parameters, for a meta-classifier OneVsRestClassifier having L classifiers?
It would be fairly appreciated to see any comment.
GridSearchCV creates grid from your parameter values, it evaluates your OneVsRestClassifier as atomic classifier (I.e. GridSearchCV doesn't know what is inside this metaclassifier)
First: 0.85 is the best score of OneVsRestClassifier among all possible combinations (16 combinations in your case, 4*2*4) of parameters ("estimator__C", "estimator__kernel", "estimator__degree"), it means that GridSearchCV evaluates 16 (again, it's only in this particular case) possible OneVsRestClassifier's each of which contains L SVC's. All of that L classifiers inside one OneVsRestClassifier have same values of parameters (but each of them is learning to recognize their own class from L possible)
i.e. from set of
{OneVsRestClassifier(SVC(C=1, kernel="poly", degree=1)),
OneVsRestClassifier(SVC(C=1, kernel="poly", degree=2)),
...,
OneVsRestClassifier(SVC(C=8, kernel="rbf", degree=3)),
OneVsRestClassifier(SVC(C=8, kernel="rbf", degree=4))}
it chooses one with the best score.
model_tunning.best_params_ here represents parameters for OneVsRestClassifier(SVC()) with which it will achieve model_tunning.best_score_.
You can get that best OneVsRestClassifier from model_tunning.best_estimator_ attribute.
Second: There is no ready to use code to obtain separate scores for L classifiers from OneVsRestClassifier, but you can look at implementation of OneVsRestClassifier.fit method, or take this (should work :) ):
# Here X, y - your dataset
one_vs_rest = model_tunning.best_estimator_
yT = one_vs_rest.label_binarizer_.transform(y).toarray().T
# Iterate through all L classifiers
for classifier, is_ith_class in zip(one_vs_rest.estimators_, yT):
print(classifier.score(X, is_ith_class))
Inspired by #Olologin 's answer, I realized that 0.85 is the best weighted average of f1 scores (in this example) obtained by L predictions. In the following code, I evaluate the model by inner test, using macro average of f1 score:
# Case A, inspect F1 score using the meta-classifier
F_A = f1_score(y, model_tunning.best_estimator_.predict(X), average='macro')
# Case B, inspect F1 scores of each label (binary task) and collect them by macro average
F_B = []
for label, clc in zip(y.T, model_tunning.best_estimator_.estimators_):
F_B.append(f1_score(label, clf.predict(X)))
F_B = mean(F_B)
F_A==F_B # True
So it implies that the GridSearchCV applies one of 4*2*4 sets of parameters to build the meta-classifier which in turn makes prediction on each label with one of the L classifiers. The outcome will be L f1 scores for L labels, each of which is a performance of a binary task. Finally, a single score is obtained by taking average (macro or weighted average, specified by parameter in f1_score) of L f1 scores.
The GridSearchCV then choose the best averaged f1 scores among 4*2*4 sets of parameters, which is 0.85 in this example.
Though it is convenient to use the wrapper for multi-label problem, it can only maximize the averaged f1 score with a same set of parameters used to build L classifiers. If one wants to optimize the performance of each label separately, one seems to have to build L classifiers without using the wrapper.
As for your second question, you might want to used GridSearchCV with scikit-multilearn's BinaryRelevance classifier. Like OneVsRestClassifier, Binary Relevance creates L single-label classifiers, one per label. For each label the training data is 1 if label is present and 0 if not present. The best selected classifier set is the BinaryRelevance class instance in best_estimator_ property of GridSearchCV. Use for predicting floats of probabilities use the predict_proba method of the BinaryRelevance object. An example can be found in the scikit-multilearn docs for model selection.
In your case I would run the following code:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.model_selection import GridSearchCV
import sklearn.metrics
model_to_set = BinaryRelevance(SVC())
parameters = {
"classifier__estimator__C": [1,2,4,8],
"classifier__estimator__kernel": ["poly","rbf"],
"classifier__estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1')
model_tunning.fit(X, y)
# for some X_test testing set
predictions = model_tunning.best_estimator_.predict(X_test)
# average=None gives per label score
metrics.f1_score(y_test, predictions, average = None)
Please note that there much better methods for multi-label classification than Binary Relevance :) You can find them in madjarov's comparison or my recent paper.
I'm using scikit-learn's (sklearn) linear SVM (LinearSVC) and I'm currently trying to remove the 10% most predictive features for doing sentiment analysis on 3 classes (positive, negative and neutral) in order to see if I can prevent overfitting while working on domain adaptation. I know that it's possible to access the feature weights by using svm.LinearSVC().coef_ but I'm not sure how to remove the 10% most predictive features. Does anyone know to proceed? In advance, thanks for your help. Here's my code:
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer as cv
# Using linear SVM classifier
clf = svm.LinearSVC()
# Count vectorizer used by the SVM classifier (select which ngrams to use here)
vec = cv(lowercase=True, ngram_range=(1,2))
# Fit count vectorizer with training text data
vec.fit(trainStringList)
# X represents the text data from the respective datasets
# Transforms text into vectors for the training set
X_train = vec.transform(trainStringList)
#transforms text into vectors for the test set
X_test = vec.transform(testStringList)
# Y represents the labels from the respective datasets
# Converting labels from the respective data sets to integers (0="positive", 1= "neutral", 2= "negative")
Y_train = trainLabels
Y_test = testLabels
# Fitting the training data to the linear SVM classifier
clf.fit(X_train,Y_train)
for feature_vector in clf.coef_:
???
Coefficients with the highest weights will indicate highest importance when generating predictions. You can eliminate the features associated with these parameters. I do not advise this. If your goal is is to reduce overfitting, C is the regularization parameter in this model. Provide a higher C value when initiating the LinearSVC object (default is 1):
clf = svm.LinearSVC(C=10)
You should be doing some sort of cross-validation to determine the best values for hyperparameters, such as C.