Scikit Learn: How can I set the SVM Output range in regression? - python

I have some input data in the range [-1 , 1] and output data in the range [ 0, 1]. When I use the SMV regression to predict the output
I have that the predicted output values are between-1 and 1. What am I
missing? The code is:
svr=svm.SVR(C=0.1, gamma=0.01,kernel='rbf')
y_rbf =svr.fit(TrainingIn,TrainingOut)
y_hat=svr.predict(TestIn)
Thank you!

Given the information here, it's impossible to reconstruct your problem. I'm pretty sure though, that it has to do with the preprocessing/scaling of your data. An example snippet to get SVR running might look like this (feel free to adapt it to your needs):
from sklearn.svm import SVR
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
# replace this parth with your data, e.g. TrainingIn/TrainingOut
boston = load_boston()
X, y = boston.data, boston.target
X1, X2, y1, y2 = train_test_split(X, y)
svr = SVR(C=80)
scaler = StandardScaler()
svr.fit(scaler.fit_transform(X1), y1)
y_pred = svr.predict(scaler.transform(X2))
print mean_squared_error(y2, y_pred)

Am keeping this answer only for future reference (it does not directly answer PSan's question).
It's important to note that (perhaps contrary to its name) sklearn.svm.SVR can be used as both a predictor and a classifier. If fed labeled data, predict will output {-1, +1}.

Related

How do I manually `predict_proba` from logistic regression model in scikit-learn?

I am trying to manually predict a logistic regression model using the coefficient and intercept outputs from a scikit-learn model. However, I can't match up my probability predictions with the predict_proba method from the classifier.
I have tried:
from sklearn.datasets import load_iris
from scipy.special import expit
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
# use sklearn's predict_proba function
sk_probas = clf.predict_proba(X[:1, :])
# and attempting manually (using scipy's inverse logit)
manual_probas = expit(np.dot(X[:1], clf.coef_.T)+clf.intercept_)
# with a completely manual inverse logit
full_manual_probas = 1/(1+np.exp(-(np.dot(iris_test, iris_coef.T)+clf.intercept_)))
outputs:
>>> sk_probas
array([[9.81815067e-01, 1.81849190e-02, 1.44120963e-08]])
>>> manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
>>> full_manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
I do seem to get the classes to match (using np.argmax), but the probabilities are different. What am I missing?
I've looked at this and this but haven't managed to figure it out yet.
The documentation states that
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class
That is, in order to get the same values as sklearn you have to normalize using softmax, like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X, y)
decision = np.dot(X[:1], clf.coef_.T)+clf.intercept_
print(clf.predict_proba(X[:1]))
print(np.exp(decision) / np.exp(decision).sum())
To use sigmoids instead you can do it like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000, multi_class='ovr').fit(X, y) # Notice the extra argument
full_manual_probas = 1/(1+np.exp(-(np.dot(X[:1], clf.coef_.T)+clf.intercept_)))
print(clf.predict_proba(X[:1]))
print(full_manual_probas / full_manual_probas.sum())

How to calculate accurecy measures for "OneVsRestClassifier" classifier?

I have a multiclass classification problem, the code bellow can classify the data at the multiclass level.
from sklearn import datasets
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]
clf = OneVsRestClassifier(QDA())
y_score = cross_val_predict(clf, X, y, cv=10 ,method='predict_proba')
How Can I calculate the performance measures, listed below, of this classifier using the above code?
accuracy
specificity
sensitivity
presison
mcc
f1
Recall
thank you..
You should be able to do it with:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
In this case from what I understand, y_test is what you defined as y and y_pred can be easily calculated via:
y_pred = clf.predict(X)
You can find further information regarding metrics here and some explinations in wikipedia just remember for multilabel classes, the concepts of true negative and true positive should be understood differently.

Should I use feature scaling with polynomial regression with scikit-learn?

I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.

How to set 'update_weights' in OpenCV's MLP implementation?

I'm used to using sklearn, whose documentation I find very easy to follow. However, I now need to learn to use OpenCV - in particular, I need to be able to use an MLP classifier, and to update its weights as new training data comes in.
In sklearn, this can be done using the partial_fit method. According to the OpenCV documentation, there is an UPDATE_WEIGHTS flag that can be set, but I can't figure out how to include it in my code.
Here's a MCVE of what I have so far:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np
import cv2
from sklearn.neural_network import MLPClassifier
def softmax(x):
softmaxes = np.zeros(x.shape)
for i in range(x.shape[1]):
softmaxes[:, i] = np.exp(x)[:, i]/np.sum(np.exp(x), axis=1)
return softmaxes
data = load_breast_cancer()
X = data.data
y = data.target.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1729)
y2 = np.zeros((y_train.shape[0], 2))
y2[:,0] = np.where(y_train==0, 1, 0)
y2[:,1] = np.where(y_train==1, 1, 0)
ann = cv2.ml.ANN_MLP_create()
ann.setLayerSizes(np.array([X.shape[1], y2.shape[1]]))
ann.setActivationFunction(cv2.ml.ANN_MLP_SIGMOID_SYM)
ann.train(np.float32(X_train), cv2.ml.ROW_SAMPLE, np.float32(y2))
mlp = MLPClassifier()
mlp.fit(X_train, y_train)
preds_proba = softmax(ann.predict(np.float32(X_test))[1])
print(roc_auc_score(y_test, preds_proba[:,1]))
print(roc_auc_score(y_test, mlp.predict_proba(X_test)[:,1]))
As the score between the OpenCV classifier and the sklearn learn one are comparable, I'm pretty confident it's implemented correctly.
How can I modify this code, so that when a new training sample comes in, I can update the weights based on that sample alone, rather than retraining on the entire train set?
The equivalent in sklearn would be:
mlp.partial_fit(X_new_sample, y_new_sample).
Figured out the answer. The syntax is the following:
ann.train(cv2.ml.TrainData_create(np.float32(X), 0, np.float32(y)), flags=1)

How to get comparable and reproducible results from LogisticRegressionCV and GridSearchCV

I want to score different classifiers with different parameters.
For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.
But problem while it give me equal C parameters, but not the AUC ROC scoring.
I'll try fix many parameters like scorer, random_state, solver, max_iter, tol...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed
warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By #joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.
Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_ gives the score for all the folds.
GridSearchCV.best_score_ gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4 instead of your tol=10, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).

Categories

Resources