I am building a classifier to maximize the margin between positively and negatively labelled points.
I am using sklearn.LinearSVC to do this. I have to find both the weights (a vector, theta) and intercept ( a scalar theta_0). I also need to calculate the maximum margin. So, I wrote the below code.
import numpy as np
import sklearn
from sklearn.svm import LinearSVC
# training data
X_train = np.array([[0,0],[2,0],[3,0],[0,2],[2,2],[5,1],[5,2],[2,4],[4,4],[5,5]])
y_train = [-1,-1,-1,-1,-1,1,1,1,1,1]
classifier = LinearSVC(random_state = 0, C=1.0, fit_intercept= True)
classifier.fit(X_train, y_train)
theta = classifier.coef_
theta_0.intercept_
norm = np.linalg.norm(theta)
margin = 2/norm
As per my understanding, LinearSVC is the right package for this; though I see some tutorials in which people use SVC and then kernel = 'linear'.
I am not sure whether I should set the fit_intercept parameter to True. I am getting a different value for theta and theta_0 when I default it to False.
Can somebody guide me on the understanding of this parameter and also whether the margin calculation is correct? Lastly, whether LinearSVC is the right model. Thanks.
This statement is wrong:
theta_0.intercept_
I assume that it should be:
theta_0 = classifier.intercept_
Related
This block of code:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
X = 'some_data'
y = 'some_target'
penalty = 1.5e-5
A = Ridge(normalize=True, alpha=penalty).fit(X, y)
triggers the following warning:
FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
- model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
Ridge(alpha=1.5e-05)
But that codes gives me completely different coefficients, as expected because normalisation and standardisation are different.
B = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=penalty))
B[1].fit(B[0].fit_transform(X), y)
Output:
A.coefs[0], B[1].coefs[0]
(124.87330648168594, 125511.75051106009)
The result still does not match if I set alpha = penalty * n_features.
Output:
A.coefs[0], B[1].coefs[0]
(124.87330648168594, 114686.09835548172)
even though Ridge() uses a bit different normalization than I expected:
the regressor X will be normalized by subtracting mean and dividing by
l2-norm
So what's the proper way to use ridge regression with normalization?
considering that l2-norm seems like being obtained after prediction, data modifying and fitting again
nothing comes to my mind in context of using ridge regression from sklearn, especially after 1.2 version
prepare data for experimenting:
url = 'https://drive.google.com/file/d/1bu64NqQkG0YR8G2CQPkxR1EQUAJ8kCZ6/view?usp=sharing'
url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
data = pd.read_csv(url, index_col=0)
X = data.iloc[:,:15]
y = data['target']
The difference is that the coefficients reported with normalize=True are to be applied directly to the unscaled inputs, whereas the pipeline approach applies its coefficients to the model's inputs, which are the scaled features.
You can "normalize" (an unfortunate overloading of the word) the coefficients by multiplying/dividing by the features' standard deviation. Together with the change to penalty suggested in the future warning, I get the same outputs:
np.allclose(A.coef_, B[1].coef_ / B[0].scale_)
# True
(I've tested using sklearn.datasets.load_diabetes.)
I am trying to get my head around how to use KNeighborsTransformer correctly, so I am using the Iris dataset to test it.
However, I find that when I use KNeighborsTransformer before the KNeighborsClassifier I get different results than using KNeighborsClassifier directly.
When I plot the decision boundaries, they are similar, but different.
I have given the metric and weights mode explicitly, so that cannot be the problem.
Why do I get this difference?
Does it have something to do with whether they count a point as its own nearest neighbour?
Or does it have something to do with the metric='precomputed'?
Below is the code I use to consider the two classifiers.
import numpy as np
from sklearn import neighbors, datasets
from sklearn.pipeline import make_pipeline
# import data
iris = datasets.load_iris()
# We only take the first two features.
X = iris.data[:, :2]
y = iris.target
n_neighbors = 15
knn_metric = 'minkowski'
knn_mode = 'distance'
# With estimator with KNeighborsTransformer
estimator = make_pipeline(
neighbors.KNeighborsTransformer(
n_neighbors = n_neighbors + 1, # one extra neighbor should already be computed when mode == 'distance'. But also the extra neighbour should be filtered by the following KNeighborsClassifier
metric = knn_metric,
mode = knn_mode),
neighbors.KNeighborsClassifier(
n_neighbors=n_neighbors, metric='precomputed'))
estimator.fit(X, y)
print(estimator.score(X, y)) # 0.82
# with just KNeighborsClassifier
clf = neighbors.KNeighborsClassifier(
n_neighbors,
weights = knn_mode,
metric = knn_metric)
clf.fit(X, y)
print(clf.score(X, y)) # 0.9266666666666666
Your pipeline approach uses the default uniform vote, but your direct approach uses the distance-weighted vote. Making them match (either both distance or both uniform) almost makes the behavior match; the seeming remaining difference is in tie-breaking of nearest neighbors; I'm not sure yet why the tie-breaking is happening differently in the two cases, but it's likely not such a big issue with more realistic datasets.
I am using scikit-learn's CalibratedClassifierCV with GaussianNB() to run binary classification on some data.
I have verified the inputs in .fit(X_train, y_train) and they have matching dimensions and both pass the np.isfinite test.
My problem is when I run .predict_proba(X_test).
For some of the samples, the probabilities returned are array([-inf, inf]), and I can't really understand why.
This came to light when I tried running brier_score_loss on the resulting predictions, and it threw a ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I have added some data to this Google drive link.
It's larger than what I wanted but I couldn't get consistent reproduction with smaller datasets.
The code for reproduction lies below.
There is some randomness to the code so if no infinites are found try running it again, but from my experiments it should find them on the first try.
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np
loaded = np.load('data.npz')
X = loaded['X']
y = loaded['y']
num = 2*10**4
sss = StratifiedShuffleSplit(n_splits = 10, test_size = 0.2)
cal_classifier = CalibratedClassifierCV(GaussianNB(), method = 'isotonic', cv = sss)
classifier_fit = cal_classifier.fit(X[:num], y[:num])
predicted_probabilities = classifier_fit.predict_proba(X[num:num+num//4])[:,1]
predicted_probabilities[np.argwhere(~np.isfinite(predicted_probabilities))]
It seems that the Isotonic regression (used by CalibratedClassifierCV) is providing the inf values.
More precisely it comes from a linear regression in Isotonic:
declared here - https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/isotonic.py#L266
called here - https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/isotonic.py#L389
The regression called on very small values (below a certain threshold but superior to 0) gives inf.
In debug mode self.f_([0, 3.2392382784e-313]) returns [0.10430463576158941, inf] which is a strange behaviour. The implementation of interpolate.interp1d probably doesn't handle this kind of "super-small" values. Hope it helps.
I want to score different classifiers with different parameters.
For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.
But problem while it give me equal C parameters, but not the AUC ROC scoring.
I'll try fix many parameters like scorer, random_state, solver, max_iter, tol...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed
warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By #joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.
Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_ gives the score for all the folds.
GridSearchCV.best_score_ gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4 instead of your tol=10, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).
I plan on using scikit svm for class prediction.
I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification.
Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward.
I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment.
How do I achieve that with the available score function?
Cheers,
EL
You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:
import numpy as np
from sklearn import svm
clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
is_train = idx != i
clf.fit(observations[is_train, :], labels[is_train])
preds[i] = clf.predict(observations[i, :])
Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:
from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)
See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.
Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.