Extract decision boundary with scikit-learn linear SVM - python

I have a very simple 1D classification problem: a list of values [0, 0.5, 2] and their associated classes [0, 1, 2]. I would like to get the classification boundaries between those classes.
Adapting the iris example (for visualization purposes), getting rid of the non-linear models:
X = np.array([[x, 1] for x in [0, 0.5, 2]])
Y = np.array([1, 0, 2])
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, Y)
lin_svc = svm.LinearSVC(C=C).fit(X, Y)
Gives the following result:
LinearSVC is returning junk (why?), but the SVC with linear kernel is working okay. So I would like to get the boundaries values, that you can graphically guess: ~0.25 and ~1.25.
That's where I'm lost: svc.coef_ returns
array([[ 0.5 , 0. ],
[-1.33333333, 0. ],
[-1. , 0. ]])
while svc.intercept_ returns array([-0.125 , 1.66666667, 1. ]).
This is not explicit.
I must be missing something silly, how to obtain those values? They seem obvious to compute, that would be ridiculous to iterate over the x-axis to find the boundary...

I had the same question and eventually found the solution in the sklearn documentation.
Given the weights W=svc.coef_[0] and the intercept I=svc.intercept_ , the decision boundary is the line
y = a*x - b
with
a = -W[0]/W[1]
b = I[0]/W[1]

Exact boundary calculated from coef_ and intercept_
I think this is a great question and haven't been able to find a general answer to it anywhere in the documentation. This site really needs Latex, but anyway, I'll try to do my best without...
In general, a hyperplane is defined by its unit normal and an offset from the origin. So we hope to find some decision function of the form: x dot n + d > 0 (where the > may of course be replaced with >=).
In the case of the SVM Margins Example, we can manipulate the equation they start with to clarify its conceptual significance. First, let's establish the notational convenience of writing coef to represent coef_[0] and intercept to represent intercept_[0], since these arrays only have 1 value. Then some simple substitution yields the equation:
y + coef[0]*x/coef[1] + intercept/coef[1] = 0
Multiplying through by coef[1], we obtain
coef[1]*y + coef[0]*x + intercept = 0
And so we see that the coefficients and intercept function roughly as their names would imply. Applying one quick generalization of notation should make the answer clear - we will replace x and y with a single vector x.
coef[0]*x[0] + coef[1]*x[1] + intercept = 0
In general, the coef_ and intercept_ members of the svm classifier will have dimension matching the data set it was trained on, so we can extrapolate this equation to data of arbitrary dimension. And to avoid leading anyone astray, here is the final generalized decision boundary using the original variable names from the svm:
coef_[0][0]*x[0] + coef_[0][1]*x[1] + coef_[0][2]*x[2] + ... + coef_[0][n-1]*x[n-1] + intercept_[0] = 0
where the dimension of the data is n.
Or more tersely:
sum(coef_[0][i]*x[i]) + intercept_[0] = 0
where i sums over the range of the dimension of the input data.

Get decision line from SVM, demo 1
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs
# we create 40 separable points
X, y = make_blobs(n_samples=40, centers=2, random_state=6)
# fit the model, don't regularize for illustration purposes
clf = svm.SVC(kernel='linear', C=1000)
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)
# plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)
# plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
# plot support vectors
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,
linewidth=1, facecolors='none')
plt.show()
Prints:
Approximate the separating n-1 dimensional hyperplane of an SVM, Demo 2
import numpy as np
import mlpy
from sklearn import svm
from sklearn.svm import SVC
import matplotlib.pyplot as plt
np.random.seed(0)
mean1, cov1, n1 = [1, 5], [[1,1],[1,2]], 200 # 200 samples of class 1
x1 = np.random.multivariate_normal(mean1, cov1, n1)
y1 = np.ones(n1, dtype=np.int)
mean2, cov2, n2 = [2.5, 2.5], [[1,0],[0,1]], 300 # 300 samples of class -1
x2 = np.random.multivariate_normal(mean2, cov2, n2)
y2 = 0 * np.ones(n2, dtype=np.int)
X = np.concatenate((x1, x2), axis=0) # concatenate the 1 and -1 samples
y = np.concatenate((y1, y2))
clf = svm.SVC()
#fit the hyperplane between the clouds of data, should be fast as hell
clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
production_point = [1., 2.5]
answer = clf.predict([production_point])
print("Answer: " + str(answer))
plt.plot(x1[:,0], x1[:,1], 'ob', x2[:,0], x2[:,1], 'or', markersize = 5)
colormap = ['r', 'b']
color = colormap[answer[0]]
plt.plot(production_point[0], production_point[1], 'o' + str(color), markersize=20)
#I want to draw the decision lines
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
plt.show()
Prints:
These hyperplanes are all straight as an arrow, they're just straight in higher dimensions and can't be comprehended by mere mortals confined to 3 dimensional space. These hyperplanes are cast into higher dimensions with the creative kernel functions, than flattened back into the visible dimension for your viewing pleasure. Here is a video trying to impart some intuition of what is going on in demo 2: https://www.youtube.com/watch?v=3liCbRZPrZA

Related

Noise-level estimation with scikit-learn GPR package for multidimensional data

I am trying to estimate the noise level for Gaussian Process. Scikit-learn gives an example on this website https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy.html. The last chunk of code seems to do exactly what I need, but it is only for the 1-d case
from matplotlib.colors import LogNorm
length_scale = np.logspace(-2, 4, num=50)
noise_level = np.logspace(-2, 1, num=50)
length_scale_grid, noise_level_grid = np.meshgrid(length_scale, noise_level)
log_marginal_likelihood = [
gpr.log_marginal_likelihood(theta=np.log([0.36, scale, noise]))
for scale, noise in zip(length_scale_grid.ravel(), noise_level_grid.ravel())
]
log_marginal_likelihood = np.reshape(
log_marginal_likelihood, newshape=noise_level_grid.shape
)
vmin, vmax = (-log_marginal_likelihood).min(), 50
level = np.around(np.logspace(np.log10(vmin), np.log10(vmax), num=50), decimals=1)
plt.contour(
length_scale_grid,
noise_level_grid,
-log_marginal_likelihood,
levels=level,
norm=LogNorm(vmin=vmin, vmax=vmax),
)
plt.colorbar()
plt.xscale("log")
plt.yscale("log")
plt.xlabel("Length-scale")
plt.ylabel("Noise-level")
plt.title("Log-marginal-likelihood")
plt.show()
However, I have inputs X which is two-dimensional, so I have two length scales to optimize. I modified the code as follows:
from matplotlib.colors import LogNorm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
kernel = 1.0 * RBF(length_scale=np.array([1e-1,1e-1])) + WhiteKernel(
determination
noise_level=1e-2, noise_level_bounds=(1e-10, 1e1)
)
gpr = GaussianProcessRegressor(kernel=kernel, alpha=0.0)
gpr.fit(T, y)
res=gpr.get_params
length_scale1 = np.logspace(-2, 4, num=50)
length_scale2 = np.logspace(-2, 4, num=50)
noise_level = np.logspace(-2, 1, num=50)
length_scale_grid1,length_scale_grid2, noise_level_grid = np.meshgrid(length_scale1, length_scale2,noise_level)
log_marginal_likelihood = [
gpr.log_marginal_likelihood(theta=np.log([0.36, scale1, scale2, noise]))
for scale1, scale2, noise in zip(length_scale_grid1.ravel(),length_scale_grid2.ravel(), noise_level_grid.ravel())
]
log_marginal_likelihood = np.reshape(
log_marginal_likelihood, newshape=noise_level_grid.shape
)
vmin, vmax = (-log_marginal_likelihood).min(), 50
level = np.around(np.logspace(np.log10(vmin), np.log10(vmax), num=50), decimals=1)
The result seems wrong since it always gives the largest length scale and noise estimation. I looked up the source code of scikit-learn and it says
theta : array-like of shape (n_kernel_params,)
So my questions are: 1) for 1d case, theta should be of length 2 (one for length scale and one for noise), but why is there a third valur 0.36 in np.log([0.36, scale, noise]))? 2) is my modification correct? if not, how should I change my code for the 2d case?
Thank you very much!

How to plot 2 trendlines on a single scatterplot? (python)

I want to plot 2 trendlines for one scatterplot using Matplotlib in Python but I don't know how. The graph should be similar to this target plot (from here, fig.2).
I managed to plot 1 trendline on a scatterplot here but can't figure out how to plot another trend.
Underneath is what I tried until now:
This proved ok for other parameters that I plotted, but not for this case, which led me to the conclusion that it's not too correct.
X = vO2.reshape(-1, 1)
Y = ve.reshape(-1, 1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
y_pred = linear_regressor.predict(X)
x_pred = linear_regressor.predict(Y)
plt.scatter(X, Y)
plt.plot(X, y_pred, '-*',label="O2")
plt.plot(x_pred, Y, '-*',label="vent")
plt.xlabel("VO2 (L/min)")
plt.ylabel("VE (L/min)")
plt.show()
and also
z1 = np.polyfit(vO2, ve, 1)
p1 = np.poly1d(z1)
z2 = np.polyfit(ve, vO2, 1)
p2 = np.poly1d(z2)
plt.scatter(vO2, ref_vent, label='original')
plt.plot(vO2, p1(vO2), label='trendline')
plt.plot(ve, p2(ve), label='trendline')
plt.show()
which also didn't look similar to the target plot.
I don't know how to continue. Thanks in advance!
example dataset:
vo2 = [1.673925 1.9015125 1.981775 2.112875 2.1112625 2.086375 2.13475
2.1777 2.176975 2.1857125 2.258925 2.2718375 2.3381 2.3330875
2.353725 2.4879625 2.448275 2.4829875 2.5084375 2.511275 2.5511
2.5678375 2.5844625 2.6101875 2.6457375 2.6602125 2.6939875 2.7210625
2.720475 2.767025 2.751375 2.7771875 2.776025 2.7319875 2.564
2.3977625 2.4459125 2.42965 2.401275 2.387175 2.3544375]
ve = [ 3.93125 7.1975 9.04375 14.06125 14.11875 13.24375
14.6625 15.3625 15.2 15.035 17.7625 17.955
19.2675 19.875 21.1575 22.9825 23.75625 23.30875
25.9925 25.6775 27.33875 27.7775 27.9625 29.35
31.86125 32.2425 33.7575 34.69125 36.20125 38.6325
39.4425 42.085 45.17 47.18 42.295 37.5125
38.84375 37.4775 34.20375 33.18 32.67708333]
OK, so you need to find the point, where slope of line changes. I tried 2nd derivative, but it was noisy and I coulnd't find the right spot.
Another way is to try all possible points, calculate left and right regression lines and find pair with best fit (r2 coeff). Give this code a try. It is not complete. I do not know, how to force regression lines to go through point in the middle. And it might be better to work with interpolated data, if there are not enough datapoints.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
vo2 = [1.673925,1.9015125,1.981775,2.112875,2.1112625,2.086375,2.13475,2.1777,2.176975,2.1857125,2.258925,2.2718375,2.3381,2.3330875,2.353725,2.4879625,2.448275,2.4829875,2.5084375,2.511275,2.5511,2.5678375,2.5844625,2.6101875,2.6457375,2.6602125,2.6939875,2.7210625,2.720475,2.767025,2.751375,2.7771875,2.776025,2.7319875,2.564,2.3977625,2.4459125,2.42965,2.401275,2.387175,2.3544375]
ve = [ 3.93125,7.1975,9.04375,14.06125,14.11875,13.24375,14.6625,15.3625,15.2,15.035,17.7625,17.955,19.2675,19.875,21.1575,22.9825,23.75625,23.30875,25.9925,25.6775,27.33875,27.7775,27.9625,29.35,31.86125,32.2425,33.7575,34.69125,36.20125,38.6325,39.4425,42.085,45.17,47.18,42.295,37.5125,38.84375,37.4775,34.20375,33.18,32.67708333]
x = np.array(vo2)
y = np.array(ve)
sort_idx = x.argsort()
x = x[sort_idx]
y = y[sort_idx]
assert len(x) == len(y)
def fit(x,y):
p = np.polyfit(x, y, 1)
f = np.poly1d(p)
r2 = r2_score(y, f(x))
return p, f, r2
skip = 5 # minimal length of split data
r2 = [0] * len(x)
funcs = {}
for i in range(len(x)):
if i < skip or i > len(x) - skip:
continue
_, f_left, r2_left = fit(x[:i], y[:i])
_, f_right, r2_right = fit(x[i:], y[i:])
r2[i] = r2_left * r2_right
funcs[i] = (f_left, f_right)
split_ix = np.argmax(r2) # index of split
f_left,f_right = funcs[split_ix]
print(f"split point index: {split_ix}, x: {x[split_ix]}, y: {y[split_ix]}")
xd = np.linspace(min(x), max(x), 100)
plt.plot(x, y, "o")
plt.plot(xd, f_left(xd))
plt.plot(xd, f_right(xd))
plt.plot(x[split_ix], y[split_ix], "x")
plt.show()

Python package for Bayesian optimization among a given set of samples

I have a set of samples and their corresponding target values. Below is the visualization:
I am trying to use Bayesian optimization to find the sample with maximum target value. My question is if someone knows a python package that does this. I was looking at BayesianOptimization package and it seems that it can perform optimization over an interval, but not necessarily only among the samples.
Edit: Here is a sample code:
import matplotlib.pyplot as plt
import numpy as np
from bayes_opt import BayesianOptimization
mean = (1, 2)
cov = [[1, 0], [0, 1]]
my_x, my_y = np.random.multivariate_normal(mean, cov, 1000).T
cmap_data = -my_x ** 2 - (my_y - 1) ** 2 + 1
search = []
def black_box_function(x, y):
a = np.hstack((x, y))
search.append(a)
plt.scatter(my_x, my_y, marker='.', s=1, c=np.array(cmap_data))
plt.axis('equal')
# plt.scatter(np.array(search)[:, 0], np.array(search)[:, 1], marker='.', s=20, c='black')
plt.plot(np.array(search)[:, 0], np.array(search)[:, 1], marker='.', c='black', linewidth=0.5)
plt.show()
return -x ** 2 - (y - 1) ** 2 + 1
# -- optimization
pbounds = {'x': (min(my_x), max(my_x)), 'y': (min(my_y), max(my_y))}
optimizer = BayesianOptimization(
f=black_box_function,
pbounds=pbounds,
random_state=1,
)
for i in range(1000):
optimizer.probe(
params=[my_x[i], my_y[i]],
lazy=True,
)
optimizer.maximize(
init_points=0,
n_iter=0,
)
But it seems optimizer.probe just evaluates the values (to be used in future iterations), and it is not an actual iteration of optimization.

having ambiguity using customized kernel for `sklearn.svm` regressor

I want to use customized kernel function in Epsilon-Support Vector Regression module of Sklearn.svm. I found this code as an example for customized kernel for svc at the scilit-learn documentation:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
Y = iris.target
def my_kernel(X, Y):
"""
We create a custom kernel:
(2 0)
k(X, Y) = X ( ) Y.T
(0 1)
"""
M = np.array([[2, 0], [0, 1.0]])
return np.dot(np.dot(X, M), Y.T)
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data.
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired, edgecolors='k')
plt.title('3-Class classification using Support Vector Machine with custom'
' kernel')
plt.axis('tight')
plt.show()
I want to define some function like:
def my_new_kernel(X):
a,b,c = (random.randint(0,100) for _ in range(3))
# imagine f1,f2,f3 are functions like sin(x), cos(x), ...
ans = a*f1(X) + b*f2(X) + c*f3(X)
return ans
What I thought about kernel method is that it's a function that gets matrix of features (X) as input and returns a matrix of shape (n,1) . Then svm appends the returned matrix to the feature columns and uses that to classify the labels Y.
In the code above the kernel is used in svm.fit function and I can't figure out what are X and Y inputs of kernel and their shapes. if X and Y (inputs of my_kernel method) are the features and label of dataset, so then how does the kernel work for test data where we have no labels?
Actually I want to use svm for a dataset with shape of (10000, 6), (5 columns=features, 1 column = label) then if I want to use my_new_kernel method what would be the inputs and output and their shapes.
Your exact issue is quite unclear; here are some remarks which may be helpful nevertheless.
I can't figure out what are X and Y inputs of kernel and their shapes. if X and Y (inputs of my_kernel method) are the features and label of dataset,
Indeed they are; from the documentation of fit:
Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features. For kernel=”precomputed”, the
expected shape of X is (n_samples, n_samples).
y : array-like, shape(n_samples,)
Target values (class labels in classification, real numbers in regression)
exactly like they are for the default available kernels.
so then how does the kernel work for test data where we have no labels?
A close look at the code you have provided will reveal that the labels Y are indeed used only during training (fit); they are not of course used during prediction (clf.predict() in the code above - don't get confused with yy, which have nothing to do with Y).

Canonical Discriminant Function in Python sklearn

I am learning about Linear Discriminant Analysis and am using the scikit-learn module. I am confused by the "coef_" attribute from the LinearDiscriminantAnalysis class. As far as I understand, these are the discriminant function coefficients (sklearn calls them weight vectors). Since there should be (n_classes-1) discriminant functions, I would expect the coef_ attribute to be an array with shape (n_components, n_features), but instead it prints an (n_classes, n_features) array. Below is an example of this using the Iris dataset example from sklearn. Since there are 3 classes and 2 components, I would expect print(lda.coef_) to give me a 2x4 array instead of a 3x4 array...
Maybe I'm misinterpreting what the weight vectors are, perhaps they are the coefficients for the classification function?
And how do I get the coefficients for each variable in each discriminant/canonical function?
screenshot of jupyter notebook
Code here:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis(n_components=2,store_covariance=True)
X_r = lda.fit(X, y).transform(X)
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('Function 1 (%.2f%%)' %(lda.explained_variance_ratio_[0]*100))
plt.ylabel('Function 2 (%.2f%%)' %(lda.explained_variance_ratio_[1]*100))
plt.title('LDA of IRIS dataset')
print(lda.coef_)
#output -> [[ 6.24621637 12.24610757 -16.83743427 -21.13723331]
# [ -1.51666857 -4.36791652 4.64982565 3.18640594]
# [ -4.72954779 -7.87819105 12.18760862 17.95082737]]
You can calculate the coefficients with the following code:
def LDA_coefficients(X,lda):
nb_col = X.shape[1]
matrix= np.zeros((nb_col+1,nb_col), dtype=int)
Z=pd.DataFrame(data=matrix,columns=X.columns)
for j in range(0,nb_col):
Z.iloc[j,j] = 1
LD = lda.transform(Z)
nb_funct= LD.shape[1]
results = pd.DataFrame();
index = ['const']
for j in range(0,LD.shape[0]-1):
index = np.append(index,'C'+str(j+1))
for i in range(0,LD.shape[1]):
coef = [LD[-1][i]]
for j in range(0,LD.shape[0]-1):
coef = np.append(coef,LD[j][i]-LD[-1][i])
result = pd.Series(coef)
result.index = index
column_name = 'LD' + str(i+1)
results[column_name] = result
return results
Before calling this function you need to complete the linear discriminant analysis:
lda = LinearDiscriminantAnalysis()
lda.fit(X,y)

Categories

Resources