plotting linear SVM

plotting linear SVM - python

I tried following the example here but i am having trouble applying it when i have 16 features. lin_svc is trained with those 16 features (i deleted the line to re-train it again from the example). it works and i tried it and also extracted .coef_before.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
#features is an array of 16
#lin_svc variable is available
#train is a pandas DF
X = train[features].as_matrix()
y = train.outcome
h = .02 # step size in the mesh
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel']
for i, clf in enumerate([lin_svc]):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(X)
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
The error i am getting is:
ValueError Traceback (most recent call last)
<ipython-input-8-d52ca252fc3a> in <module>()
24
25 # Put the result into a color plot
---> 26 Z = Z.reshape(xx.shape)
27 plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
28
ValueError: total size of new array must be unchanged

I've encountered this same issue myself. Since you're really interested in plotting Z as a function of xx and yy, you should be passing those to clf.predict() rathan than passing X. Try replacing
Z = clf.predict(X)
with
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
and the plot should show nicely (assuming no other bugs).
Also you may want to change the title of your question to something like "Plotting 2-D Decision Boundary," since this has nothing to do with SVMs specifically. You'll encounter this kind of issue with any of the sklearn classifiers.

Related

Error in plotting the decision boundary for SVC Laplace kernel

I'm trying to plot the decision boundary of the SVM classifier using a precomputed Laplace kernel (code below) on the similar lines of this scikit-learn post. I'm taking test points as mesh grid values (xx, yy) just like as mentioned in the post and train points as X and y. I'm able to fit the pre-computed kernel using train points.
import numpy as np
#from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics.pairwise import laplacian_kernel
#Load the iris data
iris_data = load_iris()
#Split the data and target
X = iris_data.data[:, :2]
y = iris_data.target
#Step size in mesh plot
h = 0.02
#Convert X and y to a numpy array
X = np.array(X)
y = np.array(y)
#Using Laplacian kernel - https://scikit-learn.org/stable/modules/metrics.html#laplacian-kernel
K = np.array(laplacian_kernel(X, gamma=.5))
svm = SVC(kernel='precomputed').fit(K, np.ravel(y))
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
#plt.subplot(2, 2, i + 1)
#plt.subplots_adjust(wspace=0.4, hspace=0.4)
# Calculate the gram matrix for test points. Here is where the error is coming. xx- test, X-train.
K_test = np.array(laplacian_kernel(xx, X, gamma=.5))
#Predict using the gram matrix for test
Z = svm.predict(np.c_[K_test])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title('SVC with Laplace kernel')
plt.show()
However, when I try to plot the decision boundary on graph for grid points, I get the below error.
Traceback (most recent call last):
File "/home/user/Src/laplce.py", line 37, in <module>
K_test = np.array(laplacian_kernel(xx, X, gamma=.5))
File "/home/user/.local/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 1136, in laplacian_kernel
X, Y = check_pairwise_arrays(X, Y)
File "/home/user/.local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/home/user/.local/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 160, in check_pairwise_arrays
raise ValueError("Incompatible dimension for X and Y matrices: "
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 280 while Y.shape[1] == 2
So, how do I resolve the error and plot the decision boundary for iris data ? Thanks in advance

The issue is getting your meshgrid into the same dimensions as the training matrix, before applying the laplacian. So if we run the code below to fit the svm :
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics.pairwise import laplacian_kernel
iris_data = load_iris()
X = iris_data.data[:, :2]
y = iris_data.target
h = 0.02
K = laplacian_kernel(X,gamma=.5)
svm = SVC(kernel='precomputed').fit(K, y)
Create the grid like you did:
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
x_test = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
xx,yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
Your original input into the laplacian was (150,2) so you need to basically put your xx,yy into 2 columns:
x_test = np.vstack([xx.ravel(),yy.ravel()]).T
K_test = laplacian_kernel(x_test, X, gamma=.5)
Z = svm.predict(K_test)
Z = Z.reshape(xx.shape)
Then plot:
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
The points are more or less correct, you can see it does not resolve 1,2 very well:
pd.crosstab(y,svm.predict(K))
col_0 0 1 2
row_0
0 49 1 0
1 0 35 15
2 0 11 39

How to have multiple categorical markers on a scatterplot

I want to train logistic regression model, and after that create a plot which shows boundary lines, but in specific way.
My work so far
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
logreg = LogisticRegression(C=1e5)
# Create an instance of Logistic Regression Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:,1], c=Y, marker='x',edgecolors='k', cmap=cmap_bold)
plt.xlabel('Sepal length'),
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
However I find it very unreadable. I want to have other markers for each classification and legend in left upper corner. Just like in the image below :
Do you have any idea how can I change that ? I played with marker ='s', marker='x', but those change all points on scatter plot, instead of one specific classification.

Since you are plotting with categorical values, you can just plot each class separately:
# Replace this
# plt.scatter(X[:, 0], X[:,1], c=Y, marker='x',edgecolors='k', cmap=cmap_bold)
# with this
markers = 'sxo'
for m,i in zip(markers,np.unique(Y)):
mask = Y==i
plt.scatter(X[mask, 0], X[mask,1], c=cmap_bold.colors[i],
marker=m,edgecolors='k', label=i)
plt.legend()
Output:

I find it easier to create a dataframe from X & Y, and then plot the data points with seaborn.scatterplot.
seaborn is a high-level api for matplotlib
As shown in How to extract the boundary values from k-nearest neighbors predict, the dataframe columns can be used to specify all data for fitting, and x and y min and max.
load and setup the data
import numpy as np
import matplotlib.pyplot as plt # version 3.3.1
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from matplotlib.colors import ListedColormap
import seaborn # versuin 0.11.0
import pandas # version 1.1.3
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# seaborn.scatterplot palette parameter takes a list
palette = ['#FF0000', '#00FF00', '#0000FF']
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
# add X & Y to dataframe
df = pd.DataFrame(X, columns=iris.feature_names[:2])
df['label'] = Y
# map the number values to the species name and add it to the dataframe
species_map = dict(zip(range(3), iris.target_names))
df['species'] = df.label.map(species_map)
logreg = LogisticRegression(C=1e5)
# Create an instance of Logistic Regression Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plot the data
plt.figure(1, figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')
# Plot also the training points
# add data points using seaborn
sns.scatterplot(data=df, x='sepal length (cm)', y='sepal width (cm)', hue='species',
style='species', edgecolor='k', alpha=0.5, palette=palette, s=70)
# change legend location
plt.legend(title='Species', loc=2)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
# plt.xticks(())
# plt.yticks(())
plt.show()
alpha=0.5 is used with sns.scatterplot, to show that some values of 'versicolor' and 'virginica' overlap.
If the species label is desired for the legend, instead of the name, change hue='species' to hue='label'.

You need to change a single call to plt.scatter to one call for each marker type, since matplotlib does not allow passing multiple marker types as it does with color.
The plot code becomes something like
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
X0 = X[Y==0]
X1 = X[Y==1]
X2 = X[Y==2]
Y0 = Y[Y==0]
Y1 = Y[Y==1]
Y2 = Y[Y==2]
plt.scatter(X0[:, 0], X0[:,1], marker='s',color="red")
plt.scatter(X1[:, 0], X1[:,1], marker='x',color="blue")
plt.scatter(X2[:, 0], X2[:,1], marker='o',color="green")
plt.xlabel('Sepal length'),
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
where you individually set the marker type and color of each class. You can also create a list for the marker type and another for the color and use a loop.

X.shape[1] = 2 should be equal to 9, the number of features at training time

I'm trying to plot my Binary SVM classifier results using matplotlib.pyplot and using this documentation as a guide:
https://scikit-learn.org/0.18/auto_examples/svm/plot_iris.html
Here's my code:
# create a mesh to plot in
h = .02
x_min, x_max = X.min() - 1, X.max() + 1
y_min, y_max = 0,1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# PLOT
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlabel('X')
plt.ylabel('Y')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
Where X.shape = (10769, 9), and the class values are either 0 or 1, hence why I set y_min and y_max to 0 and 1.
This is the error I'm getting:
X.shape[1] = 2 should be equal to 9, the number of features at training time
I'm not sure if I understand how to plot SVMs correctly - what am I doing wrong?
Here is the link to the notebook with the full code: https://colab.research.google.com/drive/1F3_CFIDv8qaDeWodujpeu4phQADFpodP?usp=sharing
Any help would be much appreciated!!

Subplot drawn by auxiliary function experiences unexpected shift

I'm trying to merge two plots in one:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_iris.html
http://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_decision_regions.html#sphx-glr-auto-examples-ensemble-plot-voting-decision-regions-py
In the left plot I want to display the decision boundary with the hyperplane corresponding to the OVA classifiers and in the right plot I would like to show the decision probabilities.
This is the code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn import datasets
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
def plot_hyperplane(c, color, fitted_model):
"""
Plot the one-against-all classifiers for the given model.
Parameters
--------------
c : index of the hyperplane to be plot
color : color to be used when drawing the line
fitted_model : the fitted model
"""
xmin, xmax = plt.xlim()
ymin, ymax = plt.ylim()
try:
coef = fitted_model.coef_
intercept = fitted_model.intercept_
except:
return
def line(x0):
return (-(x0 * coef[c, 0]) - intercept[c]) / coef[c, 1]
plt.plot([xmin, xmax], [line(xmin), line(xmax)], ls="--", color=color, zorder=3)
def plot_decision_boundary(X, y, fitted_model, features, targets):
"""
This function plots a model decision boundary as well as it tries to plot
the decision probabilities, if available.
Requires a model fitted with two features only.
Parameters
--------------
X : the data to learn
y : the classification labels
fitted_model : the fitted model
"""
cmap = plt.get_cmap('Set3')
prob = cmap
colors = [cmap(i) for i in np.linspace(0, 1, len(fitted_model.classes_))]
plt.figure(figsize=(9.5, 5))
for i, plot_type in enumerate(['Decision Boundary', 'Decision Probabilities']):
plt.subplot(1, 2, i+1)
mesh_step_size = 0.01 # step size in the mesh
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size), np.arange(y_min, y_max, mesh_step_size))
# First plot, predicted results using the given model
if i == 0:
Z = fitted_model.predict(np.c_[xx.ravel(), yy.ravel()])
for h, color in zip(fitted_model.classes_, colors):
plot_hyperplane(h, color, fitted_model)
# Second plot, predicted probabilities using the given model
else:
prob = 'RdYlBu_r'
try:
Z = fitted_model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
except:
plt.text(0.4, 0.5, 'Probabilities Unavailable', horizontalalignment='center',
verticalalignment='center', transform=plt.gca().transAxes, fontsize=12)
plt.axis('off')
break
Z = Z.reshape(xx.shape)
# Display Z
plt.imshow(Z, interpolation='nearest', cmap=prob, alpha=0.5,
extent=(x_min, x_max, y_min, y_max), origin='lower', zorder=1)
# Plot the data points
for i, color in zip(fitted_model.classes_, colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], facecolor=color, edgecolor='k', lw=1,
label=iris.target_names[i], cmap=cmap, alpha=0.8, zorder=2)
plt.title(plot_type + '\n' +
str(fitted_model).split('(')[0]+ ' Test Accuracy: ' + str(np.round(fitted_model.score(X, y), 5)))
plt.xlabel(features[0])
plt.ylabel(features[1])
plt.gca().set_aspect('equal')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.subplots_adjust(top=0.9, bottom=0.08, wspace=0.02)
plt.show()
if __name__ == '__main__':
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target
scaler = preprocessing.StandardScaler().fit_transform(X)
clf1 = DecisionTreeClassifier(max_depth=4)
clf2 = KNeighborsClassifier(n_neighbors=7)
clf3 = SVC(kernel='rbf', probability=True)
clf4 = SGDClassifier(alpha=0.001, n_iter=100).fit(X, y)
clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)
clf4.fit(X, y)
plot_decision_boundary(X, y, clf1, iris.feature_names, iris.target_names[[0, 2]])
plot_decision_boundary(X, y, clf2, iris.feature_names, iris.target_names[[0, 2]])
plot_decision_boundary(X, y, clf3, iris.feature_names, iris.target_names[[0, 2]])
plot_decision_boundary(X, y, clf4, iris.feature_names, iris.target_names[[0, 2]])
And the results:
As can be seen, for the last example (clf4 in the given code) I'm so far unable to plot the hyperplane in the wrong position. I wonder how to correct this. They should be translated to the correct range regarding the used features to fit the model.
Thanks.

Apparently, the problem is the ends of dashed lines representing hyperplanes are not consistent with the final and expected xlim and ylim. A good thing about this case is that you already have x_min, x_max, y_min, y_max defined. So use that and fix xlim and ylim by applying the following 3 lines before plotting hyperplanes (specifically, add in front of your comment line # First plot, predicted results using the given model).
ax = plt.gca()
ax.set_xlim((x_min, x_max), auto=False)
ax.set_ylim((y_min, y_max), auto=False)

numpy.ndarray syntax understanding for confirmation

I am referring the code example here (http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html), and specifically confused by this line iris.data[:, :2], since iris.data is 150 (row) * 4 (column) dimensional I think it means, select all rows, and the first two columns. I ask here to confirm if my understanding is correct, since I take time but cannot find such syntax definition official document.
Another question is, I am using the following code to get # of rows and # of columns, not sure if better more elegant ways? My code is more Python native style and not sure if numpy has better style to get the related values.
print len(iris.data) # for number of rows
print len(iris.data[0]) # for number of columns
Using Python 2.7 with miniconda interpreter.
print(__doc__)
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
h = .02 # step size in the mesh
logreg = linear_model.LogisticRegression(C=1e5)
# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
regards,
Lin

You are right. The first syntax selects the first 2 columns/features. Another way to query dimensions is to look at iris.data.shape. This will return a n-dimensional tuple with the length. You can find some documentation here: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
import numpy as np
x = np.random.rand(100, 200)
# Select the first 2 columns
y = x[:, :2]
# Get the row length
print (y.shape[0])
# Get the column length
print (y.shape[1])
# Number of dimensions
print (len(y.shape))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

plotting linear SVM - python

Related

Error in plotting the decision boundary for SVC Laplace kernel

How to have multiple categorical markers on a scatterplot

X.shape[1] = 2 should be equal to 9, the number of features at training time

Subplot drawn by auxiliary function experiences unexpected shift

numpy.ndarray syntax understanding for confirmation

Categories

Resources