Having trouble developing a KNeighboursClassifier analysis in Python - python

I'm trying to produce a routine using KNeighboursClassifier in Python in Jupyter. My goal is to group the diversity values into 4 types of water masses, but when I test my code, ''Dead Kernel'' appears on my Jupyter page.
I want to produce a figure similar to this:
only adapting it to my data. That's the code I'm working on:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.cm as cm
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as mpatches
import matplotlib.patches as mpatches
from sklearn import neighbors, datasets
from sklearn.neighbors import KNeighborsClassifier
index = pd.read_excel('diverty_index.xlsx') #This is my data set
X = index[['Shannon', 'Depth']]
y = index['Water_mass']
# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def plot_water_knn(X, y, n_neighbors, weights):
X_mat = X[['Shannon', 'Depth']].values #Shannon is a diversity index
y_mat = y.values
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X_mat, y_mat)
# Plot the decision boundary by assigning a color in the color map
# to each mesh point.
mesh_step_size = .01 # step size in the mesh
plot_symbol_size = 50
x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
np.arange(y_min, y_max, mesh_step_size))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot training points
plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
patch0 = mpatches.Patch(color='#FF0000', label='AASW')
patch1 = mpatches.Patch(color='#00FF00', label='CDW')
patch2 = mpatches.Patch(color='#0000FF', label='MWDW')
patch3 = mpatches.Patch(color='#AFAFAF', label='AABW')
plt.legend(handles=[patch0, patch1, patch2, patch3])
plt.xlabel('Shannon H')
plt.ylabel('Profundidade(m)')
plt.show()
plot_water_knn(X_train, y_train, 5, 'uniform')```

I think the dead kernel issue is a result of using a fine mesh (mesh_step_size) on a large feature space. Standardizing your data will help, and should improve the model. If it doesn't, your entire dataset might be too large for your machine.
But the first-order problem is that this code is a bit jumbled up, mixing modeling and plotting in a ort of half-written function. Let's simplify everything and start with the classifier. Forget the plotting for now.
Making a classifier
I refactored a bit, using conventional names for things. Try this:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_excel('diverty_index.xlsx')
X = df[['Shannon', 'Depth']]
y = df['Water_mass']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Note here that it's important that your samples are not in 'clumps', e.g. location A with 5 measurements at various depths, then location B with 4 measurements at different depths, etc. If they are 'clumped' like this, you can't just split randomly with train_test_split; instead you need to split the clumps (e.g. the locations).
Now you need to scale your data. This classifier depends on distances, and distances don't mean much if your features are different scales. So do this:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # Do not fit scaler to Test.
Now you can fit the classifier:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train, y_train)
y_pred = knn.predict(y_test)
print(classification_report(y_test, y_pred)
How does this look? What's the weighted F1 score? Does it improve if you change the value of n_neighbors? How good can you make this model? (Strictly speaking you should have another blind dataset to test all this, but that's a detail.)
If you got this far with an intact kernel, then you can feel good about having a somewhat useful KNN model, and you can move on to the data viz.
Decision region visualization
I suspect that this line was killing your kernel: mesh_step_size = .01, because if the features of X have a large range (e.g. 0 to 10,000), the mesh will be gigantic and eat your memory, crashing the kernel. But now that we've standardized the data, things are more predictable because most values will be in the range -3 to +3.
This minimalist approach adapted from this famous plot should produce something:
fig, ax = plt.subplots()
# Plot the validation data.
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
# Set up the grid parameters.
h = .02
x_min, x_max = X_train[:, 0].min() - .5, X_train[:, 0].max() + .5
y_min, y_max = X_train[:, 1].min() - .5, X_train[:, 1].max() + .5
# Make the grid.
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict on the grid.
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot.
Z = Z.reshape(xx.shape)
im = ax.pcolormesh(xx, yy, Z, alpha=0.2, zorder=1, shading='auto')
If this produces something at least, then you're off to the races. I'm sure you can add a title, axis annotation, etc, and make it pretty. Good luck.

Related

How to plot my own logistic regression decision boundaries and SKlearn's ones on the same figure

I have an assignment in which I need to compare my own multi-class logistic regression and the built-in SKlearn one.
As part of it, I need to plot the decision boundaries of each, on the same figure (for 2,3, and 4 classes separately).
This is my model's decision boundaries for 3 classes:
Made with this code:
x1_min, x1_max = X[:,0].min()-.5, X[:,0].max()+.5
x2_min, x2_max = X[:,1].min()-.5, X[:,1].max()+.5
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
for i in range(len(ws)):
probs = ol.predict_prob(grid, ws[i]).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='green')
where
ol - is my Own Linear regression
ws - the current weights
That's how I tried to plot the Sklearn boundaries:
for i in range(len(clf.coef_)):
w = clf.coef_[i]
a = -w[0] / w[1]
xx = np.linspace(x1_min, x1_max)
yy = a * xx - (clf.intercept_[0]) / w[1]
plt.plot(xx, yy, 'k-')
Resulting
I understand that it's due to the 1dim vs 2dim grids, but I can't understand how to solve it.
I also tried to use the built-in DecisionBoundaryDisplay but I couldn't figure out how to plot it with my boundaries + it doesn't plot only the lines but also the whole background is painted in the corresponding color.
A couple fixes:
Change clf.intercept_[1] to clf.intercept_[i]
If the xlimits and ylimits in the plot look strange, you can constrain them.
ax.set_xlim([x1_min, x1_max])
ax.set_ylim([x2_min, x2_max])
MRE:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
X, y = make_blobs(n_features=2, centers=3, random_state=42)
fig, ax = plt.subplots(1, 2)
x1_min, x1_max = X[:,0].min()-.5, X[:,0].max()+.5
x2_min, x2_max = X[:,1].min()-.5, X[:,1].max()+.5
def draw_coef_lines(clf, X, y, ax, title):
for i in range(len(clf.coef_)):
w = clf.coef_[i]
a = -w[0] / w[1]
xx = np.linspace(x1_min, x1_max)
yy = a * xx - (clf.intercept_[i]) / w[1]
ax.plot(xx, yy, 'k-')
ax.scatter(X[:, 0], X[:, 1], c=y)
ax.set_xlim([x1_min, x1_max])
ax.set_ylim([x2_min, x2_max])
ax.set_title(title)
clf1 = LogisticRegression().fit(X, y)
clf2 = LogisticRegression(multi_class="ovr").fit(X, y)
draw_coef_lines(clf1, X, y, ax[0], "Multinomial")
draw_coef_lines(clf2, X, y, ax[1], "OneVsRest")
plt.show()

Trouble with unsupervised clustering method of dataframe

I'm working on some Python ML exercises and I'm stuck on a question.
I have a dataframe with 7 columns and almost 10k lines. 6 of those column/variables are objects and 1 is a float. The 7 variables are : Company, Job, Technologies, Degree, Experience (the one float variable - # of years), City, and Exp_level.
I want to do an unsupervised clustering to show 2 variables I deem important.
The code I've been testing hasn't been working and it seems that there is an issue with the mixed variables I have.
x = df
y = x.pop('Metier')
y.unique()
OneHotEncoder().fit(df.dropna()).categories_
x.values, y
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = KNN.KNeighborsClassifier(5, weights=weights)
clf.fit(x.values, y.values)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()
This is the 8th exercise by the way, so all my imports and dataframe loading were done in the beginning.
The error I keep having is ValueError: could not convert string to float: 'Sanofi' (the name of a company).
I'm doing my best to train and work on my Python skills. I hope I gave enough information to show that. Is there a better way to obtain my goal? I can only use the libraries :
import pandas as pd
import numpy as np
import re
import sklearn as sk
import sklearn.neighbors as KNN
from sklearn.preprocessing import OneHotEncoder
import seaborn as sb
from matplotlib import pyplot as plt
Hoping I can figure out this tricky exercise, any help would be greatly appreciated! I thank you in advance :) Super happy to be working on my Python skills more and more.
This is my df :

SVM: plot decision surface when working with more than 2 features

I am working with scikit-learn's breast cancer dataset, consisting of 30 features.
Following this tutorial for the much less depressing iris dataset, I figured how to plot the decision surface separating the "benign" and "malignant" categories, when considering the dataset's first two features (mean radius and mean texture).
This is what I get:
But how to represent the hyperplane computed when using all features in the dataset?
I am aware that I cannot plot a graph in 30 dimensions, but I would like to "project" the hyperplane created when running svm.SVC(kernel='linear', C=1).fit(X_train, y_train) onto the 2D scatter plot showing mean texture against mean radius.
I read about using PCA to reduce dimensionality, but I suspect that fitting a "reduced" dataset is not the same as projecting the hyperplane computed over all 30 features onto a 2D plot.
Here is my code so far:
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import svm
import numpy as np
#Load dataset
cancer = datasets.load_breast_cancer()
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test
h = .02 # mesh step
C = 1.0 # Regularisation
clf = svm.SVC(kernel='linear', C=C).fit(X_train[:,:2], y_train) # Linear Kernel
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
scat=plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
legend1 = plt.legend(*scat.legend_elements(),
loc="upper right", title="diagnostic")
plt.xlabel('mean_radius')
plt.ylabel('mean_texture')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()
You cannot visualize the decision surface for a lot of features. This is because the dimensions will be too many and there is no way to visualize an N-dimensional surface.
I have also written an article about this here:
https://towardsdatascience.com/support-vector-machines-svm-clearly-explained-a-python-tutorial-for-classification-problems-29c539f3ad8?source=friends_link&sk=80f72ab272550d76a0cc3730d7c8af35
However, you can use 2 features and plot nice decision surfaces as follows.
Case 1: 2D plot for 2 features and using the iris dataset
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
model = svm.SVC(kernel='linear')
clf = model.fit(X, y)
fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('y label here')
ax.set_xlabel('x label here')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()
Case 2: 3D plot for 3 features and using the iris dataset
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from mpl_toolkits.mplot3d import Axes3D
iris = datasets.load_iris()
X = iris.data[:, :3] # we only take the first three features.
Y = iris.target
#make it binary classification problem
X = X[np.logical_or(Y==0,Y==1)]
Y = Y[np.logical_or(Y==0,Y==1)]
model = svm.SVC(kernel='linear')
clf = model.fit(X, Y)
# The equation of the separating plane is given by all x so that np.dot(svc.coef_[0], x) + b = 0.
# Solve for w3 (z)
z = lambda x,y: (-clf.intercept_[0]-clf.coef_[0][0]*x -clf.coef_[0][1]*y) / clf.coef_[0][2]
tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot3D(X[Y==0,0], X[Y==0,1], X[Y==0,2],'ob')
ax.plot3D(X[Y==1,0], X[Y==1,1], X[Y==1,2],'sr')
ax.plot_surface(x, y, z(x,y))
ax.view_init(30, 60)
plt.show()
You can't plot the 30-dim data without any transformation to 2-d.
https://github.com/tmadl/highdimensional-decision-boundary-plot
What is a Voronoi Tessellation?
Given a set P := {p1, ..., pn} of sites, a Voronoi Tessellation is a subdivision of the space into n cells, one for each site in P, with the property that a point q lies in the cell corresponding to a site pi iff d(pi, q) < d(pj, q) for i distinct from j. The segments in a Voronoi Tessellation correspond to all points in the plane equidistant to the two nearest sites. Voronoi Tessellations have applications in computer science. - https://philogb.github.io/blog/2010/02/12/voronoi-tessellation/
In geometry, a centroidal Voronoi tessellation (CVT) is a special type of Voronoi tessellation or Voronoi diagram. A Voronoi tessellation is called centroidal when the generating point of each Voronoi cell is also its centroid, i.e. the arithmetic mean or center of mass. It can be viewed as an optimal partition corresponding to an optimal distribution of generators. A number of algorithms can be used to generate centroidal Voronoi tessellations, including Lloyd's algorithm for K-means clustering or Quasi-Newton methods like BFGS. - Wiki
import numpy as np, matplotlib.pyplot as plt
from sklearn.neighbors.classification import KNeighborsClassifier
from sklearn.datasets.base import load_breast_cancer
from sklearn.manifold.t_sne import TSNE
from sklearn import svm
bcd = load_breast_cancer()
X,y = bcd.data, bcd.target
X_Train_embedded = TSNE(n_components=2).fit_transform(X)
print(X_Train_embedded.shape)
h = .02 # mesh step
C = 1.0 # Regularisation
clf = svm.SVC(kernel='linear', C=C) # Linear Kernel
clf = clf.fit(X,y)
y_predicted = clf.predict(X)
resolution = 100 # 100x100 background pixels
X2d_xmin, X2d_xmax = np.min(X_Train_embedded[:,0]), np.max(X_Train_embedded[:,0])
X2d_ymin, X2d_ymax = np.min(X_Train_embedded[:,1]), np.max(X_Train_embedded[:,1])
xx, yy = np.meshgrid(np.linspace(X2d_xmin, X2d_xmax, resolution), np.linspace(X2d_ymin, X2d_ymax, resolution))
# approximate Voronoi tesselation on resolution x resolution grid using 1-NN
background_model = KNeighborsClassifier(n_neighbors=1).fit(X_Train_embedded, y_predicted)
voronoiBackground = background_model.predict(np.c_[xx.ravel(), yy.ravel()])
voronoiBackground = voronoiBackground.reshape((resolution, resolution))
#plot
plt.contourf(xx, yy, voronoiBackground)
plt.scatter(X_Train_embedded[:,0], X_Train_embedded[:,1], c=y)
plt.show()

How to plot SVM decision boundary in sklearn Python?

Using SVM with sklearn library, I would like to plot the data with each labels representing its color. I don't want to color the points but filling area with colors.
I have now :
d_pred, d_train_std, d_test_std, l_train, l_test
d_pred are the labels predicted.
I would plot d_pred with d_train_std with shape : (70000,2) where X-axis are the first column and Y-Axis the second column.
Thank you.
You cannot visualize the decision surface for a lot of features. This is because the dimensions will be too many and there is no way to visualize an N-dimensional surface.
However, you can use 2 features and plot nice decision surfaces as follows.
I have also written an article about this here:
https://towardsdatascience.com/support-vector-machines-svm-clearly-explained-a-python-tutorial-for-classification-problems-29c539f3ad8?source=friends_link&sk=80f72ab272550d76a0cc3730d7c8af35
Case 1: 2D plot for 2 features and using the iris dataset
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
model = svm.SVC(kernel='linear')
clf = model.fit(X, y)
fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('y label here')
ax.set_xlabel('x label here')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()
Case 2: 3D plot for 3 features and using the iris dataset
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from mpl_toolkits.mplot3d import Axes3D
iris = datasets.load_iris()
X = iris.data[:, :3] # we only take the first three features.
Y = iris.target
#make it binary classification problem
X = X[np.logical_or(Y==0,Y==1)]
Y = Y[np.logical_or(Y==0,Y==1)]
model = svm.SVC(kernel='linear')
clf = model.fit(X, Y)
# The equation of the separating plane is given by all x so that np.dot(svc.coef_[0], x) + b = 0.
# Solve for w3 (z)
z = lambda x,y: (-clf.intercept_[0]-clf.coef_[0][0]*x -clf.coef_[0][1]*y) / clf.coef_[0][2]
tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot3D(X[Y==0,0], X[Y==0,1], X[Y==0,2],'ob')
ax.plot3D(X[Y==1,0], X[Y==1,1], X[Y==1,2],'sr')
ax.plot_surface(x, y, z(x,y))
ax.view_init(30, 60)
plt.show()
It can be difficult to get the function in 3D. An easy way to get a visualization is to get a large amount of points that cover your point space and run them through your learned function (my_model.predict), keep the points that hit inside the function, and visualize them. The more you add the more defined the boundary will be.
Here's my code that does what #Christian Tuchez describes:
outputs = my_clf.predict(1_test)
hits = []
for i in range(outputs.size):
if outputs[i] == 1:
hits.append(i) # save the index where it's 1
This saves the index of all the points that hit in the function (saved in the "hits" list). You can probably accomplish this without a loop, I just found it easiest for me.
Then to display just those points, you'd write something like this:
ax.scatter(1_test[hits[:], 0], 1_test[hits[:], 1], 1_test[hits[:], 2], c="cyan", s=2, edgecolor=None)

How to plot the text classification using tf-idf svm sklearn in python

I have implemented the text classification using tf-idf and SVM by following the tutorial from this tutorial
The classification is working properly.
Now I want to plot the tf-idf values (i.e. features) and also see how the final hyperplane generated that classifies the data into two classes.
The code implemented is as follows:
import os
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
def make_Corpus(root_dir):
polarity_dirs = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
corpus = []
for polarity_dir in polarity_dirs:
reviews = [os.path.join(polarity_dir,f) for f in os.listdir(polarity_dir)]
for review in reviews:
doc_string = "";
with open(review) as rev:
for line in rev:
doc_string = doc_string + line
if not corpus:
corpus = [doc_string]
else:
corpus.append(doc_string)
return corpus
#Create a corpus with each document having one string
root_dir = 'txt_sentoken'
corpus = make_Corpus(root_dir)
#Stratified 10-cross fold validation with SVM and Multinomial NB
labels = np.zeros(2000);
labels[0:1000]=0;
labels[1000:2000]=1;
kf = StratifiedKFold(n_splits=10)
totalsvm = 0 # Accuracy measure on 2000 files
totalNB = 0
totalMatSvm = np.zeros((2,2)); # Confusion matrix on 2000 files
totalMatNB = np.zeros((2,2));
for train_index, test_index in kf.split(corpus,labels):
X_train = [corpus[i] for i in train_index]
X_test = [corpus[i] for i in test_index]
y_train, y_test = labels[train_index], labels[test_index]
vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True,stop_words='english')
train_corpus_tf_idf = vectorizer.fit_transform(X_train)
test_corpus_tf_idf = vectorizer.transform(X_test)
model1 = LinearSVC()
model2 = MultinomialNB()
model1.fit(train_corpus_tf_idf,y_train)
model2.fit(train_corpus_tf_idf,y_train)
result1 = model1.predict(test_corpus_tf_idf)
result2 = model2.predict(test_corpus_tf_idf)
totalMatSvm = totalMatSvm + confusion_matrix(y_test, result1)
totalMatNB = totalMatNB + confusion_matrix(y_test, result2)
totalsvm = totalsvm+sum(y_test==result1)
totalNB = totalNB+sum(y_test==result2)
print totalMatSvm, totalsvm/2000.0, totalMatNB, totalNB/2000.0
I have read how to plot the graphs, but couldn't find any tutorial related to plot the features of tf-idf and also the hyperplane generated by SVM.
First, you need to select only 2 features in order to create the 2-dimensional decision surface plot.
Example using some synthetic data:
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer())])
X = pipeline.fit_transform(newsgroups_train.data).todense()
# Select ONLY 2 features
X = np.array(X)
X = X[:, [0,1]]
y = newsgroups_train.target
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
model = svm.SVC(kernel='linear')
clf = model.fit(X, y)
fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('y label here')
ax.set_xlabel('x label here')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()
RESULTS
The plot is not nice since we selected randomly only 2 features to create it. One way to make it nice is the following: You could use a univariate ranking method (e.g. ANOVA F-value test) and find the best top-2 features. Then using these top-2 you could create a nice separating surface plot.

Categories

Resources