Trouble with unsupervised clustering method of dataframe - python

I'm working on some Python ML exercises and I'm stuck on a question.
I have a dataframe with 7 columns and almost 10k lines. 6 of those column/variables are objects and 1 is a float. The 7 variables are : Company, Job, Technologies, Degree, Experience (the one float variable - # of years), City, and Exp_level.
I want to do an unsupervised clustering to show 2 variables I deem important.
The code I've been testing hasn't been working and it seems that there is an issue with the mixed variables I have.
x = df
y = x.pop('Metier')
y.unique()
OneHotEncoder().fit(df.dropna()).categories_
x.values, y
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = KNN.KNeighborsClassifier(5, weights=weights)
clf.fit(x.values, y.values)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()
This is the 8th exercise by the way, so all my imports and dataframe loading were done in the beginning.
The error I keep having is ValueError: could not convert string to float: 'Sanofi' (the name of a company).
I'm doing my best to train and work on my Python skills. I hope I gave enough information to show that. Is there a better way to obtain my goal? I can only use the libraries :
import pandas as pd
import numpy as np
import re
import sklearn as sk
import sklearn.neighbors as KNN
from sklearn.preprocessing import OneHotEncoder
import seaborn as sb
from matplotlib import pyplot as plt
Hoping I can figure out this tricky exercise, any help would be greatly appreciated! I thank you in advance :) Super happy to be working on my Python skills more and more.
This is my df :

Related

Having trouble developing a KNeighboursClassifier analysis in Python

I'm trying to produce a routine using KNeighboursClassifier in Python in Jupyter. My goal is to group the diversity values into 4 types of water masses, but when I test my code, ''Dead Kernel'' appears on my Jupyter page.
I want to produce a figure similar to this:
only adapting it to my data. That's the code I'm working on:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.cm as cm
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as mpatches
import matplotlib.patches as mpatches
from sklearn import neighbors, datasets
from sklearn.neighbors import KNeighborsClassifier
index = pd.read_excel('diverty_index.xlsx') #This is my data set
X = index[['Shannon', 'Depth']]
y = index['Water_mass']
# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def plot_water_knn(X, y, n_neighbors, weights):
X_mat = X[['Shannon', 'Depth']].values #Shannon is a diversity index
y_mat = y.values
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X_mat, y_mat)
# Plot the decision boundary by assigning a color in the color map
# to each mesh point.
mesh_step_size = .01 # step size in the mesh
plot_symbol_size = 50
x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
np.arange(y_min, y_max, mesh_step_size))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot training points
plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
patch0 = mpatches.Patch(color='#FF0000', label='AASW')
patch1 = mpatches.Patch(color='#00FF00', label='CDW')
patch2 = mpatches.Patch(color='#0000FF', label='MWDW')
patch3 = mpatches.Patch(color='#AFAFAF', label='AABW')
plt.legend(handles=[patch0, patch1, patch2, patch3])
plt.xlabel('Shannon H')
plt.ylabel('Profundidade(m)')
plt.show()
plot_water_knn(X_train, y_train, 5, 'uniform')```
I think the dead kernel issue is a result of using a fine mesh (mesh_step_size) on a large feature space. Standardizing your data will help, and should improve the model. If it doesn't, your entire dataset might be too large for your machine.
But the first-order problem is that this code is a bit jumbled up, mixing modeling and plotting in a ort of half-written function. Let's simplify everything and start with the classifier. Forget the plotting for now.
Making a classifier
I refactored a bit, using conventional names for things. Try this:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_excel('diverty_index.xlsx')
X = df[['Shannon', 'Depth']]
y = df['Water_mass']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Note here that it's important that your samples are not in 'clumps', e.g. location A with 5 measurements at various depths, then location B with 4 measurements at different depths, etc. If they are 'clumped' like this, you can't just split randomly with train_test_split; instead you need to split the clumps (e.g. the locations).
Now you need to scale your data. This classifier depends on distances, and distances don't mean much if your features are different scales. So do this:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # Do not fit scaler to Test.
Now you can fit the classifier:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train, y_train)
y_pred = knn.predict(y_test)
print(classification_report(y_test, y_pred)
How does this look? What's the weighted F1 score? Does it improve if you change the value of n_neighbors? How good can you make this model? (Strictly speaking you should have another blind dataset to test all this, but that's a detail.)
If you got this far with an intact kernel, then you can feel good about having a somewhat useful KNN model, and you can move on to the data viz.
Decision region visualization
I suspect that this line was killing your kernel: mesh_step_size = .01, because if the features of X have a large range (e.g. 0 to 10,000), the mesh will be gigantic and eat your memory, crashing the kernel. But now that we've standardized the data, things are more predictable because most values will be in the range -3 to +3.
This minimalist approach adapted from this famous plot should produce something:
fig, ax = plt.subplots()
# Plot the validation data.
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
# Set up the grid parameters.
h = .02
x_min, x_max = X_train[:, 0].min() - .5, X_train[:, 0].max() + .5
y_min, y_max = X_train[:, 1].min() - .5, X_train[:, 1].max() + .5
# Make the grid.
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict on the grid.
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot.
Z = Z.reshape(xx.shape)
im = ax.pcolormesh(xx, yy, Z, alpha=0.2, zorder=1, shading='auto')
If this produces something at least, then you're off to the races. I'm sure you can add a title, axis annotation, etc, and make it pretty. Good luck.

Plotting boundary decision for five dimensional independent variable [duplicate]

I am very new to matplotlib and am working on simple projects to get acquainted with it. I was wondering how I might plot the decision boundary which is the weight vector of the form [w1,w2], which basically separates the two classes lets say C1 and C2, using matplotlib.
Is it as simple as plotting a line from (0,0) to the point (w1,w2) (since W is the weight "vector") if so, how do I extend this like in both directions if I need to?
Right now all I am doing is :
import matplotlib.pyplot as plt
plt.plot([0,w1],[0,w2])
plt.show()
Thanks in advance.
Decision boundary is generally much more complex then just a line, and so (in 2d dimensional case) it is better to use the code for generic case, which will also work well with linear classifiers. The simplest idea is to plot contour plot of the decision function
# X - some data in 2dimensional np.array
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# here "model" is your model's prediction (classification) function
Z = model(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=pl.cm.Paired)
plt.axis('off')
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
some examples from sklearn documentation

Why are the decision boundaries in binary classification with support vector machine overlapping on my plots? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm performing binary classification using scikit. Everything appears to be orderly in terms of prediction, but when I plot the decision boundaries, the decision boundaries are overlapping (see Plot). Now I realize now that MULTICLASS SVM will inevitably lead to decision boundaries overlapping, but why is this occurring with binary SVM classification? They should never overlap as far as I know since the space is being divided into two. So any idea why my plots look so disorderly and so many different colors when there should only be two colors? Is it how I am plotting? Thank you.
Updated Picture with subplots
def createSVMandPlot(X,y,x_name,y_name):
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y) #1 vs 1
rbf_svc = svm.SVC(kernel='rbf', gamma='scale', C=C).fit(X, y) #1v1
poly_svc = svm.SVC(kernel='poly', degree=3, gamma='scale',C=C).fit(X, y) #1v1
lin_svc = svm.LinearSVC(C=C).fit(X, y) #1 vs rest
print(str(x_name)+' vs. '+str(y_name))
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
X_pred=clf.predict(X)
X_pred1=np.asarray(X_pred).reshape(len(X_pred),1)
A=confusion_matrix(X_pred1, y)
print(A)
c=0
for r in range(len(X_pred)):
if X_pred[r]==y[r]:
c+=1
print(str(c)+' out of 34 predicted correctly (true positives)')
=============================================================================
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
=============================================================================
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC w/ linear kernel',
'LinearSVC (w/ linear kernel)',
'SVM w/ RBF kernel',
'SVM w/ poly(degree 3) kernel']
plt.pause(7)
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
# point in the mesh [x_min, x_max]x[y_min, y_max].
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=.5)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], s=13,c=y)
plt.xlabel(x_name)
plt.ylabel(y_name)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
Because you have four different Support vector machines :
svc , rbf_svc , poly_svc and lin_svc
And iteratively you are plotting all of them. That is the reason you are seeing the overlaps boundaries, because of 4 different boundaries shown in the same single plot

numpy.ndarray syntax understanding for confirmation

I am referring the code example here (http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html), and specifically confused by this line iris.data[:, :2], since iris.data is 150 (row) * 4 (column) dimensional I think it means, select all rows, and the first two columns. I ask here to confirm if my understanding is correct, since I take time but cannot find such syntax definition official document.
Another question is, I am using the following code to get # of rows and # of columns, not sure if better more elegant ways? My code is more Python native style and not sure if numpy has better style to get the related values.
print len(iris.data) # for number of rows
print len(iris.data[0]) # for number of columns
Using Python 2.7 with miniconda interpreter.
print(__doc__)
# Code source: Gaƫl Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
h = .02 # step size in the mesh
logreg = linear_model.LogisticRegression(C=1e5)
# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
regards,
Lin
You are right. The first syntax selects the first 2 columns/features. Another way to query dimensions is to look at iris.data.shape. This will return a n-dimensional tuple with the length. You can find some documentation here: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
import numpy as np
x = np.random.rand(100, 200)
# Select the first 2 columns
y = x[:, :2]
# Get the row length
print (y.shape[0])
# Get the column length
print (y.shape[1])
# Number of dimensions
print (len(y.shape))

plot decision boundary matplotlib

I am very new to matplotlib and am working on simple projects to get acquainted with it. I was wondering how I might plot the decision boundary which is the weight vector of the form [w1,w2], which basically separates the two classes lets say C1 and C2, using matplotlib.
Is it as simple as plotting a line from (0,0) to the point (w1,w2) (since W is the weight "vector") if so, how do I extend this like in both directions if I need to?
Right now all I am doing is :
import matplotlib.pyplot as plt
plt.plot([0,w1],[0,w2])
plt.show()
Thanks in advance.
Decision boundary is generally much more complex then just a line, and so (in 2d dimensional case) it is better to use the code for generic case, which will also work well with linear classifiers. The simplest idea is to plot contour plot of the decision function
# X - some data in 2dimensional np.array
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# here "model" is your model's prediction (classification) function
Z = model(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=pl.cm.Paired)
plt.axis('off')
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
some examples from sklearn documentation

Categories

Resources