numpy.ndarray syntax understanding for confirmation - python

I am referring the code example here (http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html), and specifically confused by this line iris.data[:, :2], since iris.data is 150 (row) * 4 (column) dimensional I think it means, select all rows, and the first two columns. I ask here to confirm if my understanding is correct, since I take time but cannot find such syntax definition official document.
Another question is, I am using the following code to get # of rows and # of columns, not sure if better more elegant ways? My code is more Python native style and not sure if numpy has better style to get the related values.
print len(iris.data) # for number of rows
print len(iris.data[0]) # for number of columns
Using Python 2.7 with miniconda interpreter.
print(__doc__)
# Code source: Gaƫl Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
h = .02 # step size in the mesh
logreg = linear_model.LogisticRegression(C=1e5)
# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
regards,
Lin

You are right. The first syntax selects the first 2 columns/features. Another way to query dimensions is to look at iris.data.shape. This will return a n-dimensional tuple with the length. You can find some documentation here: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
import numpy as np
x = np.random.rand(100, 200)
# Select the first 2 columns
y = x[:, :2]
# Get the row length
print (y.shape[0])
# Get the column length
print (y.shape[1])
# Number of dimensions
print (len(y.shape))

Related

Plotting boundary decision for five dimensional independent variable [duplicate]

I am very new to matplotlib and am working on simple projects to get acquainted with it. I was wondering how I might plot the decision boundary which is the weight vector of the form [w1,w2], which basically separates the two classes lets say C1 and C2, using matplotlib.
Is it as simple as plotting a line from (0,0) to the point (w1,w2) (since W is the weight "vector") if so, how do I extend this like in both directions if I need to?
Right now all I am doing is :
import matplotlib.pyplot as plt
plt.plot([0,w1],[0,w2])
plt.show()
Thanks in advance.
Decision boundary is generally much more complex then just a line, and so (in 2d dimensional case) it is better to use the code for generic case, which will also work well with linear classifiers. The simplest idea is to plot contour plot of the decision function
# X - some data in 2dimensional np.array
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# here "model" is your model's prediction (classification) function
Z = model(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=pl.cm.Paired)
plt.axis('off')
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
some examples from sklearn documentation

How to have multiple categorical markers on a scatterplot

I want to train logistic regression model, and after that create a plot which shows boundary lines, but in specific way.
My work so far
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
logreg = LogisticRegression(C=1e5)
# Create an instance of Logistic Regression Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:,1], c=Y, marker='x',edgecolors='k', cmap=cmap_bold)
plt.xlabel('Sepal length'),
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
However I find it very unreadable. I want to have other markers for each classification and legend in left upper corner. Just like in the image below :
Do you have any idea how can I change that ? I played with marker ='s', marker='x', but those change all points on scatter plot, instead of one specific classification.
Since you are plotting with categorical values, you can just plot each class separately:
# Replace this
# plt.scatter(X[:, 0], X[:,1], c=Y, marker='x',edgecolors='k', cmap=cmap_bold)
# with this
markers = 'sxo'
for m,i in zip(markers,np.unique(Y)):
mask = Y==i
plt.scatter(X[mask, 0], X[mask,1], c=cmap_bold.colors[i],
marker=m,edgecolors='k', label=i)
plt.legend()
Output:
I find it easier to create a dataframe from X & Y, and then plot the data points with seaborn.scatterplot.
seaborn is a high-level api for matplotlib
As shown in How to extract the boundary values from k-nearest neighbors predict, the dataframe columns can be used to specify all data for fitting, and x and y min and max.
load and setup the data
import numpy as np
import matplotlib.pyplot as plt # version 3.3.1
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from matplotlib.colors import ListedColormap
import seaborn # versuin 0.11.0
import pandas # version 1.1.3
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# seaborn.scatterplot palette parameter takes a list
palette = ['#FF0000', '#00FF00', '#0000FF']
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
# add X & Y to dataframe
df = pd.DataFrame(X, columns=iris.feature_names[:2])
df['label'] = Y
# map the number values to the species name and add it to the dataframe
species_map = dict(zip(range(3), iris.target_names))
df['species'] = df.label.map(species_map)
logreg = LogisticRegression(C=1e5)
# Create an instance of Logistic Regression Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plot the data
plt.figure(1, figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')
# Plot also the training points
# add data points using seaborn
sns.scatterplot(data=df, x='sepal length (cm)', y='sepal width (cm)', hue='species',
style='species', edgecolor='k', alpha=0.5, palette=palette, s=70)
# change legend location
plt.legend(title='Species', loc=2)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
# plt.xticks(())
# plt.yticks(())
plt.show()
alpha=0.5 is used with sns.scatterplot, to show that some values of 'versicolor' and 'virginica' overlap.
If the species label is desired for the legend, instead of the name, change hue='species' to hue='label'.
You need to change a single call to plt.scatter to one call for each marker type, since matplotlib does not allow passing multiple marker types as it does with color.
The plot code becomes something like
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
X0 = X[Y==0]
X1 = X[Y==1]
X2 = X[Y==2]
Y0 = Y[Y==0]
Y1 = Y[Y==1]
Y2 = Y[Y==2]
plt.scatter(X0[:, 0], X0[:,1], marker='s',color="red")
plt.scatter(X1[:, 0], X1[:,1], marker='x',color="blue")
plt.scatter(X2[:, 0], X2[:,1], marker='o',color="green")
plt.xlabel('Sepal length'),
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
where you individually set the marker type and color of each class. You can also create a list for the marker type and another for the color and use a loop.

Trouble with unsupervised clustering method of dataframe

I'm working on some Python ML exercises and I'm stuck on a question.
I have a dataframe with 7 columns and almost 10k lines. 6 of those column/variables are objects and 1 is a float. The 7 variables are : Company, Job, Technologies, Degree, Experience (the one float variable - # of years), City, and Exp_level.
I want to do an unsupervised clustering to show 2 variables I deem important.
The code I've been testing hasn't been working and it seems that there is an issue with the mixed variables I have.
x = df
y = x.pop('Metier')
y.unique()
OneHotEncoder().fit(df.dropna()).categories_
x.values, y
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = KNN.KNeighborsClassifier(5, weights=weights)
clf.fit(x.values, y.values)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()
This is the 8th exercise by the way, so all my imports and dataframe loading were done in the beginning.
The error I keep having is ValueError: could not convert string to float: 'Sanofi' (the name of a company).
I'm doing my best to train and work on my Python skills. I hope I gave enough information to show that. Is there a better way to obtain my goal? I can only use the libraries :
import pandas as pd
import numpy as np
import re
import sklearn as sk
import sklearn.neighbors as KNN
from sklearn.preprocessing import OneHotEncoder
import seaborn as sb
from matplotlib import pyplot as plt
Hoping I can figure out this tricky exercise, any help would be greatly appreciated! I thank you in advance :) Super happy to be working on my Python skills more and more.
This is my df :

plotting linear SVM

I tried following the example here but i am having trouble applying it when i have 16 features. lin_svc is trained with those 16 features (i deleted the line to re-train it again from the example). it works and i tried it and also extracted .coef_before.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
#features is an array of 16
#lin_svc variable is available
#train is a pandas DF
X = train[features].as_matrix()
y = train.outcome
h = .02 # step size in the mesh
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel']
for i, clf in enumerate([lin_svc]):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(X)
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
The error i am getting is:
ValueError Traceback (most recent call last)
<ipython-input-8-d52ca252fc3a> in <module>()
24
25 # Put the result into a color plot
---> 26 Z = Z.reshape(xx.shape)
27 plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
28
ValueError: total size of new array must be unchanged
I've encountered this same issue myself. Since you're really interested in plotting Z as a function of xx and yy, you should be passing those to clf.predict() rathan than passing X. Try replacing
Z = clf.predict(X)
with
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
and the plot should show nicely (assuming no other bugs).
Also you may want to change the title of your question to something like "Plotting 2-D Decision Boundary," since this has nothing to do with SVMs specifically. You'll encounter this kind of issue with any of the sklearn classifiers.

plot decision boundary matplotlib

I am very new to matplotlib and am working on simple projects to get acquainted with it. I was wondering how I might plot the decision boundary which is the weight vector of the form [w1,w2], which basically separates the two classes lets say C1 and C2, using matplotlib.
Is it as simple as plotting a line from (0,0) to the point (w1,w2) (since W is the weight "vector") if so, how do I extend this like in both directions if I need to?
Right now all I am doing is :
import matplotlib.pyplot as plt
plt.plot([0,w1],[0,w2])
plt.show()
Thanks in advance.
Decision boundary is generally much more complex then just a line, and so (in 2d dimensional case) it is better to use the code for generic case, which will also work well with linear classifiers. The simplest idea is to plot contour plot of the decision function
# X - some data in 2dimensional np.array
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# here "model" is your model's prediction (classification) function
Z = model(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=pl.cm.Paired)
plt.axis('off')
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
some examples from sklearn documentation

Categories

Resources