How to evaluate output of DecisionTreeRegressor python

How to evaluate output of DecisionTreeRegressor python - python

I am trying to use DecisionTreeRegressor from sklearn python to find out what is the dependency between two variables X- axis preassure and y - axis received optical power. I am measuring both parameters like */1min.
When I did work with polyval and polyfit in matlab, I was able to extract
the actual prediction equation which more or less described the relation between
received_optical_power = f(preassure)
I think that my question is basically how to evaluate the output of my analysis when using DecisionTreeRegressor. I mean the actual equation, residua and how to calculate the actual error of extracted curve.
I do my project with jupyter notebooks python at localhost just because my input final.merged.txt file has like 10 MB.
preassure column 3!!!!
print(__doc__)
# Import the necessary modules and libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
# Create a random dataset
rng = np.random.RandomState(1)
# X = np.sort(5 * rng.rand(80, 1), axis=0)
# y = np.sin(X).ravel()
# y[::5] += 3 * (0.5 - rng.rand(16))
X = pd.read_csv('final.merged.txt',sep = ";",usecols=[3]) # -- toto funguje !!
#X = pd.read_csv('final.merged.txt',sep = ";",usecols=(3,5,6,7,9,10))
#X = X.loc[:,'avgPressure'].values
y = pd.read_csv('final.merged.txt',sep = ";",usecols=[1])
# y = y.ix[:,0]
y = y.loc[:,'received optical power'].values
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_3 = DecisionTreeRegressor(max_depth=10)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
# Predict
#X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
#X_test =xrange(X.min(),X.max())
X_test = np.arange(X.min(), X.max(), (X.max()-X.min())/len(X))
print "vector y: %s\nvector X: %s\nX_test: %s" % (len(y), len(X),len(X_test))
len(X_test) is len(y)
X_test=pd.DataFrame(X_test,columns = ['X_test'])
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)
# Plot the results
plt.figure()
plt.scatter(X, y, c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.plot(X_test, y_3, color="red", label="max_depth=10", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()
I do appreciate any suggestion.

Related

How to plot my own logistic regression decision boundaries and SKlearn's ones on the same figure

I have an assignment in which I need to compare my own multi-class logistic regression and the built-in SKlearn one.
As part of it, I need to plot the decision boundaries of each, on the same figure (for 2,3, and 4 classes separately).
This is my model's decision boundaries for 3 classes:
Made with this code:
x1_min, x1_max = X[:,0].min()-.5, X[:,0].max()+.5
x2_min, x2_max = X[:,1].min()-.5, X[:,1].max()+.5
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
for i in range(len(ws)):
probs = ol.predict_prob(grid, ws[i]).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='green')
where
ol - is my Own Linear regression
ws - the current weights
That's how I tried to plot the Sklearn boundaries:
for i in range(len(clf.coef_)):
w = clf.coef_[i]
a = -w[0] / w[1]
xx = np.linspace(x1_min, x1_max)
yy = a * xx - (clf.intercept_[0]) / w[1]
plt.plot(xx, yy, 'k-')
Resulting
I understand that it's due to the 1dim vs 2dim grids, but I can't understand how to solve it.
I also tried to use the built-in DecisionBoundaryDisplay but I couldn't figure out how to plot it with my boundaries + it doesn't plot only the lines but also the whole background is painted in the corresponding color.

A couple fixes:
Change clf.intercept_[1] to clf.intercept_[i]
If the xlimits and ylimits in the plot look strange, you can constrain them.
ax.set_xlim([x1_min, x1_max])
ax.set_ylim([x2_min, x2_max])
MRE:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
X, y = make_blobs(n_features=2, centers=3, random_state=42)
fig, ax = plt.subplots(1, 2)
x1_min, x1_max = X[:,0].min()-.5, X[:,0].max()+.5
x2_min, x2_max = X[:,1].min()-.5, X[:,1].max()+.5
def draw_coef_lines(clf, X, y, ax, title):
for i in range(len(clf.coef_)):
w = clf.coef_[i]
a = -w[0] / w[1]
xx = np.linspace(x1_min, x1_max)
yy = a * xx - (clf.intercept_[i]) / w[1]
ax.plot(xx, yy, 'k-')
ax.scatter(X[:, 0], X[:, 1], c=y)
ax.set_xlim([x1_min, x1_max])
ax.set_ylim([x2_min, x2_max])
ax.set_title(title)
clf1 = LogisticRegression().fit(X, y)
clf2 = LogisticRegression(multi_class="ovr").fit(X, y)
draw_coef_lines(clf1, X, y, ax[0], "Multinomial")
draw_coef_lines(clf2, X, y, ax[1], "OneVsRest")
plt.show()

Natural cubic spline using patsy cr

I am trying to make natural cubic spline using patsy library.
Here is my code:
import numpy as np
from sklearn.linear_model import LinearRegression
from patsy import cr
import matplotlib.pyplot as plt
x = df.age #some data
y = df.wage
x_basis = cr(x, df=15)
model = LinearRegression().fit(x_basis, y)
y_hat = model.predict(x_basis)
plt.scatter(x, y)
plt.plot(x, y_hat, 'r')
plt.show()
The output is the following:
I believe that there should be one line. How can I solve the issue?

This is just a plotting issue. The plt.plot function draws by default a line plot without markers. Since your data is not sorted by x variable, the line jumps back and forth making the result look messy. I generated sample data and plotted with plt.plot defaults and then by hiding the line, and adding markers. The results are
With default plt.plot arguments
Rest of the code below (not repeated for convenience)
spline_basis = patsy.cr(x, df=3)
model = LinearRegression().fit(spline_basis, y)
y_spline = model.predict(spline_basis)
plt.scatter(x, y)
plt.plot(x, y_spline, color="red")
plt.show()
With plt.plot when marker='.' and ls=''
Rest of the code below (not repeated for convenience)
spline_basis = patsy.cr(x, df=3)
model = LinearRegression().fit(spline_basis, y)
y_spline = model.predict(spline_basis)
plt.scatter(x, y)
plt.plot(x, y_spline, ls="", marker=".", color="red") # Only this changed
plt.show()
By reordering data
Rest of the code below (not repeated for convenience)
If you want to draw a line plot, you could just reorder the data for fitting, like this:
spline_basis = patsy.cr(x, df=3)
model = LinearRegression().fit(spline_basis, y)
y_spline = model.predict(spline_basis)
plt.scatter(x, y)
xsorted, ysorted = zip(*[(X, Y) for (X, Y) in sorted(zip(x, y_spline))]) # simple reordering
plt.plot(xsorted, ysorted, color="red")
plt.show()
By using new data for prediction
Usually, models are created for prediction. The idea is to create model with training data and then use the model with some new data. This new data can be anything. If it is sorted, it can be plotted as lineplot. In this case, you can create new x-values, for example
new_x = np.linspace(10, 100, 100)
and calculate the predicted y-values for them. For this, you need to know (and save) only few values. Actually, just df, lower_bound, upper_bound and the 4 floats from model._coef.
# Fit model
spline_basis = patsy.cr(x, df=3, lower_bound=x.min(), upper_bound=x.max())
model = LinearRegression().fit(spline_basis, y)
y_train = model.predict(spline_basis)
# Use model
new_x = np.linspace(10, 100, 100) # 100 points
spline_basis_new = patsy.cr(new_x, df=3, lower_bound=x.min(), upper_bound=x.max())
new_y = model.predict(spline_basis_new)
plt.scatter(x, y)
plt.plot(x, y_train, color="red", ls="", marker=".")
plt.plot(new_x, new_y, color="green")
plt.show()
Rest of the code
from matplotlib import pyplot as plt
import numpy as np
import patsy
from sklearn.linear_model import LinearRegression
def dummy_data():
np.random.seed(1)
x = np.random.choice(np.arange(18, 81), size=4000)
def model(x):
a = 83 / 107520
b = -895 / 5376
c = 17747 / 1680
d = -622 / 7
return a * x ** 3 + b * x ** 2 + c * x + d
def noisemodel(x):
an = -0.0591836734693878
bn = 5.25510204081633
cn = -31.6326530612245
return an * x ** 2 + bn * x + cn
y = model(x)
ynoise = np.array([np.random.randn() * noisemodel(_) for _ in x])
return x, y + ynoise
x, y = dummy_data()

Bruh, sort your values before passing to a patsy function.
DSBA Piazza Teacher Assistant Team

Python Matplotlib: Find the corresponding x value of a given y value on a basic plot

I have a plot generated using Matplotlib (it is a precision-recall curve of a histogram originally) and I need to compute the correct x value which corresponds to the y value with y = 0.9. The data are loaded from text files which are present in columns. This is the code which is used to create the plot:
import numpy as np
import matplotlib.pyplot as plt
import pylab
from sklearn import metrics
data1 = np.loadtxt('text1.txt')
data2 = np.loadtxt('text2.txt')
background = 1 - (1 + y) / 2.
signal = 1 - (1 + x) / 2.
classifier_output = np.concatenate([background,signal])
true_value = np.concatenate([np.zeros_like(background, dtype=int), np.ones_like(signal, dtype=int)])
precision, recall, threshold = metrics.precision_recall_curve(true_value, classifier_output)
plt.plot(threshold, precision[:-1])
plt.savefig('Plot.pdf', dpi = 2000)
plt.show()
Is there any way to compute the correct value on the x axis which corresponds to y = 0.9?

Reference: https://numpy.org/doc/stable/reference/generated/numpy.interp.html
You can use np.interp() in the form of x_interp = np.interp(y_val, y, x) to interpret an x value.
If you want to interpret a y value instead, you need to switch to y_interp = np.interp(x_val, x, y).
I also added dashed lines and annotations to the plot to visualize the result better. Since the data were not provided, I made up the data for demonstration purposes.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
# make up some data
background = np.linspace(0, 1, 400)
signal = np.linspace(0, 2, 400)
classifier_output = np.concatenate([background, signal])
true_value = np.concatenate([np.zeros_like(background, dtype=int), np.ones_like(signal, dtype=int)])
fig, ax = plt.subplots()
precision, recall, threshold = metrics.precision_recall_curve(true_value, classifier_output)
plt.plot(threshold, precision[:-1])
# interpreting x value based on y value
y_val = 0.9
x_interp = round(np.interp(y_val, precision[:-1], threshold), 4) # x_interp = np.interp(y_vals, y, x)
# place a marker on point (x_interp, y_val)
plt.plot(x_interp, y_val, 'o', color='k')
# draw dash lines
plt.plot([x_interp, threshold[0]], [y_val, y_val], '--', color='k')
plt.plot([x_interp, x_interp], [precision[:-1][0], y_val], '--', color='k')
# add annotation on point (x_interp, y_val)
ax.annotate(f'(x={x_interp}, y={y_val})', (x_interp, y_val), size=14, xytext=(x_interp * 1.05, y_val))
# remove the margin around the starting point (depending on the data's lower bound, adjust or remove if necessary)
ax.set_xlim(threshold[0])
ax.set_ylim(precision[:-1][0])
plt.tight_layout()
print(f'y = {y_val}, x = {x_interp}')
plt.show()
Result:

Surface plot for multivariate 5 degree polynomial regression in Python

I am implementing a paper in Python, which was originally implemented in MATLAB. The paper says that a five degree polynomial was found using curve fitting from a set of sampling data points. I did not want to use their polynomial, so I started using the sample data points (given in paper) and tried to find a 5 degree polynomial using sklearn Polynomial Features and linear_model. As it is a multivariate equation f(x,y) where x and y are the length and width of a certain pond and f is the initial concentration of pollutant.
So my problem is that sklearn Polynomial Features transforms the test and train data points into n-polynomial points (I think as far as I understood it). However when I need the clf.predict function (where clf is the trained model) to take only the x and y values because when I am drawing the surface graph from Matplotlib, it requires meshgrid, so when I meshgrid my sklean-transformed test points, it's shape becomes like NxN whereas the predict function requires Nxn (where n is the number of degrees of polynomial in which it transformed the data) and N is the number of rows.
Is there any possible way to draw the mesh points for this polynomial?
Paper Link: http://www.ajer.org/papers/v5(11)/A05110105.pdf
Paper title: Mathematical Model of Biological Oxygen Demand in Facultative Wastewater Stabilization Pond Based on Two-Dimensional
Advection-Dispersion Model
If possible, please look at the figure 5 and 6 in the paper (link above). This is what I am trying to achieve.
Thank you
enter code here
from math import exp
import numpy as np
from operator import itemgetter
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
fig = plt.figure()
ax = fig.gca(projection='3d')
def model_BOD (cn):
cnp1 = []
n = len(cn)
# variables:
dmx = 1e-5
dmy = 1e-5
u = 2.10e-4
v = 2.10e-4
obs_time = 100
dt = 0.1
for t in np.arange(0.1,obs_time,dt):
for i in range(N):
for j in range(N):
d = j + (i-1)*N
dxp1 = d + N
dyp1 = d + 1
dxm1 = d - N
dym1 = d - 1
cnp1.append(t*(((-2*dmx/dx**2)+(-2*dmy/dy**2)+(1/t))*cn[dxp1] + (dmx/dx**2)*cn[dyp1] \
+ (dmy/dy**2)*cn[dym1] - (u/(2*dx))*cn[dxp1] + (u/(2*dx))*cn[dxm1] \
- (v/(2*dy))*cn[dyp1] + (v/(2*dy))*cn[dym1]))
cn = cnp1
cnp1 = []
return cn
N = 20
Length = 70
Width = 77
dx = Length/N
dy = Width/N
deg_of_poly = 5
datapoints = np.array([
[12.5,70,81.32],[25,70,88.54],[37.5,70,67.58],[50,70,55.32],[62.5,70,56.84],[77,70,49.52],
[0,11.5,71.32],[77,57.5,67.20],
[0,23,58.54],[25,46,51.32],[37.5,46,49.52],
[0,34.5,63.22],[25,34.5,48.32],[37.5,34.5,82.30],[50,34.5,56.42],[77,34.5,48.32],[37.5,23,67.32],
[0,46,64.20],[77,11.5,41.89],[77,46,55.54],[77,23,52.22],
[0,57.5,93.72],
[0,70,98.20],[77,0,42.32]
])
X = datapoints[:,0:2]
Y = datapoints[:,-1]
predict_x = []
predict_y = []
for i in np.linspace(0,Width,N):
for j in np.linspace(0,Length,N):
predict_x.append([i,j])
predict = np.array(predict_x)
poly = PolynomialFeatures(degree=deg_of_poly)
X_ = poly.fit_transform(X)
predict_ = poly.fit_transform(predict)
clf = linear_model.LinearRegression()
clf.fit(X_, Y)
prediction = []
for k,i in enumerate(predict_):
prediction.append(clf.predict(np.array([i]))[0])
prediction_ = model_BOD(prediction)
print prediction_
XX = []
XX = predict[:,0]
YY = []
YY = predict[:,-1]
XX,YY = np.meshgrid(X,Y)
Z = prediction
##R = np.sqrt(XX**2+YY**2)
##Z = np.tan(R)
surf = ax.plot_surface(XX,YY,Z)
plt.show()

If I understand correctly, the key logic here is to generate polynomial features from your meshgrid, make the prediction and plot the prediction with the original meshgrid. Hope the following gives what you want:
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
# The training set
datapoints = np.array([
[12.5,70,81.32], [25,70,88.54], [37.5,70,67.58], [50,70,55.32],
[62.5,70,56.84], [77,70,49.52], [0,11.5,71.32], [77,57.5,67.20],
[0,23,58.54], [25,46,51.32], [37.5,46,49.52], [0,34.5,63.22],
[25,34.5,48.32], [37.5,34.5,82.30], [50,34.5,56.42], [77,34.5,48.32],
[37.5,23,67.32], [0,46,64.20], [77,11.5,41.89], [77,46,55.54],
[77,23,52.22], [0,57.5,93.72], [0,70,98.20], [77,0,42.32]
])
X = datapoints[:,0:2]
Y = datapoints[:,-1]
# 5 degree polynomial features
deg_of_poly = 5
poly = PolynomialFeatures(degree=deg_of_poly)
X_ = poly.fit_transform(X)
# Fit linear model
clf = linear_model.LinearRegression()
clf.fit(X_, Y)
# The test set, or plotting set
N = 20
Length = 70
predict_x0, predict_x1 = np.meshgrid(np.linspace(0, Length, N),
np.linspace(0, Length, N))
predict_x = np.concatenate((predict_x0.reshape(-1, 1),
predict_x1.reshape(-1, 1)),
axis=1)
predict_x_ = poly.fit_transform(predict_x)
predict_y = clf.predict(predict_x_)
# Plot
fig = plt.figure(figsize=(16, 6))
ax1 = fig.add_subplot(121, projection='3d')
surf = ax1.plot_surface(predict_x0, predict_x1, predict_y.reshape(predict_x0.shape),
rstride=1, cstride=1, cmap=cm.jet, alpha=0.5)
ax1.scatter(datapoints[:, 0], datapoints[:, 1], datapoints[:, 2], c='b', marker='o')
ax1.set_xlim((70, 0))
ax1.set_ylim((0, 70))
fig.colorbar(surf, ax=ax1)
ax2 = fig.add_subplot(122)
cs = ax2.contourf(predict_x0, predict_x1, predict_y.reshape(predict_x0.shape))
ax2.contour(cs, colors='k')
fig.colorbar(cs, ax=ax2)
plt.show()

Plotting a decision boundary separating 2 classes using Matplotlib's pyplot

I could really use a tip to help me plotting a decision boundary to separate to classes of data. I created some sample data (from a Gaussian distribution) via Python NumPy. In this case, every data point is a 2D coordinate, i.e., a 1 column vector consisting of 2 rows. E.g.,
[ 1
2 ]
Let's assume I have 2 classes, class1 and class2, and I created 100 data points for class1 and 100 data points for class2 via the code below (assigned to the variables x1_samples and x2_samples).
mu_vec1 = np.array([0,0])
cov_mat1 = np.array([[2,0],[0,2]])
x1_samples = np.random.multivariate_normal(mu_vec1, cov_mat1, 100)
mu_vec1 = mu_vec1.reshape(1,2).T # to 1-col vector
mu_vec2 = np.array([1,2])
cov_mat2 = np.array([[1,0],[0,1]])
x2_samples = np.random.multivariate_normal(mu_vec2, cov_mat2, 100)
mu_vec2 = mu_vec2.reshape(1,2).T
When I plot the data points for each class, it would look like this:
Now, I came up with an equation for an decision boundary to separate both classes and would like to add it to the plot. However, I am not really sure how I can plot this function:
def decision_boundary(x_vec, mu_vec1, mu_vec2):
g1 = (x_vec-mu_vec1).T.dot((x_vec-mu_vec1))
g2 = 2*( (x_vec-mu_vec2).T.dot((x_vec-mu_vec2)) )
return g1 - g2
I would really appreciate any help!
EDIT:
Intuitively (If I did my math right) I would expect the decision boundary to look somewhat like this red line when I plot the function...

Your question is more complicated than a simple plot : you need to draw the contour which will maximize the inter-class distance. Fortunately it's a well-studied field, particularly for SVM machine learning.
The easiest method is to download the scikit-learn module, which provides a lot of cool methods to draw boundaries: scikit-learn: Support Vector Machines
Code :
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import scipy
from sklearn import svm
mu_vec1 = np.array([0,0])
cov_mat1 = np.array([[2,0],[0,2]])
x1_samples = np.random.multivariate_normal(mu_vec1, cov_mat1, 100)
mu_vec1 = mu_vec1.reshape(1,2).T # to 1-col vector
mu_vec2 = np.array([1,2])
cov_mat2 = np.array([[1,0],[0,1]])
x2_samples = np.random.multivariate_normal(mu_vec2, cov_mat2, 100)
mu_vec2 = mu_vec2.reshape(1,2).T
fig = plt.figure()
plt.scatter(x1_samples[:,0],x1_samples[:,1], marker='+')
plt.scatter(x2_samples[:,0],x2_samples[:,1], c= 'green', marker='o')
X = np.concatenate((x1_samples,x2_samples), axis = 0)
Y = np.array([0]*100 + [1]*100)
C = 1.0 # SVM regularization parameter
clf = svm.SVC(kernel = 'linear', gamma=0.7, C=C )
clf.fit(X, Y)
Linear Plot
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]
plt.plot(xx, yy, 'k-')
MultiLinear Plot
C = 1.0 # SVM regularization parameter
clf = svm.SVC(kernel = 'rbf', gamma=0.7, C=C )
clf.fit(X, Y)
h = .02 # step size in the mesh
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, cmap=plt.cm.Paired)
Implementation
If you want to implement it yourself, you need to solve the following quadratic equation:
The Wikipedia article
Unfortunately, for non-linear boundaries like the one you draw, it's a difficult problem relying on a kernel trick but there isn't a clear cut solution.

Based on the way you've written decision_boundary you'll want to use the contour function, as Joe noted above. If you just want the boundary line, you can draw a single contour at the 0 level:
f, ax = plt.subplots(figsize=(7, 7))
c1, c2 = "#3366AA", "#AA3333"
ax.scatter(*x1_samples.T, c=c1, s=40)
ax.scatter(*x2_samples.T, c=c2, marker="D", s=40)
x_vec = np.linspace(*ax.get_xlim())
ax.contour(x_vec, x_vec,
decision_boundary(x_vec, mu_vec1, mu_vec2),
levels=[0], cmap="Greys_r")
Which makes:

Those were some great suggestions, thanks a lot for your help! I ended up solving the equation analytically and this is the solution I ended up with (I just want to post it for future reference:
# 2-category classification with random 2D-sample data
# from a multivariate normal distribution
import numpy as np
from matplotlib import pyplot as plt
def decision_boundary(x_1):
""" Calculates the x_2 value for plotting the decision boundary."""
return 4 - np.sqrt(-x_1**2 + 4*x_1 + 6 + np.log(16))
# Generating a Gaussion dataset:
# creating random vectors from the multivariate normal distribution
# given mean and covariance
mu_vec1 = np.array([0,0])
cov_mat1 = np.array([[2,0],[0,2]])
x1_samples = np.random.multivariate_normal(mu_vec1, cov_mat1, 100)
mu_vec1 = mu_vec1.reshape(1,2).T # to 1-col vector
mu_vec2 = np.array([1,2])
cov_mat2 = np.array([[1,0],[0,1]])
x2_samples = np.random.multivariate_normal(mu_vec2, cov_mat2, 100)
mu_vec2 = mu_vec2.reshape(1,2).T # to 1-col vector
# Main scatter plot and plot annotation
f, ax = plt.subplots(figsize=(7, 7))
ax.scatter(x1_samples[:,0], x1_samples[:,1], marker='o', color='green', s=40, alpha=0.5)
ax.scatter(x2_samples[:,0], x2_samples[:,1], marker='^', color='blue', s=40, alpha=0.5)
plt.legend(['Class1 (w1)', 'Class2 (w2)'], loc='upper right')
plt.title('Densities of 2 classes with 25 bivariate random patterns each')
plt.ylabel('x2')
plt.xlabel('x1')
ftext = 'p(x|w1) ~ N(mu1=(0,0)^t, cov1=I)\np(x|w2) ~ N(mu2=(1,1)^t, cov2=I)'
plt.figtext(.15,.8, ftext, fontsize=11, ha='left')
# Adding decision boundary to plot
x_1 = np.arange(-5, 5, 0.1)
bound = decision_boundary(x_1)
plt.plot(x_1, bound, 'r--', lw=3)
x_vec = np.linspace(*ax.get_xlim())
x_1 = np.arange(0, 100, 0.05)
plt.show()
And the code can be found here
EDIT:
I also have a convenience function for plotting decision regions for classifiers that implement a fit and predict method, e.g., the classifiers in scikit-learn, which is useful if the solution cannot be found analytically. A more detailed description how it works can be found here.

You can create your own equation for the boundary:
where you have to find the positions x0 and y0, as well as the constants ai and bi for the radius equation. So, you have 2*(n+1)+2 variables. Using scipy.optimize.leastsq is straightforward for this type of problem.
The code attached below builds the residual for the leastsq penalizing the points outsize the boundary. The result for your problem, obtained with:
x, y = find_boundary(x2_samples[:,0], x2_samples[:,1], n)
ax.plot(x, y, '-k', lw=2.)
x, y = find_boundary(x1_samples[:,0], x1_samples[:,1], n)
ax.plot(x, y, '--k', lw=2.)
using n=1:
using n=2:
usng n=5:
using n=7:
import numpy as np
from numpy import sin, cos, pi
from scipy.optimize import leastsq
def find_boundary(x, y, n, plot_pts=1000):
def sines(theta):
ans = np.array([sin(i*theta) for i in range(n+1)])
return ans
def cosines(theta):
ans = np.array([cos(i*theta) for i in range(n+1)])
return ans
def residual(params, x, y):
x0 = params[0]
y0 = params[1]
c = params[2:]
r_pts = ((x-x0)**2 + (y-y0)**2)**0.5
thetas = np.arctan2((y-y0), (x-x0))
m = np.vstack((sines(thetas), cosines(thetas))).T
r_bound = m.dot(c)
delta = r_pts - r_bound
delta[delta>0] *= 10
return delta
# initial guess for x0 and y0
x0 = x.mean()
y0 = y.mean()
params = np.zeros(2 + 2*(n+1))
params[0] = x0
params[1] = y0
params[2:] += 1000
popt, pcov = leastsq(residual, x0=params, args=(x, y),
ftol=1.e-12, xtol=1.e-12)
thetas = np.linspace(0, 2*pi, plot_pts)
m = np.vstack((sines(thetas), cosines(thetas))).T
c = np.array(popt[2:])
r_bound = m.dot(c)
x_bound = popt[0] + r_bound*cos(thetas)
y_bound = popt[1] + r_bound*sin(thetas)
return x_bound, y_bound

I like the mglearn library to draw decision boundaries. Here is one example from the book "Introduction to Machine Learning with Python" by A. Mueller:
fig, axes = plt.subplots(1, 3, figsize=(10, 3))
for n_neighbors, ax in zip([1, 3, 9], axes):
clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
ax.set_title("{} neighbor(s)".format(n_neighbors))
ax.set_xlabel("feature 0")
ax.set_ylabel("feature 1")
axes[0].legend(loc=3)

If you want to use scikit learn, you can write your code like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# read data
data = pd.read_csv('ex2data1.txt', header=None)
X = data[[0,1]].values
y = data[2]
# use LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X, y)
# Coefficient of the features in the decision function. (from theta 1 to theta n)
parameters = log_reg.coef_[0]
# Intercept (a.k.a. bias) added to the decision function. (theta 0)
parameter0 = log_reg.intercept_
# Plotting the decision boundary
fig = plt.figure(figsize=(10,7))
x_values = [np.min(X[:, 1] -5 ), np.max(X[:, 1] +5 )]
# calcul y values
y_values = np.dot((-1./parameters[1]), (np.dot(parameters[0],x_values) + parameter0))
colors=['red' if l==0 else 'blue' for l in y]
plt.scatter(X[:, 0], X[:, 1], label='Logistics regression', color=colors)
plt.plot(x_values, y_values, label='Decision Boundary')
plt.show()
see: Building-a-Logistic-Regression-with-Scikit-learn

Just solved a very similar problem with a different approach (root finding) and wanted to post this alternative as answer here for future reference:
def discr_func(x, y, cov_mat, mu_vec):
"""
Calculates the value of the discriminant function for a dx1 dimensional
sample given covariance matrix and mean vector.
Keyword arguments:
x_vec: A dx1 dimensional numpy array representing the sample.
cov_mat: numpy array of the covariance matrix.
mu_vec: dx1 dimensional numpy array of the sample mean.
Returns a float value as result of the discriminant function.
"""
x_vec = np.array([[x],[y]])
W_i = (-1/2) * np.linalg.inv(cov_mat)
assert(W_i.shape[0] > 1 and W_i.shape[1] > 1), 'W_i must be a matrix'
w_i = np.linalg.inv(cov_mat).dot(mu_vec)
assert(w_i.shape[0] > 1 and w_i.shape[1] == 1), 'w_i must be a column vector'
omega_i_p1 = (((-1/2) * (mu_vec).T).dot(np.linalg.inv(cov_mat))).dot(mu_vec)
omega_i_p2 = (-1/2) * np.log(np.linalg.det(cov_mat))
omega_i = omega_i_p1 - omega_i_p2
assert(omega_i.shape == (1, 1)), 'omega_i must be a scalar'
g = ((x_vec.T).dot(W_i)).dot(x_vec) + (w_i.T).dot(x_vec) + omega_i
return float(g)
#g1 = discr_func(x, y, cov_mat=cov_mat1, mu_vec=mu_vec_1)
#g2 = discr_func(x, y, cov_mat=cov_mat2, mu_vec=mu_vec_2)
x_est50 = list(np.arange(-6, 6, 0.1))
y_est50 = []
for i in x_est50:
y_est50.append(scipy.optimize.bisect(lambda y: discr_func(i, y, cov_mat=cov_est_1, mu_vec=mu_est_1) - \
discr_func(i, y, cov_mat=cov_est_2, mu_vec=mu_est_2), -10,10))
y_est50 = [float(i) for i in y_est50]
Here is the result:
(blue the quadratic case, red the linear case (equal variances)

I know this question has been answered in a very thorough way analytically. I just wanted to share a possible 'hack' to the problem. It is unwieldy but gets the job done.
Start by building a mesh grid of the 2d area and then based on the classifier just build a class map of the entire space. Subsequently detect changes in the decision made row-wise and store the edges points in a list and scatter plot the points.
def disc(x): # returns the class of the point based on location x = [x,y]
temp = 0.5 + 0.5*np.sign(disc0(x)-disc1(x))
# disc0() and disc1() are the discriminant functions of the respective classes
return 0*temp + 1*(1-temp)
num = 200
a = np.linspace(-4,4,num)
b = np.linspace(-6,6,num)
X,Y = np.meshgrid(a,b)
def decColor(x,y):
temp = np.zeros((num,num))
print x.shape, np.size(x,axis=0)
for l in range(num):
for m in range(num):
p = np.array([x[l,m],y[l,m]])
#print p
temp[l,m] = disc(p)
return temp
boundColorMap = decColor(X,Y)
group = 0
boundary = []
for x in range(num):
group = boundColorMap[x,0]
for y in range(num):
if boundColorMap[x,y]!=group:
boundary.append([X[x,y],Y[x,y]])
group = boundColorMap[x,y]
boundary = np.array(boundary)
Sample Decision Boundary for a simple bivariate gaussian classifier

Given two bi-variate normal distributions, you can use Gaussian Discriminant Analysis (GDA) to come up with a decision boundary as the difference between the log of the 2 pdf's.
Here's a way to do it using scipy multivariate_normal (the code is not optimized):
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
from numpy.linalg import norm
from numpy.linalg import inv
from scipy.spatial.distance import mahalanobis
def normal_scatter(mean, cov, p):
size = 100
sigma_x = cov[0,0]
sigma_y = cov[1,1]
mu_x = mean[0]
mu_y = mean[1]
x_ps, y_ps = np.random.multivariate_normal(mean, cov, size).T
x,y = np.mgrid[mu_x-3*sigma_x:mu_x+3*sigma_x:1/size, mu_y-3*sigma_y:mu_y+3*sigma_y:1/size]
grid = np.empty(x.shape + (2,))
grid[:, :, 0] = x; grid[:, :, 1] = y
z = p*multivariate_normal.pdf(grid, mean, cov)
return x_ps, y_ps, x,y,z
# Dist 1
mu_1 = np.array([1, 1])
cov_1 = .5*np.array([[1, 0], [0, 1]])
p_1 = .5
x_ps, y_ps, x,y,z = normal_scatter(mu_1, cov_1, p_1)
plt.plot(x_ps,y_ps,'x')
plt.contour(x, y, z, cmap='Blues', levels=3)
# Dist 2
mu_2 = np.array([2, 1])
#cov_2 = np.array([[2, -1], [-1, 1]])
cov_2 = cov_1
p_2 = .5
x_ps, y_ps, x,y,z = normal_scatter(mu_2, cov_2, p_2)
plt.plot(x_ps,y_ps,'.')
plt.contour(x, y, z, cmap='Oranges', levels=3)
# Decision Boundary
X = np.empty(x.shape + (2,))
X[:, :, 0] = x; X[:, :, 1] = y
g = np.log(p_1*multivariate_normal.pdf(X, mu_1, cov_1)) - np.log(p_2*multivariate_normal.pdf(X, mu_2, cov_2))
plt.contour(x, y, g, [0])
plt.grid()
plt.axhline(y=0, color='k')
plt.axvline(x=0, color='k')
plt.plot([mu_1[0], mu_2[0]], [mu_1[1], mu_2[1]], 'k')
plt.show()
If p_1 != p_2, then you get non-linear boundary. The decision boundary is given by g above.
Then to plot the decision hyper-plane (line in 2D), you need to evaluate g for a 2D mesh, then get the contour which will give a separating line.
You can also assume to have equal co-variance matrices for both distributions, which will give a linear decision boundary. In this case, you can replace the calculation of g in the above code with the following:
W = inv(cov_1).dot(mu_1-mu_2)
x_0 = 1/2*(mu_1+mu_2) - cov_1.dot(np.log(p_1/p_2)).dot((mu_1-mu_2)/mahalanobis(mu_1, mu_2, cov_1))
X = np.empty(x.shape + (2,))
X[:, :, 0] = x; X[:, :, 1] = y
g = (X-x_0).dot(W)

i use this method from this book python-machine-learning-2nd.pdf URL
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):
# setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.8,
c=colors[idx],
marker=markers[idx],
label=cl,
edgecolor='black')
# highlight test samples
if test_idx:
# plot all samples
X_test, y_test = X[test_idx, :], y[test_idx]
plt.scatter(X_test[:, 0],
X_test[:, 1],
c='',
edgecolor='black',
alpha=1.0,
linewidth=1,
marker='o',
s=100,
label='test set')

Since version 1.1, sklearn has a function for this:
https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#sklearn.inspection.DecisionBoundaryDisplay

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to evaluate output of DecisionTreeRegressor python - python

Related

How to plot my own logistic regression decision boundaries and SKlearn's ones on the same figure

Natural cubic spline using patsy cr

Python Matplotlib: Find the corresponding x value of a given y value on a basic plot

Surface plot for multivariate 5 degree polynomial regression in Python

Plotting a decision boundary separating 2 classes using Matplotlib's pyplot

Categories

Resources