How to plot SciKit-Learn linear regression graph

How to plot SciKit-Learn linear regression graph - python

I am new to SciKit-Learn and I have been working on a regression problem (king county csv) on kaggle. I have been training a regression model to predict the price of the house and I wanted to plot the graph but I have no idea how to do so. I am using python 3.6. Any advice or suggestion would be greatly appreciated.
#importing numpy and pandas, seaborn
import numpy as np #linear algebra
import pandas as pd #datapreprocessing, CSV file I/O
import seaborn as sns #for plotting graphs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
data = pd.read_csv('kc_house_data.csv')
data = data.drop('date',axis=1)
data = data.drop('id',axis=1)
X = data
Y = X['price'].values
X = X.drop('price', axis = 1).values
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.30, random_state=21)
reg = LinearRegression()
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(reg, X_train, Y_train, cv=kfold, scoring='r2')
print(cv_results)
round(np.mean(cv_results)*100, 2)

This is the code from sklearn: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

You can use matplotlib for plotting
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 9))
plt.plot(cv_results)
plt.show()
There can be multiple type of plots you can use like simple line plot or scatter plot.
plt.barh(x, y) # for bar graph
plt.plot(x,y) # for line graph
plt.scatter(x,y) # for scatter graph

Seaborn is a very useful visualization library. So much so that you can use 'seaborn.regplot' to directly plot the data and regression-model-fit line. It directly takes in the predictor variable and response variable, and spits out the plot of data points and best fit line. Here is the link on how to use it:
https://seaborn.pydata.org/generated/seaborn.regplot.html

I have also done the same competition on kaggle.
For regressions I would go for a scatter plot:
import matplotlib as plt
plt.plot(x,y)
As for the visualisations on that particular competition I would use the following code:
# visualising some more outliers in the data values
fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(12, 120))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)
sns.color_palette("husl", 8)
for i, feature in enumerate(list(train[numeric]), 1):
if(feature=='MiscVal'):
break
plt.subplot(len(list(numeric)), 3, i)
sns.scatterplot(x=feature, y='SalePrice', hue='SalePrice', palette='Blues', data=train)
plt.xlabel('{}'.format(feature), size=15,labelpad=12.5)
plt.ylabel('SalePrice', size=15, labelpad=12.5)
for j in range(2):
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.legend(loc='best', prop={'size': 10})
plt.show()
I have actually uploaded the full code for that competition on my GitHub if you want to have a look ;) (I am currently in the top 14% on that competition).

Related

Visualization of iris data set and a model for Naive bayes

There are so many ways to visualize a data set. I want to have all those methods together here and I have chosen iris data set for that. In order to do so These are been written here.
I would have use either pandas' visualization or seaborn's.
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
import pandas as pd
# Parallel Coordinates
# Load the data set
iris = sns.load_dataset("iris")
parallel_coordinates(iris, 'species', color=('#556270', '#4ECDC4', '#C7F464'))
plt.show()
and Result is as follow:
from pandas.plotting import andrews_curves
# Andrew Curves
a_c = andrews_curves(iris, 'species')
a_c.plot()
plt.show()
and its plot is shown below:
from seaborn import pairplot
# Pair Plot
pairplot(iris, hue='species')
plt.show()
which would plot the following fig:
and also another plot which is I think the least used and the most important is the following one:
from plotly.express import scatter_3d
# Plotting in 3D by plotly.express that would show the plot with capability of zooming,
# changing the orientation, and rotating
scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_length', size="petal_width",
color="species", color_discrete_map={"Joly": "blue", "Bergeron": "violet", "Coderre": "pink"})\
.show()
This one would plot into your browser and demands HTML5 and you can see as you wish with it. The next figure is the one. Remember that It is a SCATTERING plot and the size of each ball is showing data of the petal_width so all four features are in one single plot.
Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification
problems. It is called Naive Bayes because the calculations of the probabilities for each class are
simplified to make their calculations tractable. Rather than attempting to calculate the
probabilities of each attribute value, they are assumed to be conditionally independent given the
class value. This is a very strong assumption that is most unlikely in real data, i.e. that the
attributes do not interact. Nevertheless, the approach performs surprisingly well on data where
this assumption does not hold.
Here is a good example of developing a model to predict labels of this data set. You can use this example to develop every model because this is the basic of it.
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import seaborn as sns
# Load the data set
iris = sns.load_dataset("iris")
iris = iris.rename(index=str, columns={'sepal_length': '1_sepal_length', 'sepal_width': '2_sepal_width',
'petal_length': '3_petal_length', 'petal_width': '4_petal_width'})
# Setup X and y data
X_data_plot = df1.iloc[:, 0:2]
y_labels_plot = df1.iloc[:, 2].replace({'setosa': 0, 'versicolor': 1, 'virginica': 2}).copy()
x_train, x_test, y_train, y_test = train_test_split(df2.iloc[:, 0:4], y_labels_plot, test_size=0.25,
random_state=42) # This is for the model
# Fit model
model_sk_plot = GaussianNB(priors=None)
nb_model = GaussianNB(priors=None)
model_sk_plot.fit(X_data_plot, y_labels_plot)
nb_model.fit(x_train, y_train)
# Our 2-dimensional classifier will be over variables X and Y
N_plot = 100
X_plot = np.linspace(4, 8, N_plot)
Y_plot = np.linspace(1.5, 5, N_plot)
X_plot, Y_plot = np.meshgrid(X_plot, Y_plot)
plot = sns.FacetGrid(iris, hue="species", size=5, palette='husl').map(plt.scatter, "1_sepal_length",
"2_sepal_width", ).add_legend()
my_ax = plot.ax
# Computing the predicted class function for each value on the grid
zz = np.array([model_sk_plot.predict([[xx, yy]])[0] for xx, yy in zip(np.ravel(X_plot), np.ravel(Y_plot))])
# Reshaping the predicted class into the meshgrid shape
Z = zz.reshape(X_plot.shape)
# Plot the filled and boundary contours
my_ax.contourf(X_plot, Y_plot, Z, 2, alpha=.1, colors=('blue', 'green', 'red'))
my_ax.contour(X_plot, Y_plot, Z, 2, alpha=1, colors=('blue', 'green', 'red'))
# Add axis and title
my_ax.set_xlabel('Sepal length')
my_ax.set_ylabel('Sepal width')
my_ax.set_title('Gaussian Naive Bayes decision boundaries')
plt.show()
Add whatever you think is necessary to this , for example decision boundaries in 3d is what I have not done before.

plt.plot draws multiple curves instad of single curve

here is the link to the dataset I used: Dataset
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Lets begin with polynomial regression
df = pd.read_excel('enes.xlsx', index='hacim')
X=pd.DataFrame(df['hacim'])
Y=pd.DataFrame(df['delay'])
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, Y)
plt.scatter(X, Y, color = 'red')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('X Vs Y')
plt.xlabel('hacim')
plt.ylabel('delay')
plt.show()
Last plt.show shows a graph where there are many lines instead of a 1 lined polynomial regression i desired. what wrong and how can ı fix this?
Data
,hacim,delay
0,815,1.44
1,750,1.11
2,321,2.37
3,1021,1.44
4,255,1.09
5,564,1.61
6,1455,15.27
7,525,2.7
8,1118,106.98
9,1036,3.47
10,396,1.34
11,1485,21.49
12,1017,12.22
13,1345,2.72
14,312,1.71
15,742,33.79
16,1100,39.62
17,1445,4.88
18,847,1.55
19,991,1.82
20,1296,10.77
21,854,1.81
22,1198,61.9
23,1162,8.22
24,1463,42.25
25,1272,4.31
26,745,2.36
27,521,2.14
28,1247,94.33
29,732,12.55
30,489,1.05
31,1494,12.78
32,591,3.18
33,257,1.18
34,602,4.24
35,335,2.06
36,523,3.63
37,752,7.61
38,349,1.76
39,771,0.79
40,855,39.08
41,948,3.95
42,1378,97.28
43,598,2.69
44,558,1.67
45,634,34.69
46,1146,12.22
47,1087,1.74
48,628,1.03
49,711,3.34
50,1116,7.27
51,748,1.09
52,1212,14.16
53,434,1.42
54,1046,8.25
55,568,1.33
56,894,2.61
57,1041,4.79
58,801,1.84
59,1387,11.5
60,1171,161.21
61,734,2.43
62,1471,17.42
63,461,1.42
64,751,2.36
65,898,2.4
66,593,1.74
67,942,3.39
68,825,1.09
69,715,20.23
70,725,5.43
71,1128,7.57
72,1348,4.49
73,1393,9.77
74,1379,97.76
75,859,2.59
76,612,15.98
77,1495,8.22
78,887,1.85
79,867,38.65
80,1353,1.6
81,851,60.25
82,1079,24.05
83,1100,25.58
84,638,1.23
85,1115,1.94
86,1443,4.79
87,1421,10.33
88,1279,7.29
89,1176,173.44
90,315,1.53
91,1019,34.03
92,1337,48.67
93,576,28.83
94,919,2.88
95,361,1.5
96,989,1.47
97,1286,32.11

Let's use pandas plot it is much easier:
X=pd.DataFrame(df['hacim'])
Y=pd.DataFrame(df['delay'])
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, Y)
df['y_pred'] = lin_reg_2.predict(poly_reg.fit_transform(X))
df = df.sort_values('hacim')
ax = df.plot.scatter('hacim','delay')
df.plot('hacim', 'y_pred', ax=ax, color='r')
plt.title('X Vs Y')
plt.xlabel('hacim')
plt.ylabel('delay')
plt.show()
Output:
The root of the scatter lines was unsorted data when plotting line graph.
You could do this:
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue', marker='o', linestyle='none')
Output:

Plotting SVC decision region

I am following some SVC code in a book using moon_dataset.
here is the code:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
X, y = make_moons(n_samples=100, noise=0.15)
rbf_kernel_svm_clf = Pipeline([
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
])
rbf_kernel_svm_clf.fit(X, y)
am i have tried plotting any of these graphs with the following code but nothing so far.
plt.scatter(X, y)
any help? thanks

You need something more than just a scatter plot to plot the decision regions. A very useful module for this is MLxtend, which makes it very easy to plot the decision regions of a fitted model with plot_decision_regions. Here's how you could get it done using your example:
from mlxtend.plotting import plot_decision_regions
plt.figure(figsize=(12,8))
plot_decision_regions(X, y, clf=rbf_kernel_svm_clf.named_steps['svm_clf'], legend=2)

Multiple traces on Polynomial Regression Graph

i am implementing simple polynomial regression to predict time for a video given its size, and it's my own dataset. Now for some reason, i am getting multiple traces for my plot.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('estSize.csv')
X = dataset.iloc[:, 0].values.reshape(-1,1)
y = dataset.iloc[:, 1].values.reshape(-1,1)
from sklearn.linear_model import LinearRegression
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Visualising the Polynomial Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.show()

Your data needs to be ordered with respect to the predictor.
After the line
dataset = pd.read_csv('estSize.csv')
Add this line:
dataset = dataset.sort_values(by=['col1'])
Where col1 is your column header for the file-size values.

Subplot 3 graph in one figure

I am Having trouble with the last subplot. The last Crosstab plot appears by itself, and then the subplot has the first 2 subplots but the 3rd one is empty and contains no data. How can I graph it so that all 3 graphs come up in one figure and they share they same Y axis or 'Frequency'
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
#Data Exploration
data = sm.datasets.fair.load_pandas().data
data['affair'] = np.where(data['affairs'] > 0 , 1,0)
print(data)
print(data.groupby('affair').mean())
print(data.groupby('rate_marriage').mean())
plt.subplot(331)
data['educ'].hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')
plt.subplot(332)
data['rate_marriage'].hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
plt.subplot(333)
pd.crosstab(data['rate_marriage'], data['affair'].astype(bool)).plot(kind='bar')
plt.title('Marriage Rating distribution by affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
plt.show()

You need to tell the pandas plotting function where to plot the data.
This can be achieved through the ax keyword.
ax= plt.subplot(333)
pd.crosstab(data['rate_marriage'], data['affair'].astype(bool)).plot(kind='bar', ax=ax)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to plot SciKit-Learn linear regression graph - python

This is the code from sklearn: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

Related

Visualization of iris data set and a model for Naive bayes

plt.plot draws multiple curves instad of single curve

Plotting SVC decision region

Multiple traces on Polynomial Regression Graph

Subplot 3 graph in one figure

Categories

Resources