How to plot SciKit-Learn linear regression graph - python
I am new to SciKit-Learn and I have been working on a regression problem (king county csv) on kaggle. I have been training a regression model to predict the price of the house and I wanted to plot the graph but I have no idea how to do so. I am using python 3.6. Any advice or suggestion would be greatly appreciated.
#importing numpy and pandas, seaborn
import numpy as np #linear algebra
import pandas as pd #datapreprocessing, CSV file I/O
import seaborn as sns #for plotting graphs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
data = pd.read_csv('kc_house_data.csv')
data = data.drop('date',axis=1)
data = data.drop('id',axis=1)
X = data
Y = X['price'].values
X = X.drop('price', axis = 1).values
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.30, random_state=21)
reg = LinearRegression()
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(reg, X_train, Y_train, cv=kfold, scoring='r2')
print(cv_results)
round(np.mean(cv_results)*100, 2)
This is the code from sklearn: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
You can use matplotlib for plotting
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 9))
plt.plot(cv_results)
plt.show()
There can be multiple type of plots you can use like simple line plot or scatter plot.
plt.barh(x, y) # for bar graph
plt.plot(x,y) # for line graph
plt.scatter(x,y) # for scatter graph
Seaborn is a very useful visualization library. So much so that you can use 'seaborn.regplot' to directly plot the data and regression-model-fit line. It directly takes in the predictor variable and response variable, and spits out the plot of data points and best fit line. Here is the link on how to use it:
https://seaborn.pydata.org/generated/seaborn.regplot.html
I have also done the same competition on kaggle.
For regressions I would go for a scatter plot:
import matplotlib as plt
plt.plot(x,y)
As for the visualisations on that particular competition I would use the following code:
# visualising some more outliers in the data values
fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(12, 120))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)
sns.color_palette("husl", 8)
for i, feature in enumerate(list(train[numeric]), 1):
if(feature=='MiscVal'):
break
plt.subplot(len(list(numeric)), 3, i)
sns.scatterplot(x=feature, y='SalePrice', hue='SalePrice', palette='Blues', data=train)
plt.xlabel('{}'.format(feature), size=15,labelpad=12.5)
plt.ylabel('SalePrice', size=15, labelpad=12.5)
for j in range(2):
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.legend(loc='best', prop={'size': 10})
plt.show()
I have actually uploaded the full code for that competition on my GitHub if you want to have a look ;) (I am currently in the top 14% on that competition).
Related
Visualization of iris data set and a model for Naive bayes
There are so many ways to visualize a data set. I want to have all those methods together here and I have chosen iris data set for that. In order to do so These are been written here. I would have use either pandas' visualization or seaborn's. import seaborn as sns import matplotlib.pyplot as plt from pandas.plotting import parallel_coordinates import pandas as pd # Parallel Coordinates # Load the data set iris = sns.load_dataset("iris") parallel_coordinates(iris, 'species', color=('#556270', '#4ECDC4', '#C7F464')) plt.show() and Result is as follow: from pandas.plotting import andrews_curves # Andrew Curves a_c = andrews_curves(iris, 'species') a_c.plot() plt.show() and its plot is shown below: from seaborn import pairplot # Pair Plot pairplot(iris, hue='species') plt.show() which would plot the following fig: and also another plot which is I think the least used and the most important is the following one: from plotly.express import scatter_3d # Plotting in 3D by plotly.express that would show the plot with capability of zooming, # changing the orientation, and rotating scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_length', size="petal_width", color="species", color_discrete_map={"Joly": "blue", "Bergeron": "violet", "Coderre": "pink"})\ .show() This one would plot into your browser and demands HTML5 and you can see as you wish with it. The next figure is the one. Remember that It is a SCATTERING plot and the size of each ball is showing data of the petal_width so all four features are in one single plot. Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable. Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold. Here is a good example of developing a model to predict labels of this data set. You can use this example to develop every model because this is the basic of it. from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score import seaborn as sns # Load the data set iris = sns.load_dataset("iris") iris = iris.rename(index=str, columns={'sepal_length': '1_sepal_length', 'sepal_width': '2_sepal_width', 'petal_length': '3_petal_length', 'petal_width': '4_petal_width'}) # Setup X and y data X_data_plot = df1.iloc[:, 0:2] y_labels_plot = df1.iloc[:, 2].replace({'setosa': 0, 'versicolor': 1, 'virginica': 2}).copy() x_train, x_test, y_train, y_test = train_test_split(df2.iloc[:, 0:4], y_labels_plot, test_size=0.25, random_state=42) # This is for the model # Fit model model_sk_plot = GaussianNB(priors=None) nb_model = GaussianNB(priors=None) model_sk_plot.fit(X_data_plot, y_labels_plot) nb_model.fit(x_train, y_train) # Our 2-dimensional classifier will be over variables X and Y N_plot = 100 X_plot = np.linspace(4, 8, N_plot) Y_plot = np.linspace(1.5, 5, N_plot) X_plot, Y_plot = np.meshgrid(X_plot, Y_plot) plot = sns.FacetGrid(iris, hue="species", size=5, palette='husl').map(plt.scatter, "1_sepal_length", "2_sepal_width", ).add_legend() my_ax = plot.ax # Computing the predicted class function for each value on the grid zz = np.array([model_sk_plot.predict([[xx, yy]])[0] for xx, yy in zip(np.ravel(X_plot), np.ravel(Y_plot))]) # Reshaping the predicted class into the meshgrid shape Z = zz.reshape(X_plot.shape) # Plot the filled and boundary contours my_ax.contourf(X_plot, Y_plot, Z, 2, alpha=.1, colors=('blue', 'green', 'red')) my_ax.contour(X_plot, Y_plot, Z, 2, alpha=1, colors=('blue', 'green', 'red')) # Add axis and title my_ax.set_xlabel('Sepal length') my_ax.set_ylabel('Sepal width') my_ax.set_title('Gaussian Naive Bayes decision boundaries') plt.show() Add whatever you think is necessary to this , for example decision boundaries in 3d is what I have not done before.
plt.plot draws multiple curves instad of single curve
here is the link to the dataset I used: Dataset import numpy as np import matplotlib.pyplot as plt import pandas as pd #Lets begin with polynomial regression df = pd.read_excel('enes.xlsx', index='hacim') X=pd.DataFrame(df['hacim']) Y=pd.DataFrame(df['delay']) from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree = 4) X_poly = poly_reg.fit_transform(X) lin_reg_2 = LinearRegression() lin_reg_2.fit(X_poly, Y) plt.scatter(X, Y, color = 'red') plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue') plt.title('X Vs Y') plt.xlabel('hacim') plt.ylabel('delay') plt.show() Last plt.show shows a graph where there are many lines instead of a 1 lined polynomial regression i desired. what wrong and how can ı fix this? Data ,hacim,delay 0,815,1.44 1,750,1.11 2,321,2.37 3,1021,1.44 4,255,1.09 5,564,1.61 6,1455,15.27 7,525,2.7 8,1118,106.98 9,1036,3.47 10,396,1.34 11,1485,21.49 12,1017,12.22 13,1345,2.72 14,312,1.71 15,742,33.79 16,1100,39.62 17,1445,4.88 18,847,1.55 19,991,1.82 20,1296,10.77 21,854,1.81 22,1198,61.9 23,1162,8.22 24,1463,42.25 25,1272,4.31 26,745,2.36 27,521,2.14 28,1247,94.33 29,732,12.55 30,489,1.05 31,1494,12.78 32,591,3.18 33,257,1.18 34,602,4.24 35,335,2.06 36,523,3.63 37,752,7.61 38,349,1.76 39,771,0.79 40,855,39.08 41,948,3.95 42,1378,97.28 43,598,2.69 44,558,1.67 45,634,34.69 46,1146,12.22 47,1087,1.74 48,628,1.03 49,711,3.34 50,1116,7.27 51,748,1.09 52,1212,14.16 53,434,1.42 54,1046,8.25 55,568,1.33 56,894,2.61 57,1041,4.79 58,801,1.84 59,1387,11.5 60,1171,161.21 61,734,2.43 62,1471,17.42 63,461,1.42 64,751,2.36 65,898,2.4 66,593,1.74 67,942,3.39 68,825,1.09 69,715,20.23 70,725,5.43 71,1128,7.57 72,1348,4.49 73,1393,9.77 74,1379,97.76 75,859,2.59 76,612,15.98 77,1495,8.22 78,887,1.85 79,867,38.65 80,1353,1.6 81,851,60.25 82,1079,24.05 83,1100,25.58 84,638,1.23 85,1115,1.94 86,1443,4.79 87,1421,10.33 88,1279,7.29 89,1176,173.44 90,315,1.53 91,1019,34.03 92,1337,48.67 93,576,28.83 94,919,2.88 95,361,1.5 96,989,1.47 97,1286,32.11
Let's use pandas plot it is much easier: X=pd.DataFrame(df['hacim']) Y=pd.DataFrame(df['delay']) from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree = 4) X_poly = poly_reg.fit_transform(X) lin_reg_2 = LinearRegression() lin_reg_2.fit(X_poly, Y) df['y_pred'] = lin_reg_2.predict(poly_reg.fit_transform(X)) df = df.sort_values('hacim') ax = df.plot.scatter('hacim','delay') df.plot('hacim', 'y_pred', ax=ax, color='r') plt.title('X Vs Y') plt.xlabel('hacim') plt.ylabel('delay') plt.show() Output: The root of the scatter lines was unsorted data when plotting line graph. You could do this: plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue', marker='o', linestyle='none') Output:
Plotting SVC decision region
I am following some SVC code in a book using moon_dataset. here is the code: import pandas as pd import matplotlib.pyplot as plt %matplotlib inline from sklearn.datasets import make_moons from sklearn.pipeline import Pipeline from sklearn.preprocessing import PolynomialFeatures X, y = make_moons(n_samples=100, noise=0.15) rbf_kernel_svm_clf = Pipeline([ ("scaler", StandardScaler()), ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001)) ]) rbf_kernel_svm_clf.fit(X, y) am i have tried plotting any of these graphs with the following code but nothing so far. plt.scatter(X, y) any help? thanks
You need something more than just a scatter plot to plot the decision regions. A very useful module for this is MLxtend, which makes it very easy to plot the decision regions of a fitted model with plot_decision_regions. Here's how you could get it done using your example: from mlxtend.plotting import plot_decision_regions plt.figure(figsize=(12,8)) plot_decision_regions(X, y, clf=rbf_kernel_svm_clf.named_steps['svm_clf'], legend=2)
Multiple traces on Polynomial Regression Graph
i am implementing simple polynomial regression to predict time for a video given its size, and it's my own dataset. Now for some reason, i am getting multiple traces for my plot. # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('estSize.csv') X = dataset.iloc[:, 0].values.reshape(-1,1) y = dataset.iloc[:, 1].values.reshape(-1,1) from sklearn.linear_model import LinearRegression # Fitting Polynomial Regression to the dataset from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree = 2) X_poly = poly_reg.fit_transform(X) poly_reg.fit(X_poly, y) lin_reg_2 = LinearRegression() lin_reg_2.fit(X_poly, y) # Visualising the Polynomial Regression results plt.scatter(X, y, color = 'red') plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue') plt.show()
Your data needs to be ordered with respect to the predictor. After the line dataset = pd.read_csv('estSize.csv') Add this line: dataset = dataset.sort_values(by=['col1']) Where col1 is your column header for the file-size values.
Subplot 3 graph in one figure
I am Having trouble with the last subplot. The last Crosstab plot appears by itself, and then the subplot has the first 2 subplots but the 3rd one is empty and contains no data. How can I graph it so that all 3 graphs come up in one figure and they share they same Y axis or 'Frequency' import numpy as np import pandas as pd import statsmodels.api as sm import matplotlib.pyplot as plt from patsy import dmatrices from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split from sklearn import metrics from sklearn.cross_validation import cross_val_score #Data Exploration data = sm.datasets.fair.load_pandas().data data['affair'] = np.where(data['affairs'] > 0 , 1,0) print(data) print(data.groupby('affair').mean()) print(data.groupby('rate_marriage').mean()) plt.subplot(331) data['educ'].hist() plt.title('Histogram of Education') plt.xlabel('Education Level') plt.ylabel('Frequency') plt.subplot(332) data['rate_marriage'].hist() plt.title('Histogram of Marriage Rating') plt.xlabel('Marriage Rating') plt.ylabel('Frequency') plt.subplot(333) pd.crosstab(data['rate_marriage'], data['affair'].astype(bool)).plot(kind='bar') plt.title('Marriage Rating distribution by affair Status') plt.xlabel('Marriage Rating') plt.ylabel('Frequency') plt.show()
You need to tell the pandas plotting function where to plot the data. This can be achieved through the ax keyword. ax= plt.subplot(333) pd.crosstab(data['rate_marriage'], data['affair'].astype(bool)).plot(kind='bar', ax=ax)