There are so many ways to visualize a data set. I want to have all those methods together here and I have chosen iris data set for that. In order to do so These are been written here.
I would have use either pandas' visualization or seaborn's.
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
import pandas as pd
# Parallel Coordinates
# Load the data set
iris = sns.load_dataset("iris")
parallel_coordinates(iris, 'species', color=('#556270', '#4ECDC4', '#C7F464'))
plt.show()
and Result is as follow:
from pandas.plotting import andrews_curves
# Andrew Curves
a_c = andrews_curves(iris, 'species')
a_c.plot()
plt.show()
and its plot is shown below:
from seaborn import pairplot
# Pair Plot
pairplot(iris, hue='species')
plt.show()
which would plot the following fig:
and also another plot which is I think the least used and the most important is the following one:
from plotly.express import scatter_3d
# Plotting in 3D by plotly.express that would show the plot with capability of zooming,
# changing the orientation, and rotating
scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_length', size="petal_width",
color="species", color_discrete_map={"Joly": "blue", "Bergeron": "violet", "Coderre": "pink"})\
.show()
This one would plot into your browser and demands HTML5 and you can see as you wish with it. The next figure is the one. Remember that It is a SCATTERING plot and the size of each ball is showing data of the petal_width so all four features are in one single plot.
Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification
problems. It is called Naive Bayes because the calculations of the probabilities for each class are
simplified to make their calculations tractable. Rather than attempting to calculate the
probabilities of each attribute value, they are assumed to be conditionally independent given the
class value. This is a very strong assumption that is most unlikely in real data, i.e. that the
attributes do not interact. Nevertheless, the approach performs surprisingly well on data where
this assumption does not hold.
Here is a good example of developing a model to predict labels of this data set. You can use this example to develop every model because this is the basic of it.
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import seaborn as sns
# Load the data set
iris = sns.load_dataset("iris")
iris = iris.rename(index=str, columns={'sepal_length': '1_sepal_length', 'sepal_width': '2_sepal_width',
'petal_length': '3_petal_length', 'petal_width': '4_petal_width'})
# Setup X and y data
X_data_plot = df1.iloc[:, 0:2]
y_labels_plot = df1.iloc[:, 2].replace({'setosa': 0, 'versicolor': 1, 'virginica': 2}).copy()
x_train, x_test, y_train, y_test = train_test_split(df2.iloc[:, 0:4], y_labels_plot, test_size=0.25,
random_state=42) # This is for the model
# Fit model
model_sk_plot = GaussianNB(priors=None)
nb_model = GaussianNB(priors=None)
model_sk_plot.fit(X_data_plot, y_labels_plot)
nb_model.fit(x_train, y_train)
# Our 2-dimensional classifier will be over variables X and Y
N_plot = 100
X_plot = np.linspace(4, 8, N_plot)
Y_plot = np.linspace(1.5, 5, N_plot)
X_plot, Y_plot = np.meshgrid(X_plot, Y_plot)
plot = sns.FacetGrid(iris, hue="species", size=5, palette='husl').map(plt.scatter, "1_sepal_length",
"2_sepal_width", ).add_legend()
my_ax = plot.ax
# Computing the predicted class function for each value on the grid
zz = np.array([model_sk_plot.predict([[xx, yy]])[0] for xx, yy in zip(np.ravel(X_plot), np.ravel(Y_plot))])
# Reshaping the predicted class into the meshgrid shape
Z = zz.reshape(X_plot.shape)
# Plot the filled and boundary contours
my_ax.contourf(X_plot, Y_plot, Z, 2, alpha=.1, colors=('blue', 'green', 'red'))
my_ax.contour(X_plot, Y_plot, Z, 2, alpha=1, colors=('blue', 'green', 'red'))
# Add axis and title
my_ax.set_xlabel('Sepal length')
my_ax.set_ylabel('Sepal width')
my_ax.set_title('Gaussian Naive Bayes decision boundaries')
plt.show()
Add whatever you think is necessary to this , for example decision boundaries in 3d is what I have not done before.
here is the link to the dataset I used: Dataset
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Lets begin with polynomial regression
df = pd.read_excel('enes.xlsx', index='hacim')
X=pd.DataFrame(df['hacim'])
Y=pd.DataFrame(df['delay'])
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, Y)
plt.scatter(X, Y, color = 'red')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('X Vs Y')
plt.xlabel('hacim')
plt.ylabel('delay')
plt.show()
Last plt.show shows a graph where there are many lines instead of a 1 lined polynomial regression i desired. what wrong and how can ı fix this?
Data
,hacim,delay
0,815,1.44
1,750,1.11
2,321,2.37
3,1021,1.44
4,255,1.09
5,564,1.61
6,1455,15.27
7,525,2.7
8,1118,106.98
9,1036,3.47
10,396,1.34
11,1485,21.49
12,1017,12.22
13,1345,2.72
14,312,1.71
15,742,33.79
16,1100,39.62
17,1445,4.88
18,847,1.55
19,991,1.82
20,1296,10.77
21,854,1.81
22,1198,61.9
23,1162,8.22
24,1463,42.25
25,1272,4.31
26,745,2.36
27,521,2.14
28,1247,94.33
29,732,12.55
30,489,1.05
31,1494,12.78
32,591,3.18
33,257,1.18
34,602,4.24
35,335,2.06
36,523,3.63
37,752,7.61
38,349,1.76
39,771,0.79
40,855,39.08
41,948,3.95
42,1378,97.28
43,598,2.69
44,558,1.67
45,634,34.69
46,1146,12.22
47,1087,1.74
48,628,1.03
49,711,3.34
50,1116,7.27
51,748,1.09
52,1212,14.16
53,434,1.42
54,1046,8.25
55,568,1.33
56,894,2.61
57,1041,4.79
58,801,1.84
59,1387,11.5
60,1171,161.21
61,734,2.43
62,1471,17.42
63,461,1.42
64,751,2.36
65,898,2.4
66,593,1.74
67,942,3.39
68,825,1.09
69,715,20.23
70,725,5.43
71,1128,7.57
72,1348,4.49
73,1393,9.77
74,1379,97.76
75,859,2.59
76,612,15.98
77,1495,8.22
78,887,1.85
79,867,38.65
80,1353,1.6
81,851,60.25
82,1079,24.05
83,1100,25.58
84,638,1.23
85,1115,1.94
86,1443,4.79
87,1421,10.33
88,1279,7.29
89,1176,173.44
90,315,1.53
91,1019,34.03
92,1337,48.67
93,576,28.83
94,919,2.88
95,361,1.5
96,989,1.47
97,1286,32.11
Let's use pandas plot it is much easier:
X=pd.DataFrame(df['hacim'])
Y=pd.DataFrame(df['delay'])
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, Y)
df['y_pred'] = lin_reg_2.predict(poly_reg.fit_transform(X))
df = df.sort_values('hacim')
ax = df.plot.scatter('hacim','delay')
df.plot('hacim', 'y_pred', ax=ax, color='r')
plt.title('X Vs Y')
plt.xlabel('hacim')
plt.ylabel('delay')
plt.show()
Output:
The root of the scatter lines was unsorted data when plotting line graph.
You could do this:
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue', marker='o', linestyle='none')
Output:
I am new to SciKit-Learn and I have been working on a regression problem (king county csv) on kaggle. I have been training a regression model to predict the price of the house and I wanted to plot the graph but I have no idea how to do so. I am using python 3.6. Any advice or suggestion would be greatly appreciated.
#importing numpy and pandas, seaborn
import numpy as np #linear algebra
import pandas as pd #datapreprocessing, CSV file I/O
import seaborn as sns #for plotting graphs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
data = pd.read_csv('kc_house_data.csv')
data = data.drop('date',axis=1)
data = data.drop('id',axis=1)
X = data
Y = X['price'].values
X = X.drop('price', axis = 1).values
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.30, random_state=21)
reg = LinearRegression()
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(reg, X_train, Y_train, cv=kfold, scoring='r2')
print(cv_results)
round(np.mean(cv_results)*100, 2)
This is the code from sklearn: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
You can use matplotlib for plotting
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 9))
plt.plot(cv_results)
plt.show()
There can be multiple type of plots you can use like simple line plot or scatter plot.
plt.barh(x, y) # for bar graph
plt.plot(x,y) # for line graph
plt.scatter(x,y) # for scatter graph
Seaborn is a very useful visualization library. So much so that you can use 'seaborn.regplot' to directly plot the data and regression-model-fit line. It directly takes in the predictor variable and response variable, and spits out the plot of data points and best fit line. Here is the link on how to use it:
https://seaborn.pydata.org/generated/seaborn.regplot.html
I have also done the same competition on kaggle.
For regressions I would go for a scatter plot:
import matplotlib as plt
plt.plot(x,y)
As for the visualisations on that particular competition I would use the following code:
# visualising some more outliers in the data values
fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(12, 120))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)
sns.color_palette("husl", 8)
for i, feature in enumerate(list(train[numeric]), 1):
if(feature=='MiscVal'):
break
plt.subplot(len(list(numeric)), 3, i)
sns.scatterplot(x=feature, y='SalePrice', hue='SalePrice', palette='Blues', data=train)
plt.xlabel('{}'.format(feature), size=15,labelpad=12.5)
plt.ylabel('SalePrice', size=15, labelpad=12.5)
for j in range(2):
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.legend(loc='best', prop={'size': 10})
plt.show()
I have actually uploaded the full code for that competition on my GitHub if you want to have a look ;) (I am currently in the top 14% on that competition).
Iam getting the error as
"ValueError: Expected 2D array, got 1D array instead: array=[ 45000.
50000. 60000. 80000. 110000. 150000. 200000. 300000.
500000. 1000000.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it
contains a single sample."
while executing the following code:
# SVR
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_S.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Visualising the SVR results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# Visualising the SVR results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
Seems, expected dimension is wrong. Could you try:
regressor = SVR(kernel='rbf')
regressor.fit(X.reshape(-1, 1), y)
The problem is if you type y.ndim, you will see the dimension as 1, and if you type X.ndim, you will see the dimension as 2.
So to solve this problem you have to change the result of y.ndim from 1 to 2.
For this just use the reshape function that comes under numpy class.
data=pd.read_csv("Position_Salaries.csv")
X=data.iloc[:,1:2].values
y=data.iloc[:,2].values
y=np.reshape(y,(10,1))
It should solve the problem caused due to dimension.
Do the regular Feature Scaling after the above code and it will work for sure.
Do vote if it works for you.
Thanks.
from sklearn.preprocessing import StandardScaler
#Creating two objects for dependent and independent variable
ss_X = StandardScaler()
ss_y = StandardScaler()
X = ss_X.fit_transform(X)
y = ss_y.fit_transform(y.reshape(-1,1))
After Reshape thing it will work fine