Iam getting the error as
"ValueError: Expected 2D array, got 1D array instead: array=[ 45000.
50000. 60000. 80000. 110000. 150000. 200000. 300000.
500000. 1000000.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it
contains a single sample."
while executing the following code:
# SVR
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_S.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Visualising the SVR results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# Visualising the SVR results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
Seems, expected dimension is wrong. Could you try:
regressor = SVR(kernel='rbf')
regressor.fit(X.reshape(-1, 1), y)
The problem is if you type y.ndim, you will see the dimension as 1, and if you type X.ndim, you will see the dimension as 2.
So to solve this problem you have to change the result of y.ndim from 1 to 2.
For this just use the reshape function that comes under numpy class.
data=pd.read_csv("Position_Salaries.csv")
X=data.iloc[:,1:2].values
y=data.iloc[:,2].values
y=np.reshape(y,(10,1))
It should solve the problem caused due to dimension.
Do the regular Feature Scaling after the above code and it will work for sure.
Do vote if it works for you.
Thanks.
from sklearn.preprocessing import StandardScaler
#Creating two objects for dependent and independent variable
ss_X = StandardScaler()
ss_y = StandardScaler()
X = ss_X.fit_transform(X)
y = ss_y.fit_transform(y.reshape(-1,1))
After Reshape thing it will work fine
Related
I tried to visualise the PCA transformed data of the MNIST Digit dataset using sns.scatterplot and plt.scatter approaches as below
from keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
(X_train, y_train), (X_test, y_test) = mnist.load_data()
dim_1 = X_train.shape[0]
dim_2 = X_train.shape[1]
dim_3 = X_train.shape[2]
arr = X_train.reshape(dim_1, dim_2 * dim_3)
sc = StandardScaler()
norm_arr = sc.fit_transform(arr)
pca = PCA(n_components=2)
pca_arr = pca.fit_transform(norm_arr)
pca_arr = np.vstack((pca_arr.T, y_train)).T
pca_df = pd.DataFrame(data=pca_arr, columns=("1st_principal", "2nd_principal", "label"))
pca_df = pca_df.astype({'label': 'int32'})
Using scatterplot from matplotlib produces this visual:
sns.FacetGrid(pca_df, hue="label", height=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.show()
On the other hand, scatterplot from seaborn is quite different, in particular the locations of digits 9 (on the bottom right hand corner instead of upper left hand corner compared to the first plot.
plt.figure(figsize=(7,7))
sns.scatterplot(x = pca_arr_combo[:, 0], y = pca_arr_combo[:, 1],
hue = pca_arr_combo[:, 2], palette = sns.hls_palette(10), legend = 'full')
plt.show()
Can someone please explain why 2 different visuals can be produced on the same dataset? I was wondering sns.FacetGrid had something to do with it, but not sure why? Which scatterplot was correct?
Thanks.
I have been trying to change the gradient palette colours from the shap.summary_plot() to the ones interested, exemplified in RGB.
To illustrate it, I have tried to use matplotlib to create my palette. However, it has not worked so far.
Could someone help me with that ?
This is what I have tried so far:
Creating an example with the iris dataset (No problem in here)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.model_selection import train_test_split
import xgboost as xgb
import shap
# import some data to play with
iris = datasets.load_iris()
Y = pd.DataFrame(iris.target, columns = ["Species"])
X = pd.DataFrame(iris.data, columns = iris.feature_names)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0, stratify=Y)
params = { # General Parameters
'booster': 'gbtree',
# Param for boosting
'eta': 0.2,
'gamma': 1,
'max_depth': 5,
'min_child_weight': 5,
'subsample': 0.5,
'colsample_bynode': 0.5,
'lambda': 0, #default = 0
'alpha': 1, #default = 1
# Command line parameters
'num_rounds': 10000,
# Learning Task Parameters
'objective': 'multi:softprob' #'multi:softprob'
}
model = xgb.XGBClassifier(**params, verbose=0, cv=5 , )
# fitting the model
model.fit(X_train,np.ravel(Y_train), eval_set=[(X_test, np.ravel(Y_test))], early_stopping_rounds=20)
# Tree on XGBoost
explainerXGB = shap.TreeExplainer(model, data=X, model_output ="margin")
#recall one can put "probablity" then we explain the output of the model transformed
#into probability space (note that this means the SHAP values now sum to the probability output of the model).
shap_values_XGB_test = explainerXGB.shap_values(X_test)
shap_values_XGB_train = explainerXGB.shap_values(X_train)
shap.summary_plot(shap_values_XGB_train, X_train, )#color=cmap
Until here if you run the code when should get the summary plot with the default colors. In order to change the default ones, I have tried to create my 2 color gradient palette as following:
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
RGB_val = 255
color01= (0,150,200) # Blue wanted
color04= (220,60,60) # red wanted
Colors = [color01, color04]
# Creating a blue red palette transition for graphics
Colors= [(R/RGB_val,G/RGB_val,B/RGB_val) for idx, (R,G,B) in enumerate(Colors)]
n = 256
# Start of the creation of the gradient
Color01= ListedColormap(Colors[0], name='Color01', N=None)
Color04= ListedColormap(Colors[1], name='Color04', N=None)
top = cm.get_cmap(Color01,128)
bottom = cm.get_cmap(Color04,128)
newcolors = np.vstack((top(np.linspace(0, 1, 128)),
bottom(np.linspace(0, 1, 128))))
mymin0 = newcolors[0][0]
mymin1 = newcolors[0][1]
mymin2 = newcolors[0][2]
mymin3 = newcolors[0][3]
mymax0 = newcolors[255][0]
mymax1 = newcolors[255][1]
mymax2 = newcolors[255][2]
mymax3 = newcolors[255][3]
GradientBlueRed= [np.linspace(mymin0, mymax0, n),
np.linspace(mymin1, mymax1, n),
np.linspace(mymin2, mymax2, n),
np.linspace(mymin3, mymax3, n)]
GradientBlueRed_res =np.transpose(GradientBlueRed)
# End of the creation of the gradient
newcmp = ListedColormap(GradientBlueRed_res, name='BlueRed')
shap.summary_plot(shap_values_XGB_train, X_train, color=newcmp)
But I haven't been able to get a change on the colors of the graphic. :
Can someone explain me how to make it for:
(A) 2 gradient color or
(B) 3 color gradient (specifying a color in the middle between the other 2) ?
Thank you so much for your time in advanced,
As already shown here, my workaround solution using the set_cmap() function of figure's artists:
# Create colormap
newcmp = ListedColormap(GradientBlueRed_res, name='BlueRed')
# Plot the summary without showing it
plt.figure()
shap.summary_plot(shap_values_XGB_train, X_train, show=False)
# Change the colormap of the artists
for fc in plt.gcf().get_children():
for fcc in fc.get_children():
if hasattr(fcc, "set_cmap"):
fcc.set_cmap(newcmp)
Result
Actually I made it a solution for this SHAP plot (currently version is 0.39). Basically you can generate a cmap and then use it through the parameter cmap.
An example:
import shap
from matplotlib.colors import LinearSegmentedColormap
# Generate colormap through matplotlib
newCmap = LinearSegmentedColormap.from_list("", ['#c4cfd4','#3345ea'])
# Set plot
shap.decision_plot(..., plot_color=newCmap)
I am not sure which version of SHAP you are using, but in version 0.4.0 (02-2022) summary plot has cmap parameter, so you can directly pass the cmap you build to it:
shap.summary_plot(shap_values, plot_type='dot', plot_size=(12, 6), cmap='hsv')
My question mainly comes from this post
:https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
In the article, the author plotted the vector direction and length of each variable. Based on my understanding, after performing PCA. All we get are the eigenvectors and eigenvalues. For a dataset which has a dimension M x N, each eigenvalue should be a vector as 1 x N. So, my question is maybe the length of the vector is the eigenvalue, but how to find the direction of the vector for each variable mathematical? And what is the physical meaning of the length of the vector?
Also, if it is possible, can I do similar work with scikit PCA function in python?
Thanks!
This plot is called biplot and it is very useful to understand the PCA results. The length of the vectors it is just the values that each feature/variable has on each Principal Component aka PCA loadings.
Example:
These loadings as accessible through print(pca.components_). Using the Iris Dataset the loadings are:
[[ 0.52106591, -0.26934744, 0.5804131 , 0.56485654],
[ 0.37741762, 0.92329566, 0.02449161, 0.06694199],
[-0.71956635, 0.24438178, 0.14212637, 0.63427274],
[-0.26128628, 0.12350962, 0.80144925, -0.52359713]])
Here, each row is one PC and each column corresponds to one variable/feature. So feature/variable 1, has a value 0.52106591 on the PC1 and 0.37741762 on the PC2. These are the values used to plot the vectors that you saw in the biplot. See below the coordinates of Var1. It's exactly those (above) values !!
Finally, to create this plot in python you can use this using sklearn:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], pca.components_.T)
plt.show()
See also this post: https://stackoverflow.com/a/50845697/5025009
and
https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
Try the pca library. This will plot the explained variance, and create a biplot.
pip install pca
A small example:
from pca import pca
# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)
# Or reduce the data towards 2 PCs
model = pca(n_components=2)
# Load example dataset
import pandas as pd
import sklearn
from sklearn.datasets import load_iris
X = pd.DataFrame(data=load_iris().data, columns=load_iris().feature_names, index=load_iris().target)
# Fit transform
results = model.fit_transform(X)
# Plot explained variance
fig, ax = model.plot()
# Scatter first 2 PCs
fig, ax = model.scatter()
# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)
The results is a dict containing many statistics of the PCs, loadings etc.
I recently wrote a Logistic regression model using Scikit Module. However, I'm having a REALLY HARD time plotting the decision boundary line. I'm explicitly multiplying the Coefficients and the Intercepts and plotting them (which in turn throws a wrong figure).
Could someone point me in the right direction on how to plot the decision boundary?
Is there an easier way to plot the line without having to manually multiply the coefficients and the intercepts?
Thanks a Million!
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
#Import Dataset
dataset = pd.read_csv("Students Exam Dataset.txt", names=["Exam 1", "Exam 2", "Admitted"])
print(dataset.head())
#Visualizing Dataset
positive = dataset[dataset["Admitted"] == 1]
negative = dataset[dataset["Admitted"] == 0]
plt.scatter(positive["Exam 1"], positive["Exam 2"], color="blue", marker="o", label="Admitted")
plt.scatter(negative["Exam 1"], negative["Exam 2"], color="red", marker="x", label="Not Admitted")
plt.title("Student Admission Plot")
plt.xlabel("Exam 1")
plt.ylabel("Exam 2")
plt.legend()
plt.plot()
plt.show()
#Preprocessing Data
col = len(dataset.columns)
x = dataset.iloc[:,0:col].values
y = dataset.iloc[:,col-1:col].values
print(f"X Shape: {x.shape} Y Shape: {y.shape}")
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1306)
#Initialize Model
reg = LogisticRegression()
reg.fit(x_train, y_train)
#Output
predictions = reg.predict(x_test)
accuracy = accuracy_score(y_test, predictions) * 100
coeff = reg.coef_
intercept = reg.intercept_
print(f"Accuracy Score : {accuracy} %")
print(f"Coefficients = {coeff}")
print(f"Intercept Coefficient = {intercept}")
#Visualizing Output
xx = np.linspace(30,100,100)
decision_boundary = (coeff[0,0] * xx + intercept.item()) / coeff[0,1]
plt.scatter(positive["Exam 1"], positive["Exam 2"], color="blue", marker="o", label="Admitted")
plt.scatter(negative["Exam 1"], negative["Exam 2"], color="red", marker="x", label="Not Admitted")
plt.plot(xx, decision_boundary, color="green", label="Decision Boundary")
plt.title("Student Admission Plot")
plt.xlabel("Exam 1")
plt.ylabel("Exam 2")
plt.legend()
plt.show()
Dataset: Student Dataset.txt
Is there an easier way to plot the line without having to manually multiply the coefficients and the intercepts?
Yes, if you don't need to build this from scratch, there is an excellent implementation of plotting decision boundaries from scikit-learn classifiers in the mlxtend package. The documentation is extensive in the link provided and it's easy to install with pip install mlxtend.
First, a couple points about the Preprocessing block of the code you posted:
1. x should not include the class labels.
2. y should be a 1d array.
#Preprocessing Data
col = len(dataset.columns)
x = dataset.iloc[:,0:col-1].values # assumes your labels are always in the final column.
y = dataset.iloc[:,col-1:col].values
y = y.reshape(-1) # convert to 1d
Now the plotting is as easy as:
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(x, y,
X_highlight=x_test,
clf=reg,
legend=2)
This particular plot highlights x_test data points by encircling them.
i am implementing simple polynomial regression to predict time for a video given its size, and it's my own dataset. Now for some reason, i am getting multiple traces for my plot.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('estSize.csv')
X = dataset.iloc[:, 0].values.reshape(-1,1)
y = dataset.iloc[:, 1].values.reshape(-1,1)
from sklearn.linear_model import LinearRegression
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Visualising the Polynomial Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.show()
Your data needs to be ordered with respect to the predictor.
After the line
dataset = pd.read_csv('estSize.csv')
Add this line:
dataset = dataset.sort_values(by=['col1'])
Where col1 is your column header for the file-size values.