Modeling-Support Vector Regression (SVR) vs. Linear Regression - python

I'm a little new with modeling techniques and I'm trying to compare SVR and Linear Regression. I've used f(x) = 5x+10 linear function to generate training and test data set. I've written following code snippet so far:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
with open('test.csv', 'r') as f1:
train_dataframe = pd.read_csv(f1)
X_train = train_dataframe.iloc[:30,(0)]
y_train = train_dataframe.iloc[:30,(1)]
with open('test.csv','r') as f2:
test_dataframe = pd.read_csv(f2)
X_test = test_dataframe.iloc[30:,(0)]
y_test = test_dataframe.iloc[30:,(1)]
svr = svm.SVR(kernel="rbf", gamma=0.1)
log = LinearRegression()
svr.fit(X_train.reshape(-1,1),y_train)
log.fit(X_train.reshape(-1,1), y_train)
predSVR = svr.predict(X_test.reshape(-1,1))
predLog = log.predict(X_test.reshape(-1,1))
plt.plot(X_test, y_test, label='true data')
plt.plot(X_test, predSVR, 'co', label='SVR')
plt.plot(X_test, predLog, 'mo', label='LogReg')
plt.legend()
plt.show()
As you can see in the picture, Linear Regression works fine but SVM has poor prediction accuracy.
Please let me know if you any suggestion to tackle this issue.
Thanks

The reason is SVR with kernel rbf don't apply the feature scaling. You need to apply feature scaling before fitting the data to the model.
Sample Code For Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
sc_y = StandardScaler()
y = sc_y.fit_transform(y)

Please see the code below:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.cross_validation import train_test_split
X = np.linspace(0,100,101)
y = np.array([(100*np.random.rand(1)+num) for num in (5*x+10)])
X_train, X_test, y_train, y_test = train_test_split(X, y)
svr = SVR(kernel='linear')
lm = LinearRegression()
svr.fit(X_train.reshape(-1,1),y_train.flatten())
lm.fit(X_train.reshape(-1,1), y_train.flatten())
pred_SVR = svr.predict(X_test.reshape(-1,1))
pred_lm = lm.predict(X_test.reshape(-1,1))
plt.plot(X,y, label='True data')
plt.plot(X_test[::2], pred_SVR[::2], 'co', label='SVR')
plt.plot(X_test[1::2], pred_lm[1::2], 'mo', label='Linear Reg')
plt.legend(loc='upper left');
The reason you were going nowhere was rbf kernel

If we adjust SVR rbf model with constrains like this:
svr_rbf=SVR(kernel='rbf', C=1e3, gamma=0.1)
we will see different result as below chart. Green Stars are new SVR_rbf model prediction. Hope it helps.

Related

Problem with plotting decision regions for classification model

I have a problem with plotting decision regions for Logistic Regression classification model. Can somebody help me and explain something how to do that? I put the colab link to this project here -> https://colab.research.google.com/drive/1JqFyoAk0zithy4esfjiyo6MdB12iBndi?usp=sharing
Dataset from Kaggle -> https://www.kaggle.com/datasets/muratkokludataset/date-fruit-datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from mlxtend.plotting import plot_decision_regions
np.set_printoptions(suppress=True, edgeitems=30, linewidth=100000, formatter=dict(float=lambda x: f'{x:.8f}'))
np.random.seed(42)
sns.set()
desired_width = 320
pd.options.display.float_format = '{:,.8f}'.format
pd.set_option('display.width', desired_width)
pd.set_option('display.max_columns', 12)
raw_data = pd.read_excel(io='/content/Date_Fruit_Datasets.xlsx',
sheet_name='Date_Fruit_Datasets')
data = raw_data.copy()
data.head(n=10)
data.describe().transpose()
data.info()
data.shape
# Creating data and target
X = data.drop(columns='Class')
y = data['Class']
X.shape
y.shape
# Encoding target
encoder = LabelEncoder()
y = encoder.fit_transform(y=y)
# Creating train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Scalling data
scaler = StandardScaler()
X_train = scaler.fit_transform(X=X_train)
X_test = scaler.transform(X=X_test)
# Creating classifier, fitting and predicting
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X=X_train, y=y_train)
y_pred = classifier.predict(X=X_test)
y_pred_proba = classifier.predict_proba(X=X_test)
# Checking finally reports and scores
score = accuracy_score(y_true=y_test, y_pred=y_pred)
report = classification_report(y_true=y_test, y_pred=y_pred, target_names=encoder.classes_)
confusion_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
# Compare y_true and y_pred in DataFrame
results = pd.DataFrame(data={
'y_true': y_test,
'y_pred': y_pred
})
# Creating Data Frame with predict proba
predict_proba = pd.DataFrame(data=classifier.predict_proba(X=X_test), columns=encoder.classes_)
# Saving results to csv
results.to_csv(path_or_buf='/content/data_fruit_predictions.csv')
predict_proba.to_csv(path_or_buf='/content/data_fruit_predict_proba.csv')
# Plotting decision regions
value = 1.5
width = 0.75
plt.figure(figsize=(10, 8))
plot_decision_regions(X=X.values, y=y, clf=classifier,
filler_feature_values={i: value for i in range(1, 34)},
filler_feature_ranges={i: width for i in range(1, 34)}, legend=2)
plt.show()
After using function plot_decision_regions PyCharm shows me error like:
UserWarning: No contour levels were found within the data range.
ax.contour(xx, yy, Z, cset.levels,
and
UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future.
ax.scatter(x=x_data,

Calculating AUC for LogisticRegression model

Let's take data
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
data = load_breast_cancer()
X = data.data
y = data.target
I want to create model using only first principal component and calculate AUC for it.
My work so far
scaler = StandardScaler()
scaler.fit(X_train)
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)
But while I'm trying to use
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
Following error occurs :
y should be a 1d array, got an array of shape (569, 2) instead.
I tried to reshape my data
fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)
But it didn't solve the issue (it outputs) :
multilabel-indicator format is not supported
Do you have any idea how can I perform AUC on this first principal component?
You may wish to try:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = StandardScaler()
pca = PCA(2)
clf = LogisticRegression()
ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
ppl.fit(X_train, y_train)
preds = ppl.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
metrics.plot_roc_curve(ppl, X_test, y_test)
The problem is that predict_proba returns a column for each class. Generally with binary classification, your classes are 0 and 1, so you want the probability of the second class, so it's quite common to slice as follows (replacing the last line in your code block):
pred = clf.predict_proba(principalDf)[:, 1]

Dataset indices for predicted values is not matching with those for actual values

I am a python novice who is trying to solve a regression problem with neural networks. I am at the stage where I want to plot the predicted vs actual followed by determining the regression coefficient.
Model training
#import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
%matplotlib inline
#importing the dataset
data = pd.read_csv("PPV_dataset.csv")
X = np.array(data.drop(["PPV"],1))
y = np.array(data["PPV"])
#model training & prediction
nn = MLPRegressor(hidden_layer_sizes=(100,), activation = 'logistic', solver = 'sgd')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
nn.fit(X_train, y_train)
pred = nn.predict(X_test)
#indices of test set
a = X_test
indices = []
for row in range(len(X)):
for i in range(len(a)):
if np.all(a[i]==X[row]):
indices.append(row)
#listing actual values in an array
actual_values = []
for i in range(len(indices)):
actual_values.append(y[indices[i]])
Comparing actual to predicted values
len(actual_values)
13
len(pred)
12
Image of dataset
You should use the matplotlib and the seaborn libraries for plotting you graph,
and for coeficient r_sq = nn.score(actual_values, pred)
I recommend using seaborn.lmplot() in your case
for roberts particular case I suggest:
from sklearn.metrics import r2_score
r2_score(y_true, y_pred)

Should I use feature scaling with polynomial regression with scikit-learn?

I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.

added Standardscaler but receive errors in Cross Validation and the correlation matrix

This is the code I built to apply a multiple linear regression. I added standard scaler to fix the Y intercept p-value which was not significant but the problem that the results of CV RMSE in the end changed and have nosense anymore and received an error in the code for plotting the correlation Matrix saying : AttributeError: 'numpy.ndarray' object has no attribute 'corr'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from scipy.stats.stats import pearsonr
# Import Excel File
data = pd.read_excel("C:\\Users\\AchourAh\\Desktop\\Multiple_Linear_Regression\\SP Level Reasons Excels\\SP000273701_PL14_IPC_03_09_2018_Reasons.xlsx",'Sheet1') #Import Excel file
# Replace null values of the whole dataset with 0
data1 = data.fillna(0)
print(data1)
# Extraction of the independent and dependent variables
X = data1.iloc[0:len(data1),[1,2,3,4,5,6,7]] #Extract the column of the COPCOR SP we are going to check its impact
Y = data1.iloc[0:len(data1),9] #Extract the column of the PAUS SP
# Data Splitting to train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.25,random_state=1)
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Statistical Analysis of the training set with Statsmodels
X = sm.add_constant(X_train) # add a constant to the model
est = sm.OLS(Y_train, X).fit()
print(est.summary()) # print the results
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
lm = LinearRegression() # create an lm object of LinearRegression Class
lm.fit(X_train,Y_train) # train our LinearRegression model using the training set of data - dependent and independent variables as parameters. Teaching lm that Y_train values are all corresponding to X_train.
print(lm.intercept_)
print(lm.coef_)
mse_test = mean_squared_error(Y_test, lm.predict(X_test))
print(math.sqrt(mse_test))
# Data Splitting to train and test set of the reduced data
X_1 = data1.iloc[0:len(data1),[1,2]] #Extract the column of the COPCOR SP we are going to check its impact
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X_1, Y, test_size =0.25,random_state=1)
X_train2 = ss.fit_transform(X_train2)
X_test2 = ss.transform(X_test2)
# Statistical Analysis of the reduced model with Statsmodels
X_reduced = sm.add_constant(X_train2) # add a constant to the model
est_reduced = sm.OLS(Y_train2, X_reduced).fit()
print(est_reduced.summary()) # print the results
# Fitting a Linear Model for the reduced model with Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
lm1 = LinearRegression() #create an lm object of LinearRegression Class
lm1.fit(X_train2, Y_train2)
print(lm1.intercept_)
print(lm1.coef_)
mse_test1 = mean_squared_error(Y_test2, lm1.predict(X_test2))
print(math.sqrt(mse_test1))
#Cross Validation and Training again the model
from sklearn.model_selection import KFold
from sklearn import model_selection
kf = KFold(n_splits=6, random_state=1)
for train_index, test_index in kf.split(X_train2):
print("Train:", train_index, "Validation:",test_index)
X_train1, X_test1 = X[train_index], X[test_index]
Y_train1, Y_test1 = Y[train_index], Y[test_index]
results = -1 * model_selection.cross_val_score(lm1, X_train1, Y_train1,scoring='neg_mean_squared_error', cv=kf)
print(np.sqrt(results))
#RMSE values interpretation
print(math.sqrt(mse_test1))
print(math.sqrt(results.mean()))
#Good model built no overfitting or underfitting (Barely Same for test and training :Goal of Cross validation but low prediction accuracy = Value is big
import seaborn
Corr=X_train2.corr(method='pearson')
mask=np.zeros_like(Corr)
mask[np.triu_indices_from(mask)]=True
seaborn.heatmap(Corr,cmap='RdYlGn_r',vmax=1.0,vmin=-1.0,mask=mask, linewidths=2.5)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()
enter code here
Do you have an idea how to fix the issue ?
I'm guessing the problem lies with:
Corr=X_train2.corr(method='pearson')
.corr is a pandas dataframe method but X_train2 is a numpy array at that stage. If a dataframe/series is passed into StandardScaler, a numpy array is returned. Try replacing the above with:
Corr=pd.DataFrame(X_train2).corr(method='pearson')
or make use of numpy.corrcoef or numpy.correlate in their respective forms.

Categories

Resources