Should I use feature scaling with polynomial regression with scikit-learn?

Should I use feature scaling with polynomial regression with scikit-learn? - python

I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```

Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.

Related

Why LightGBM Python-package gives bad prediction using for regression task?

I have a sample time-series dataset (23, 208), which is a pivot table count for 24hrs count for some users; I was experimenting with different regressors from sklearn which work fine (except for SGDRegressor()), but this LightGBM Python-package gives me very linear prediction as follows:
my tried code:
import pandas as pd
dff = pd.read_csv('ex_data2.csv',sep=',')
dff.set_index("timestamp",inplace=True)
print(dff.shape)
from sklearn.model_selection import train_test_split
trainingSetf, testSetf = train_test_split(dff,
#target_attribute,
test_size=0.2,
random_state=42,
#stratify=y,
shuffle=False)
import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
username = 'MMC_HEC_LVP' # select one column for plotting & check regression performance
user_list = []
for column in dff.columns:
user_list.append(column)
index = user_list.index(username)
X_trainf = trainingSetf.iloc[:,:].values
y_trainf = trainingSetf.iloc[:,:].values
X_testf = testSetf.iloc[:,:].values
y_testf = testSetf.iloc[:,:].values
test_set_copy = y_testf.copy()
model_LGBMRegressor = MultiOutputRegressor(lgb.LGBMRegressor()).fit(X_trainf, y_trainf)
pred_LGBMRegressor = model_LGBMRegressor.predict(X_testf)
test_set_copy[:,[index]] = pred_LGBMRegressor[:,[index]]
#plot the results for selected user/column
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.figure(figsize=(12, 10))
plt.xlabel("Date")
plt.ylabel("Values")
plt.title(f"{username} Plot")
plt.plot(trainingSetf.iloc[:,[index]],label='trainingSet')
plt.plot(testSetf.iloc[:,[index]],"--",label='testSet')
plt.plot(test_set_copy[:,[index]],'b--',label='RF_predict')
plt.legend()
So what I am missing is if I use default (hyper-)parameters?

Short Answer
Your dataset has a very small number of rows, and LightGBM's parameters have default values set to provide good performance on medium-sized datasets.
Set the following parameters to force LightGBM to fit to the provided data.
min_data_in_bin = 1
min_data_in_leaf = 1
Long Answer
Before training, LightGBM does some pre-processing on the input data.
For example:
bundling sparse features
binning continuous features into histograms
dropping features which are guaranteed to be uninformative (for example, features which are constant)
The result of that preprocessing is a LightGBM Dataset object, and running that preprocessing is called Dataset "construction". LightGBM performs boosting on this Dataset object, not raw data like numpy arrays or pandas data frames.
To speed up construction and prevent overfitting during training, LightGBM provides ability to the prevent creation of histogram bins that are too small (min_data_in_bin) or splits that produce leaf nodes which match too few records (min_data_in_leaf).
Setting those parameters to very low values may be required to train on small datasets.
I created the following minimal, reproducible example, using Python 3.8.12, lightgbm==3.3.2, numpy==1.22.2, and scikit-learn==1.0.2 demonstrating this behavior.
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
# 20-row input data
X, y = make_regression(
n_samples=20,
n_informative=5,
n_features=5,
random_state=708
)
# training produces 0 trees, and predicts mean(y)
reg = LGBMRegressor(
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.000
# training fits and predicts well
reg = LGBMRegressor(
min_data_in_bin=1,
min_data_in_leaf=1,
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.985
If you use LGBMRegressor(min_data_in_bin=1, min_data_in_leaf=1) in the code in the original post, you'll see predictions that better fit to the provided data.

In this way the model is overfitted!
If you do a random split after creating the dataset and evaluate the model on the test dataset, you will notice that the performance is essentially the same or worse (as in this example).
# SETUP
# =============================================================
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(
n_samples=200, n_informative=10, n_features=40, random_state=123
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
# =============================================================
# TEST 1
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.815
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.974
# =============================================================
# TEST 2
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X_train, y_train)
print(f"r2 (defaults): {r2_score(y_train, reg.predict(X_train))}")
# 0.759
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X_train, y_train)
print(f"r2 (small min_data): {r2_score(y_test, reg.predict(X_test))}")
# 0.219

Dataset indices for predicted values is not matching with those for actual values

I am a python novice who is trying to solve a regression problem with neural networks. I am at the stage where I want to plot the predicted vs actual followed by determining the regression coefficient.
Model training
#import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
%matplotlib inline
#importing the dataset
data = pd.read_csv("PPV_dataset.csv")
X = np.array(data.drop(["PPV"],1))
y = np.array(data["PPV"])
#model training & prediction
nn = MLPRegressor(hidden_layer_sizes=(100,), activation = 'logistic', solver = 'sgd')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
nn.fit(X_train, y_train)
pred = nn.predict(X_test)
#indices of test set
a = X_test
indices = []
for row in range(len(X)):
for i in range(len(a)):
if np.all(a[i]==X[row]):
indices.append(row)
#listing actual values in an array
actual_values = []
for i in range(len(indices)):
actual_values.append(y[indices[i]])
Comparing actual to predicted values
len(actual_values)
13
len(pred)
12
Image of dataset

You should use the matplotlib and the seaborn libraries for plotting you graph,
and for coeficient r_sq = nn.score(actual_values, pred)
I recommend using seaborn.lmplot() in your case
for roberts particular case I suggest:
from sklearn.metrics import r2_score
r2_score(y_true, y_pred)

Writing a loop trying different values of degree and train SVM with ply kernel

I'm trying to write a loop trying different values of degree while training an SVM with the poly kernel in python using the digits dataset. I'd also like every value of degree to have an accuracy and to plot the graphs with degrees on the x axis and the test accuracy on the y axis.
Here's what I have so far...it's practically nothing but I don't know where to begin. Thanks in advance.
from sklearn.datasets import load_digits
digits = load_digits()
# Create the features matrix
X = digits.data
# Create the target vector
y = digits.target
# Create training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.25,
random_state=1)
from sklearn.svm import SVC
classifierObj1 = SVC(kernel='poly', degree=3)
classifierObj1.fit(X_train, y_train)
y_pred = classifierObj1.predict(X_test)

Polynomial Regression Degree Increasing Error

I am trying to predict Boston Housing prices. When I choose polynomial regression degree 1 or 2, R2 score is OK. But 3rd degree decreases R2 score.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2) # <-- Tuning to 3
X_poly = poly_reg.fit_transform(X_train)
poly_reg.fit(X_poly, y_train)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y_train)
y_pred = lin_reg_2.predict(poly_reg.fit_transform(X_test))
from sklearn.metrics import r2_score
print('Prediction Score is: ', r2_score(y_test, y_pred))
Output (degree=2):
Prediction Score is: 0.6903318065831567
Output (degree=3):
Prediction Score is: -12898.308114085281

It is called overfitting the model.What you are doing is fitting the model perfectly on the training set that will lead to high variance.When you fit your hypothesis well on the training set it will then fail on the test set. You can check your r2_score for your training set using r2_score(X_train,y_train). It will be high. You need to balance the trade-off between bias and variance.
You can try other regression models like lasso and ridge and can play with their alpha value in case you are looking for a high r2_score. For better understanding, I am putting up an image that will show how hypothesis line gets affected on increasing the degree of the polynomial.

How do I extract the estimation parameters (theta) from GaussianProcessClassifier

# Scale/ Normalize Independent Variables
X = StandardScaler().fit_transform(X)
#Split data into train an test set at 50% each
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=42)
gpc= GaussianProcessClassifier(1.0 * RBF(1.0), n_jobs=-1)
gpc.fit(X_train,y_train)
y_proba=gpc.predict_proba(X_test)
#classify as 1 if prediction probablity greater than 15.8%
y_pred = [1 if x >= .158 else 0 for x in y_proba[:, 1]]
The above code runs as expected. However, in order to explain the model, something like, 'a 1 unit change in Beta1 will result in a .7% improvement in probability of sucess' , I need to be able to see the theta. How do I do this?
Thanks for the assist. BTW, this is for a homework assignment

Very nice question. You can indeed access the thetas however in the documentation it is not clear how to do this.
Use the following. Here I use the iris dataset.
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Scale/ Normalize Independent Variables
X = StandardScaler().fit_transform(X)
#Split data into train an test set at 50% each
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .5, random_state=42)
gpc= GaussianProcessClassifier(1.0 * RBF(1.0), n_jobs=-1)
gpc.fit(X_train,y_train)
y_proba=gpc.predict_proba(X_test)
#classify as 1 if prediction probablity greater than 15.8%
y_pred = [1 if x >= .158 else 0 for x in y_proba[:, 1]]
# thetas
gpc.kernel_.theta
Results:
array([7.1292252 , 1.35355145, 5.54106817, 0.61431805, 7.00063873,
1.3175175 ])
An example from the documentation that access thetas can be found HERE
Hope this helps.

It looks as if the theta value you are looking for is a property of the kernel object that you pass into the classifier. You can read more in this section of the sklearn documentation. You can access the log-transformed values of theta for the classifier kernel by using classifier.kernel_.theta where classifier is the name of your classifier object.
Note that the kernel object also has a method clone_with_theta(theta) that might come in handy if you're making modifications to theta.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Should I use feature scaling with polynomial regression with scikit-learn? - python

Related

Why LightGBM Python-package gives bad prediction using for regression task?

Dataset indices for predicted values is not matching with those for actual values

Writing a loop trying different values of degree and train SVM with ply kernel

Polynomial Regression Degree Increasing Error

How do I extract the estimation parameters (theta) from GaussianProcessClassifier

Categories

Resources