How to set penalty of elastic net in Logistic Regression

How to set penalty of elastic net in Logistic Regression - python

I was currently working on one of the assignments from the IBM Machine learning course. I ended up getting one error multiple times while training the model even when I set penalty = 'elasticnet'. I know that the elastic net model needs an L1 ratio and I am not even sure that I need to set the l1_ratio at all or where should I set the L1_ratio.
The code I was working on is below:
#defining Logistic Regression with Elastic Net penalty
l1_ratio=0.5
#elastic net penalty to shrink coefficients without removing any features from the model
penalty= 'elasticnet'
# Our classification problem is multinomial
multi_class = 'multinomial'
#Use saga for elastic net penalty and multinomial classes. sklearn only support saga for elastic net
solver = 'saga'
#setting max iteration to 1000
max_iter = 1000
#Initiating the LogisticRegression and training the model
e_net_model = LogisticRegression(random_state=rs, penalty=penalty, multi_class=multi_class, solver=solver, max_iter = 1000)
#training
e_net_model.fit(X_train, y_train)
Error I was having while fitting the model:
TypeError Traceback (most recent call last)
Input In [60], in <cell line: 2>()
1 # Type your code here
----> 2 e_net_model.fit(X_train, y_train)
File ~\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:1291, in LogisticRegression.fit(self, X, y, sample_weight)
Picture of the Error

Try this code:
from scipy.sparse.construct import rand
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
#defining Logistic Regression with Elastic Net penalty
l1_ratio=0.5
#elastic net penalty to shrink coefficients without removing any features from the model
penalty= 'elasticnet'
# Our classification problem is multinomial
multi_class = 'multinomial'
#Use saga for elastic net penalty and multinomial classes. sklearn only support saga for elastic net
solver = 'saga'
#setting max iteration to 1000
max_iter = 1000
#Initiating the LogisticRegression and training the model
e_net_model = LogisticRegression(random_state=0, penalty=penalty, multi_class=multi_class, solver=solver, max_iter = 1000,l1_ratio=l1_ratio)
#training
e_net_model.fit(X, y)
# define a single row of input data
row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426]
# predict the class label
yhat = e_net_model.predict([row])
# summarize the predicted class
print('Predicted Class: %d' % yhat[0])
Also, make sure that your training data doesn't have missing values.

Related

How to use k-fold cross-validation instead of train_test_split for Regression Neural Network

We have developed an Artificial Neural Network (ANN), where we split our data into training and testing data with train_test_split. As we want a better and more generalized estimate of our performance scores, we would like to split data with k-fold instead.
Now, we split the data into 70% training and 30% testing data with train_test_split
def splitData(dataframe):
X = dataframe[Predictors].values
y = dataframe[TargetVariable].values
### Sandardization of data ###
PredictorScaler = StandardScaler()
TargetVarScaler = StandardScaler()
# Storing the fit object for later reference
PredictorScalerFit = PredictorScaler.fit(X)
TargetVarScalerFit = TargetVarScaler.fit(y)
# Generating the standardized values of X and y
X = PredictorScalerFit.transform(X)
y = TargetVarScalerFit.transform(y)
# Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
return X_train, X_test, y_train, y_test, PredictorScalerFit, TargetVarScalerFit
How do we split our data with k-fold instead?
And are we right in our assumptions that the performance scores will result in better and more generalized estimates if using k-fold cross-validation instead of train_test_split?
Our main
if __name__ == '__main__':
print('--------------')
# ...
dataframe = pd.read_csv("/.../file.csv")
# Splitting data into training and tesing data
X_train, X_test, y_train, y_test, PredictorScalerFit, TargetVarScalerFit = splitData(dataframe=dataframe)
# Making the Regression Artificial Neural Network (ANN)
ann = ANN(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, PredictorScalerFit=PredictorScalerFit, TargetVarScalerFit=TargetVarScalerFit)
# Evaluation of the performance of the Aritifical Neural Network (ANN)
performanceEvaluation(y_test_orig=ann['temp'], y_test_pred=ann['Predicted_temp'])
Our ANN Prediction function
def ANN(X_train, y_train, X_test, y_test, TargetVarScalerFit, PredictorScalerFit):
# Making the regression Artificial Neural Network (ANN) Model
model = make_regression_ann()
# Fitting the ANN to the Training set
model.fit(X_train, y_train, batch_size=5, epochs=50, verbose=1)
# Generating Predictions on testing data
Predictions = model.predict(X_test)
# Scaling the predicted temp data back to original price scale
Predictions = TargetVarScalerFit.inverse_transform(Predictions)
# Scaling the y_test temp data back to original temp scale
y_test_orig = TargetVarScalerFit.inverse_transform(y_test)
# Scaling the test data back to original scale
Test_Data = PredictorScalerFit.inverse_transform(X_test)
TestingData = pd.DataFrame(data=Test_Data, columns=Predictors)
TestingData['temp'] = y_test_orig
TestingData['Predicted_temp'] = Predictions
TestingData.head()
TestingData.to_csv("TestingData.csv")
return TestingData
Making the regression ann model
def make_regression_ann(initializer='normal', activation='relu', optimizer='adam', loss='mse'):
# create ANN model
model = Sequential()
# Defining the Input layer and FIRST hidden layer, both are same!
model.add(Dense(units=8, input_dim=7, kernel_initializer=initializer, activation=activation))
# Defining the Second layer of the model
# after the first layer we don't have to specify input_dim as keras configure it automatically
model.add(Dense(units=6, kernel_initializer=initializer, activation=activation))
# The output neuron is a single fully connected node
# Since we will be predicting a single number
model.add(Dense(1, kernel_initializer=initializer))
# Compiling the model
model.compile(loss=loss, optimizer=optimizer)
return model
Our function to generate performance scores
def performanceEvaluation(y_test_orig, y_test_pred):
# Computing the Mean Absolute Percent Error
MAPE = mean_absolute_percentage_error(y_test_orig, y_test_pred)
# Computing R2 Score
r2 = r2_score(y_test_orig, y_test_pred)
# Computing Mean Square Error (MSE)
MSE = mean_squared_error(y_test_orig, y_test_pred)
# Computing Root Mean Square Error (RMSE)
RMSE = mean_squared_error(y_test_orig, y_test_pred, squared=False)
# Computing Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_orig, y_test_pred)
# Computing Mean Bias Error (MBE)
MBE = np.mean(y_test_pred - y_test_orig) # here we calculate MBE
print('--------------')
print('The Coefficient of Determination (R2) of ANN model is:', r2)
print("The Root Mean Squared Error (RMSE) of ANN model is:", RMSE)
print("The Mean Squared Error (MSE) of ANN model is:", MSE)
print('The Mean Absolute Percent Error (MAPE) of ANN model is:', MAPE)
print("The Mean Absolute Error (MAE) of ANN model is:", MAE)
print("The Mean Bias Error (MBE) of ANN model is:", MBE)
print('--------------')
eval_list = [r2, RMSE, MSE, MAPE, MAE, MBE]
columns = ['Coefficient of Determination (R2)', 'Root Mean Square Error (RMSE)', 'Mean Squared Error (MSE)',
'Mean Absolute Percent Error (MAPE)', 'Mean Absolute Error (MAE)', 'Mean Bias Error (MBE)']
dataframe = pd.DataFrame([eval_list], columns=columns)
return dataframe
Our performance scores
Coefficient of Determination (R2) Root Mean Square Error (RMSE) Mean Squared Error (MSE) Mean Absolute Percent Error (MAPE) Mean Absolute Error (MAE) Mean Bias Error (MBE)
0.982052940563799 0.7293977725380798 0.5320211105835124 0.0894734801108027 0.5711224962560332 0.049541171482965995
What we tried to reach our goal
Splitting whole dataset into X (Predictors) , y (TargetVariable)
def splitData(dataframe):
X = dataframe[Predictors].values
y = dataframe[TargetVariable].values
### Sandardization of data ###
PredictorScaler = StandardScaler()
TargetVarScaler = StandardScaler()
# Storing the fit object for later reference
PredictorScalerFit = PredictorScaler.fit(X)
TargetVarScalerFit = TargetVarScaler.fit(y)
# Generating the standardized values of X and y
X = PredictorScalerFit.transform(X)
y = TargetVarScalerFit.transform(y)
return X, y
Utilize ANN
def ANN_test(X,y):
# Fitting the ANN to the Training set
model = KerasRegressor(build_fn=make_regression_ann(), epochs=50, batch_size=5)
cv = KFold(n_splits=10, random_state=1, shuffle=True)
test = cross_val_score(model, X=X, y=y, cv=cv, scoring="neg_mean_squared_error", n_jobs=1)
print(test)
mean = test.mean()
print(mean)
Error receiving from using this function:
2021-12-24 16:16:47.909705: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/Users/yusuf/PycharmProjects/MasterThesis-ArtificialNeuralNetwork/main.py:168: DeprecationWarning: KerasRegressor is deprecated, use Sci-Keras (https://github.com/adriangb/scikeras) instead.
model = KerasRegressor(build_fn=make_regression_ann(), epochs=50, batch_size=5)
2021-12-24 16:16:48.193312: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
[nan nan nan nan nan nan nan nan nan nan]
nan
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning:
10 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 681, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/keras/wrappers/scikit_learn.py", line 152, in fit
self.model = self.build_fn(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/keras/engine/base_layer.py", line 3076, in _split_out_first_arg
raise ValueError(
ValueError: The first argument to `Layer.call` must always be passed.
warnings.warn(some_fits_failed_message, FitFailedWarning)
We are operating on:
Mac Monterey, Version: 12.0.1
Tensorflow Version: 2.7.0
Keras Version: 2.7.0
Libraries
import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.metrics import accuracy_score, make_scorer, mean_absolute_percentage_error, mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from keras.wrappers.scikit_learn import KerasRegressor
from keras.models import Sequential
from keras.layers import Dense

You need to use KerasRegressor to wrap your keras model as a scikit learn model.
Take a look at example 1 here
from keras.wrappers.scikit_learn import KerasRegressor
model = KerasRegressor(build_fn=make_regression_ann, epochs=512, batch_size=3)
kfold = KFold(n_splits=10, random_state=seed)
scores = cross_val_score(model, x, y, cv=kfold)

ConvergenceWarning: Stochastic Optimizer: Maximum iterations (10) reached and the optimization hasn't converged yet

I am trying sklearn to write about the training of a neutral_work on MNIST datasets.
Why the optimizer did not converge? What else can be done to increase the accuracy which I get in?
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier
print(__doc__)
# load data from http://www.openml.org/d/d554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X = X / 255.
# rescale the data. use the traditional train/test split
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]
# mlp = MLPClassifier(hidden_layer_sizes=(100, 100), max_iter=400, alpha=1e-4,
# solver='sgd' , verbose=10, tol=14-4, random_state=1)
mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4,
solver='sgd', verbose=10, tol=1e-4, random_state=1,
learning_rate_init=.1)
mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))
fig, axes = plt.subplots(4, 4)
# use global min/max to ensure all weights are shown on the same scale
vmin, vmax = mlp.coefs_[0].min(), mlp.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
vmax=.5 * vmax)
ax.set_xticks(())
ax.set_yticks(())
plt.show()

"ConvergenceWarning: Stochastic Optimizer: Maximum iterations (10) reached and the optimization hasn't converged yet. ConvergenceWarning)"
A convergence point is a machine learning models localized optimal state. It basically means that the variables within the model have the best posible values (Within a certain vicinity) in order to predict a target feature based on another set of features. In a Multi-layer perceptron (MLP), these variables are the weights within each neuron. Generally, when a data set doesn't represent a organized and discernable pattern, machine learning algorithms might not be able to find a convergence point. However, if there is a convergence point, a machine learning model will do its best to find it. In order to train a MLP you need to iterate a data set within the network many times in order for its weights to find a convergence point. You can also limit the amount of iterations in order to limit processing time or as a regularization tool.
In your code example you have 2 MLP models, however I'll focus on the snippet that isn't commented:
mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4,solver='sgd', verbose=10, tol=1e-4, random_state=1,learning_rate_init=.1)
There are several parameters that could influence the amount of iterations the model would need in order to converge, however the simplest change you could make is increasing the maximum amount of iterations, for example setting it to max_iter=100 or any other needed greater value.
However, there might be a deeper issue in this machine learning model. The MNIST data set is a set of written character images. MLP is a highly flexible machine learning model, nevertheless, it might not be an adequate model for computer vision and image classification. You might get some positive results with MLP, you might get even better results with Convolutional Neural Networks, which are basically really fancy MLPs.

How can correct sample_weight in sklearn.naive_bayes?

I'm implementing Naive Bayes by sklearn with imbalanced data.
My data has more than 16k records and 6 output categories.
I tried to fit the model with the sample_weight calculated by sklearn.utils.class_weight
The sample_weight received something like:
sample_weight = [11.77540107 1.82284768 0.64688602 2.47138047 0.38577435 1.21389195]
import numpy as np
data_set = np.loadtxt("./data/_vector21.csv", delimiter=",")
inp_vec = data_set[:, 1:22]
out_vec = data_set[:, 22:]
#
# # Split dataset into training set and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inp_vec, out_vec, test_size=0.2) # 80% training and 20% test
#
# class weight
from keras.utils.np_utils import to_categorical
output_vec_categorical = to_categorical(y_train)
from sklearn.utils import class_weight
y_ints = [y.argmax() for y in output_vec_categorical]
c_w = class_weight.compute_class_weight('balanced', np.unique(y_ints), y_ints)
cw = {}
for i in set(y_ints):
cw[i] = c_w[i]
# Create a Gaussian Classifier
from sklearn.naive_bayes import *
model = GaussianNB()
# Train the model using the training sets
print(c_w)
model.fit(X_train, y_train, c_w)
# Predict the response for test dataset
y_pred = model.predict(X_test)
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("\nClassification Report: \n", (metrics.classification_report(y_test, y_pred)))
print("\nAccuracy: %.3f%%" % (metrics.accuracy_score(y_test, y_pred)*100))
I got this message:
ValueError: Found input variables with inconsistent numbers of samples: [13212, 6]
Can anyone tell me what did I do wrong and how can fix it?
Thanks a lot.

The sample_weight and class_weight are two different things.
As their name suggests:
sample_weight is to be applied to individual samples (rows in your data). So the length of sample_weight must match the number of samples in your X.
class_weight is to make the classifier give more importance and attention to the classes. So the length of class_weight must match the number of classes in your targets.
You are calculating class_weight and not sample_weight by using the sklearn.utils.class_weight, but then try to pass it to the sample_weight. Hence the dimension mismatch error.
Please see the following questions for more understanding of how these two weights interact internally:
What is the difference between sample weight and class weight options in scikit learn?
https://stats.stackexchange.com/questions/244630/difference-between-sample-weight-and-class-weight-randomforest-classifier

This way I was able to calculate the weights to deal with class imbalance.
from sklearn.utils import class_weight
sample = class_weight.compute_sample_weight('balanced', y_train)
#Classifier Naive Bayes
naive = naive_bayes.MultinomialNB()
naive.fit(X_train,y_train, sample_weight=sample)
predictions_NB = naive.predict(X_test)

Prediction after feature selection python

I am trying to build a predictive model using python. The training and test data set has over 400 variables. On using feature selection on training data set the number of variables are reduced to 180
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = .9)
and then I am training a model using gradient boosting algorithm achieveing .84 AUC accuracy in cross validation.
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score as auc
df_fit, df_eval, y_fit, y_eval= train_test_split( df, y, test_size=0.2, random_state=1 )
boosting_model = ensemble.GradientBoostingClassifier(n_estimators=100, max_depth=3,
min_samples_leaf=100, learning_rate=0.1,
subsample=0.5, random_state=1)
boosting_model.fit(df_fit, y_fit)
But when I am trying to use this model to predict for prediction data set it is giving me error
predict_target = boosting_model.predict(df_prediction)
Error: Number of variables in prediction data set 'df_prediction' does not match the number of variables in the model
Which makes sense because total variables in testing data remains to be over 400.
My question is there anyway to bypass this problem and keep using feature selection for predictive modeling. Because if I remove it the accuracy of model drops down to .5 which is very poor.
Thanks!

You should transform your prediction matrix through your feature selection too. So somewhere in your code you do
df = sel.fit_transform(X)
and before predicting
df_prediction = sel.transform(X_prediction)

partial_fit Sklearn's MLPClassifier

I've been trying to use Sklearn's neural network MLPClassifier. I have a dataset that is of size 1000 instances (with binary outputs) and I want to apply a basic Neural Net with 1 hidden layer to it.
The issue is that my data instances are not available all at the same time. At any point in time, I only have access to 1 data instance. I thought that partial_fit method of MLPClassifier can be used for this so I simulated the problem with an imaginary dataset of 1000 inputs and looped over the inputs one at a time and partial_fit to each instance but when I run the code, the neural net learns nothing and the predicted output is all zeros.
I am clueless as to what might be causing the problem. Any thought is hugely appreciated.
from __future__ import division
import numpy as np
from sklearn.datasets import make_classification
from sklearn.neural_network import MLPClassifier
#Creating an imaginary dataset
input, output = make_classification(1000, 30, n_informative=10, n_classes=2)
input= input / input.max(axis=0)
N = input.shape[0]
train_input = input[0:N/2,:]
train_target = output[0:N/2]
test_input= input[N/2:N,:]
test_target = output[N/2:N]
#Creating and training the Neural Net
clf = MLPClassifier(activation='tanh', algorithm='sgd', learning_rate='constant',
alpha=1e-4, hidden_layer_sizes=(15,), random_state=1, batch_size=1,verbose= True,
max_iter=1, warm_start=True)
classes=[0,1]
for j in xrange(0,100):
for i in xrange(0,train_input.shape[0]):
input_inst = [train_input[i,:]]
input_inst = np.asarray(input_inst)
target_inst= [train_target[i]]
target_inst = np.asarray(target_inst)
clf=clf.partial_fit(input_inst,target_inst,classes)
#Testing the Neural Net
y_pred = clf.predict(test_input)
print y_pred

Explanation of the problem
The problem is with self.label_binarizer_.fit(y) in line 895 in multilayer_perceptron.py.
Whenever you call clf.partial_fit(input_inst,target_inst,classes), you call self.label_binarizer_.fit(y) where y has only one sample corresponding to one class, in this case. Therefore, if the last sample is of class 0, then your clf will classify everything as class 0.
Solution
As a temporary fix, you can edit multilayer_perceptron.py at line 895.
It is found in a directory similar to this python2.7/site-packages/sklearn/neural_network/
At line 895, change,
self.label_binarizer_.fit(y)
to
if not incremental:
self.label_binarizer_.fit(y)
else:
self.label_binarizer_.fit(self.classes_)
That way, if you are using partial_fit, then self.label_binarizer_ fits on the classes rather than on the individual sample.
Further, the code you posted can be changed to the following to make it work,
from __future__ import division
import numpy as np
from sklearn.datasets import make_classification
from sklearn.neural_network import MLPClassifier
#Creating an imaginary dataset
input, output = make_classification(1000, 30, n_informative=10, n_classes=2)
input= input / input.max(axis=0)
N = input.shape[0]
train_input = input[0:N/2,:]
train_target = output[0:N/2]
test_input= input[N/2:N,:]
test_target = output[N/2:N]
#Creating and training the Neural Net
# 1. Disable verbose (verbose is annoying with partial_fit)
clf = MLPClassifier(activation='tanh', algorithm='sgd', learning_rate='constant',
alpha=1e-4, hidden_layer_sizes=(15,), random_state=1, batch_size=1,verbose= False,
max_iter=1, warm_start=True)
# 2. Set what the classes are
clf.classes_ = [0,1]
for j in xrange(0,100):
for i in xrange(0,train_input.shape[0]):
input_inst = train_input[[i]]
target_inst= train_target[[i]]
clf=clf.partial_fit(input_inst,target_inst)
# 3. Monitor progress
print "Score on training set: %0.8f" % clf.score(train_input, train_target)
#Testing the Neural Net
y_pred = clf.predict(test_input)
print y_pred
# 4. Compute score on testing set
print clf.score(test_input, test_target)
There are 4 main changes in the code. This should give you a good prediction on both the training and the testing set!
Cheers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.