Scikit-Learn vs Keras (Tensorflow) for multinomial logistic regression - python

I am trying simple multinomial logistic regression using Keras, but the results are quite different compared to standard scikit-learn approach.
For example with iris data:
import numpy as np
import pandas as pd
df = pd.read_csv("./data/iris.data", header=None)
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.3, random_state=52)
X_train = df_train.drop(4, axis=1)
y_train = df_train[4]
X_test = df_test.drop(4, axis=1)
y_test = df_test[4]
Using scikit-learn:
from sklearn.linear_model import LogisticRegression
scikit_model = LogisticRegression(multi_class='multinomial', solver ='saga', max_iter=500)
scikit_model.fit(X_train, y_train)
the average weighted f1-score on test set:
y_test_pred = scikit_model.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred, scikit_model.classes_))
is 0.96.
Then with Keras:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
# first we have to encode class values as integers
encoder = LabelEncoder()
encoder.fit(y_train)
y_train_encoded = encoder.transform(y_train)
Y_train = np_utils.to_categorical(y_train_encoded)
y_test_encoded = encoder.transform(y_test)
Y_test = np_utils.to_categorical(y_test_encoded)
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from keras.regularizers import l2
#model construction
input_dim = 4 # 4 variables
output_dim = 3 # 3 possible outputs
def classification_model():
model = Sequential()
model.add(Dense(output_dim, input_dim=input_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
return model
#training
keras_model = classification_model()
keras_model.fit(X_train, Y_train, epochs=500, verbose=0)
the average weighted f1-score on test set:
classes = np.argmax(keras_model.predict(X_test), axis = 1)
y_test_pred = encoder.inverse_transform(classes)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred, encoder.classes_))
is 0.89.
Is it possible to perform identical (or at least as much as possible) logistic regression with Keras as with scikit-learn?

I tried to run your examples and noticed a couple of potential sources:
The test set is incredibly small, only 45 instances. This means that to get from accuracy of .89 to .96, the model only needs to predict just three more instances correctly. Due to randomness in training, your Keras results can oscillate quite a bit.
As explained by #meowongac https://stackoverflow.com/a/59643522/1467943, you're using a different optimizer. One point is that scikit's algorithm will automatically set its learning rate. For SGD in Keras, tweaking learning rate and/or number of epochs could lead to improvements.
Scikit learn quietly uses L2 regularization by default.
Using your code, I was able to get accuracy ranging from .89 to .96 by running SGD with learning rate set to .05. When switching to Adam (also with this quite high learning rate), I got more stable results ranging from .92 to .96 (although this is more of an impression as I didn't run too many trials).

One obvious difference is saga (a variant of SAG) is used in LogisticRegression while SGD is used in your NN. As far as I know, LogisticRegression doesn't support SGD. Alternatively you can use SGDRegressor or SGDClassifier instead of LogisticRegression. And here is a blog discussing the differences between them.

Related

Problem with building an ANN for Iris Dataset

I am new to machine learning. I have been trying to get this code working but the loss is stuck as 1.12 and is neither increasing or decreasing. Any help would be appreciated.
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
dataset = pd.read_csv('Iris.csv')
#for rncoding label
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
dataset["Labels"] = encoder.fit_transform(dataset["Species"])
X = dataset.iloc[:,1:5]
Y = dataset['Labels']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
X_train = np.array(X_train).astype(np.float32)
X_test = np.array(X_test).astype(np.float32)
y_train = np.array(y_train).astype(np.float32)
y_test = np.array(y_test).astype(np.float32)
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(8, input_shape=(4,), activation='relu'),
tf.keras.layers.Dense(3, activation='softmax')
])
opt = tf.keras.optimizers.Adam(0.01)
model.compile(optimizer=opt, loss='mse')
r = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50)
This is a classification problem where you have to predict the class of Iris plant (source). You have specified mse loss which stands for 'Mean Squared Error'. It measures the average deviation of predicted values from actual values. The square ensures you penalize a large deviation higher than a small deviation. This loss is used for regression problems when you have to predict a continuous value like price, clicks, sales etc.
A few suggestions that will help are:
Change the loss to a classification loss function. categorical_cross_entropy is a good choice here. Without going into too many details, in classification problems model outputs the score of a particular sample belonging to a class. The softmax function used by you converts these scores to normalized probabilities. The cross-entropy loss ensures that your model is penalized when it gives a high probability to the wrong class
Try standardizing your data with 0 mean and unit variance. This helps the model convergence.
You may refer to this article for building a neural network for Iris dataset.

How to add a sklearn target transformer (for output variable) to keras neural network pipeline?

I want to build a neural network using Keras on transforms of my input variables AND my output variables using the sklearn Pipeline (so I can perform CV). I am trying to use TransformedTargetRegressor, but my mean squared errors do not make sense to me.
This is my code which is adapted from Sklearn's example for TransformedTargetRegressor using the Boston Housing dataset and adding a simple neural network that scales the input variables (X).
Set up (this section is fine):
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import train_test_split
#load data
X, y = load_boston(return_X_y=True)
#define simple neural network
def simple_nn():
model = Sequential()
model.add(Dense(13, input_dim=13, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer = 'adam')
return model
#create pipeline for input variables (X) preprocessing
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=simple_nn, epochs=100, batch_size=5, verbose=True)))
pipeline = Pipeline(estimators)
I am trying to do the following (section in question):
#Section in question
transformer = MinMaxScaler()
model = TransformedTargetRegressor(regressor=pipeline,
transformer=transformer)
results = cross_val_score(model, X, y, cv=KFold(n_splits=5))
The resulting cross validation scores are:
array([ 0.61321517, 0.35811762, -2.67674546, -0.30623006, -0.38187424])
The middle number is of particular concern to me since the y target is supposed to have been scaled from 0 to 1, so a mean squared error of -2.67 seems wrong. What am I doing wrong here?
A mean squared error is squared, and thus can't be negative.
That means that your score is not the mean squared error.
The cross_val_score documentation tells us that if not defined, the scorer default to the estimator scorer :
"If None, the estimator’s default scorer (if available) is used.
In your case, it's the TransformedTargetRegressor regressor that is being used. And the TransformedTargetRegressor documentation tells us that its default score :
Return the coefficient of determination R^2 of the prediction.
So the values your are displaying are R2 scores. It can be negative if your model perform badly. See this question for instance.
As a good practice, you should always define the scorer you want to use, to avoid relying on the wrong one.

Machine Learning Model overfitting

So I build a GRU model and I'm comparing 3 different datasets on the same model. I was just running the first dataset and set the number of epochs to 25, but I have noticed that my validation loss is increasing just after the 6th epoch, doesn't that indicate overfitting, am I doing something wrong?
import pandas as pd
import tensorflow as tf
from keras.layers.core import Dense
from keras.layers.recurrent import GRU
from keras.models import Sequential
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from google.colab import files
from tensorboardcolab import TensorBoardColab, TensorBoardColabCallback
tbc=TensorBoardColab() # Tensorboard
df10=pd.read_csv('/content/drive/My Drive/Isolation Forest/IF 10 PERCENT.csv',index_col=None)
df2_10= pd.read_csv('/content/drive/My Drive/2019 Dataframe/2019 10minutes IF 10 PERCENT.csv',index_col=None)
X10_train= df10[['WindSpeed_mps','AmbTemp_DegC','RotorSpeed_rpm','RotorSpeedAve','NacelleOrientation_Deg','MeasuredYawError','Pitch_Deg','WindSpeed1','WindSpeed2','WindSpeed3','GeneratorTemperature_DegC','GearBoxTemperature_DegC']]
X10_train=X10_train.values
y10_train= df10['Power_kW']
y10_train=y10_train.values
X10_test= df2_10[['WindSpeed_mps','AmbTemp_DegC','RotorSpeed_rpm','RotorSpeedAve','NacelleOrientation_Deg','MeasuredYawError','Pitch_Deg','WindSpeed1','WindSpeed2','WindSpeed3','GeneratorTemperature_DegC','GearBoxTemperature_DegC']]
X10_test=X10_test.values
y10_test= df2_10['Power_kW']
y10_test=y10_test.values
# scaling values for model
x_scale = MinMaxScaler()
y_scale = MinMaxScaler()
X10_train= x_scale.fit_transform(X10_train)
y10_train= y_scale.fit_transform(y10_train.reshape(-1,1))
X10_test= x_scale.fit_transform(X10_test)
y10_test= y_scale.fit_transform(y10_test.reshape(-1,1))
X10_train = X10_train.reshape((-1,1,12))
X10_test = X10_test.reshape((-1,1,12))
# creating model using Keras
model10 = Sequential()
model10.add(GRU(units=512, return_sequences=True, input_shape=(1,12)))
model10.add(GRU(units=256, return_sequences=True))
model10.add(GRU(units=256))
model10.add(Dense(units=1, activation='sigmoid'))
model10.compile(loss=['mse'], optimizer='adam',metrics=['mse'])
model10.summary()
history10=model10.fit(X10_train, y10_train, batch_size=256, epochs=25,validation_split=0.20, verbose=1, callbacks=[TensorBoardColabCallback(tbc)])
score = model10.evaluate(X10_test, y10_test)
print('Score: {}'.format(score))
y10_predicted = model10.predict(X10_test)
y10_predicted = y_scale.inverse_transform(y10_predicted)
y10_test = y_scale.inverse_transform(y10_test)
plt.plot( y10_predicted, label='Predicted')
plt.plot( y10_test, label='Measurements')
plt.legend()
plt.savefig('/content/drive/My Drive/Figures/Power Prediction 10 Percent.png')
plt.show()
LSTMs(and also GRUs in spite of their lighter construction) are notorious for easily overfitting.
Reduce the number of units(the output size) in each of the layers(32(layer1)-64(layer2); you could also eliminate the last layer altogether.
The second of all, you are using the activation 'sigmoid', but your loss function + metric is mse.
Ensure that your problem is either a regression or a classification one. If it is indeed a regression, then the activation function should be 'linear' at the last step. If it is a classification one, you should change your loss_function to binary_crossentropy and your metric to 'accuracy'.
Therefore, the plot displayed is just misleading for the moment. If you modify like I suggested and you still get such a train-val loss plot, then we can state for sure that you have an overfitting case.

How to optimize accuracy of ANN

I have 50 target classes of 300 datasets.
This is my sample dataset, with 98 features:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
dataset = pd.read_csv(root_path + 'pima-indians-diabetes.data.csv', header=None)
X= dataset.iloc[:,0:8]
y= dataset.iloc[:,8]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3)
from keras import Sequential
from keras.layers import Dense
classifier = Sequential()
#First Hidden Layer
classifier.add(Dense(units = 10, activation='relu',kernel_initializer='random_normal', input_dim=8))
#Second Hidden Layer
classifier.add(Dense(units = 10, activation='relu',kernel_initializer='random_normal'))
#Output Layer
classifier.add(Dense(units = 1, activation='sigmoid',kernel_initializer='random_normal'))
#Compiling the neural network
classifier.compile(optimizer ='adam',loss='binary_crossentropy', metrics =['accuracy'])
#Fitting the data to the training dataset
classifier.fit(X_train,y_train, batch_size=2, epochs=10)
I get 19% accuracy here, and I don't know how to optimize my prediction result.
I am considering that you have performed the Dimentionality Reduction technique on your original data having 98 features and therefore you are using an 8-dimensional input feature in your model.
I have a few observations on your implementation:
[As a Classification Problem]
As you have mentioned that your samples belong to 50 diffecent classes, the problem is certainly a multiclass classification problem. So, you need to encode your label first like:
from keras.utils import to_categorical
y = to_categorical(y, num_classes=50, dtype='float32')
In this case, you need to change the number of output node (representing class) and activation function in the final layer as follows:
classifier.add(Dense(units = 50, activation='softmax'))
Furthermore, you have to ue categorical_crossentropy as a loss function while compiling your model.
classifier.compile(optimizer ='adam',loss='categorical_crossentropy', metrics =['accuracy'])
[As a Regression Problem]
You can also consider this problem as a multiple regression problem as the output is within the range of 0 to 50 (continuous) and can keep a single output node in the final layer as you did. But in that case, you should use a linear activation function instead of sigmoid.
So, the final layer should be like:
classifier.add(Dense(units = 1)) # default activation is linear
Additionally, In case of regression problem, mean_squared_error is the most relevant cost function to use (assuming not many outliers in your dataset) and accuracy as a performance metric is irrelevant (rather you may use mean_absolute_error which is analogous to loss). Hence, the second modification is:
classifier.compile(optimizer ='adam',loss='mean_squared_error')

I am not sure why decision tree and random forest is displaying 100% accuracy?

I am currently working on a model that reads structured data and determines if someone has a disease. I think the issue is the data is not being split between training and testing data. I am unaware of how I would be able to do that.
I am not sure what to try.
import pandas as pd
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
heart_data = pd.read_csv('cardio_train.csv')
heart_data.head()
heart_data.shape
heart_data.describe()
heart_data.isnull().sum()
heart_data_columns = heart_data.columns
predictors = heart_data[heart_data_columns[heart_data_columns != 'target']] # all columns except Breast Cancer
target = heart_data['target'] # Breast Cancer column
#This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type
predictors.head()
target.head()
#normalize the data by subtracting the mean and dividing by the standard deviation.
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()
n_cols = predictors_norm.shape[1] # number of predictors
def regression_model():
# create model
model = Sequential()
#inputs
model.add(Dense(50, activation='relu', input_shape=(n_cols,)))
model.add(Dense(50, activation='relu')) # activation function
model.add(Dense(1))
# compile model
model.compile(optimizer='adam', loss='mean_squared_error')
#loss measures the results and figures out how bad it did. Optimizer generates next guess.
return model
# build the model
model = regression_model()
print (model)
# fit the model
history=model.fit(predictors_norm, target, validation_split=0.3, epochs=10, verbose=2)
#Decision Tree
print ("Processing Decision Tree")
dtc = DecisionTreeClassifier()
dtc.fit(predictors_norm,target)
print("Decision Tree Test Accuracy {:.2f}%".format(dtc.score(predictors_norm, target)*100))
#Support Vector Machine
print ("Processing Support Vector Machine")
svm = SVC(random_state = 1)
svm.fit(predictors_norm, target)
print("Test Accuracy of SVM Algorithm: {:.2f}%".format(svm.score(predictors_norm,target)*100))
#Random Forest
print ("Processing Random Forest")
rf = RandomForestClassifier(n_estimators = 1000, random_state = 1)
rf.fit(predictors_norm, target)
print("Random Forest Algorithm Accuracy Score : {:.2f}%".format(rf.score(predictors_norm,target)*100))
The message i am getting is this
Decision Tree Test Accuracy 100.00%
However, support vector machine is getting 73.37%
You are evaluating your model on the same data as when you trained it : you are probably overfitting. To overcome this, you must separate the data into two parts, one for learning, one for testing :
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2)
Then, learn your model with the train dataset and evaluate it on the test dataset :
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
accuracy = dtc.score(x_test, y_test) * 100
print(f"Decision Tree test accuracy : {accuracy} %.")

Categories

Resources