Accuracy is 84 percent but prediction results are wrong - python

Here is my complete code. I'm trying to predict protein classes from protein sequences.
from sklearn.preprocessing import LabelBinarizer
# Transform labels to one-hot
lb = LabelBinarizer()
Y = lb.fit_transform(df.classification)
from keras.preprocessing import text, sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
#maximum length of sequence, everything afterwards is discarded!
max_length = 500
#create and fit tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(seqs)
X = tokenizer.texts_to_sequences(seqs)
X = sequence.pad_sequences(X, maxlen=max_length)
from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional, Conv1D
from keras.layers.convolutional import MaxPooling1D
import tensorflow as tf
from tensorflow.keras import layers
embedding_vecor_length = 128
max_length = 500
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, embedding_vecor_length, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=512)
This is accuracy of the model
train-acc = 0.8485087800799034
test-acc = 0.8203392530062913
and my prediction results are:
[9.65313017e-02 1.33084046e-04 1.73516816e-03 4.62103529e-08
8.45071673e-03 2.42734270e-04 3.54182965e-04 2.88571493e-04
1.99087553e-05 8.92244339e-01]
[8.89207274e-02 1.99566261e-04 1.76228161e-04 2.08527595e-02
1.64435953e-01 2.83987029e-03 1.53038520e-02 7.07270563e-01
5.16798650e-07 2.19354401e-08]
[9.36142087e-01 6.09822795e-02 3.55492946e-09 2.19342492e-05
5.41335670e-04 1.89031591e-04 2.66434945e-04 1.84136129e-03
1.54582867e-05 3.31551647e-10]
Any help in this regard would be appreciated. I'm stuck with it and don't know how to solve it. Also, I'm kindda new to deep learning.

As you can see your last layer has an activation function of softmax function
model.add(Dense(10, activation='softmax'))
So when you predict values it passes through that softmax function in the last layer which gives you those strange-looking float values.
Now, basically what the softmax function is doing here is that it normalizes the input values given to the function and normalizes them in range (0, 1) and all the components will add up to 1. You can read more about the softmax function here: https://en.wikipedia.org/wiki/Softmax_function.
On how can you find the prediction label id you just need to find the maximum value's index in the array and you will have your label id they are pointing to.
You can use numpy argmax function to find maximum values index in multidimensional arrays. You can refer here: https://numpy.org/doc/stable/reference/generated/numpy.argmax.html

Related

how can i train my CNN model on dataset from a .csv file?

I'm a Python beginner and I'm using google Colab to train my first CNN model. I'm blocked on the training part: I know I have to use model.fit() to train the model, but I have no idea on what goes inside the model.fit(). Plus I'm not sure about the part of splitting my dataset from a CSV-file. Here's my code:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import to_categorical
from keras.preprocessing import image
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from sklearn.metrics import accuracy_score
from tqdm import tqdm
from numpy.random import RandomState
import csv
df = pd.read_csv('classes.csv')
print(df.head(3500))
#splitting the dataset in training and testing sets.
rng = RandomState()
train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]
#model's structure
model = Sequential()
#convolutional layer
model.add(Conv2D(32, kernel_size=(3, 3),activation='relu',input_shape=(28,28,1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
#flatten output of conv
model.add(Flatten())
#hidden layer
model.add(Dense(128, activation='relu'))
#output layer
model.add(Dense(10, activation='softmax'))
#compiling sequential model
model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])
model.summary()
#training the model
model.fit()
Ok lets set if we can help. First thing is about splitting the data frame. It is best to split it into a train set, a validation set and a test set. Best way to do that is to use sklearn's train, test, split. The documentation for that is [here.][1] Use the code below to split the data trame into 3 sets
from sklearn.model_selection import train_test_split
train_split=.8 # percentage of df to use for training
test_split = .05 # percentage of df to use for test
# note valid_split=1- train_split = test_ split in this case .15
train_df, dummy_df= train_test_split(df, train_split=train_split, shuffle=True, random_state=123)
dummy_split=test_split/(1.0 - train_split)
test_df, valid_df=train_test_split(dummy_df, train_split=dummy_split) # note dummy_df is 20% of df
now you need to create the train, test and validation generators. I am assuming you have image inputs. Use ImageDataGenerator.flow_from_dataframe() to create your generators. Documentation for that is [here.][2] It is typical to scale the pixel images either in the range 0 to 1 or in the range -1 to +1. I prefer the later. below is the code for the generators
def scalar(img):
return img/127.5 -1
x_col='Filename' # set this to point to the df column that holds the full path to the image file
y_col- 'Label' # set this to the df column that holds the class labels
batch_size = 32 # set this to the batch size
class_mode='categorical'
image_shape=(64,64) # set this to desired image size
tgen=tf.keras.preprocessing.image.ImageDataGenerator(preprocessing_function=scalar, horizontal_flip=True)
gen= tf.keras.preprocessing.image.ImageDataGenerator(preprocessing_function=scalar)
train_gen=tgen.flow_from_dataframe(train_df, x_col=x_col,
y_col=y_col,target_size=image_shape,
class_mode=class_mode, batch_size=batch_size,
shuffle=True, random_state=123)
test_gen=tgen.flow_from_dataframe(test_df, x_col= x_col, y_col=y_col
class_mode=class_mode,
batch_size=batch_size, shuffle=False)
valid_gen=tgen.flow_from_dataframe(valid_df, x_col=x_col, shuffle=False,
y_col=y_col,batch_size=batch_size
class_mode=class_mode , target_size=image_shape)
now it is time to train your model using model.fit. Documentation is [here.][3]
epochs = 15 # set this to the number of epochs to run
history=model.fit(train_gen, epochs=epochs, validation_data= valid_gen, verbose=1)
When the model has completed training you want to see how well it performs on the test set. You do this doing model.evaluate as shown below
accuracy = model.evaluate(test_gen, verbose=1)[1]
print (accuracy)
You can use your model to make predictions using model.predict
preds=model.predict(test_gen, verbose=1)
preds is a matrix with the number of rows equal to the number of samples in the test set and the number of columns is equal to the number of classes you have. Each entry is a probability value that the image is in that class. You want to select the column with the highest probability value. Use np.argmax to find the column(class) with the highest probability value. You can get the test_set labels using labels-test_gen.labels. Code is below
correct_preds=0
for i in range (len(preds):
index=np.argmax(preds[i]) # column with highest probability
label=test_gen.labels[i] # true class of image
if index == label:
correct_preds +=1
pred_accuracy= correct_preds/ len(preds)
print ( pred_accuracy)
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
[2]: https://keras.io/api/preprocessing/image/
[3]: https://www.tensorflow.org/api_docs/python/tf/keras/Model

Not getting reproducible results TensorFlow-Keras-Google Collab

I've been trying to create a model that recognizes different singing techniques. I have got good results but I want to do different tests with different optimizers, layers, etc. However, I can't get reproducible results. By running twice this model training:
num_epochs = 100
batch_size = 128
history = modelo.fit(X_train_f, Y_train, validation_data=(X_test_f,Y_test), epochs=num_epochs, batch_size=batch_size, verbose=2)
I can get 25% accuracy the first run and then 34% the second. Then if I change the optimizer from "sgd" to "adam", I would get a 99%. If I come back to the previous "sgd" optimizer that got me 34% the second run, I would get 100% or something crazy like that. I don't understand why.
I've tried many things I've read in similar questions. The following lines show how I am trying to make my code to be reproducible, and these are actually the first lines of my whole code:
import numpy as np
import tensorflow as tf
import random as rn
import os
#https://stackoverflow.com/questions/57305909/tensorflow-keras-reproducibility-problem-on-google-colab
os.environ['PYTHONHASHSEED']=str(5)
np.random.seed(5)
rn.seed(12345)
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
tf.compat.v1.set_random_seed(1234)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
Question is, what am I doing wrong with the code above that is not working (as I mentioned)?
Here's where I create the training sets:
from keras.datasets import mnist
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers.core import Dense, Flatten
from keras.layers import BatchNormalization,Activation
from keras.optimizers import SGD, Adam
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=2)
My model:
from tensorflow.keras import layers
from tensorflow.keras import initializers
input_dim = X_train_f.shape[1]
output_dim = Y_train.shape[1]
modelo = Sequential()
modelo.add(Conv1D(filters=6, kernel_initializer=initializers.glorot_uniform(seed=5), kernel_size=5, activation='relu', input_shape=(40, 1))) # 6
modelo.add(MaxPooling1D(pool_size=2))
modelo.add(Conv1D(filters=16, kernel_initializer=initializers.glorot_uniform(seed=5), kernel_size=5, activation='relu')) # 16
modelo.add(MaxPooling1D(pool_size=2))
modelo.add(Flatten())
modelo.add(Dense(120, kernel_initializer=initializers.glorot_uniform(seed=5), activation='relu')) # 120
modelo.add(Dense(84, kernel_initializer=initializers.glorot_uniform(seed=5), activation='relu')) # 84
modelo.add(Dense(nclases, kernel_initializer=initializers.glorot_uniform(seed=5), activation='softmax'))
sgd = SGD(lr=0.1)
#modelo.compile(loss='categorical_crossentropy',
# optimizer='adam',
# metrics=['accuracy'])
modelo.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
modelo.summary()
modelo.input_shape
It is a normal situation. Adam optimizer is much more powerful comparing to SGD. Adam implicitly performs coordinate-wise gradient clipping and can hence, unlike SGD, tackle heavy-tailed noise.

How do I code input layer in Deep Learning using Keras (Basic)

Okay so I'm pretty new to deep learning and have a very basic doubt. I have an input data with an array containing 255 data (Araay shape (255,)) in epochs_data and their corresponding labels in new_labels (Array shape (255,)).
I split the data using the following code:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(epochs_data, new_labels, test_size = 0.2, random_state=30)
I'm using a sequential model:
from keras.models import Sequential
from keras import layers
from keras.layers import Dense, Activation, Flatten
model = Sequential()
I know how to code for the hidden layers and output layer:
model.add(Dense(500, activation='relu')) #Hidden Layer
model.add(Dense(2, activation='softmax')) #Output Layer
But I don't know how to code layer for input with the input_shape specified. The X_train is the input.It's an array of shape (180,). Also tell me how to code the model.fit() for the same. Any help is appreciated.
You have to copy this line before the hidden layer. You can add the activation function that you want. Finally, as you can see this line represent both the input layer and the 1° hidden layer (you have to choose the n° of neuron (I put 100) )
model.add(Dense(100, input_shape = (X_train.shape[1],))
EDIT:
Before fitting your model you have to configure your model with this line:
model.compile(loss = 'mse', optimizer = 'Adam', metrics = ['mse'])
So you have to choose a metric that in this case is Mean Squarred Error and an optimizer like Adam, Adamax, ect.
Then you can fit your model choosing the data (X,Y), n° epochs, val_split and the batch size.
history = model.fit(X_train, y_train, epochs = 200,
validation_split = 0.1, batch_size=250)

Tuning number of hidden layers in Keras

I'm just trying to explore keras and tensorflow with the famous MNIST dataset.
I already applied some basic neural networks, but when it comes to tuning some hyperparameters, especially the number of layers, thanks to the sklearn wrapper GridSearchCV, I get the error below:
Parameter values for parameter (hidden_layers) need to be a sequence(but not a string) or np.ndarray.
So you can have a better view I post the main parts of my code.
Data preparation
# Extract label
X_train=train.drop(labels = ["label"],axis = 1,inplace=False)
Y_train=train['label']
del train
# Reshape to fit MLP
X_train = X_train.values.reshape(X_train.shape[0],784).astype('float32')
X_train = X_train / 255
# Label format
from keras.utils import np_utils
Y_train = keras.utils.to_categorical(Y_train, num_classes = 10)
num_classes = Y_train.shape[1]
Keras part
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
# Function with hyperparameters to optimize
def create_model(optimizer='adam', activation = 'sigmoid', hidden_layers=2):
# Initialize the constructor
model = Sequential()
# Add an input layer
model.add(Dense(32, activation=activation, input_shape=784))
for i in range(hidden_layers):
# Add one hidden layer
model.add(Dense(16, activation=activation))
# Add an output layer
model.add(Dense(num_classes, activation='softmax'))
#compile model
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=
['accuracy'])
return model
# Model which will be the input for the GridSearchCV function
modelCV = KerasClassifier(build_fn=create_model, verbose=0)
GridSearchCV
from keras.activations import relu, sigmoid
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import Dropout
from keras.utils import np_utils
activations = [sigmoid, relu]
param_grid = dict(hidden_layers=3,activation=activations, batch_size = [256], epochs=[30])
grid = GridSearchCV(estimator=modelCV, param_grid=param_grid, scoring='accuracy')
grid_result = grid.fit(X_train, Y_train)
I just want to let you know that the same kind of question has already been asked here Grid Search the number of hidden layers with keras but the answer is not complete at all and I can't add a comment to reply to the answerer.
Thank you!
You should add:
for i in range(int(hidden_layers)):
# Add one hidden layer
model.add(Dense(16, activation=activation))
Try to add the values of param_grid as lists :
params_grid={"hidden_layers": [3]}
When you are setting your parameter hidden layer =2 it goes as a string thus an error it throw.
Ideally it should a sequence to run the code that's what you error says

Keras LSTM model giving different predictions on same input when size of input has changed?

I am building a LSTM for text classification in with Keras, and am playing around with different input sentences to get a sense of what is happening, but I'm getting strange outputs. For example:
Sentence 1 = "On Tuesday, Ms. [Mary] Barra, 51, completed a remarkable personal odyssey when she was named as the next chief executive of G.M.--and the first woman to ascend to the top job at a major auto company."
Sentence 2 = "On Tuesday, Ms. [Mary] Barra, 51, was named as the next chief executive of G.M.--and the first woman to ascend to the top job at a major auto company."
The model predicts the class "objective" (0), output 0.4242 when the Sentence 2 is the only element in the input array. It predicts "subjective" (1), output 0.9061 for Sentence 1. If they are both (as separate strings) fed as input in the same array, both are classified as "subjective" (1) - but Sentence 1 outputs 0.8689 and 2 outputs 0.5607. It seems as though they are affecting each other's outputs. It does not matter which index in the input array each sentence is.
Here is the code:
max_length = 500
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000, lower=True,split=' ')
tokenizer.fit_on_texts(dataset["sentence"].values)
#print(tokenizer.word_index) # To see the dicstionary
X = tokenizer.texts_to_sequences(dataset["sentence"].values)
X = pad_sequences(X, maxlen=max_length)
y = np.array(dataset["label"])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)
X_train = sequence.pad_sequences(X_train, maxlen=max_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_length)
embedding_vector_length = 32
###LSTM
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
model = Sequential()
model.add(Embedding(5000, embedding_vector_length, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='sigmoid'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
from keras import optimizers
sgd = optimizers.SGD(lr=0.9)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)
# save model
model.save('LSTM.h5')
I then reloaded the model in a separate script and am feeding it hard-coded sentences:
model = load_model('LSTM.h5')
max_length = 500
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000, lower=True,split=' ')
tokenizer.fit_on_texts(article_sentences)
#print(tokenizer.word_index) # To see the dicstionary
X = tokenizer.texts_to_sequences(article_sentences)
X = pad_sequences(X, maxlen=max_length)
prediction = model.predict(X)
print(prediction)
for i in range(len(X)):
print('%s\nLabel:%d' % (article_sentences[i], prediction[i]))
I set the random seed before training the model and in the script where I load the model, am I missing something when loading the model? Should I be arranging my data differently?
When predicting, you always to fit tokenizer with prediction text, So print out X, maybe you find the difference between twice predicting.
The tokenizer in predicting should be same with one in training. But in you code, you fit tokenizer with prediction text.
You're fitting a new tokenizer which is different from the tokenzier your model is trained on. So for that the first thing you need to do is to pickle the tokenizer and while prediction don't fit the tokenizer again use that pickle tokenizer
You can use that tokenizer directly as X = tokenizer.texts_to_sequences(article_sentences)
no need to fit again. I hope it helps!

Categories

Resources