Training process stuck in a local minimum for neural network

Training process stuck in a local minimum for neural network - python

I'm working on a neural network that's designed to identify patterns in a large dataset, but I'm running into an issue where the training process seems to be stuck in a local minimum. Despite trying a variety of different optimization algorithms and adjusting the learning rate, I can't seem to get the network to converge on a more optimal solution. Here's the code I'm using to train the network:
import numpy as np
import tensorflow as tf
# Load dataset
data = np.load('data.npy')
# Split dataset into training and validation sets
train_data = data[:5000]
val_data = data[5000:]
# Define neural network architecture
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(512, activation='relu', input_shape=(data.shape[1],)),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# Compile model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
# Train model
history = model.fit(train_data[:, :-1], train_data[:, -1],
validation_data=(val_data[:, :-1], val_data[:, -1]),
batch_size=32,
epochs=100,
verbose=1)
I suspect that there might be an issue with the dataset itself, but I've tried normalizing and standardizing the data, as well as applying various preprocessing techniques, with no luck. Any insights or suggestions would be helpful.

Here is one source related to your problems (https://github.com/christianversloot/machine-learning-articles/blob/main/getting-out-of-loss-plateaus-by-adjusting-learning-rates.md). Have a look! May give you some ideas.

Have you tried using different train/val splits? It could be something specific to the split you happen to have selected.

Related

Tensorflow: Using Word Embeddings to Tranform Target Train into Embedding Vectors, does it make sense?

I have been studying TF for only a few months now, so bear with me, and thought of this as a hypothetical question.
For example, I have as input, audio spectrograms of words, and I am training a model to guess the word from a vocabulary, after seeing audio spectrograms. So words are target dataset, not input dataset. Can I train a small pretrained embedding model to transform them into embedding vectors, prior to using them on my actual model as y_train, instead of using just one-hot-encoded text?
I found below code from Francois Chollet's book as an example.
#PRETRAINED EMBEDDING MODEL
embed_model = keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
dtype=tensorflow.string, input_shape=[], output_shape=[50]),
keras.layers.Dense(128, activation="relu"),
keras.layers.Dense(1, activation="sigmoid")
])
embed_model.compile(loss="binary_crossentropy", optimizer="adam",
metrics=["accuracy"])
batch_size = 32
train_set = vectorized_text_y.batch(batch_size).prefetch(1)
embed_history = embed_model.fit(train_set, epochs=5)
I know in this setting using word embedings would not contribute to guessing words from their sound much, as their semantic relations are not relevant, but still curious if this approach would make sense.

Different overfitting for three models with the same structure

I want to design three models that they have the same structure but at the end one of them should have some serious overfitting and another model has less overfitting and the last model has no overfitting.
The idea is that i want to see how much information exist in last layer of each model for some test data. let's say I m using mnist dataset as training and testing set and the structure of all models should be like this.
# Network architecture
network = Sequential()
# input layer
network.add(Dense(512, activation='relu', input_shape=(28*28,) ))
# Hidden layers
network.add(Dense(64, activation='relu', name='features'))
# Output layer
network.add((Dense(10,activation='softmax')))
network.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
#train the model
history = network.fit(train_img, train_label, epochs=50, batch_size=256, validation_split=0.2)
So now the question is how to change this train model that fulfills my needs for three models with different overfitting.
I m new in machine learning topics and i hope i have explain my question as good as possible.
Thanks in advance

Overfit:
The MNIST dataset is rather simple, therefore it should be easy to overfit with the model you are suggesting. Increase the number of epochs: eventually, your model will memorize the training data very well. If you struggle to overfit the data, you might need a more complex network - but I doubt that this will be the case.
Just right:
Probably the easiest wat to obtain the model which is just right (no overfit or underfit) use a callback. Specifically, we can use early stopping. The callback will stop training if the validation loss stops improving. For your code, all you have to do is modify the training as follows:
First define a callback
callback_es = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss')
Add the callback to your training
history = network.fit(train_img, train_label, epochs=50, batch_size=256, validation_split=0.2, callback = [callback_es])
Underfit
Similar idea as with overfitting. In this case, you want to stop your training early on. Train your model for a limited number of epochs only. If you find that your model overfits to quickly, try to lower the learning rate.

How to fix (do better) text classification model with using word2vec

I'm the freshman in Machine Learning and Neural Network. I've got the problem with text classification. I use LSTM NN architecture system with Keras library.
My model every time reach the results about 97%. I got the database with something about 1 million records, where 600k of them are positive and 400k are negative.
I got also 2 labeled classes as 0 (for negative) and 1 (for positive). My database is split for training database and tests database in relation 80:20. For the NN input, I use Word2Vec trained on PubMed articles.
My network architecure:
model = Sequential()
model.add(emb_layer)
model.add(LSTM(64, dropout =0.5))
model.add(Dense(2))
model.add(Activation(‘softmax’)
model.compile(optimizer=’rmsprop’, loss=’binary_crossentropy’, metrics=[‘accuracy’])
model.fit(X_train, y_train, epochs=50, batch_size=32)
How can I fix (do better) my NN created model in this kind of text classification?

The problem with which we are dealing here is called overfitting.
First of all, make sure your input data is properly cleaned. One of the principles of machine learning is: ‘Garbage In, Garbage Out”. Next, you should balance your data collection, for example on 400k positive and 400k negative records. In sequence, the data set should be divided into a training, test and validation set (60%:20%:20%), for example using scikit-learn library, as in the following example:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)
Then I would use a different neural network architecture and try to optimize the parameters.
Personally, I would suggest using a 2-layer LSTM neural network or a combination of a convolutional and recurrent neural network (faster and reading articles that give better results).
1) 2-layer LSTM:
model = Sequential()
model.add(emb_layer)
model.add(LSTM(64, dropout=0.5, recurrent_dropout=0.5, return_sequences=True)
model.add(LSTM(64, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(2))
model.add(Activation(‘sigmoid’))
You can try using 2 layers with 64 hidden neurons, add recurrent_dropout parameter.
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
2) CNN + LSTM
model = Sequential()
model.add(emb_layer)
model.add(Convolution1D(32, 3, padding=’same’))
model.add(Activation(‘relu’))
model.add(MaxPool1D(pool_size=2))
model.add(Dropout(0.5))
model.add(LSTM(32, dropout(0.5, recurrent_dropout=0.5, return_sequences=True))
model.add(LSTM(64, dropout(0.5, recurrent_dropout=0.5))
model.add(Dense(2))
model.add(Activation(‘sigmoid’))
You can try using combination of a CNN and RNN. In this architecture, the model learns faster (up to 5 times faster).
Then, in both cases, you need to apply optimization, loss function.
A good optimizer for both cases is the "Adam" optimizer.
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
In the last step, we validate our network on the validation set.
In addition, we use callback, which will stop the network learning process, in case when, for example, in 3 more iterations, there are no changes in the accuracy of the classification.
from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(patience=3)
model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stopping])
We can also control the overfitting using graphs. If you want to see how to do it, check here.
If you need further help, let me know in a comment.

Keras : Why does Sequential and Model give different outputs?

I am using Keras for computing a simple sequence classification neural network. I played with the different module and I found that there are two way to create Sequential neural network.
The first way is to use Sequential API. This is the most common way which I found in a lot of tutorial/documentation.
Here is the code :
# Sequential Neural Network using Sequential()
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu', input_shape=(27 , 300,)))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(len(7, activation='softmax'))
model.summary()
The second ways is to build de sequential neural network from "scratch" with the Model API. Here is the code.
# Sequential neural network using Model()
inputs = Input(shape=(27 , 300))
x = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(inputs)
x = MaxPooling1D(pool_size=2)(x)
x = LSTM(100)(x)
predictions = Dense(7, activation='softmax')(x)
model = Model(inputs=inputs, outputs=predictions)
model.summary()
I trained it both with a fixed seed (np.random.seed(1337)), with the same training data and my output are different...
Knowing that the only difference in the summary is the first layer of inputs with the Model API.
Is there anyone that knows why this neural network are different ?
And if there are not, why did i get different results ?
Thanks

You setup the random seed only in numpy and not in tensorflow (in case it's the backend of keras in your case). Try to add this in your code:
from numpy.random import seed
seed(1337)
from tensorflow import set_random_seed
set_random_seed(1337)
the detailed article about this topic here

tf.keras.backend.clear_session()
tf.random.set_seed(seed_value)
You can use above code block and run the loaded model for some iterations and check if the error still persists .
I was facing the same issue for reproducibiity,it worked for me .
As mentioned by andrey, over and above these 2 seed setter, you need to setup the Python Hash Environment
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
you can still add one more block to force TensorFlow to use single thread.
( if you are using multicore)
Multiple threads are a potential source of non-reproducible results.
session_conf = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)

Keras: Training loss decrases (accuracy increase) while validation loss increases (accuracy decrease)

I am working on a very sparse dataset with the point of predicting 6 classes.
I have tried working with a lot of models and architectures, but the problem remains the same.
When I start training, the acc for training will slowly start to increase and loss will decrease where as the validation will do the exact opposite.
I have really tried to deal with overfitting, and I simply cannot still believe that this is what is coursing this issue.
What have I tried
Transfer learning on VGG16:
exclude top layer and add dense layer with 256 units and 6 units softmax output layer
finetune the top CNN block
finetune the top 3-4 CNN blocks
To deal with overfitting I use heavy augmentation in Keras and dropout after the 256 dense layer with p=0.5.
Creating own CNN with VGG16-ish architecture:
including batch normalization wherever possible
L2 regularization on each CNN+dense layer
Dropout from anywhere between 0.5-0.8 after each CNN+dense+pooling layer
Heavy data augmentation in "on the fly" in Keras
Realising that perhaps I have too many free parameters:
decreasing the network to only contain 2 CNN blocks + dense + output.
dealing with overfitting in the same manner as above.
Without exception all training sessions are looking like this:
Training & Validation loss+accuracy
The last mentioned architecture looks like this:
reg = 0.0001
model = Sequential()
model.add(Conv2D(8, (3, 3), input_shape=input_shape, padding='same',
kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Conv2D(16, (3, 3), input_shape=input_shape, padding='same',
kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(16, kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(6))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='SGD',metrics=['accuracy'])
And the data is augmented by the generator in Keras and is loaded with flow_from_directory:
train_datagen = ImageDataGenerator(rotation_range=10,
width_shift_range=0.05,
height_shift_range=0.05,
shear_range=0.05,
zoom_range=0.05,
rescale=1/255.,
fill_mode='nearest',
channel_shift_range=0.2*255)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
shuffle = True,
class_mode='categorical')
validation_datagen = ImageDataGenerator(rescale=1/255.)
validation_generator = validation_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=1,
shuffle = True,
class_mode='categorical')

What I can think of by analyzing your metric outputs (from the link you provided):
Seems to me that approximately near epoch 30 your model is starting to overfit. Therefore you can try stopping your training in that iteration, or well just train it for ~30 epochs (or the exact number). The Keras Callbacks may be useful here, specially the ModelCheckpoint to enable you to stop your training when desired (Ctrl +C) or when certain criteria is met. Here is an example of basic ModelCheckpoint use:
#save best True saves only if the metric improves
chk = ModelCheckpoint("myModel.h5", monitor='val_loss', save_best_only=False)
callbacks_list = [chk]
#pass callback on fit
history = model.fit(X, Y, ... , callbacks=callbacks_list)
(Edit:) As suggested in comments, another option you have available is to use the EarlyStopping callback, where you can specify the minimum change tolerated and the 'patience' or epochs without such improvement before stopping the training. If using this, you have to pass it to the callbacks argument as explained before.
At the current setup you model has (and with the modifications you have tried) that point in your training seems to be the optimal training time for your case; training it further will bring no benefits to your model (in fact, will make it generalize worse).
Given you have tried several modifications, one thing you can do is to try to increase your Network Depth, to give it more capacity. Try adding more layers, one at a time, and check for improvements. Also, you usually you want to start with simpler models first, before attempting a multi-layer solution.
If a simple model doesn't work, add one layer and test again, repeating until satisfied or possible. And by simple I mean really simple, have you tried a non-convolutional approach? Although CNN are great for images, maybe you are overkilling it here.
If nothing seems to work, maybe it is time to get more data, or to generate more data from the one you have by sampling or other techniques. For that last suggestion, try checking this keras blog I have found really useful. Deep learning algorithms usually require substantial amount of training data, specially for complex models, like images, so be aware this may not be an easy task. Hope this helps.

IMHO, this is just normal situation for DL. In Keras you can setup a callback that will save the best model (depending on evaluation metric that you provide), and callback that will stop training if model isn't improving.
See ModelCheckpoint & EarlyStopping callbacks respectively.
P.S. Sorry, maybe I misunderstood question - do you have validation loss decreasing form first step?

Validation loss is increasing. This means you need more data, or more regularization. Standard situation here, and nothing to be worried about. By the way, more parameters (bigger model) is just going to worsen this problem unless you fix it.
So you can now investigate profitably by introducing more examples, L2, L1, or dropout.

I faced a similar problem and managed to fix it by removing the Batch Normalisation layer that's just before the output dense layer. This made a ton of difference. Also one of the suggestions I was given is to remove the Dropout layer as it might be causing Shift Variance. Check this paper
I got part of the solution from this thread.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.