I have been struggling to create a automatic speech recognition neural network using tensorflow trained on the hugging face mozilla common voice 11 dataset. The model seems to train well for around 100 batches before the loss sudenly goes to infinity.
Here is the code for the data preprocessing:
dataset = datasets.load_dataset("mozilla-foundation/common_voice_11_0", "en")
dataset = dataset.remove_columns(['client_id', 'audio', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'])
def prepare_dataset(batch):
wav_file = batch['path']
# Remove file name
split = wav_file.split("\\")
joined = "\\".join(split[:-1]) + "\\"
# Get the train number
complete_path = glob.glob(joined + "*")
# Combine all the parts
file = complete_path[0] + "\\" + split[-1]
batch['path'] = file
return batch
train_dataset = dataset['train'].map(prepare_dataset).shuffle(len(dataset['train']))
val_dataset = dataset['validation'].map(prepare_dataset).shuffle(len(dataset['validation']))
frame_length = 256
frame_step = 160
fft_length = 384
def load_mp3(wav_file):
audio = tfio.audio.AudioIOTensor(wav_file, dtype=tf.float32)
sample_rate = tf.cast(audio.rate, dtype=tf.int64)
audio = tf.squeeze(audio.to_tensor())
audio = tfio.audio.resample(audio, rate_in=sample_rate, rate_out=8000)
audio = tfio.audio.fade(audio, fade_in=1000, fade_out=2000, mode="logarithmic")
return audio
def convert_to_spect(audio):
spectrogram = tf.signal.stft(
audio, frame_length=frame_length, frame_step=frame_step, fft_length=fft_length
)
spectrogram = tf.abs(spectrogram)
spectrogram = tf.math.pow(spectrogram, 0.5)
spectrogram = tfio.audio.freq_mask(spectrogram, param=25)
spectrogram = tfio.audio.time_mask(spectrogram, param=25)
spectrogram = tfio.audio.freq_mask(spectrogram, param=25)
spectrogram = tfio.audio.time_mask(spectrogram, param=25)
means = tf.math.reduce_mean(spectrogram, 1, keepdims=True)
stddevs = tf.math.reduce_std(spectrogram, 1, keepdims=True)
spectrogram = (spectrogram - means) / (stddevs + 1e-10)
return spectrogram
def process_text(label):
label = tf.strings.lower(label)
label = tf.strings.unicode_split(label, input_encoding="UTF-8")
label = char_to_num(label)
return label
def encode_mozilla_sample(wav_file, label):
audio = load_mp3(wav_file)
spectrogram = convert_to_spect(audio)
label = process_text(label)
return spectrogram, label
And here is the code for the model:
def CTCLoss(y_true, y_pred):
# Compute the training-time loss value
batch_len = tf.cast(tf.shape(y_true)[0], dtype="int64")
input_length = tf.cast(tf.shape(y_pred)[1], dtype="int64")
label_length = tf.cast(tf.shape(y_true)[1], dtype="int64")
input_length = input_length * tf.ones(shape=(batch_len, 1), dtype="int64")
label_length = label_length * tf.ones(shape=(batch_len, 1), dtype="int64")
loss = tf.keras.backend.ctc_batch_cost(y_true, y_pred, input_length, label_length)
return loss
def build_model(input_dim, output_dim, rnn_layers=5, conv_units=128, rnn_units=128, dropout=0.5):
input_spectrogram = tf.keras.layers.Input((None, input_dim), name="input")
x = tf.keras.layers.Reshape((-1, input_dim, 1), name="expand_dim")(input_spectrogram)
# Conv layers
x = tf.keras.layers.Conv2D(
filters=conv_units,
kernel_size=[11, 41],
strides=[2, 2],
padding="same",
use_bias=False,
name="conv_1",
)(x)
x = tf.keras.layers.BatchNormalization(name="conv_1_bn")(x)
x = tf.keras.layers.ReLU(name="conv_1_relu")(x)
x = tf.keras.layers.Conv2D(
filters=conv_units,
kernel_size=[11, 21],
strides=[1, 2],
padding="same",
use_bias=False,
name="conv_2",
)(x)
x = tf.keras.layers.BatchNormalization(name="conv_2_bn")(x)
x = tf.keras.layers.ReLU(name="conv_2_relu")(x)
x = tf.keras.layers.Reshape((-1, x.shape[-2] * x.shape[-1]))(x)
# RNN layers
for i in range(1, rnn_layers + 1):
recurrent = tf.keras.layers.GRU(
units=rnn_units,
activation="tanh",
recurrent_activation="sigmoid",
use_bias=True,
return_sequences=True,
reset_after=True,
name=f"gru_{i}",
)
x = tf.keras.layers.Bidirectional(
recurrent, name=f"bidirectional_{i}", merge_mode="concat"
)(x)
x = tf.keras.layers.BatchNormalization(name=f"rnn_{i}_bn")(x)
if i < rnn_layers:
x = tf.keras.layers.Dropout(rate=dropout)(x)
# Dense layer
x = tf.keras.layers.Dense(units=rnn_units * 2, activation="gelu", name="dense_1")(x)
x = tf.keras.layers.Dropout(rate=dropout)(x)
# Classification layer
output = tf.keras.layers.Dense(units=output_dim + 1, activation="softmax", name="output_layer")(x)
# Model
model = tf.keras.Model(input_spectrogram, output, name="DeepSpeech_2")
# Optimizer
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
# Compile the model and return
model.compile(optimizer=opt, loss=CTCLoss)
return model
# Get the model
model = build_model(
input_dim=fft_length // 2 + 1,
output_dim=char_to_num.vocabulary_size(),
rnn_units=32,
conv_units=32,
rnn_layers=5,
dropout=0.5
)
Versions:
tensorflow: 2.10.1
python: 3.9.12
gpu: Nvidia GeForce RTX 3080
OS: Windows 11
cuDNN: 8.1
CUDA: 11.2
I have tried increasing the batch size expecting the model to generalize better but any batch size 256 or higher caused the gpu to run out of memory. The infite loss occurs with any batch size 128 or less. I have also tried increasing the batch size while using less data but the result is the same. I thought that reducing the neural network size would help solve the problem but no matter what, it seems that the loss goes to infinity after reaching a loss of around 200. A few other changes I have tried are activation functions(relu, leakyrelu, gelu), optimizers(SGD, ADAM, ADAMW), and the number of rnn/conv layers.
Note: I have considered using a pretrained model but I have always wanted to successfully create ASR from scratch using tensorflow. Will it even be possible to get even moderately acceptable results using my GPU and data or will I have to resort to using wav2vec?
Another note: I was first inspired to create this project after watching the video https://www.youtube.com/watch?v=YereI6Gn3bM
made by "The A.I. Hacker - Michael Phi" who first convinced me that this was possible. Before I had thought that my computer would not be able to handle this task but after seeing him do this with pytorch, similar computer specs, and the same data, I though that I would be able to do so.
Update:
I have recently tried replacing the 2D Conv layers with a single 1D Conv layer, making the GRU layer not bidirectional, and going back to the AdamW optimizer but nothing has changed.
Thanks for the solution. I just changed the number of neurons in the second to last dense layer to 512 and the model is currently running without error. Now I am just going to have to figure out how to improve the model so I can finally wrap up this project.
Related
I previously posted an issue with a neural network where the loss was constantly going to infinity after 200 batches. After solving the problem and preventing the loss from going to infinity, the model has been stopping at a local minimum of a loss of 180 and I am not able to get below it. Does anyone have any suggestions to help the model perform better? I have tried many learning rates from 0.1 to 1e04, a learning rate warmup, and ReduceLROnPlateau. When I try a batch size of above 32, my GPU runs out of memory and the program crashes. Could the issue be with the parameters for the spectrogram?
Here is the code for the data processing:
# The set of characters accepted in the transcription.
characters = [x for x in "abcdefghijklmnopqrstuvwxyz'.?! "]
# Mapping characters to integers
char_to_num = tf.keras.layers.StringLookup(vocabulary=characters, oov_token="")
# Mapping integers back to original characters
num_to_char = tf.keras.layers.StringLookup(
vocabulary=char_to_num.get_vocabulary(), oov_token="", invert=True
)
print(
f"The vocabulary is: {char_to_num.get_vocabulary()} "
f"(size ={char_to_num.vocabulary_size()})"
)
sample_rate = 44100
frame_length = 512
frame_step = int(sample_rate * 0.010)
fft_length = 512
def load_mp3(wav_file):
audio = tfio.audio.AudioIOTensor(wav_file, dtype=tf.float32)
sample_rate = tf.cast(audio.rate, dtype=tf.int64)
audio = tf.squeeze(audio.to_tensor())
audio = tfio.audio.resample(audio, rate_in=sample_rate, rate_out=sample_rate)
audio = tfio.audio.fade(audio, fade_in=1000, fade_out=2000, mode="logarithmic")
return audio
def convert_to_stft(audio):
# 4. Get the spectrogram
spectrogram = tf.signal.stft(
audio, frame_length=frame_length, frame_step=frame_step#, fft_length=fft_length
)
spectrogram = tf.abs(spectrogram)
spectrogram = tf.math.pow(spectrogram, 0.5)
means = tf.math.reduce_mean(spectrogram, 1, keepdims=True)
stddevs = tf.math.reduce_std(spectrogram, 1, keepdims=True)
spectrogram = (spectrogram - means) / (stddevs + 1e-10)
return spectrogram
def process_text(label):
label = tf.strings.lower(label)
label = tf.strings.unicode_split(label, input_encoding="UTF-8")
label = char_to_num(label)
label = label[:255]
return label
def encode_mozilla_sample(wav_file, label):
audio = load_mp3(wav_file)
spectrogram = convert_to_stft(audio)
label = process_text(label)
return spectrogram, label
And here is the code for the model:
def CTCLoss(y_true, y_pred):
# Compute the training-time loss value
batch_len = tf.cast(tf.shape(y_true)[0], dtype="int64")
input_length = tf.cast(tf.shape(y_pred)[1], dtype="int64")
label_length = tf.cast(tf.shape(y_true)[1], dtype="int64")
input_length = input_length * tf.ones(shape=(batch_len, 1), dtype="int64")
label_length = label_length * tf.ones(shape=(batch_len, 1), dtype="int64")
loss = tf.keras.backend.ctc_batch_cost(y_true, y_pred, input_length, label_length)
return loss
def build_model(input_dim, output_dim, rnn_layers=5, conv_units=128, rnn_units=128,
dropout=0.5):
input_spectrogram = tf.keras.layers.Input((None, input_dim), name="input")
x = tf.keras.layers.Reshape((-1, input_dim, 1), name="expand_dim")(input_spectrogram)
# Conv layers
x = tf.keras.layers.Conv2D(
filters=conv_units,
kernel_size=[11, 41],
strides=[2, 2],
padding="same",
use_bias=False,
name="conv_1",
)(x)
x = tf.keras.layers.BatchNormalization(name="conv_1_bn")(x)
x = tf.keras.layers.ReLU(name="conv_1_relu")(x)
x = tf.keras.layers.Conv2D(
filters=conv_units,
kernel_size=[11, 21],
strides=[1, 2],
padding="same",
use_bias=False,
name="conv_2",
)(x)
x = tf.keras.layers.BatchNormalization(name="conv_2_bn")(x)
x = tf.keras.layers.ReLU(name="conv_2_relu")(x)
x = tf.keras.layers.Reshape((-1, x.shape[-2] * x.shape[-1]))(x)
# RNN layers
for i in range(1, rnn_layers + 1):
recurrent = tf.keras.layers.GRU(
units=rnn_units,
activation="tanh",
recurrent_activation="sigmoid",
use_bias=True,
return_sequences=True,
reset_after=True,
name=f"gru_{i}",
)
x = tf.keras.layers.Bidirectional(
recurrent, name=f"bidirectional_{i}", merge_mode="concat"
)(x)
x = tf.keras.layers.BatchNormalization(name=f"rnn_{i}_bn")(x)
if i < rnn_layers:
x = tf.keras.layers.Dropout(rate=dropout)(x)
# Dense layer
x = tf.keras.layers.Dense(units=rnn_units * 2, activation="gelu", name="dense_1")(x)
x = tf.keras.layers.Dropout(rate=dropout)(x)
# Classification layer
output = tf.keras.layers.Dense(units=output_dim + 1, activation="softmax",
name="output_layer")(x)
# Model
model = tf.keras.Model(input_spectrogram, output, name="DeepSpeech_2")
# Optimizer
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
# Compile the model and return
model.compile(optimizer=opt, loss=CTCLoss)
return model
# Get the model
model = build_model(
input_dim=fft_length // 2 + 1,
output_dim=char_to_num.vocabulary_size(),
rnn_units=32,
conv_units=32,
rnn_layers=5,
dropout=0.5
)
Versions:
tensorflow: 2.10.1
python: 3.9.12
gpu: Nvidia GeForce RTX 3080(10Gb)
OS: Windows 11
cuDNN: 8.1
CUDA: 11.2
Note: There is more information in the previous question I posted with the loss going to infinity.
I made my custom yolo loss function, essentially same as the one here https://github.com/Neerajj9/Text-Detection-using-Yolo-Algorithm-in-keras-tensorflow/blob/master/Yolo.ipynb
While training, it shows a loss of nan. Why is it so?
def yolo_loss_function(y_true,y_pred):
#y_true,y_pred:None,16,16,1,5
l_coords = 5.0
l_noob = 0.5
coords = y_true[:,:,:,:,0]*l_coords
noobs = (-1*(y_true[:,:,:,:,0]-1)*l_noob)
p_pred = y_pred[:,:,:,:,0] #probability that theer is text or not
p_true = y_true[:,:,:,:,0] #Always 1 or 0
x_true = y_true[:,:,:,:,1]
x_pred = y_pred[:,:,:,:,1]
yy_true = y_true[:,:,:,:,2]
yy_pred = y_pred[:,:,:,:,2]
w_true = y_true[:,:,:,:,3]
w_pred = y_pred[:,:,:,:,3]
h_true = y_true[:,:,:,:,4]
h_pred = y_pred[:,:,:,:,4]
#We have different loss value depending on whether text is present or not
p_loss_absent = K.sum(K.square(p_pred-p_true)*noobs)
p_loss_present = K.sum(K.square(p_pred-p_true))
x_loss = K.sum(K.square(x_pred-x_true)*coords)
yy_loss = K.sum(K.square(yy_pred-yy_true)*coords)
xy_loss = x_loss + yy_loss
w_loss = K.sum(K.square(K.sqrt(w_pred)-K.sqrt(w_true))*coords)
h_loss = K.sum(K.square(K.sqrt(h_pred)-K.sqrt(h_true))*coords)
wh_loss = w_loss+h_loss
loss = p_loss_present+p_loss_absent + xy_loss + wh_loss
return loss
#optimizer
opt = Adam(lr=0.0001,beta_1=0.9,beta_2=0.999,epsilon=1e-08,decay=0.0)
#checkpoint
checkpoint = ModelCheckpoint('model/text_detect.h5',monitor='val_loss',verbose =1,save_best_only=True,mode='min',period=1)
model.compile(loss=yolo_loss_function,optimizer=opt,metrics=['accuracy'])
I'm using transfer learning using the MobileNetV2 architecture.
P.S. - Loss goes to NAN when training the custom YOLO model As in this, I tried removing sqrt from my loss function. That removed the nan but my loss does not decrease. It is increasing steadily and stays constant at about 6 then. The answer at the above post does not seem to help as I cannot see "division" by 0 anywhere.
Edit:
def yolo_model(input_shape):
inp = Input(input_shape)
model = MobileNetV2( input_tensor= inp , include_top=False, weights='imagenet')
last_layer = model.output
conv = Conv2D(512,(3,3) , activation='relu' , padding='same')(last_layer)
conv = Dropout(0.4)(conv)
bn = BatchNormalization()(conv)
lr = LeakyReLU(alpha=0.1)(bn)
conv = Conv2D(128,(3,3) , activation='relu' , padding='same')(lr)
conv = Dropout(0.4)(conv)
bn = BatchNormalization()(conv)
lr = LeakyReLU(alpha=0.1)(bn)
conv = Conv2D(5,(3,3) , activation='sigmoid' , padding='same')(lr)
final = Reshape((grid_h,grid_w,classes,info))(conv)
model = Model(inp,final)
return model
I'm uploading what my model is. The activation in the last Conv2D layer was relu which I changed to sigmoid in response to an answer. Also, my image is normalised from (-1,1). After 1st Epoch , my program showed loss:nan accuracy:1.0000 and below that there was a line could not bring down loss from inf.
You are using relu in last layer, which is not expected. This may be causing dying gradients.
In original yolo paper, the co-ordinates are bounded meaning co-ordinates, height, widths are normalized in range (0,1). So, maybe get rid of relu and try linear or sigmoid.
model.add(Conv2D(7,(3,3),padding="same"))
model.add(Activation("relu"))
adam = optimizers.adam(lr=0.001)
model.compile(loss=custom_loss,optimizer=adam,metrics=["accuracy"])
I have been bashing my head against the wall for the past few days - and I simply cannot figure it out.
Would some of you good people perhaps let me know what I am doing wrong?
I am trying to port code from https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb (written in Tensorflow) to Keras. Here is the original part of the code:
class DQNetwork:
def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
with tf.variable_scope(name):
self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs")
self.actions_ = tf.placeholder(tf.float32, [None, 3], name="actions_")
self.target_Q = tf.placeholder(tf.float32, [None], name="target")
#First convnet: CNN => BatchNormalization => ELU; Input is 84x84x4
self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
filters = 32, kernel_size = [8,8],strides = [4,4],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(), name = "conv1")
self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,training = True,
epsilon = 1e-5,name = 'batch_norm1')
self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name="conv1_out")
## --> [20, 20, 32]
#Second convnet: CNN => BatchNormalization => ELU
self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
filters = 64,kernel_size = [4,4],strides = [2,2],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),name = "conv2")
self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,training = True,
epsilon = 1e-5,name = 'batch_norm2')
self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name="conv2_out")
## --> [9, 9, 64]
#Third convnet: CNN => BatchNormalization => ELU
self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
filters = 128,kernel_size = [4,4],strides = [2,2],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),name = "conv3")
self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,training = True,
epsilon = 1e-5,name = 'batch_norm3')
self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name="conv3_out")
## --> [3, 3, 128]
self.flatten = tf.layers.flatten(self.conv3_out)
## --> [1152]
self.fc = tf.layers.dense(inputs = self.flatten,
units = 512, activation = tf.nn.elu,
kernel_initializer=tf.contrib.layers.xavier_initializer(),name="fc1")
self.output = tf.layers.dense(inputs = self.fc, kernel_initializer=tf.contrib.layers.xavier_initializer(),
units = 3, activation=None)
# Q is our predicted Q value.
self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)
# The loss is the difference between our predicted Q_values and the Q_target
# Sum(Qtarget - Q)^2
self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)
# farther below...
Qs_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
# If we are in a terminal state, only equals reward
if terminal:
target_Qs_batch.append(rewards_mb[i])
else:
target = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
target_Qs_batch.append(target)
targets_mb = np.array([each for each in target_Qs_batch])
loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],
feed_dict={DQNetwork.inputs_: states_mb,
DQNetwork.target_Q: targets_mb,
DQNetwork.actions_: actions_mb})
And here is my conversion:
class DQNetworkA:
def __init__(self, state_size, action_size, learning_rate):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
self.model = keras.models.Sequential()
self.model.add(keras.layers.Conv2D(32, (8, 8), strides=(4, 4), padding = "VALID", input_shape=state_size))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Conv2D(64, (4, 4), strides=(2, 2), padding = "VALID"))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Conv2D(128, (4, 4), strides=(2, 2), padding = "VALID"))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Flatten())
self.model.add(keras.layers.Dense(512))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Dense(action_size))
self.model.compile(loss="mse", optimizer=keras.optimizers.RMSprop(lr=self.learning_rate))
print(self.model.summary())
# farther below...
Qs = DQNetwork.predict(states_mb)
Qs_next_state = DQNetwork.predict(next_states_mb)
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
t = np.copy(Qs[i])
a = np.argmax(actions_mb[i])
# If we are in a terminal state, only equals reward
if terminal:
t[a] = rewards_mb[i]
else:
t[a] = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
target_Qs_batch.append(t)
dbg_target_Qs_batch.append(t[a])
targets_mb = np.array([each for each in target_Qs_batch])
loss = DQNetwork.train_on_batch(states_mb, targets_mb)
Everything else is the same. I have even tried to mess around with a custom loss function to minimize differences in the code – and it simply does not work! While the original code quickly converges my Keras doodlings simply does not seem to want to work!
Does anyone have a clue? Any hints or help would be highly appreciated...
A little further explanation:
This is a simple DQN playing Doom - so the after about 100 episodes (games), the model seems to be able to shoot the target without a problem every episode. Loss goes down, rewards per game go up - as one would expect... However, in the Keras model loss graph is flat, reward graph is flat - it almost seems not to be able to learn anything. (see the graphs linked below)
Here is how it works. In TF code, model outputs a tensor [a, b, c] where a, b and c give probability of each action the main character might take (ie: [left, right, shoot]). Model is then given reward for every action, so it is passed a target value (target_mb, f.ex. 10) along with which action this is for (one-hot encoded in actions_mb, ie [0,1,0] - if this is a target for moving right). Loss is then computed with a simple MSE over difference between target and predicted value of the model for the given action.
I have done two things:
1) I tried to use the standard "mse" loss as I have seen in other models of this type. To make the loss behave the same way, I pass the model its own input apart from target value. So if model predicts [3,4,5] and the target is 10 for [0,1,0] - we pass [3,10,5] as the truth to the model. This should be equivalent to the actions of the TF model. ie, difference between 10 and 4, squared and then mean over all differences from the batch.
2) When 1) did not work, I tried to make a custom loss function that basically attempts to mimick behaviour of the TF model as closely as possible. So if model predicts [3,4,5] and the target is 10 for [0,1,0] (as above) - we pass [0,10,0] as the truth to the model. Then the custom loss function through some finicky multiplication and division arrives at difference between 10 and 4 - squares it and takes mean of all squared errors as below:
def custom_loss(y_true, y_pred):
isolated_truths = tf.reduce_sum(y_true, axis=1)
isolated_predictions = tf.divide(tf.reduce_sum(tf.multiply(y_true, y_pred), axis=1), isolated_truths)
delta = isolated_predictions - isolated_truths
return tf.reduce_mean(tf.square(delta))
# when training, this small modification is made to targets:
loss = DQN_Keras.train_on_batch(states_mb, targets_mb.reshape(len(targets_mb),1) * actions_mb)
And it still does not work (although you can see on the graphs that the loss seems to behave far more reasonably!).
Take a look at the graphs:
tf model: https://pasteboard.co/IN1b5MN.png
keras model with mse loss: https://pasteboard.co/IN1kH6P.png
keras model with custom loss: https://pasteboard.co/IN17ktg.png
edit #2 - runnable code
Original TF code - copy pasted from tutorial above, working:
=> https://pastebin.com/QLb7nWZi
My code with custom loss in full:
=> https://pastebin.com/3HiYg6t7
Well, I have made work - by removing BatchNormalization layers. Now I am completely mystified... so does batch normalization work differently in Keras and Tensorflow? Or is the missing clue this mysterious "training=True" parameter in TF (not present in Keras)?
PS.
While digging into the issue, I also found this very useful article describing how to create advanced Keras models with several inputs like masks (like in the original TF code!):
https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26
Hi I need to change the first convolution of a model from rgb/resnet_v1_50/conv1/weights:0 (float32_ref 7x7x3x64) to rgb/resnet_v1_50/conv1/weights:0 (float32_ref 7x7x4x64), so basicaly augmenting the number of filter form 3 to 4 to accept 4 channels images but keeping the pretrained weight elsewhere (just the additional channel initialize ramdonly).
Do you have an idea of how to do that in Tensorflow 1.x (I'm more of a PyTorch guy...) ?
In PyTorch I do:
net = model.resnet50(num_classes=dataset_train.num_classes(),pretrained=True)
new_conv1 = nn.Conv2d(4, 64, kernel_size=7, stride=2,padding=3,bias=False)
conv1 = net.conv1
with torch.no_grad():
new_conv1.weight[:, :3, :, :]= conv1.weight
new_conv1.bias = conv1.bias
net.conv1 = new_conv1
Here is how the model is created in tensorflow:
def single_stream(self, images, modality, is_training, reuse=False):
with tf.variable_scope(modality, reuse=reuse):
with slim.arg_scope(resnet_v1.resnet_arg_scope()):
_, end_points = resnet_v1.resnet_v1_50(
images, self.no_classes, is_training=is_training, reuse=reuse)
# last bottleneck before logits
net = end_points[modality + '/resnet_v1_50/block4']
if 'autoencoder' in self.mode:
return net
with tf.variable_scope(modality + '/resnet_v1_50', reuse=reuse):
bottleneck = slim.conv2d(net, self.hidden_repr_size, [
7, 7], padding='VALID', activation_fn=tf.nn.relu, scope='f_repr')
net = slim.conv2d(bottleneck, self.no_classes, [
1, 1], activation_fn=None, scope='_logits_')
if ('train_hallucination' in self.mode or 'test_disc' in self.mode or 'train_eccv' in self.mode):
return net, bottleneck
return net
I am able with the command in the build_model: self.images = tf.placeholder(tf.float32, [None, 224, 224, 4], modality + '_images') to effectively change the 3 to a 4: rgb/resnet_v1_50/conv1/weights:0 (float32_ref 7x7x4x64) [12544, bytes: 50176] but the problem is thus now with the checkpoint!
Thanks a lot for your help!
As you do with Pytorch, you can do the same in Keras, which is now a module of TF2 (more info).
I'm gonna show you one possible way to do so:
net_conv1 = model.layers[2] # first 2D convolutional layer, from model.layers, or model.summary()
# your new set of weights must have same dimensions of the ouput of the layer
print( 'weights shape: ', numpy.shape(net_conv1.weights) )
print( net_conv1.weights[0].shape )
print( net_conv1.weights[1].shape )
# New weights
osh_0 = net_conv1.weights[0].shape.as_list()
osh_1 = net_conv1.weights[1].shape.as_list()
print(osh_0, osh_1)
new_conv1_w_0 = numpy.random.rand( *osh_0 )
new_conv1_w_1 = numpy.random.rand( *osh_1 )
# update the weights
net_conv1.set_weights([new_conv1_w_0, new_conv1_w_1])
# check the result
net_conv1.get_weights()
# update the model
model.layers[2] = net_conv1
Check the layers section of Keras doc.
Hope it will be helpful
I am trying to create a simple 3D U-net for image segmentation, just to learn how to use the layers. Therefore I do a 3D convolution with stride 2 and then a transpose deconvolution to get back the same image size. I am also overfitting to a small set (test set) just to see if my network is learning.
I created the same net in Keras and it works just fine. Now I want to create in tensorflow but I been having trouble with it.
The cost changes slightly but no matter what I do (reduce learning rate, add more epochs, add more layers, change batch size...) the output is always the same. I believe the net is not updating the weights. I am sure I am doing something wrong but I can find what it is. Any help would be greatly appreciate it.
Here is my code:
def forward_propagation(X):
if ( mode == 'train'): print(" --------- Net --------- ")
# Convolutional Layer 1
with tf.variable_scope('CONV1'):
Z1 = tf.layers.conv3d(X, filters = 16, kernel =[3,3,3], strides = [ 2, 2, 2], padding='SAME', name = 'S2/conv3d')
A1 = tf.nn.relu(Z1, name = 'S2/ReLU')
if ( mode == 'train'): print("Convolutional Layer 1 S2 " + str(A1.get_shape()))
# DEConvolutional Layer 1
with tf.variable_scope('DeCONV1'):
output_deconv1 = tf.stack([X.get_shape()[0] , X.get_shape()[1], X.get_shape()[2], X.get_shape()[3], 1])
dZ1 = tf.nn.conv3d_transpose(A1, filters = 1, kernel =[3,3,3], strides = [2, 2, 2], padding='SAME', name = 'S2/conv3d_transpose')
dA1 = tf.nn.relu(dZ1, name = 'S2/ReLU')
if ( mode == 'train'): print("Deconvolutional Layer 1 S1 " + str(dA1.get_shape()))
return dA1
def compute_cost(output, target, method = 'dice_hard_coe'):
with tf.variable_scope('COST'):
if (method == 'sigmoid_cross_entropy') :
# Make them vectors
output = tf.reshape( output, [-1, output.get_shape().as_list()[0]] )
target = tf.reshape( target, [-1, target.get_shape().as_list()[0]] )
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits = output, labels = target)
cost = tf.reduce_mean(loss)
return cost
and the main function for the model:
def model(X_h5, Y_h5, learning_rate = 0.009,
num_epochs = 100, minibatch_size = 64, print_cost = True):
ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables
#tf.set_random_seed(1) # to keep results consistent (tensorflow seed)
#seed = 3 # to keep results consistent (numpy seed)
(m, n_D, n_H, n_W, num_channels) = X_h5["test_data"].shape #TTT
num_labels = Y_h5["test_mask"].shape[4] #TTT
img_size = Y_h5["test_mask"].shape[1] #TTT
costs = [] # To keep track of the cost
accuracies = [] # To keep track of the accuracy
# Create Placeholders of the correct shape
X, Y = create_placeholders(n_H, n_W, n_D, minibatch_size)
# Forward propagation: Build the forward propagation in the tensorflow graph
nn_output = forward_propagation(X)
prediction = tf.nn.sigmoid(nn_output)
# Cost function: Add cost function to tensorflow graph
cost_method = 'sigmoid_cross_entropy'
cost = compute_cost(nn_output, Y, cost_method)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
print('------ Training ------')
# Run the initialization
tf.local_variables_initializer().run(session=sess)
sess.run(init)
# Do the training loop
for i in range(num_epochs*m):
# ----- TRAIN -------
current_epoch = i//m
patient_start = i-(current_epoch * m)
patient_end = patient_start + minibatch_size
current_X_train = np.zeros((minibatch_size, n_D, n_H, n_W,num_channels))
current_X_train[:,:,:,:,:] = np.array(X_h5["test_data"][patient_start:patient_end,:,:,:,:]) #TTT
current_X_train = np.nan_to_num(current_X_train) # make nan zero
current_Y_train = np.zeros((minibatch_size, n_D, n_H, n_W, num_labels))
current_Y_train[:,:,:,:,:] = np.array(Y_h5["test_mask"][patient_start:patient_end,:,:,:,:]) #TTT
current_Y_train = np.nan_to_num(current_Y_train) # make nan zero
feed_dict = {X: current_X_train, Y: current_Y_train}
_ , temp_cost = sess.run([optimizer, cost], feed_dict=feed_dict)
# ----- TEST -------
# Print the cost every 1/5 epoch
if ((i % (num_epochs*m/5) )== 0):
# Calculate the predictions
test_predictions = np.zeros(Y_h5["test_mask"].shape)
for j in range(0, X_h5["test_data"].shape[0], minibatch_size):
patient_start = j
patient_end = patient_start + minibatch_size
current_X_test = np.zeros((minibatch_size, n_D, n_H, n_W, num_channels))
current_X_test[:,:,:,:,:] = np.array(X_h5["test_data"][patient_start:patient_end,:,:,:,:])
current_X_test = np.nan_to_num(current_X_test) # make nan zero
current_Y_test = np.zeros((minibatch_size, n_D, n_H, n_W, num_labels))
current_Y_test[:,:,:,:,:] = np.array(Y_h5["test_mask"][patient_start:patient_end,:,:,:,:])
current_Y_test = np.nan_to_num(current_Y_test) # make nan zero
feed_dict = {X: current_X_test, Y: current_Y_test}
_, current_prediction = sess.run([cost, prediction], feed_dict=feed_dict)
test_predictions[j:j + minibatch_size,:,:,:,:] = current_prediction
costs.append(temp_cost)
print ("[" + str(current_epoch) + "|" + str(num_epochs) + "] " + "Cost : " + str(costs[-1]))
display_progress(X_h5["test_data"], Y_h5["test_mask"], test_predictions, 5, n_H, n_W)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('epochs')
plt.show()
return
I call the model with:
model(hdf5_data_file, hdf5_mask_file, num_epochs = 500, minibatch_size = 1, learning_rate = 1e-3)
These are the results that I am currently getting:
Edit:
I have tried reducing the learning rate and it doesn't help. I also tried using tensorboard debug and the weights are not being updated:
I am not sure why this is happening.
I Created the same simple model in keras and it works fine. I am not sure what I am doing wrong in tensorflow.
Not sure if you are still looking for help, as I am answering this question half a year later your posted date. :) I've listed my observations and also some suggestions for you to try below. It my primary observation is right... then you probably just need a coffee break / a night of good sleep.
primary observation:
tf.reshape( output, [-1, output.get_shape().as_list()[0]] ) seems wrong. If you prefer to flatten the vector, it should be something like tf.reshape(output,[-1,np.prod(image_shape_list)]).
other observations:
With such a shallow network, I doubt the network have enough spatial resolution to differentiate tumor voxels from non-tumor voxels. Can you show the keras implementation and the performance compared to a pure tf implementation? I would probably go with 2+ layers, let's .
say with 3 layers, with a stride of 2 per layer, and an input image width of 256, you will end with a width of 32 at your deepest encoder layer. (If you have a limited GPU memory, downsample the input image.)
if changing the loss computation does not work, as #bremen_matt mentioned, reduce LR to say maybe 1e-5.
after the basic architecture tweaks and you "feel" that the network is sort of learning and not stuck, try augmenting the training data, add dropout, batch norm during training, and then maybe fancy up your loss by adding a discriminator.