I've trained the following model for some timeseries in Keras:
input_layer = Input(batch_shape=(56, 3864))
first_layer = Dense(24, input_dim=28, activation='relu',
activity_regularizer=None,
kernel_regularizer=None)(input_layer)
first_layer = Dropout(0.3)(first_layer)
second_layer = Dense(12, activation='relu')(first_layer)
second_layer = Dropout(0.3)(second_layer)
out = Dense(56)(second_layer)
model_1 = Model(input_layer, out)
Then I defined a new model with the trained layers of model_1 and added dropout layers with a different rate, drp, to it:
input_2 = Input(batch_shape=(56, 3864))
first_dense_layer = model_1.layers[1](input_2)
first_dropout_layer = model_1.layers[2](first_dense_layer)
new_dropout = Dropout(drp)(first_dropout_layer)
snd_dense_layer = model_1.layers[3](new_dropout)
snd_dropout_layer = model_1.layers[4](snd_dense_layer)
new_dropout_2 = Dropout(drp)(snd_dropout_layer)
output = model_1.layers[5](new_dropout_2)
model_2 = Model(input_2, output)
Then I'm getting the prediction results of these two models as follow:
result_1 = model_1.predict(test_data, batch_size=56)
result_2 = model_2.predict(test_data, batch_size=56)
I was expecting to get completely different results because the second model has new dropout layers and theses two models are different (IMO), but that's not the case. Both are generating the same result. Why is that happening?
As I mentioned in the comments, the Dropout layer is turned off in inference phase (i.e. test mode), so when you use model.predict() the Dropout layers are not active. However, if you would like to have a model that uses Dropout both in training and inference phase, you can pass training argument when calling it, as suggested by François Chollet:
# ...
new_dropout = Dropout(drp)(first_dropout_layer, training=True)
# ...
Alternatively, If you have already trained your model and now want to use it in inference mode and keep the Dropout layers (and possibly other layers which have different behavior in training/inference phase such as BatchNormalization) active, you can define a backend function that takes the model's inputs as well as Keras learning phase:
from keras import backend as K
func = K.function(model.inputs + [K.learning_phase()], model.outputs)
# to use it pass 1 to set the learning phase to training mode
outputs = func([input_arrays] + [1.])
your question has a simple solution in the latest version of Tensorflow. you can set the training argument of the call method to true.
you can run a code like the below code:
model(input,training=True)
by using training=True TensorFlow automatically applies the Dropout layer in inference mode.
As there are already some working code solutions above, I will simply add a few more details regarding dropout during inference to prevent confusion.
Based on the original paper, Dropout layers play the role of turning off (setting gradients to zero) the neuron nodes during training to reduce overfitting. However, once we finish off with training and start testing the model, we do not 'touch' any neurons, thus, all the units are considered to make the decision when inferencing. This causes previously 'dead' neuron weights to be large than expected due to the usage of Dropout. To prevent this, a scaling factor is applied to balance the network node. To be more precise, if a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p during the prediction stage.
Related
In the book "Machine Learning with scikit-learn and Tensorflow" there's a code fragment I can't wrap my head around. Until that chapter, their models were only explicitly using layers - be it in a sequential fashion, or functional. But in the chapter 16, there's this:
import tensorflow_addons as tfa
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(
decoder_embeddings, initial_state=encoder_state,
sequence_length=sequence_lengths)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
outputs=[Y_proba])
And then he just runs the model in a standard way:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
X = np.random.randint(100, size=10*1000).reshape(1000, 10)
Y = np.random.randint(100, size=15*1000).reshape(1000, 15)
X_decoder = np.c_[np.zeros((1000, 1)), Y[:, :-1]]
seq_lengths = np.full([1000], 15)
history = model.fit([X, X_decoder, seq_lengths], Y, epochs=2)
I have trouble understanding the code starting at line 7. The author is creating an Embedding layer which he immediately calls on encoder_inputs and decoder_inputs, then he does basically the same with the LSTM layer that he calls on the previously created encoder_embeddings and tensors returned by this operation are used in the code slightly below. What I don't get here is how are those tensors trained? It looks like he's not using the layers creating them in the model, but if so, then how come the embeddings are learned and the whole model converges?
To understand this overall flow, you must understand how things are made under the hood. Tensorflow uses graph execution when making the model. When you have passed [encoder_inputs, decoder_inputs, sequence_lengths] as an inputs and [Y_proba] as a output. The model doesn't immediately start the training, first it builds the model. So, what builds means here, the thing is it makes a computational graph first and then stores this computational graph. model.compile() does this for you, it makes a computational graph for you.
Let me explain it further let's suppose I wanna compute a + b = c and b + d = 2 and finally c * d = 6 using a computational graph, then first Tensorflow will make 3 nodes for it, how will it look like? see in the picture below.
As you are seeing in the picture above the same exact thing is done by TensorFlow when you pass your inputs and outputs. The picture above is the depiction of forward pass. Now the same graph would be used by Tensorflow to do the backward pass. See the figure below.
Now, first, the computational graph is made and then the same computational graph is used to compute the forward pass and backward pass.
The graph above computes the computational graph of your complete model. But how? in your case specifically. The model will ask how Y_prob comes here. The graph consists of the operations and tensors created between the inputs and outputs. The Embedding layer is created and applied to the inputs encoder_inputs and decoder_inputs to obtain encoder_embeddings and decoder_embeddings, respectively. The LSTM layer is applied to encoder_embeddings to produce encoder_outputs, state_h, and state_c. These tensors are then passed as inputs to the BasicDecoder layer, which combines the decoder_embeddings, encoder_state (constructed from state_h and state_c), and sequence_lengths to produce final_outputs, final_state, and final_sequence_lengths. Finally, the softmax function is applied to the rnn_output of final_outputs to produce the final output Y_proba.
All the entities which are mentioned in the paragraph above in quotes would be your intermediate nodes in a computational graph.
So, it will start with the inputs and bring it down to the Y-Prob. During graph computation, the weights of the model and other parameters are also initiated. The graph is made once, which is then easy to compute the forward pass and backward pass.
How do these layers are trained and optimized for convergence?
when you specify inputs=[encoder_inputs, decoder_inputs, sequence_lengths] and outputs=[Y_proba], the model knows the intermediate layers that are used to compute Y_proba from encoder_inputs, decoder_inputs and sequence_lengths. These intermediate layers are the Embedding and LSTM layers, as well as the TrainingSampler, LSTMCell, Dense and BasicDecoder layers. These layers are automatically included in the computation graph of the model, allowing the optimizer to update the parameters of these layers during training.
This is an example of the Keras Functional API. In this style of defining a model, you write the blueprints first, and then use it later. Think of it like wiring a circuit: while you're connecting things, there's no electricity flowing through them (the electricity corresponds to data in our metaphor). Later, you turn on the power source, and electricity flows through.
This is how the Functional API works as well. First, let's read the last line:
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
outputs=[Y_proba])
This says "Hey Keras, I need a model whose inputs are encoder_inputs, decoder_inputs, and sequence_lengths, and they will eventually produce Y_proba. The details of how they will produce this output is defined above. Let's look at the specific lines you're having trouble with:
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
The first of these says, "Keras, give me a layer that will produce embeddings". embeddings is a layer object. The next two lines are the wiring that we talked about: you can connect layers preemptively before data flows through them: that's the crux of the Functional API. So the second layer says, "Keras, encoder_inputs, which is an Input, will connect to (and go through) the Embedding layer I just created, and the result of that will be in a variable I call encoder_embeddings.
The rest of the code follows the same logic: you're connecting the wires together, before you compile your model and eventually fit it with the data.
I'm using tensorflow and keras 2.8.0 version.
I have the following network:
#defining model
model=Sequential()
#adding convolution layer
model.add(Conv2D(256,(3,3),activation='relu',input_shape=(256,256,3)))
#adding pooling layer
model.add(MaxPool2D(2,2))
#adding fully connected layer
model.add(Flatten())
model.add(Dense(100,activation='relu'))
#adding output layer
model.add(Dense(len(classes),activation='softmax'))
#compiling the model
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
#fitting the model
model.fit(x_tr,y_tr,epochs=epochs, )
# Alla 12-esima epoca, va a converge a 1
# batch size è 125 credo, non so il motivo
#evaluting the model
loss_value, accuracy = model.evaluate(x_te, y_te)
#loss_value, accuracy, top_k_accuracy = model.evaluate(x_te, y_te, batch_size=batch_size)
print("loss_value: " + str(loss_value))
print("acuracy: " + str(accuracy))
#predict first 4 images in the test set
ypred = model.predict(x_te)
The point is that now i'm trying to save the model in ".h5" format but if i train it for 100 epochs or for 1 epochs i will get a 4.61Gb file model.
Why the size of this file is that big?
How can i reduce this model size ?
General reason: The size of your h5 file is based only on the number of parameters your model has.
After constructing the model add the line model.summary() and look at the number of parameters the model has in general.
Steps to reduce model size: You have a LOT of filters in your conv layer. Since I don't know what you want to achieve with your model, I would still advise you to seperate the number of filters to different conv layers and add Pooling layers in between. The will scale down the image and will especially reduce the number of parameters for the Flatten layer.
More information on Pooling layers can be found here.
What I find out, after 5 months of experience, is that the steps to do in order to reduce the model size, improve the accuracy score and reduce the loss value are the following:
categorize the labels and then change the loss function of the model
normalize data in values in the range [-1,1]
use the dense layer increase the parameters and then the dimension of the model: is not even helpful sometimes. Having more parameters doesn't mean have more accuracy. In order to find a solution you have to do several try changing the network, using different activation function and optimizer such as SGD or Adam.
Choose good parameters for learning_rate, decay_rate, decay_values and so on. These parameters give you a better or worse result.
Use batch_size = 32 or 64
Use function that load the dataset step by step and not all in one time in RAM because it makes the process slower and is not even needed: if you are using keras then you can use tf.data.Dataset.from_tensor_slices((x, y)).batch(32 , drop_remainder=True) of course it should be done for train,test,validation
Hope that it helps
I am building a hurricane track predictor using satellite data. I have a multiple to many output in a multilayer LSTM model, with input and output arrays following the structure [samples[time[features]]]. I have as features of inputs and outputs the coordinates of the hurricane, WS, and other dimensions.
The problem is that the error reduction, and as a consequence, the model predicts always a constant. After reading several posts, I standardized the data, removed some unnecessary layers, but still, the model always predicts the same output.
I think the model is big enough, activation functions make sense, given that the outputs are all within [-1;1].
So my questions are : What am I doing wrong ?
The model is the following:
class Stacked_LSTM():
def __init__(self, training_inputs, training_outputs, n_steps_in, n_steps_out, n_features_in, n_features_out, metrics, optimizer, epochs):
self.training_inputs = training_inputs
self.training_outputs = training_outputs
self.epochs = epochs
self.n_steps_in = n_steps_in
self.n_steps_out = n_steps_out
self.n_features_in = n_features_in
self.n_features_out = n_features_out
self.metrics = metrics
self.optimizer = optimizer
self.stop = EarlyStopping(monitor='loss', min_delta=0.000000000001, patience=30)
self.model = Sequential()
self.model.add(LSTM(360, activation='tanh', return_sequences=True, input_shape=(self.n_steps_in, self.n_features_in,))) #, kernel_regularizer=regularizers.l2(0.001), not a good idea
self.model.add(layers.Dropout(0.1))
self.model.add(LSTM(360, activation='tanh'))
self.model.add(layers.Dropout(0.1))
self.model.add(Dense(self.n_features_out*self.n_steps_out))
self.model.add(Reshape((self.n_steps_out, self.n_features_out)))
self.model.compile(optimizer=self.optimizer, loss='mae', metrics=[metrics])
def fit(self):
return self.model.fit(self.training_inputs, self.training_outputs, callbacks=[self.stop], epochs=self.epochs)
def predict(self, input):
return self.model.predict(input)
Notes
1) In this particular problem, the time series data is not "continuous", because one time serie belongs to a particular hurricane. I have therefore adapted the training and test samples of the time series to each hurricane. The implication of this is that I cannot use the function stateful=True in my layers because it would then mean that the model doesn't makes any difference between the different hurricanes (if my understanding is correct).
2) No image data, so no convolutionnal model needed.
Few suggestions, based on my experience:
4 layers of LSTM is too much. Stick to two, maximum three.
Don't use relu as activations for LSTMs.
Do not use BatchNormalization for time-series.
Other than these, I'd also suggest removing the dense layers between two LSTM layers.
I have tried to implement Proximal Policy Optimization with Intrinsic Curiosity Rewards for statefull LSTM neural network.
Losses in both PPO and ICM are diverging and I would like to find out if its bug in code or badly selected hyperparameters.
Code (where some wrong implementation could be):
In ICM model I use first layer LSTM too to match input dimensions.
In ICM whole dataset is propagated at once, with zeros as initial hidden(resultin tensors are different, than they would be if I propagated only 1 state or batch and re-use hidden cells)
In PPO advantage and discount reward processing the dataset is propagated one by one and hidden cells are re-used (exact opposite than in ICM because here it uses same model for selecting actions and this approach is "real-time-like")
In PPO training model is trained on batches with re-use of hidden cells
I have used https://github.com/adik993/ppo-pytorch as default code and reworked it to run on my environment and use LSTM
I may provide code samples later if specifically requested due to large amount of rows
Hyperparameters:
def __init_curiosity(self):
curiosity_factory=ICM.factory(MlpICMModel.factory(), policy_weight=1,
reward_scale=0.1, weight=0.2,
intrinsic_reward_integration=0.01,
reporter=self.reporter)
self.curiosity = curiosity_factory.create(self.state_converter,
self.action_converter)
self.curiosity.to(self.device, torch.float32)
self.reward_normalizer = StandardNormalizer()
def __init_PPO_trainer(self):
self.PPO_trainer = PPO(agent = self,
reward = GeneralizedRewardEstimation(gamma=0.99, lam=0.95),
advantage = GeneralizedAdvantageEstimation(gamma=0.99, lam=0.95),
learning_rate = 1e-3,
clip_range = 0.3,
v_clip_range = 0.3,
c_entropy = 1e-2,
c_value = 0.5,
n_mini_batches = 32,
n_optimization_epochs = 10,
clip_grad_norm = 0.5)
self.PPO_trainer.to(self.device, torch.float32)
Training graphs:
(Notice large numbers on y axis)
UPDATE
For now I have reworked LSTM processing to use batches and hidden memory on all places (for both main model and ICM), but the problem is still present. I have traced it to output from ICM's model, here the output diverges mainly in action_hat tensor.
Found the problem... In main model I use softmax for eval runs and log_softmax for training in output layer and according to PyTorch docs the CrossEntropyLoss uses log_softmax inside, so as advised I used NLLLoss but forthe computation of ICM model loss which does not have softmax fnc in output layer! So switching back to CrossEntropyLoss (which was originaly in reference code) solved ICM loss divergence.
I am trying to fine-tune a model using keras, according to this description: https://keras.io/applications/#inceptionv3
However, during training I discovered that the output of the network does not remain constant after training when using the same input (while all relevant layers were frozen), which I do not want.
I constructed the following toy example to investigate this:
import keras.applications.resnet50 as resnet50
from keras.layers import Dense, Flatten, Input
from keras.models import Model
from keras.utils import to_categorical
from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
# data
i = np.random.rand(1,224,224,3)
X = np.random.rand(32,224,224,3)
y = to_categorical(np.random.randint(751, size=32), num_classes=751)
# model
base_model = resnet50.ResNet50(weights='imagenet', include_top=False, input_tensor=Input(shape=(224,224,3)))
layer = base_model.output
layer = Flatten(name='myflatten')(layer)
layer = Dense(751, activation='softmax', name='fc751')(layer)
model = Model(inputs=base_model.input, outputs=layer)
# freeze all layers
for layer in model.layers:
layer.trainable = False
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# features and predictions before training
feat0 = base_model.predict(i)
pred0 = model.predict(i)
weights0 = model.layers[-1].get_weights()
# before training output is consistent
feat00 = base_model.predict(i)
pred00 = model.predict(i)
print(np.allclose(feat0, feat00)) # True
print(np.allclose(pred0, pred00)) # True
# train
model.fit(X, y, batch_size=2, epochs=3, shuffle=False)
# features and predictions after training
feat1 = base_model.predict(i)
pred1 = model.predict(i)
weights1 = model.layers[-1].get_weights()
# these are not the same
print(np.allclose(feat0, feat1)) # False
# Optionally: printing shows they are in fact very different
# print(feat0)
# print(feat1)
# these are not the same
print(np.allclose(pred0, pred1)) # False
# Optionally: printing shows they are in fact very different
# print(pred0)
# print(pred1)
# these are the same and loss does not change during training
# so layers were actually frozen
print(np.allclose(weights0[0], weights1[0])) # True
# Check again if all layers were in fact untrainable
for layer in model.layers:
assert layer.trainable == False # All succeed
# Being overly cautious also checking base_model
for layer in base_model.layers:
assert layer.trainable == False # All succeed
Since I froze all layers i fully expect both the predictions and both the features to be equal, but surprisingly they aren't.
So probably I am making some kind of mistake, but I can't figure what.. Any suggestions would be greatly appreciated!
So the problem seems to be that the model uses batch normalization layers, which do update their internal state (i.e. their weights) based on the seen data during training. This even happens when their trainable flag have been set to False. And as their weights are thus updated, the output also changes. You can check this by using the code in the question and changing the following codelines:
This weights0 = model.layers[-1].get_weights()
to weights0 = model.layers[2].get_weights()
and this weights1 = model.layers[-1].get_weights()
to weights1 = model.layers[2].get_weights()
or the index of any other batch normalization layer.
Because then the following assertion will no longer hold:
print(np.allclose(weights0, weights1)) # Now this is False
As far as I am aware there is currently no solution for this yet..
See also my issue on Keras' Github page.
One more reason for unstable training could be since you are using a very small batch size, i.e., batch_size=2. At least, use batch_size=32. This value is too small for the batch normalization to compute reliably the estimation of the training distribution statistics (mean and variance). These mean and variance values are then used to normalize first the distribution and followed by learning of beta and gamma parameters (actual distribution).
Check the following links for more details:
In the introduction and related works, the authors criticized BatchNorm and do check figure 1: https://arxiv.org/pdf/1803.08494.pdf
Nice article on "Curse of Batch Norm": https://towardsdatascience.com/curse-of-batch-normalization-8e6dd20bc304