pycaffe get gradients/weights/biases - python

So I have initialized a caffe.Net object with
network = caffe.Net('path/to/lenet.prototxt', caffe.TEST)
and I want to get activation, weights, biases, gradients for each layer with parameters. My current approach is to do a step(100) to go through 100 iterations and then look at every layer:
for layer_name in network._layer_names:
if layer_name in network.params:
x = layer_name
output = np.array(network.blobs[x].data)
weight = np.array(network.params[x][0].data)
bias = np.array(network.params[x][1].data)
this should give me the activation, the weights and biases of each layer. Then I save them. No idea for the gradients though.
Is this approach for weights/biases/activation the right one?

Instead of .data use .diff
Implementation Details
As we are often interested in the values as well as the gradients of the blob, a Blob stores two chunks of memories, data and diff. The former is the normal data that we pass along, and the latter is the gradient computed by the network.
Note that these will be initialized to zero, unless you have run some training steps.


Load weights for last layer (output layer) to a new model from trained network

Is it possible to load the weights to the last layer in my new model from trained network by using set_weights and get_weights scheme ?
The point is, i saved the weight of each layer as a mat file (after training) to make some calculation in Matlab and i want just the modified weights of the last layer to be loaded to the last layer in my new model and other layers get the same weights as the trained model. It is a bit trickey, since the saved format is mat.
weights1 = lstm_model1.layers[0].get_weights()[0]
biases1 = lstm_model1.layers[0].get_weights()[1]
weights2 = lstm_model1.layers[2].get_weights()[0]
biases2 = lstm_model1.layers[2].get_weights()[1]
weights3 = lstm_model1.layers[4].get_weights()[0]
biases3 = lstm_model1.layers[4].get_weights()[1]
# Save the weights and biases for adaptation algorithm
savemat("weights1.mat", mdict={'weights1': weights1})
savemat("biases1.mat", mdict={'biases1': biases1})
savemat("weights2.mat", mdict={'weights2': weights2})
savemat("biases2.mat", mdict={'biases2': biases2})
savemat("weights3.mat", mdict={'weights3': weights3})
savemat("biases3.mat", mdict={'biases3': biases3})
How can i load just the old weights of other layers to the new model (without the last layer) and the modified weights of last layer to the last layer in the new one ?
If it was saved as a .h5 file format, this works. However, I’m not sure about .mat:
In simplicity, you just have to callget_weights on the desired layer, and similarly, set_weights on the corresponding layer of the other model:
last_layer_weights = old_model.layers[-1].get_weights()
For a more complete code sample, here you go:
# Create an arbitrary model with some weights, for example
model = Sequential(layers = [
Dense(70, input_shape = (100,)),
# Save the weights of the model
# Later, load in the model (we only really need the layer in question)
old_model = Sequential(layers = [
Dense(70, input_shape = (100,)),
# Create a new model with slightly different architecture (except for the layer in question, at least)
new_model = Sequential(layers = [
Dense(80, input_shape = (100,)),
# Set the weights of the final layer of the new model to the weights of the final layer of the old model, but leaving other layers unchanged.
# Assert that the weights of the final layer is the same, but other are not.
print (np.all(new_model.layers[-1].get_weights()[0] == old_model.layers[-1].get_weights()[0]))
>> True
print (np.all(new_model.layers[-2].get_weights()[0] == old_model.layers[-2].get_weights()[0]))
>> False

Adding softmax layer to LSTM network "freezes" output

I've been trying to teach myself the basics of RNN's with a personnal project on PyTorch. I want to produce a simple network that is able to predict the next character in a sequence (idea mainly from this article but I wanted to do most of the stuff myself).
My idea is this : I take a batch of B input sequences of size n (np array of n integers), one hot encode them and pass them through my network composed of several LSTM layers, one fully connected layers and one softmax unit.
I then compare the output to the target sequences which are the input sequences shifted one step ahead.
My issue is that when I include the softmax layer, the output is the same every single epoch for every single batch. When I don't include it, the network seems to learn appropriately. I can't figure out what's wrong.
My implementation is the following :
class Model(nn.Module):
def __init__(self, one_hot_length, dropout_prob, num_units, num_layers):
self.LSTM = nn.LSTM(one_hot_length, num_units, num_layers, batch_first = True, dropout = dropout_prob)
self.dropout = nn.Dropout(dropout_prob)
self.fully_connected = nn.Linear(num_units, one_hot_length)
self.softmax = nn.Softmax(dim = 1)
# dim = 1 as the tensor is of shape (batch_size*seq_length, one_hot_length) when entering the softmax unit
def forward_pass(self, input_seq, hc_states):
output, hc_states = self.LSTM (input_seq, hc_states)
output = output.view(-1, self.num_units)
output = self.fully_connected(output)
# I simply comment out the next line when I run the network without the softmax layer
output = self.softmax(output)
return output, hc_states
one_hot_length is the size of my character dictionnary (~200, also the size of a one hot encoded vector)
num_units is the number of hidden units in a LSTM cell, num_layers the number of LSTM layers in the network.
The inside of the training loop (simplified) goes as follows :
input, target = next_batches(data, batch_pointer)
input = nn.functional.one_hot(input_seq, num_classes = one_hot_length).float().
for state in hc_states:
output, states = net.forward_pass(input, hc_states)
loss = nn.CrossEntropyLoss(output, target)
nn.utils.clip_grad_norm_(net.parameters(), MaxGradNorm)
With hc_states a tuple with the hidden states tensor and the cell states tensor, input, is a tensor of size (B,n,one_hot_length), target is (B,n).
I'm training on a really small dataset (sentences in a .txt of ~400Ko) just to tune my code, and did 4 different runs with different parameters and each time the outcome was the same : the network doesn't learn at all when it has the softmax layer, and trains somewhat appropriately without.
I don't think it is an issue with tensors shapes as I'm almost sure I checked everything.
My understanding of my problem is that I'm trying to do classification, and that the usual is to put a softmax unit at the end to get "probabilities" of each character to appear, but clearly this isn't right.
Any ideas to help me ?
I'm also fairly new to Pytorch and RNN so I apologize in advance if my architecture/implementation is some kind of monstrosity to a knowledgeable person. Feel free to correct me and thanks in advance.

Diverging losses in PPO + ICM using LSTM

I have tried to implement Proximal Policy Optimization with Intrinsic Curiosity Rewards for statefull LSTM neural network.
Losses in both PPO and ICM are diverging and I would like to find out if its bug in code or badly selected hyperparameters.
Code (where some wrong implementation could be):
In ICM model I use first layer LSTM too to match input dimensions.
In ICM whole dataset is propagated at once, with zeros as initial hidden(resultin tensors are different, than they would be if I propagated only 1 state or batch and re-use hidden cells)
In PPO advantage and discount reward processing the dataset is propagated one by one and hidden cells are re-used (exact opposite than in ICM because here it uses same model for selecting actions and this approach is "real-time-like")
In PPO training model is trained on batches with re-use of hidden cells
I have used as default code and reworked it to run on my environment and use LSTM
I may provide code samples later if specifically requested due to large amount of rows
def __init_curiosity(self):
curiosity_factory=ICM.factory(MlpICMModel.factory(), policy_weight=1,
reward_scale=0.1, weight=0.2,
self.curiosity = curiosity_factory.create(self.state_converter,
self.action_converter), torch.float32)
self.reward_normalizer = StandardNormalizer()
def __init_PPO_trainer(self):
self.PPO_trainer = PPO(agent = self,
reward = GeneralizedRewardEstimation(gamma=0.99, lam=0.95),
advantage = GeneralizedAdvantageEstimation(gamma=0.99, lam=0.95),
learning_rate = 1e-3,
clip_range = 0.3,
v_clip_range = 0.3,
c_entropy = 1e-2,
c_value = 0.5,
n_mini_batches = 32,
n_optimization_epochs = 10,
clip_grad_norm = 0.5), torch.float32)
Training graphs:
(Notice large numbers on y axis)
For now I have reworked LSTM processing to use batches and hidden memory on all places (for both main model and ICM), but the problem is still present. I have traced it to output from ICM's model, here the output diverges mainly in action_hat tensor.
Found the problem... In main model I use softmax for eval runs and log_softmax for training in output layer and according to PyTorch docs the CrossEntropyLoss uses log_softmax inside, so as advised I used NLLLoss but forthe computation of ICM model loss which does not have softmax fnc in output layer! So switching back to CrossEntropyLoss (which was originaly in reference code) solved ICM loss divergence.

Adding Dropout to testing/inference phase

I've trained the following model for some timeseries in Keras:
input_layer = Input(batch_shape=(56, 3864))
first_layer = Dense(24, input_dim=28, activation='relu',
first_layer = Dropout(0.3)(first_layer)
second_layer = Dense(12, activation='relu')(first_layer)
second_layer = Dropout(0.3)(second_layer)
out = Dense(56)(second_layer)
model_1 = Model(input_layer, out)
Then I defined a new model with the trained layers of model_1 and added dropout layers with a different rate, drp, to it:
input_2 = Input(batch_shape=(56, 3864))
first_dense_layer = model_1.layers[1](input_2)
first_dropout_layer = model_1.layers[2](first_dense_layer)
new_dropout = Dropout(drp)(first_dropout_layer)
snd_dense_layer = model_1.layers[3](new_dropout)
snd_dropout_layer = model_1.layers[4](snd_dense_layer)
new_dropout_2 = Dropout(drp)(snd_dropout_layer)
output = model_1.layers[5](new_dropout_2)
model_2 = Model(input_2, output)
Then I'm getting the prediction results of these two models as follow:
result_1 = model_1.predict(test_data, batch_size=56)
result_2 = model_2.predict(test_data, batch_size=56)
I was expecting to get completely different results because the second model has new dropout layers and theses two models are different (IMO), but that's not the case. Both are generating the same result. Why is that happening?
As I mentioned in the comments, the Dropout layer is turned off in inference phase (i.e. test mode), so when you use model.predict() the Dropout layers are not active. However, if you would like to have a model that uses Dropout both in training and inference phase, you can pass training argument when calling it, as suggested by François Chollet:
# ...
new_dropout = Dropout(drp)(first_dropout_layer, training=True)
# ...
Alternatively, If you have already trained your model and now want to use it in inference mode and keep the Dropout layers (and possibly other layers which have different behavior in training/inference phase such as BatchNormalization) active, you can define a backend function that takes the model's inputs as well as Keras learning phase:
from keras import backend as K
func = K.function(model.inputs + [K.learning_phase()], model.outputs)
# to use it pass 1 to set the learning phase to training mode
outputs = func([input_arrays] + [1.])
your question has a simple solution in the latest version of Tensorflow. you can set the training argument of the call method to true.
you can run a code like the below code:
by using training=True TensorFlow automatically applies the Dropout layer in inference mode.
As there are already some working code solutions above, I will simply add a few more details regarding dropout during inference to prevent confusion.
Based on the original paper, Dropout layers play the role of turning off (setting gradients to zero) the neuron nodes during training to reduce overfitting. However, once we finish off with training and start testing the model, we do not 'touch' any neurons, thus, all the units are considered to make the decision when inferencing. This causes previously 'dead' neuron weights to be large than expected due to the usage of Dropout. To prevent this, a scaling factor is applied to balance the network node. To be more precise, if a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p during the prediction stage.

Tensorflow: Different activation values for same image

I'm trying to retrain (read finetune) a MobileNet image Classifier.
The script for retraining given by tensorflow here (from the tutorial), updates only the weights of the newly added fully connected layer. I modified this script to update weights of all the layers of the pre-trained model. I'm using MobileNet architecture with depth multiplier of 0.25 and input size of 128.
However while retraining I obsereved a strange thing, if I give a particular image as input for inference in a batch with some other images, the activation values after some layers are different from those when the image is passed alone. Also activation values for same image from different batches are different. Example - For two batches -
batch_1 : [img1, img2, img3]; batch_2 : [img1, img4, img5]. The activations for img1 are different from both the batches.
Here is the code I use for inference -
for tf.Session(graph=tf.get_default_graph()) as sess:
image_path = '/tmp/images/10dsf00003.jpg'
id_ = gfile.FastGFile(image_path, 'rb').read()
#The line below loads the jpeg using tf.decode_jpeg and does some preprocessing
id =, {jpeg_data_tensor: id_})
input_image_tensor = graph.get_tensor_by_name('input')
layerXname='MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu:0' #Name of the layer whose activations to inspect.
layerX = graph.get_tensor_by_name(layerXname), {input_image_tensor: id})
The above code is executed once as it is and once with the following change in the last line:, {input_image_tensor: np.asarray([np.squeeze(id), np.squeeze(id), np.squeeze(id)])})
Following are some nodes in the graph :
[u'input', u'MobilenetV1/Conv2d_0/weights', u'MobilenetV1/Conv2d_0/weights/read', u'MobilenetV1/MobilenetV1/Conv2d_0/convolution', u'MobilenetV1/Conv2d_0/BatchNorm/beta', u'MobilenetV1/Conv2d_0/BatchNorm/beta/read', u'MobilenetV1/Conv2d_0/BatchNorm/gamma', u'MobilenetV1/Conv2d_0/BatchNorm/gamma/read', u'MobilenetV1/Conv2d_0/BatchNorm/moving_mean', u'MobilenetV1/Conv2d_0/BatchNorm/moving_mean/read', u'MobilenetV1/Conv2d_0/BatchNorm/moving_variance', u'MobilenetV1/Conv2d_0/BatchNorm/moving_variance/read', u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/add/y', u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/add', u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/Rsqrt', u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/mul', u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/mul_1', u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/mul_2', u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/sub', u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/add_1', u'MobilenetV1/MobilenetV1/Conv2d_0/Relu6', u'MobilenetV1/Conv2d_1_depthwise/depthwise_weights', u'MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read', ... ...]
Now when layerXname = 'MobilenetV1/MobilenetV1/Conv2d_0/convolution'
The activations are same in both of the above specified cases. (i.e.
layerxactivations and layerxactivations_batch[0] are same).
But after this layer, all layers have different activation values. I feel that the batchNorm operations after 'MobilenetV1/MobilenetV1/Conv2d_0/convolution' layer behave differently for batch inputs and a single image. Or is the issue caused by something else ?
Any help/pointers would be appreciated.
When you build the mobilenet there is one parameter called is_training. If you don't set it to false the dropout layer and the batch normalization layer will give you different results in different iterations. Batch normalization will probably change very little the values but dropout will change them a lot as it drops some input values.
Take a look to the signature of mobilnet:
def mobilenet_v1(inputs,
"""Mobilenet v1 model for classification.
inputs: a tensor of shape [batch_size, height, width, channels].
num_classes: number of predicted classes.
dropout_keep_prob: the percentage of activation values that are retained.
is_training: whether is training or not.
min_depth: Minimum depth value (number of channels) for all convolution ops.
Enforced when depth_multiplier < 1, and not an active constraint when
depth_multiplier >= 1.
depth_multiplier: Float multiplier for the depth (number of channels)
for all convolution ops. The value must be greater than zero. Typical
usage will be to set this value in (0, 1) to reduce the number of
parameters or computation cost of the model.
conv_defs: A list of ConvDef namedtuples specifying the net architecture.
prediction_fn: a function to get predictions out of logits.
spatial_squeeze: if True, logits is of shape is [B, C], if false logits is
of shape [B, 1, 1, C], where B is batch_size and C is number of classes.
reuse: whether or not the network and its variables should be reused. To be
able to reuse 'scope' must be given.
scope: Optional variable_scope.
logits: the pre-softmax activations, a tensor of size
[batch_size, num_classes]
end_points: a dictionary from components of the network to the corresponding
ValueError: Input rank is invalid.
This is due to Batch Normalisation.
How are you running inference. Are you loading it from the checkpoint files or are you using a Frozen Protobuf model. If you use a frozen model you can expect similar results for different formats of inputs.
Check this out. A similar issue for a different application is raised here.

