Caffe LeNet: Difference between `solver.step(1)` and `solver.net.forward()`

Caffe LeNet: Difference between `solver.step(1)` and `solver.net.forward()` - python

I was checking the Caffe LeNet Tutorial here and a question came to mind:
What's the difference between these 2 codes:
self.solver.step(1)
and
self.solver.net.forward() # train net
They both seem to train the network at least according to the comment.
Personally I think the first one trains the network on the training data and updates the weights of both net and test_net but the second one seems to only forward a batch of data and apply the learned weights from the previous step.
If what I think is right, then what is the purpose of the second code in the tutorial? why did the code do a net.forward ? can't solver.step(1) do this itself?
Thanks for your time

step does one full iteration, covering all three phases: forward evaluation, backward propagation, and update. The call to forward does only the first of these. There are also differences in the signature (parameter list).

I discovered a strange behavior in solver.step(1) and solver.net.forward(). When I used a custom layer for the input network, my instance layer needs a variable before using:
solver.net.layers[0].mySet(variable)
That variable was set in a local variable for my layer. But when I called for solver.step, that variable does not appear. However it does, when I use solver.net.forward(). I am not certain, but maybe solver.step is instantiating a new variable for the layer.

Related

How to structure and size Y-labels for multivariate sequence prediction using Keras LSTMs

I am working on a sequence prediction problem where my inputs are of size (numOfSamples, numOfTimeSteps, features) where each sample is independent, number of time steps is uniform for each sample (after pre-padding the length with 0's using keras.pad_sequences), and my number of features is 2. To summarize my question(s), I am wondering how to structure my Y-label dataset to feed the model and want to gain some insight on how to properly structure my model to output what I want.
My first feature is a categorical variable encoded to a unique int and my second is numerical. I want to be able to predict the next categorical variable as well as an associated feature2 value, and then use this to feed back into the network to predict a sequence until the EOS category is output.
This is a main source I've been referencing to try and understand how to create a generator for use with keras.fit_generator.
[1]
There is no confusion with how the mini-batch for "X" data is grabbed, but for the "Y" data, I am not sure about the proper format for what I am trying to do. Since I am trying to predict a category, I figured a one-hot vector representation of the t+1 timestep would be the proper way to encode the first feature, I guess resulting in a 4? Dimensional numpy matrix?? But I'm kinda lost with how to deal with the second numerical feature.
Now, this leads me to questions concerning architecture and how to structure a model to do what I am wanting. Does the following architecture make sense? I believe there is something missing that I am not understanding.
Proposed architecture (parameters loosely filled in, nothing set yet):
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model.fit_generator(...) #ill figure this out
So, at the end, a softmax activation can predict the next categorical value for feature1. How do I also output a value for feature2 so that I can feed the new prediction for both features back as the next time-step? Do I need some sort of parallel architecture with two LSTMs that are combined somehow?
This is my first attempt at doing anything with neural networks or Keras, and I would not say I'm "great" at python, I can get by though. However, I feel I have a decent grasp at the fundamental theoretical concepts, but lack the practice.
This question is sorta open ended, with encouragement to pick apart my current strategy.
Once again, the overall goal is to predict both features (categorical, numeric) in order to predict "full sequences" from intermediate length sequences.
Ex. I train on these padded max-len sequences, but in production I want to use this to predict the remaining part of the currently unseen time-steps, which would be variable length.

Okay, so If I understand you properly (correct me if I'm wrong) you would like to predict next features based on the current ones.
When it comes to categorical variables, you are on point, your Dense layer should output N-1 vector containing probability of each class (while we are at it, if you, by any chance, use pandas.get_dummies remember to specify argument drop_first=True, similiar approach should be employed whatever you are using for one-hot encoding).
Except those N-1 output vector for each sample, it should output one more number for numerical value.
Remember to output logits (no activation, don't use softmax at the end like you currently do). Afterwards network output should be separated into N-1 part (your categorical feature) and passed to loss function able to handle logits (e.g. in Tensorflow it is tf.nn.softmax_cross_entropy_with_logits_v2 which applies numerically stable softmax for you).
Now, your N-th element of network output should be passed to different loss, probably Mean Squared Error.
Based on loss value of those two losses (you could take a mean of both to obtain one loss value), you backpropagate through the network and it might do just fine.
Unfortunately I'm not skilled enough in Keras in order to help you with the code, but I think you will figure it out yourself. While we're at it, I would like to suggest PyTorch for more custom neural networks (I think yours fits this description), though it's definitely doable in Keras as well, your choice.
Additional 'maybe helpful' thought: you may check Teacher Forcing for your kind of task. More on the topic and theory behind it can be found in the outstanding Deep Learning Book and code example (though in PyTorch once again), can be found in their docs here.
BTW interesting idea, mind if I use it in connection with my current research trajectory (with kudos going to you of course)? Comment on this answer if so we can talk it out in chat.

Basically every answer I was looking for was exampled and explained in this tutorial. Absolutely great resource for trying to understand how to model multi-output networks. This one goes through a lengthy walkthrough of a multi-output CNN architecture. It only took me about three weeks to stumble upon, however.
https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

Potential bug in tensorflow CNN tutorial code using ExponentialMovingAverage?

I'm studying the code of the tensorflow convolution neural network tutorial which trains a CNN using the cifar10 dataset. The source code lies here in Gihub and the document in Document.
My question is specifically about the use of ExponentialMovingAverage(doc here) in cifar10.py line 375-378. which is
with tf.control_dependencies([apply_gradient_op, variables_averages_op]):
train_op = tf.no_op(name='train')
return train_op
Here, the variables_averages_op is an operation that updates all the shadow variables and apply_gradient_op is an operation that applies computed gradients to all original variables(which updates the original variables, a.k.a. model weights).
Since control_dependencies doesn't guarantee the order of the execution of its passed arguments, the execution order of apply_gradient_op and variables_averages_op is arbitrary in this example, which further indicates that upon running the train_op, we could end up with firstly updating the original variables and then updating the corresponding shadow variables, or updating shadow variables before the original variables. The latter one seems unreasonable to me.
According to the official doc of ExponentialMovingAverage(link above), the update of the shadow variable relies on the original variable:
shadow_variable = decay * shadow_variable + (1 - decay) * variable
The update of the original variable should be before the update of the shadow ones, which is not the case in the tutorial code.
Can anyone help me clear that? Thanks.

I believe you are right. It does seem like a bug in the example. It is probably not important in practice as the order of variable update and moving average update is likely to be stable. Even if it is the "wrong" order, in the worst case, your moving average will be "one step ahead of the variable". Which is likely to have a less significant effect than changing your decay from 0.999 to 0.998 or something like that.
Just created a pull request to fix this: https://github.com/tensorflow/models/pull/3946

How to skip redundant forward prop during RL training step in TensorFlow

I have a Tensorflow question regarding Reinforcement Learning. I have everything working and training, but there is something that feels redundant. Want to point it out and hear your thoughts:
Lets assume something simple like episodic REINFORCE. Given this standard setup:
state -> network -> logits
When I want to train (when episode complete), I need to:
pass in an array of states (saved from running the episode) to a TF Placeholder
do a forward pass with those saved states to produce logits
compute log_probs (using saved array of actions)
compute loss (using saved array of advantages)
This works fine. However, what seems redundant is steps 1&2. I'd prefer to calculate log_probs during each step of episode while episode is being run. This way I don't have to do steps 1,2,3 during training, and forward pass is only performed once (during episode). I'd have my log_probs calculated by the time the episode was over.
However, if I create placeholders for log_probs and advantages, and don't pass in states (for the redundant forward prop), then I don't know how to get TF to know where the variables are for backprop. I get the error:
ValueError: No gradients provided for any variable
So my questions are:
if I'm passing in states, is it true that forward prop is being run again during training?
can I prevent this by my method above, finding some way to tell TF where the gradients are?
In case anyone wants to see actual code (I tried to be clear enough not to need it), here is a gist of the script in question
EDIT: I think the answer has something to do with using optimzer.compute_gradients (where I can pass in variables) and optimizer.apply_gradients, but not sure how yet...

Alex-Net for feature extraction

I try to get reliable features for ImageNet to do further classification on them. To achieve that I would like to use tensorflow with Alexnet, for feature extraction. That means I would like to get the values from the last layer in the CNN. Could someone write a piece of Python code that explains how that works?

As jonrsharpe mentioned, that's not really stackoverflow's MO, but in practice, many people do choose to write code to help explain answers (because it's often easier).
So I'm going to assume that this was just miscommunication, and you really intended to ask one of the following two questions:
How does one grab the values of the last layer of Alexnet in TensorFlow?
How does feature extraction from the last layer of a deep convolutional network like alexnet work?
The answer to the first question is actually very easy. I'll use the cifar10 example code in TensorFlow (which is loosely based on AlexNet) as an example. The forward pass of the network is built in the inference function, which returns a variable representing the output of the softmax layer. To actually get predicted image labels, you just argmax the logits, like this: (I've left out some of the setup code, but if you're already running alexnet, you already have that working)
logits = cifar10.inference(images)
predictions = tf.argmax(logits,1)
# Actually run the computation
labels = session.run([predictions])
So grabbing just the last layer features is literally just as easy as asking for them. The only wrinkle is that, in this case, cifar10 doesn't natively expose them, so you need to modify the cifar10.inference function to return both:
# old code in cifar10.inference:
# return softmax_linear
# new code in cifar10.inference:
return softmax_linear, local4
And then modify all the calls to cifar10.inference, like the one we just showed:
logits,local4 = cifar10.inference(images)
predictions = tf.argmax(logits,1)
# Actually run the computation, this time asking for both answers
labels,last_layer = session.run([predictions, local4])
And that's it. last_layer contains the last layer for all of the inputs you gave the model.
As for the second question, that's a much deeper question, but I'm guessing that's why you want to work on it. I'd suggest starting by reading up on some of the papers published in this area. I'm not an expert here, but I do like Bolei Zhou's work. For instance, try looking at Figure 2 in "Learning Deep Features for Discriminative Localization". It's a localization paper, but it's using very similar techniques (and several of Bolei's papers use it).

Building custom Caffe layer in python

After parsing many links regarding building Caffe layers in Python i still have difficulties in understanding few concepts. Can please someone clarify them?
Blobs and weights python structure for network is explained here: Finding gradient of a Caffe conv-filter with regards to input.
Network and Solver structure is explained here: Cheat sheet for caffe / pycaffe?.
Example of defining python layer is here: pyloss.py on git.
Layer tests here: test layer on git.
Development of new layers for C++ is described here: git wiki.
What I am still missing is:
setup() method: what I should do here? Why in example I should compare the lenght of 'bottom' param with '2'? Why it should be 2? It seems not a batch size because its arbitrary? And bottom as I understand is blob, and then the first dimension is batch size?
reshape() method: as I understand 'bottom' input param is blob of below layer, and 'top' param is blob of upper layer, and I need to reshape top layer according to output shape of my calculations with forward pass. But why do I need to do this every forward pass if these shapes do not change from pass to pass, only weights change?
reshape and forward methods have 0 indexes for 'top' input param used. Why would I need to use top[0].data=... or top[0].input=... instead of top.data=... and top.input=...? Whats this index about? If we do not use other part of this top list, why it is exposed in this way? I can suspect its or C++ backbone coincidence, but it would be good to know exactly.
reshape() method, line with:
if bottom[0].count != bottom[1].count
what I do here? why its dimension is 2 again? And what I am counting here? Why both part of blobs (0 and 1) should be equal in amount of some members (count)?
forward() method, what I define by this line:
self.diff[...] = bottom[0].data - bottom[1].data
When it is used after forward path if I define it? Can we just use
diff = bottom[0].data - bottom[1].data
instead to count loss later in this method, without assigning to self, or its done with some purpose?
backward() method: what's this about: for i in range(2):? Why again range is 2?
backward() method, propagate_down parameter: why it is defined? I mean if its True, gradient should be assigned to bottom[X].diff as I see, but why would someone call method which would do nothing with propagate_down = False, if it just do nothing and still cycling inside?
I'm sorry if those questions are too obvious, I just wasn't able to find a good guide to understand them and asking for help here.

You asked a lot of questions here, I'll give you some highlights and pointers that I hope will clarify matters for you. I will not explicitly answer all your questions.
It seems like you are most confused about the the difference between a blob and a layer's input/output. Indeed most of the layers has a single blob as input and a single blob as output, but it is not always the case. Consider a loss layer: it has two inputs: predictions and ground truth labels. So, in this case bottom is a vector of length 2(!) with bottom[0] being a (4-D) blob representing predictions, while bottom[1] is another blob with the labels. Thus, when constructing such a layer you must ascertain that you have exactly (hard coded) 2 input blobs (see e.g., ExactNumBottomBlobs() in AccuracyLayer definition).
The same goes for top blobs as well: indeed in most cases there is a single top for each layer, but it's not always the case (see e.g., AccuracyLayer). Therefore, top is also a vector of 4-D blobs, one for each top of the layer. Most of the time there would be a single element in that vector, but sometimes you might find more than one.
I believe this covers your questions 1,3,4 and 6.
As of reshape() (Q.2) this function is not called every forward pass, it is called only when net is setup to allocate space for inputs/outputs and params.
Occasionally, you might want to change input size for your net (e.g., for detection nets) then you need to call reshape() for all layers of the net to accommodate the new input size.
As for propagate_down parameter (Q.7): since a layer may have more than one bottom you would need, in principle, to pass the gradient to all bottoms during backprop. However, what is the meaning of a gradient to the label bottom of a loss layer? There are cases when you do not want to propagate to all bottoms: this is what this flag is for. (here's an example with a loss layer with three bottoms that expect gradient to all of them).
For more information, see this "Python" layer tutorial.

Why it should be 2?
That specific gist is talking about the Euclidian loss layer. Euclidian loss is the mean square error between 2 vectors. Hence there must be 2 vectors in the input blob to this layer. The length of each vector must be same because it is element-wise difference. You can see this check in the reshape method.
Thanks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.