Building custom Caffe layer in python

Building custom Caffe layer in python - python

After parsing many links regarding building Caffe layers in Python i still have difficulties in understanding few concepts. Can please someone clarify them?
Blobs and weights python structure for network is explained here: Finding gradient of a Caffe conv-filter with regards to input.
Network and Solver structure is explained here: Cheat sheet for caffe / pycaffe?.
Example of defining python layer is here: pyloss.py on git.
Layer tests here: test layer on git.
Development of new layers for C++ is described here: git wiki.
What I am still missing is:
setup() method: what I should do here? Why in example I should compare the lenght of 'bottom' param with '2'? Why it should be 2? It seems not a batch size because its arbitrary? And bottom as I understand is blob, and then the first dimension is batch size?
reshape() method: as I understand 'bottom' input param is blob of below layer, and 'top' param is blob of upper layer, and I need to reshape top layer according to output shape of my calculations with forward pass. But why do I need to do this every forward pass if these shapes do not change from pass to pass, only weights change?
reshape and forward methods have 0 indexes for 'top' input param used. Why would I need to use top[0].data=... or top[0].input=... instead of top.data=... and top.input=...? Whats this index about? If we do not use other part of this top list, why it is exposed in this way? I can suspect its or C++ backbone coincidence, but it would be good to know exactly.
reshape() method, line with:
if bottom[0].count != bottom[1].count
what I do here? why its dimension is 2 again? And what I am counting here? Why both part of blobs (0 and 1) should be equal in amount of some members (count)?
forward() method, what I define by this line:
self.diff[...] = bottom[0].data - bottom[1].data
When it is used after forward path if I define it? Can we just use
diff = bottom[0].data - bottom[1].data
instead to count loss later in this method, without assigning to self, or its done with some purpose?
backward() method: what's this about: for i in range(2):? Why again range is 2?
backward() method, propagate_down parameter: why it is defined? I mean if its True, gradient should be assigned to bottom[X].diff as I see, but why would someone call method which would do nothing with propagate_down = False, if it just do nothing and still cycling inside?
I'm sorry if those questions are too obvious, I just wasn't able to find a good guide to understand them and asking for help here.

You asked a lot of questions here, I'll give you some highlights and pointers that I hope will clarify matters for you. I will not explicitly answer all your questions.
It seems like you are most confused about the the difference between a blob and a layer's input/output. Indeed most of the layers has a single blob as input and a single blob as output, but it is not always the case. Consider a loss layer: it has two inputs: predictions and ground truth labels. So, in this case bottom is a vector of length 2(!) with bottom[0] being a (4-D) blob representing predictions, while bottom[1] is another blob with the labels. Thus, when constructing such a layer you must ascertain that you have exactly (hard coded) 2 input blobs (see e.g., ExactNumBottomBlobs() in AccuracyLayer definition).
The same goes for top blobs as well: indeed in most cases there is a single top for each layer, but it's not always the case (see e.g., AccuracyLayer). Therefore, top is also a vector of 4-D blobs, one for each top of the layer. Most of the time there would be a single element in that vector, but sometimes you might find more than one.
I believe this covers your questions 1,3,4 and 6.
As of reshape() (Q.2) this function is not called every forward pass, it is called only when net is setup to allocate space for inputs/outputs and params.
Occasionally, you might want to change input size for your net (e.g., for detection nets) then you need to call reshape() for all layers of the net to accommodate the new input size.
As for propagate_down parameter (Q.7): since a layer may have more than one bottom you would need, in principle, to pass the gradient to all bottoms during backprop. However, what is the meaning of a gradient to the label bottom of a loss layer? There are cases when you do not want to propagate to all bottoms: this is what this flag is for. (here's an example with a loss layer with three bottoms that expect gradient to all of them).
For more information, see this "Python" layer tutorial.

Why it should be 2?
That specific gist is talking about the Euclidian loss layer. Euclidian loss is the mean square error between 2 vectors. Hence there must be 2 vectors in the input blob to this layer. The length of each vector must be same because it is element-wise difference. You can see this check in the reshape method.
Thanks.

Related

Can I use a convolution filter instead of a dense layer for clasification?

I was reading a decent paper S-DCNet and I fell upon a section (page3,table1,classifier) where a convolution layer has been used on the feature map in order to produce a binary classification output as part of an internal process. Since I am a noob and when someone talks to me about classification I automatically make a synapse relating to FCs combined with softmax, I started wondering ... Is this a possible thing to do? Can indeed a convolutional layer be used to classify a binary outcome? The whole concept triggered my imagination so much that I insist on getting answers...
Honestly, how does this actually work? What is the difference between using a convolution filter instead of a fully connected layer for classification purposes?
Edit (Uncertain answer on how does it work): I asked a colleague and he told me that using a filter of the same shape as the length-width shape of the feature map at the current stage, may lead to a learnable binary output (considering that you also reduce the #channels of the feature map to a single channel). But I still don't understand the motivations behind such a technique ..

Using convolutions as FCs can be done (for example) with filters of spatial size (1,1) and with depth of the same size as the FC input size.
The resulting feature map would be of the same size as the input feature map, but each pixel would be the output of a "FC" layer whose weights are the weights of the shared 1x1 conv filter.
This kind of thing is used mainly for semantic segmentation, meaning classification per pixel. U-net is a good example if memory serves.
Also see this.
Also note that 1x1 convolutions have other uses as well.
paperswithcode probably some of the nets there use this trick.

How to implement a location specific convolutional filters in Tensorflow or Pytorch?

I want to implement a convolutional layer, with a different convolutional filter for each output location. Specifically, think of the case when the output is of 16*16*128 (W * H * C). Instead of having a 3*3*128 filter we have 16*16 filters; each with size 3*3*128. This would lead to huge amount of parameters, but it can the case be that each of the 3*3*128 filter may be the same except scaled by a different constant, and the constants can be learned through a side network. In this way the number of parameters won't be too much.
The similar idea is briefly in Dynamic Filter Networks, but I cannot find an implementation of location specific filters. My question is, if we want a location specific convolutional filter how do I implement it in Tensorflow or Pytorch? Do I need to write my own operation or there is some smart way to use the functions provided? If I have to write an OP is there any trick that can easily achieve this idea? Any help is appreciated!

A convolution, by definition, is not location specific - this is what makes it a convolution. If you wish to generalize convolution, bear in mind that eventually a convolution is a special case of a simple linear operation. Therefore, you can implement your "location specific" convolution as a fully-connected layer (nn.Linear) with a very specific sparse weights.

The reason why convolutions are more efficient than fully connected layers is beacause they are translation invariant. If you wish to have convolutions which are dependent on location you would need to add two extra parameters to the convolution i.e. having N+2 input channels where x, y coord are the values of the two additonal channels (as in e.g. CoordConv, or Location Biased Convolutions).

What is the meaning of 'self.diff' in 'forward' of a custom python loss layer for Caffe training?

I try to use a custom python loss layer. When I checked several examples online, such as:
Euclidean loss layer, Dice loss layer,
I notice a variable 'self.diff' is always assigned in 'forward'. Especially for the Dice loss layer,
self.diff[...] = bottom[1].data
I wonder if there is any reason that this variable has to be introduced in forward or I can just use bottom[1].data to access ground truth label?
In addition, what is the point of top[0].reshape(1) in reshape, since by definition in forward, the loss output is a scalar itself.

You need to set the diff attribute of the layer for overall consistency and data communication protocol; it's available other places in the class, and anywhere the loss layer object appears. bottom is a local parameter, and is not available elsewhere in the same form.
In general, the code is expandable for a variety of applications and more complex computations; the reshaping is part of this, ensuring that the returned value is scalar, even if someone expands the inputs to work with vectors or matrices.

Caffe LeNet: Difference between `solver.step(1)` and `solver.net.forward()`

I was checking the Caffe LeNet Tutorial here and a question came to mind:
What's the difference between these 2 codes:
self.solver.step(1)
and
self.solver.net.forward() # train net
They both seem to train the network at least according to the comment.
Personally I think the first one trains the network on the training data and updates the weights of both net and test_net but the second one seems to only forward a batch of data and apply the learned weights from the previous step.
If what I think is right, then what is the purpose of the second code in the tutorial? why did the code do a net.forward ? can't solver.step(1) do this itself?
Thanks for your time

step does one full iteration, covering all three phases: forward evaluation, backward propagation, and update. The call to forward does only the first of these. There are also differences in the signature (parameter list).

I discovered a strange behavior in solver.step(1) and solver.net.forward(). When I used a custom layer for the input network, my instance layer needs a variable before using:
solver.net.layers[0].mySet(variable)
That variable was set in a local variable for my layer. But when I called for solver.step, that variable does not appear. However it does, when I use solver.net.forward(). I am not certain, but maybe solver.step is instantiating a new variable for the layer.

Alex-Net for feature extraction

I try to get reliable features for ImageNet to do further classification on them. To achieve that I would like to use tensorflow with Alexnet, for feature extraction. That means I would like to get the values from the last layer in the CNN. Could someone write a piece of Python code that explains how that works?

As jonrsharpe mentioned, that's not really stackoverflow's MO, but in practice, many people do choose to write code to help explain answers (because it's often easier).
So I'm going to assume that this was just miscommunication, and you really intended to ask one of the following two questions:
How does one grab the values of the last layer of Alexnet in TensorFlow?
How does feature extraction from the last layer of a deep convolutional network like alexnet work?
The answer to the first question is actually very easy. I'll use the cifar10 example code in TensorFlow (which is loosely based on AlexNet) as an example. The forward pass of the network is built in the inference function, which returns a variable representing the output of the softmax layer. To actually get predicted image labels, you just argmax the logits, like this: (I've left out some of the setup code, but if you're already running alexnet, you already have that working)
logits = cifar10.inference(images)
predictions = tf.argmax(logits,1)
# Actually run the computation
labels = session.run([predictions])
So grabbing just the last layer features is literally just as easy as asking for them. The only wrinkle is that, in this case, cifar10 doesn't natively expose them, so you need to modify the cifar10.inference function to return both:
# old code in cifar10.inference:
# return softmax_linear
# new code in cifar10.inference:
return softmax_linear, local4
And then modify all the calls to cifar10.inference, like the one we just showed:
logits,local4 = cifar10.inference(images)
predictions = tf.argmax(logits,1)
# Actually run the computation, this time asking for both answers
labels,last_layer = session.run([predictions, local4])
And that's it. last_layer contains the last layer for all of the inputs you gave the model.
As for the second question, that's a much deeper question, but I'm guessing that's why you want to work on it. I'd suggest starting by reading up on some of the papers published in this area. I'm not an expert here, but I do like Bolei Zhou's work. For instance, try looking at Figure 2 in "Learning Deep Features for Discriminative Localization". It's a localization paper, but it's using very similar techniques (and several of Bolei's papers use it).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.