For a standard machine learning problem, e.g, image classification on MNIST, the loss function is fixed, therefor the optimization process can be accomplished simply by calling functions and feed the input into them. There is no need to derive gradients and code the descent procedure by hand.
But now I'm confused when I met some complicated formulation. Say we are solving a semi-supervised problem, and the loss function has two parts:Ls + lambda * Lu. The first part is a normal classification formulation, e.g, cross entropy loss. And the second part varies. In my situation, Lu is a matrix factorization loss, which in specific is:Lu = MF(D, C * W). And the total loss function can be written as:
L = \sum log p(yi|xi) + MF(D, C * W)
= \sum log p(yi|Wi) + MF(D, C * W)
= \sum log p(yi|T * Wi + b) + MF(D, C * W)
Where parameters are W, C, T and b. The first part is a classification loss, and the input xi is a raw of W, i.e. Wi, a vector of size (d, 1). And the label yi can be a one-hot vector of size (c, 1), so parameters T and b map the input to the label size. And the second part is a matrix factorization loss.
Now I'm confused when I'm going to optimize this function using sgd. It can be solved by write down the formulation derive gradients then accomplish a training procedure from scratch. But I'm wondering if there is a simpler way? Because it's easy to use a deep learning tool like Tensorflow or Keras to train a classification model, all u need to do is build a network and feed the data.
So similarly, is there a tool that can automatically compute gradients after I defined the loss function? Because deriving gradients and achieve them from scratch is really annoying. Both the classification loss and matrix factorization loss is very common, so I think the combination can be achieved thoroughly.
Theano and Tensorflow will exactly do this for you if you can formulate your optimization-problem within their framework / language. These frameworks are also general enough to implement non-NN-based algorithms, like simple first-order-based optimizations like yours.
If that's not possible you can try autograd, which can do this on a subset of numpy. Just formulate your loss as numpy-function (while sticking to supported functions; read the docs) and let autograd build the gradients.
Keep in mind, that the somewhat by-construction-approach used by Theano & Tensorflow will be more efficient (because of the more defined input and because these two libraries are a bit more evolved).
Both Theano & Tensorflow have built-in differentiation for you. So you only need to form the loss.
Related
I am implementing an autoencoder, used to rebuild color images. The loss function I want to use requires a reduced color set (max ~100 different colors) but I am struggling to find a suitable differentiable algorithm.
Another doubt I have is the following: is it better to apply such quantization directly in the loss function, or can I implement it in a custom non-trainable layer? In the second case, need the algorithm to be differentiable?
My first idea approaching this problem was to quantize the images before feeding them to the network, but I don`t know how to "force" the network to produce only the quantized colors as output.
Any suggestion is greatly appreciated, I do not need code, just some ideas or new perspectives. Being pretty new to Tensorflow I am probably missing something.
If you want to compress the image, it seems you want to find discrete color set for image compression. In that case auto-encoder is not suitable approach for image compression.
The general auto-encoder compress tensor of images(B x C x H x W) to latent code of each images(B x D, typically D = 512). The beauty of this approach is that the optimal latent space is found 'automatically'.
Nevertheless if you want to utilize convex optimization tool of tensorflow, some continuous relaxation technique like interpolation could be helpul.
In the following paper, they utilize continuous relaxation for discrete path selection of neural network.
Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. ICLR.
In the following paper, they utilize interpolation to learn quantized kernel bank on look-up table.
Jo, Y., & Kim, S. J. (2021). Practical single-image super-resolution using look-up table. CVPR.
Both of them provide codes.
I am trying to implement a research paper that uses CNN and CRF for page object detection. According to the research paper we have to to build two neural network (named unary and pairwise). Then the training data (set of images) are passed and both the CNNs are trained. After that we are supposed to apply CRF.
Following are the equations for CRF:
U and V are unary and pairwise potentials obtained from the CNNs using the following equations:
Maximum a posteriori (MAP) strategy to predict the labels of line regions given a new document. MAP inference of CRFs can be formulated as the following optimization problem:
The parameters of our CRFs include Unary-Net's weights and Pairwise-Net's weights and a combination coefficient vector λ of U and V. weights of U and V (w) are learned using the SGD method. Then they are fixed and λ is learned using the Pseudo Likelihood method.
I have created the neural networks but I am not able to implement the CRF part. Can someone help me implement this or suggest a python library that makes it easier to implement. (I have tried a python library pystruct but could not install it)
I am using the Physics Informed Neural Networks (PINNs) methodology to solve non-linear PDEs in high dimension. Specifically, I am using this class https://github.com/maziarraissi/PINNs/blob/master/appendix/continuous_time_inference%20(Burgers)/Burgers.py
where the function def net_f(self, x,t): is modified to include more variables than just x and t i.e. def net_f(self, w, z, v, t):
I have two PDEs and a policy function (each an instance of the PhysicsInformedNN class) and at some point I have a function which combines the approximation from def net_f(self, w,z,v,t): let's say in the following way y = PDE1(w,z,v,t) + PDE2(w,z,v,t) + policy(w,z,v,t). I want to take derivatives of this function with respect to w,z and v. However, I can't figure out how to do that in TensorFlow 1.15.2. This is more or less trivial to do in TF2 but I want to stick to TensorFlow 1.15.2 for several reasons.
Basically, the problem boils down to this: from an instantiated model
model = PhysicsInformedNN(X_u_train, u_train, X_f_train, layers, lb, ub, nu)
take a derivative of model.net_u(x,t) with respect to x or t. Assuming, either the model is trained or not trained. If I can do that then I can figure out how to take derivatives of function y above w.r.t. the variables in each PDE and the policy.
Note: This can be done fully analytically, i.e. I can hard code the formulas for the derivatives of y using values from model.predict() (which would be numpy arrays). I can check that the derivative formulas are correct with TF2 I just want to use automatic differentiation to do this (since the formulas are complicated and become very cumbersome as the dimension of the PDEs increases).
I've implemented gradient descent in Python to perform a regularized polynomial regression using as a loss function the MSE, but on linear data (to prove the role of the regularization).
So my model is under the form:
And in my loss function, R represents the regularization term:
Let's take the L2-norm as our regularization, the partial derivatives of the loss function w.r.t. wi are given below:
Finally, the coefficients wi are updated using a constant learning rate:
The problem is that I'm unable to make it converge, because the regularization is penalizing both of the coefficients of degree 2 (w2) and degree 1 (w1) of the polynomial, while in my case I want it to penalize only the former since the data is linear.
Is it possible to achieve this, as both LassoCV and RidgeCV implemented in Scikit-learn are able to do it? Or is there a mistake in my equations given above?
I suspect that a constant learning rate (mu) could be problematic too, what's a simple formula to make it adaptive?
I ended up using Coordinate descent as described in this tutorial to which I added a regularization term (L1 or L2). After a relatively large number of iterations, w2 was almost zero (and therefore the predicted model was linear).
I am trying to implement the distributed synchronous SGD approach described in this paper using Tensorflow. For that I need to compute and apply gradients layer-wise. In principle I can do it in the following way (obs! incomplete code:
#WORKER CODE
opt = tf.train.GradientDescentOptimizer(learning_rate)
for layer_vars in all_layer_vars:
grads_vars = opt.compute_gradients(loss, layer_vars)
grads = sess.run([grad_var[0] for grad_var in grads_vars], feed_dict)
send_grads_to_master(zip(grads, layer_vars))
#MASTER CODE
while (True):
grads_vars = receive_grads_from_worker()
sess.run(opt.apply_gradients(grads_vars))
What I wonder is whether in this scenario (with several compute_gradients() calls, within different session.run()'s) the number of internal operations performed by Tensorflow is the same or higher than in the "standard" scenario where all grads are computed with just one invocation of compute_gradients().
That is, thinking on the backpropagation algorithm, I wonder if in this distributed scenario Tensorflow will compute the different "delta's" only once, or not. If the latter, is there a more efficient way of doing what I want?