Derivates from a class instance in TF1 - python

I am using the Physics Informed Neural Networks (PINNs) methodology to solve non-linear PDEs in high dimension. Specifically, I am using this class https://github.com/maziarraissi/PINNs/blob/master/appendix/continuous_time_inference%20(Burgers)/Burgers.py
where the function def net_f(self, x,t): is modified to include more variables than just x and t i.e. def net_f(self, w, z, v, t):
I have two PDEs and a policy function (each an instance of the PhysicsInformedNN class) and at some point I have a function which combines the approximation from def net_f(self, w,z,v,t): let's say in the following way y = PDE1(w,z,v,t) + PDE2(w,z,v,t) + policy(w,z,v,t). I want to take derivatives of this function with respect to w,z and v. However, I can't figure out how to do that in TensorFlow 1.15.2. This is more or less trivial to do in TF2 but I want to stick to TensorFlow 1.15.2 for several reasons.
Basically, the problem boils down to this: from an instantiated model
model = PhysicsInformedNN(X_u_train, u_train, X_f_train, layers, lb, ub, nu)
take a derivative of model.net_u(x,t) with respect to x or t. Assuming, either the model is trained or not trained. If I can do that then I can figure out how to take derivatives of function y above w.r.t. the variables in each PDE and the policy.
Note: This can be done fully analytically, i.e. I can hard code the formulas for the derivatives of y using values from model.predict() (which would be numpy arrays). I can check that the derivative formulas are correct with TF2 I just want to use automatic differentiation to do this (since the formulas are complicated and become very cumbersome as the dimension of the PDEs increases).

Related

Why scaling down the parameter many times during training will help the learning speed be the same for all weights in Progressive GAN?

Equalized learning rate is one of the special things in Progressive Gan, a paper of the NVIDIA team. By using this method, they introduced that
Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.
In detail, they inited all learnable parameters by normal distribution . During training time, each forward time, they will scale the result with per-layer normalization constant from He's initializer
I reproduced the code from pytorch GAN zoo Github's repo
def forward(self, x, equalized):
# generate He constant depend on the size of tensor W
size = self.module.weight.size()
fan_in = prod(size[1:])
weight = math.sqrt(2.0 / fan_in)
'''
A module example:
import torch.nn as nn
module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias)
'''
x = self.module(x)
if equalized:
x *= self.weight
return x
At first, I thought the He constant will be as He's paper
Normally, so can be scaled up which leads to the gradient in backpropagation is increased as the formula in ProGan's paper prevent vanishing gradient.
However, the code shows that .
In summary, I can't understand why to scale down the parameter many times during training will help the learning speed be more stable.
I asked this question in some communities e.g: Artificial Intelligent, mathematics, and still haven't had an answer yet.
Please help me explain it, thank you!
There is already an explanation in the paper for the reason for scaling down the parameters in every single pass:
The benefit of doing this dynamically instead of during initialization is somewhat
subtle and relates to the scale-invariance in commonly used adaptive stochastic gradient descent
methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These
methods normalize a gradient update by its estimated standard deviation, thus making the update
independent of the scale of the parameter. As a result, if some parameters have a larger dynamic
range than others, they will take longer to adjust. This is a scenario modern initializers cause, and
thus it is possible that a learning rate is both too large and too small at the same time.
I believe multiplying He's constant in every pass ensures that the range of the parameter will not be too wide at any point in backpropagation and therefore they will not take so long to adjust. So if for example discriminator at some point in the learning process adjusted faster than the generator, it will not take the generator to adjust itself and consequently, the learning process of them will equalize.

Tensorflow, tf.gradients calculations

I am learning how to use Tensorflow and at this 1 particular point I am really stuck and can not make a sense around it. Imagine I have a 5 layer network and the output is represented by output. Now suppose I want to find the gradient of output with respect to layer_2. For that purpose, the code I will write in Tensorflow will be something like:
gradients_i_want = tf.gradients(output, layer_2)
Theoretically, this gradient should be calculated via chain rule. I want to ask, that whether Tensorflow calculates these gradients via chain rule or it will just take the derivative of output with respect to layer_2
Tensorflow will create a graph for your model, where each node is an operation (e.g. addition, multiplication, or a combination of them). Basic ops have manually defined gradient functions, and those functions will be used when applying the chain rule while traveling backwards through the graph.
If you write your own custom op, you might need to also write the corresponding gradient function.

Neural network toy model to fit sine function fails, what's wrong?

Graduate student, new to Keras and neural networks was trying to fit a very simple feedforward neural network to a one-dimensional sine.
Below are three examples of the best fit that I can get. On the plots, you can see the output of the network vs ground truth
The complete code, just a few lines, is posted here example Keras
I was playing with the number of layers, different activation functions, different initializations, and different loss functions, batch size, number of training samples. It seems that none of those were able to improve the results beyond the above examples.
I would appreciate any comments and suggestions. Is sine a hard function for a neural network to fit? I suspect that the answer is not, so I must be doing something wrong...
There is a similar question here from 5 years ago, but the OP there didn't provide the code and it is still not clear what went wrong or how he was able to resolve this problem.
In order to make your code work, you need to:
scale the input values in the [-1, +1] range (neural networks don't like big values)
scale the output values as well, as the tanh activation doesn't work too well close to +/-1
use the relu activation instead of tanh in all but the last layer (converges way faster)
With these modifications, I was able to run your code with two hidden layers of 10 and 25 neurons
Since there is already an answer that provides a workaround I'm going to focus on problems with your approach.
Input data scale
As others have stated, your input data value range from 0 to 1000 is quite big. This problem can be easily solved by scaling your input data to zero mean and unit variance (X = (X - X.mean())/X.std()) which will result in improved training performance. For tanh this improvement can be explained by saturation: tanh maps to [-1;1] and will therefore return either -1 or 1 for almost all sufficiently big (>3) x, i.e. it saturates. In saturation the gradient for tanh will be close to zero and nothing will be learned. Of course, you could also use ReLU instead, which won't saturate for values > 0, however you will have a similar problem as now gradients depend (almost) solely on x and therefore later inputs will always have higher impact than earlier inputs (among other things).
While re-scaling or normalization may be a solution, another solution would be to treat your input as a categorical input and map your discrete values to a one-hot encoded vector, so instead of
>>> X = np.arange(T)
>>> X.shape
(1000,)
you would have
>>> X = np.eye(len(X))
>>> X.shape
(1000, 1000)
Of course this might not be desirable if you want to learn continuous inputs.
Modeling
You are currently trying to model a mapping from a linear function to a non-linear function: you map f(x) = x to g(x) = sin(x). While I understand that this is a toy problem, this way of modeling is limited to only this one curve as f(x) is in no way related to g(x). As soon as you are trying to model different curves, say both sin(x) and cos(x), with the same network you will have a problem with your X as it has exactly the same values for both curves. A better approach of modeling this problem is to predict the next value of the curve, i.e. instead of
X = range(T)
Y = sin(x)
you want
X = sin(X)[:-1]
Y = sin(X)[1:]
so for time-step 2 you will get the y value of time-step 1 as input and your loss expects the y value of time-step 2. This way you implicitly model time.

Is there some way can accomplish stochastic gradient descent not from scratch

For a standard machine learning problem, e.g, image classification on MNIST, the loss function is fixed, therefor the optimization process can be accomplished simply by calling functions and feed the input into them. There is no need to derive gradients and code the descent procedure by hand.
But now I'm confused when I met some complicated formulation. Say we are solving a semi-supervised problem, and the loss function has two parts:Ls + lambda * Lu. The first part is a normal classification formulation, e.g, cross entropy loss. And the second part varies. In my situation, Lu is a matrix factorization loss, which in specific is:Lu = MF(D, C * W). And the total loss function can be written as:
L = \sum log p(yi|xi) + MF(D, C * W)
= \sum log p(yi|Wi) + MF(D, C * W)
= \sum log p(yi|T * Wi + b) + MF(D, C * W)
Where parameters are W, C, T and b. The first part is a classification loss, and the input xi is a raw of W, i.e. Wi, a vector of size (d, 1). And the label yi can be a one-hot vector of size (c, 1), so parameters T and b map the input to the label size. And the second part is a matrix factorization loss.
Now I'm confused when I'm going to optimize this function using sgd. It can be solved by write down the formulation derive gradients then accomplish a training procedure from scratch. But I'm wondering if there is a simpler way? Because it's easy to use a deep learning tool like Tensorflow or Keras to train a classification model, all u need to do is build a network and feed the data.
So similarly, is there a tool that can automatically compute gradients after I defined the loss function? Because deriving gradients and achieve them from scratch is really annoying. Both the classification loss and matrix factorization loss is very common, so I think the combination can be achieved thoroughly.
Theano and Tensorflow will exactly do this for you if you can formulate your optimization-problem within their framework / language. These frameworks are also general enough to implement non-NN-based algorithms, like simple first-order-based optimizations like yours.
If that's not possible you can try autograd, which can do this on a subset of numpy. Just formulate your loss as numpy-function (while sticking to supported functions; read the docs) and let autograd build the gradients.
Keep in mind, that the somewhat by-construction-approach used by Theano & Tensorflow will be more efficient (because of the more defined input and because these two libraries are a bit more evolved).
Both Theano & Tensorflow have built-in differentiation for you. So you only need to form the loss.

Extracting coefficients with a nonlinear Kernel using sklearn.svm python

This may not be possible in theory, if so please elaborate.
I am trying to fit some data with Python's sklearn SVM class sklearn SVM class
When I use a linear kernel, I can extract the coefs using get_params method where
coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes,
n_features] Weights assigned to the features (coefficients in the
primal problem). This is only available in the case of linear kernel.
So I can find the equation of best fit that depends on all the independent variables, and am able to use this equation elsewhere.
Is it possible to do the same (get a non-linear equation) from a nonlinear kernel (like the RBF or the polynomial kernel) using sklearn?
Thanks!
Tim
According to the documentation:
The decision function is:
...
This parameters can be accessed through the members dual_coef_ which holds the product y_i alpha_i, support_vectors_ which holds the support vectors, and intercept_ which holds the independent term \rho ...
("support vectors" means the x_i in the decision function equation).
Each kernel has a different function, which you'll need to understand to compute the K(x_i,x) term.

Categories

Resources