I am looking to use floor() method in one of my models. I would like to understand what pytorch does with its gradient propagation since as such floor is a discontinuous method.
If there is no gradient defined, I could override the backward method to define my own gradient as necessary but I would like to understand what the default behavior is and the corresponding source code if possible.
import torch
x = torch.rand(20, requires_grad=True)
y = 20*x
z = y.floor().sum()
z.backward()
x.grad returns zeros.
z has a grad_fn=
So FloorBackward is the gradient method. But there is no reference to the source code of FloorBackward in pytorch repository.
As the floor function is piece wise constant. This means the gradient must be zero almost everywhere.
While the code doesn't say anything about it, I expect that the gradient is set to a constant zero everywhere.
Related
I am looking for help about the implementation of a logit model with statsmodel for binary variables.
Here is my code :
(I am using feature selection methods : MinimumRedundancyMaximumRelevance and RecursiveFeatureElimination available on python)
for i_mrmr in range(4,20):
for i_rfe in range(3,i_mrmr):
regressors_step1 = I am selecting the MRMR features
regressors_step2 = I am selecting features from the previous list with RFE method
for method in ['newton', 'nm', 'bfgs', 'lbfgs', 'powell', 'cg', 'ncg']:
logit_model = Logit(y,X.loc[:,regressors_step2])
try:
result = logit_model.fit(method=method, cov_type='HC1')
print(result.summary)
except:
result = "error"
I am using Logit from statsmodels.discrete.discrete_model.Logit.
The y variable, the target, is a binary.
All explanatory variables, the X, are binary too.
The logit model is "functionning" for the different optimization methods. That is to say, I end up with some summary to print. Nonetheless, various warnings print such as : "Maximum Likelihood optimization failed to converge."
The optimization methods presented in the statsmodel algorithm are the ones from scipy :
‘newton’ for Newton-Raphson, ‘nm’ for Nelder-Mead
‘bfgs’ for Broyden-Fletcher-Goldfarb-Shanno (BFGS)
‘lbfgs’ for limited-memory BFGS with optional box constraints
‘powell’ for modified Powell’s method
‘cg’ for conjugate gradient
‘ncg’ for Newton-conjugate gradient
We can find these methods on scipy.optimize.
Here are my questions :
I did not find anywhere any argument against the use of these optimization methods for a binary set of variables. But, because of the warnings, I am asking myself if it is correct to do so. And then, what is the best method, the one which is the more appropriate in this case ?
Here : Scipy minimize: how to restrict x only to 0 and 1? it is implicitely said that a model of the kind Python MIP (Mixed-Integer Linear Programming) could be better in the binary set of variables case. In the documentation of the MIP package of python it appears that to implement this kind of model I should explicitly give a function to minimize or maximize and also I should express the constraints... (see : https://docs.python-mip.com/en/latest/quickstart.html#creating-models)
Therefore I am wondering if i need to define a logit function as the objective function ? What are the constraints I should express ? Is there any easier way to do ?
I am working on a weighted version of SparseCategoricalCrossentropy. right now my implementation is converting y_true to one hot form and calculates the cross entropy then multiplies it with a weight matrix. I get the same output between my implementation and SparseCategoricalCrossentropy when weights are all 1 however my problem is with one hot encoding. I have a lot of classes (32+bg) and when using one hot encoding I run out of memory for large images/batch sizes which does not happen with SparseCategoricalCrossentropy. I am trying to figure out how is the built in one implemented (is there a way to avoid one hot encoding etc.). How is the built in one implemented or where is it implemented looking at [1] it is probably implemented on the native side but I can not find it?
[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/losses.py#L692
The SparseCategoricalCrossentropy documentation has a "View Source on GitHub" tab you can click on. This will show you the implementation. Doing this leads us to line 666 of tensorflow.python.keras.losses. We can see from the class definition that it wraps a function sparse_categorical_crossentropy which is defined on line 4867 of tensorflow.keras.backend. We can see at the bottom of the function definition this is a wrapper around tf.nn.sparse_softmax_cross_entropy_with_logits and this function definition can be found in tensorflow.python.ops.nn_ops. At the bottom of this function definition, we can see it is a wrapper around gen_nn_ops.sparse_softmax_cross_entropy_with_logits. If you look for gen_nn_ops, you won't find it. It is the name of the *.so file that python imports to run tensorflow's C++ op code. So what we are really looking for is a sparse softmax C++ kernel, which can be found in tensorflow.core.kernels.sparse_xent_op.cc. This op calls a functor which calls a method SparseXentEigenImpl whose implementation can be found in the corresponding header file, sparse_xent_op.h. And starting on line 47 of that file you can see how they create the sparse loss.
// Generator for calculation of the sparse Xent loss.
// This generator takes the logits, the sum of the exponentiated
// logits, and the label indices. For each minibatch entry, ignoring
// the batch index b, it calculates:
//
// loss[j] = (log(sum_exp_logits) - logits[j]) * 1{ j == label }
//
// for j = 0 .. num_classes. This value must be summed over all j for
// the final loss.
And on line 224 there is a comment of outlining the loss calculation formula.
// sum(-labels *
// ((logits - max_logits) - log(sum(exp(logits - max_logits)))))
// along classes
Not sure if this helps you create your weighted op, but this is how sparse xent is calculated in tensorflow.
Edit:
There also is a method tf.nn.weighted_cross_entropy_with_logits. Not sure if that will work with your sparsity requirement, but will probably work better than trying to implement something yourself.
I want to implement the following sigmoid function with a custom slope parameter k.
y = f(x)= 1/ ( 1+exp(-1*k*x))
gradient gy = k * f(x)*(1-f(x))
I want to use this in my autoencoder. How do I implement this in Chainer?
If k is constant (i.e., a hyperparameter), F.sigmoid(k * x) should just work.
If k is a parameter that should be learned in the same way as other weights, you may want to subclass a link like L.PReLU, and use it just like other links, e.g. L.Linear and L.Convolution2D. You can still implement the forward method of the link like the above simple expression.
An activation function should be a subclass of Chainer.FunctionNode (FunctionNode docs). An example of this is the Swish function provided by chainer library. You can observe its source here and clone it (or any other function such as tanh) to make necessary changes to its forward and backward operation declaration to fit it to your needs.
I have two loss functions here to be minimized:
The first one is a local one, where:
min f1(x1),
min f2(x2),
min f3(x3),....
min fn(xn)
The other one is global one, where:
min f(x1,x2,...,xn) = f1(x1)+f2(x2)+...fn(xn)
For each local problem fi(x), I have 2 variables to be optimized, and I have 1000 local problems. Correspondingly, for the global problem, I have 2000 variables to be optimized. Surely the 2nd one has more parameters to be optimized, but since f1, f2, f3...fn are independent with each other, I hope they two should be comparable.
I use the scipy minimize function for optimization (scipy.optimize.minimize). But the 2nd one much much slower than the 1st one.
The only drawback of the global one, i think, is taking more gradients than it actually need to. For example, the gradient of x1 only comes from f1, but the global computes its gradient from f2, f3... fn, which is 0. Thus, making it slower. If that is the case, I do hope there would be some ways for acceleration.
BTW, since I later on need to add a global constraint to the optimization, this is why I must use the global loss function instead of the local one.
I think your guess is correct that the amount of time that it takes more is because it needs to compute the gradients. Based on the description page for scipy.optimize.minimize (https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html), it seems that the method computes the gradient numerically if you provide jac = False (optional and set to False by default).
jac : bool or callable, optional
Jacobian (gradient) of objective function. Only for CG, BFGS, Newton-CG, L-BFGS- B, TNC, SLSQP, dogleg, trust-ncg, trust-krylov, trust-region-exact. If jac is a Boolean and is True, fun is assumed to return the gradient along with the objective function. If False, the gradient will be estimated numerically. jac can also be a callable returning the gradient of the objective. In this case, it must accept the same arguments as fun.
Based on the above, you can set jac = True and then you should provide your function as a callable that returns function value as well as the gradients. This should speed up the process.
One other way is to write your own customizable minimizer as callable.
This is a follow up question to this one.
I'm still working on the cifar10 example on the file cifar10.py and noticed some strange behavior regarding the creation of variables.
But first a side-question: Why are the variables created with weight decay factor of wd=0.0 and not wd=None? That way you would have less vertices in the computation graph.
Next, the strange behavior. I added the following function to make it more convenient to create variables:
def _create_variable(name, shape, initializer, wd=None):
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, dtype, initializer)
if wd is not None:
wd_val = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', wd_val)
return var
When using this function to create the variables (with the original parameters), the logits that are computed come from a range of +-1e13 for the first batch, gradually getting better reaching +-1.5. The loss on the other hand starts at around 400000 and gets bigger until it hits NaN.
When using the original functions to create the variables, the logits come from a range of +-1 right from the beginning and the loss start at around 4.5, gradually getting smaller.
Can somebody explain to me what the difference between my and the provided functions for variable generation is, and why the effect is so huge? I don't see it.
The full code of my modified cifar10.py can be found here. To test it out simple replace the original file with my version. To than switch between the original and my function simply change line 212 to CUSTOM = False
Thank you in advance.
Stupid me! I used my own function the wrong way and passed the values for stddev as the mean and used the default stddev of 1.
The curse of not addressing the arguments by their name.
Anyway, why does this cause such a huge loss; sometimes even NaN?