I am working on a weighted version of SparseCategoricalCrossentropy. right now my implementation is converting y_true to one hot form and calculates the cross entropy then multiplies it with a weight matrix. I get the same output between my implementation and SparseCategoricalCrossentropy when weights are all 1 however my problem is with one hot encoding. I have a lot of classes (32+bg) and when using one hot encoding I run out of memory for large images/batch sizes which does not happen with SparseCategoricalCrossentropy. I am trying to figure out how is the built in one implemented (is there a way to avoid one hot encoding etc.). How is the built in one implemented or where is it implemented looking at [1] it is probably implemented on the native side but I can not find it?
[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/losses.py#L692
The SparseCategoricalCrossentropy documentation has a "View Source on GitHub" tab you can click on. This will show you the implementation. Doing this leads us to line 666 of tensorflow.python.keras.losses. We can see from the class definition that it wraps a function sparse_categorical_crossentropy which is defined on line 4867 of tensorflow.keras.backend. We can see at the bottom of the function definition this is a wrapper around tf.nn.sparse_softmax_cross_entropy_with_logits and this function definition can be found in tensorflow.python.ops.nn_ops. At the bottom of this function definition, we can see it is a wrapper around gen_nn_ops.sparse_softmax_cross_entropy_with_logits. If you look for gen_nn_ops, you won't find it. It is the name of the *.so file that python imports to run tensorflow's C++ op code. So what we are really looking for is a sparse softmax C++ kernel, which can be found in tensorflow.core.kernels.sparse_xent_op.cc. This op calls a functor which calls a method SparseXentEigenImpl whose implementation can be found in the corresponding header file, sparse_xent_op.h. And starting on line 47 of that file you can see how they create the sparse loss.
// Generator for calculation of the sparse Xent loss.
// This generator takes the logits, the sum of the exponentiated
// logits, and the label indices. For each minibatch entry, ignoring
// the batch index b, it calculates:
//
// loss[j] = (log(sum_exp_logits) - logits[j]) * 1{ j == label }
//
// for j = 0 .. num_classes. This value must be summed over all j for
// the final loss.
And on line 224 there is a comment of outlining the loss calculation formula.
// sum(-labels *
// ((logits - max_logits) - log(sum(exp(logits - max_logits)))))
// along classes
Not sure if this helps you create your weighted op, but this is how sparse xent is calculated in tensorflow.
Edit:
There also is a method tf.nn.weighted_cross_entropy_with_logits. Not sure if that will work with your sparsity requirement, but will probably work better than trying to implement something yourself.
Related
I am looking for help about the implementation of a logit model with statsmodel for binary variables.
Here is my code :
(I am using feature selection methods : MinimumRedundancyMaximumRelevance and RecursiveFeatureElimination available on python)
for i_mrmr in range(4,20):
for i_rfe in range(3,i_mrmr):
regressors_step1 = I am selecting the MRMR features
regressors_step2 = I am selecting features from the previous list with RFE method
for method in ['newton', 'nm', 'bfgs', 'lbfgs', 'powell', 'cg', 'ncg']:
logit_model = Logit(y,X.loc[:,regressors_step2])
try:
result = logit_model.fit(method=method, cov_type='HC1')
print(result.summary)
except:
result = "error"
I am using Logit from statsmodels.discrete.discrete_model.Logit.
The y variable, the target, is a binary.
All explanatory variables, the X, are binary too.
The logit model is "functionning" for the different optimization methods. That is to say, I end up with some summary to print. Nonetheless, various warnings print such as : "Maximum Likelihood optimization failed to converge."
The optimization methods presented in the statsmodel algorithm are the ones from scipy :
‘newton’ for Newton-Raphson, ‘nm’ for Nelder-Mead
‘bfgs’ for Broyden-Fletcher-Goldfarb-Shanno (BFGS)
‘lbfgs’ for limited-memory BFGS with optional box constraints
‘powell’ for modified Powell’s method
‘cg’ for conjugate gradient
‘ncg’ for Newton-conjugate gradient
We can find these methods on scipy.optimize.
Here are my questions :
I did not find anywhere any argument against the use of these optimization methods for a binary set of variables. But, because of the warnings, I am asking myself if it is correct to do so. And then, what is the best method, the one which is the more appropriate in this case ?
Here : Scipy minimize: how to restrict x only to 0 and 1? it is implicitely said that a model of the kind Python MIP (Mixed-Integer Linear Programming) could be better in the binary set of variables case. In the documentation of the MIP package of python it appears that to implement this kind of model I should explicitly give a function to minimize or maximize and also I should express the constraints... (see : https://docs.python-mip.com/en/latest/quickstart.html#creating-models)
Therefore I am wondering if i need to define a logit function as the objective function ? What are the constraints I should express ? Is there any easier way to do ?
I want to implement the following sigmoid function with a custom slope parameter k.
y = f(x)= 1/ ( 1+exp(-1*k*x))
gradient gy = k * f(x)*(1-f(x))
I want to use this in my autoencoder. How do I implement this in Chainer?
If k is constant (i.e., a hyperparameter), F.sigmoid(k * x) should just work.
If k is a parameter that should be learned in the same way as other weights, you may want to subclass a link like L.PReLU, and use it just like other links, e.g. L.Linear and L.Convolution2D. You can still implement the forward method of the link like the above simple expression.
An activation function should be a subclass of Chainer.FunctionNode (FunctionNode docs). An example of this is the Swish function provided by chainer library. You can observe its source here and clone it (or any other function such as tanh) to make necessary changes to its forward and backward operation declaration to fit it to your needs.
I am using a scipy.minimize function, where I'd like to have one parameter only searching for options with two decimals.
def cost(parameters,input,target):
from sklearn.metrics import mean_squared_error
output = self.model(parameters = parameters,input = input)
cost = mean_squared_error(target.flatten(), output.flatten())
return cost
parameters = [1, 1] # initial parameters
res = minimize(fun=cost, x0=parameters,args=(input,target)
model_parameters = res.x
Here self.model is a function that performs some matrix manipulation based on the parameters. Input and target are two matrices. The function works the way I want to, except I would like to have parameter[1] to have a constraint. Ideally I'd just like to give an numpy array, like np.arange(0,10,0.01). Is this possible?
In general this is very hard to do as smoothness is one of the core-assumptions of those optimizers.
Problems where some variables are discrete and some are not are hard and usually tackled either by mixed-integer optimization (working good for MI-linear-programming, quite okay for MI-convex-programming although there are less good solvers) or global-optimization (usually derivative-free).
Depending on your task-details, i recommend decomposing the problem:
outer-loop for np.arange(0,10,0.01)-like fixing of variable
inner-loop for optimizing, where this variable is fixed
return the model with the best objective (with status=success)
This will effect in N inner-optimizations, where N=state-space of your to fix-var.
Depending on your task/data, it might be a good idea to traverse the fixing-space monotonically (like using np's arange) and use the solution of iteration i as initial-point for the problem i+1 (potentially less iterations needed if guess is good). But this is probably not relevant here, see next part.
If you really got 2 parameters, like indicated, this decomposition leads to an inner-problem with only 1 variable. Then, don't use minimize, use minimize_scalar (faster and more robust; does not need an initial-point).
So far I understand the minimize function with method Trust-ncg, the "method specific" parameter "max_trust_radius" is the maximum value for a new step optimization.
However, I experience a weird behaviour.
I work in my doctorate data and I have a code that invokes minimize function (with trust ncg method)
passing parameters
{
'initial_trust_radius':0.1,
'max_trust_radius':1,
'eta':0.15,
'gtol':1e-5,
'disp': True
}
I invoke minimize function as:
res = minimize(bbox, x0, method='trust-ncg',jac=bbox_der, hess=bbox_hess,options=opt_par)
where
bbox is a function to evaluate the objective function
x0 is the initial guess
bbox_der is the gradient function
bbox_hess hessian function
opt_par is the dictionary above with the parameters.
Bbox invokes simulation code and get the data. It works: minimize go back and forth, proposing new values, bbox invokes simulation.
Everything works well until I got a weird issue.
The "x" vector contains 8 values. I realize that one of the iterations, the last value is greater than 1.
Per the max_trust_radius, I think that it should be less than 1, but it is 1.0621612802208713e+00
The issue causes problems because bbox can not receive the value greater than 1, as it invokes a simulation program and there is a constraint that it can not receive 1 or greater than 1.
I found the scipy code and tried to see if I could be able to find a bug or something wrong but I am not.
My main concerns are:
My understanding is that there is a bug in the scipy minimize code as the new value is greater than max_trust_radius .
How can I manipulate or control the values to avoid that values became greater than 1?
Do you suggest something to investigate the issue?
The max_trust_radius controls how large steps you are allowed to take:
max_trust_radius : float
Maximum value of the trust-region radius.
No steps that are longer than this value will be proposed.
Since you are very likely to take many steps during the minimization, each which can be up to 1 long, it is not strange at all that you (assuming ||x0||=0) end up with ||x|| > 1.
If your problem is strictly bounded then you need to apply an optimization algorithm that supports bounds on the parameters.
For scipy.optimize.minimize only L-BFGS-B, TNC and SLSQP methods seem to support the bounds= keyword.
This is a follow up question to this one.
I'm still working on the cifar10 example on the file cifar10.py and noticed some strange behavior regarding the creation of variables.
But first a side-question: Why are the variables created with weight decay factor of wd=0.0 and not wd=None? That way you would have less vertices in the computation graph.
Next, the strange behavior. I added the following function to make it more convenient to create variables:
def _create_variable(name, shape, initializer, wd=None):
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, dtype, initializer)
if wd is not None:
wd_val = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', wd_val)
return var
When using this function to create the variables (with the original parameters), the logits that are computed come from a range of +-1e13 for the first batch, gradually getting better reaching +-1.5. The loss on the other hand starts at around 400000 and gets bigger until it hits NaN.
When using the original functions to create the variables, the logits come from a range of +-1 right from the beginning and the loss start at around 4.5, gradually getting smaller.
Can somebody explain to me what the difference between my and the provided functions for variable generation is, and why the effect is so huge? I don't see it.
The full code of my modified cifar10.py can be found here. To test it out simple replace the original file with my version. To than switch between the original and my function simply change line 212 to CUSTOM = False
Thank you in advance.
Stupid me! I used my own function the wrong way and passed the values for stddev as the mean and used the default stddev of 1.
The curse of not addressing the arguments by their name.
Anyway, why does this cause such a huge loss; sometimes even NaN?