I am trying to debug my tensorflow code that suddenly produces a NaN loss after about 30 epochs. You may find my specific problem and things I tried in this SO question.
I monitored the weights of all layers for each mini-batch during training and found that the weights suddenly jump to NaN although all weight values were less than 1 during the previous iteration (I have set kernel_constraint max_norm to 1). This makes it very hard to figure out which operation is the culprit.
Pytorch has a cool debugging method torch.autograd.detect_anomaly that produces an error at any backward computation that produces NaN value and shows the traceback. This makes it easy to debug the code.
Is there something similar in TensorFlow? If not can you suggest a method to debug this?
There is indeed a similar debugging tool in tensorflow. See tf.debugging.check_numerics.
This can be used to track the tensors that produce inf or nan values during training. As soon as such value is found, tensorflow produces an InvalidArgumentError.
tf.debugging.check_numerics(LayerN, "LayerN is producing nans!")
If the tensor LayerN has nans, you would get an error like that:
Traceback (most recent call last):
File "trainer.py", line 506, in <module>
worker.train_model()
File "trainer.py", line 211, in train_model
l, tmae = train_step(*batch)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 855, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: LayerN is producing nans! : Tensor had NaN values
Related
I have used NSGA-Net neural architecture search to generate and train several architectures. I am trying to generate PGD adversarial examples using my trained PyTorch models. I tried using both Adversarial Robustness Toolbox 1.3 (ART) and torchattacks 2.4 but I get the same error.
These few lines of code describe the main functionality of my code and what I am trying to achieve here:
# net is my trained NSGA-Net PyTorch model
# Defining PGA attack
pgd_attack = PGD(net, eps=4 / 255, alpha=2 / 255, steps=3)
# Creating adversarial examples using validation data and the defined PGD attack
for images, labels in valid_data:
images = pgd_attack(images, labels).cuda()
outputs = net(images)
So here is what the error generally looks like:
Traceback (most recent call last):
File "torch-attacks.py", line 296, in <module>
main()
File "torch-attacks.py", line 254, in main
images = pgd_attack(images, labels).cuda()
File "\Anaconda3\envs\GPU\lib\site-packages\torchattacks\attack.py", line 114, in __call__
images = self.forward(*input, **kwargs)
File "\Anaconda3\envs\GPU\lib\site-packages\torchattacks\attacks\pgd.py", line 57, in forward
outputs = self.model(adv_images)
File "\envs\GPU\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "\codes\NSGA\nsga-net\models\macro_models.py", line 79, in forward
x = self.gap(self.model(x))
File "\Anaconda3\envs\GPU\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "\Anaconda3\envs\GPU\lib\site-packages\torch\nn\modules\container.py", line 100, in forward
input = module(input)
File "\Anaconda3\envs\GPU\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "\codes\NSGA\nsga-net\models\macro_decoder.py", line 978, in forward
x = self.first_conv(x)
File "\Anaconda3\envs\GPU\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "\Anaconda3\envs\GPU\lib\site-packages\torch\nn\modules\conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "\Anaconda3\envs\GPU\lib\site-packages\torch\nn\modules\conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'weight' in call to _thnn_conv2d_forward
I have used the same the code with a simple PyTorch model and it worked but I am using NSGA-Net so I haven't designed the model myself. I also tried using .float() on both the model and inputs and still got the same error.
Keep in mind that I only have access to the following files:
torch-attacks.py
macro_models.py
macro_decoder.py
You should convert images to the desired type (images.float() in your case). Labels must not be converted to any floating type. They are allowed to be either int or long tensors.
I'm trying to train a convolutional neural network using Keras/Tensorflow. My model compiles correctly, but as soon as training begins the following error is returned:
Using TensorFlow backend.
Epoch 1/3
Traceback (most recent call last):
File "./main.py", line 17, in <module>
history = CNN.fit(TrainImages, TrainMasks, epochs = 3)
File "/home/tomhalmos/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit
validation_freq=validation_freq)
File "/home/tomhalmos/.local/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
outs = fit_function(ins_batch)
File "/home/tomhalmos/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3727, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/tomhalmos/.local/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1551, in __call__
return self._call_impl(args, kwargs)
File "/home/tomhalmos/.local/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1591, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/home/tomhalmos/.local/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/tomhalmos/.local/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/tomhalmos/.local/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.**InvalidArgumentError: BiasGrad requires tensor size <= int32 max**
[[node gradients/conv2d_22/BiasAdd_grad/BiasAddGrad (defined at /home/tomhalmos/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_5496]
Function call stack:
keras_scratch_graph
Happy to provide any further details if the above is not sufficient.
The bounds checking is on the number of elements in a tensor. The size is limited to 2.147 billion values (int32).
Take your image size (h x v) times the sample batch size. Multiply that by the number of channels in your operation (such as Conv2D). The place where you get a count larger than 2.1e9 is the guilty operation. There is no solution that I can see other than reducing one of those numbers.
I change my job on GPU and it's work well.
When using custom estimators in Tensorflow 2, when the model contains BatchNorm or Dropout layers, tf fails while building the graph with the following error. It works just fine when I comment out the Dropout and BatchNorm layers.
The model I use is a simple CNN model with two conv blocks and dense layer at the end:
def build_conv_block(x: Model, filter_map_count: int, name: str):
x = Conv2D(filter_map_count, (3, 3), name=f'{name}_conv_2d')(x)
x = BatchNormalization(name=f'{name}_bn')(x) <------- Error when not commented out
x = ReLU(name=f'{name}_relu')(x)
x = MaxPool2D((2, 2), name=f'{name}_max_pool_2d')(x)
x = Dropout(0.25, name=f'{name}_dropout')(x) <------- Error when not commented out
return x
def get_model(params):
input_image = Input(shape=params.input_shape)
x = build_conv_block(input_image, filter_map_count=64, name='layer_1')
x = build_conv_block(x, filter_map_count=128, name='layer_2')
x = Flatten(name='flatten_conv')(x)
output_pred = Dense(10, activation='softmax', name='output')(x)
model = Model(inputs=input_image, outputs=output_pred)
model.optimizer = Adam(learning_rate=params.learning_rate)
return model
I have a standard train_op in the model_fn that takes mnist images and labels as input and the class as output:
# Calculate gradients
with tf.GradientTape() as tape:
y_pred = model(features, training=training)
loss = tf.losses.categorical_crossentropy(labels, y_pred)
if mode == tf.estimator.ModeKeys.TRAIN:
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
Here's the traceback of the error I get:
Traceback (most recent call last):
File "F:/Projects/python/my_project/train.py", line 38, in <module>
tf.estimator.train_and_evaluate(estimator, train_spec=train_spec, eval_spec=eval_spec)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 473, in train_and_evaluate
return executor.run()
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 613, in run
return self.run_local()
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1160, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1190, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1148, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "F:\Projects\python\my_project\model.py", line 62, in model_fn
gradients = tape.gradient(loss, model.trainable_variables)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\eager\backprop.py", line 1014, in gradient
unconnected_gradients=unconnected_gradients)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\eager\imperative_grad.py", line 76, in imperative_grad
compat.as_str(unconnected_gradients.value))
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\eager\backprop.py", line 138, in _gradient_function
return grad_fn(mock_op, *out_grads)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\ops\cond_v2.py", line 120, in _IfGrad
true_graph, grads, util.unique_grad_fn_name(true_graph.name))
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\ops\cond_v2.py", line 395, in _create_grad_func
func_graph=_CondGradFuncGraph(name, func_graph))
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\framework\func_graph.py", line 915, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\ops\cond_v2.py", line 394, in <lambda>
lambda: _grad_fn(func_graph, grads), [], {},
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\ops\cond_v2.py", line 373, in _grad_fn
src_graph=func_graph)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 550, in _GradientsHelper
gradient_uid)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 175, in _DefaultGradYs
constant_op.constant(1, dtype=y.dtype, name="grad_ys_%d" % i)))
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\framework\constant_op.py", line 227, in constant
allow_broadcast=True)
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\framework\constant_op.py", line 265, in _constant_impl
allow_broadcast=allow_broadcast))
File "F:\Python\envs\tf2\lib\site-packages\tensorflow_core\python\framework\tensor_util.py", line 484, in make_tensor_proto
(dtype, nparray.dtype, values))
TypeError: Incompatible types: <dtype: 'variant'> vs. int32. Value is 1
It looks similar to the error mentioned in TF Issue #31894, but it doesn't seem to solve this problem. The TypeError does not tell much about where and why the error is happening and directly googling it does not help.
Although it may not be too obvious from the TypeError variant vs int32, if we carefully check the logs, we can see that the error occurs when finding gradients:
File "F:\Projects\python\my_project\model.py", line 62, in model_fn
gradients = tape.gradient(loss, model.trainable_variables)
Also, it should be noted that we get the same error even if one of them is present. So, if we try and analyze the common attributes in BatchNormalization and Dropout layer, both may seem to not come under the core layers, but when we look carefully, only those two layers in the model have a different train/test phase i.e. dropout doesn't zero out the values in test phase and batch norm uses a moving mean and variance during test phase.
Now the problem is narrowed down to using any layer that has a different train/test phase. This happens because tensorflow identifies if training mode is on or not using training parameter passed to the model.
This problem can be solved by using
y_pred = model(features, training=True)
when finding the gradients i.e. for the training phase and by using
y_pred = model(features, training=False)
otherwise i.e. for predict and eval phases.
Linked: Errors where moving mean is not updating is also reported, which can be solved by adding the same attribute.
I am using Keras to built a LSTM model.
def LSTM_model_1(X_train,Y_train,Dropout,hidden_units):
model = Sequential()
model.add(Masking(mask_value=666, input_shape=(X_train.shape[1],X_train.shape[2])))
model.add(LSTM(hidden_units, activation='tanh', return_sequences=True, dropout=Dropout))
model.add(LSTM(hidden_units, return_sequences=True))
model.add(LSTM(hidden_units, return_sequences=True))
model.add(Dense(Y_train.shape[-1], activation='softmax'))
model.compile(loss='mean_squared_error', optimizer='adam',metrics['categorical_accuracy'])
return model
The input data is of shape
X_train.shape=(77,100,34); Y_Train.shape=(77,100,7)
The Y data is one-hot-encoded. Both input tensors are zero-padded for the last list entry. The padded values in Y_train is 0. So no state gets a value of 1 for the padded end. dropout=0 and hidden_units=2 which seems not related to the following error.
Unfortunately, I get following error which I think is connected with the shape of Y. But I cannot put my finger on it. The error happens when the first LSTM layer is initialized/added.
ValueError: Initializer for variable lstm_58/kernel/ is from inside a
control-flow construct, such as a loop or conditional. When creating a
variable inside a loop or conditional, use a lambda as the
initializer.
If I follow the error I noticed that it comes down to this:
dtype: If set, initial_value will be converted to the given type.
If None, either the datatype will be kept (if initial_value is
a Tensor), or convert_to_tensor will decide.
"convert to tensor' creates an object which is then None and leads to the error. Apparently, the LSTM tries to convert the input into a tensor... But if I look at my input, it is already a tensor.
Does any of you have an idea what went wrong or how to use lambda as an initializer? Thanks
EDit: the stack trace
File "C:\Users\310122653\Documents\GitHub\DNN\build_model.py", line
44, in LSTM_model_1
model.add(LSTM(hidden_units, activation='tanh', return_sequences=True, dropout=Dropout))
File "C:\ProgramData\Anaconda3\lib\site-packages\keras\models.py",
line 492, in add
output_tensor = layer(self.outputs[0])
File
"C:\ProgramData\Anaconda3\lib\site-packages\keras\layers\recurrent.py",
line 499, in call
return super(RNN, self).call(inputs, **kwargs)
File
"C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\topology.py",
line 592, in call
self.build(input_shapes[0])
File
"C:\ProgramData\Anaconda3\lib\site-packages\keras\layers\recurrent.py",
line 461, in build
self.cell.build(step_input_shape)
File
"C:\ProgramData\Anaconda3\lib\site-packages\keras\layers\recurrent.py",
line 1838, in build
constraint=self.kernel_constraint)
File
"C:\ProgramData\Anaconda3\lib\site-packages\keras\legacy\interfaces.py",
line 91, in wrapper
return func(*args, **kwargs)
File
"C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\topology.py",
line 416, in add_weight
constraint=constraint)
File
"C:\ProgramData\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py",
line 395, in variable
v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
File
"C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops\variables.py",
line 235, in init
constraint=constraint)
File
"C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops\variables.py",
line 356, in _init_from_args
"initializer." % name)
The solution, in this case, was to restart the Kernel.
Thanks to Daniel Möller
I'm using my own data with tensorflow MNIST example pipeline but getting:
ValueError: input has 16384 elements, which isn't divisible by 65536
I've been practicing with the example' data successfully. However, after introducing my own images which I resized to 128x128px and generated ubytes idx files, I get the following error:
Traceback (most recent call last):
File "tensorimage.py", line 132, in
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/training/optimizer.py", line 196, in minimize
grad_loss=grad_loss)
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/training/optimizer.py", line 253, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/ops/gradients.py", line 478, in gradients
in_grads = _AsList(grad_fn(op, *out_grads))
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/ops/array_grad.py", line 298, in _ReshapeGrad
return [array_ops.reshape(grad, array_ops.shape(op.inputs[0])), None]
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1758, in reshape
name=name)
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2319, in create_op
set_shapes_for_outputs(ret)
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1711, in set_shapes_for_outputs
shapes = shape_func(op)
File "/home/ubuntu/tensorflow/python3/lib/python3.4/site-packages/tensorflow/python/ops/array_ops.py", line 1867, in _ReshapeShape
(num_elements, known_elements))
ValueError: input has 16384 elements, which isn't divisible by 65536
What's confusing to me is, I did indeed set the input to 16384 elements (128x128), however, I don't understand where 65536 came from. I combed through all of the code including the file tensorflow/python3/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py
but can't find where 65536 number came from.
It's hard to tell without seeing more of your code exactly what's going wrong, but the summary is that TensorFlow thinks that the other dimensions of input result in a stride of 65536 elements, and so it's trying to infer the missing dimension by dividing the number of elements present by the known dimensions size, and spotting an error:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/array_ops.py#L1702
What happens if you print the size of input just before this error?