I have a dummy model (a linear autoencoder). When training on a dataset of 1 000 records, it works; but on a larger dataset, three orders of magnitude larger, it runs out of GPU memory; even though the batch size is fixed and the computer has enough RAM to hold.
Am I doing something silly?
Note: it works fine on TF 2.5, but crashes on TF 2.6-2.9. It always works if training on CPU.
The model is:
def get_model(n_inputs: int) -> models.Model:
inp = layers.Input(shape=(n_inputs,))
out = layers.Dense(n_inputs, activation='linear')(inp)
m = models.Model(inputs=inp, outputs=out)
m.compile(loss='mse', optimizer='adam')
m.summary()
return m
I am feeding the data through the tf.data API
def wrap_data(data: np.ndarray) -> tf.data.Dataset:
dataset = tf.data.Dataset.from_tensor_slices(data)
shuffled = dataset.shuffle(buffer_size=len(data), reshuffle_each_iteration=True)
batched = shuffled.batch(16, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
autoencoder = batched.map(lambda x: (x, x)).prefetch(5)
return autoencoder
The full reproducing script is here. Running python benchmark.py works, but python benchmark.py --big doesn't.
I am using Python 3.9 on Fedora 36. The GPU is a Nvidia RTX 2070 with 8 GiB of RAM. The driver version is 515.48.07 and CUDA Version: 11.7. nvidia-smi reports most of the memory is available between runs, and the small version requires less than 800 MiB.
The full traceback is:
2022-09-05 15:29:37.525261: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16384000000 exceeds 10% of free system memory.
2022-09-05 15:29:54.002629: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.26GiB (rounded to 16384000000)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2022-09-05 15:29:54.002987: W tensorflow/core/common_runtime/bfc_allocator.cc:491] *_******____________________________________________________________________________________________
Traceback (most recent call last):
File "/home/david/[path]/benchmark.py", line 49, in <module>
main(parser.parse_args().big)
File "/home/david/[path]/benchmark.py", line 40, in main
train_data_iterator = wrap_data(train_data)
File "/home/david/[path]/benchmark.py", line 33, in wrap_data
dataset = tf.data.Dataset.from_tensor_slices(data)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 809, in from_tensor_slices
return TensorSliceDataset(tensors, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4551, in __init__
element = structure.normalize_element(element)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
return func(*args, **kwargs)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1640, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
And specifying the GPU memory allocator, as suggested, doesn't help:
2022-09-05 15:33:19.542433: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16384000000 exceeds 10% of free system memory.
2022-09-05 15:33:25.973935: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:288] gpu_async_0 cuMemAllocAsync failed to allocate 16384000000 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
Reported by CUDA: Free memory/Total memory: 1115357184/8369799168
2022-09-05 15:33:25.973961: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:293] Stats: Limit: 6221922304
InUse: 67126312
MaxInUse: 201327628
NumAllocs: 13
MaxAllocSize: 67108864
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-09-05 15:33:25.973970: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:56] Histogram of current allocation: (allocation_size_in_bytes, nb_allocation_of_that_sizes), ...;
2022-09-05 15:33:25.973974: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 4, 5
2022-09-05 15:33:25.973976: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 8, 2
2022-09-05 15:33:25.973979: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1028, 1
2022-09-05 15:33:25.973982: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 16384, 1
2022-09-05 15:33:25.973985: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 67108864, 1
Traceback (most recent call last):
File "/home/david/[path]/benchmark.py", line 48, in <module>
main(parser.parse_args().big)
File "/home/david/[path]/benchmark.py", line 40, in main
train_data_iterator = wrap_data(train_data)
File "/home/david/[path]/benchmark.py", line 33, in wrap_data
dataset = tf.data.Dataset.from_tensor_slices(data)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 809, in from_tensor_slices
return TensorSliceDataset(tensors, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4551, in __init__
element = structure.normalize_element(element)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
return func(*args, **kwargs)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1640, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
Related
I recently got a new laptop with an inbuilt graphics card from Intel and one extra from Nvidia.
I installed cuda and drivers of this version
| NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6
I also have Tensorflow 2.7.
I am trying to run a network that was working perfectly on my old computer, more or less being taken from this repository:
https://github.com/zhixuhao/unet.git
but I get the following warning when i initiate the model:
2022-02-02 14:47:03.039319: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
and when i actually try to train the model it runs OOM without training at all. (Error message simplified for character limit)
2022-02-02 14:50:00.390958: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-02-02 14:50:00.391000: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-02-02 14:50:02.218748: W tensorflow/core/common_runtime/bfc_allocator.cc:275]
2022-02-02 14:50:12.687857: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f16daade200 of size 524288 next 155
2022-02-02 14:50:12.687865: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f16dab5e200 of size 1179648 next 157
2022-02-02 14:50:12.688670: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 5.17GiB
2022-02-02 14:50:12.688678: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 6427901952 memory_limit_: 6427901952 available bytes: 0 curr_region_allocation_bytes_: 12855803904
2022-02-02 14:50:12.688695: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats:
Limit: 6427901952
InUse: 5546784512
MaxInUse: 6126439680
NumAllocs: 645
MaxAllocSize: 3716153344
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-02-02 14:50:12.688718: W tensorflow/core/common_runtime/bfc_allocator.cc:474] **********************************************************_____******************************_______
2022-02-02 14:50:12.688819: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_grad_input_ops.cc:335 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[1,128,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
Input In [7], in <module>
----> 1 model.fit_generator(myGene,steps_per_epoch=300,epochs=10,callbacks=[model_checkpoint])
File ~/.local/lib/python3.8/site-packages/keras/engine/training.py:2016, in Model.fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
2005 """Fits the model on data yielded batch-by-batch by a Python generator.
2006
2007 DEPRECATED:
2008 `Model.fit` now supports generators, so there is no longer any need to use
2009 this endpoint.
2010 """
2011 warnings.warn(
2012 '`Model.fit_generator` is deprecated and '
2013 'will be removed in a future version. '
2014 'Please use `Model.fit`, which supports generators.',
2015 stacklevel=2)
-> 2016 return self.fit(
2017 generator,
2018 steps_per_epoch=steps_per_epoch,
2019 epochs=epochs,
2020 verbose=verbose,
2021 callbacks=callbacks,
2022 validation_data=validation_data,
2023 validation_steps=validation_steps,
2024 validation_freq=validation_freq,
2025 class_weight=class_weight,
2026 max_queue_size=max_queue_size,
2027 workers=workers,
2028 use_multiprocessing=use_multiprocessing,
2029 shuffle=shuffle,
2030 initial_epoch=initial_epoch)
File ~/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py:67, in filter_traceback.<locals>.error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
File ~/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:58, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
56 try:
57 ctx.ensure_initialized()
---> 58 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
59 inputs, attrs, num_outputs)
60 except core._NotOkStatusException as e:
61 if name is not None:
ResourceExhaustedError: OOM when allocating tensor with shape[1,128,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node gradient_tape/model/conv2d_20/Conv2D/Conv2DBackpropInput
(defined at /home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py:464)
]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[Op:__inference_train_function_3020]
Errors may have originated from an input operation.
Input Source operations connected to node gradient_tape/model/conv2d_20/Conv2D/Conv2DBackpropInput:
In[0] gradient_tape/model/conv2d_20/Conv2D/ShapeN:
In[1] model/conv2d_20/Conv2D/ReadVariableOp (defined at /home/john/.local/lib/python3.8/site-packages/keras/layers/convolutional/base_conv.py:224)
In[2] gradient_tape/model/conv2d_20/ReluGrad:
Operation defined at: (most recent call last)
>>> File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
>>> return _run_code(code, main_globals, None,
>>>
>>> File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
>>> exec(code, run_globals)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
>>> app.launch_new_instance()
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/traitlets/config/application.py", line 846, in launch_instance
>>> app.start()
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 677, in start
>>> self.io_loop.start()
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199, in start
>>> self.asyncio_loop.run_forever()
>>>
>>> File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
>>> self._run_once()
>>>
>>> File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
>>> handle._run()
>>>
>>> File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
>>> self._context.run(self._callback, *self._args)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 461, in dispatch_queue
>>> await self.process_one()
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 450, in process_one
>>> await dispatch(*args)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 357, in dispatch_shell
>>> await result
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 652, in execute_request
>>> reply_content = await reply_content
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 353, in do_execute
>>> res = shell.run_cell(code, store_history=store_history, silent=silent)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/zmqshell.py", line 532, in run_cell
>>> return super().run_cell(*args, **kwargs)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2768, in run_cell
>>> result = self._run_cell(
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2814, in _run_cell
>>> return runner(coro)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
>>> coro.send(None)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3012, in run_cell_async
>>> has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3191, in run_ast_nodes
>>> if await self.run_code(code, result, async_=asy):
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3251, in run_code
>>> exec(code_obj, self.user_global_ns, self.user_ns)
>>>
>>> File "/tmp/ipykernel_20858/1898079364.py", line 1, in <module>
>>> model.fit_generator(myGene,steps_per_epoch=300,epochs=10,callbacks=[model_checkpoint])
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 2016, in fit_generator
>>> return self.fit(
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 1216, in fit
>>> tmp_logs = self.train_function(iterator)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function
>>> return step_function(self, iterator)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function
>>> outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step
>>> outputs = model.train_step(data)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 816, in train_step
>>> self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 530, in minimize
>>> grads_and_vars = self._compute_gradients(
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 583, in _compute_gradients
>>> grads_and_vars = self._get_gradients(tape, loss, var_list, grad_loss)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 464, in _get_gradients
>>> grads = tape.gradient(loss, var_list, grad_loss)
>>>
Does anyone have any ideas what could have caused this/how to resolve it?
best
Try setting a hard limit on the total GPU memory as shown in this guide and let us know if it works.
import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)
also check here
I am executing the head2head model presented in the Github repo here.
When I am running the code using the following command:
./scripts/train/train_on_target.sh Obama head2headDataset
with contents of the train_on_target.sh file as:
target_name=$1
dataset_name=$2
python train.py --checkpoints_dir checkpoints/$dataset_name \
--target_name $target_name \
--name head2head_$target_name \
--dataroot datasets/$dataset_name/dataset \
--serial_batches
Then I am getting the following error:
Traceback (most recent call last):
File "train.py", line 108, in <module>
flow_ref, conf_ref, t_scales, n_frames_D)
File "/home/nitin/head2head/util/util.py", line 48, in get_skipped_flows
flow_ref_skipped[s], conf_ref_skipped[s] = flowNet(real_B[s][:,1:], real_B[s][:,:-1])
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/nitin/head2head/models/flownet.py", line 38, in forward
flow, conf = self.compute_flow_and_conf(input_A, input_B)
File "/home/nitin/head2head/models/flownet.py", line 55, in compute_flow_and_conf
flow1 = self.flowNet(data1)
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/nitin/head2head/models/flownet2_pytorch/models.py", line 156, in forward
flownetfusion_flow = self.flownetfusion(concat3)
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/nitin/head2head/models/flownet2_pytorch/networks/FlowNetFusion.py", line 62, in forward
concat0 = torch.cat((out_conv0,out_deconv0,flow1_up),1)
RuntimeError: CUDA out of memory. Tried to allocate 82.00 MiB (GPU 0; 5.80 GiB total capacity; 4.77 GiB already allocated; 73.56 MiB free; 4.88 GiB reserved in total by PyTorch)
I have checked the batch size in the file options/base_options.py. It is already set to 1. How can I solve the above mentioned exception. My system has 6 GB NVIDIA GTX 1660 Super GPU.
Data management:
You can try reducing the dataset used for training to check if is a hardware limitation.
Moreover, if it is an image dataset, you can reduce the dimensions of the images by reducing the dpi.
Model parameters management:
Another approach is to reduce the number of parameters of your model. The first suggestion would be to change the Dense layer size and then the other neural network hyperparameters.
While moving a model to eager execution, I encountered an error using gradient_tape for back propagation. While as far as I can tell all operations are taking place on the GPU, during back prop I get the following error:
File "tf_registration_continuous.py", line 128, in single_registration_step
elastic_grads = tape.gradient(loss_value, elastic_variable_list)
File "/share/software/user/open/py-tensorflow/1.8.0_py27/lib/python2.7/site-packages/tensorflow/python/eager/backprop.py", line 767, in gradient
output_gradients=output_gradients)
File "/share/software/user/open/py-tensorflow/1.8.0_py27/lib/python2.7/site-packages/tensorflow/python/eager/imperative_grad.py", line 63, in imperative_grad
tape._tape, vspace, target, sources, output_gradients) # pylint: disable=protected-access
File "/share/software/user/open/py-tensorflow/1.8.0_py27/lib/python2.7/site-packages/tensorflow/python/eager/backprop.py", line 147, in grad_fn
op_inputs, op_outputs, orig_outputs)
File "/share/software/user/open/py-tensorflow/1.8.0_py27/lib/python2.7/site-packages/tensorflow/python/eager/backprop.py", line 115, in _magic_gradient_function
return grad_fn(mock_op, *out_grads)
File "/share/software/user/open/py-tensorflow/1.8.0_py27/lib/python2.7/site-packages/tensorflow/python/ops/array_grad.py", line 427, in _GatherV2Grad
params_shape = math_ops.to_int32(params_shape)
File "/share/software/user/open/py-tensorflow/1.8.0_py27/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 875, in to_int32
return cast(x, dtypes.int32, name=name)
File "/share/software/user/open/py-tensorflow/1.8.0_py27/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 787, in cast
x = gen_math_ops.cast(x, base_type, name=name)
File "/share/software/user/open/py-tensorflow/1.8.0_py27/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1548, in cast
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "/share/software/user/open/py-scipystack/1.0_py27/lib/python2.7/site-packages/six.py", line 718, in raise_from
raise value
tensorflow.python.framework.errors_impl.InvalidArgumentError: Tensors on conflicting devices: cannot compute Cast as input #0 was expected to be on /job:localhost/replica:0/task:0/device:GPU:0 but is actually on /job:localhost/replica:0/task:0/device:CPU:0 (operation running on /job:localhost/replica:0/task:0/device:GPU:0) Tensors can be copied explicitly using .gpu() or .cpu() methods, or transparently copied by using tf.enable_eager_execution(device_policy=tfe.DEVICE_PLACEMENT_SILENT). Copying tensors between devices may slow down your model [Op:Cast] name: ToInt32/
I'm trying to train an LSTM model in Keras, and am coming up with above error. The following reproduces the error:
from keras.layers import LSTM
from keras.models import Sequential
model = Sequential()
model.add(LSTM(128, input_shape=(1000,1000)))
I'm using Python 3.4, Keras 2.0.4 with TensorFlow backend, version 0.12.1
Here's the traceback:
File "test.py", line 6, in <module>
model.add(LSTM(128, input_shape=(1000,1000)))
File "/usr/local/lib/python3.4/dist-packages/keras/models.py", line 433, in add
layer(x)
File "/usr/local/lib/python3.4/dist-packages/keras/layers/recurrent.py", line 243, in __call__
return super(Recurrent, self).__call__(inputs, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/keras/engine/topology.py", line 558, in __call__
self.build(input_shapes[0])
File "/usr/local/lib/python3.4/dist-packages/keras/layers/recurrent.py", line 1012, in build
constraint=self.bias_constraint)
File "/usr/local/lib/python3.4/dist-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/keras/engine/topology.py", line 391, in add_weight
weight = K.variable(initializer(shape), dtype=dtype, name=name)
File "/usr/local/lib/python3.4/dist-packages/keras/layers/recurrent.py", line 1004, in bias_initializer
self.bias_initializer((self.units * 2,), *args, **kwargs),
File "/usr/local/lib/python3.4/dist-packages/keras/backend/tensorflow_backend.py", line 1681, in concatenate
return tf.concat([to_dense(x) for x in tensors], axis)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/ops/array_ops.py", line 1075, in concat
dtype=dtypes.int32).get_shape(
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 669, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_util.py", line 367, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).__name__))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.
I've seen a few people report the same issue, e.g. Tensorflow Slim: TypeError: Expected int32, got list containing Tensors of type '_Message' instead. It seems there to be a problem with the concat function needing to be switched around, but here it is already in the right form.
Any idea what the cause of this is?
when I running the Tensorflow's official sample code: 'tensorflow-mnist-tutorial':
$python3 mnist_1.0_softmax.py
and something wrong:
Traceback (most recent call last):
File "/Users/holmes/Desktop/untitled/gitRepository/tensorflow-mnist-tutorial/mnist_1.0_softmax.py", line 78, in <module>
I = tensorflowvisu.tf_format_mnist_images(X, Y, Y_) # assembles 10x10 images by default
File "/Users/holmes/Desktop/untitled/gitRepository/tensorflow-mnist-tutorial/tensorflowvisu.py", line 42, in tf_format_mnist_images
everything_incorrect_first = tf.concat(0, [incorrectly_recognised_indices, correctly_recognised_indices]) # images reordered with indeces of unrecognised images first
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 1047, in concat
dtype=dtypes.int32).get_shape(
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 651, in convert_to_tensor
as_ref=False)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 716, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 367, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).__name__))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.