Error - Keras+Theano - loss function - python

I am trying to run a Keras script on an AWS instance. While the script runs fine on my own computer (Python 2.7 - no CPU) it causes an error on AWS. I have installed the latest version of Theano - and other scripts (e.g. the mnist tutoral) do not give errors. The script that is causing the issue is a standard Keras tutoral script (https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py). The error is copied in per below (apologies - there might be a better way to capture errors straight from the command line). Any help much appreciated
First page of error message:
End of error message (i have not copied in the entire stack of keras/layers errors)

Somehow you're passing a symbolic value for the border_mode parameter. If this works fine on CPU but not on GPU then, for some reason, the CPU version of the code supports symbolic border modes but the GPU version does not.
If you can, change the border_mode parameter value to be a Python literal instead of a Theano symbolic variable.

Related

MemoryError Precedes BrokenPipeError While Training CNN

Generally for running pip with no cache we use --no-cache-dir, like
pip install pytorch --no-cache-dir.
I downloaded a CNN model I want to use from github.
The first two lines of execution
python generate_dataset.py --is_train=True --use_phase=True --chip_size=100 --patch_size=94 --use_phase=True --dataset=soc
python generate_dataset.py --is_train=False --use_phase=True --chip_size=128 --patch_size=128 --use_phase=True --dataset=soc
executed succesfully. But while running
python train.py --config_name=config/AConvNet-SOC.json
It is giving MemoryError.
The publisher of above repository is using 32GB RAM and 11 GB GPU. But I have 8 GB RAM and 8GB GPU.
Here is what I have done:
I thought of running it without cache. like,
python train.py --config_name=config/AConvNet-SOC.json --no-cache-dir
But it is throwing below error
FATAL Flags parsing error: Unknown command line flag 'no-cache-dir' Pass --helpshort or --helpfull to see help on flags.
I think it is because no-cache-dir argument is not defined in it by using absl.flags. Does python supports using no chache directory implementation
I am able to solve it by decreasing the number of epochs and batch_size. But I want to run it for full epochs.
Using zeo_grad() of Pytorch makes the gradients zero for every minibatch, so that GPU won't run out of memory. But it is already used in the code I am running in _base.py. Is there anyway I can leverage more of this.
How to resolve this.

cudaErrorInvalidAddressSpace: operation not supported on global/shared address space

I'm trying to run EnyaHermite's pytorch implementation of PicassoNet-II (https://github.com/EnyaHermite/Picasso) on a Ubuntu 18.04.6 LTS GPU cluster and I encounter the following error:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): CUDA free failed: cudaErrorInvalidAddressSpace: operation not supported on global/shared address space
The framework is utilizing a few CPP functions in its main python script and one of them is this decimate_gpu.cu file (https://github.com/EnyaHermite/Picasso/blob/main/pytorch/picasso/mesh/modules/source/decimate_gpu.cu).
I cannot debug the file since I'm running it on a GPU cluster, I only know the crash happends because of this file.
I've only seen one similar issue (here: https://forums.developer.nvidia.com/t/invalidaddressspace-when-using-pointer-from-continuation-callable-parameters/184951/7). The issue in that post was an incorrect definition of a callable, so a change from __global__ to __device__ made it work.
I'm not sure if this error is similar, however I have no idea how to fix it.
Best,
Bjonze

Failed to get convolution algorithm.This is probably because cuDNN failed to initialize.[{node conv2d_1/Conv2D}]

Error
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv2d_1/Conv2D (defined at C:\Users\Rajshree\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_keras_scratch_graph_808]
I'm running 2 programs . Both are using this and performing similar task of recognizing human expression. The difference only lies in the model of CNN they are using. One is working perfectly fine so what could be the possible problem with the other.
This could be due to a multitude of things. Are you running the two programs at the same time? Are you running this on a GPU? If so, it could be that one is already using the GPU, and the other finds that the GPU is already in use so it throws an error.

Tensorflow Object Detection API - Error running model_builder_test.py module 'tensorflow' has no attribute 'contrib'

I installed the Tensorflow Object Detection API, and ran the model_builder_test.py script to make sure everything was working. I got the following error:
AttributeError: module 'tensorflow' has no attribute 'contrib'
I'm using Python 3.7.3 and Tensorflow 2.0.0. According to this answer, it may be related to Tensorflow version 2. I'm going to use this method to upgrade the model_builder_test.py script. However, I'm worried about other issues in the Object Detection API using Tensorflow 2.
My questions are:
1) Am I correct in interpreting this error?
2) Is it safe to use Object Detection with Tensorflow 2, or should I downgrade to Tensorflow 1.x?
Thanks!
1) Yes
2) Yes, and it may in fact work better per several bug fixes in TF2 - but make sure you follow the linked guide closely to confirm model behavior doesn't change unexpectedly (i.e. compare execution in TF1 vs. TF2)
However; the "make sure" in (2) is easier said than done - we're talking about an entire API here. This is best left to the API's devs themselves, unless you're highly familiar with relevant parts of the repository. Even if you fix one bug, there may be others, even those that don't throw errors, per class/method-based functionality changes (especially in Eager vs. Graph interactions). There's not much harm to using TF 1.x, and it may even run faster.
Lastly, I'd suggest opening a TF Git issue on this; contributors/devs may respond there & not here.

Specifying the GPU ID using tensorflow backend

I am trying to train the two different models using GPU ID.
I have tried the command CUDA_VISIBLE_DEVICES=1 python filename.py but it picks up the GPU 0 rather than 1.
I also added the os environment variable in my code, but I got the same behavior.
I am not sure how can I fix this as this is first time to use GPU.

Categories

Resources