MXNet ML lib C++ segmentation fault on OS X - python

I have a problem with Apache MXNet machine learning library on OS X.
I have been able to run Python version of Lenet, convolutional neural network.
I installed these with pip under both Anaconda Python 2.7 and 3.6.
conda create -n mxnet27 python=2.7
conda info --envs
source activate mxnet27
conda list
pip install mxnet==0.12.1
But when I run C++ example files cpp-package/example/lenet.cpp I get the this segfault:
Segmentation fault: 11
This is the place in the code where the segfault is thrown:
Symbol conv1 =
Convolution("conv1", data, conv1_w, conv1_b, Shape(5, 5), 20);
I get similar segfault for the other C++ examples.
I have built MXNet on OS X 10.13.2
I disabled as many libraries as possible, e.g. OpenCV and CUDA.
On Simon Corston-Oliver suggestion I upgraded to MXNet 1.0.0, but that version did not compile with Clang on OS X. Error message:
operator_tune.h:150:36: note: add an explicit instantiation declaration to suppress this
warning if 'mxnet::op::OperatorTuneByType<float>::tuning_mode_' is explicitly instantiated in another translation unit
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/unordered_map:601:15: error: object of type 'std::__1::pair<int,
mxnet::test::perf::TimingInstrument::Info>' cannot be assigned because its copy assignment operator is implicitly deleted

I don't know of a specific issue with v0.12 that would lead to a segfault but before we dig in, I'd recommend upgrading to v1.0 which was released 2017-12-04.
If you still encounter the same problem with 1.0 we can work to debug.

I found a solution to compiling MXNet 1.0.0 posted here by helloniklas:
https://github.com/apache/incubator-mxnet/issues/9217
It involved only using make instead of CMake.
This solution worked me and compiled the code.
C++ examples runs without the seg fault, but documentation is scarce. I only got one of the to do training.

Related

How can I know which is the lastest Python version compatible for Tensorflow v2.x?

As title. My requirement is very simple. Since I will probably need to use the latest features of Python at my work. I wonder how to know the latest version of Python can be used with Tensorflow v2.x without any trouble regarding compatibility. I must put emphasis on that I need to use the tensorflow.keras module. I don't want to get an error message during the model training. Any advice?
I did try to follow the issue on their GitHub on supporting Python3.9. While the issue is closed, most of the comments there are NOT from the contributors/maintainers. And the last comment is on 2021/6. Is Python3.9 the lastest compatible version to run Tensorflow v2.x?
TensorFlow 2 is supported from Python 3.7 to 3.10 according to their website: https://www.tensorflow.org/install?hl=en

Tensorflow 2.4.1 - Couldn't invoke ptxas.exe

I try to run Tensorflow with GPU support (GTX 1660 SUPER).
I created an enviroment using anaconda, than installed cudatoolkit (version 11.0.221) and tensorflow-gpu (version 2.4.1). Afterwards, I downloaded cuDNN (version 8.0.4), and copied all files from cuDNN's bin folder to my environment's bin folder at anaconda3\envs\<env name>\Library\bin.
In my script, I've set the memory limit to my GPU's memory using tf.config.experimental.set_memory_growth.
When I run the script (which uses convolutional algorithms), I get a warning that says Couldn't invoke ptxas.exe --version which comes after an Call to CreateProcess failed. Error code: 2 error.
After the launch failure, I get: Relying on driver to perform ptx compilation. Modify $PATH to customize ptxas location.
I've already tried switching to cuDNN version 8.1.1.
How I fix this?
I got a new fix for this.
First I tried using tensorflow=2.3, cudnn=7.6.5 and cudatoolkit=10.1 as mentioned in previous answers. However, every time I put a model to train, the process was going stale and the training seemed to be stuck in epoch 1.
I then managed to include ptxas in my conda environment by running conda install -c nvidia cuda-nvcc The packages I am using are:
tensorflow=2.9, cudnn=8.1.0, cudatoolkit=11.2.2, cuda-nvcc=11.7.99 and python=3.9
I am running everything on windows 10 flawlessly now.
For the benefit of community adding #Zuk Levinson comment
Solves the issue by using
tensorflow=2.3, cudnn=7.6.5 and cudatoolkit=10.1

module tensorflow has no attribute contrib

I've made a piece of code using a tutorial based on tensorflow 1.6 which uses 'contrib' and this is not compatible with my current tensorflow verison (2.1.0).
I haven't been able to run the upgrade script and downgrading my version of tf causes another host of problems.
I've also tried using other modules in tensor flow 2 such as tensorflow-addons and disabling version 2 behaviour.
What to do??
Thank you to #jdehesa
Here is the information on TensorFlow official website.
Warning: The tf.contrib module is not included in TensorFlow 2. Many
of its submodules have been integrated into TensorFlow core, or
spun-off into other projects like tensorflow_io, or tensorflow_addons.
For instructions on how to upgrade see the Migration guide.
https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/contrib
https://www.tensorflow.org/guide/migrate
Or, you can just convert the code to an appropriate version for TF 2.x.

Compiling binary with tensorflow library for cpu: Cannot find cuda library?

In development, I have been using the gpu-accelerated tensorflow
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.1-cp35-cp35m-linux_x86_64.whl
I am attempting to deploy my trained model along with an application binary for my users. I compile using PyInstaller (3.3.dev0+f0df2d2bb) on python 3.5.2 to create my application into a binary for my users.
For deployment, I install the cpu version, https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp35-cp35m-linux_x86_64.whl
However, upon successful compilation, I run my program and receive the infamous tensorflow cuda error:
tensorflow.python.framework.errors_impl.NotFoundError:
tensorflow/contrib/util/tensorflow/contrib/cudnn_rnn/python/ops/_cudnn_rnn_ops.so:
cannot open shared object file: No such file or directory
why is it looking for cuda when I've only got the cpu version installed? (Let alone the fact that I'm still on my development machine with cuda, so it should find it anyway. I can use tensorflow-gpu/cuda fine in uncompiled scripts. But this is irrelevant because deployment machines won't have cuda)
My first thought was that somehow I'm importing the wrong tensorflow, but I've not only used pip uninstall tensorflow-gpu but then I also went to delete the tensorflow-gpu in /usr/local/lib/python3.5/dist-packages/
Any ideas what could be happening? Maybe I need to start using a virtual-env..

undefined symbol: cudnnCreate in ubuntu google cloud vm instance

I'm trying to run a tensorflow python script in a google cloud vm instance with GPU enabled. I have followed the process for installing GPU drivers, cuda, cudnn and tensorflow. However whenever I try to run my program (which runs fine in a super computing cluster) I keep getting:
undefined symbol: cudnnCreate
I have added the next to my ~/.bashrc
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/usr/local/cuda-8.0/lib64"
export CUDA_HOME="/usr/local/cuda-8.0"
export PATH="$PATH:/usr/local/cuda-8.0/bin"
but still it does not work and produces the same error
Answering my own question: The issue was not that the library was not installed, the library installed was the wrong version hence it could not find it. In this case it was cudnn 5.0. However even after installing the right version it still didn't work due to incompatibilities between versions of driver, CUDA and cudnn. I solved all this issues by re-installing everything including the driver taking into account tensorflow libraries requisites.

Categories

Resources