I have been trying to find ways to enable parallel processing in theano while training a neural network , but I can't seem to find it. Right now when I train a network theano is only using a single core.
Also I do not have access to a GPU , so if I could make theano use all the cores on the machine, then it will hopefull speed things up.
Any tips on speeding up theano is very welcome !
This is what I have been able to figure out.
Follow the instructions on this page
http://deeplearning.net/software/theano/install_ubuntu.html
It seems that I did not install BLAS properly. So I reinstalled everything according to the instructions on the website.
Theano has config flags that have to be set.
And follow the discussion here Why does multiprocessing use only a single core after I import numpy?
Using all this when I run the script
THEANO_FLAGS='openmp=True' OMP_NUM_THREADS=N OPENBLAS_MAIN_FREE=1 python <script>.py
//Where N is the number of cores
Theano uses all the cores on my machine.
Related
I implemented a Neural Network class using only python and numpy, and I want to do some experiments with it. The problem is that it takes so long to train. My computer does not have a high-end GPU nor a wonderful CPU, so I thought about some sort of 'cloud training'.
I know libraries such as TensorFlow or PyTorch use backends to train neural networks faster, and I was wondering if something similar could be achieved with numpy. Is there a way to run numpy in the cloud?
Even if it is slow and doesn't use GPUs would be fine for me. I tried to load my files to Google Colab, but it didn't work so well. It stopped running due to inactivity after some time.
Is there any nice solution out there?
Thanks for reading it all!
Try to use cupy instead of numpy, it runs on GPU (works well on colab GPU instance) and maybe you should do just some little modifications to your code.
I'm currently trying to find an effective way of running a machine learning task over a set amount of cores using tensorflow. From the information I found there were two main approaches to doing this.
The first of which was using the two tensorflow variables intra_op_parallelism_threads and inter_op_parallelism_threads and then creating a session using this configuration.
The second of which is using OpenMP. Setting the environment variable OMP_NUM_THREADS allows for manipulation of the amount of threads spawned for the process.
My problem arose when I discovered that installing tensorflow through conda and through pip gave two different environments. In the conda install modifying the OpenMP environment variables seemed to change the way the process was parallelised, whilst in the 'pip environment' the only thing which appeared to change it was the inter/intra config variables which I mentioned earlier.
This led to some difficulty in trying to compare the two installs for benchmarking reasons. If I set OMP_NUM_THREADS equal to 1 and inter/intra to 16 on a 48 core processor on the conda install I only get about 200% CPU usage as most of the threads are idle at any given time.
omp_threads = 1
mkl_threads = 1
os.environ["OMP_NUM_THREADS"] = str(omp_threads)
os.environ["MKL_NUM_THREADS"] = str(mkl_threads)
config = tf.ConfigProto()
config.intra_op_parallelism_threads = 16
config.inter_op_parallelism_threads = 16
session = tf.Session(config=config)
K.set_session(session)
I would expect this code to spawn 32 threads where most of which are being utilized at any given time, when in fact it spawns 32 threads and only 4-5 are being used at once.
Has anyone ran into anything similar before when using tensorflow?
Why is it that installing through conda and through pip seems to give two different environments?
Is there any way of having comparable performance on the two installs by using some combination of the two methods discussed earlier?
Finally is there maybe an even better way to limit python to a specific number of cores?
Answer to your first and last question.
Yes I ran into a similar situation while using TensorFlow installed through pip.
You can limit python to a specific number of cores by using thread affinity, numatcl or taskset on linux.
Looking at the details provied by the following links, TensorFlow will always generate multiple threads and most of them will be sleeping by default.
How to stop TensorFlow from multi-threading
How can I run Tensorflow on one single core?
Change number of threads for Tensorflow inference with C API
I think the point here is that conda install Tensorflow with MKL but pip does not.
OpenMP control only works in MKL, and in pip install, the OpenMP environment variable does not work and only set TFSessionConfig with intra/inter could affect multi-threading
According to knowledge with tf.device('/GPU') can be used for implementing tensor-flow in GPU. Is there any similar is there any way for implementing any python code on GPU(Cuda) ? or should I use pycuda?
For parallel processing in python some intermideate libraries or packages needed to be there that sit between the code and the gpu/cpu for parallel executions. Some popular packages are pycuda, numba etc. If you want to do gpu programming using simple python syntax without using other frameworks like tensorflow, then take a look at this.
Today i was asking to me if is possible to do matrix calculation using gpu instead cpu because i know that a gpu is designed to do them faster then a cpu.
I searched on the net and i found notices about the matrix calculation using gpu with different python's libraries but my question is exists a documentation that descibes how should we write code to comunicate with a gpu.
I'm asking that because i want to develop my own one to better understand how gpu work and to try something different.
Thanks to all.
I solved that problem with OpenCL
OpenCL is a standard library that the vendor of the GPU's implement by their own. Like NVIDIA support openCL and other features thanks to CUDA library.
Here a good guide to get start
I had a question regarding tensorflow that is, somewhat critical to what task I'm trying to accomplish.
My scenario is as follows,
1. I have a tensorflow script that has been set-up, trained and tested. It is working well.
The training and testing was done on a devBox with 2 Titan X cards.
We need to now port this system to a live-pilot testing stage and are required to deploy it on a virtual-machine with Ubuntu 14.04 running atop of it.
Here lies the problem - A vm will not have access to underlying GPUs and must validate the incoming data in CPU only mode. My question,
Will the absence of GPUs hinder the validation process of my ML system? Does tensorflow, by default use GPUs for CNN computation and will the absence of a GPU affect the execution?
How do I run my script in CPU only mode?
Will setting CUDA_VISIBLE_DEVICES to none help with the validation in a CPU-only mode after the system has been trained on GPU boxes?
I'm sorry if this comes across as a noob question but I am new to TF and any advice would be much appreciated. Please let me know if you need any further information about my scenario.
Testing with CUDA_VISIBLE_DEVICES set to empty string will make sure that you don't have anything that depends on GPU being present, and theoretically it should be enough. In practice, there are some bugs in GPU codepath which can get triggered when there are no GPUs (like this one), so you want to make sure your GPU software environment (CUDA version) is the same.
Alternatively, you could compile TensorFlow without GPU support (bazel build -c opt tensorflow), this way you don't have to worry about matching CUDA environments or setting CUDA_VISIBLE_DEVICES