How to use OpenMP parallelism effectively with tensorflow 1.14.0 - python

I'm currently trying to find an effective way of running a machine learning task over a set amount of cores using tensorflow. From the information I found there were two main approaches to doing this.
The first of which was using the two tensorflow variables intra_op_parallelism_threads and inter_op_parallelism_threads and then creating a session using this configuration.
The second of which is using OpenMP. Setting the environment variable OMP_NUM_THREADS allows for manipulation of the amount of threads spawned for the process.
My problem arose when I discovered that installing tensorflow through conda and through pip gave two different environments. In the conda install modifying the OpenMP environment variables seemed to change the way the process was parallelised, whilst in the 'pip environment' the only thing which appeared to change it was the inter/intra config variables which I mentioned earlier.
This led to some difficulty in trying to compare the two installs for benchmarking reasons. If I set OMP_NUM_THREADS equal to 1 and inter/intra to 16 on a 48 core processor on the conda install I only get about 200% CPU usage as most of the threads are idle at any given time.
omp_threads = 1
mkl_threads = 1
os.environ["OMP_NUM_THREADS"] = str(omp_threads)
os.environ["MKL_NUM_THREADS"] = str(mkl_threads)
config = tf.ConfigProto()
config.intra_op_parallelism_threads = 16
config.inter_op_parallelism_threads = 16
session = tf.Session(config=config)
K.set_session(session)
I would expect this code to spawn 32 threads where most of which are being utilized at any given time, when in fact it spawns 32 threads and only 4-5 are being used at once.
Has anyone ran into anything similar before when using tensorflow?
Why is it that installing through conda and through pip seems to give two different environments?
Is there any way of having comparable performance on the two installs by using some combination of the two methods discussed earlier?
Finally is there maybe an even better way to limit python to a specific number of cores?

Answer to your first and last question.
Yes I ran into a similar situation while using TensorFlow installed through pip.
You can limit python to a specific number of cores by using thread affinity, numatcl or taskset on linux.
Looking at the details provied by the following links, TensorFlow will always generate multiple threads and most of them will be sleeping by default.
How to stop TensorFlow from multi-threading
How can I run Tensorflow on one single core?
Change number of threads for Tensorflow inference with C API

I think the point here is that conda install Tensorflow with MKL but pip does not.
OpenMP control only works in MKL, and in pip install, the OpenMP environment variable does not work and only set TFSessionConfig with intra/inter could affect multi-threading

Related

Multiple Python 3 kernels in Docker Jupyter Notes

I’m trying to add different python 3 kernels to a docker based Jupyter Notes installation. This would be optimal since many different notebooks will need to run on this installation.
So far I have tried installing a virtual environment, and added the second kernel, but however I order the instructions I always end up with the same kernel in both. Is this even possible? All mentions online seem to be for py2 and py3 kernels. Any examples you could share? Thanks!
I tried creating a virtual environment, activating it and then adding the ikernel, but in the end both kernels had the same version in jn

is it possible to upgrade python version of a conda env and specifically of the base env

Suppose I want to
backup the base env by cloning it,
and then upgrade Python version of the base env from, say, 3.7.x to 3.10.x.
Is it possible? If it is, how should I proceed?
Run
conda activate base
conda install python=3.10
Important notes
Even though it is possible, this practice is not recommended by the official documentation:
It is not recommended, rather it is preferable to create a new environment. The resolver has to work very hard to determine exactly which packages to upgrade. But it is possible (...)
Depending on how many packages there are —and how many conflicts—, the procedure might take a long while (after more than 2 hours I quit).
Console output is quite limited: Although it shows progression while resolving the packages (it took 5 minutes for 444 packages), it does not show progression while resolving the conflicts it has identified, but
these words: Found conflicts! Looking for incompatible packages. This can take several minutes. Press CTRL-C to abort. (while resolving the conflicts actually took more than 2 hours, before I quit)
a spinning cursor,
only as many package names as the screen width can hold.

Can you pre-install libraries on Databricks Pool nodes?

We have a number of Python Databricks jobs that all use the same underlying Wheel package to install their dependencies. Installing this Wheel package even with a node that has been idling in a Pool still takes 90 seconds.
Some of these jobs are very long-running so we would like to use Jobs computer clusters for the lower cost in DBUs.
Some of these jobs are much shorter-running (<10 seconds) where the 90 second install time seems more significant. We have been considering using a hot cluster (All-Purpose Compute) for these shorter jobs. We would like to avoid the extra cost of the All-Purpose Compute if possible.
Reading the Databricks documentation suggests that the Idle instances in the Pool are reserved for us but not costing us DBUs. Is there a way for us to pre-install the required libraries on our Idle instances so that when a job comes through we are able to immediately start processing it?
Is there an alternate approach that can fulfill a similar use case?
You can't install libraries directly into nodes from pool, because the actual code is executed in the Docker container corresponding to Databricks Runtime. There are several ways to speedup installation of the libraries:
Create your own Docker image with all necessary libraries pre-installed, and pre-load Databricks Runtime version and your Docker image - this part couldn't be done via UI, so you need to use REST API (see description of preloaded_docker_images attribute), databrick-cli, or Databricks Terraform provider. The main disadvantage of custom Docker images is that some functionality isn't available out of box, for example, arbitrary files in Repos, web terminal, etc. (don't remember full list)
Put all necessary libraries and their dependencies onto DBFS and install them via cluster init script. It's very important that you collect binary dependencies, not packages only with the source code, so you won't need to compile them when installing. This could be done once:
for Python this could be done with pip download --prefer-binary lib1 lib2 ...
for Java/Scala you can use mvn dependency:get -Dartifact=<maven_coordinates>, that will download dependencies into ~/.m2/repository folder, from which you can copy jars to DBFS and in init script use cp /dbfs/.../jars/* /databricks/jars/ command
for R, it's slightly more complicated, but is also doable

Dask on single OSX machine - is it parallel by default?

I have installed Dask on OSX Mojave. Does it execute computations in parallel by default? Or do I need to change some settings?
I am using the DataFrame API. Does that make a difference to the answer?
I installed it with pip. Does that make a difference to the answer?
Yes, Dask is parallel by default.
Unless you specify otherwise, or create a distributed Client, execution will happen with the "threaded" scheduler, in a number of threads equal to your number of cores. Note, however, that because of the python GIL (only one python instruction executed at a time), you may not get as much parallelism as available, depending on how good your specific tasks are at releasing the GIL. That is why you have a choice of schedulers.
Being on OSX, installing with pip: these make no difference. Using dataframes makes a difference in that it dictates the sorts of tasks you're likely running. Pandas is good at releasing the GIL for many operations.

How to enable parallel processing on Theano ? No GPU

I have been trying to find ways to enable parallel processing in theano while training a neural network , but I can't seem to find it. Right now when I train a network theano is only using a single core.
Also I do not have access to a GPU , so if I could make theano use all the cores on the machine, then it will hopefull speed things up.
Any tips on speeding up theano is very welcome !
This is what I have been able to figure out.
Follow the instructions on this page
http://deeplearning.net/software/theano/install_ubuntu.html
It seems that I did not install BLAS properly. So I reinstalled everything according to the instructions on the website.
Theano has config flags that have to be set.
And follow the discussion here Why does multiprocessing use only a single core after I import numpy?
Using all this when I run the script
THEANO_FLAGS='openmp=True' OMP_NUM_THREADS=N OPENBLAS_MAIN_FREE=1 python <script>.py
//Where N is the number of cores
Theano uses all the cores on my machine.

Categories

Resources