Distributed tensorflow monopolizes GPUs after running server.__init__

Distributed tensorflow monopolizes GPUs after running server.__init__ - python

I have two computers with two GPUs each. I am trying to start with distributed tensorflow and very confused about how it all works. On computer A I would like to have one ps tasks (I have the impression this should go on the CPU) and two worker tasks (one per GPU). And I would like to have two 'worker' tasks on computer B. Here's how I have tried to implement this, in test.py
import tensorflow as tf
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--job_name', required = True, type = str)
parser.add_argument('--task_idx', required = True, type = int)
args, _ = parser.parse_known_args()
JOB_NAME = args.job_name
TASK_INDEX = args.task_idx
ps_hosts = ["computerB-i9:2222"]
worker_hosts = ["computerA-i7:2222", "computerA-i7:2223", "computerB-i9:2223", "computerB-i9:2224"]
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name = JOB_NAME, task_index = TASK_INDEX)
if JOB_NAME == "ps":
server.join()
elif JOB_NAME == "worker":
is_chief = (TASK_INDEX == 0)
with tf.device(tf.train.replica_device_setter(
worker_device = "/job:worker/task:%d" % FLAGS.task_index, cluster = cluster)):
a = tf.constant(8)
b = tf.constant(9)
with tf.Session(server.target) as sess:
sess.run(tf.multiply(a, b))
What I am finding by running python3 test.py --job_name ps == task_idx 0 on computer A, is that I see that both GPUs on computer A have immediately been reserved by the script and that computer B shows no activity. This is not what I expected. I thought that since for the ps job I simply run server.join() that this should not use the GPU. However I can see by setting pdb break points that as soon as the server is initialized, the GPUs are taken. This leaves me with several questions:
- Why does the server immediately take all the GPU capacity?
- How am I supposed to allocate GPU and launch different processes?
- Does my original plan even make sense? (I am still a little confused by tasks vs. clusters vs. servers etc...)
I have watched the Tensorflow Developer Summit 2017 video on distributed Tensorflow and I have also been looking around on Github and blogs. I have not been able to find a working code example using the latest or even relatively recent distributed tensorflow functions. Likewise, I notice that many questions on Stack Overflow are not answered, so I have read related questions but not any that resolve my questions. I would appreciate any guidance or recommendations about other resources. Thanks!

I found that the following will work when invoking from command line:
CUDA_VISIBLE_DEVICES="" python3 test.py --job_name ps --task_idx 0 --dir_name TEST
Since I found this in a lot of code examples it seems like this may be the standard way to control an individual server's access to GPU resources.

Related

Apache Beam with DirectRunner (SUBPROCESS_SDK) uses only one worker, how do I force it to use all available workers?

The following code:
def get_pipeline(workers):
pipeline_options = PipelineOptions(['--direct_num_workers', str(workers)])
return beam.Pipeline(options=pipeline_options,
runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
urn=python_urns.SUBPROCESS_SDK,
payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii'))))
with get_pipeline(4) as pipeline:
_ = (
pipeline
| 'ReadTestData' >> beam.io.ReadFromParquet(input_files, columns=all_columns)
| "write" >> beam.io.WriteToText("/tmp/txt2")
)
uses only one worker out of 4 available and generates only one big output file (even though there are many input files).
How do I force the Beam pipeline to work in parallel i.e. how do I force every input file to get processed separately by a different worker?

Which version of beam are you using ?
I have the same issue with beam 2.16.0 but version 2.17.0 seems to have the expected behavior.
You might want to try with this version instead while keeping your code as it is.

Get the number of GPUs used in Tensorflow Distributed in a multi node approach

I am currently trying to compare Horovod and Tensorflow Distributed API.
When using using Horovod, I am able to access the total number of GPUs currently used as follows:
import horovod.tensorflow as hvd
size = hvd.size()
A similar concept is available when using PyTorch Distributed API:
size = int(os.environ["WORLD_SIZE"])
I would like to perform the same operation and obtain the number of GPUs currently in use for multi GPUs/nodes with TF Distributed official API.
I can't use CUDA_VISIBLE_DEVICES environment variable as it would only work on a single node.

A few findings which answer my question:
Equivalent of hvd.size(): (the session must be started and initialized first unlike hvd ! Else you will just get "1")
==> tf.distribute.get_strategy().num_replicas_in_sync
Equivalent of hvd.rank(): (the session must be started and initialized first unlike hvd ! Else you will just get "0")
def get_rank():
replica_id = tf.distribute.get_replica_context().replica_id_in_sync_group
if isinstance(replica_id, tf.Tensor):
return tf.get_static_value(replica_id) != 0)
else:
return 0
Is TF Distributed running ? : tf.distribute.has_strategy() => True/False (same remark as above, else you just get False)

TensorFlow: minimalist program fails on distributed mode

I wrote a very simple program that runs just fine without distribution but hangs on CheckpointSaverHook in distributed mode (everything on my localhost though!). I've seen there's been a few questions about hanging in distributed mode, but none seem to match my question.
Here's the script (made to toy with the new layers API):
import numpy as np
import tensorflow as tf
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib import layers
DATA_SIZE=10
DIMENSION=5
FEATURES='features'
def generate_input_fn():
def _input_fn():
mid = int(DATA_SIZE/2)
data = np.array([np.ones(DIMENSION) if x < mid else -np.ones(DIMENSION) for x in range(DATA_SIZE)])
labels = ['0' if x < mid else '1' for x in range(DATA_SIZE)]
table = tf.contrib.lookup.string_to_index_table_from_tensor(tf.constant(['0', '1']))
label_tensor = table.lookup(tf.convert_to_tensor(labels, dtype=tf.string))
return dict(zip([FEATURES], [tf.convert_to_tensor(data, dtype=tf.float32)])), label_tensor
return _input_fn
def build_estimator(model_dir):
features = layers.real_valued_column(FEATURES, dimension=DIMENSION)
return tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir=model_dir,
dnn_feature_columns=[features],
dnn_hidden_units=[20,20])
def generate_exp_fun():
def _exp_fun(output_dir):
return tf.contrib.learn.Experiment(
build_estimator(output_dir),
train_input_fn=generate_input_fn(),
eval_input_fn=generate_input_fn(),
train_steps=100
)
return _exp_fun
if __name__ == '__main__':
tf.logging.set_verbosity(tf.logging.DEBUG)
learn_runner.run(generate_exp_fun(), 'job_dir')
To test distributed mode, I simply launch it with the environment variable TF_CONFIG={"cluster": {"ps":["localhost:5040"], "worker":["localhost:5041"]}, "task":{"type":"worker","index":0}, "environment": "local"} (this is for the worker, the same with ps type is used to launch the parameter server.
I use tensorflow-1.0.1 (but had the same behavior with 1.0.0) on windows-64, only CPU. I actually never get any error, it just hang on after INFO:tensorflow:Create CheckpointSaverHook. forever... I've tried to attach VisualStudio C++ debugger to the process but with little success so far, so I can't print a stack for what's happening in the native part.
P.S.: it's not a problem with DNNLinearCombinedClassifier because it fails as well with a simple tf.contrib.learn.LinearClassifier. And as noted in the comments, it's not due to both process running on localhost, since it fails also when running on separate VMs.
EDIT: I think there's actually an issue with server launching. It looks like the server is not launched when you're in local mode (no matter if distributed or not), cf. tensorflow/contrib/learn/python/learn/experiment.py l.250-258:
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
This will prevent the server from being started in local mode for the workers... Anyone has an idea if it's a bug or there's something I'm missing?

So this has been answered in: https://github.com/tensorflow/tensorflow/issues/8796. Finally, one should use CLOUD for any distributed operation.

gstreamer videomixer2 cpu usage 164%. Any solution to decrease

As a part of my project, I will have to synchronize 2 videos. Since i am implementing it in python, i started using gstreamer.
My pipeline looks like this
filesrc -> decoder-> queuev -> videobox
filesrc-1 -> decoder-> queuev1 -> videobox1
both of these videobox is joined to mixer like this
[videobox 1 and 2 ] -> mixer -> ffmpegcolorspace ->videosink
All of them in a single pipeline.
But problem here is when i run the code , i get 174% cpu usage which i think is not really optimized. Is there any way to reduce this? because even if i simply run 3 videos in parallel pipelines i get 14% cpu usage.
I am also uploading part of my code here.
self.pipeline = gst.Pipeline('pipleline')
self.filesrc = gst.element_factory_make("filesrc", "filesrc")
self.filesrc.set_property('location', videoloc1)
self.pipeline.add(self.filesrc)
self.decode = gst.element_factory_make("decodebin2", "decode")
self.pipeline.add(self.decode)
self.queuev = gst.element_factory_make("queue", "queuev")
self.pipeline.add(self.queuev)
self.video = gst.element_factory_make("autovideosink", "video")
self.pipeline.add(self.video)
self.filesrc_2 = gst.element_factory_make("filesrc", "filesrc2")
self.filesrc_2.set_property('location', videoloc2)
self.pipeline.add(self.filesrc_2)
self.decode_2 = gst.element_factory_make("decodebin2", "decode_2")
self.pipeline.add(self.decode_2)
self.queuev_2 = gst.element_factory_make("queue", "queuev_2")
self.pipeline.add(self.queuev_2)
self.mixer = gst.element_factory_make("videomixer2", "mixer")
self.pipeline.add(self.mixer)
self.videobox_1 = gst.element_factory_make("videobox", "videobox_1")
self.pipeline.add(self.videobox_1)
self.videobox_2 = gst.element_factory_make("videobox", "videobox_2")
self.pipeline.add(self.videobox_2)
self.ffmpeg1 = gst.element_factory_make("ffmpegcolorspace", "ffmpeg1")
self.pipeline.add(self.ffmpeg1)
gst.element_link_many(self.filesrc,self.decode)
gst.element_link_many(self.filesrc_2,self.decode_2)
gst.element_link_many(self.queuev,self.videobox_1,self.mixer,self.ffmpeg1,self.video)
gst.element_link_many(self.queuev_2,self.videobox_2,self.mixer)

Videomixer is using the cpu to mix videos. Anyway, in oder to know, run a profiler (oprofile, sysprof) to see what code is using the most cpu. Also you did not said anything on the resolutions and colorspaces involved and the hardware you run this on. Thus it is hard to say wheter is is unexpectedly slow.
Finally, you don#t need to mix videos to sync them, you can just run them in a single pipeline. It is up to your application to e.g. render into separate drawing areas in your window or whatever.

You can use streamsynchronizer https://gstreamer.freedesktop.org/data/doc/gstreamer/head/gst-plugins-base-plugins/html/gst-plugins-base-plugins-streamsynchronizer.html

"embarrassingly parallel" programming using python and PBS on a cluster

I have a function (neural network model) which produces figures. I wish to test several parameters, methods and different inputs (meaning hundreds of runs of the function) from python using PBS on a standard cluster with Torque.
Note: I tried parallelpython, ipython and such and was never completely satisfied, since I want something simpler. The cluster is in a given configuration that I cannot change and such a solution integrating python + qsub will certainly benefit to the community.
To simplify things, I have a simple function such as:
import myModule
def model(input, a= 1., N=100):
do_lots_number_crunching(input, a,N)
pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')
where input is an object representing the input, input.name is a string, anddo_lots_number_crunching may last hours.
My question is: is there a correct way to transform something like a scan of parameters such as
for a in pylab.linspace(0., 1., 100):
model(input, a)
into "something" that would launch a PBS script for every call to the model function?
#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py
I was thinking of a function that would include the PBS template and call it from the python script, but could not yet figure it out (decorator?).

pbs_python[1] could work for this. If experiment_model.py 'a' as an argument you could do
import pbs, os
server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)
attopl = pbs.new_attropl(4)
attropl[0].name = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'
attropl[1].name = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'
attropl[2].name = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'
attrop1[3].name = pbs.ATTR_V
script='''
cd /data/work/
python experiment_model.py %f
'''
jobs = []
for a in pylab.linspace(0.,1.,100):
script_name = 'experiment_model.job' + str(a)
with open(script_name,'w') as scriptf:
scriptf.write(script % a)
job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
jobs.append(job_id)
os.remove(script_name)
print jobs
[1]: https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

You can do this easily using jug (which I developed for a similar setup).
You'd write in file (e.g., model.py):
#TaskGenerator
def model(param1, param2):
res = complex_computation(param1, param2)
pyplot.coolgraph(res)
for param1 in np.linspace(0, 1.,100):
for param2 in xrange(2000):
model(param1, param2)
And that's it!
Now you can launch "jug jobs" on your queue: jug execute model.py and this will parallelise automatically. What happens is that each job will in, a loop, do something like:
while not all_done():
for t in tasks in tasks_that_i_can_run():
if t.lock_for_me(): t.run()
(It's actually more complicated than that, but you get the point).
It uses the filesystem for locking (if you're on an NFS system) or a redis server if you prefer. It can also handle dependencies between tasks.
This is not exactly what you asked for, but I believe it's a cleaner architechture to separate this from the job queueing system.

It looks like I'm a little late to the party, but I also had the same question of how to map embarrassingly parallel problems onto a cluster in python a few years ago and wrote my own solution. I recently uploaded it to github here: https://github.com/plediii/pbs_util
To write your program with pbs_util, I would first create a pbs_util.ini in the working directory containing
[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00
Then a python script like this
import pbs_util.pbs_map as ppm
import pylab
import myModule
class ModelWorker(ppm.Worker):
def __init__(self, input, N):
self.input = input
self.N = N
def __call__(self, a):
myModule.do_lots_number_crunching(self.input, a, self.N)
pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')
# You need "main" protection like this since pbs_map will import this file on the compute nodes
if __name__ == "__main__":
input, N = something, picklable
# Use list to force the iterator
list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
startup_args=(input, N),
num_clients=100))
And that would do it.

I just started working with clusters and EP applications. My goal (I'm with the Library) is to learn enough to help other researchers on campus access HPC with EP applications...especially researchers outside of STEM. I'm still very new, but thought it may help this question to point out the use of GNU Parallel in a PBS script to launch basic python scripts with varying arguments. In the .pbs file, there are two lines to point out:
module load gnu-parallel # this is required on my environment
parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*
# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}` this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.'
# will be substituted into the python call as the second(3rd) argument where the
# `{}` resides. These can be simple text files that you use in your 'simple.py'
# script to pass the parameter sets, filenames, etc.
As a newby to EP supercomputing, even though I don't yet understand all the other options on "parallel", this command allowed me to launch python scripts in parallel with different parameters. This would work well if you can generate a slew of parameter files ahead of time that will parallelize your problem. For example, running simulations across a parameter space. Or processing many files with the same code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Distributed tensorflow monopolizes GPUs after running server.init - python

I found that the following will work when invoking from command line: CUDA_VISIBLE_DEVICES="" python3 test.py --job_name ps --task_idx 0 --dir_name TEST Since I found this in a lot of code examples it seems like this may be the standard way to control an individual server's access to GPU resources.

Related

Apache Beam with DirectRunner (SUBPROCESS_SDK) uses only one worker, how do I force it to use all available workers?

Get the number of GPUs used in Tensorflow Distributed in a multi node approach

TensorFlow: minimalist program fails on distributed mode

gstreamer videomixer2 cpu usage 164%. Any solution to decrease

"embarrassingly parallel" programming using python and PBS on a cluster

Categories

Resources