Distributed TensorFlow - Not running some workers

Distributed TensorFlow - Not running some workers - python

I'm trying to get a very simple example of distributed TensorFlow working. However, I'm having a bug that appears non-deterministically between runs. On some runs, it works perfectly. Outputting something along the lines of:
Worker 2 | step 0
Worker 0 | step 0
Worker 1 | step 0
Worker 3 | step 0
Worker 2 | step 1
Worker 0 | step 1
Worker 1 | step 1
Worker 3 | step 1
...
However, every once in a while, one or more of the workers fails to run, resulting in output like this:
Worker 0 | step 0
Worker 3 | step 0
Worker 0 | step 1
Worker 3 | step 1
Worker 0 | step 2
Worker 3 | step 2
...
If I run the loop indefinitely, it seems that the missing workers always startup at some point, but only minutes later, which isn't practical.
I've found that two things make the issue go away (but make the program useless): 1. Not declaring any tf Variables inside the with tf.device(tf.train.replica_device_setter()) scope. If I even declare one variable (e.g. nasty_var below), the issue starts cropping up. and 2. setting the is_chief param in tf.train.MonitoredTrainingSession() to True for all workers. This causes the bug to go away even if variables are declared, but it seems wrong to make all of the workers the chief. The way I'm currently setting it below - is_chief=(task_index == 0) - is taken directly from a TensorFlow tutorial.
Here's the simplest code I can get to replicate the issue. (You may have to run multiple times to see the bug, but it almost always shows up within 5 runs
from multiprocessing import Process
import tensorflow as tf
from time import sleep
from numpy.random import random_sample
cluster = tf.train.ClusterSpec({'ps': ['localhost:2222'],
'worker': ['localhost:2223',
'localhost:2224',
'localhost:2225',
'localhost:2226']})
def create_worker(task_index):
server = tf.train.Server(cluster, job_name='worker', task_index=task_index)
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % task_index, cluster=cluster)):
nasty_var = tf.Variable(0) # This line causes the problem. No issue when this is commented out.
with tf.train.MonitoredTrainingSession(master=server.target, is_chief=(task_index == 0)):
for step in xrange(10000):
sleep(random_sample()) # Simulate some work being done.
print 'Worker %d | step %d' % (task_index, step)
def create_ps(task_index):
param_server = tf.train.Server(cluster, job_name='ps',
task_index=task_index)
param_server.join()
# Launch workers and ps in separate processes.
processes = []
for i in xrange(len(cluster.as_dict()['worker'])):
print 'Forking worker process ', i
p = Process(target=create_worker, args=[i])
p.start()
processes.append(p)
for i in xrange(len(cluster.as_dict()['ps'])):
print 'Forking ps process ', i
p = Process(target=create_ps, args=[i])
p.start()
processes.append(p)
for p in processes:
p.join()

I'm guessing the cause here is the implicit coordination protocol in how a tf.train.MonitoredTrainingSession starts, which is implemented here:
If this session is the chief:
Run the variable initializer op.
Else (if this session is not the chief):
Run an op to check if the variables has been initialized.
While any of the variables has not yet been initialized.
Wait 30 seconds.
Try creating a new session, and checking to see if the variables have been initialized.
(I discuss the rationale behind this protocol in a video about Distributed TensorFlow.)
When every session is the chief, or there are no variables to initialize, the tf.train.MonitoredTrainingSession will always start immediately. However, once there is a single variable, and you only have a single chief, you will see that the non-chief workers have to wait for the chief to act.
The reason for using this protocol is that it is robust to various processes failing, and the delay—while very noticeable when running everything on a single process—is short compared to the expected running time of a typical distributed training job.
Looking at the implementation again, it does seem that this 30-second timeout should be configurable (as the recovery_wait_secs argument to tf.train.SessionManager()), but there is currently no way to set this timeout when you create a tf.train.MonitoredTrainingSession, because it uses a hardcoded set of arguments for creating a session manager.
This seems like an oversight in the API, so please feel free to open a feature request on the GitHub issues page!

As mrry said, the problem exists because:
Non-chief relies on chief to initialize the model.
If it isn't initialized, then it waits for 30 secs.
Performance-wise, there's no difference to wait for the chief and kicks in at the next 30s. However, I was doing a research recently which required me to enforce strictly synchronized update, and this problem needed to be taken care of.
The key here is to use a barrier, depending on your distributed setting. Assume you are using thread-1 to run ps, and thread-2~5 to run workers, then you only need to:
Instead of using a MonitoredTrainingSession, use a tf.train.Supervisor, which enables you to set recovery_wait_secs, with default=30s. Change it to 1s to reduce your wait time.
sv = tf.train.Supervisor(is_chief=is_chief,
logdir=...
init_op=...
...
recovery_wait_secs=1s)
sess = sv.prepare_or_wait_for_session(server.target,
config=sess_config)
Use a barrier. Assume you are using threads:
In main:
barrier = threading.Barrier(parties=num_workers)
for i in range(num_workers):
threads.append(threading.Thread(target=run_model, args=("worker", i, barrier, )))
threads.append(threading.Thread(target=run_model, args=("ps", 0, barrier, )))
In actual training function:
_ = sess.run([train_op], feed_dict=train_feed)
barrier.wait()
Then just proceeds happily. The barrier will make sure that all models reaches this step, and there's for sure no race conditions.

Related

Modify global variable with Pool in multiprocessing module

I'm not familiar with multiprocessing module. I am tring to verify that variables in different processes are irrelevant. After the test, I find different processes probably "share" variables. That happens when process has the same pid. I am not sure if there is some relationship?
Environment : Windows 10 ; python 3.7
# -*- coding: utf-8 -*-
import os
from multiprocessing import Pool
p=0
def Child_process(id_number):
global p
print('Task start: %s(%s)' % (id_number, os.getpid()))
print('p = %d' % p)
p=p+1
print('Task {} end'.format(id_number))
if __name__ == '__main__':
p = Pool(4)
p.map(Child_process,range(5))
p.close()
p.join()
The result is:
Task start: 0(7668)
p = 0
Task start: 1(10384)
Task 0 end
p = 0
Task start: 2(7668)
p = 1
Task 1 end
Task 2 end
Task start: 3(7668)
Task start: 4(10384)
p = 1
Task 4 end
p = 2
Task 3 end
I think the p should always be 0, but it increases when different processes have the same pid?

By definition, a thread/process pool will re-use the same thread/process. This lets you setup resources in the when the thread/process starts so that each thread/process won't have to initialize them each time. This includes global variables, open files, sockets, etc. You can do the one time initialization by passing an initializer function to the thread/process. So if you set or increment the variable p it will remain set throughout the various runs of the process. If you want the variable to always start at 0 for each run, you'll need to set it to 0 at the start of each run.
This note is in the multiprocessing.pool.Pool class:
Note: Worker processes within a Pool typically live for the complete duration of the Pool’s work queue. A frequent pattern found in other systems (such as Apache, mod_wsgi, etc) to free resources held by workers is to allow a worker within a pool to complete only a set amount of work before being exiting, being cleaned up and a new process spawned to replace the old one. The maxtasksperchild argument to the Pool exposes this ability to the end user.

How to perform batch computation in python by adding processes as soon as cores become free?

Bash has the function "wait -n" which can be used in a relatively trivial way to halt subsequent execution of child processes until a certain number of processor cores have been made available. E.g. I can do the following,
for IJOB in IJOBRANGE;
do
./func.x ${IJOB}
# checking the number of background processes
# and halting the execution accordingly
bground=( $(jobs -p) );
if (( ${#bground[#]} >= CORES )); then
wait -n
fi
done || exit 1
This snippet can batch execute an arbitrary C process "func.x" with varying arguments and always maintains a fixed number of parallel instances of the child processes, set to the value "CORES".
I was wondering if something similar could be done with a python script and
python child processes (or functions). Currently, I define a python function, set up an one dimensional parameter array and use the the Pool routine from the python multiprocessing module to parallel compute the function over the parameter array. The pool functions perform a set number (# of CPU CORES in the following example) of evaluation of my function and waits until all the instances of the spawned processes have concluded before moving to the next batch.
import multiprocessing as mp
def func(x):
# some computation with x
def main(j):
# setting the parameter array
xarray = range(j)
pool = mp.Pool()
pool.map(func,xarray)
I would like to know if it is possible to modify this snippet in order to always perform a fixed number of parallel computation of my subroutine, i.e. add another process as soon as one of the child processes have been finished. All the "func" processes here are supposed to be independent and the order of execution does not matter either. I am new to the python way and it would be really great to have some helpful perspectives.

Following our discussion in the comments, here's some test code adapted from yours that shows Pools don't wait for all parallel tasks to complete before assigning a new one to available workers:
import multiprocessing as mp
from time import sleep, time
def func(x):
"""sleeps for x seconds"""
name = mp.current_process().name
print("{} {}: sleep {}".format(time(), name, x))
sleep(x)
print("{} {}: done sleeping".format(time(), name))
def main():
# A pool of two processes, for the sake of simplicity
pool = mp.Pool(processes=2)
# Here's how that works out visually:
#
# 0s 1s 2s 3s
# P1 [sleep(1)][ sleep(2) ]
# P2 [ sleep(2) ][sleep(1)]
sleeps = [1, 2, 2, 1]
pool.map(func, sleeps)
if __name__ == "__main__":
main()
Running this code gives (timestamps simplified for clarity):
$ python3 mp.py
0s: ForkPoolWorker-1: sleep 1
0s: ForkPoolWorker-2: sleep 2
1s: ForkPoolWorker-1: done sleeping
1s: ForkPoolWorker-1: sleep 2
2s: ForkPoolWorker-2: done sleeping
2s: ForkPoolWorker-2: sleep 1
3s: ForkPoolWorker-1: done sleeping
3s: ForkPoolWorker-2: done sleeping
We can see that the first process doesn't wait for the second process to complete its first task before starting its second task.
So I guess that should answer the point you were raising, hope I've understood you clearly.

how to start multiple jobs in python and communicate with the main job

I am a novice user of python multithreading/multiprocessing, so please bear with me.
I would like to solve the following problem and I need some help/suggestions in this regard.
Let me describe in brief:
I would like to start a python script which does something in the
beginning sequentially.
After the sequential part is over, I would like to start some jobs
in parallel.
Assume that there are four parallel jobs I want to start.
I would like to also start these jobs on some other machines using "lsf" on the computing cluster.My initial script is also running on a ” lsf”
machine.
The four jobs which I started on four machines will perform two logical steps A and B---one after the other.
When a job started initially, they start with logical step A and finish it.
After every job (4jobs) has finished the Step A; they should notify the first job which started these. In other words, the main job which started is waiting for the confirmation from these four jobs.
Once the main job receives confirmation from these four jobs; it should notify all the four jobs to do the logical step B.
Logical step B will automatically terminate the jobs after finishing the task.
Main job is waiting for the all jobs to finish and later on it should continue with the sequential part.
An example scenario would be:
Python script running on an “lsf” machine in the cluster starts four "tcl shells" on four “lsf” machines.
In each tcl shell, a script is sourced to do the logical step A.
Once the step A is done, somehow they should inform the python script which is waiting for the acknowledgement.
Once the acknowledgement is received from all the four, python script inform them to do the logical step B.
Logical step B is also a script which is sourced in their tcl shell; this script will also close the tcl shell at the end.
Meanwhile, python script is waiting for all the four jobs to finish.
After all four jobs are finished; it should continue with the sequential part again and finish later on.
Here are my questions:
I am confused about---should I use multithreading/multiprocessing. Which one suits better?
In fact what is the difference between these two? I read about these but I wasn't able to conclude.
What is python GIL? I also read somewhere at any one point in time only one thread will execute.
I need some explanation here. It gives me an impression that I can't use threads.
Any suggestions on how could I solve my problem systematically and in a more pythonic way.
I am looking for some verbal step by step explanation and some pointers to read on each step.
Once the concepts are clear, I would like to code it myself.
Thanks in advance.

In addition to roganjosh's answer, I would include some signaling to start the step B after A has finished:
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue, proceed):
print "Process {} has started been created".format(process_number)
print "Process {} has ended step A".format(process_number)
sys.stdout.flush()
queue.put((process_number, "done"))
proceed.wait() #wait for the signal to do the second part
print "Process {} has ended step B".format(process_number)
sys.stdout.flush()
def multiproc_master():
queue = mp.Queue()
proceed = mp.Event()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in range(4)]
for p in processes:
p.start()
#block = True waits until there is something available
results = [queue.get(block=True) for p in processes]
proceed.set() #set continue-flag
for p in processes: #wait for all to finish (also in windows)
p.join()
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs

1) From the options you listed in your question, you should probably use multiprocessing in this case to leverage multiple CPU cores and compute things in parallel.
2) Going further from point 1: the Global Interpreter Lock (GIL) means that only one thread can actually execute code at any one time.
A simple example for multithreading that pops up often here is having a prompt for user input for, say, an answer to a maths problem. In the background, they want a timer to keep incrementing at one second intervals to register how long the person took to respond. Without multithreading, the program would block whilst waiting for user input and the counter would not increment. In this case, you could have the counter and the input prompt run on different threads so that they appear to be running at the same time. In reality, both threads are sharing the same CPU resource and are constantly passing an object backwards and forwards (the GIL) to grant them individual access to the CPU. This is hopeless if you want to properly process things in parallel. (Note: In reality, you'd just record the time before and after the prompt and calculate the difference rather than bothering with threads.)
3) I have made a really simple example using multiprocessing. In this case, I spawn 4 processes that compute the sum of squares for a randomly chosen range. These processes do not have a shared GIL and therefore execute independently unlike multithreading. In this example, you can see that all processes start and end at slightly different times, but we can aggregate the results of the processes into a single queue object. The parent process will wait for all 4 child processes to return their computations before moving on. You could then repeat the code for func_B (not included in the code).
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue):
start = time.time()
print "Process {} has started at {}".format(process_number, start)
sys.stdout.flush()
my_calc = sum([x**2 for x in xrange(random.randint(1000000, 3000000))])
end = time.time()
print "Process {} has ended at {}".format(process_number, end)
sys.stdout.flush()
queue.put((process_number, my_calc))
def multiproc_master():
queue = mp.Queue()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in xrange(4)]
for p in processes:
p.start()
# Unhash the below if you run on Linux (Windows and Linux treat multiprocessing
# differently as Windows lacks os.fork())
#for p in processes:
# p.join()
results = [queue.get() for p in processes]
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs

Why threading increase processing time?

I was working on multitasking a basic 2-D DLA simulation. Diffusion Limited Aggregation (DLA) is when you have particles performing a random walk and aggregate when they touch the current aggregate.
In the simulation, I have 10.000 particles walking to a random direction at each step. I use a pool of worker and a queue to feed them. I feed them with a list of particles and the worker perform the method .updatePositionAndggregate() on each particle.
If I have one worker, I feed it with a list of 10.000 particles, if I have two workers, i feed them with a list of 5.000 particles each, if I have 3 workers, I feed them with a list of 3.333 particles each, etc and etc.
I show you some code for the worker now
class Worker(Thread):
"""
The worker class is here to process a list of particles and try to aggregate
them.
"""
def __init__(self, name, particles):
"""
Initialize the worker and its events.
"""
Thread.__init__(self, name = name)
self.daemon = True
self.particles = particles
self.start()
def run(self):
"""
The worker is started just after its creation and wait to be feed with a
list of particles in order to process them.
"""
while True:
particles = self.particles.get()
# print self.name + ': wake up with ' + str(len(self.particles)) + ' particles' + '\n'
# Processing the particles that has been feed.
for particle in particles:
particle.updatePositionAndAggregate()
self.particles.task_done()
# print self.name + ': is done' + '\n'
And in the main thread:
# Create the workers.
workerQueue = Queue(num_threads)
for i in range(0, num_threads):
Worker("worker_" + str(i), workerQueue)
# We run the simulation until all the particle has been created
while some_condition():
# Feed all the workers.
startWorker = datetime.datetime.now()
for i in range(0, num_threads):
j = i * len(particles) / num_threads
k = (i + 1) * len(particles) / num_threads
# Feeding the worker thread.
# print "main: feeding " + worker.name + ' ' + str(len(worker.particles)) + ' particles\n'
workerQueue.put(particles[j:k])
# Wait for all the workers
workerQueue.join()
workerDurations.append((datetime.datetime.now() - startWorker).total_seconds())
print sum(workerDurations) / len(workerDurations)
So, I print the average time in waiting the workers to terminate their tasks. I did some experiment with different thread number.
| num threads | average workers duration (s.) |
|-------------|-------------------------------|
| 1 | 0.147835636364 |
| 2 | 0.228585818182 |
| 3 | 0.258296454545 |
| 10 | 0.294294636364 |
I really wonder why adding workers increase the processing time, I thought that at least having 2 worker would decrease the processing time, but it dramatically increases from .14s. to 0.23s. Can you explain me why ?
EDIT:
So, explanation is Python threading implementation, is there a way so I can have real multitasking ?

This is happening because threads don't execute at the same time as Python can execute only one thread at a time due to GIL (Global Interpreter Lock).
When you spawn a new thread, everything freezes except this thread. When it stops the other one is executed. Spawning threads needs lots of time.
Friendly speaking, the code doesn't matter at all as any code using 100 threads is SLOWER than code using 10 threads in Python (if more threads means more efficiency and more speed, which is not always true).
Here's an exact quote from the Python docs:
CPython implementation detail:
In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
Wikipedia about GIL
StackOverflow about GIL

Threads in python (at least in 2.7) are NOT executed simultaneously because of GIL: https://wiki.python.org/moin/GlobalInterpreterLock - they run in single process and share CPU, therefore you can't use threads for speeding your computation up.
If you want to use parallel computation to speed up your calculation (at least in python2.7), use processes - package multiprocessing.

This is due to Python's global interpreter lock. Unfortunately, with the GIL in Python threads will block I/O and as such will never exceed usage of 1 CPU core. Have a look here to get you started on understanding the GIL: https://wiki.python.org/moin/GlobalInterpreterLock
Check your running processes (Task Manager in Windows, for example) and will notice that only one core is being utilized by your Python application.
I would suggest looking at multi-processing in Python, which is not hindered by the GIL: https://docs.python.org/2/library/multiprocessing.html

It takes time to actually create the other thread and start processing it. Since we don't have control of the scheduler, I'm willing to bet both of these threads get scheduled on the same core (since the work is so small), therefore you are adding the time it takes to create the thread and no parallel processing is done

Python Multiprocessing Worker/Queue

I have a python function that has to run 12 times in total. I have this set up currently to use Pool from the multiprocessing library to run up to all of them in parallel. Typically I run 6 at a time because the function is CPU intensive and running 12 in parallel often causes the program to crash. When we do 6 at a time, the second set of 6 will not begin until all of the first 6 processes are finished. Ideally, we would like another one (e.g. the 7th) to kick off as soon as one from the initial batch of 6 is finished- So that 6 are running at once while there are more to start. Right now the code looks like this (it would be called twice, passing the first 6 elements in one list and then the second 6 in another:
from multiprocessing import Pool
def start_pool(project_list):
pool = Pool(processes=6)
pool.map(run_assignments_parallel,project_list[0:6])
So i have been trying to implement a worker/queue solution and have run into some issues. I have a worker function that looks like this:
def worker(work_queue, done_queue):
try:
for proj in iter(work_queue.get, 'STOP'):
print proj
run_assignments_parallel(proj)
done_queue.put('finished ' + proj )
except Exception, e:
done_queue.put("%s failed on %s with: %s" % (current_process().name, proj, e.message))
return True
And the code to call the worker function is as follows:
workers = 6
work_queue = Queue()
done_queue = Queue()
processes = []
for project in project_list:
print project
work_queue.put(project)
for w in xrange(workers):
p = Process(target=worker, args=(work_queue, done_queue))
p.start()
processes.append(p)
work_queue.put('STOP')
for p in processes:
p.join()
done_queue.put('STOP')
for status in iter(done_queue.get, 'STOP'):
print status
project_list is just a list of paths for the 12 projects that need to be run in the function 'run_assignments_parallel.'
The way this is written now, the function is getting called more than once for the same process (project) and I cant really tell what is going on. This code is based on an example i found and I am pretty sure the looping structure is messed up. Any help would be great and I aplogize for my ignorance on the matter. Thanks!

Ideally, we would like another one (e.g. the 7th) to kick off as soon as one from the initial batch of 6 is finished- So that 6 are running at once while there are more to start.
All you need to change is to pass all 12 input parameters instead of 6:
from multiprocessing import Pool
pool = Pool(processes=6) # run no more than 6 at a time
pool.map(run_assignments_parallel, project_list) # pass full list (12 items)

You can use the MPipe module.
Create a 6-worker, single-stage pipeline and feed in all your projects as tasks. Then just read the results (in your case, statuses) off the end.
from mpipe import Pipeline, OrderedStage
...
pipe = Pipeline(OrderedStage(run_assignments_parallel), 6)
for project in project_list:
pipe.put(project)
pipe.put(None) # Signal end of input.
for status in pipe.results():
print(status)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.