Python parallel programming model

Python parallel programming model - python

I'm writing a machine learning program with the following components:
A shared "Experience Pool" with a binary-tree-like data structure.
N simulator processes. Each adds an "experience object" to the pool every once in a while. The pool is responsible for balancing its tree.
M learner processes that sample a batch of "experience objects" from the pool every few moments and perform whatever learning procedure.
I don't know what's the best way to implement the above. I'm not using Tensorflow, so I cannot take advantage of its parallel capability. More concretely,
I first think of Python3's built-in multiprocessing library. Unlike multithreading, however, multiprocessing module cannot have different processes update the same global object. My hunch is that I should use the server-proxy model. Could anyone please give me a rough skeleton code to start with?
Is MPI4py a better solution?
Any other libraries that would be a better fit? I've looked at celery, disque, etc. It's not obvious to me how to adapt them to my use case.

Based on the comments, what you're really looking for is a way to update a shared object from a set of processes that are carrying out a CPU-bound task. The CPU-bounding makes multiprocessing an obvious choice - if most of your work was IO-bound, multithreading would have been a simpler choice.
Your problem follows a simpler server-client model: the clients use the server as a simple stateful store, no communication between any child processes is needed, and no process needs to be synchronised.
Thus, the simplest way to do this is to:
Start a separate process that contains a server.
Inside the server logic, provide methods to update and read from a single object.
Treat both your simulator and learner processes as separate clients that can periodically read and update the global state.
From the server's perspective, the identity of the clients doesn't matter - only their actions do.
Thus, this can be accomplished by using a customised manager in multiprocessing as so:
# server.py
from multiprocessing.managers import BaseManager
# this represents the data structure you've already implemented.
from ... import ExperienceTree
# An important note: the way proxy objects work is by shared weak reference to
# the object. If all of your workers die, it takes your proxy object with
# it. Thus, if you have an instance, the instance is garbage-collected
# once all references to it have been erased. I have chosen to sidestep
# this in my code by using class variables and objects so that instances
# are never used - you may define __init__, etc. if you so wish, but
# just be aware of what will happen to your object once all workers are gone.
class ExperiencePool(object):
tree = ExperienceTree()
#classmethod
def update(cls, experience_object):
''' Implement methods to update the tree with an experience object. '''
cls.tree.update(experience_object)
#classmethod
def sample(cls):
''' Implement methods to sample the tree's experience objects. '''
return cls.tree.sample()
# subclass base manager
class Server(BaseManager):
pass
# register the class you just created - now you can access an instance of
# ExperiencePool using Server.Shared_Experience_Pool().
Server.register('Shared_Experience_Pool', ExperiencePool)
if __name__ == '__main__':
# run the server on port 8080 of your own machine
with Server(('localhost', 8080), authkey=b'none') as server_process:
server_process.get_server().serve_forever()
Now for all of your clients you can just do:
# client.py - you can always have a separate client file for a learner and a simulator.
from multiprocessing.managers import BaseManager
from server import ExperiencePool
class Server(BaseManager):
pass
Server.register('Shared_Experience_Pool', ExperiencePool)
if __name__ == '__main__':
# run the server on port 8080 of your own machine forever.
server_process = Server(('localhost', 8080), authkey=b'none')
server_process.connect()
experience_pool = server_process.Shared_Experience_Pool()
# now do your own thing and call `experience_call.sample()` or `update` whenever you want.
You may then launch one server.py and as many workers as you want.
Is This The Best Design?
Not always. You may run into race conditions in that your learners may receive stale or old data if they are forced to compete with a simulator node writing at the same time.
If you want to ensure a preference for latest writes, you may additionally use a lock whenever your simulators are trying to write something, preventing your other processes from getting a read until the write finishes.

Related

Running two Tensorflow trainings in parallel using joblib and dask

I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers.
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
To that end, I do the following:
I use joblib.delayed to spawn the two processes.
Within each process I run with joblib.parallel_backend('dask'): to execute the fit/training logic. Each training process triggers N dask workers.
The problem is that I don't know if the entire process is thread safe, are there any concurrency elements that I'm missing?
# First, submit the function twice using joblib delay
delayed_funcs = [joblib.delayed(train)(sub_task) for sub_task in [123, 456]]
parallel_pool = joblib.Parallel(n_jobs=2)
parallel_pool(delayed_funcs)
# Second, submit each training process
def train(sub_task):
global client
if client is None:
print('connecting')
client = Client()
data = some_data_to_train
# Third, process the training itself with N workers
with joblib.parallel_backend('dask'):
X = data[columns]
y = data[label]
niceties = dict(verbose=False)
model = KerasClassifier(build_fn=build_layers,
loss=tf.keras.losses.MeanSquaredError(), **niceties)
model.fit(X, y, epochs=500, verbose = 0)

This is pure speculation, but one potential concurrency issue is due to if client is None: part, where two processes could race to create a Client.
If this is resolved (e.g. by explicitly creating a client in advance), then dask scheduler will rely on time of submission to prioritize task (unless priority is clearly assigned) and also the graph (DAG) structure, there are further details available in docs.

The question, as given, could easily be marked as "unclear" for SO. A couple of notes:
global client : makes the client object available outside of the fucntion. But the function is run from another process, you do not affect the other process when making the client
if client is None : this is a name error, your code doesn't actually run as written
client = Client() : you make a new cluster in each subprocess, each assuming the total resources available, oversubscribing those resources.
dask knows whether any client has been created in the current process, but that doesn't help you here
You must ask yourself: why are you creating processes for the two fits at all? Why not just let Dask figure out its parallelism, which is what it's meant for.
--
-EDIT-
to answer the form of the question asked in a comment.
My question is whether using the same client variable in these two parallel processes creates a problem.
No, the two client variables are unrelated to one-another. You may see a warning message about not being able to bind to a default port, which you can safely ignore. However, please don't make it global as this is unnecessary and makes what you are doing less clear.
--
I think I must answer the question as phrased in your comment, which I advise to add to the main question
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
You have the following options:
create a client with a specific known address within your program or beforehand, then connect to it
create a default client Client() and get its address (e.g., client._scheduler_identity['address']) and connect to that
write a scheduler information file with client.write_scheduler_file and use that
You will connect in the function with
client = Client(address)
or
client = Client(scheduler_file=the_file_you_wrote)

Python multi-processing one worker dynimc number of recievers of all worker data (1:n)

I am planing to setup a small proxy service for a remote sensor, that only accepts one connection. I have a temporary solution and I am now designing a more robust version, and therefore dived deeper into the python multiprocessing module.
I have written a couple of systems in python using a main process, which spawns subprocesses using the multiprocessing module and used multiprocessing.Queue to communicate between them. This works quite well and some of theses programs/scripts are doing their job in a production environment.
The new case is slightly different since it uses 2+n processes:
One data-collector, that reads data from the sensor (at 100Hz) and every once in a while receives short ASCII strings as command
One main-server, that binds to a socket and listens, for new connections and spawns...
n child-servers, that handle clients who want to have the sensor data
while communication from the child servers to the data collector seems pretty straight forward using a multiprocessing.Queue which manages a n:1 connection well enough, I have problems with the other way. I can't use a queue for that as well, because all child-servers need to get all the data the sensor produces, while they are active. At least I haven't found a way to configure a Queue to mimic that behaviour, as get takes the top most out of the Queue by design.
I looked into shared memory already, which massively increases the management overhead, since as far as I understand it while using it, I would basically need to implement a streaming buffer myself.
The only safe way I see right now, is using a redis server and messages queues, but I am a bit hesitant, since that would need more infrastructure than I like.
Is there a pure python internal way?

maybe You can use MQTT for that ?
You did not clearly specify, but sounds like observer pattern -
or do You want the clients to poll each time they need data ?
It depends which delays / data rate / jitter etc. You can accept.
after You provided the information :
The whole setup runs on one machine in one process space. What I would like to have, is a way without going through a third party process
I would suggest to check for observer pattern.
More informations can be found for example:
https://www.youtube.com/watch?v=_BpmfnqjgzQ&t=1882s
and
https://refactoring.guru/design-patterns/observer/python/example
and
https://www.protechtraining.com/blog/post/tutorial-the-observer-pattern-in-python-879
and
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Observer.html
Your Server should fork for each new connection and register with the observer, and will be therefore informed about every change.

Concurrency within redis queue

I'm working with a django application hosted on heroku with redistogo addon:nano pack. I'm using rq, to execute tasks in the background - the tasks are initiated by online users. I've a constraint on increasing number of connections, limited resources I'm afraid.
I'm currently having a single worker running over 'n' number of queues. Each queue uses an instance of connection from the connection pool to handle 'n' different types of task. For instance, lets say if 4 users initiate same type of task, I would like to have my main worker create child processes dynamically, to handle it. Is there a way to achieve required multiprocessing and concurrency?
I tried with multiprocessing module, initially without introducing Lock(); but that exposes and overwrites user passed data to the initiating function, with the previous request data. After applying locks, it restricts second user to initiate the requests by returning a server error - 500
github link #1: Looks like the team is working on the PR; not yet released though!
github link #2: This post helps to explain creating more workers at runtime.
This solution however also overrides the data. The new request is again processed with the previous requests data.
Let me know if you need to see some code. I'll try to post a minimal reproducible snippet.
Any thoughts/suggestions/guidelines?

Did you get a chance to try AutoWorker?
Spawn RQ Workers automatically.
from autoworker import AutoWorker
aw = AutoWorker(queue='high', max_procs=6)
aw.work()
It makes use of multiprocessing with StrictRedis from redis module and following imports from rq
from rq.contrib.legacy import cleanup_ghosts
from rq.queue import Queue
from rq.worker import Worker, WorkerStatus

After looking under the hood, I realised Worker class is already implementing multiprocessing.
The work function internally calls execute_job(job, queue) which in turn as quoted in the module
Spawns a work horse to perform the actual work and passes it a job.
The worker will wait for the work horse and make sure it executes within the given timeout bounds,
or will end the work horse with SIGALRM.
The execute_job() funtion makes a call to fork_work_horse(job, queue) implicitly which spawns a work horse to perform the actual work and passes it a job as per the following logic:
def fork_work_horse(self, job, queue):
child_pid = os.fork()
os.environ['RQ_WORKER_ID'] = self.name
os.environ['RQ_JOB_ID'] = job.id
if child_pid == 0:
self.main_work_horse(job, queue)
else:
self._horse_pid = child_pid
self.procline('Forked {0} at {1}'.format(child_pid, time.time()))
The main_work_horse makes an internal call to perform_job(job, queue) which makes a few other calls to actually perform the job.
All the steps about The Worker Lifecycle mentioned over rq's official documentation page are taken care within these calls.
It's not the multiprocessing I was expecting, but I guess they have a way of doing things. However my original post is still not answered with this, also I'm still not sure about concurrency..
The documentation there still needs to be worked upon, since it hardly covers the true essence of this library!

Storing subprocess object in memory using global singleton instance

So I am using subprocess to spawn a long running process through the web interface using Django. Now if a user wants to come back to the page I would like to give him the option of terminating the subprocess at a later stage.
How can do this? I implemented the same thing in Java and made a global singleton ProcessManager dictionary to store the Process Object in Memory. Can I do something similar in Python?
EDIT
Yes Singletons and a hash of ProcessManager is the way of doing it cleanly. Emmanuel's code works perfectly fine with a few modifications.
Thanks

I think an easy way to implement Singleton pattern in python is via class attributes:
import subprocess
class ProcessManager(object):
__PROCESS = None;
#staticmethod
def set_process(args):
# Sets singleton process
if __PROCESS is None:
p = subprocess.Popen(args)
ProcessManager.__PROCESS = p;
# else: exception handling
#staticmethod
def kill_process():
# Kills process
if __PROCESS is None:
# exception handling
else:
ProcessManager.__PROCESS.kill()
Then you can use this class via:
from my_module import ProcessManager
my_args = ...
ProcessManager.set_process(my_args)
...
ProcessManager.kill_process()
Notes:
the ProcessManager is in charge of creating the process, to be symmetrical with its ending
I don't have enough knowledge in multi-threading to know if this works in multi-threading mode

You can use the same technique in Python as you did in Java, that is store the reference to the process in a module variable or implement a kind of a singleton.
The only problem you have as opposed to Java, is that Python does not have that rich analogy to the Servlet specification, and there is no interface to handle the application start or finish. In most cases you should not be worried how many instances of your application are running, because you fetch all data from a persistent storage. But in this case you should understand how your application is deployed.
If there is a single long running instance of your application (a FastCGI instance, for example, or a single WSGI application on cherrypy), you can isolate the process handling functionality in a separate module and load it when the module is imported (any module is imported only once within an application). If there are many instances (many FastCGI instances, or plain CGI-scripts), you should better detach child processes and keep their ids in a persistent storage (in a database, or files) and intersect them with the the list of currently running processes on demand.

How to synchronize a python dict with multiprocessing

I am using Python 2.6 and the multiprocessing module for multi-threading. Now I would like to have a synchronized dict (where the only atomic operation I really need is the += operator on a value).
Should I wrap the dict with a multiprocessing.sharedctypes.synchronized() call? Or is another way the way to go?

Intro
There seems to be a lot of arm-chair suggestions and no working examples. None of the answers listed here even suggest using multiprocessing and this is quite a bit disappointing and disturbing. As python lovers we should support our built-in libraries, and while parallel processing and synchronization is never a trivial matter, I believe it can be made trivial with proper design. This is becoming extremely important in modern multi-core architectures and cannot be stressed enough! That said, I am far from satisfied with the multiprocessing library, as it is still in its infancy stages with quite a few pitfalls, bugs, and being geared towards functional programming (which I detest). Currently I still prefer the Pyro module (which is way ahead of its time) over multiprocessing due to multiprocessing's severe limitation in being unable to share newly created objects while the server is running. The "register" class-method of the manager objects will only actually register an object BEFORE the manager (or its server) is started. Enough chatter, more code:
Server.py
from multiprocessing.managers import SyncManager
class MyManager(SyncManager):
pass
syncdict = {}
def get_dict():
return syncdict
if __name__ == "__main__":
MyManager.register("syncdict", get_dict)
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.start()
raw_input("Press any key to kill server".center(50, "-"))
manager.shutdown()
In the above code example, Server.py makes use of multiprocessing's SyncManager which can supply synchronized shared objects. This code will not work running in the interpreter because the multiprocessing library is quite touchy on how to find the "callable" for each registered object. Running Server.py will start a customized SyncManager that shares the syncdict dictionary for use of multiple processes and can be connected to clients either on the same machine, or if run on an IP address other than loopback, other machines. In this case the server is run on loopback (127.0.0.1) on port 5000. Using the authkey parameter uses secure connections when manipulating syncdict. When any key is pressed the manager is shutdown.
Client.py
from multiprocessing.managers import SyncManager
import sys, time
class MyManager(SyncManager):
pass
MyManager.register("syncdict")
if __name__ == "__main__":
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.connect()
syncdict = manager.syncdict()
print "dict = %s" % (dir(syncdict))
key = raw_input("Enter key to update: ")
inc = float(raw_input("Enter increment: "))
sleep = float(raw_input("Enter sleep time (sec): "))
try:
#if the key doesn't exist create it
if not syncdict.has_key(key):
syncdict.update([(key, 0)])
#increment key value every sleep seconds
#then print syncdict
while True:
syncdict.update([(key, syncdict.get(key) + inc)])
time.sleep(sleep)
print "%s" % (syncdict)
except KeyboardInterrupt:
print "Killed client"
The client must also create a customized SyncManager, registering "syncdict", this time without passing in a callable to retrieve the shared dict. It then uses the customized SycnManager to connect using the loopback IP address (127.0.0.1) on port 5000 and an authkey establishing a secure connection to the manager started in Server.py. It retrieves the shared dict syncdict by calling the registered callable on the manager. It prompts the user for the following:
The key in syncdict to operate on
The amount to increment the value accessed by the key every cycle
The amount of time to sleep per cycle in seconds
The client then checks to see if the key exists. If it doesn't it creates the key on the syncdict. The client then enters an "endless" loop where it updates the key's value by the increment, sleeps the amount specified, and prints the syncdict only to repeat this process until a KeyboardInterrupt occurs (Ctrl+C).
Annoying problems
The Manager's register methods MUST be called before the manager is started otherwise you will get exceptions even though a dir call on the Manager will reveal that it indeed does have the method that was registered.
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
Using SyncManager's dict method would alleviate annoying problem #2 except that annoying problem #1 prevents the proxy returned by SyncManager.dict() being registered and shared. (SyncManager.dict() can only be called AFTER the manager is started, and register will only work BEFORE the manager is started so SyncManager.dict() is only useful when doing functional programming and passing the proxy to Processes as an argument like the doc examples do)
The server AND the client both have to register even though intuitively it would seem like the client would just be able to figure it out after connecting to the manager (Please add this to your wish-list multiprocessing developers)
Closing
I hope you enjoyed this quite thorough and slightly time-consuming answer as much as I have. I was having a great deal of trouble getting straight in my mind why I was struggling so much with the multiprocessing module where Pyro makes it a breeze and now thanks to this answer I have hit the nail on the head. I hope this is useful to the python community on how to improve the multiprocessing module as I do believe it has a great deal of promise but in its infancy falls short of what is possible. Despite the annoying problems described I think this is still quite a viable alternative and is pretty simple. You could also use SyncManager.dict() and pass it to Processes as an argument the way the docs show and it would probably be an even simpler solution depending on your requirements it just feels unnatural to me.

I would dedicate a separate process to maintaining the "shared dict": just use e.g. xmlrpclib to make that tiny amount of code available to the other processes, exposing via xmlrpclib e.g. a function taking key, increment to perform the increment and one taking just the key and returning the value, with semantic details (is there a default value for missing keys, etc, etc) depending on your app's needs.
Then you can use any approach you like to implement the shared-dict dedicated process: all the way from a single-threaded server with a simple dict in memory, to a simple sqlite DB, etc, etc. I suggest you start with code "as simple as you can get away with" (depending on whether you need a persistent shared dict, or persistence is not necessary to you), then measure and optimize as and if needed.

In response to an appropriate solution to the concurrent-write issue. I did very quick research and found that this article is suggesting a lock/semaphore solution. (http://effbot.org/zone/thread-synchronization.htm)
While the example isn't specificity on a dictionary, I'm pretty sure you could code a class-based wrapper object to help you work with dictionaries based on this idea.
If I had a requirement to implement something like this in a thread safe manner, I'd probably use the Python Semaphore solution. (Assuming my earlier merge technique wouldn't work.) I believe that semaphores generally slow down thread efficiencies due to their blocking nature.
From the site:
A semaphore is a more advanced lock mechanism. A semaphore has an internal counter rather than a lock flag, and it only blocks if more than a given number of threads have attempted to hold the semaphore. Depending on how the semaphore is initialized, this allows multiple threads to access the same code section simultaneously.
semaphore = threading.BoundedSemaphore()
semaphore.acquire() # decrements the counter
... access the shared resource; work with dictionary, add item or whatever.
semaphore.release() # increments the counter

Is there a reason that the dictionary needs to be shared in the first place? Could you have each thread maintain their own instance of a dictionary and either merge at the end of the thread processing or periodically use a call-back to merge copies of the individual thread dictionaries together?
I don't know exactly what you are doing, so keep in my that my written plan may not work verbatim. What I'm suggesting is more of a high-level design idea.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.