Multiprocesses python with shared memory - python

I have an object that connects to a websocket remote server. I need to do a parallel process at the same time. However, I don't want to create a new connection to the server. Since threads are the easier way to do this, this is what I have been using so far. However, I have been getting a huge latency because of GIL. Can I achieve the same thing as threads but with multiprocesses in parallel?
This is the code that I have:
class WebSocketApp(object):
def on_open(self):
# Create another thread to make sure the commands are always been read
print "Creating thread..."
try:
thread.start_new_thread( self.read_commands,() )
except:
print "Error: Unable to start thread"
Is there an equivalent way to do this with multiprocesses?
Thanks!

The direct equivalent is
import multiprocessing
class WebSocketApp(object):
def on_open(self):
# Create another process to make sure the commands are always been read
print "Creating process..."
try:
multiprocessing.Process(target=self.read_commands,).start()
except:
print "Error: Unable to start process"
However, this doesn't address the "shared memory" aspect, which has to be handled a little differently than it is with threads, where you can just use global variables. You haven't really specified what objects you need to share between processes, so it's hard to say exactly what approach you should take. The multiprocessing documentation does cover ways to deal with shared state, however. Do note that in general it's better to avoid shared state if possible, and just explicitly pass state between the processes, either as an argument to the Process constructor or via a something like a Queue.

You sure can, use something along the lines of:
from multiprocessing import Process
class WebSocketApp(object):
def on_open(self):
# Create another thread to make sure the commands are always been read
print "Creating thread..."
try:
p = Process(target = WebSocketApp.read_commands, args = (self, )) # Add other arguments to this tuple
p.start()
except:
print "Error: Unable to start thread"
It is important to note, however, that as soon as the object is sent to the other process the two objects self and self in the different threads diverge and represent different objects. If you wish to communicate you will need to use something like the included Queue or Pipe in the multiprocessing module.
You may need to keep a reference of all the processes (p in this case) in your main thread in order to be able to communicate that your program is terminating (As a still-running child process will appear to hang the parent when it dies), but that depends on the nature of your program.
If you wish to keep the object the same, you can do one of a few things:
Make all of your object properties either single values or arrays and then do something similar to this:
from multiprocessing import Process, Value, Array
class WebSocketApp(object):
def __init__(self):
self.my_value = Value('d', 0.3)
self.my_array = Array('i', [4 10 4])
# -- Snip --
And then these values should work as shared memory. The types are very restrictive though (You must specify their types)
A different answer is to use a manager:
from multiprocessing import Process, Manager
class WebSocketApp(object):
def __init__(self):
self.my_manager = Manager()
self.my_list = self.my_manager.list()
self.my_dict = self.my_manager.dict()
# -- Snip --
And then self.my_list and self.my_dict act as a shared-memory list and dictionary respectively.
However, the types for both of these approaches can be restrictive so you may have to roll your own technique with a Queue and a Semaphore. But it depends what you're doing.
Check out the multiprocessing documentation for more information.

Related

multiprocessing's Queue inside Manger.Namespace()

I am currently creating a class which is supposed to execute some methods in a multi-threaded way, using the multiprocessing module. I execute the real computation using a Pool of n workers. Now I wanted to assign each of the currently n active workers an index between 0 and n for some other calculation. To do this, I wanted to use a shared Queue to assign an index in a way, that at every time no two workers have the same id. To share the same Queue inside the class between the different threads, I wanted to store it inside a Manager.Namespace(). But doing this, I got some problems with the Queue. Therefore, I created a minimal version of my problem and ended up with something like this:
from multiprocess import Process, Queue, Manager, Pool, cpu_count
class A(object):
def __init__(self):
manager = Manager()
self.ns = manager.Namespace()
self.ns.q = manager.Queue()
def foo(self):
for i in range(10):
print(i)
self.ns.q.put(i)
print(self.ns.q.get())
print(self.ns.q.qsize())
a = A()
a.foo()
In this code, the execution stops before the second print statement - therefore, I think, that no data is actually written in the Queue. When I remove the namespace related stuff the code works flawlessly. Is this the intended behaviour of the multiprocessings objects and am I doing something wrong? Or is this some kind of bug?
yes, you should not use Namespace here. when you put a Queue object into manager.Namespace(), each process will get a new Queue instance, all the writer/reader of those newly created queue objects have no connection with parent process, therefore no message will be received by worker processes. share a Queue solely instead.
by the way, you mentioned "thread" many times, but in the context of multiprocess module, a worker is a process, not a thread.

In Python, do I need to protect data transfer between multi-threaded processes?

For example, if a process is generating an image, and other parallel process is accessing this image through a get method, my intuition tells me that it may be dangerous to access that image while it is being written.
In C++ I have to use mutexes to make sure the image isn't accessed while it is being written, otherwise I'm experiencing random segfaults. but since python has some data protection mechanisms that I don't fully know, I'm not sure if I need to do this.
PSEUDO-CODE:
Class Camera(object):
def __init__(self):
self._capture = camera_handler() #camera_handler is a object that loads the driver and lets you control the camera.
self.new_image = None
self._is_to_run = False
def start(self):
self._is_to_run = True
self._thread = thread_object(target=self.run)
self._thread.start()
def run(self):
while(self._is_to_run):
self.new_image = self._capture.update()
cam = Camera()
cam.start()
while True:
image = cam.new_image
result = do_some_process_image(image)
Is this safe?
First of al, the threading module uses threads, not different processes!
The crucial difference between threads and processes is that the former share an address space (memory), while the latter don't.
The "standard" python implementation (CPython) uses a Global Interpreter Lock to ensure that only one thread at a time can be executing Python bytecode. So for data that can be updated with one one bytecode instruction (like store_fast) you might not need mutexes. When a thread that is modifying such a variable is interrupted, either the store has been done or it hasn't.
But in general you definitely need to protect data structures from reading and modification by multiple threads. If a thread is interrupted while it is in the proces of modifying say a large dictionary and execution is passed to another thread that tries to read from the dictionary, it might find the data in an inconsistant state.
Python shouldn't segfault in situations like this - the global intepreter lock is your friend. However, even in your example there's every chance that a camera interface is going to go into some random C library that doesn't necessarily behave itself. Even then, it doesn't prevent all race conditions in your code and you could easily find inconsistent data because of that.
Python does have Lock which is very low-level and doesn't provide much functionality. Condition is a higher-level type that is better for implementing a mutex-like lock:
# Consume one item
with cv:
while not an_item_is_available():
cv.wait()
get_an_available_item()
# Produce one item
with cv:
make_an_item_available()
cv.notify()
Incidentally, there was a mutex in Python 2, which was deprecated in 2.6 and removed in Python 3.
I think what you are looking for is is the lock Object -> https://docs.python.org/2/library/threading.html#lock-objects
A primitive lock is a synchronization primitive that is not owned by a
particular thread when locked. In Python, it is currently the lowest
level synchronization primitive available, implemented directly by the
thread extension module.
In your example, I would encapsulate the access to the image in a function like this
def image_access(self, image_Data = None):
lock = Lock()
lock.acquire()
temp = self.new_image
try:
if image_Data not None:
self.new_image = image_Data
finally:
lock.release()
if image_Data is None:
return temp
For more on Thread synchronization, see -> http://effbot.org/zone/thread-synchronization.htm
Edit:
Here are the cahnges to the ohter functions
def run(self):
while(self._is_to_run):
self.image_access(self._capture.update())
...
while True:
result = do_some_process_image(cam.image_access())

Shared pool map between processes with object-oriented python

(python2.7)
I'm trying to do a kind of scanner, that has to walk through CFG nodes, and split in different processes on branching for parallelism purpose.
The scanner is represented by an object of class Scanner. This class has one method traverse that walks through the said graph and splits if necessary.
Here how it looks:
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
self.process_pool = Pool(processes=4)
def traverse(self, ...):
[...]
if branch:
self.process_pool.map(my_func, todo_list).
My problem is the following:
How do I create a instance of multiprocessing.Pool, that is shared between all of my processes ? I want it to be shared, because since a path can be splitted again, I do not want to end with a kind of fork bomb, and having the same Pool will help me to limit the number of processes running at the same time.
The above code does not work, since Pool can not be pickled. In consequence, I have tried that:
class Scanner(object):
def __getstate__(self):
self_dict = self.__dict__.copy()
def self_dict['process_pool']
return self_dict
[...]
But obviously, it results in having self.process_pool not defined in the created processes.
Then, I tried to create a Pool as a module attribute:
process_pool = Pool(processes=4)
def my_func(x):
[...]
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
def traverse(self, ...):
[...]
if branch:
process_pool.map(my_func, todo_list)
It does not work, and this answer explains why.
But here comes the thing, wherever I create my Pool, something is missing. If I create this Pool at the end of my file, it does not see self.attribute1, the same way it did not see answer and fails with an AttributeError.
I'm not even trying to share it yet, and I'm already stuck with Multiprocessing way of doing thing.
I don't know if I have not been thinking correctly the whole thing, but I can not believe it's so complicated to handle something as simple as "having a worker pool and giving them tasks".
Thank you,
EDIT:
I resolved my first problem (AttributeError), my class had a callback as its attribute, and this callback was defined in the main script file, after the import of the scanner module... But the concurrency and "do not fork bomb" thing is still a problem.
What you want to do can't be done safely. Think about if you somehow had a single shared Pool shared across parent and worker processes, with, say, two worker processes. The parent runs a map that tries to perform two tasks, and each task needs to map two more tasks. The two parent dispatched tasks go to each worker, and the parent blocks. Each worker sends two more tasks to the shared pool and blocks for them to complete. But now all workers are now occupied, waiting for a worker to become free; you've deadlocked.
A safer approach would be to have the workers return enough information to dispatch additional tasks in the parent. Then you could do something like:
class MoreWork(object):
def __init__(self, func, *args):
self.func = func
self.args = args
pool = multiprocessing.Pool()
try:
base_task = somefunc, someargs
outstanding = collections.deque([pool.apply_async(*base_task)])
while outstanding:
result = outstanding.popleft().get()
if isinstance(result, MoreWork):
outstanding.append(pool.apply_async(result.func, result.args))
else:
... do something with a "final" result, maybe breaking the loop ...
finally:
pool.terminate()
What the functions are is up to you, they'd just return information in a MoreWork when there was more to do, not launch a task directly. The point is to ensure that by having the parent be solely responsible for task dispatch, and the workers solely responsible for task completion, you can't deadlock due to all workers being blocked waiting for tasks that are in the queue, but not being processed.
This is also not at all optimized; ideally, you wouldn't block waiting on the first item in the queue if other items in the queue were complete; it's a lot easier to do this with the concurrent.futures module, specifically with concurrent.futures.wait to wait on the first available result from an arbitrary number of outstanding tasks, but you'd need a third party PyPI package to get concurrent.futures on Python 2.7.

Share a class variable across multiple processes in python

I have a class variable declared as a list that I want to update from a method declared within that class. However since this method processes a large amount of data, I am using multiprocessing to invoke it and hence I need to put lock on the class variable before updating it. I am unable to figure out how to put such a lock and update the class variable. If it matters, I am only creating one object of the said class at any given time.
Because of python's GIL, multiprocessing can only be used whith completely separate tasks, and no shared memory.
But you still can make it happend by using multiprocessing shared Array/Value:
from https://docs.python.org/2/library/multiprocessing.html#sharing-state-between-processes
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print num.value
print arr[:]
Now as you asked, you need to ensure that differents processes won't access the same variable at the same time, and use a Lock. Hopefuly, all the shared variable available in the multiprocessing module are paired with a Lock.
To access the lock :
num.acquire() # get the lock
# do stuff
num.release() # don't forget to release it
I hope this helps.
If you're using the multiprocessing module (as opposed to multithreading, which is different), then unless I'm mistaken, the multiple processes forked don't share memory and each process would have its own copy of your class. This would mean that a lock would not be necessary, but it would also mean that the class attribute is not shared like you want it to be.
The multiprocessing module does offer several ways to allow communication between processes, including shared array objects. Perhaps this is what you're looking for.
Depending on what you're doing, you might also consider using the master-worker pattern, where you create a worker class with methods to manipulate your data, spawn several processes to run this class, and then dispatch datasets to the workers from your main process using the Queue class from the multiprocessing module.

How do I run a Python method as a subprocess?

i need a help with a python project:
Example:
class MyFrame(wx.Frame):
def __init__(self, parent, title):
super(MyFrame, self).__init__(parent, title=title, size=(330, 300))
self.InitUI()
self.Centre()
self.Show()
def InitUI(self):
"""
Subprocess
"""
subprocess.execMethodFromClass( self , 'Connection' , args1 , args2 , ... )
def Connection( self ):
self.connection = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.connection.connect(( '192.0.1.135' , 3345 ))
while True:
data = self.connection.recv(1024)
if not data:
break
else:
print data
Show:
subprocess.execMethodFromClass( self , 'Connection' , args1 , args2 , ... )
Thanks!
As the friendly dogcow says, to run a function in a child process, all you have to do is use a multiprocessing.Process:
p = multiprocessing.Process(target=f, args=('bob',))
p.start()
p.join()
Of course you probably want to hang onto p and join it later in most* real-life use cases. You're obviously not getting any parallelism by spawning a new process just to make your main process sit around and wait.
So, in your case, that's just:
p = multiprocessing.Process(target=self.Connection, args=(args1, args2))
But this probably won't work in your case, because you're trying to call a method on the current self object.
First, depending on your platform and Python version, multiprocessing may have to pass the bound method self.Connection to the child by pickling it and sending it over a pipe. This involves pickling the self instance as well as the method. So it will only work if MyFrame objects are pickleable. And I'm pretty sure that a wx.Frame can't be pickled.
And even if you do get the self object to the child, it will obviously be a copy, not a shared instance. So, when the child process's Connection method sets self.connection = …, that won't affect the original parent process's self.
Even worse if you try to call any wx.Frame methods. Even if all the Python stuff worked, on most platforms, trying to modify GUI resources like windows from the wrong process will not work.
The only kinds of objects you can actually share are the kinds you can put in multiprocessing.Value or multiprocessing.sharedctypes.
The way around this is to factor out the code you want to childify into a separate, isolated function, that shares as little as possible (ideally nothing, or nothing but a Queue or Pipe) with the parent.
For your example, this is easy:
class Client(object):
def connect_and_fetch(self):
self.connection = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.connection.connect(( '192.0.1.135' , 3345 ))
while True:
data = self.connection.recv(1024)
if not data:
break
else:
print data
def do_client():
client = Client()
connect_and_fetch()
class MyFrame(wx.Frame):
# ...
def Connection(self):
self.child = multiprocessing.Process(target=do_client)
self.child.start()
# and now put a self.child.join() somewhere
In fact, you don't even need a class at all here, because the only use you have for self is to store a variable that could just as easily be a local. But I'm guessing in your real-life program, there's a bit more state than that.
There's an interesting (if a bit outdated) example on the wxpython wiki, called MultiProcessing, which looks like it does most of what you want and more. (It's using a classmethod for the child process instead of a standalone function for some reason, and using old-style syntax for it because it's old, but hopefully it's still helpful.)
If you're using wx for your GUI, you may want to consider using its inter-process mechanisms instead of the native Python ones. While it's more complicated and less pythonic in the general case, when you're trying to integrate a child process and its communications pipe into your main event loop, why not let wx take care of it?
The alternative is to create a thread to wait on the child process and/or whatever Pipe or Queue you give it, and then create and post wx.Events to the main thread.
* Most, not all. For example, if f temporarily uses up a whole lot of memory, running it in a child process means you release that memory to the OS as quickly as possible. Or, if it calls badly-written third-party/legacy/whatever code that has nasty and poorly-documented global side-effects, you're isolated from those side-effects. And so on.
From http://docs.python.org/dev/library/multiprocessing.html:
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()
You can't. You use subprocess to call another application or script to run in a separate process.
subprocess.Popen(cmds)
If you need to run some long running process, look into threads or the multiprocessing module. Here are some links:
http://docs.python.org/2/library/multiprocessing.html
http://wiki.wxpython.org/LongRunningTasks
http://www.blog.pythonlibrary.org/2012/08/03/python-concurrency-porting-from-a-queue-to-multiprocessing/
http://www.blog.pythonlibrary.org/2010/05/22/wxpython-and-threads/

Categories

Resources