I have a main Python process, and a bunch or workers created by the main process using os.fork().
I need to pass large and fairly involved data structures from the workers back to the main process. What existing libraries would you recommend for that?
The data structures are a mix of lists, dictionaries, numpy arrays, custom classes (which I can tweak) and multi-layer combinations of the above.
Disk I/O should be avoided. If I could also avoid creating copies of the data -- for example by having some kind of shared-memory solution -- that would be nice too, but is not a hard constraint.
For the purposes of this question, it is mandatory that the workers are created using os.fork(), or a wrapper thereof that would clone the master process's address space.
This only needs to work on Linux.
multiprocessing's queue implementation works. Internally, it pickles data to a pipe.
q = multiprocessing.Queue()
if (os.fork() == 0):
print(q.get())
else:
q.put(5)
# outputs: 5
Related
I want to share a multiprocessing.Array between the parent and its child processes:
self.procs = [MP.Process(target=worker, kwargs=dict(sh_array=array)
for _ in range(num_workers)]
Does the code above do the right thing? I only want fast IPC communication based on shared memory / file-mapping when I access the shared array. I don't want any message passing or copy-based IPC happening behind the curtain. That would defeat the purpose of the code I'm writing.
Also, I'd like to pass the same way different instances of a class which all refer to the same shared array. Will this work correctly or should I pass the shared array separately and then rebuild the objects in the child processes manually?
Does the code above do the right thing?
Yes, that's exactly what multiprocessing.Array is for and how it is used.
I don't want any IPC communication when I access the shared array.
I don't think the term "IPC" is used correctly here. IPC stands for inter-process communication, and if you have an array that is shared between processes, then anything you write to the array will be available to be read from the other processes. In other words, you are communicating between processes. In other words, IPC. Shared memory is IPC, and if you don't want IPC, then you can't share things between processes.
You may mean something completely different. Maybe you don't want to pass messages back and forth, or something like that?
Also, I'd like to pass the same way different instances of a class which all refer to the same shared array. Will this work correctly or should I pass the shared array separately and then rebuild the objects in the child processes manually?
Either way works. Do whichever option makes the code more natural to read.
The following code parallelizes a for-loop.
import networkx as nx;
import numpy as np;
from joblib import Parallel, delayed;
import multiprocessing;
def core_func(repeat_index, G, numpy_arrary_2D):
for u in G.nodes():
numpy_arrary_2D[repeat_index][u] = 2;
return;
if __name__ == "__main__":
G = nx.erdos_renyi_graph(100000,0.99);
nRepeat = 5000;
numpy_array = np.zeros([nRepeat,G.number_of_nodes()]);
Parallel(n_jobs=4)(delayed(core_func)(repeat_index, G, numpy_array) for repeat_index in range(nRepeat));
print(np.mean(numpy_array));
As can be seen, the expected value to be printed is 2. However, when I run my code on a cluster (multi-core, shared memory), it returns 0.0.
I think the problem is that each worker creates its own copy of the numpy_array object, and the one created in the main function is not updated. How can I modify the code such that the numpy array numpy_array can be updated?
joblib uses the multiprocessing pool of processes by default, as its manual says:
Under the hood, the Parallel object create a multiprocessing pool that
forks the Python interpreter in multiple processes to execute each of
the items of the list. The delayed function is a simple trick to be
able to create a tuple (function, args, kwargs) with a function-call
syntax.
Which means, that every process inherits the original state of the array, but whatever it writes inside into it, is lost when the process exits. Only the function result is delivered back to the calling (main) process. But you do not return anything, so None is returned.
To make the shared array modiyable, you have two ways: using threads and using the shared memory.
The threads, unlike the processes, share the memory. So you can write to the array and every job will see this change. According to the joblib manual, it is done this way:
Parallel(n_jobs=4, backend="threading")(delayed(core_func)(repeat_index, G, numpy_array) for repeat_index in range(nRepeat));
When you run it:
$ python r1.py
2.0
However, when you will be writing complex things into the array, make sure you properly handle the locks around the data or data pieces, or you will hit the race conditions (google it).
Also read carefully about GIL, as the computational multithreading in Python is limited (unlike the I/O multithreading).
If you still need the processes (e.g. because of GIL), you can put that array into the shared memory.
This is a bit more complicated topic, but joblib + numpy shared memory example is shown in the joblib manual also.
As Sergey wrote in his answer, processes doesn't share state and memory. This is why you don't see the expected answer.
Threads share state and memory space, as they run under the same process. This is useful if you have many I/O operations. It won't get you more processing power (more CPUs) because of the GIL
One technique to communicate between processes is Proxy Objects using Manager. You create a manager object, which synchronize resources between the processes.
A manager object returned by Manager() controls a server process which holds Python objects and allows other processes to manipulate them using proxies.
I haven't tested this code (I don't have all the modules you use), and it might require more modifications to the code, but using Manager object it should look like this
if __name__ == "__main__":
G = nx.erdos_renyi_graph(100000,0.99);
nRepeat = 5000;
manager = multiprocessing.Manager()
numpys = manager.list(np.zeros([nRepeat, G.number_of_nodes()])
Parallel(n_jobs=4)(delayed(core_func)(repeat_index, G, numpys, que) for repeat_index in range(nRepeat));
print(np.mean(numpys));
I am currently working on a project that involves connecting two devices to a python script, retrieving data from them and outputting the data.
Code outline:
• Scans for paired devices
• Paired device found creates thread instance (Two devices connected = two thread instances )
• Data is printed within the thread i.e. each instance has a separate
bundle of data
Basically when two devices are connected two instances of my thread class is created. Each thread instance returns a different bundle of data.
My question is: Is there a way I can combine the two bundles of data into one bundle of data?
Any help on this is appreciated :)
I assume you are using the threading module.
Threading in Python
Python is not multithreaded for CPU activity. The interpreter still uses a GIL (Global Interpreter Lock) for most operations and therefore linearizing operations in a python script. Threading is good to do IO however, as other threads can be woken up while a thread waits for IO.
Idea
Because of the GIL we can just use a standard list to combine our data. The idea is to pass the same list or dictionary to every Thread we create using the args parameter. See pydoc for threading.
Our simple implementation uses two Threads to show how it can be done. In real-world applications you probably use a Thread group or something similar..
Implementation
def worker(data):
# retrieve data from device
data.append(1)
data.append(2)
l = []
# Let's pass our list to the target via args.
a = Thread(target=worker, args=(l,))
b = Thread(target=worker, args=(l,))
# Start our threads
a.start()
b.start()
# Join them and print result
a.join()
b.join()
print(l)
Further thoughts
If you want to be 100% correct and don't rely on the GIL to linearize access to your list, you can use a simple mutex to lock and unlock or use the Queue module which implements correct locking.
Depending on the nature of the data a dictionary might be more convenient to join data by certain keys.
Other considerations
Threads should be carefully considered. Alternatives such as asyncio, etc might be better suited.
My general advice: Avoid using any of these things
avoid threads
avoid the multiprocessing module in Python
avoid the futures module in Python.
Use a tool like http://python-rq.org/
Benefit:
You need to define the input- and output data well, since only serializable data can be passed around
You have distinct interpreters.
No dead locks
Easier to debug.
def content_generator(applications, dict):
for app in applications:
yield(app, dict[app])
with open('abc.pickle', 'r') as f:
very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
do some aggregation on result
I have a really large dictionary whose keys are strings (application names), values are information concerning the application. Since applications are independent, I want to use multiprocessing to process them in parallel. Parallelization works when the dictionary is not that big but all the python processes were killed when the dictionary is too big. I used dmesg to check what went wrong and found they were killed since the machine ran out of memory. I did top when the pool processes are running and found that they all occupy the same amount of resident memory(RES), which is all 3.4G. This confuses me since it seems to have copied the whole dictionaries into the spawned processes. I thought I broke up the dictionary and passing only what is relevant to the spawned process by yielding only dict[app] instead of dict. Any thoughts on what I did wrong?
The comments are becoming impossible to follow, so I'm pasting in my important comment here:
On a Linux-y system, new processes are created by fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your Pool before creating giant data structures. Then the child processes will inherit a much smaller address space.
Then some answers to questions:
so in python 2.7, there is no way to spawn a new process?
On Linux-y systems, no. The ability to use "spawn" on those was first added in Python 3.4. On Windows systems, "spawn" has always been the only choice (no fork() on Windows).
The big dictionary is passed in to a function as an argument and I
could only create the pool inside this function. How would I be able
to create the pool before the big dictionary
As simple as this: make these two lines the first two lines in your program:
import multiprocessing
pool = multiprocessing.Pool()
You can create the pool any time you like (just so long as it exists sometime before you actually use it), and worker processes will inherit the entire address space at the time the Pool constructor is invoked.
ANOTHER SUGGESTION
If you're not mutating the dict after it's created, try using this instead:
def content_generator(dict):
for app in dict:
yield app, dict[app]
That way you don't have to materialize a giant set of the keys either. Or, even better (if possible), skip all that and iterate directly over the items:
for result in pool.imap_unordered(func_process_application, very_large_dict.iteritems()):
I have a python function that creates and stores a object instance in a global list and this function is called by a thread. While the thread runs the lists is filled up as it should be, but when the thread exits the list is empty and I have no idea why. Any help would be appreciated.
simulationResults = []
def run(width1, height1, seed1, prob1):
global simulationResults
instance = Life(width1, height1, seed1, prob1)
instance.run()
simulationResults.append(instance)
this is called in my main by:
for i in range(1, nsims + 1):
simulations.append(multiprocessing.Process(target=run, args=(width, height, seed, prob)))
simulations[(len(simulations) - 1)].start()
for i in simulations:
i.join()
multiprocessing is based on processes, not threads. The important difference: Each process has a separate memory space, while threads share a common memory space. When first created, a process may (depending on OS, spawn method, etc.) be able to read the same values the parent process has, but if it writes to them, only the local values are changed, not the parent's copy. Only threads can rely on being able to access an arbitrary single shared global variable and have it behave as expected.
I'd suggest looking at either multiprocessing.Pool and its various methods to dispatch tasks and retrieve their results later, or if you must use raw Processes, look at the various ways to exchange data between processes; you can't just assign to a global variable, because globals stop being shared when the new Process is forked/spawned.
In your code you are creating new processes rather than threads. When the process is created the new process will have deep copies of the variables in the main process, but they are independent from each other. I think for your case it makes sense to use processes rather than threads because It would allow you to utilise multiple cores as opposed to thread that will be limited to a single core due to GIL.
You will have to use interprocess communication techniques to communicate between processes. But since in your case the processes are not persistent daemons, it would make sense to write the simulationResults into a different unique file by each process and read them back from the main process.