Python mutiprocessing with big shared data

Python mutiprocessing with big shared data - python

I am using python to develop an app to process data using mutliprocessing module, the code looks like this:
import multiprocessing
globalData = loadData() #very large data
def f(v):
global globalData
return someOperation(globalData,v)
if __name__ == '__main__':
pool = multiprocessing.Pool()
arr = loadArray() #some big list
res = pool.map(f,arr)
The problem is that all child processes needs the same global data to process the function, so it loads it and takes a long time, what is the best solution to share this data among all child processes, as it is already loaded in the parent?

Multiprocessing on ms-windows works differently from UNIX-like systems.
UNIX-like systems have the fork system call, which makes a copy of the current process. In modern systems with copy-on-write virtual memory management, this is not even a very expensive operation.
This means that global data in the parent process will be shared with the child process, until the child process writes to that page, in which case it will be copied.
The thing is that ms-windows doesn't have fork. It has CreateProcess instead. So on ms-windows, this happens:
The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.
So since your global data is referenced in your function it will be loaded. But every child process will load it separately.
What you could try is have your processes load the data using mmap with ACCESS_READ. I would expect that the ms-windows memory subsystem is smart enough to only load the data once in case the same file is loaded by multiple processes.

I am also new to python, but if I do understand your question, it's very easy: in the folowing script we use 5 workers to get the square of the first 10000 numbers.
import multiprocessing
globalData = range(10000) #very large data
def f(x):
return x*x
if __name__ == '__main__':
pool = multiprocessing.Pool(5)
print(pool.map(f,globalData))

Related

multiprocessing in python - what gets inherited by forkserver process from parent process?

I am trying to use forkserver and I encountered NameError: name 'xxx' is not defined in worker processes.
I am using Python 3.6.4, but the documentation should be the same, from https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods it says that:
The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
Also, it says:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
So apparently a key object that my worker process needs to work on did not get inherited by the server process and then passing to workers, why did that happen? I wonder what exactly gets inherited by forkserver process from parent process?
Here is what my code looks like:
import multiprocessing
import (a bunch of other modules)
def worker_func(nameList):
global largeObject
for item in nameList:
# get some info from largeObject using item as index
# do some calculation
return [item, info]
if __name__ == '__main__':
result = []
largeObject # This is my large object, it's read-only and no modification will be made to it.
nameList # Here is a list variable that I will need to get info for each item in it from the largeObject
ctx_in_main = multiprocessing.get_context('forkserver')
print('Start parallel, using forking/spawning/?:', ctx_in_main.get_context())
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=4) as pool:
for x in pool.imap_unordered(worker_func, nameList):
result.append(x)
Thank you!
Best,

Theory
Below is an excerpt from Bojan Nikolic blog
Modern Python versions (on Linux) provide three ways of starting the separate processes:
Fork()-ing the parent processes and continuing with the same processes image in both parent and child. This method is fast, but potentially unreliable when parent state is complex
Spawning the child processes, i.e., fork()-ing and then execv to replace the process image with a new Python process. This method is reliable but slow, as the processes image is reloaded afresh.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
Forkserver
The third method, forkserver, is illustrated below. Note that children retain a copy of the forkserver state. This state is intended to be relatively simple, but it is possible to adjust this through the multiprocess API through the set_forkserver_preload() method.
Practice
Thus, if you want simething to be inherited by child processes from the parent, this must be specified in the forkserver state by means of set_forkserver_preload(modules_names), which set list of module names to try to load in forkserver process. I give an example below:
# inherited.py
large_obj = {"one": 1, "two": 2, "three": 3}
# main.py
import multiprocessing
import os
from time import sleep
from inherited import large_obj
def worker_func(key: str):
print(os.getpid(), id(large_obj))
sleep(1)
return large_obj[key]
if __name__ == '__main__':
result = []
ctx_in_main = multiprocessing.get_context('forkserver')
ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=cores) as pool:
for x in pool.imap(worker_func, ["one", "two", "three"]):
result.append(x)
for res in result:
print(res)
Output:
# The PIDs are different but the address is always the same
PID=18603, obj id=139913466185024
PID=18604, obj id=139913466185024
PID=18605, obj id=139913466185024
And if we don't use preloading
...
ctx_in_main = multiprocessing.get_context('forkserver')
# ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
...
# The PIDs are different, the addresses are different too
# (but sometimes they can coincide)
PID=19046, obj id=140011789067776
PID=19047, obj id=140011789030976
PID=19048, obj id=140011789030912

So after an inspiring discussion with Alex I think I have sufficient info to address my question: what exactly gets inherited by forkserver process from parent process?
Basically when the server process starts, it will import your main module and everything before if __name__ == '__main__' will be executed. That's why my code don't work, because large_object is nowhere to be found in server process and in all those worker processes that fork from the server process.
Alex's solution works because large_object now gets imported to both main and server process so every worker forked from server will also gets large_object. If combined with set_forkserver_preload(modules_names) all workers might even get the same large_object from what I saw. The reason for using forkserver is explicitly explained in Python documentations and in Bojan's blog:
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
So it's more on the safe side of concern here.
On a side note, if you use fork as the starting method though, you don't need to import anything since all child process gets a copy of parents process memory (or a reference if the system use COW-copy-on-write, please correct me if I am wrong). In this case using global large_object will get you access to large_object in worker_func directly.
The forkserver might not be a suitable approach for me because the issue I am facing is memory overhead. All the operations that gets me large_object in the first place are memory-consuming, so I don't want any unnecessary resources in my worker processes.
If I put all those calculations directly into inherited.py as Alex suggested, it will be executed twice (once when I imported the module in main and once when the server imports it; maybe even more when worker processes were born?), this is suitable if I just want a single-threaded safe process that workers can fork from. But since I am trying to get workers to not inherit unnecessary resources and only get large_object, this won't work.
And putting those calculations in __main__ in inherited.py won't work either since now none of the processes will execute them, including main and server.
So, as a conclusion, if the goal here is to get workers to inherit minimal resources, I am better off breaking my code into 2, do calculation.py first, pickle the large_object, exit the interpreter, and start a fresh one to load the pickled large_object. Then I can just go nuts with either fork or forkserver.

Python call multiple functions in parallel and combine result

I have to write a job to perform difference type of analysis on given document. I know I can do sequentially i.e., call each parser one by one.
A very high level script structure is given below
def summarize(doc):
pass
def LengthCount(doc):
pass
def LanguageFinder(doc):
pass
def ProfanityFinder(doc):
pass
if __name__ == '__main__':
doc = "Some document"
smry = summarize(doc)
length = LengthCount(doc)
lang = LanguageFinder(doc)
profanity = ProfanityFinder(doc)
# Save sumary, length, language, profanity information in database
But for performance improvement, I think these task can be run in parallel. How can I do it. What are the possible ways for this purpose in Python especially 3.x version. It is quite possible that one parser (module) take more time than other but overall if they could be run in parallel they it will increase performance. Lastly, if not possible in Python, any other language is also welcome.

In Python you have a few options for concurrency/parallelism. There is the threading module which allows you to execute code in multiple logical threads and the multiprocessing module which allows you to spawn multiple processes. There is also the concurrent.futures module that provides an API into both of these mechanisms.
If your process is CPU-bound (i.e. you are running at 100% of the CPU available to Python throughout - note this is not 100% CPU if you have a multi-core or hyper-threading machine) you are unlikely to see much benefit from threading as this doesn't actually use multiple CPU threads in parallel, it just allows one to take over from another whilst the first is waiting for IO. Multiprocessing is likely to be more useful for you as this allows you to run using multiple CPU threads. You can start each of your functions in its own process using the Process class:
import multiprocessing
#function defs here
p = multiprocessing.Process(target=LengthCount, args=(doc,))
p.start()
# repeat for other processes
You will need to tweak your code to have the functions return to a shared variable (or write straight to your database) rather than directly return your result so you can access them once the process is complete.

How to write to a shared variable in python joblib

The following code parallelizes a for-loop.
import networkx as nx;
import numpy as np;
from joblib import Parallel, delayed;
import multiprocessing;
def core_func(repeat_index, G, numpy_arrary_2D):
for u in G.nodes():
numpy_arrary_2D[repeat_index][u] = 2;
return;
if __name__ == "__main__":
G = nx.erdos_renyi_graph(100000,0.99);
nRepeat = 5000;
numpy_array = np.zeros([nRepeat,G.number_of_nodes()]);
Parallel(n_jobs=4)(delayed(core_func)(repeat_index, G, numpy_array) for repeat_index in range(nRepeat));
print(np.mean(numpy_array));
As can be seen, the expected value to be printed is 2. However, when I run my code on a cluster (multi-core, shared memory), it returns 0.0.
I think the problem is that each worker creates its own copy of the numpy_array object, and the one created in the main function is not updated. How can I modify the code such that the numpy array numpy_array can be updated?

joblib uses the multiprocessing pool of processes by default, as its manual says:
Under the hood, the Parallel object create a multiprocessing pool that
forks the Python interpreter in multiple processes to execute each of
the items of the list. The delayed function is a simple trick to be
able to create a tuple (function, args, kwargs) with a function-call
syntax.
Which means, that every process inherits the original state of the array, but whatever it writes inside into it, is lost when the process exits. Only the function result is delivered back to the calling (main) process. But you do not return anything, so None is returned.
To make the shared array modiyable, you have two ways: using threads and using the shared memory.
The threads, unlike the processes, share the memory. So you can write to the array and every job will see this change. According to the joblib manual, it is done this way:
Parallel(n_jobs=4, backend="threading")(delayed(core_func)(repeat_index, G, numpy_array) for repeat_index in range(nRepeat));
When you run it:
$ python r1.py
2.0
However, when you will be writing complex things into the array, make sure you properly handle the locks around the data or data pieces, or you will hit the race conditions (google it).
Also read carefully about GIL, as the computational multithreading in Python is limited (unlike the I/O multithreading).
If you still need the processes (e.g. because of GIL), you can put that array into the shared memory.
This is a bit more complicated topic, but joblib + numpy shared memory example is shown in the joblib manual also.

As Sergey wrote in his answer, processes doesn't share state and memory. This is why you don't see the expected answer.
Threads share state and memory space, as they run under the same process. This is useful if you have many I/O operations. It won't get you more processing power (more CPUs) because of the GIL
One technique to communicate between processes is Proxy Objects using Manager. You create a manager object, which synchronize resources between the processes.
A manager object returned by Manager() controls a server process which holds Python objects and allows other processes to manipulate them using proxies.
I haven't tested this code (I don't have all the modules you use), and it might require more modifications to the code, but using Manager object it should look like this
if __name__ == "__main__":
G = nx.erdos_renyi_graph(100000,0.99);
nRepeat = 5000;
manager = multiprocessing.Manager()
numpys = manager.list(np.zeros([nRepeat, G.number_of_nodes()])
Parallel(n_jobs=4)(delayed(core_func)(repeat_index, G, numpys, que) for repeat_index in range(nRepeat));
print(np.mean(numpys));

python multiprocessing.pool.map, passing arguments to spawned processes

def content_generator(applications, dict):
for app in applications:
yield(app, dict[app])
with open('abc.pickle', 'r') as f:
very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
do some aggregation on result
I have a really large dictionary whose keys are strings (application names), values are information concerning the application. Since applications are independent, I want to use multiprocessing to process them in parallel. Parallelization works when the dictionary is not that big but all the python processes were killed when the dictionary is too big. I used dmesg to check what went wrong and found they were killed since the machine ran out of memory. I did top when the pool processes are running and found that they all occupy the same amount of resident memory(RES), which is all 3.4G. This confuses me since it seems to have copied the whole dictionaries into the spawned processes. I thought I broke up the dictionary and passing only what is relevant to the spawned process by yielding only dict[app] instead of dict. Any thoughts on what I did wrong?

The comments are becoming impossible to follow, so I'm pasting in my important comment here:
On a Linux-y system, new processes are created by fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your Pool before creating giant data structures. Then the child processes will inherit a much smaller address space.
Then some answers to questions:
so in python 2.7, there is no way to spawn a new process?
On Linux-y systems, no. The ability to use "spawn" on those was first added in Python 3.4. On Windows systems, "spawn" has always been the only choice (no fork() on Windows).
The big dictionary is passed in to a function as an argument and I
could only create the pool inside this function. How would I be able
to create the pool before the big dictionary
As simple as this: make these two lines the first two lines in your program:
import multiprocessing
pool = multiprocessing.Pool()
You can create the pool any time you like (just so long as it exists sometime before you actually use it), and worker processes will inherit the entire address space at the time the Pool constructor is invoked.
ANOTHER SUGGESTION
If you're not mutating the dict after it's created, try using this instead:
def content_generator(dict):
for app in dict:
yield app, dict[app]
That way you don't have to materialize a giant set of the keys either. Or, even better (if possible), skip all that and iterate directly over the items:
for result in pool.imap_unordered(func_process_application, very_large_dict.iteritems()):

Parallel python loss of data

I have a python function that creates and stores a object instance in a global list and this function is called by a thread. While the thread runs the lists is filled up as it should be, but when the thread exits the list is empty and I have no idea why. Any help would be appreciated.
simulationResults = []
def run(width1, height1, seed1, prob1):
global simulationResults
instance = Life(width1, height1, seed1, prob1)
instance.run()
simulationResults.append(instance)
this is called in my main by:
for i in range(1, nsims + 1):
simulations.append(multiprocessing.Process(target=run, args=(width, height, seed, prob)))
simulations[(len(simulations) - 1)].start()
for i in simulations:
i.join()

multiprocessing is based on processes, not threads. The important difference: Each process has a separate memory space, while threads share a common memory space. When first created, a process may (depending on OS, spawn method, etc.) be able to read the same values the parent process has, but if it writes to them, only the local values are changed, not the parent's copy. Only threads can rely on being able to access an arbitrary single shared global variable and have it behave as expected.
I'd suggest looking at either multiprocessing.Pool and its various methods to dispatch tasks and retrieve their results later, or if you must use raw Processes, look at the various ways to exchange data between processes; you can't just assign to a global variable, because globals stop being shared when the new Process is forked/spawned.

In your code you are creating new processes rather than threads. When the process is created the new process will have deep copies of the variables in the main process, but they are independent from each other. I think for your case it makes sense to use processes rather than threads because It would allow you to utilise multiple cores as opposed to thread that will be limited to a single core due to GIL.
You will have to use interprocess communication techniques to communicate between processes. But since in your case the processes are not persistent daemons, it would make sense to write the simulationResults into a different unique file by each process and read them back from the main process.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.