Multiprocessing, pooling and randomness - python

I am experiencing a strange thing: I wrote a program to simulate economies. Instead of running this simulation one by one on one CPU core, I want to use multiprocessing to make things faster. So I run my code (fine), and I want to get some stats from the simulations I am doing. Then arises one surprise: all the simulations done at the same time yield the very same result! Is there some strange relationship between Pool() and random.seed()?
To be much clearer, here is what the code can be summarized as:
class Economy(object):
def __init__(self,i):
self.run_number = i
self.Statistics = Statistics()
self.process()
def run_and_return(i):
eco = Economy(i)
return eco
collection = []
def get_result(x):
collection.append(x)
if __name__ == '__main__':
pool = Pool(processes=4)
for i in range(NRUN):
pool.apply_async(run_and_return, (i,), callback=get_result)
pool.close()
pool.join()
The process(i) is the function that goes through every step of the simulation, during i steps. Basically I simulate NRUN Economies, from which I get the Statistics that I put in the list collection.
Now the strange thing is that the output of this is exactly the same for the first 4 runs: during the same "wave" of simulations, I get the very same output. Once I get to the second wave, then I get a different output for the next 4 simulations!
All these simulations run well if I use the same program with processes=1: I get different results when I only work on one core, taking simulations one by one... I have tried a few things, but can't get my head around this, hence my post...
Thank you very much for taking the time to read this long post, do not hesitate to ask for more precisions!
All the best,

If you are on Linux then each pool process is made by forking the parent process. This means the process is literally duplicated - this includes the seed any random object may be using.
The random module selects the seed for its default functions on import. Meaning the seed has already been selected before you create the Pool.
To get around this you must use an initialiser for each pool process that sets the random seed to something unique.
A decent way to seed random would be to use the process id and the current time. The process id is bound to be unique on a single run of your program. Whilst using the time will ensure uniqueness over multiple runs in case the same process id is produced. Passing process id and time through as a string will mean that the digest of the string is also used to seed the random number generator -- meaning two similar strings will produce substantially different seeds. Alternatively, you could use the uuid module to generate seeds.
def proc_init():
random.seed(str(os.getpid()) + str(time.time()))
pool = Pool(num_procs, initializer=proc_init)

Related

How to have a multi-procsesing function return and store values in python?

I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?
In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)
You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.
If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.

Multiprocessing on PBS cluster node

I have to run multiple simulations of the same model with varying parameters (or random number generator seed). Previously I worked on a server with many cores, where I used python multiprocessing library with apply_async. This was very handy as I could decide the maximum number of cores to occupy and simulations would just go into a queue.
As I understand from other questions, multiprocessing works on pbs clusters as long as you work on just one node, which can be fine for now. However, my code doesn't always work.
To let you understand my kind of code:
import functions_library as L
import multiprocessing as mp
if __name__ == "__main__":
N = 100
proc = 50
pool = mp.Pool(processes = proc)
seed = 342
np.random.seed(seed)
seeds = np.random.randint(low=1,high=100000,size=N)
resul = []
for SEED in seeds:
SEED = int(SEED)
resul.append(pool.apply_async(L.some_function, args = (some_args)))
print(SEED)
results = [p.get() for p in resul]
database = pd.DataFrame(results)
database.to_csv("prova.csv")
The function creates 3 N=10000 networkx graphs and perform some computations on them, then returns a simple short python dictionary.
The weird thing I cannot debug is the following error message:
multiprocessing.pool.MaybeEncodingError: Error sending result: >''. >Reason: 'RecursionError('maximum recursion depth exceeded while calling a >Python object')'
What's strange is that I run multiple istances of the code on different nodes. 3 times the code correctly worked, whereas most of the times it returns the previous error. I tried lunching different number of parallel simulation, from 7 to 20 (# cores of the nodes), but there doesn't seem to be a pattern, so I guess it's not a memory issue.
In other questions similar error seems to be related to pickling strange or big objects, but in this case the only thing that comes out of the function is a short dictionary, so it shouldn't be related to that.
I also tried increasing the allowed recursion depth with the sys library at the beginning og the work but didn't work up to 15000.
Any idea to solve or at least understand this behavior?
It was related to eigenvector_centrality() not converging.
When running outside of multiprocessing it correctly returns a networkx error, whereas inside it only this recursion error is returned.
I am not aware if this is a weird very function specific behavior or sometimes multiprocessing cannot handle some library errors.

Run large python script in parallel/multiprocessing

I have a large python script (an economic model with rows > 1500) which I want to excecute in parallel on several cpu cores. All the examples for multiprocessing I found so far were about simple functions, but not whole scripts. Could you please give me a hint how to achieve this?
Thanks!
Clarification: the model generates as an output a dataset for a multitude of variables. Each result is randomly different from the other model runs. Therefore I have to run the model often enough till some deviation measure is achieved (let's say 50 times). Model input is allways the same, but not the output.
Edit, got it:
import os
from multiprocessing import Pool
n_cores = 4
n_iterations = 5
def run_process(process):
os.system('python myscript.py')
if __name__ == '__main__':
p = Pool(n_cores)
p.map(run_process, range(n_iterations))
If you want to use a pool of workers, I usually do the following.
import multiprocessing as mp
def MyFunctionInParallel(foo, bar, queue):
res = foo + bar
queue.put({res: res})
return
if __name__ == '__main__':
data = []
info = {}
num =
ManQueue = mp.Manager().Queue()
with mp.Pool(processes=numProcs) as pool:
pool.starmap(MyFunctionInParallel, [(data[v], info, ManQueue)
for v in range(num)])
resultdict = {}
for i in range(num):
resultdict.update(ManQueue.get())
To be clearer, your script becomes the body of MyFunctionInParallel. This means that you need to slightly change your script so that the variables which depend on your input (i.e. each of your models) can be passed as arguments to MyFunctionInParallel. Then, depending on what you want to do with the results you get for each run, you can either use a Queue as sketched above or for example, write your results in a file. If you use a Queue, it means that you want to be able to retrieve your data at the end of the parallel execution (i.e. in the same script execution), and I would advise to use dictionaries as a way to store your results in the Queue, as they are very flexible on the data they can contain. On the other hand, writing up your results in a file is I guess better if you wish to share them with other users/applications. You have to be careful with concurrent writing from all the workers, so as to produce a meaningful output, but writing one file per model can also be OK.
For the main part of the code, num would be the number of models you will be running, data and info some parameters which are specific (or not) to each model and numProcs the number of processes that you wish to launch. For the call to starmap, it will basically map the arguments in the list comprehension to each call of MyFunctionInParallel, allowing each execution to have different input arguments.

Using pool for multiprocessing in Python (Windows)

I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5
Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking

Multiprocessing -- Thread Pool Memory Leak?

I am observing memory usage that I cannot explain to myself. Below I provide a stripped down version of my actual code that still exhibits this behavior. The code is intended to accomplish the following:
Read a text file in chunks of 1000 lines. Each line is a sentence. Split these 1000 sentences into 4 generators. Pass these generators to a thread pool and run feature extraction in parallel on 250 sentences.
In my actual code I accumulate features and labels from all sentences of the entire file.
Now here comes the weird thing: Memory gets allocated but not freed again even when not accumulating these values! And it has something to do with the thread pool I think. The amount of memory taken in total is dependent on how many features are extracted for any given word. I simulate this here with range(100). Have a look:
from sys import argv
from itertools import chain, islice
from multiprocessing import Pool
from math import ceil
# dummyfied feature extraction function
# the lengt of the range determines howmuch mamory is used up in total,
# eventhough the objects are never stored
def features_from_sentence(sentence):
return [{'some feature' 'some value'} for i in range(100)], ['some label' for i in range(100)]
# split iterable into generator of generators of length `size`
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def features_from_sentence_meta(l):
return list(map (features_from_sentence, l))
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# split sentences into a generator of 4 generators
sentence_chunks = chunks(sentences, ceil(50000/4))
# results is a list containing the lists of pairs of X and Y of all chunks
results = map(lambda x : x[0], pool.map(features_from_sentence_meta, sentence_chunks))
X, Y = zip(*results)
print(f'end: {i}')
return X, Y
# reads file in chunks of `lines_per_chunk` lines
def line_chunks(textfile, lines_per_chunk=1000):
chunk = []
i = 0
with open(textfile, 'r') as textfile:
for line in textfile:
if not line.split(): continue
i+=1
chunk.append(line.strip())
if i == lines_per_chunk:
yield chunk
i = 0
chunk = []
yield chunk
textfile = argv[1]
for i, line_chunk in enumerate(line_chunks(textfile)):
# stop processing file after 10 chunks to demonstrate
# that memory stays occupied (check your system monitor)
if i == 10:
while True:
pass
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
The file I am using to debug this has 50000 nonempty lines, which is why I use the hardcoded 50000 at one place. If you want to use the same file, he is a link for your convenience:
https://www.dropbox.com/s/v7nxb7vrrjim349/de_wiki_50000_lines?dl=0
Now when you run this script and open your system monitor you will observe that memory gets used up and the usage keeps going until the 10th chunk, where I artificially go into an endless loop to demonstrate that the memory stays in use, even though I never store anything.
Can you explain to me why this happens? I seem to be missing something about how multiprocessing pools are supposed to be used.
First, let's clear up some misunderstandings—although, as it turns out, this wasn't actually the right avenue to explore in the first place.
When you allocate memory in Python, of course it has to go get that memory from the OS.
When you release memory, however, it rarely gets returned to the OS, until you finally exit. Instead, it goes into a "free list"—or, actually, multiple levels of free lists for different purposes. This means that the next time you need memory, Python already has it lying around, and can find it immediately, without needing to talk to the OS to allocate more. This usually makes memory-intensive programs much faster.
But this also means that—especially on modern 64-bit operating systems—trying to understand whether you really do have any memory pressure issues by looking at your Activity Monitor/Task Manager/etc. is next to useless.
The tracemalloc module in the standard library provides low-level tools to see what actually is going on with your memory usage. At a higher level, you can use something like memory_profiler, which (if you enable tracemalloc support—this is important) can put that information together with OS-level information from sources like psutil to figure out where things are going.
However, if you aren't seeing any actual problems—your system isn't going into swap hell, you aren't getting any MemoryError exceptions, your performance isn't hitting some weird cliff where it scales linearly up to N and then suddenly goes all to hell at N+1, etc.—you usually don't need to bother with any of this in the first place.
If you do discover a problem, then, fortunately, you're already half-way to solving it. As I mentioned at the top, most memory that you allocated doesn't get returned to the OS until you finally exit. But if all of your memory usage is happening in child processes, and those child processes have no state, you can make them exit and restart whenever you want.
Of course there's a performance cost to doing so—process teardown and startup time, and page maps and caches that have to start over, and asking the OS to allocate the memory again, and so on. And there's also a complexity cost—you can't just run a pool and let it do its thing; you have to get involved in its thing and make it recycle processes for you.
There's no builtin support in the multiprocessing.Pool class for doing this.
You can, of course, build your own Pool. If you want to get fancy, you can look at the source to multiprocessing and do what it does. Or you can build a trivial pool out of a list of Process objects and a pair of Queues. Or you can just directly use Process objects without the abstraction of a pool.
Another reason you can have memory problems is that your individual processes are fine, but you just have too many of them.
And, in fact, that seems to be the case here.
You create a Pool of 4 workers in this function:
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# ...
… and you call this function for every chunk:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
So, you end up with 4 new processes for every chunk. Even if each one has pretty low memory usage, having hundreds of them at once is going to add up.
Not to mention that you're probably severely hurting your time performance by having hundreds of processes competing over 4 cores, so you waste time in context switching and OS scheduling instead of doing real work.
As you pointed out in a comment, the fix for this is trivial: just make a single global pool instead of a new one for each call.
Sorry for getting all Columbo here, but… just one more thing… This code runs at the top level of your module:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
… and that's the code that tries to spin up the pool and all the child tasks. But each child process in that pool needs to import this module, which means they're all going to end up running the same code, and spinning up another pool and a whole extra set of child tasks.
You're presumably running this on Linux or macOS, where the default startmethod is fork, which means multiprocessing can avoid this import, so you don't have a problem. But with the other startmethods, this code would basically be a forkbomb that eats up all of your system resources. And that includes spawn, which is the default startmethod on Windows. So, if there's ever any chance anyone might run this code on Windows, you should put all of that top-level code in a if __name__ == '__main__': guard.

Categories

Resources