How to utilize all cores with python multiprocessing

How to utilize all cores with python multiprocessing - python

I have been fiddling with Python's multiprocessing functionality for upwards of an hour now, trying to parallelize a rather complex graph traversal function using multiprocessing.Process and multiprocessing.Manager:
import networkx as nx
import csv
import time
from operator import itemgetter
import os
import multiprocessing as mp
cutoff = 1
exclusionlist = ["cpd:C00024"]
DG = nx.read_gml("KeggComplete.gml", relabel=True)
for exclusion in exclusionlist:
DG.remove_node(exclusion)
# checks if 'memorizedPaths exists, and if not, creates it
fn = os.path.join(os.path.dirname(__file__),
'memorizedPaths' + str(cutoff+1))
if not os.path.exists(fn):
os.makedirs(fn)
manager = mp.Manager()
memorizedPaths = manager.dict()
filepaths = manager.dict()
degreelist = sorted(DG.degree_iter(),
key=itemgetter(1),
reverse=True)
def _all_simple_paths_graph(item, DG, cutoff, memorizedPaths, filepaths):
source = item[0]
uniqueTreePaths = []
if cutoff < 1:
return
visited = [source]
stack = [iter(DG[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.pop()
elif child in memorizedPaths:
for path in memorizedPaths[child]:
newPath = (tuple(visited) + tuple(path))
if (len(newPath) <= cutoff) and
(len(set(visited) & set(path)) == 0):
uniqueTreePaths.append(newPath)
continue
elif len(visited) < cutoff:
if child not in visited:
visited.append(child)
stack.append(iter(DG[child]))
if visited not in uniqueTreePaths:
uniqueTreePaths.append(tuple(visited))
else: # len(visited) == cutoff:
if (visited not in uniqueTreePaths) and
(child not in visited):
uniqueTreePaths.append(tuple(visited + [child]))
stack.pop()
visited.pop()
# writes the absolute path of the node path file into the hash table
filepaths[source] = str(fn) + "/" + str(source) + "path.txt"
with open (filepaths[source], "wb") as csvfile2:
writer = csv.writer(csvfile2, delimiter=" ", quotechar="|")
for path in uniqueTreePaths:
writer.writerow(path)
memorizedPaths[source] = uniqueTreePaths
############################################################################
if __name__ == '__main__':
start = time.clock()
for item in degreelist:
test = mp.Process(target=_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
test.start()
test.join()
end = time.clock()
print (end-start)
Currently - though luck and magic - it works (sort of). My problem is I'm only using 12 of my 24 cores.
Can someone explain why this might be the case? Perhaps my code isn't the best multiprocessing solution, or is it a feature of my architecture Intel Xeon CPU E5-2640 # 2.50GHz x18 running on Ubuntu 13.04 x64?
EDIT:
I managed to get:
p = mp.Pool()
for item in degreelist:
p.apply_async(_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
p.close()
p.join()
Working, however, it's VERY SLOW! So I assume I'm using the wrong function for the job. hopefully it helps clarify exactly what I'm trying to accomplish!
EDIT2: .map attempt:
partialfunc = partial(_all_simple_paths_graph,
DG=DG,
cutoff=cutoff,
memorizedPaths=memorizedPaths,
filepaths=filepaths)
p = mp.Pool()
for item in processList:
processVar = p.map(partialfunc, xrange(len(processList)))
p.close()
p.join()
Works, is slower than singlecore. Time to optimize!

Too much piling up here to address in comments, so, where mp is multiprocessing:
mp.cpu_count() should return the number of processors. But test it. Some platforms are funky, and this info isn't always easy to get. Python does the best it can.
If you start 24 processes, they'll do exactly what you tell them to do ;-) Looks like mp.Pool() would be most convenient for you. You pass the number of processes you want to create to its constructor. mp.Pool(processes=None) will use mp.cpu_count() for the number of processors.
Then you can use, for example, .imap_unordered(...) on your Pool instance to spread your degreelist across processes. Or maybe some other Pool method would work better for you - experiment.
If you can't bash the problem into Pool's view of the world, you could instead create an mp.Queue to create a work queue, .put()'ing nodes (or slices of nodes, to reduce overhead) to work on in the main program, and write the workers to .get() work items off that queue. Ask if you need examples. Note that you need to put sentinel values (one per process) on the queue, after all the "real" work items, so that worker processes can test for the sentinel to know when they're done.
FYI, I like queues because they're more explicit. Many others like Pools better because they're more magical ;-)
Pool Example
Here's an executable prototype for you. This shows one way to use imap_unordered with Pool and chunksize that doesn't require changing any function signatures. Of course you'll have to plug in your real code ;-) Note that the init_worker approach allows passing "most of" the arguments only once per processor, not once for every item in your degreeslist. Cutting the amount of inter-process communication can be crucial for speed.
import multiprocessing as mp
def init_worker(mps, fps, cut):
global memorizedPaths, filepaths, cutoff
global DG
print "process initializing", mp.current_process()
memorizedPaths, filepaths, cutoff = mps, fps, cut
DG = 1##nx.read_gml("KeggComplete.gml", relabel = True)
def work(item):
_all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths)
def _all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths):
pass # print "doing " + str(item)
if __name__ == "__main__":
m = mp.Manager()
memorizedPaths = m.dict()
filepaths = m.dict()
cutoff = 1 ##
# use all available CPUs
p = mp.Pool(initializer=init_worker, initargs=(memorizedPaths,
filepaths,
cutoff))
degreelist = range(100000) ##
for _ in p.imap_unordered(work, degreelist, chunksize=500):
pass
p.close()
p.join()
I strongly advise running this exactly as-is, so you can see that it's blazing fast. Then add things to it a bit a time, to see how that affects the time. For example, just adding
memorizedPaths[item] = item
to _all_simple_paths_graph() slows it down enormously. Why? Because the dict gets bigger and bigger with each addition, and this process-safe dict has to be synchronized (under the covers) among all the processes. The unit of synchronization is "the entire dict" - there's no internal structure the mp machinery can exploit to do incremental updates to the shared dict.
If you can't afford this expense, then you can't use a Manager.dict() for this. Opportunities for cleverness abound ;-)

Related

setting cpu affinity using multiprocessing in python windows

I am trying to run 1200 iterations of a function with different values using multiprocessing.
Is there a way I can set the priority and affinities of the processors within the function itself ?
Here is an example of what I am doing :
with multiprocessing.Pool(processes=3) as pool:
r = pool.map(func, (c for c in combinations))
I want each of the 3 processes to have high priority using psutil, and the cpu_affinity to be specified. While I can use: psutil.Process().HIGH_PRIORITY_CLASS withing func, how should I specify different affinities for the three processors?

I would use the initializer function in mp.Pool:
#same prio for each child
def init_priority(prio_level):
set_prio(prio_level)
if __name__ == "__main__":
with Pool(nprocs, init_priority, (prio_level,)) as p:
p.map(...)
#Different prio for each child: (this may not be very useful
#because you cannot choose which child will accept each "task").
def init_priority(q):
prio_level = q.get()
set_prio(prio_level)
if __name__ == "__main__":
q = mp.Queue()
for _ in range(nprocs): #put one prio_level for each process
q.put(prio_level)
with Pool(nprocs, init_priority, (q,)) as p:
p.map(...)
If you need to have some high priority child processes and some low priority, and will need to be able to discern easily between them, I would skip mp.Pool, and just use your own Process's.

Why does interning global string values result in less memory used per multiprocessing process?

I have a Python 3.6 data processing task that involves pre-loading a large dict for looking up dates by ID for use in a subsequent step by a pool of sub-processes managed by the multiprocessing module. This process was eating up most if not all of the memory on the box, so one optimisation I applied was to 'intern' the string dates being stored in the dict. This reduced the memory footprint of the dict by several GBs as I expected it would, but it also had another unexpected effect.
Before applying interning, the sub-processes would gradually eat more and more memory as they executed, which I believe was down to them having to copy the dict gradually from global memory across to the sub-processes' individual allocated memory (this is running on Linux and so benefits from the copy-on-write behaviour of fork()). Even though I'm not updating the dict in the sub-processes, it looks like read-only access can still trigger copy-on-write through reference counting.
I was only expecting the interning to reduce the memory footprint of the dict, but in fact it stopped the memory usage gradually increasing over the sub-processes lifetime as well.
Here's a minimal example I was able to build that replicates the behaviour, although it requires a large file to load in and populate the dict with and a sufficient amount of repetition in the values to make sure that interning provides a benefit.
import multiprocessing
import sys
# initialise a large dict that will be visible to all processes
# that contains a lot of repeated values
global_map = dict()
with open(sys.argv[1], 'r', encoding='utf-8') as file:
if len(sys.argv) > 2:
print('interning is on')
else:
print('interning is off')
for i, line in enumerate(file):
if i > 30000000:
break
parts = line.split('|')
if len(sys.argv) > 2:
global_map[str(i)] = sys.intern(parts[2])
else:
global_map[str(i)] = parts[2]
def read_map():
# do some nonsense processing with each value in the dict
global global_map
for i in range(30000000):
x = global_map[str(i)]
y = x + '_'
return y
print("starting processes")
process_pool = multiprocessing.Pool(processes=10)
for _ in range(10):
process_pool.apply_async(read_map)
process_pool.close()
process_pool.join()
I ran this script and monitored htop to see the total memory usage.
interning?
mem usage just after 'starting processes' printed
peak mem usage after that
no
7.1GB
28.0GB
yes
5.5GB
5.6GB
While I am delighted that this optimisation seems to have fixed all my memory issues at once, I'd like to understand better why this works. If the creeping memory usage by the sub-processes is down to copy-on-write, why doesn't this happen if I intern the strings?

The CPython implementation stores interned strings in a global object that is a regular Python dictionary where both, keys and values are pointers to string objects.
When a new child process is created, it gets a copy of the parent's address space so they will use the reduced data dictionary with interned strings.
I've compiled Python with the patch below and as you can see, both processes have access to the table with interned strings:
test.py:
import multiprocessing as mp
import sys
import _string
PROCS = 2
STRING = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
def worker():
proc = mp.current_process()
interned = _string.interned()
try:
idx = interned.index(STRING)
except ValueError:
s = None
else:
s = interned[idx]
print(f"{proc}: <{s}>")
def main():
sys.intern(STRING)
procs = []
for _ in range(PROCS):
p = mp.Process(target=worker)
p.start()
procs.append(p)
for p in procs:
p.join()
if __name__ == "__main__":
main()
Test:
# python test.py
<Process name='Process-1' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>
<Process name='Process-2' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>
Patch:
--- Objects/unicodeobject.c 2021-05-15 15:08:05.117433926 +0100
+++ Objects/unicodeobject.c.tmp 2021-05-15 23:48:35.236152366 +0100
## -16230,6 +16230,11 ##
_PyUnicode_FiniEncodings(&tstate->interp->unicode.fs_codec);
}
+static PyObject *
+interned_impl(PyObject *module)
+{
+ return PyDict_Values(interned);
+}
/* A _string module, to export formatter_parser and formatter_field_name_split
to the string.Formatter class implemented in Python. */
## -16239,6 +16244,8 ##
METH_O, PyDoc_STR("split the argument as a field name")},
{"formatter_parser", (PyCFunction) formatter_parser,
METH_O, PyDoc_STR("parse the argument as a format string")},
+ {"interned", (PyCFunction) interned_impl,
+ METH_NOARGS, PyDoc_STR("lookup interned strings")},
{NULL, NULL}
};
You may also want to take a look at shared_memory module.
References:
The internals of Python string interning

Not an answer, but I thought it was of interest to provide a MWE that does not require an input file. Peak memory usage is much higher when manual interning is turned off, which HTF properly explained in my opinion.
from multiprocessing import Pool
from random import choice
from string import ascii_lowercase
# from sys import intern
def rand_str(length):
return ''.join([choice(ascii_lowercase) for i in range(length)])
def read_map():
for value in global_map.values():
x = value
y = x + '_'
return y
global_map = dict()
for i in range(20_000_000):
# global_map[str(i)] = intern(rand_str(4))
global_map[str(i)] = rand_str(4)
print("starting processes")
if __name__ == '__main__':
with Pool(processes=2) as process_pool:
processes = [process_pool.apply_async(read_map)
for process in range(process_pool._processes)]
for process in processes:
process.wait()
print(process.get())

Overhead of python multiprocessing initialization is worse than benefits

I want to use a trie search with python 3.7 in order to match a string with some given words.
The trie search algorithm is actually quite fast, however I also want to use all of the cores my CPU has. Lets assume my pc has 8 cores and I want to use 7 of them.
So I split my word database into 7 equally big lists and created a trie off every one. (That's the basic Idea for parallizing the code)
However, when I call Process() off the multiprocessing module, the Process().start() method can take up a couple of seconds on the real database. (the search itself takes about a microseconds).
To be honest, I'm not yet a professional programmer, which means I probably have put in some major mistake in the code. Does someone see the reason the start of the process is so damn slow?
Please consider that I tested the script with a way bigger database than the trie below. I also tested the script with calling only 1 process each time and that was also significantly slower.
I wanted to provide less code, however I think It's nice to see the running problem. I can also provide additional info if needed.
import string
import sys
import time
from multiprocessing import Process, Manager
from itertools import combinations_with_replacement
class TrieNode:
def __init__(self):
self.isString = False
self.children = {}
def insertString(self, word, root):
currentNode = root
for char in word:
if char not in currentNode.children:
currentNode.children[char] = TrieNode()
currentNode = currentNode.children[char]
currentNode.isString = True
def findStrings(self, prefix, node, results):
# Hänge das Ergebnis an, wenn ein Ende gefunden wurde
if node.isString:
results.append(prefix)
for char in node.children:
self.findStrings(prefix + char, node.children[char], results)
def findSubStrings(self, start_prefix, root, results):
currentNode = root
for char in start_prefix:
# Beende Schleife auf fehlende Prefixes oder deren Kinder
if char not in currentNode.children:
break
# Wechsle zu Kindern in anderem Falle
else:
currentNode = currentNode.children[char]
# Verwende findStrings Rekursiv zum auffinden von End-Knoten
self.findStrings(start_prefix, currentNode, results)
return results
def gen_word_list(num_words, min_word_len=4, max_word_len=10):
wordList = []
total_words = 0
for long_word in combinations_with_replacement(string.ascii_lowercase, max_word_len):
wordList.append(long_word)
total_words += 1
if total_words >= num_words:
break
for cut_length in range(1, max_word_len-min_word_len+1):
wordList.append(long_word[:-cut_length])
total_words += 1
if total_words >= num_words:
break
return wordList
if __name__ == '__main__':
# Sample word list
wordList = gen_word_list(1.5 * 10**5)
# Configs
try:
n_cores = int(sys.argv[-1] or 7)
except ValueError:
n_cores = 7
# Repetitions to do in order to estimate the runtime of a single run
num_repeats = 20
real_num_repeats = n_cores * num_repeats
# Creating Trie
root = TrieNode()
# Adding words
for word in wordList:
root.insertString(word, root)
# Extending trie to use it on multiple cores at once
multiroot = [root] * n_cores
# Measure time
print('Single process ...')
t_0 = time.time()
for i in range(real_num_repeats):
r = []
root.findSubStrings('he', root, r)
single_proc_time = (time.time()-t_0)
print(single_proc_time/real_num_repeats)
# using multicore to speed up the process
man = Manager()
# Loop to test the multicore Solution
# (Less repetitions are done to compare the timings to the single-core solution)
print('\nMultiprocess ...')
t_00 = time.time()
p_init_time = 0
procs_append_time = 0
p_start_time = 0
for i in range(num_repeats):
# Create Share-able list
res = man.list()
procs = []
for i in range(n_cores):
t_0 = time.time()
p = Process(target=multiroot[i].findSubStrings, args=('a', multiroot[i], res))
t_1 = time.time()
p_init_time += t_1 - t_0
procs.append(p)
t_2 = time.time()
procs_append_time += t_2 - t_1
p.start()
p_start_time += time.time() - t_2
for p in procs:
p.join()
multi_proc_time = time.time() - t_00
print(multi_proc_time / real_num_repeats)
init_overhead = p_init_time / single_proc_time
append_overhead = procs_append_time / single_proc_time
start_overhead = p_start_time / single_proc_time
total_overhead = (multi_proc_time - single_proc_time) / single_proc_time
print(f"Process(...) overhead: {init_overhead:.1%}")
print(f"procs.append(p) overhead: {append_overhead:.1%}")
print(f"p.start() overhead: {start_overhead:.1%}")
print(f"Total overhead: {total_overhead:.1%}")
Single process ...
0.007229958261762347
Multiprocess ...
0.7615800397736686
Process(...) overhead: 0.9%
procs.append(p) overhead: 0.0%
p.start() overhead: 8.2%
Total overhead: 10573.8%

General idea
There are many things to consider and most of them are already described in Multiprocessing > Programming guidelines. The most important thing is to remember that you are actually working with multiple processes and so there are 3 (or 4) ways of how variables are handled:
Synchronized wrappers over ctypes shared-state variables (like
multiprocessing.Value). Actual variable is always "one object" in
memory, and wrapper by default is using "locking" to set/get real
value.
Proxies (like Manager().list()). These variables are similar to shared-state variables, but are placed in the special "server process", and all the operations over them are actually sending pickled values between manager-process and active-process:
results.append(x) pickles x and sends it from manager process to active process that makes this call,
then it's unpickled
Any other access to results (like len(results), iteration over results) involves the same pickling/sending/unpickling process.
So generally proxies are much slower than any other approach for common variables and in many cases using manager
for "local" parallelization will give worse performance even compared to single-process runs.
But a manager-server can be used remotely, so it's reasonable to use them when you want to parallelize the work
using workers distributed on multiple machines
Objects available during subprocess create. For "fork" start method all the objects available during creation of subprocess are still available and "not shared", so changing them only changes it "locally for the subprocess". But before they are changed each process really "shares" the memory for each such object, so:
If they are used "read-only", then nothing is copied or "communicated".
If they are changed then they are copied inside the subprocess and, the copy is being changed. This is called Copy-On-Write or COW.
Please note that making a new reference to object, e.g. assigning a variable
to reference it, or appending it to a list increases ref_count of object, and that is
considered to be "a change".
Behavior may also vary depending on "start method": e.g. for "spawn"/"forkserver" method changeable global variables are not really "the same objects" value seen by subprocess may not be the same as in parent process.
So initial values of multiroot[i] (used in Process(target=..., args=(..., multiroot[i], ...))) are shared but:
if you are not using 'fork' start method (and by default Windows is not using it), then all the args are pickled at least once for each subprocess. And so start may be taking a long time if multiroot[i].children is huge.
Even if you are using fork: initially multiroot[i] seems to be shared and not copied, but I'm not sure what happens
when variables are assigned inside of findSubStrings method (e.g. currentNode = ...) — maybe it's causing copy-on-write (COW) and so whole instance of TrieNode is being copied.
What can be done to improve the situation:
If you are using fork start method, then make sure that "database" objects (TrieNode instances) are truly
readonly and don't event have methods with variables assignments in them. For example you can move findSubStrings to another class, and make sure to call all the instance.insertString before starting subprocesses.
You are using man.list() instance as a results argument to findSubStrings. This means that for each subprocess
a different "wrapper" is created, and all the results.append(prefix) actions are pickling prefix, and then sending it
to server process. If you are using Pool with limited number of processes, then it's not a big deal. If you are spawning
huge amount of subprocesses, then it might affect performance. And I think that by default they all use "locking" so concurrent appends migth be relatively slow. If order of items in results does not matter (I'm not experienced with prefix-trees and don't remember theory behind it), then you can fully avoid any overheads related to concurrent
results.append:
create new results list inside the findSubStrings method. Don't use res = man.list() at all.
To get the "final" results: iterate over every result object returned by pool.apply_async());
get the results; "merge them".
Using weak references
Using currentNode = root in findSubStrings will result in COW of root. That's why weak references (currentNodeRef = weakref.ref(root)) can give a little extra benefit.
Example
import string
import sys
import time
import weakref
from copy import deepcopy
from multiprocessing import Pool
from itertools import combinations_with_replacement
class TrieNode:
def __init__(self):
self.isString = False
self.children = {}
def insertString(self, word, root):
current_node = root
for char in word:
if char not in current_node.children:
current_node.children[char] = TrieNode()
current_node = current_node.children[char]
current_node.isString = True
# findStrings: not a method of TrieNode anymore, and works with reference to node.
def findStrings(prefix, node_ref, results):
# Hänge das Ergebnis an, wenn ein Ende gefunden wurde
if node_ref().isString:
results.append(prefix)
for char in node_ref().children:
findStrings(prefix + char, weakref.ref(node_ref().children[char]), results)
# findSubStrings: not a method of TrieNode anymore, and works with reference to node.
def findSubStrings(start_prefix, node_ref, results=None):
if results is None:
results = []
current_node_ref = node_ref
for char in start_prefix:
# Beende Schleife auf fehlende Prefixes oder deren Kinder
if char not in current_node_ref().children:
break
# Wechsle zu Kindern in anderem Falle
else:
current_node_ref = weakref.ref(current_node_ref().children[char])
# Verwende findStrings Rekursiv zum auffinden von End-Knoten
findStrings(start_prefix, current_node_ref, results)
return results
def gen_word_list(num_words, min_word_len=4, max_word_len=10):
wordList = []
total_words = 0
for long_word in combinations_with_replacement(string.ascii_lowercase, max_word_len):
wordList.append(long_word)
total_words += 1
if total_words >= num_words:
break
for cut_length in range(1, max_word_len-min_word_len+1):
wordList.append(long_word[:-cut_length])
total_words += 1
if total_words >= num_words:
break
return wordList
if __name__ == '__main__':
# Sample word list
wordList = gen_word_list(1.5 * 10**5)
# Configs
try:
n_cores = int(sys.argv[-1] or 7)
except ValueError:
n_cores = 7
# Repetitions to do in order to estimate the runtime of a single run
real_num_repeats = 420
simulated_num_repeats = real_num_repeats // n_cores
# Creating Trie
root = TrieNode()
# Adding words
for word in wordList:
root.insertString(word, root)
# Create tries for subprocesses:
multiroot = [deepcopy(root) for _ in range(n_cores)]
# NOTE: actually all subprocesses can use the same `root`, but let's copy them to simulate
# that we are using different tries when splitting job to sub-jobs
# localFindSubStrings: defined after `multiroot`, so `multiroot` can be used as "shared" variable
def localFindSubStrings(start_prefix, root_index=None, results=None):
if root_index is None:
root_ref = weakref.ref(root)
else:
root_ref = weakref.ref(multiroot[root_index])
return findSubStrings(start_prefix, root_ref, results)
# Measure time
print('Single process ...')
single_proc_num_results = None
t_0 = time.time()
for i in range(real_num_repeats):
iteration_results = localFindSubStrings('help', )
if single_proc_num_results is None:
single_proc_num_results = len(iteration_results)
single_proc_time = (time.time()-t_0)
print(single_proc_time/real_num_repeats)
# Loop to test the multicore Solution
# (Less repetitions are done to compare the timings to the single-core solution)
print('\nMultiprocess ...')
p_init_time = 0
apply_async_time = 0
results_join_time = 0
# Should processes be joined between repeats (simulate single job on multiple cores) or not (simulate multiple jobs running simultaneously)
PARALLEL_REPEATS = True
if PARALLEL_REPEATS:
t_0 = time.time()
pool = Pool(processes=n_cores)
t_1 = time.time()
p_init_time += t_1 - t_0
async_results = []
final_results = []
t_00 = time.time()
for repeat_num in range(simulated_num_repeats):
final_result = []
final_results.append(final_result)
if not PARALLEL_REPEATS:
t_0 = time.time()
pool = Pool(processes=n_cores)
t_1 = time.time()
p_init_time += t_1 - t_0
async_results = []
else:
t_1 = time.time()
async_results.append(
(
final_result,
pool.starmap_async(
localFindSubStrings,
[('help', core_num) for core_num in range(n_cores)],
)
)
)
t_2 = time.time()
apply_async_time += t_2 - t_1
if not PARALLEL_REPEATS:
for _, a_res in async_results:
for result_part in a_res.get():
t_3 = time.time()
final_result.extend(result_part)
results_join_time += time.time() - t_3
pool.close()
pool.join()
if PARALLEL_REPEATS:
for final_result, a_res in async_results:
for result_part in a_res.get():
t_3 = time.time()
final_result.extend(result_part)
results_join_time += time.time() - t_3
pool.close()
pool.join()
multi_proc_time = time.time() - t_00
# Work is not really parallelized, instead it's just 'duplicated' over cores,
# and so we divide using `real_num_repeats` (not `simulated_num_repeats`)
print(multi_proc_time / real_num_repeats)
init_overhead = p_init_time / single_proc_time
apply_async_overhead = apply_async_time / single_proc_time
results_join_percent = results_join_time / single_proc_time
total_overhead = (multi_proc_time - single_proc_time) / single_proc_time
print(f"Pool(...) overhead: {init_overhead:.1%}")
print(f"pool.starmap_async(...) overhead: {apply_async_overhead:.1%}")
print(f"Results join time percent: {results_join_percent:.1%}")
print(f"Total overhead: {total_overhead:.1%}")
for iteration_results in final_results:
num_results = len(iteration_results) / n_cores
if num_results != single_proc_num_results:
raise AssertionError(f'length of results should not change! {num_results} != {single_proc_num_results}')
NOTES:
PARALLEL_REPEATS=True simulates running of the multiple jobs (for example each job should be started for different prefixes, but in example I use the same prefix to have consistent "load" for each run), and each job is "parallelized" over all cores.
PARALLEL_REPEATS=False simulates running of a single job
parallelized over all cores and it's slower than single-process
solution.
It seems that parallelism is only better when each
worker in the pool is issued apply_async more than 1 time.
Example output:
Single process ...
0.007109369550432477
Multiprocess ...
0.002928720201764788
Pool(...) overhead: 1.3%
pool.apply_async(...) overhead: 1.5%
Results join time percent: 1.8%
Total overhead: -58.8%

at first I want to thank everyone who was participating as every answer contributed to the solution.
As the first comments pointed out, creating a new process every time leads to python shifting the needed data into the process. This can take a couple of seconds and is leading to a non-desired delay.
What brought the ultimate solution for me is creating the processes (one per core) once using the Process class of the multiprocessing library once during the startup of the program.
You can then communicate with the process using the Pipe class of the same module.
I found the ping-pong example here really helping: https://www.youtube.com/watch?v=s1SkCYMnfbY&t=900s
It is still not optimal as multiple pipes trying to talk to the process during the same time causes the process to crash.
However, I should be able solving this issue using queues. If someone is interested in the solution feel free to ask.

Function that multiprocesses another function

I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?

One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)

Python 3 multiprocessing on 1 core gives overhead that grows with workload

I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()

When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to utilize all cores with python multiprocessing - python

Related

setting cpu affinity using multiprocessing in python windows

Why does interning global string values result in less memory used per multiprocessing process?

Overhead of python multiprocessing initialization is worse than benefits

Function that multiprocesses another function

Python 3 multiprocessing on 1 core gives overhead that grows with workload

Categories

Resources