Overhead of python multiprocessing initialization is worse than benefits - python

I want to use a trie search with python 3.7 in order to match a string with some given words.
The trie search algorithm is actually quite fast, however I also want to use all of the cores my CPU has. Lets assume my pc has 8 cores and I want to use 7 of them.
So I split my word database into 7 equally big lists and created a trie off every one. (That's the basic Idea for parallizing the code)
However, when I call Process() off the multiprocessing module, the Process().start() method can take up a couple of seconds on the real database. (the search itself takes about a microseconds).
To be honest, I'm not yet a professional programmer, which means I probably have put in some major mistake in the code. Does someone see the reason the start of the process is so damn slow?
Please consider that I tested the script with a way bigger database than the trie below. I also tested the script with calling only 1 process each time and that was also significantly slower.
I wanted to provide less code, however I think It's nice to see the running problem. I can also provide additional info if needed.
import string
import sys
import time
from multiprocessing import Process, Manager
from itertools import combinations_with_replacement
class TrieNode:
def __init__(self):
self.isString = False
self.children = {}
def insertString(self, word, root):
currentNode = root
for char in word:
if char not in currentNode.children:
currentNode.children[char] = TrieNode()
currentNode = currentNode.children[char]
currentNode.isString = True
def findStrings(self, prefix, node, results):
# Hänge das Ergebnis an, wenn ein Ende gefunden wurde
if node.isString:
results.append(prefix)
for char in node.children:
self.findStrings(prefix + char, node.children[char], results)
def findSubStrings(self, start_prefix, root, results):
currentNode = root
for char in start_prefix:
# Beende Schleife auf fehlende Prefixes oder deren Kinder
if char not in currentNode.children:
break
# Wechsle zu Kindern in anderem Falle
else:
currentNode = currentNode.children[char]
# Verwende findStrings Rekursiv zum auffinden von End-Knoten
self.findStrings(start_prefix, currentNode, results)
return results
def gen_word_list(num_words, min_word_len=4, max_word_len=10):
wordList = []
total_words = 0
for long_word in combinations_with_replacement(string.ascii_lowercase, max_word_len):
wordList.append(long_word)
total_words += 1
if total_words >= num_words:
break
for cut_length in range(1, max_word_len-min_word_len+1):
wordList.append(long_word[:-cut_length])
total_words += 1
if total_words >= num_words:
break
return wordList
if __name__ == '__main__':
# Sample word list
wordList = gen_word_list(1.5 * 10**5)
# Configs
try:
n_cores = int(sys.argv[-1] or 7)
except ValueError:
n_cores = 7
# Repetitions to do in order to estimate the runtime of a single run
num_repeats = 20
real_num_repeats = n_cores * num_repeats
# Creating Trie
root = TrieNode()
# Adding words
for word in wordList:
root.insertString(word, root)
# Extending trie to use it on multiple cores at once
multiroot = [root] * n_cores
# Measure time
print('Single process ...')
t_0 = time.time()
for i in range(real_num_repeats):
r = []
root.findSubStrings('he', root, r)
single_proc_time = (time.time()-t_0)
print(single_proc_time/real_num_repeats)
# using multicore to speed up the process
man = Manager()
# Loop to test the multicore Solution
# (Less repetitions are done to compare the timings to the single-core solution)
print('\nMultiprocess ...')
t_00 = time.time()
p_init_time = 0
procs_append_time = 0
p_start_time = 0
for i in range(num_repeats):
# Create Share-able list
res = man.list()
procs = []
for i in range(n_cores):
t_0 = time.time()
p = Process(target=multiroot[i].findSubStrings, args=('a', multiroot[i], res))
t_1 = time.time()
p_init_time += t_1 - t_0
procs.append(p)
t_2 = time.time()
procs_append_time += t_2 - t_1
p.start()
p_start_time += time.time() - t_2
for p in procs:
p.join()
multi_proc_time = time.time() - t_00
print(multi_proc_time / real_num_repeats)
init_overhead = p_init_time / single_proc_time
append_overhead = procs_append_time / single_proc_time
start_overhead = p_start_time / single_proc_time
total_overhead = (multi_proc_time - single_proc_time) / single_proc_time
print(f"Process(...) overhead: {init_overhead:.1%}")
print(f"procs.append(p) overhead: {append_overhead:.1%}")
print(f"p.start() overhead: {start_overhead:.1%}")
print(f"Total overhead: {total_overhead:.1%}")
Single process ...
0.007229958261762347
Multiprocess ...
0.7615800397736686
Process(...) overhead: 0.9%
procs.append(p) overhead: 0.0%
p.start() overhead: 8.2%
Total overhead: 10573.8%

General idea
There are many things to consider and most of them are already described in Multiprocessing > Programming guidelines. The most important thing is to remember that you are actually working with multiple processes and so there are 3 (or 4) ways of how variables are handled:
Synchronized wrappers over ctypes shared-state variables (like
multiprocessing.Value). Actual variable is always "one object" in
memory, and wrapper by default is using "locking" to set/get real
value.
Proxies (like Manager().list()). These variables are similar to shared-state variables, but are placed in the special "server process", and all the operations over them are actually sending pickled values between manager-process and active-process:
results.append(x) pickles x and sends it from manager process to active process that makes this call,
then it's unpickled
Any other access to results (like len(results), iteration over results) involves the same pickling/sending/unpickling process.
So generally proxies are much slower than any other approach for common variables and in many cases using manager
for "local" parallelization will give worse performance even compared to single-process runs.
But a manager-server can be used remotely, so it's reasonable to use them when you want to parallelize the work
using workers distributed on multiple machines
Objects available during subprocess create. For "fork" start method all the objects available during creation of subprocess are still available and "not shared", so changing them only changes it "locally for the subprocess". But before they are changed each process really "shares" the memory for each such object, so:
If they are used "read-only", then nothing is copied or "communicated".
If they are changed then they are copied inside the subprocess and, the copy is being changed. This is called Copy-On-Write or COW.
Please note that making a new reference to object, e.g. assigning a variable
to reference it, or appending it to a list increases ref_count of object, and that is
considered to be "a change".
Behavior may also vary depending on "start method": e.g. for "spawn"/"forkserver" method changeable global variables are not really "the same objects" value seen by subprocess may not be the same as in parent process.
So initial values of multiroot[i] (used in Process(target=..., args=(..., multiroot[i], ...))) are shared but:
if you are not using 'fork' start method (and by default Windows is not using it), then all the args are pickled at least once for each subprocess. And so start may be taking a long time if multiroot[i].children is huge.
Even if you are using fork: initially multiroot[i] seems to be shared and not copied, but I'm not sure what happens
when variables are assigned inside of findSubStrings method (e.g. currentNode = ...) — maybe it's causing copy-on-write (COW) and so whole instance of TrieNode is being copied.
What can be done to improve the situation:
If you are using fork start method, then make sure that "database" objects (TrieNode instances) are truly
readonly and don't event have methods with variables assignments in them. For example you can move findSubStrings to another class, and make sure to call all the instance.insertString before starting subprocesses.
You are using man.list() instance as a results argument to findSubStrings. This means that for each subprocess
a different "wrapper" is created, and all the results.append(prefix) actions are pickling prefix, and then sending it
to server process. If you are using Pool with limited number of processes, then it's not a big deal. If you are spawning
huge amount of subprocesses, then it might affect performance. And I think that by default they all use "locking" so concurrent appends migth be relatively slow. If order of items in results does not matter (I'm not experienced with prefix-trees and don't remember theory behind it), then you can fully avoid any overheads related to concurrent
results.append:
create new results list inside the findSubStrings method. Don't use res = man.list() at all.
To get the "final" results: iterate over every result object returned by pool.apply_async());
get the results; "merge them".
Using weak references
Using currentNode = root in findSubStrings will result in COW of root. That's why weak references (currentNodeRef = weakref.ref(root)) can give a little extra benefit.
Example
import string
import sys
import time
import weakref
from copy import deepcopy
from multiprocessing import Pool
from itertools import combinations_with_replacement
class TrieNode:
def __init__(self):
self.isString = False
self.children = {}
def insertString(self, word, root):
current_node = root
for char in word:
if char not in current_node.children:
current_node.children[char] = TrieNode()
current_node = current_node.children[char]
current_node.isString = True
# findStrings: not a method of TrieNode anymore, and works with reference to node.
def findStrings(prefix, node_ref, results):
# Hänge das Ergebnis an, wenn ein Ende gefunden wurde
if node_ref().isString:
results.append(prefix)
for char in node_ref().children:
findStrings(prefix + char, weakref.ref(node_ref().children[char]), results)
# findSubStrings: not a method of TrieNode anymore, and works with reference to node.
def findSubStrings(start_prefix, node_ref, results=None):
if results is None:
results = []
current_node_ref = node_ref
for char in start_prefix:
# Beende Schleife auf fehlende Prefixes oder deren Kinder
if char not in current_node_ref().children:
break
# Wechsle zu Kindern in anderem Falle
else:
current_node_ref = weakref.ref(current_node_ref().children[char])
# Verwende findStrings Rekursiv zum auffinden von End-Knoten
findStrings(start_prefix, current_node_ref, results)
return results
def gen_word_list(num_words, min_word_len=4, max_word_len=10):
wordList = []
total_words = 0
for long_word in combinations_with_replacement(string.ascii_lowercase, max_word_len):
wordList.append(long_word)
total_words += 1
if total_words >= num_words:
break
for cut_length in range(1, max_word_len-min_word_len+1):
wordList.append(long_word[:-cut_length])
total_words += 1
if total_words >= num_words:
break
return wordList
if __name__ == '__main__':
# Sample word list
wordList = gen_word_list(1.5 * 10**5)
# Configs
try:
n_cores = int(sys.argv[-1] or 7)
except ValueError:
n_cores = 7
# Repetitions to do in order to estimate the runtime of a single run
real_num_repeats = 420
simulated_num_repeats = real_num_repeats // n_cores
# Creating Trie
root = TrieNode()
# Adding words
for word in wordList:
root.insertString(word, root)
# Create tries for subprocesses:
multiroot = [deepcopy(root) for _ in range(n_cores)]
# NOTE: actually all subprocesses can use the same `root`, but let's copy them to simulate
# that we are using different tries when splitting job to sub-jobs
# localFindSubStrings: defined after `multiroot`, so `multiroot` can be used as "shared" variable
def localFindSubStrings(start_prefix, root_index=None, results=None):
if root_index is None:
root_ref = weakref.ref(root)
else:
root_ref = weakref.ref(multiroot[root_index])
return findSubStrings(start_prefix, root_ref, results)
# Measure time
print('Single process ...')
single_proc_num_results = None
t_0 = time.time()
for i in range(real_num_repeats):
iteration_results = localFindSubStrings('help', )
if single_proc_num_results is None:
single_proc_num_results = len(iteration_results)
single_proc_time = (time.time()-t_0)
print(single_proc_time/real_num_repeats)
# Loop to test the multicore Solution
# (Less repetitions are done to compare the timings to the single-core solution)
print('\nMultiprocess ...')
p_init_time = 0
apply_async_time = 0
results_join_time = 0
# Should processes be joined between repeats (simulate single job on multiple cores) or not (simulate multiple jobs running simultaneously)
PARALLEL_REPEATS = True
if PARALLEL_REPEATS:
t_0 = time.time()
pool = Pool(processes=n_cores)
t_1 = time.time()
p_init_time += t_1 - t_0
async_results = []
final_results = []
t_00 = time.time()
for repeat_num in range(simulated_num_repeats):
final_result = []
final_results.append(final_result)
if not PARALLEL_REPEATS:
t_0 = time.time()
pool = Pool(processes=n_cores)
t_1 = time.time()
p_init_time += t_1 - t_0
async_results = []
else:
t_1 = time.time()
async_results.append(
(
final_result,
pool.starmap_async(
localFindSubStrings,
[('help', core_num) for core_num in range(n_cores)],
)
)
)
t_2 = time.time()
apply_async_time += t_2 - t_1
if not PARALLEL_REPEATS:
for _, a_res in async_results:
for result_part in a_res.get():
t_3 = time.time()
final_result.extend(result_part)
results_join_time += time.time() - t_3
pool.close()
pool.join()
if PARALLEL_REPEATS:
for final_result, a_res in async_results:
for result_part in a_res.get():
t_3 = time.time()
final_result.extend(result_part)
results_join_time += time.time() - t_3
pool.close()
pool.join()
multi_proc_time = time.time() - t_00
# Work is not really parallelized, instead it's just 'duplicated' over cores,
# and so we divide using `real_num_repeats` (not `simulated_num_repeats`)
print(multi_proc_time / real_num_repeats)
init_overhead = p_init_time / single_proc_time
apply_async_overhead = apply_async_time / single_proc_time
results_join_percent = results_join_time / single_proc_time
total_overhead = (multi_proc_time - single_proc_time) / single_proc_time
print(f"Pool(...) overhead: {init_overhead:.1%}")
print(f"pool.starmap_async(...) overhead: {apply_async_overhead:.1%}")
print(f"Results join time percent: {results_join_percent:.1%}")
print(f"Total overhead: {total_overhead:.1%}")
for iteration_results in final_results:
num_results = len(iteration_results) / n_cores
if num_results != single_proc_num_results:
raise AssertionError(f'length of results should not change! {num_results} != {single_proc_num_results}')
NOTES:
PARALLEL_REPEATS=True simulates running of the multiple jobs (for example each job should be started for different prefixes, but in example I use the same prefix to have consistent "load" for each run), and each job is "parallelized" over all cores.
PARALLEL_REPEATS=False simulates running of a single job
parallelized over all cores and it's slower than single-process
solution.
It seems that parallelism is only better when each
worker in the pool is issued apply_async more than 1 time.
Example output:
Single process ...
0.007109369550432477
Multiprocess ...
0.002928720201764788
Pool(...) overhead: 1.3%
pool.apply_async(...) overhead: 1.5%
Results join time percent: 1.8%
Total overhead: -58.8%

at first I want to thank everyone who was participating as every answer contributed to the solution.
As the first comments pointed out, creating a new process every time leads to python shifting the needed data into the process. This can take a couple of seconds and is leading to a non-desired delay.
What brought the ultimate solution for me is creating the processes (one per core) once using the Process class of the multiprocessing library once during the startup of the program.
You can then communicate with the process using the Pipe class of the same module.
I found the ping-pong example here really helping: https://www.youtube.com/watch?v=s1SkCYMnfbY&t=900s
It is still not optimal as multiple pipes trying to talk to the process during the same time causes the process to crash.
However, I should be able solving this issue using queues. If someone is interested in the solution feel free to ask.

Related

Function that multiprocesses another function

I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)

Python Threading inconsistent execution time

Using the threading library to accelerate calculating each point's neighborhood in a points-cloud. By calling function CalculateAllPointsNeighbors at the bottom of the post.
The function receives a search radius, maximum number of neighbors and a number of threads to split the work on. No changes are done on any of the points. And each point stores data in its own np.ndarray cell accessed by its own index.
The following function times how long it takes N number of threads to finish calculating all points neighborhoods:
def TimeFuncThreads(classObj, uptothreads):
listTimers = []
startNum = 1
EndNum = uptothreads + 1
for i in range(startNum, EndNum):
print("Current Number of Threads to Test: ", i)
tempT = time.time()
classObj.CalculateAllPointsNeighbors(searchRadius=0.05, maxNN=25, maxThreads=i)
tempT = time.time() - tempT
listTimers.append(tempT)
PlotXY(np.arange(startNum, EndNum), listTimers)
The problem is, I've been getting very different results in each run. Here are the plots from 5 subsequent runs of the function TimeFuncThreads. The X axis is number of threads, Y is the runtime. First thing is, they look totally random. And second, there is no significant acceleration boost.
I'm confused now whether I'm using the threading library wrong and what is this behavior that I'm getting?
The function that handles the threading and the function that is being called from each thread:
def CalculateAllPointsNeighbors(self, searchRadius=0.20, maxNN=50, maxThreads=8):
threadsList = []
pointsIndices = np.arange(self.numberOfPoints)
splitIndices = np.array_split(pointsIndices, maxThreads)
for i in range(maxThreads):
threadsList.append(threading.Thread(target=self.GetPointsNeighborsByID,
args=(splitIndices[i], searchRadius, maxNN)))
[t.start() for t in threadsList]
[t.join() for t in threadsList]
def GetPointsNeighborsByID(self, idx, searchRadius=0.05, maxNN=20):
if isinstance(idx, int):
idx = [idx]
for currentPointIndex in idx:
currentPoint = self.pointsOpen3D.points[currentPointIndex]
pointNeighborhoodObject = self.GetPointNeighborsByCoordinates(currentPoint, searchRadius, maxNN)
self.pointsNeighborsArray[currentPointIndex] = pointNeighborhoodObject
self.__RotatePointNeighborhood(currentPointIndex)
It pains me to be the one to introduce you to the Python Gil. Is a very nice feature that makes parallelism using threads in Python a nightmare.
If you really want to improve your code speed, you should be looking at the multiprocessing module

Python 3 multiprocessing on 1 core gives overhead that grows with workload

I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()
When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.

Python Multiprocessing speed issue

I have a nested for loop of the form
while x<lat2[0]:
while y>lat3[1]:
if (is_inside_nepal([x,y])):
print("inside")
else:
print("not")
y = y - (1/150.0)
y = lat2[1]
x = x + (1/150.0)
#here lat2[0] represents a large number
Now this normally takes around 50s for executing.
And I have changed this loop to a multiprocessing code.
def v1find_coordinates(q):
while not(q.empty()):
x1 = q.get()
x2 = x1 + incfactor
while x1<x2:
def func(x1):
while y>lat3[1]:
if (is_inside([x1,y])):
print x1,y,"inside"
else:
print x1,y,"not inside"
y = y - (1/150.0)
func(x1)
y = lat2[1]
x1 = x1 + (1/150.0)
incfactor = 0.7
xvalues = drange(x,lat2[0],incfactor)
#this drange function is to get list with increment factor as decimal
cores = mp.cpu_count()
q = Queue()
for i in xvalues:
q.put(i)
for i in range(0,cores):
p = Process(target = v1find_coordinates,args=(q,) )
p.start()
p.Daemon = True
processes.append(p)
for i in processes:
print ("now joining")
i.join()
This multiprocessing code also takes around 50s execution time. This means there is no difference of time between the two.
I also have tried using pools. I have also managed the chunk size. I have googled and searched through other stackoverflow. But can't find any satisfying answer.
The only answer I could find was time was taken in process management to make both the result same. If this is the reason then how can I get the multiprocessing work to obtain faster results?
Will implementing in C from Python give faster results?
I am not expecting drastic results but by common sense one can tell that running on 4 cores should be a lot faster than running in 1 core. But I am getting similar results. Any kind of help would be appreciated.
You seem to be using a thread Queue (from Queue import Queue). This does not work as expected as Process uses fork() and it clones the entire Queue into each worker Process
Use:
from multiprocessing import Queue

How to utilize all cores with python multiprocessing

I have been fiddling with Python's multiprocessing functionality for upwards of an hour now, trying to parallelize a rather complex graph traversal function using multiprocessing.Process and multiprocessing.Manager:
import networkx as nx
import csv
import time
from operator import itemgetter
import os
import multiprocessing as mp
cutoff = 1
exclusionlist = ["cpd:C00024"]
DG = nx.read_gml("KeggComplete.gml", relabel=True)
for exclusion in exclusionlist:
DG.remove_node(exclusion)
# checks if 'memorizedPaths exists, and if not, creates it
fn = os.path.join(os.path.dirname(__file__),
'memorizedPaths' + str(cutoff+1))
if not os.path.exists(fn):
os.makedirs(fn)
manager = mp.Manager()
memorizedPaths = manager.dict()
filepaths = manager.dict()
degreelist = sorted(DG.degree_iter(),
key=itemgetter(1),
reverse=True)
def _all_simple_paths_graph(item, DG, cutoff, memorizedPaths, filepaths):
source = item[0]
uniqueTreePaths = []
if cutoff < 1:
return
visited = [source]
stack = [iter(DG[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.pop()
elif child in memorizedPaths:
for path in memorizedPaths[child]:
newPath = (tuple(visited) + tuple(path))
if (len(newPath) <= cutoff) and
(len(set(visited) & set(path)) == 0):
uniqueTreePaths.append(newPath)
continue
elif len(visited) < cutoff:
if child not in visited:
visited.append(child)
stack.append(iter(DG[child]))
if visited not in uniqueTreePaths:
uniqueTreePaths.append(tuple(visited))
else: # len(visited) == cutoff:
if (visited not in uniqueTreePaths) and
(child not in visited):
uniqueTreePaths.append(tuple(visited + [child]))
stack.pop()
visited.pop()
# writes the absolute path of the node path file into the hash table
filepaths[source] = str(fn) + "/" + str(source) + "path.txt"
with open (filepaths[source], "wb") as csvfile2:
writer = csv.writer(csvfile2, delimiter=" ", quotechar="|")
for path in uniqueTreePaths:
writer.writerow(path)
memorizedPaths[source] = uniqueTreePaths
############################################################################
if __name__ == '__main__':
start = time.clock()
for item in degreelist:
test = mp.Process(target=_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
test.start()
test.join()
end = time.clock()
print (end-start)
Currently - though luck and magic - it works (sort of). My problem is I'm only using 12 of my 24 cores.
Can someone explain why this might be the case? Perhaps my code isn't the best multiprocessing solution, or is it a feature of my architecture Intel Xeon CPU E5-2640 # 2.50GHz x18 running on Ubuntu 13.04 x64?
EDIT:
I managed to get:
p = mp.Pool()
for item in degreelist:
p.apply_async(_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
p.close()
p.join()
Working, however, it's VERY SLOW! So I assume I'm using the wrong function for the job. hopefully it helps clarify exactly what I'm trying to accomplish!
EDIT2: .map attempt:
partialfunc = partial(_all_simple_paths_graph,
DG=DG,
cutoff=cutoff,
memorizedPaths=memorizedPaths,
filepaths=filepaths)
p = mp.Pool()
for item in processList:
processVar = p.map(partialfunc, xrange(len(processList)))
p.close()
p.join()
Works, is slower than singlecore. Time to optimize!
Too much piling up here to address in comments, so, where mp is multiprocessing:
mp.cpu_count() should return the number of processors. But test it. Some platforms are funky, and this info isn't always easy to get. Python does the best it can.
If you start 24 processes, they'll do exactly what you tell them to do ;-) Looks like mp.Pool() would be most convenient for you. You pass the number of processes you want to create to its constructor. mp.Pool(processes=None) will use mp.cpu_count() for the number of processors.
Then you can use, for example, .imap_unordered(...) on your Pool instance to spread your degreelist across processes. Or maybe some other Pool method would work better for you - experiment.
If you can't bash the problem into Pool's view of the world, you could instead create an mp.Queue to create a work queue, .put()'ing nodes (or slices of nodes, to reduce overhead) to work on in the main program, and write the workers to .get() work items off that queue. Ask if you need examples. Note that you need to put sentinel values (one per process) on the queue, after all the "real" work items, so that worker processes can test for the sentinel to know when they're done.
FYI, I like queues because they're more explicit. Many others like Pools better because they're more magical ;-)
Pool Example
Here's an executable prototype for you. This shows one way to use imap_unordered with Pool and chunksize that doesn't require changing any function signatures. Of course you'll have to plug in your real code ;-) Note that the init_worker approach allows passing "most of" the arguments only once per processor, not once for every item in your degreeslist. Cutting the amount of inter-process communication can be crucial for speed.
import multiprocessing as mp
def init_worker(mps, fps, cut):
global memorizedPaths, filepaths, cutoff
global DG
print "process initializing", mp.current_process()
memorizedPaths, filepaths, cutoff = mps, fps, cut
DG = 1##nx.read_gml("KeggComplete.gml", relabel = True)
def work(item):
_all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths)
def _all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths):
pass # print "doing " + str(item)
if __name__ == "__main__":
m = mp.Manager()
memorizedPaths = m.dict()
filepaths = m.dict()
cutoff = 1 ##
# use all available CPUs
p = mp.Pool(initializer=init_worker, initargs=(memorizedPaths,
filepaths,
cutoff))
degreelist = range(100000) ##
for _ in p.imap_unordered(work, degreelist, chunksize=500):
pass
p.close()
p.join()
I strongly advise running this exactly as-is, so you can see that it's blazing fast. Then add things to it a bit a time, to see how that affects the time. For example, just adding
memorizedPaths[item] = item
to _all_simple_paths_graph() slows it down enormously. Why? Because the dict gets bigger and bigger with each addition, and this process-safe dict has to be synchronized (under the covers) among all the processes. The unit of synchronization is "the entire dict" - there's no internal structure the mp machinery can exploit to do incremental updates to the shared dict.
If you can't afford this expense, then you can't use a Manager.dict() for this. Opportunities for cleverness abound ;-)

Categories

Resources