I want to make a Dask Delayed flow which includes CPU and GPU tasks. GPU tasks can only run on GPU workers, and a GPU worker only has one GPU and can only handle one GPU task at a time.
Unfortunately, I see no way to specify worker resources in the Delayed API.
Here is common code:
client = Client(resources={'GPU': 1})
#delayed
def fcpu(x, y):
sleep(1)
return x + y
#delayed
def fgpu(x, y):
sleep(1)
return x + y
Here is the flow written in pure Delayed. This code will not behave properly because it doesn't know about the GPU resource.
# STEP ONE: two parallel CPU tasks
a = fcpu(1, 1)
b = fcpu(10, 10)
# STEP TWO: two GPU tasks
c = fgpu(a, b) # Requires 1 GPU
d = fgpu(a, b) # Requires 1 GPU
# STEP THREE: final CPU task
e = fcpu(c, d)
%time e.compute() # 3 seconds
This is the best solution I could come up with. It combines Delayed syntax with Client.compute() futures. It seems to behave correctly, but it is very ugly.
# STEP ONE: two parallel CPU tasks
a = fcpu(1, 1)
b = fcpu(10, 10)
a_future, b_future = client.compute([a, b]) # Wo DON'T want a resource limit
# STEP TWO: two GPU tasks - only resources to run one at a time
c = fgpu(a_future, b_future)
d = fgpu(a_future, b_future)
c_future, d_future = client.compute([c, d], resources={'GPU': 1})
# STEP THREE: final CPU task
e = fcpu(c_future, d_future)
res = e.compute()
Is there a better way to do this?
Maybe an approach similar to what is described in https://jobqueue.dask.org/en/latest/examples.html It is a case of processing on one GPU machine or a machine with SSD.
def step_1_w_single_GPU(data):
return "Step 1 done for: %s" % data
def step_2_w_local_IO(data):
return "Step 2 done for: %s" % data
stage_1 = [delayed(step_1_w_single_GPU)(i) for i in range(10)]
stage_2 = [delayed(step_2_w_local_IO)(s2) for s2 in stage_1]
result_stage_2 = client.compute(stage_2,
resources={tuple(stage_1): {'GPU': 1},
tuple(stage_2): {'ssdGB': 100}})
This is possible with annotations, see the example in docs:
x = dd.read_csv(...)
with dask.annotate(resources={'GPU': 1}):
y = x.map_partitions(func1)
z = y.map_partitions(func2)
z.compute(optimize_graph=False)
As noted in the graph, such annotations can be lost during optimization, hence the kwarg optimize_graph=False.
Related
How can I improve the performance of the networkx function local_bridges https://networkx.org/documentation/stable//reference/algorithms/generated/networkx.algorithms.bridges.local_bridges.html#networkx.algorithms.bridges.local_bridges
I have experimented using pypy - but so far am still stuck on consuming the generator on a single core. My graph has 300k edges. An example:
# construct the nx Graph:
import networkx as nx
# construct an undirected graph here - this is just a dummy graph
G = nx.cycle_graph(300000)
# fast - as it only returns an generator/iterator
lb = nx.local_bridges(G)
# individual item is also fast
%%time
next(lb)
CPU times: user 1.01 s, sys: 11 ms, total: 1.02 s
Wall time: 1.02 s
# computing all the values is very slow.
lb_list = list(lb)
How can I consume this iterator in parallel to utilize all processor cores? The current naive implementation is only using a single core!
My naive multi-threaded first try is:
import multiprocessing as mp
lb = nx.local_bridges(G)
pool = mp.Pool()
lb_list = list(pool.map((), lb))
However, I do not want to apply a specific function - () rather only get the next element from the iterator in parallel.
Related:
python or dask parallel generator?
edit
I guess it boils down how to parallelize:
lb_res = []
lb = nx.local_bridges(G)
for node in range(1, len(G) +1):
lb_res.append(next(lb))
lb_res
Naively using multiprocessing obviously fails:
# from multiprocessing import Pool
# https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror
from multiprocess import Pool
lb_res = []
lb = nx.local_bridges(G)
def my_function(thing):
return next(thing)
with Pool(5) as p:
parallel_result = p.map(my_function, range(1, len(G) +1))
parallel_result
But it is unclear to me how I can pass my generator as the argument to the map function - and fully consume the generator.
edit 2
For this particular problem, it turns out that the bottleneck is the shortest path computation for the with_span=True parameter. When disabled, it is decently fast.
When calculating the span is desired I would suggest cugraph with a fast implementation of SSSP on the GPU. Still, the iteration over the set of edges does not happen in parallel and should be improved further.
However, to learn more, I would be interested in understanding how to parallelize the consumption from a generator in python.
You can't consume a generator in parallel, every non-trivial generator's next state is determined by its current state. You have to call next() sequentially.
From https://github.com/networkx/networkx/blob/master/networkx/algorithms/bridges.py#L162 this is how the function is implemented
for u, v in G.edges:
if not (set(G[u]) & set(G[v])):
yield u, v
So you could parallelize it using something like this, but then you would have to incur the penalty of merging those individual lists using something like multiprocessing.Manager. I think it would just make the whole thing much slower, but you can time it yourself.
def process_edge(e):
u, v = e
lb_list = []
if not (set(G[u]) & set(G[v])):
lb_list.append((u,v))
with Pool(os.cpu_count()) as pool:
pool.map(process_edge, G.edges)
Another way is to split the graph into ranges of vertices and process them concurrently.
def process_nodes(nodes):
lb_list = []
for u in nodes:
for v in G[u]:
if not (set(G[u]) & set(G[v])):
lb_list.append((u,v))
with Pool(os.cpu_count()) as pool:
pool.map(process_nodes, np.array_split(list(range(G.number_of_nodes())),
os.cpu_count()))
Maybe you could also check if any better algorithms exist for this problem. Or find a faster library that's implemented in C.
The issue
I am trying to optimise some calculations which lend themselves to so-called embarrassingly parallel calculations, but I am finding that using python's multiprocessing package actually slows things down.
My question is: am I doing something wrong, or is there an intrinsic reason why parallelisation actually slows things down? Is it because I am using numba? Would other packages like joblib or dak make much of a difference?
There are loads of similar questions, in which the answer is always that the overhead costs more than the time savings, but all those questions tend to revolve around very simple functions, whereas I would have expected something with nested loops to lend itself better to parallelisation. I have also not found comparisons among joblib, multiprocessing and dask.
My function
I have a function which takes a one-dimensional numpy array as argument of shape n, and outputs a numpy array of shape (n x t), where each row is independent, i.e. row 0 of the output depends only on item 0 of the input, row 1 on item 1, etc. Something like this:
The underlying calculation is optimised with numba , which speeds things up by various orders of magnitude.
Toy example - results
I cannot share the exact code, so I have come up with a toy example. The calculation defined in my_fun_numba is actually irrelevant, it's just some very banal number crunching to keep the CPU busy.
With the toy example, the results on my PC are these, and they are very similar to what I get with my actual code.
As you can see, splitting the input array into different chunks and sending each of them to multiprocessing.pool actually slows things down vs just using numba on a single core.
What I have tried
I have tried various combinations of the cache and nogil options in the numba.jit decorator, but the difference is minimal.
I have profiled the code (not the timeit.Timer part, just a single run) with PyCharm and, if I understand the output correctly, it seems most of the time is spent waiting for the pool.
Sorted by time:
Sorted by own time:
Toy example - the code
import numpy as np
import pandas as pd
import multiprocessing
from multiprocessing import Pool
import numba
import timeit
#numba.jit(nopython = True, nogil = True, cache = True)
def my_fun_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_fun_non_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_func_parallel(inp, func, cpus = None):
if cpus == None:
cpus = max(1, multiprocessing.cpu_count() - 1)
else:
cpus = cpus
inp_split = np.array_split(inp,cpus)
pool = Pool(cpus)
out = np.vstack(pool.map(func, inp_split) )
pool.close()
pool.join()
return out
if __name__ == "__main__":
inputs = np.array([100,10e3,1e6] ).astype(int)
res = pd.DataFrame(index = inputs, columns =['no paral, no numba','no paral, numba','numba 6 cores','numba 12 cores'])
r = 3
n = 1
for i in inputs:
my_arg = np.arange(0,i)
res.loc[i, 'no paral, no numba'] = min(
timeit.Timer("my_fun_non_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'no paral, numba'] = min(
timeit.Timer("my_fun_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 6 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 6)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 12 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 12)", globals=globals()).repeat(repeat=r, number=n)
)
The code down below is a contrived example that simulates an actual problem I have that uses multiprocessing to speed up the code. The code is run on Windows 10 64-bit OS, python 3.7.5, and ipython 7.9.0
the transformation functions(these functions will be used to transform arrays in main())
from itertools import product
from functools import partial
from numba import njit, prange
import multiprocessing as mp
import numpy as np
#njit(parallel= True)
def transform_array_c(data, n):
ar_len= len(data)
sec_max1= np.empty(ar_len, dtype = data.dtype)
sec_max2= np.empty(ar_len, dtype = data.dtype)
for i in prange(n-1):
sec_max1[i]= np.nan
for sec in prange(ar_len//n):
s2_max= data[n*sec+ n-1]
s1_max= data[n*sec+ n]
for i in range(n-1,-1,-1):
if data[n*sec+i] > s2_max:
s2_max= data[n*sec+i]
sec_max2[n*sec+i]= s2_max
sec_max1[n*sec+ n-1]= sec_max2[n*sec]
for i in range(n-1):
if n*sec+n+i < ar_len:
if data[n*sec+n+i] > s1_max:
s1_max= data[n*sec+n+i]
sec_max1[n*sec+n+i]= max(s1_max, sec_max2[n*sec+i+1])
else:
break
return sec_max1
#njit(error_model= 'numpy', cache= True)
def rt_mean_sq_dev(array1, array2, n):
msd_temp = np.empty(array1.shape[0])
K = array2[n-1]
rs_x= array1[0] - K
rs_xsq = rs_x *rs_x
msd_temp[0] = np.nan
for i in range(1,n):
rs_x += array1[i] - K
rs_xsq += np.square(array1[i] - K)
msd_temp[i] = np.nan
y_i = array2[n-1] - K
msd_temp[n-1] = np.sqrt(max(y_i*y_i + (rs_xsq - 2*y_i*rs_x)/n, 0))
for i in range(n, array1.shape[0]):
rs_x = array1[i] - array1[i-n]+ rs_x
rs_xsq = np.square(array1[i] - K) - np.square(array1[i-n] - K) + rs_xsq
y_i = array2[i] - K
msd_temp[i] = np.sqrt(max(y_i*y_i + (rs_xsq - 2*y_i*rs_x)/n, 0))
return msd_temp
#njit(cache= True)
def transform_array_a(data, n):
result = np.empty(data.shape[0], dtype= data.dtype)
alpharev = 1. - 2 / (n + 1)
alpharev_exp = alpharev
e = data[0]
w = 1.
if n == 2: result[0] = e
else:result[0] = np.nan
for i in range(1, data.shape[0]):
w += alpharev_exp
e = e*alpharev + data[i]
if i > n -3:result[i] = e / w
else:result[i] = np.nan
if alpharev_exp > 3e-307:alpharev_exp*= alpharev
else:alpharev_exp=0.
return result
The multiprocessing part
def func(tup, data): #<-------------the function to be run among all
a_temp= a[tup[2][0]]
idx1 = a_temp > a[tup[2][1]]
idx2= a_temp < b[(tup[2][1], tup[1][1])]
c_final = c[tup[0][1]][idx1 | idx2]
data_final= data[idx1 | idx2]
return (tup[0][0], tup[1][0], *tup[2]), c_final[-1] - data_final[-1]
def setup(a_dict, b_dict, c_dict): #initialize the shared dictionaries
global a,b,c
a,b,c = a_dict, b_dict, c_dict
def main(a_arr, b_arr, c_arr, common_len):
np.random.seed(0)
data_array= np.random.normal(loc= 24004, scale=500, size= common_len)
a_size = a_arr[-1] + 1
b_size = len(b_arr)
c_size = len(c_arr)
loop_combo = product(enumerate(c_arr),
enumerate(b_arr),
(n_tup for n_tup in product(np.arange(1,a_arr[-1]), a_arr) if n_tup[1] > n_tup[0])
)
result = np.zeros((c_size, b_size, a_size -1 ,a_size), dtype = np.float32)
###################################################
#This part simulates the heavy-computation in the actual problem
a= {}
b= {}
c= {}
for i in range(1, a_arr[-1]+1):
a[i]= transform_array_a(data_array, i)
if i in a_arr:
for j in b_arr:
b[(i,j)]= rt_mean_sq_dev(data_array, a[i], i)/data_array *j
for i in c_arr:
c[i]= transform_array_c(data_array, i)
###################################################
with mp.Pool(processes= mp.cpu_count() - 1,
initializer= setup,
initargs= [a,b,c]
) as pool:
mp_res= pool.imap_unordered(partial(func, data= data_array),
loop_combo
)
for item in mp_res:
result[item[0]] =item[1]
return result
if __name__ == '__main__':
mp.freeze_support()
a_arr= np.arange(2,44,2)
b_arr= np.arange(0.4,0.8, 0.20)
c_arr= np.arange(2,42,10)
common_len= 440000
final_res= main(a_arr, b_arr, c_arr, common_len)
For performance reasons, multiple shared "read only" dictionaries are used among all processes to reduce the redundant computations(in the actual problem, the total computation time is reduced by 40% after using shared dictionaries among all the processes). However, The ram usage becomes absurdly higher after using shared dictionaries in my actual problem; memory usage in my 6C/12T Windows computer goes from (8.2GB peak, 5.0GB idle) to (23.9GB peak, 5.0GB idle), a little too high of a cost to pay in order to gain 40% speed up.
Is the high ram usage unavoidable when using multiple shared data among processes is a must? What can be done to my code in order to make it as fast as possible while using as low memory as possible?
Thank you in advance
Note: I tried using imap_unordered() instead of map because I heard it is supposed to reduce the memory usage when the input iterable is large, but I honestly can't see an improvement in the ram usage. Maybe I have done something wrong here?
EDIT: Due to the feedback in the answers, I have already changed the heavy computation part of the code such that it looks less dummy and resembles the computation in the actual problem.
High Memory Usage when manipulating shared dictionaries in python multiprocessing run in Windows
It is fair to demystify a bit the problem, before we move into details - there are no shared dictionaries in the original code, the less they get manipulated ( yes, each of the a,b,c did get "assigned" to a reference to the dict_a, dict_b, dict_c yet none of them is shared, but just get replicated as the multiprocessing does in Windows-class O/S-es. No writes "into" dict-s ( just non-destructive reads-from either of their replicas )
Similarly, the np.memmap()-s are possible to put some part of the originally proposed data onto disk-space ( at a cost of doing so + bearing some ( latency-masked ) random-reads latency of ~ 10 [ms] instead of ~ 0.5 [ns] if smart-aligned vectorised memory-patterns were designed into the performance hot-spot ) yet no dramatic change-of-paradigm ought be expected here, as the "external iterator" almost avoids any smart-aligned cache re-uses
Q : What can be done to my code in order to make it as fast as possible while using as low memory as possible?
The first sin was in using an 8B-int64 to store one plain Bbit ( no Qbits here yet ~ All salutes to Burnaby Quantum R&D Teams )
for i in c_arr: # ~~ np.arange( 2, 42, 10 )
np.random.seed( i ) # ~ yields a deterministic state
c[i] = np.random.poisson( size = common_len ) # ~ 440.000 int64-s with {0|1}
This took 6 (processes) x 440000 x 8B ~ 0.021 GB "smuggled" in all copies of dictionary c, whereas each and every such value is deterministically known and could be generated ALAP inside a respective target process, by just knowing the value of i ( indeed no need to pre-generate and many-times replicate ~ 0.021 GB of data )
So far, the Windows-class O/S lack an os.fork() and thus do a python full-copy ( yes, RAM ..., yes, TIME ) of as many replicated python-interpreter sessions ( plus importing the main module ) as was requested, in multiprocessing for process-based separation ( doing that for avoiding a GIL-lock ordered, pure-[SERIAL], code execution )
The Best Next Step:re-factor the codefor both efficiency and performance
The best next step - refactor the code, so as to minimise a "shallow" ( and expensive ) use of the 6-processes but "externally"-commanded by a central iterator ( the loop_combo "dictator" with ~ 18522 items to repeat the call to a "remotely-dispatched" func( tup, data ) so as to fetch a simple "DMA-tuple"-( (x,y,z), value ) to store one value into a central process result-float32-array ).
Try to increase the computing "density" - so try to re-factor the code by a divide-and-conquer manner ( i.e., that each of the mp.pool-processes computes in one smooth block some remarkably sized, dedicated sub-space of the parameter-space covered ( here iteratively "from ouside" ) and may easily reduce the returned blocks of results. Performance will only improve by doing this ( best without any form of expensive sharing ).
This re-factoring will avoid parameter pickle/unpickle-costs ( add-on overheads - both the one-time ( in passing the unique parameter-set values ) and the repetitive ( in about a ~ 18522-times executed repetitive memory-allocation, buildup and pickle/unpickle-costs of an np.arange( 440000 ) due to a poor call-signature design / engineering )
All these steps will improve your processing efficiency and reduce the there unnecessary RAM-allocations.
Since I moved from python3.5 to 3.6 the Parallel computation using joblib is not reducing the computation time.
Here are the librairies installed versions:
- python: 3.6.3
- joblib: 0.11
- numpy: 1.14.0
Based on a very well known example, I give below a sample code to reproduce the problem:
import time
import numpy as np
from joblib import Parallel, delayed
def square_int(i):
return i * i
ndata = 1000000
ti = time.time()
results = []
for i in range(ndata):
results.append(square_int(i))
duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )
for njobs in [1,2,3,4] :
ti = time.time()
results = []
results = Parallel(n_jobs=njobs, backend="multiprocessing")\
(delayed(square_int)(i) for i in range(ndata))
duration = np.round(time.time() - ti,4)
print(f"{njobs} jobs computation: {duration} s" )
I got the following ouput:
standard computation: 0.2672 s
1 jobs computation: 352.3113 s
2 jobs computation: 6.9662 s
3 jobs computation: 7.2556 s
4 jobs computation: 7.097 s
While I am increasing by a factor of 10 the number of ndata and removing the 1 core computation, I get those results:
standard computation: 2.4739 s
2 jobs computation: 77.8861 s
3 jobs computation: 79.9909 s
4 jobs computation: 83.1523 s
Does anyone have an idea in which direction I should investigate ?
I think the primary reason is that your overhead from parallel beats the benefits. In another word, your square_int is too simple to earn any performance improvement via parallel. The square_int is so simple that passing input and output between processes may take more time than executing the function square_int.
I modified your code by creating a square_int_batch function. It reduced the computation time a lot, though it is still more than the serial implementation.
import time
import numpy as np
from joblib import Parallel, delayed
def square_int(i):
return i * i
def square_int_batch(a,b):
results=[]
for i in range(a,b):
results.append(square_int(i))
return results
ndata = 1000000
ti = time.time()
results = []
for i in range(ndata):
results.append(square_int(i))
# results = [square_int(i) for i in range(ndata)]
duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )
batch_num = 3
batch_size=int(ndata/batch_num)
for njobs in [2,3,4] :
ti = time.time()
results = []
a = list(range(ndata))
# results = Parallel(n_jobs=njobs, )(delayed(square_int)(i) for i in range(ndata))
# results = Parallel(n_jobs=njobs, backend="multiprocessing")(delayed(
results = Parallel(n_jobs=njobs)(delayed(
square_int_batch)(i*batch_size,(i+1)*batch_size) for i in range(batch_num))
duration = np.round(time.time() - ti,4)
print(f"{njobs} jobs computation: {duration} s" )
And the computation timings are
standard computation: 0.3184 s
2 jobs computation: 0.5079 s
3 jobs computation: 0.6466 s
4 jobs computation: 0.4836 s
A few other suggestions that will help reduce the time.
Use list comprehension results = [square_int(i) for i in range(ndata)] to replace for loop in your specific case, it is faster. I tested.
Set batch_num to a reasonable size. The larger this value is, the more overhead. It started to get significantly slower when batch_num exceed 1000 in my case.
I used the default backend loky instead of multiprocessing. It is slightly faster, at least in my case.
From a few other SO questions, I read that the multiprocessing is good for cpu-heavy tasks, for which I don't have an official definition. You can explore that yourself.
I am completely new to python or any such programming language. I have some experience with Mathematica. I have a mathematical problem which though Mathematica solves with her own 'Parallelize' methods but leaves the system quite exhausted after using all the cores! I can barely use the machine during the run. Hence, I was looking for some coding alternative and found python kind of easy to learn and implement. So without further ado, let me tell you the mathematical problem and issues with my python code. As the full code is too long, let me give an outline.
1. Numericall solve a differential equation of the form y''(t) + f(t)y(t)=0, to get y(t) for some range, say C <= t <= D
2.Next, Interpolate the numerical result for some desired range to get the function: w(t), say for A <= t <= B
3. Using w(t), to solve another differential equation of the form z''(t) + [ a + b W(t)] z(t) =0 for some range of a and b, for which I am using the loop.
4. Deine F = 1 + sol1[157], to make a list like {a, b, F}. So let me give a prototype loop as this take most of the computation time.
for q in np.linspace(0.0, 4.0, 100):
for a in np.linspace(-2.0, 7.0, 100):
print('Solving for q = {}, a = {}'.format(q,a))
sol1 = odeint(fun, [1, 0], t, args=( a, q))[..., 0]
print(t[157])
F = 1 + sol1[157]
f1.write("{} {} {} \n".format(q, a, F))
f1.close()
Now, the real loop takes about 4 hrs and 30 minutes to complete (With some built-in functional form of w(t), it takes about 2 minute). When, I applied (without properly understanding what it does and how!) numba/autojit before the definition of fun in my code, the run time significantly improved and takes about 2 hrs and 30 minute. Also, writing two loops as itertools/product further reduces the run time by about 2 minutes only! However, Mathematica, when I let her use all the 4 cores, finishes the task within 30 minutes.
So, is there a way to improve the runtime in python?
To speed up python, you have three options:
deal with specific bottlenecks in the program (as suggested in #LutzL's comment)
try to speed up the code by compiling it into C using cython (or including C code using weave or similar techniques). Since the time-consuming computations in your case are not in python code proper but in scipy modules (at least I believe they are), this would not help you very much here.
implement multiprocessing as you suggested in your original question. This will speed up your code to up to X (slightly less than) times faster if you have X cores. Unfortunately this is rather complicated in python.
Implementing multiprocessing - example using the prototype loop from the original question
I assume that the computations you do inside the nested loops in your prototype code are actually independent from one another. Since your prototype code is incomplete, I am not sure this is the case, however. Otherwise it will, of course, not work. I will give an example using not your differential equation problem for the fun function but a prototype of the same signature (input and output variables).
import numpy as np
import scipy.integrate
import multiprocessing as mp
def fun(y, t, b, c):
# replace this function with whatever function you want to work with
# (this one is the example function from the scipy docs for odeint)
theta, omega = y
dydt = [omega, -b*omega - c*np.sin(theta)]
return dydt
#definitions of work thread and write thread functions
def run_thread(input_queue, output_queue):
# run threads will pull tasks from the input_queue, push results into output_queue
while True:
try:
queueitem = input_queue.get(block = False)
if len(queueitem) == 3:
a, q, t = queueitem
sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
F = 1 + sol1[157]
output_queue.put((q, a, F))
except Exception as e:
print(str(e))
print("Queue exhausted, terminating")
break
def write_thread(queue):
# write thread will pull results from output_queue, write them to outputfile.txt
f1 = open("outputfile.txt", "w")
while True:
try:
queueitem = queue.get(block = False)
if queueitem[0] == "TERMINATE":
f1.close()
break
else:
q, a, F = queueitem
print("{} {} {} \n".format(q, a, F))
f1.write("{} {} {} \n".format(q, a, F))
except:
# necessary since it will throw an error whenever output_queue is empty
pass
# define time point sequence
t = np.linspace(0, 10, 201)
# prepare input and output Queues
mpM = mp.Manager()
input_queue = mpM.Queue()
output_queue = mpM.Queue()
# prepare tasks, collect them in input_queue
for q in np.linspace(0.0, 4.0, 100):
for a in np.linspace(-2.0, 7.0, 100):
# Your computations as commented here will now happen in run_threads as defined above and created below
# print('Solving for q = {}, a = {}'.format(q,a))
# sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
# print(t[157])
# F = 1 + sol1[157]
input_tupel = (a, q, t)
input_queue.put(input_tupel)
# create threads
thread_number = mp.cpu_count()
procs_list = [mp.Process(target = run_thread , args = (input_queue, output_queue)) for i in range(thread_number)]
write_proc = mp.Process(target = write_thread, args = (output_queue,))
# start threads
for proc in procs_list:
proc.start()
write_proc.start()
# wait for run_threads to finish
for proc in procs_list:
proc.join()
# terminate write_thread
output_queue.put(("TERMINATE",))
write_proc.join()
Explanation
We define the individual problems (or rather their parameters) before commencing computation; we collect them in an input Queue.
We define a function (run_thread) that is run in the threads. This function computes individual problems until there are none left in the input Queue; it pushes the results into an output Queue.
We start as many such threads as we have CPUs.
We start an additional thread (write_thread) for collecting the results from the output queue and writing them into a file.
Caveats
For smaller problems, you can run multiprocessing without Queues. However, if the number of individual computations is large, you will exceed the maximum number of threads the kernel will allow you after which the kernel kills your program.
There are differences between different operating systems for how multiprocessing works. The example above will work on Linux (perhaps also on other Unix like systems such as Mac and BSD), not on Windows. The reason is that Windows does not have a fork() system call. (I do not have access to a Windows, can therefore not try to implement it for Windows.)