The following for loop is part of a iterative simulation process and is the main bottleneck regarding computational time:
import numpy as np
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every itr
cols_red_list = []
rows_list = list(range(2500)) #row idx for diff where negative element is known to appear
diff = np.random.uniform(-1.323, 3.780, (2500, 300)) #np.random.uniform is just used as toy example
for row in rows_list:
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
cols_red_list.append(col)
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Hence, I tried to parallelize it by using the multiprocessing package in hope to reduce computation time:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every
rows_list = list(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
if __name__ == '__main__':
num_of_workers = cpu_count()
print('number of CPUs : ', num_of_workers)
pool = Pool(num_of_workers)
cols_red_list = pool.map(partial(crossings,diff = diff), rows_list)
pool.close()
print(len(cols_red_list))
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Unfortunately, the parallelization turns out to be much slower compared to the sequential piece of code.
Hence my question: Did I use the multiprocessing package properly in that particular example? Are there alternative ways to parallelize the above mentioned for loop ?
Disclaimer: As you're trying to reduce the runtime of your code through parallelisation, this doesn't strictly answer your question but it might still be a good learning opportunity.
As a golden rule, before moving to multiprocessing to improve
performance (execution time), one should first optimise the
single-threaded case.
Your
rows_list = list(range(2500))
Generates the numbers 0 to 2499 (that's the range) and stores them in memory (list), which requires time to do the allocation of the required memory and the actual write. You then only use these predictable values once each, by reading them from memory (which also takes time), in a predictable order:
for row in rows_list:
This is particularly relevant to the runtime of your loop function as you do it repeatedly (for itr in range(n_int):).
Instead, consider generating the number only when you need it, without an intermediate store (which conceptually removes any need to access RAM):
for row in range(2500):
Secondly, on top of sharing the same issue (unnecessary accesses to memory), the following:
diff = np.random.uniform(-1, 1, (2500, 300))
# ...
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
seems to me to be optimisable at the level of math (or logic).
What you're trying to do is get a random variable (that col index) by defining it as "the first time I encounter a random variable in [-1;1] that is lower than 0". But notice that figuring out if a random variable with a uniform distribution over [-α;α] is negative, is the same as having a random variable over {0,1} (i.e. a bool).
Therefore, you're now working with bools instead of floats and you don't even have to do the comparison (val < 0) as you already have a bool. This potentially makes the code much faster. Using the same idea as for rows_list, you can generate that bool only when you need it; testing it until it is True (or False, choose one, it doesn't matter obviously). By doing so, you only generate as many random bools as you need, not more and not less (BTW, what happens in your code if all 300 elements in the row are negative? ;) ):
for _ in range(n_int):
cols_red_list = []
for row in range(2500):
col = next(i for i in itertools.count() if random.getrandbits(1))
cols_red_list.append(col)
or, with list comprehension:
cols_red_list = [next(i for i in count() if getrandbits(1))
for _ in range(2500)]
I'm sure that, through proper statistical analysis, you even can express that col random variable as a non-uniform variable over [0;limit[, allowing you to compute it much faster.
Please test the performance of an "optimized" version of your single-threaded implementation first. If the runtime is still not acceptable, you should then look into multithreading.
multiprocessing uses system processes (not threads!) for parallelization, which require expensive IPC (inter-process communication) to share data.
This bites you in two spots:
diff = np.random.uniform(-1, 1, (2500, 300)) creates a large matrix which is expensive to pickle/copy to another process
rows_list = list(range(2500)) creates a smaller list, but the same applies here.
To avoid this expensive IPC, you have one and a half choices:
If on a POSIX-compliant system, initialize your variables on the module level, that way each process gets a quick-and-dirty copy of the required data. This is not scalable as it requires POSIX, weird architecture (you probably don't want to put everything on the module level), and doesn't support sharing changes to that data.
Use shared memory. This only supports mostly primitive data types, but mp.Array should cover your needs.
The second problem is that setting up a pool is expensive, as num_cpu processes need to be started. Your workload is small enough to be negligible compared to this overhead. A good practice is to only create one pool and reuse it.
Here is a quick-and-dirty example of the POSIX only solution:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
n_int = 10
rows_list = np.array(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
def workload(_):
cols_red_list = [crossings(row, diff) for row in rows_list]
print(len(cols_red_list))
class Simulation(object):
def loop(self):
num_of_workers = cpu_count()
with Pool(num_of_workers) as pool:
pool.map(workload, range(10))
pool.close()
sim1 = Simulation()
sim1.loop()
For me (and my two cores) this is roughly twice as fast as the sequential version.
Update with shared memory:
import numpy as np
from multiprocessing import Pool, cpu_count, Array
from functools import partial
n_int = 10
ROW_COUNT = 2500
### WORKER
diff = None
result = None
def init_worker(*args):
global diff, result
(diff, result) = args
def crossings(i):
result[i] = next(idx for idx, val in enumerate(diff[i*300:(i+1)*300]) if val < 0)
### MAIN
class Simulation():
def loop(self):
num_of_workers = cpu_count()
diff = Array('d', range(ROW_COUNT*300), lock=False)
result = Array('i', ROW_COUNT, lock=False)
# Shared memory needs to be passed when workers are spawned
pool = Pool(num_of_workers, initializer=init_worker, initargs=(diff, result))
for i in range(n_int):
# SLOW, I assume you use a different source of values anyway.
diff[:] = np.random.uniform(-1, 1, ROW_COUNT*300)
pool.map(partial(crossings), range(ROW_COUNT))
print(len(result))
pool.close()
sim1 = Simulation()
sim1.loop()
A few notes:
Shared memory needs to be set up at worker creation, so it's global anyway.
This still isn't faster than the sequential version, but that's mainly due to random.uniform needing to be copied entirely into shared memory. I assume that are just values for testing, and in reality you'd fill it differently anyway.
I only pass indices to the worker, and use them to read and write values to the shared memory.
Related
Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
This script takes around 15 minutes to run, the dataframe contains 120k users, so comparing each possible combination takes quite a bit of time, if I just write pass on the for loop it takes 2 minutes to loop through all values.
I tried using filter() and map() for the if statements and fuzzy score but the performance was worse. I tried improving the script as much as I could but I don't know how I can improve this further.
Would really appreciate some help!
It is slightly complicated to reason about the data since you have not attached it, but we can see multiple places that might provide an improvement:
First, let's rewrite the code in a way which is easier to reason about than using the indices:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
a_id, (a_text, a_set, a_compre_string) = a
b_id, (b_text, b_set, b_compre_string) = b
if (a_compre_string == b_compre_string
and not a_set.isdisjoint(b_set)):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
You seem to only care about pairs having the same compare_string values. Therefore, and assuming this is not something that all pairs share, we can key by whatever that value is to cover much less pairs.
To put some numbers into it, let's say you have 120K inputs, and 1K values for each value of val[1][2] - then instead of covering 120K * 120K = 14 * 10^9 combinations, you would have 120 bins of size 1K (where in each bin we'd need to check all pairs) = 120 * 1K * 1K = 120 * 10^6 which is about 1000 times faster. And it would be even faster if each bin has less than 1K elements.
import collections
# Create a dictionary from compare_string to all items
# with the same compare_string
items_by_compare_string = collections.defaultdict(list)
for item in dictionary.items():
compare_string = item[1][2]
items_by_compare_string[compare_string].append(items)
# Iterate over each group of items that have the same
# compare string
for item_group in items_by_compare_string.values():
# Check pairs only within that group
for a, b in itertools.combinations(item_group, 2):
# No need to compare the compare_strings!
if not a_set.isdisjoint(b_set):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
But, what if we want more speed? Let's look at the remaining operations:
We have a check to find if two sets share at least one item
This seems like an obvious candidate for optimization if we have any knowledge about these sets (to allow us to determine which pairs are even relevant to compare)
Without additional knowledge, and just looking at every two pairs and trying to speed this up, I doubt we can do much - this is probably highly optimized using internal details of Python sets, I don't think it's likely to optimize it further
We a fuzz.ratio computation which is some external function, and I'm going to assume is heavy
If you are using this from the FuzzyWuzzy package, make sure to install python-Levenshtein to get the speedups detailed here
We have some comparisons which we are unlikely to be able to speed up
We might be able to cache the length of a_text by nesting the two loops, but that's negligible
We have appends to a list, which runs on average ("amortized") constant time per operation, so we can't really speed that up
Therefore, I don't think we can reasonably suggest any more speedups without additional knowledge. If we know something about the sets that can help optimize which pairs are relevant we might be able to speed things up further, but I think this is about it.
EDIT: As pointed out in other answers, you can obviously run the code in multi-threading. I assumed you were looking for an algorithmic change that would possibly reduce the number of operations significantly, instead of just splitting these over more CPUs.
Essentially, from python programming side, i see two things that can improve your processing time:
Multi-threads and Vectorized operations
From the fuzzy score side, here is a list of tips you can use to improve your processing time (new anonymous tab to avoid paywall):
https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
Using multi thread you can speed you operation up to N times, being N the number of threads in you CPU. You can check it with:
import multiprocessing
multiprocessing.cpu_count()
Using vectorized operations you can parallel process your operations in low level with SIMD (single instruction / multiple data) operations, or with gpu tensor operations (like those in tensorflow/pytorch).
Here is a small comparison of results for each case:
import numpy as np
import time
A = [np.random.rand(512) for i in range(2000)]
B = [np.random.rand(512) for i in range(2000)]
high_similarity = []
def measure(i,j,a,b,high_similarity):
d = ((a-b)**2).sum()
if d>12:
high_similarity.append((i,j,d))
start_single_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
measure(i,j,A[i],B[j],high_similarity)
finis_single_thread = time.time()
print("single thread time:",finis_single_thread-start_single_thread)
out[0] single thread time: 147.64517450332642
running on multi thread:
from threading import Thread
high_similarity = []
def measure(a = None,b= None,high_similarity = None):
d = ((a-b)**2).sum()
if d > 12:
high_similarity.append(d)
start_multi_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
thread = Thread(target=measure,kwargs= {'a':A[i],'b':B[j],'high_similarity':high_similarity} )
thread.start()
thread.join()
finish_multi_thread = time.time()
print("time to run on multi threads:",finish_multi_thread - start_multi_thread)
out[1] time to run on multi-threads: 11.946279764175415
A_array = np.array(A)
B_array = np.array(B)
start_vectorized = time.time()
for i in range(len(A_array)):
#vectorized distance operation
dists = (A_array-B_array)**2
high_similarity+= dists[dists>12].tolist()
aux = B_array[-1]
np.delete(B_array,-1)
np.insert(B_array, 0, aux)
finish_vectorized = time.time()
print("time to run vectorized operations:",finish_vectorized-start_vectorized)
out[2] time to run vectorized operations: 2.302949905395508
Note that you can't guarantee any order of execution, so will you also need to store the index of results. The snippet of code is just to illustrate that you can use parallel process, but i highly recommend to use a pool of threads and divide your dataset in N subsets for each worker and join the final result (instead of create a thread for each function call like i did).
I am currently generating a nested dictionary that saves some arrays by using a nested for loop. Unfortunately, it takes quite some time; I realized that the server I am working on has a few cores available, so I was wondering if Python's multiprocessing library could be helpful to speed up the creation of the dictionary.
The nested for loop looks something like this (the actual computation is heavier and more complex):
import numpy as np
data_dict = {}
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
So this is what I tried:
from multiprocessing import Pool
import numpy as np
data_dict = {}
def process():
#sci=fits.open('{}.fits'.format(name))
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process)
But pool.map (last line) seems to require an iterable, which I'm not sure what to insert there.
In my opinion, the real problem is what kind of processing is needed to compute entries of the dictionary and how many entries are there.
The kind of processing is essential to understand if multiprocessing can significantly speed up the creation of the dictionary. If your computation is I/O bound, you should use multithreading, while if it's CPU bound you should use multiprocessing. You can find more bout this here.
Assuming that the value of each entry can be computed independently and that this computation is CPU bound, let's benchmark the difference between single process and multiprocess implementation (based on multiprocessing library).
The following code is used to test the two approaches in some scenarios, varying the complexity of the computation needed for each entry and the number of entries (for the multiprocess implementation, 7 processes were used).
import timeit
import numpy as np
def some_fun(s, d, n=1):
"""A function with an adaptable complexity"""
a = s * np.ones(np.random.randint(1, 10, (2,))) / (d + 1)
for _ in range(n):
a += np.random.random(a.shape)
return a
# Code to create dictionary with only one process
setup_simple = "from __main__ import some_fun, n_first_level, n_second_level, complexity"
code_simple = """
data_dict = {}
for s in range(n_first_level):
data_dict[s] = {}
for d in range(n_second_level):
data_dict[s][d] = some_fun(s, d, n=complexity)
"""
# Code to create a dictionary with multiprocessing: we are going to use all the available cores except 1
setup_mp = """import numpy as np
import multiprocessing as mp
import itertools
from functools import partial
from __main__ import some_fun, n_first_level, n_second_level, complexity
n_processes = mp.cpu_count() - 1
# Uncomment if you want to know how many concurrent processes are you going to use
# print(f'{n_processes} concurrent processes')
"""
code_mp = """
with mp.Pool(processes=n_processes) as pool:
dict_values = pool.starmap(partial(some_fun, n=complexity), itertools.product(range(n_first_level), range(n_second_level)))
data_dict = {
k: dict(zip(range(n_second_level), dict_values[k * n_second_level: (k + 1) * n_second_level]))
for k in range(n_first_level)
}
"""
# Time the code with different settings
print('Execution time on 10 repetitions: mean [std]')
for label, complexity, n_first_level, n_second_level in (
("TRIVIAL FUNCTION", 0, 10, 10),
("TRIVIAL FUNCTION", 0, 500, 500),
("SIMPLE FUNCTION", 5, 500, 500),
("COMPLEX FUNCTION", 50, 100, 100),
("HEAVY FUNCTION", 1000, 10, 10),
):
print(f'\n{label}, {n_first_level * n_second_level} dictionary entries')
for l, t in (
('Single process', timeit.repeat(stmt=code_simple, setup=setup_simple, number=1, repeat=10)),
('Multiprocess', timeit.repeat(stmt=code_mp, setup=setup_mp, number=1, repeat=10)),
):
print(f'\t{l}: {np.mean(t):.3e} [{np.std(t):.3e}] seconds')
These are the results:
Execution time on 10 repetitions: mean [std]
TRIVIAL FUNCTION, 100 dictionary entries
Single process: 7.752e-04 [7.494e-05] seconds
Multiprocess: 1.163e-01 [2.024e-03] seconds
TRIVIAL FUNCTION, 250000 dictionary entries
Single process: 7.077e+00 [7.098e-01] seconds
Multiprocess: 1.383e+00 [7.752e-02] seconds
SIMPLE FUNCTION, 250000 dictionary entries
Single process: 1.405e+01 [1.422e+00] seconds
Multiprocess: 2.858e+00 [5.742e-01] seconds
COMPLEX FUNCTION, 10000 dictionary entries
Single process: 1.557e+00 [4.330e-02] seconds
Multiprocess: 5.383e-01 [5.330e-02] seconds
HEAVY FUNCTION, 100 dictionary entries
Single process: 3.181e-01 [5.026e-03] seconds
Multiprocess: 1.171e-01 [2.494e-03] seconds
As you can see, assuming that you have a CPU bounded computation, the multiprocess approach achieves better results in most of the scenarios. Only if you have a very light computation for each entry and/or a very limited number of entries, the single process approach should be preferred.
On the other hand, the improvement provided by multiprocessing comes with a cost: for example, if your computation for each entry uses a significant amount of memory, you could incur an OutOfMemory error, meaning that you have to improve your code and make it more complex to avoid it, finding the right balance between memory occupation and decrease in execution time. If you look around, there are a lot of questions asking how to solve memory issues caused by a non-optimal use of multiprocessing. In other words, this means that your code will be less easy to read and maintain.
To sum up, you should judge if the improvement in execution time is worthed, even if it is possible.
Using the threading library to accelerate calculating each point's neighborhood in a points-cloud. By calling function CalculateAllPointsNeighbors at the bottom of the post.
The function receives a search radius, maximum number of neighbors and a number of threads to split the work on. No changes are done on any of the points. And each point stores data in its own np.ndarray cell accessed by its own index.
The following function times how long it takes N number of threads to finish calculating all points neighborhoods:
def TimeFuncThreads(classObj, uptothreads):
listTimers = []
startNum = 1
EndNum = uptothreads + 1
for i in range(startNum, EndNum):
print("Current Number of Threads to Test: ", i)
tempT = time.time()
classObj.CalculateAllPointsNeighbors(searchRadius=0.05, maxNN=25, maxThreads=i)
tempT = time.time() - tempT
listTimers.append(tempT)
PlotXY(np.arange(startNum, EndNum), listTimers)
The problem is, I've been getting very different results in each run. Here are the plots from 5 subsequent runs of the function TimeFuncThreads. The X axis is number of threads, Y is the runtime. First thing is, they look totally random. And second, there is no significant acceleration boost.
I'm confused now whether I'm using the threading library wrong and what is this behavior that I'm getting?
The function that handles the threading and the function that is being called from each thread:
def CalculateAllPointsNeighbors(self, searchRadius=0.20, maxNN=50, maxThreads=8):
threadsList = []
pointsIndices = np.arange(self.numberOfPoints)
splitIndices = np.array_split(pointsIndices, maxThreads)
for i in range(maxThreads):
threadsList.append(threading.Thread(target=self.GetPointsNeighborsByID,
args=(splitIndices[i], searchRadius, maxNN)))
[t.start() for t in threadsList]
[t.join() for t in threadsList]
def GetPointsNeighborsByID(self, idx, searchRadius=0.05, maxNN=20):
if isinstance(idx, int):
idx = [idx]
for currentPointIndex in idx:
currentPoint = self.pointsOpen3D.points[currentPointIndex]
pointNeighborhoodObject = self.GetPointNeighborsByCoordinates(currentPoint, searchRadius, maxNN)
self.pointsNeighborsArray[currentPointIndex] = pointNeighborhoodObject
self.__RotatePointNeighborhood(currentPointIndex)
It pains me to be the one to introduce you to the Python Gil. Is a very nice feature that makes parallelism using threads in Python a nightmare.
If you really want to improve your code speed, you should be looking at the multiprocessing module
I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()
When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.
I need to accumulate the results of many trees for some query that outputs a large result. Since all trees can be handled independently it is embarrassingly parallel, except for the fact that the results needs to be summed and I cannot store the intermediate results for all trees in memory. Below is a simple example of a code for the problem which saves all the intermediate results in memory (of course the functions are newer the same in the real problem since that would be doing duplicated work).
import numpy as np
from joblib import Parallel, delayed
functions=[[abs,np.round] for i in range(500)] # Dummy functions
functions=[function for sublist in functions for function in sublist]
X=np.random.normal(size=(5,5)) # Dummy data
def helper_function(function,X=X):
return function(X)
results = Parallel(n_jobs=-1,)(
map(delayed(helper_function), [functions[i] for i in range(1000)]))
results_out = np.zeros(results[0].shape)
for result in results:
results_out+=result
A solution could be the following modification:
import numpy as np
from joblib import Parallel, delayed
functions=[[abs,np.round] for i in range(500)] # Dummy functions
functions=[function for sublist in functions for function in sublist]
X=np.random.normal(size=(5,5)) # Dummy data
results_out = np.zeros(results[0].shape)
def helper_function(function,X=X,results=results_out):
result = function(X)
results += result
Parallel(n_jobs=-1,)(
map(delayed(helper_function), [functions[i] for i in range(1000)]))
But this might cause races. So it is not optimal.
Do you have any suggestions for preforming this without storing the intermediate results and still make it parallel?
The answer is given in the documentation of joblib.
with Parallel(n_jobs=2) as parallel:
accumulator = 0.
n_iter = 0
while accumulator < 1000:
results = parallel(delayed(sqrt)(accumulator + i ** 2)
for i in range(5))
accumulator += sum(results) # synchronization barrier
n_iter += 1
You can do the calculation in chunks and reduce the chunk as you are about to run out of memory.