I have the following code structure which I want to take advantage of multi-threading to speed up results, as every computation is independent. For every point in a 2D grid, I call a compute_at_point() function and get the result.
How can I parallelize this with threads? I know the computers that will be used have at least 4 cores. Code below:
for i in range(0, grid_rows):
for j in range(0, grid_cols):
grid_point = input_grid[i, j]
res_at_point = compute_at_point(input_grid, grid_point)
output_grid[i, j] = res_at_point
Where input_grid and output_grid have the same shape, and every computation is independent.
Using multiprocessing.Pool and functools.partial, we can do the following:
import numpy as np
from functools import partial
from multiprocessing import Pool
input_grid = np.random.random((3,4))
grid_rows = input_grid.shape[0]
grid_cols = input_grid.shape[1]
def compute_at_point(input_grid, grid_point):
return np.mean(input_grid) + grid_point
with Pool(processes=4) as p:
output_grid = p.map(partial(compute_at_point, input_grid), input_grid.flatten())
output_grid = np.array(output_grid).reshape(grid_rows, grid_cols)
The simplest way to solve multiprocessing problems is by having one iterable, and allowing the the multiprocessing library to iterate over it. In this instance, we could pass the input_grid indices, or just flatten the array and iterate over its values.
Related
There is this stackoverflow post that really nicely shows a way to calculate the proximity matrix of a RandomForestClassifier().
Proximity Matrix in sklearn.ensemble.RandomForestClassifier
Nevertheless the for-loop in that script is quite slow if you have a large dataframe. I tried to parallelize this for-loop, but unsuccesfully. I only get 'None' as an output.
How can I parallelize this for-loop in Spyder 4 running Python 3.8.5 on Windows 10?
proxMat = 1*np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
Here you want to perform a reduce operation - so parrallelization is not obvious.
You did not specify how you tried to parallelize the loop.
A simple way to parrallelize :
import multiprocessing
pool = multiprocessing.Pool(processes=4)
def get_outer(i):
return np.equal.outer(terminals[:,i],terminals[:,i])
todo = list(range(1, nTrees))
results = pool.map(get_outer, todo)
proxMat = 1*np.equal.outer(a, a)
for res in results:
proxMat+ = res
I'm not sure this one would help, but possibly you'd have less pickling problems :
import multiprocessing
pool = multiprocessing.Pool(processes=4)
def get_outer(t):
return np.equal.outer(t,t)
# This part might be costly !
terms = [terminals[:,i] for i in range(1, nTrees)]
results = pool.map(get_outer, terms)
proxMat = 1*np.equal.outer(a, a)
for res in results:
proxMat+ = res
I'm trying to repeatedly run a function that requires a few positional arguments and involves random number generation (to generate many samples of a distribution). For a MWE, I think this captures everything:
import numpy as np
import multiprocessing as mup
from functools import partial
def rarr(xsize, ysize, k):
return np.random.rand(xsize, ysize)
def clever_array(nsamp, xsize=100, ysize=100, ncores=None):
np.random.seed()
if ncores is None:
p = mup.Pool()
else:
p = mup.Pool(ncores)
out = p.map_async( partial(rarr, xsize, ysize), range(nsamp))
p.close()
return np.array(out.get())
Note that the final positional argument for rarr() is just a dummy variable, since I am using map_async(), which requires an iterable. Now if I run %timeit clever_array(500, ncores = 1) I get 208 ms, whereas %timeit clever_array(500, ncores = 5) takes 149 ms. So there is definitely some kind of parallelism happening (the speedup isn't terribly impressive for this MWE but is decent in my real code).
However, I'm wondering a few things -- is there a more natural implementation other than the dummy variable for rarr() passed as an iterable to map_async to run this many times? Is there any obvious way to pass the xsize and ysize args to rarr() other than partial()? And is there any way to ensure different results from the different cores other than initializing a different random.seed() every time?
Thanks for any help!
Typically when we use multiprocessing we would expect different results from each invocation of a function, therefore it doesn't quite make sense to call the same function many times. In order to ensure the randomness of the sampling output, it is best to separate the random state (seed) from the function itself. The best approach as recommended by the numpy official doc is to use a np.random.Generator object, created via np.random.default_rng([seed]). With that we can modify your code to
import numpy as np
import multiprocessing as mup
from functools import partial
def rarr(xsize, ysize, rng):
return rng.random((xsize, ysize))
def clever_array(nsamp, xsize=100, ysize=100, ncores=None):
if ncores is None:
p = mup.Pool()
else:
p = mup.Pool(ncores)
out = p.map_async(partial(rarr, xsize, ysize), map(np.random.default_rng, range(nsamp)))
p.close()
return np.array(out.get())
The following for loop is part of a iterative simulation process and is the main bottleneck regarding computational time:
import numpy as np
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every itr
cols_red_list = []
rows_list = list(range(2500)) #row idx for diff where negative element is known to appear
diff = np.random.uniform(-1.323, 3.780, (2500, 300)) #np.random.uniform is just used as toy example
for row in rows_list:
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
cols_red_list.append(col)
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Hence, I tried to parallelize it by using the multiprocessing package in hope to reduce computation time:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every
rows_list = list(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
if __name__ == '__main__':
num_of_workers = cpu_count()
print('number of CPUs : ', num_of_workers)
pool = Pool(num_of_workers)
cols_red_list = pool.map(partial(crossings,diff = diff), rows_list)
pool.close()
print(len(cols_red_list))
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Unfortunately, the parallelization turns out to be much slower compared to the sequential piece of code.
Hence my question: Did I use the multiprocessing package properly in that particular example? Are there alternative ways to parallelize the above mentioned for loop ?
Disclaimer: As you're trying to reduce the runtime of your code through parallelisation, this doesn't strictly answer your question but it might still be a good learning opportunity.
As a golden rule, before moving to multiprocessing to improve
performance (execution time), one should first optimise the
single-threaded case.
Your
rows_list = list(range(2500))
Generates the numbers 0 to 2499 (that's the range) and stores them in memory (list), which requires time to do the allocation of the required memory and the actual write. You then only use these predictable values once each, by reading them from memory (which also takes time), in a predictable order:
for row in rows_list:
This is particularly relevant to the runtime of your loop function as you do it repeatedly (for itr in range(n_int):).
Instead, consider generating the number only when you need it, without an intermediate store (which conceptually removes any need to access RAM):
for row in range(2500):
Secondly, on top of sharing the same issue (unnecessary accesses to memory), the following:
diff = np.random.uniform(-1, 1, (2500, 300))
# ...
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
seems to me to be optimisable at the level of math (or logic).
What you're trying to do is get a random variable (that col index) by defining it as "the first time I encounter a random variable in [-1;1] that is lower than 0". But notice that figuring out if a random variable with a uniform distribution over [-α;α] is negative, is the same as having a random variable over {0,1} (i.e. a bool).
Therefore, you're now working with bools instead of floats and you don't even have to do the comparison (val < 0) as you already have a bool. This potentially makes the code much faster. Using the same idea as for rows_list, you can generate that bool only when you need it; testing it until it is True (or False, choose one, it doesn't matter obviously). By doing so, you only generate as many random bools as you need, not more and not less (BTW, what happens in your code if all 300 elements in the row are negative? ;) ):
for _ in range(n_int):
cols_red_list = []
for row in range(2500):
col = next(i for i in itertools.count() if random.getrandbits(1))
cols_red_list.append(col)
or, with list comprehension:
cols_red_list = [next(i for i in count() if getrandbits(1))
for _ in range(2500)]
I'm sure that, through proper statistical analysis, you even can express that col random variable as a non-uniform variable over [0;limit[, allowing you to compute it much faster.
Please test the performance of an "optimized" version of your single-threaded implementation first. If the runtime is still not acceptable, you should then look into multithreading.
multiprocessing uses system processes (not threads!) for parallelization, which require expensive IPC (inter-process communication) to share data.
This bites you in two spots:
diff = np.random.uniform(-1, 1, (2500, 300)) creates a large matrix which is expensive to pickle/copy to another process
rows_list = list(range(2500)) creates a smaller list, but the same applies here.
To avoid this expensive IPC, you have one and a half choices:
If on a POSIX-compliant system, initialize your variables on the module level, that way each process gets a quick-and-dirty copy of the required data. This is not scalable as it requires POSIX, weird architecture (you probably don't want to put everything on the module level), and doesn't support sharing changes to that data.
Use shared memory. This only supports mostly primitive data types, but mp.Array should cover your needs.
The second problem is that setting up a pool is expensive, as num_cpu processes need to be started. Your workload is small enough to be negligible compared to this overhead. A good practice is to only create one pool and reuse it.
Here is a quick-and-dirty example of the POSIX only solution:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
n_int = 10
rows_list = np.array(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
def workload(_):
cols_red_list = [crossings(row, diff) for row in rows_list]
print(len(cols_red_list))
class Simulation(object):
def loop(self):
num_of_workers = cpu_count()
with Pool(num_of_workers) as pool:
pool.map(workload, range(10))
pool.close()
sim1 = Simulation()
sim1.loop()
For me (and my two cores) this is roughly twice as fast as the sequential version.
Update with shared memory:
import numpy as np
from multiprocessing import Pool, cpu_count, Array
from functools import partial
n_int = 10
ROW_COUNT = 2500
### WORKER
diff = None
result = None
def init_worker(*args):
global diff, result
(diff, result) = args
def crossings(i):
result[i] = next(idx for idx, val in enumerate(diff[i*300:(i+1)*300]) if val < 0)
### MAIN
class Simulation():
def loop(self):
num_of_workers = cpu_count()
diff = Array('d', range(ROW_COUNT*300), lock=False)
result = Array('i', ROW_COUNT, lock=False)
# Shared memory needs to be passed when workers are spawned
pool = Pool(num_of_workers, initializer=init_worker, initargs=(diff, result))
for i in range(n_int):
# SLOW, I assume you use a different source of values anyway.
diff[:] = np.random.uniform(-1, 1, ROW_COUNT*300)
pool.map(partial(crossings), range(ROW_COUNT))
print(len(result))
pool.close()
sim1 = Simulation()
sim1.loop()
A few notes:
Shared memory needs to be set up at worker creation, so it's global anyway.
This still isn't faster than the sequential version, but that's mainly due to random.uniform needing to be copied entirely into shared memory. I assume that are just values for testing, and in reality you'd fill it differently anyway.
I only pass indices to the worker, and use them to read and write values to the shared memory.
I need to apply the same function onto every row in a numpy array and store the result again in a numpy array.
# states will contain results of function applied to a row in array
states = np.empty_like(array)
for i, ar in enumerate(array):
states[i] = function(ar, *args)
# do some other stuff on states
function does some non trivial filtering of my data and returns an array when the conditions are True and when they are False. function can either be pure python or cython compiled. The filtering operations on the rows are complicated and can depend on previous values in the row, this means I can't operate on the whole array in an element-by-element fashion
Is there a way to do something like this in dask for example?
Dask solution
You could do with with dask.array by chunking the array by row, calling map_blocks, then computing the result
ar = ...
x = da.from_array(ar, chunks=(1, arr.shape[1]))
x.map_blocks(function, *args)
states = x.compute()
By default this will use threads, you can use processes in the following way
from dask.multiprocessing import get
states = x.compute(get=get)
Pool solution
However dask is probably overkill for embarrassingly parallel computations like this, you could get by with a threadpool
from multiprocessing.pool import ThreadPool
pool = ThreadPool()
ar = ...
states = np.empty_like(array)
def f(i):
states[i] = function(ar[i], *args)
pool.map(f, range(len(ar)))
And you could switch to processes with the following change
from multiprocessing import Pool
pool = Pool()
Turn your function into a universal function: http://docs.scipy.org/doc/numpy/reference/ufuncs.html.
Then: states = function(array, *args).
I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though). It blocks till the result is ready.
This method chops the iterable into a number of chunks which it submits to the
process pool as separate tasks. The (approximate) size of these chunks can be
specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x):
return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the #ray.remote decorator, and then invoke it with .remote. This will ensure that every instance of the remote function will executed in a different process.
import time
import ray
ray.init()
# Define the function you want to apply map on, as remote function.
#ray.remote
def f(x):
# Do some work...
time.sleep(1)
return x*x
# Define a helper parmap(f, list) function.
# This function executes a copy of f() on each element in "list".
# Each copy of f() runs in a different process.
# Note f.remote(x) returns a future of its result (i.e.,
# an identifier of the result) rather than the result itself.
def parmap(f, list):
return [f.remote(x) for x in list]
# Call parmap() on a list consisting of first 5 integers.
result_ids = parmap(f, range(1, 6))
# Get the results
results = ray.get(result_ids)
print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p (rounded up the nearest integer) where p is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2 rounded up, i.e, in approximately 3 sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Python3's Pool class has a map() method and that's all you need to parallelize map:
from multiprocessing import Pool
with Pool() as P:
xtransList = P.map(some_func, a_list)
Using with Pool() as P is similar to a process pool and will execute each item in the list in parallel. You can provide the number of cores:
with Pool(processes=4) as P:
For those who looking for Python equivalent of R's mclapply(), here is my implementation. It is an improvement of the following two examples:
"Parallelize Pandas map() or apply()", as mentioned by #Rafael
Valero.
How to apply map to functions with multiple arguments.
It can be apply to map functions with single or multiple arguments.
import numpy as np, pandas as pd
from scipy import sparse
import functools, multiprocessing
from multiprocessing import Pool
num_cores = multiprocessing.cpu_count()
def parallelize_dataframe(df, func, U=None, V=None):
#blockSize = 5000
num_partitions = 5 # int( np.ceil(df.shape[0]*(1.0/blockSize)) )
blocks = np.array_split(df, num_partitions)
pool = Pool(num_cores)
if V is not None and U is not None:
# apply func with multiple arguments to dataframe (i.e. involves multiple columns)
df = pd.concat(pool.map(functools.partial(func, U=U, V=V), blocks))
else:
# apply func with one argument to dataframe (i.e. involves single column)
df = pd.concat(pool.map(func, blocks))
pool.close()
pool.join()
return df
def square(x):
return x**2
def test_func(data):
print("Process working on: ", data.shape)
data["squareV"] = data["testV"].apply(square)
return data
def vecProd(row, U, V):
return np.sum( np.multiply(U[int(row["obsI"]),:], V[int(row["obsJ"]),:]) )
def mProd_func(data, U, V):
data["predV"] = data.apply( lambda row: vecProd(row, U, V), axis=1 )
return data
def generate_simulated_data():
N, D, nnz, K = [302, 184, 5000, 5]
I = np.random.choice(N, size=nnz, replace=True)
J = np.random.choice(D, size=nnz, replace=True)
vals = np.random.sample(nnz)
sparseY = sparse.csc_matrix((vals, (I, J)), shape=[N, D])
# Generate parameters U and V which could be used to reconstruct the matrix Y
U = np.random.sample(N*K).reshape([N,K])
V = np.random.sample(D*K).reshape([D,K])
return sparseY, U, V
def main():
Y, U, V = generate_simulated_data()
# find row, column indices and obvseved values for sparse matrix Y
(testI, testJ, testV) = sparse.find(Y)
colNames = ["obsI", "obsJ", "testV", "predV", "squareV"]
dtypes = {"obsI":int, "obsJ":int, "testV":float, "predV":float, "squareV": float}
obsValDF = pd.DataFrame(np.zeros((len(testV), len(colNames))), columns=colNames)
obsValDF["obsI"] = testI
obsValDF["obsJ"] = testJ
obsValDF["testV"] = testV
obsValDF = obsValDF.astype(dtype=dtypes)
print("Y.shape: {!s}, #obsVals: {}, obsValDF.shape: {!s}".format(Y.shape, len(testV), obsValDF.shape))
# calculate the square of testVals
obsValDF = parallelize_dataframe(obsValDF, test_func)
# reconstruct prediction of testVals using parameters U and V
obsValDF = parallelize_dataframe(obsValDF, mProd_func, U, V)
print("obsValDF.shape after reconstruction: {!s}".format(obsValDF.shape))
print("First 5 elements of obsValDF:\n", obsValDF.iloc[:5,:])
if __name__ == '__main__':
main()
I know this is an old post, but just in case, I wrote a tool to make this super, super easy called parmapper (I actually call it parmap in my use but the name was taken).
It handles a lot of the setup and deconstruction of processes and adds tons of features. In rough order of importance
Can take lambda and other unpickleable functions
Can apply starmap and other similar call methods to make it very easy to directly use.
Can split amongst both threads and/or processes
Includes features such as progress bars
It does incur a small cost but for most uses, that is negligible.
I hope you find it useful.
(Note: It, like map in Python 3+, returns an iterable so if you expect all results to pass through it immediately, use list())