Block-wise array writing with Python multiprocessing - python

I know there are a lot of topics around similar problems (like How do I make processes able to write in an array of the main program?, Multiprocessing - Shared Array or Multiprocessing a loop of a function that writes to an array in python), but I just don't get it... so sorry for asking again.
I need to do some stuff with a huge array and want to speed up things by splitting it into blocks and running my function on those blocks, with each block being run in its own process. Problem is: the blocks are "cut" from one array and the result shall then be written into a new, common array. This is what I did so far (minimum working example; don't mind the array-shaping, this is necessary for my real-world case):
import numpy as np
import multiprocessing as mp
def calcArray(array, blocksize, n_cores=1):
in_shape = (array.shape[0] * array.shape[1], array.shape[2])
input_array = array[:, :, :array.shape[2]].reshape(in_shape)
result_array = np.zeros(array.shape)
# blockwise loop
pix_count = array.size
for position in range(0, pix_count, blocksize):
if position + blocksize < array.shape[0] * array.shape[1]:
num = blocksize
else:
num = pix_count - position
result_part = input_array[position:position + num, :] * 2
result_array[position:position + num] = result_part
# finalize result
final_result = result_array.reshape(array.shape)
return final_result
if __name__ == '__main__':
start = time.time()
img = np.ones((4000, 4000, 4))
result = calcArray(img, blocksize=100, n_cores=4)
print 'Input:\n', img
print '\nOutput:\n', result
How can I now implement multiprocessing in way that I set a number of cores and then calcArray assigns processes to each block until n_cores is reached?
With the much appreciated help of #Blownhither Ma, the code now looks like this:
import time, datetime
import numpy as np
from multiprocessing import Pool
def calculate(array):
return array * 2
if __name__ == '__main__':
start = time.time()
CORES = 4
BLOCKSIZE = 100
ARRAY = np.ones((4000, 4000, 4))
pool = Pool(processes=CORES)
in_shape = (ARRAY.shape[0] * ARRAY.shape[1], ARRAY.shape[2])
input_array = ARRAY[:, :, :ARRAY.shape[2]].reshape(in_shape)
result_array = np.zeros(input_array.shape)
# do it
pix_count = ARRAY.size
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
### OLD APPROACH WITH NO PARALLELIZATION ###
# part = calculate(input_array[position:position + num, :])
# result_array[position:position + num] = part
### NEW APPROACH WITH PARALLELIZATION ###
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :],))
handles.append(handle)
# finalize result
### OLD APPROACH WITH NO PARALLELIZATION ###
# final_result = result_array.reshape(ARRAY.shape)
### NEW APPROACH WITH PARALLELIZATION ###
final_result = [h.get() for h in handles]
final_result = np.concatenate(final_result, axis=0)
print 'Done!\nDuration (hh:mm:ss): {duration}'.format(duration=datetime.timedelta(seconds=time.time() - start))
The code runs and really starts the number processes I assigned, but takes much much longer than the old approach with just using the loop "as-is" (3 sec compared to 1 min). There must be something missing here.

The core function is pool.apply_async and handler.get.
I have been recently working on the same functions and find it useful to make a standard utility function. balanced_parallel applies function fn on matrix a in a parallel manner silently. assigned_parallel explicitly apply function on each element.
i. The way I split array is np.array_split. You may use block scheme instead.
ii. I use concat rather than assign to a empty matrix when collecting result. There is no shared memory.
from multiprocessing import cpu_count, Pool
def balanced_parallel(fn, a, processes=None, timeout=None):
""" apply fn on slice of a, return concatenated result """
if processes is None:
processes = cpu_count()
print('Parallel:\tstarting {} processes on input with shape {}'.format(processes, a.shape))
results = assigned_parallel(fn, np.array_split(a, processes), timeout=timeout, verbose=False)
return np.concatenate(results, 0)
def assigned_parallel(fn, l, processes=None, timeout=None, verbose=True):
""" apply fn on each element of l, return list of results """
if processes is None:
processes = min(cpu_count(), len(l))
pool = Pool(processes=processes)
if verbose:
print('Parallel:\tstarting {} processes on {} elements'.format(processes, len(l)))
# add jobs to the pool
handler = [pool.apply_async(fn, args=x if isinstance(x, tuple) else (x, )) for x in l]
# pool running, join all results
results = [handler[i].get(timeout=timeout) for i in range(len(handler))]
pool.close()
return results
In your case, fn would be
def _fn(matrix_part): return matrix_part * 2
result = balanced_parallel(_fn, img)
Follow-up:
Your loop should look like this to make parallelization happen.
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :], ))
handles.append(handle)
# multiple handlers exist at this moment!! Don't `.get()` yet
results = [h.get() for h in handles]
results = np.concatenate(results, axis=0)

Related

Can't use Python multiprocessing with large amount of calculations

I have to speed up my current code to do around 10^6 operations in a feasible time. Before I used multiprocessing in my the actual document I tried to do it in a mock case. Following is my attempt:
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
However, when I increase the size of my "List" my computer tires to use all of my RAM and CPU and freezes. I even cut my "List" into 3 pieces so it will only use 3 of my cores. However, nothing changed. Also, when I tried to use it on a smaller data set I noticed the code ran much slower than when it ran on a single core.
I am still very new to multiprocessing in Python, am I doing something wrong. I would appreciate it if you could help me.
To reduce memory usage, I suggest you use instead the multiprocessing module and specifically the imap method method (or imap_unordered method). Unlike the map method of either multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, the iterable argument is processed lazily. What this means is that if you use a generator function or generator expression for the iterable argument, you do not need to create the complete list of arguments in memory; as a processor in the pool become free and ready to execute more tasks, the generator will be called upon to generate the next argument for the imap call.
By default a chunksize value of 1 is used, which can be inefficient for a large iterable size. When using map and the default value of None for the chunksize argument, the pool will look at the length of the iterable first converting it to a list if necessary and then compute what it deems to be an efficient chunksize based on that length and the size of the pool. When using imap or imap_unordered, converting the iterable to a list would defeat the whole purpose of using that method. But if you know what that size would be (more or less) if the iterable were converted to a list, then there is no reason not to apply the same chunksize calculation the map method would have, and that is what is done below.
The following benchmarks perform the same processing first as a single process and then using multiprocessing using imap where each invocation of do_something on my desktop takes approximately .5 seconds. do_something now has been modified to just process a single i, k, j tuple as there is no longer any need to break up anything into smaller lists:
from multiprocessing import Pool, cpu_count
import time
def half_second():
HALF_SECOND_ITERATIONS = 10_000_000
sum = 0
for _ in range(HALF_SECOND_ITERATIONS):
sum += 1
return sum
def do_something(tpl):
# in real case this function takes about 0.5 seconds to finish for each iteration
half_second() # on my desktop
return tpl[0]**2, tpl[1]**2, tpl[2]**2
"""
def generate_tpls():
for i in range(1, 20-2):
for k in range(i+1, 20-1):
for j in range(k+1, 20):
yield i, k, j
"""
# Use smaller number of tuples so we finish in a reasonable amount of time:
def generate_tpls():
# 64 tuples:
for i in range(1, 5):
for k in range(1, 5):
for j in range(1, 5):
yield i, k, j
def benchmark1():
""" single processing """
t = time.time()
for tpl in generate_tpls():
result = do_something(tpl)
print('benchmark1 time:', time.time() - t)
def compute_chunksize(iterable_size, pool_size):
""" This is more-or-less the function used by the Pool.map method """
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
def benchmark2():
""" multiprocssing """
t = time.time()
pool_size = cpu_count() # 8 logical cores (4 physical cores)
N_TUPLES = 64 # number of tuples that will be generated
pool = Pool(pool_size)
chunksize = compute_chunksize(N_TUPLES, pool_size)
for result in pool.imap(do_something, generate_tpls(), chunksize=chunksize):
pass
print('benchmark2 time:', time.time() - t)
if __name__ == '__main__':
benchmark1()
benchmark2()
Prints:
benchmark1 time: 32.261038303375244
benchmark2 time: 8.174998044967651
The nested For loops creating the array before the main definition appears to be the problem. Moving that part to underneath the main definition clears up any memory problems.
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
if __name__ == '__main__':
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1

How to make 6 calculation as fast as possible based on one datastream?

I have one stream of data who is coming very fast, and when a new data arrive, I would like to make 6 different calculation based on it.
I would like to make those calculation as fast as possible so I can update as soon as I receive new data.
The data can arrive as fast as milliseconds so my calculation must be very fast.
So the best thing I was thinking of was to make those calculations on 6 different Threads at the same time.
I never used threads before so I don't know where to place it.
This is the code who describe my problem
What can I do from here?
import numpy as np
import time
np.random.seed(0)
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
return r
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
calc_1 = calculation_1(data=data_stream_main[0], multiplicator=2)
calc_2 = calculation_1(data=data_stream_main[0], multiplicator=3)
calc_3 = calculation_1(data=data_stream_main[1], multiplicator=2)
calc_4 = calculation_1(data=data_stream_main[1], multiplicator=3)
calc_5 = calculation_1(data=data_stream_main[2], multiplicator=2)
calc_6 = calculation_1(data=data_stream_main[2], multiplicator=3)
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)
You can use either class multiprocessing.pool.Pool or concurrent.futures.ProcessPoolExecutor to create a multiprocessing pool of 6 processes to which you can submit your 6 tasks in your loop to execute in parallel and await the results. The following example uses multiprocessing.pool.Pool.
But, the result will be very disappointing.
The problem is that (1) There is overhead in initially creating the 6 processes and (2) overhead in queueing up each task to execute in the different address space that the subprocesses live. This means that for multiprocessing to be advantageous, your worker function, calculation_1 in this case, needs to be a less-trivial, longer-running, more-CPU-intensive function. If you were to add to your worker function the following "do-nothing", CPU-intensive loop ...
cnt = 0
for i in range(100000):
cnt += 1
... then the following multiprocessing code would run several times more quickly. As is, stick with what you have.
import numpy as np
import multiprocessing as mp
import time
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
"""
cnt = 0
for i in range(100000):
cnt += 1
"""
return r
# required for Windows and other platforms that use spawn for creating new processes:
if __name__ == '__main__':
np.random.seed(0)
# no point in using more processes than processors:
n_processors = min(6, mp.cpu_count())
pool = mp.Pool(n_processors)
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
# submit tasks:
result_1 = pool.apply_async(calculation_1, (data_stream_main[0], 2))
result_2 = pool.apply_async(calculation_1, (data_stream_main[0], 3))
result_3 = pool.apply_async(calculation_1, (data_stream_main[1], 2))
result_4 = pool.apply_async(calculation_1, (data_stream_main[1], 3))
result_5 = pool.apply_async(calculation_1, (data_stream_main[2], 2))
result_6 = pool.apply_async(calculation_1, (data_stream_main[2], 3))
# wait for results:
calc_1 = result_1.get()
calc_2 = result_2.get()
calc_3 = result_3.get()
calc_4 = result_4.get()
calc_5 = result_5.get()
calc_6 = result_6.get()
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)
You could factorize the calculation by separating the log(data) from the log(multiplicator).
Given that np.log(data * (multiplicator+1)) is the same as np.log(data) + np.log(multiplicator+1), you can compute and store the 2 possible values of np.log(multiplicator+1) in global variables, then only compute log(data) once per index (thus saving 50%) on that part.
# global variables and calculation function:
multiplicator2 = np.log(3)
multiplicator3 = np.log(4)
def calculation_1(data):
logData = np.log(data)
return logData + multiplicator2, logData + multiplicator3
# in the loop:...
calc_1,calc_2 = calculation_1(data_stream_main[0])
calc_3,calc_4 = calculation_1(data_stream_main[1])
calc_5,calc_6 = calculation_1(data_stream_main[2])
If you can afford to buffer several rows of data into a numpy matrix before outputing the result, you may get some performance improvement by using numpy's parallelism to perform the calculation on the whole matrix (or chunk) and output the result in chunks instead of one row at a time. Separating reception of the data from computation and output is where the use of threads may provide a benefit.
For example:
start = time.time()
chunk = []
multiplicators = np.array([2,2,2,3,3,3])
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
chunk.append(data_stream_main*2)
if len(chunk)< 1000: continue
# process 1000 lines at a time and output results
calcs = np.log(np.array(chunk)*multiplicators)
calc_1,calc_4,calc_2,calc_5,calc_3,calc6 = calcs[-1,:]
chunk = [] # reset chunk
print("total time:", time.time() - start) # 2.7 (compared to 6.6)

Using Multithreading or Multiprocessing to improve computational speed

I am iterating through very large file size [mesh]. Since the iteration is independent, I would like to split my mesh into smaller sizes and run them all at the same time in order to lower computation time. Below is a sample code. For example, if mesh is of length=50000, I would like to divide the mesh into 100 and run fun for each mesh/100 at the same time.
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
interpolate = fnc(mydata, mymesh)
I would like to know to achieve this using multiprocessing or multithreading as I'm unable to reconcile it with the execution of my loop.
This will give you the general idea. I couldn't test this since I do not have your data. The default constructor for ProcessPoolExecutor will use the number of processors on your computer. But since that determines the level of multiprocessing you can have, it will probably be more efficient to set the N_CHUNKS parameter to the number of simultaneous processes you can support. That is, if you have a processing pool size of 6, then it is better to just divide your array into 6 large chunks and have 6 processes do the work rather than breaking it up into smaller pieces where processes will have to wait to run. So you should probably specify a specific max_workers number to the ProcessPoolExecutor not greater than the number of processors you have and set N_CHUNKS to the same value.
from concurrent.futures import ProcessPoolExecutor, as_completed
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
def main(data, mesh):
#N_CHUNKS = 100
N_CHUNKS = 6 # assuming you have 6 processors; see max_workers parameter
n = len(mesh)
assert n != 0
if n <= N_CHUNKS:
N_CHUNKS = 1
chunk_size = n
last_chunk_size = n
else:
chunk_size = n // N_CHUNKS
last_chunk_size = n - chunk_size * (N_CHUNKS - 1)
with ProcessPoolExcutor(max_workers=N_CHUNKS) as executor: # assuming you have 6 processors
the_futures = {}
start = 0
for i in range(N_CHUNKS - 1):
future = executor.submit(fnc, data, mesh[start:start+chunk_size]) # pass slice
the_futures[future] = (start, start+chunk_size) # map future to request parameters
start += chunk_size
if last_chunk_size:
future = executor.submit(fnc, data, mesh[start:n]) # pass slice
the_futures[future] = (start, start+n)
for future in as_completed(the_futures):
(start, end) = the_futures[future] # the original range
d = future.result() # do something with the results
if __name__ == '__main__':
# the call to main must be done in a block governed by if __name__ == '__main__' or you will get into a recursive
# loop where each subprocess calls main again
main(data, mesh)

Python parallelised correlation slower than single process correlation

I wanted to parallelize df.corr() using multiprocessing module in Python. I'm taking one column and computing correlation values with rest all columns in one process and second column with rest other columns in another process. I'm continuing in this fashion to fill the upper traingle of correlation matrix by stacking up the result rows from all the processes.
I took sample data of shape (678461, 210) and tried my parallelized method and df.corr() and got running time of 214.40s and 42.64s respectively. So, my parallelized method is taking more time.
Is there a way to improve this?
import multiprocessing as mp
import pandas as pd
import numpy as np
from time import *
def _correlation(args):
i, mat, mask = args
ac = mat[i]
arr = []
for j in range(len(mat)):
if i > j:
continue
bc = mat[j]
valid = mask[i] & mask[j]
if valid.sum() < 1:
c = NA
elif i == j:
c = 1.
elif not valid.all():
c = np.corrcoef(ac[valid], bc[valid])[0, 1]
else:
c = np.corrcoef(ac, bc)[0, 1]
arr.append((j, c))
return arr
def correlation_multi(df):
numeric_df = df._get_numeric_data()
cols = numeric_df.columns
mat = numeric_df.values
mat = pd.core.common._ensure_float64(mat).T
K = len(cols)
correl = np.empty((K, K), dtype=float)
mask = np.isfinite(mat)
pool = mp.Pool(processes=4)
ret_list = pool.map(_correlation, [(i, mat, mask) for i in range(len(mat))])
for i, arr in enumerate(ret_list):
for l in arr:
j = l[0]
c = l[1]
correl[i, j] = c
correl[j, i] = c
return pd.DataFrame(correl, index = cols, columns = cols)
if __name__ == '__main__':
noise = pd.DataFrame(np.random.randint(0,100,size=(100000, 50)))
noise2 = pd.DataFrame(np.random.randint(100,200,size=(100000, 50)))
df = pd.concat([noise, noise2], axis=1)
#Single process correlation
start = time()
s = df.corr()
print('Time taken: ',time()-start)
#Multi process correlation
start = time()
s1 = correlation_multi(df)
print('Time taken: ',time()-start)
The results from _correlation have to be moved from the worker processes to the process running the Pool via interprocess communication.
This means that the return data is pickled, sent to the other process, unpickled and added to the result list.
This takes time and is by nature a sequential process.
And map processes the returns in the order they were sent, IIRC. So if one iteration takes relatively long, other results might be stuck waiting. You could try using imap_unordered which yields results as soon as they arrive.

Python multiprocessing: losing counts after pool.join()?

I'm trying to solve this problem where I store the locations and counts of substrings of a given length. As the strings can be long (genome sequences), I'm trying to use multiple processes to speed it up. While the program runs, the variables that store the object seem to lose all information once the threads end.
import numpy
import multiprocessing
from multiprocessing.managers import BaseManager, DictProxy
from collections import defaultdict, namedtuple, Counter
from functools import partial
import ctypes as c
class MyManager(BaseManager):
pass
MyManager.register('defaultdict', defaultdict, DictProxy)
def gc_count(seq):
return int(100 * ((seq.upper().count('G') + seq.upper().count('C') + 0.0) / len(seq)))
def getreads(length, table, counts, genome):
genome_len = len(genome)
for start in range(0,genome_len):
gc = gc_count(genome[start:start+length])
table[ (length, gc) ].append( (start) )
counts[length,gc] +=1
if __name__ == "__main__":
g = 'ACTACGACTACGACTACGCATCAGCACATACGCATACGCATCAACGACTACGCATACGACCATCAGATCACGACATCAGCATCAGCATCACAGCATCAGCATCAGCACTACAGCATCAGCATCAGCATCAG'
genome_len = len(g)
mgr = MyManager()
mgr.start()
m = mgr.defaultdict(list)
mp_arr = multiprocessing.Array(c.c_double, 10*101)
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)
pool = multiprocessing.Pool(9)
partial_getreads = partial(getreads, table=m, counts=count, genome=g)
pool.map(partial_getreads, range(1, 10))
pool.close()
pool.join()
for i in range(1, 10):
for j in range(0,101):
print count[i,j]
for i in range(1, 10):
for j in range(0,101):
print len(m[(i,j)])
The loops at the end will only print out 0.0 for each element in count and 0 for each list in m, so somehow I'm losing all the counts. If i print the counts within the getreads(...) function, I can see that the values are being increased. Conversely, printing len(table[ (length, gc) ]) in getreads(...) or len(m[(i,j)]) in the main body only results in 0.
You could also formulate your problem as a map-reduce problem, by which you would avoid sharing the data among multiple processes (i guess it would speed up the computation). You would just need to return the resulting table and counts from the function (map) and combine the results from all the processes (reduce).
Going back to your original question...
At the bottom of Managers there is a relevant note about
modifications to mutable values or items in dict and list. Basically, you
need to re-assign the modified object to the container proxy.
l = table[ (length, gc) ]
l.append( (start) )
table[ (length, gc) ] = l
There is also a relevant Stackoverflow post about combining pool map with Array.
Taking both into account you can do something like:
def getreads(length, table, genome):
genome_len = len(genome)
arr = numpy.frombuffer(mp_arr.get_obj())
counts = arr.reshape(10,101)
for start in range(0,genome_len):
gc = gc_count(genome[start:start+length])
l = table[ (length, gc) ]
l.append( (start) )
table[ (length, gc) ] = l
counts[length,gc] +=1
if __name__ == "__main__":
g = 'ACTACGACTACGACTACGCATCAGCACATACGCATACGCATCAACGACTACGCATACGACCATCAGATCACGACATCAGCATCAGCATCACAGCATCAGCATCAGCACTACAGCATCAGCATCAGCATCAG'
genome_len = len(g)
mgr = MyManager()
mgr.start()
m = mgr.defaultdict(list)
mp_arr = multiprocessing.Array(c.c_double, 10*101)
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)
pool = multiprocessing.Pool(9)
partial_getreads = partial(getreads, table=m, genome=g)
pool.map(partial_getreads, range(1, 10))
pool.close()
pool.join()
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)

Categories

Resources