I have a generator that returns me a certain string, how can I use it together with this code?
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
results = pool.map(my_function, my_array)
Instead of the array that is passed above, I use a generator, since the number of values does not fit in the array, and I do not need to store them.
Additional question:
In this case, the transition to the next iteration should be carried out inside the function?
Essentially the same question that was posed here.
The essence is that multiprocessing will convert any iterable without a __len__ method into a list.
There is an open issue to add support for generators but for now, you're SOL.
If your array is too big to fit into memory, consider reading it in in chunks, processing it, and dumping the results to disk in chunks. Without more context, I can't really provide a more concrete solution.
UPDATE:
Thanks for posting your code. My first question, is it absolutely necessary to use multiprocessing? Depending on what my_function does, you may see no benefit to using a ThreadPool as python is famously limited by the GIL so any CPU bound worker function wouldn't speed up. In this case, maybe a ProcessPool would be better. Otherwise, you are probably better off just running results = map(my_function, generator).
Of course, if you don't have the memory to load the input data, it is unlikely you will have the memory to store the results.
Secondly, you can improve your generator by using itertools
Try:
import itertools
import string
letters = string.ascii_lowercase
cod = itertools.permutations(letters, 6)
def my_function(x):
return x
def dump_results_to_disk(results, outfile):
with open(outfile, 'w') as fp:
for result in results:
fp.write(str(result) + '\n')
def process_in_chunks(generator, chunk_size=50):
accumulator = []
chunk_number = 1
for item in generator:
if len(accumulator) < chunk_size:
accumulator.append(item)
else:
results = list(map(my_function, accumulator))
dump_results_to_disk(results, "results" + str(chunk_number) + '.txt')
chunk_number += 1
accumulator = []
dump_results_to_disk(results, "results" + str(chunk_number))
process_in_chunks(cod)
Obviously, change my_function() to whatever your worker function is and maybe you want to do something instead of dumping to disk. You can scale chunk_size to however many entries can fit in memory. If you don't have the disk space or the memory for the results, then there's really no way for you process the data in aggregate
Related
I don't know how to parallelise a code in Python that takes each line of a FASTA file and makes some statistics, like compute GC content, of it. Do you have some tips or libraries that will help me to decrease the time spent in execution?
I've tried to use os.fork(), but it gives me more execution time than the sequential code. Probably is due to I don't know very well how to give each child a different sequence.
#Computing GC Content
from Bio import SeqIO
with open('chr1.fa', 'r') as f:
records = list (SeqIO.parse(f,'fasta'))
GC_for_sequence=[]
for i in records:
GC=0
for j in i:
if j in "GC":
GC+=1
GC_for_sequence.append(GC/len(i))
print(GC_for_sequence)
The expected execution would be: Each process takes one sequence, and they do the statistics in parallel.
a few notes on your existing code to start with:
I'd suggest not doing: list (SeqIO.parse(…)) as that will pause execution until all sequences have been loaded in memory, you're much better off (memory and total execution time) just leaving it as an iterator and consuming elements off to workers as needed
looping over each character is pretty slow, using str.count is going to be much faster
putting this together, you can do:
from Bio import SeqIO
with open('chr1.fa') as fd:
gc_for_sequence=[]
for seq in SeqIO.parse(fd, 'fasta'):
gc = sum(seq.seq.count(base) for base in "GC")
gc_for_sequence.append(gc / len(seq))
if this still isn't fast enough, then you can use the multiprocessing module like:
from Bio import SeqIO
from multiprocessing import Pool
def sequence_gc_prop(seq):
return sum(seq.count(base) for base in "GC") / len(seq)
with open('chr1.fa') as fd, Pool() as pool:
gc_for_sequence = pool.map(
sequence_gc_prop,
(seq.seq for seq in SeqIO.parse(fd, 'fasta')),
chunksize=1000,
)
comments from Lukasz mostly apply. other non-obvious stuff:
the weird seq.seq for seq in… stuff is to make sure that we're not Pickling unnecessary data
I'm setting chunksize to quite a large value because the function should be quick, hence we want to give children a reasonable amount of work to do so the parent process doesn't spend all its time orchestrating things
Here's one idea with standard multiprocessing module:
from multiprocessing import Pool
import numpy as np
no_cores_to_use = 4
GC_for_sequence = [np.random.rand(100) for x in range(10)]
with Pool(no_cores_to_use) as pool:
result = pool.map(np.average, GC_for_sequence)
print(result)
In the code I used numpy module to simulate a list with some content. pool.map takes function that you want to use on your data as first argument and data list as second. The function you can easily define yourself. By default, it should take a single argument. If you want to pass more, then use functools.partial.
[EDIT] Here's an example much closer to your problem:
from multiprocessing import Pool
import numpy as np
records = ['ACTGTCGCAGC' for x in range(10)]
no_cores_to_use = 4
def count(sequence):
count = sequence.count('GC')
return count
with Pool(no_cores_to_use) as pool:
result = pool.map(count, records)
print(sum(result))
I'm working on implementing a randomized algorithm in python. Since this involves doing the same thing many (say N) times, it rather naturally parallelizes and I would like to make use of that. More specifically, I want to distribute the N iterations on all of the cores of my CPU. The problem in question involves computing the maximum of something and is thus something where every worker could compute their own maximum and then only report that one back to the parent process, which then only needs to figure out the global maximum out of those few local maxima.
Somewhat surprisingly, this does not seem to be an intended use-case of the multiprocessing module, but I'm not entirely sure how else to go about it. After some research I came up with the following solution (toy problem to find the maximum in a list that is structurally the same as my actual one):
import random
import multiprocessing
l = []
N = 100
numCores = multiprocessing.cpu_count()
# globals for every worker
mySendPipe = None
myRecPipe = None
def doWork():
pipes = zip(*[multiprocessing.Pipe() for i in range(numCores)])
pool = multiprocessing.Pool(numCores, initializeWorker, (pipes,))
pool.map(findMax, range(N))
results = []
# collate results
for p in pipes[0]:
if p.poll():
results.append(p.recv())
print(results)
return max(results)
def initializeWorker(pipes):
global mySendPipe, myRecPipe
# ID of a worker process; they are consistently named PoolWorker-i
myID = int(multiprocessing.current_process().name.split("-")[1])-1
# Modulo: When starting a second pool for the second iteration of doWork() they are named with IDs 5-8.
mySendPipe = pipes[1][myID%numCores]
myRecPipe = pipes[0][myID%numCores]
def findMax(count):
myMax = 0
if myRecPipe.poll():
myMax = myRecPipe.recv()
value = random.choice(l)
if myMax < value:
myMax = value
mySendPipe.send(myMax)
l = range(1, 1001)
random.shuffle(l)
max1 = doWork()
l = range(1001, 2001)
random.shuffle(l)
max2 = doWork()
return (max1, max2)
This works, sort of, but I've got a problem with it. Namely, using pipes to store the intermediate results feels rather silly (and is probably slow). But it also has the real problem, that I can't send arbitrarily large things through the pipe, and my application unfortunately sometimes exceeds this size (and deadlocks).
So, what I'd really like is a function analogue to the initializer that I can call once for every worker on the pool to return their local results to the parent process. I could not find such functionality, but maybe someone here has an idea?
A few final notes:
I use a global variable for the input because in my application the input is very large and I don't want to copy it to every process. Since the processes never write to it, I believe it should never be copied (or am I wrong there?). I'm open to suggestions to do this differently, but mind that I need to run this on changing inputs (sequentially though, just like in the example above).
I'd like to avoid using the Manager-class, since (by my understanding) it introduces synchronisation and locks, which in this problem should be completely unnecessary.
The only other similar question I could find is Python's multiprocessing and memory, but they wish to actually process the individual results of the workers, whereas I do not want the workers to return N things, but to instead only run a total of N times and return only their local best results.
I'm using Python 2.7.15.
tl;dr: Is there a way to use local memory for every worker process in a multiprocessing pool, so that every worker can compute a local optimum and the parent process only needs to worry about figuring out which one of those is best?
You might be overthinking this a little.
By making your worker-functions (in this case findMax) actually return a value instead of communicating it, you can store the result from calling pool.map() - it is just a parallel variant of map, after all! It will map a function over a list of inputs and return the list of results of that function call.
The simplest example illustrating my point follows your "distributed max" example:
import multiprocessing
# [0,1,2,3,4,5,6,7,8]
x = range(9)
# split the list into 3 chunks
# [(0, 1, 2), (3, 4, 5), (6, 7, 8)]
input = zip(*[iter(x)]*3)
pool = multiprocessing.Pool(2)
# compute the max of each chunk:
# max((0,1,2)) == 2
# max((3,4,5)) == 5
# ...
res = pool.map(max, input)
print(res)
This returns [2, 5, 8].
Be aware that there is some light magic going on: I use the built-in max() function which expects iterables as input. Now, if I would only pool.map over a plain list of integers, say, range(9), that would result in calls to max(0), max(1) etc. - not very useful, huh? Instead, I partition my list into chunks, so effectively, when mapping, we now map over a list of tuples, thus feeding a tuple to max on each call.
So perhaps you have to:
return a value from your worker func
think about how you structure your input domain so that you feed meaningful chunks to each worker
PS: You wrote a great first question! Thank you, it was a pleasure reading it :) Welcome to StackOverflow!
During testing I find out in the following, MP method run a bit slower
def eat_time(j):
result = []
for j in range(10**4):
a = 0
for i in range(1000):
a += 101
result.append(a)
return result
if __name__ == '__main__':
#MP method
t = time.time()
pool = Pool()
result = []
data = pool.map(eat_time, [i for i in range(5)])
for d in data:
result += d
print(time.time()-t) #11s for my computer
#Normal method
t = time.time()
integers = []
for i in range(5):
integers += eat_time(i)
print(time.time()-t) #8s for my computer
However, if I don't require it to aggregate the data by changing eat_time() to
def eat_time(j):
result = []
for j in range(10**4):
a = 0
for i in range(1000):
a += 101
#result.append(a)
return result
The MP time is much faster and now for my computer just run 3s, while normal method still take 8s. (As expected)
It looks strange to me as result is declared individually in method, I don't expect appending completely ruin the MP.
May I know is there a correct way to do this? And why MP is slower when append involved?
Edited for comment
Thx for #torek and #akhavro clarify the point.
Yes, I understand creating process take times, that's why the problem raised.
Actually the original code put the for-loop outside and call the simple method again and again, it is a bit faster over normal method in significantly many task (my case more than 10**6 calls).
Therefore I change to put code inside and make the method a bit more complicated. By moving for j in range(10**4): this line into eat_time().
But it seems making the code complicated causes communication lag due to larger data size.
So, probably the answer is no way to solve it.
It is not append that causes your slowness but returning the result with appended elements. You can test it by changing your code to do the append but return only the first few elements of your result. Now it should work much faster again.
When you return your result from a Pool worker, this is in practice implemented as a queue from multiprocessing. It works but it is not a miracle performer, definitely much slower than just manipulating in-memory structures. When you return a lot of data, the queue needs to transmit a lot.
There is no easy workaround. You could try shared memory but I do not personally like it due to added complexity. The better way would be to redesign your application so that it does not need to transmit a lot of data between processes. For example, would it be possible to process data in your worker further so that you do not need to return it all but only a processed subset?
I am trying to parse many files found in a directory, however using multiprocessing slows my program.
# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
combined ~100MB
Following this example from python documentation:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
I've written this piece of code:
from multiprocessing import Pool
from api.ttypes import *
import gc
import os
def _parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
return Points(myList)
def getParsedFiles(pathToFile):
myList = []
p = Pool(2)
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
myList.append(filename)
return p.map(_pars, , myList)
I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:
#Pool 32 ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)
Graph:
Machine specification:
62.8 GiB RAM
Intel® Core™ i7-6850K CPU # 3.60GHz × 12
What am I missing here ?
Thanks in advance!
Looks like you're I/O bound:
In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.
You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.
As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:
Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:
def _parse_coreset_points(lines):
return Points([_parse_coreset_point(line) for line in lines])
def _parse_coreset_point(line):
s = line.split()
x, y = [int(v) for v in s]
return CoresetPoint(x, y)
And our main function:
import fileinput
def getParsedFiles(directory):
pool = Pool(2)
txts = [filename for filename in os.listdir(directory):
if filename.endswith(".txt")]
return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)
In general it is never a good idea to read from the same physical (spinning) hard disk from different threads simultaneously, because every switch causes an extra delay of around 10ms to position the read head of the hard disk (would be different on SSD).
As #peter-wood already said, it is better to have one thread reading in the data, and have other threads processing that data.
Also, to really test the difference, I think you should do the test with some bigger files. For example: current hard disks should be able to read around 100MB/sec. So reading the data of a 100kB file in one go would take 1ms, while positioning the read head to the beginning of that file would take 10ms.
On the other hand, looking at your numbers (assuming those are for a single loop) it is hard to believe that being I/O bound is the only problem here. Total data is 100MB, which should take 1 second to read from disk plus some overhead, but your program takes 130 seconds. I don't know if that number is with the files cold on disk, or an average of multiple tests where the data is already cached by the OS (with 62 GB or RAM all that data should be cached the second time) - it would be interesting to see both numbers.
So there has to be something else. Let's take a closer look at your loop:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
While I don't know Python, my guess would be that the gc calls are the problem here. They are called for every line read from disk. I don't know how expensive those calls are (or what if gc.enable() triggers a garbage collection for example) and why they would be needed around append(obj) only, but there might be other problems because this is multithreading:
Assuming the gc object is global (i.e. not thread local) you could have something like this:
thread 1 : gc.disable()
# switch to thread 2
thread 2 : gc.disable()
thread 2 : myList.append(obj)
thread 2 : gc.enable()
# gc now enabled!
# switch back to thread 1 (or one of the other threads)
thread 1 : myList.append(obj)
thread 1 : gc.enable()
And if the number of threads <= number of cores, there wouldn't even be any switching, they would all be calling this at the same time.
Also, if the gc object is thread safe (it would be worse if it isn't) it would have to do some locking in order to safely alter it's internal state, which would force all other threads to wait.
For example, gc.disable() would look something like this:
def disable()
lock() # all other threads are blocked for gc calls now
alter internal data
unlock()
And because gc.disable() and gc.enable() are called in a tight loop, this will hurt performance when using multiple threads.
So it would be better to remove those calls, or place them at the beginning and end of your program if they are really needed (or only disable gc at the beginning, no need to do gc right before quitting the program).
Depending on the way Python copies or moves objects, it might also be slightly better to use myList.append(CoresetPoint(x, y)).
So it would be interesting to test the same on one 100MB file with one thread and without the gc calls.
If the processing takes longer than the reading (i.e. not I/O bound), use one thread to read the data in a buffer (should take 1 or 2 seconds on one 100MB file if not already cached), and multiple threads to process the data (but still without those gc calls in that tight loop).
You don't have to split the data into multiple files in order to be able to use threads. Just let them process different parts of the same file (even with the 14GB file).
A copy-paste snippet, for people who come from Google and don't like reading
Example is for json reading, just replace __single_json_loader with another file type to work with that.
from multiprocessing import Pool
from typing import Callable, Any, Iterable
import os
import json
def parallel_file_read(existing_file_paths: Iterable[str], map_lambda: Callable[[str], Any]):
result = {p: None for p in existing_file_paths}
pool = Pool()
for i, (temp_result, path) in enumerate(zip(pool.imap(map_lambda, existing_file_paths), result.keys())):
result[path] = temp_result
pool.close()
pool.join()
return result
def __single_json_loader(f_path: str):
with open(f_path, "r") as f:
return json.load(f)
def parallel_json_read(existing_file_paths: Iterable[str]):
combined_result = parallel_file_read(existing_file_paths, __single_json_loader)
return combined_result
And usage
if __name__ == "__main__":
def main():
directory_path = r"/path/to/my/file/directory"
assert os.path.isdir(directory_path)
d: os.DirEntry
all_files_names = [f for f in os.listdir(directory_path)]
all_files_paths = [os.path.join(directory_path, f_name) for f_name in all_files_names]
assert(all(os.path.isfile(p) for p in all_files_paths))
combined_result = parallel_json_read(all_files_paths)
main()
Very straight forward to replace a json reader with any other reader, and you're done.
The upshot of the below is that I have an embarrassingly parallel for loop that I am trying to thread. There's a bit of rigamarole to explain the problem, but despite all the verbosity, I think this should be a rather trivial problem that the multiprocessing module is designed to solve easily.
I have a large length-N array of k distinct functions, and a length-N array of abcissa. Thanks to the clever solution provided by #senderle described in Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array, I have a fast numpy-based algorithm that I can use to evaluate the functions at the abcissa to return a length-N array of ordinates:
def apply_indexed_fast(abcissa, func_indices, func_table):
""" Returns the output of an array of functions evaluated at a set of input points
if the indices of the table storing the required functions are known.
Parameters
----------
func_table : array_like
Length k array of function objects
abcissa : array_like
Length Npts array of points at which to evaluate the functions.
func_indices : array_like
Length Npts array providing the indices to use to choose which function
operates on each abcissa element. Thus func_indices is an array of integers
ranging between 0 and k-1.
Returns
-------
out : array_like
Length Npts array giving the evaluation of the appropriate function on each
abcissa element.
"""
func_argsort = func_indices.argsort()
func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
func_ranges.append(None)
out = np.zeros_like(abcissa)
for i in range(len(func_table)):
f = func_table[i]
start = func_ranges[i]
end = func_ranges[i+1]
ix = func_argsort[start:end]
out[ix] = f(abcissa[ix])
return out
What I'm now trying to do is use multiprocessing to parallelize the for loop inside this function. Before describing my approach, for clarity I'll briefly sketch how the algorithm #senderle developed works. If you can read the above code and understand it immediately, just skip the next paragraph of text.
First we find the array of indices that sorts the input func_indices, which we use to define the length-k func_ranges array of integers. The integer entries of func_ranges control the function that gets applied to the appropriate sub-array of the input abcissa, which works as follows. Let f be the i^th function in the input func_table. Then the slice of the input abcissa to which we should apply the function f is slice(func_ranges[i], func_ranges[i+1]). So once func_ranges is calculated, we can just run a simple for loop over the input func_table and successively apply each function object to the appropriate slice, filling our output array. See the code below for a minimal example of this algorithm in action.
def trivial_functional(i):
def f(x):
return i*x
return f
k = 250
func_table = np.array([trivial_functional(j) for j in range(k)])
Npts = 1e6
abcissa = np.random.random(Npts)
func_indices = np.random.random_integers(0,len(func_table)-1,Npts)
result = apply_indexed_fast(abcissa, func_indices, func_table)
So my goal now is to use multiprocessing to parallelize this calculation. I thought this would be straightforward using my usual trick for threading embarrassingly parallel for loops. But my attempt below raises an exception that I do not understand.
from multiprocessing import Pool, cpu_count
def apply_indexed_parallelized(abcissa, func_indices, func_table):
func_argsort = func_indices.argsort()
func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
func_ranges.append(None)
out = np.zeros_like(abcissa)
num_cores = cpu_count()
pool = Pool(num_cores)
def apply_funci(i):
f = func_table[i]
start = func_ranges[i]
end = func_ranges[i+1]
ix = func_argsort[start:end]
out[ix] = f(abcissa[ix])
pool.map(apply_funci, range(len(func_table)))
pool.close()
return out
result = apply_indexed_parallelized(abcissa, func_indices, func_table)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I have seen this elsewhere on SO: Multiprocessing: How to use Pool.map on a function defined in a class?. One by one, I have tried each method proposed there; in all cases, I get a "too many files open" error because the threads were never closed, or the adapted algorithm simply hangs. This seems like there should be a simple solution since this is nothing more than threading an embarrassingly parallel for loop.
Warning/Caveat:
You may not want to apply multiprocessing to this problem. You'll find that relatively simple operations on large arrays, the problems will be memory bound with numpy. The bottleneck is moving data from RAM to the CPU caches. The CPU is starved for data, so throwing more CPUs at the problem doesn't help much. Furthermore, your current approach will pickle and make a copy of the entire array for every item in your input sequence, which adds lots of overhead.
There are plenty of cases where numpy + multiprocessing is very effective, but you need to make sure you're working with a CPU-bound problem. Ideally, it's a CPU-bound problem with relatively small inputs and outputs to alleviate the overhead of pickling the input and output. For many of the problems that numpy is most often used for, that's not the case.
Two Problems with Your Current Approach
On to answering your question:
Your immediate error is due to passing in a function that's not accessible from the global scope (i.e. a function defined inside a function).
However, you have another issue. You're treating the numpy arrays as though they're shared memory that can be modified by each process. Instead, when using multiprocessing the original array will be pickled (effectively making a copy) and passed to each process independently. The original array will never be modified.
Avoiding the PicklingError
As a minimal example to reproduce your error, consider the following:
import multiprocessing
def apply_parallel(input_sequence):
def func(x):
pass
pool = multiprocessing.Pool()
pool.map(func, input_sequence)
pool.close()
foo = range(100)
apply_parallel(foo)
This will result in:
PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed
Of course, in this simple example, we could simply move the function definition back into the __main__ namespace. However, in yours, you need it to refer to data that's passed in. Let's look at an example that's a bit closer to what you're doing:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window):
data = np.pad(data, window, mode='edge')
ind = np.arange(len(data)) + window
def func(i):
return data[i-window:i+window+1].mean()
pool = multiprocessing.Pool()
result = pool.map(func, ind)
pool.close()
return result
foo = np.random.rand(20).cumsum()
result = parallel_rolling_mean(foo, 10)
There are multiple ways you could handle this, but a common approach is something like:
import numpy as np
import multiprocessing
class RollingMean(object):
def __init__(self, data, window):
self.data = np.pad(data, window, mode='edge')
self.window = window
def __call__(self, i):
start = i - self.window
stop = i + self.window + 1
return self.data[start:stop].mean()
def parallel_rolling_mean(data, window):
func = RollingMean(data, window)
ind = np.arange(len(data)) + window
pool = multiprocessing.Pool()
result = pool.map(func, ind)
pool.close()
return result
foo = np.random.rand(20).cumsum()
result = parallel_rolling_mean(foo, 10)
Great! It works!
However, if you scale this up to large arrays, you'll soon find that it will either run very slow (which you can speed up by increasing chunksize in the pool.map call) or you'll quickly run out of RAM (once you increase the chunksize).
multiprocessing pickles the input so that it can be passed to separate and independent python processes. This means you're making a copy of the entire array for every i you operate on.
We'll come back to this point in a bit...
multiprocessing Does not share memory between processes
The multiprocessing module works by pickling the inputs and passing them to independent processes. This means that if you modify something in one process the other process won't see the modification.
However, multiprocessing also provides primitives that live in shared memory and can be accessed and modified by child processes. There are a few different ways of adapting numpy arrays to use a shared memory multiprocessing.Array. However, I'd recommend avoiding those at first (read up on false sharing if you're not familiar with it). There are cases where it's very useful, but it's typically to conserve memory, not to improve performance.
Therefore, it's best to do all modifications to a large array in a single process (this is also a very useful pattern for general IO). It doesn't have to be the "main" process, but it's easiest to think about that way.
As an example, let's say we wanted to have our parallel_rolling_mean function take an output array to store things in. A useful pattern is something similar to the following. Notice the use of iterators and modifying the output only in the main process:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window, output):
def windows(data, window):
padded = np.pad(data, window, mode='edge')
for i in xrange(len(data)):
yield padded[i:i + 2*window + 1]
pool = multiprocessing.Pool()
results = pool.imap(np.mean, windows(data, window))
for i, result in enumerate(results):
output[i] = result
pool.close()
foo = np.random.rand(20000000).cumsum()
output = np.zeros_like(foo)
parallel_rolling_mean(foo, 10, output)
print output
Hopefully that example helps clarify things a bit.
chunksize and performance
One quick note on performance: If we scale this up, it will get very slow very quickly. If you look at a system monitor (e.g. top/htop), you may notice that your cores are sitting idle most of the time.
By default, the master process pickles each input for each process and passes it in immediately and then waits until they're finished to pickle the next input. In many cases, this means that the master process works, then sits idle while the worker processes are busy, then the worker processes sit idle while the master process is pickling the next input.
The key is to increase the chunksize parameter. This will cause pool.imap to "pre-pickle" the specified number of inputs for each process. Basically, the master thread can stay busy pickling inputs and the worker processes can stay busy processing. The downside is that you're using more memory. If each input uses up a large amount of RAM, this can be a bad idea. If it doesn't, though, this can dramatically speed things up.
As a quick example:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window, output):
def windows(data, window):
padded = np.pad(data, window, mode='edge')
for i in xrange(len(data)):
yield padded[i:i + 2*window + 1]
pool = multiprocessing.Pool()
results = pool.imap(np.mean, windows(data, window), chunksize=1000)
for i, result in enumerate(results):
output[i] = result
pool.close()
foo = np.random.rand(2000000).cumsum()
output = np.zeros_like(foo)
parallel_rolling_mean(foo, 10, output)
print output
With chunksize=1000, it takes 21 seconds to process a 2-million element array:
python ~/parallel_rolling_mean.py 83.53s user 1.12s system 401% cpu 21.087 total
But with chunksize=1 (the default) it takes about eight times as long (2 minutes, 41 seconds).
python ~/parallel_rolling_mean.py 358.26s user 53.40s system 246% cpu 2:47.09 total
In fact, with the default chunksize, it's actually far worse than a single-process implementation of the same thing, which takes only 45 seconds:
python ~/sequential_rolling_mean.py 45.11s user 0.06s system 99% cpu 45.187 total