Issues with parallelizing processing of numpy array - python

I am having an issue with my attempt in speeding up the computation of my program. In the serialized python version of my code, I'm computing the values of a function f(x), which returns a float, for sliding windows of the NumPy array as can be seen below:
a = np.array([i for i in range(1, 10000000)]) # Some data here
N = 100
result = []
for i in range(N, len(a)):
result.append(f(a[i - N:i]))
Since the NumPy array is really large and f(x) runtime is high, I've tried to apply multiprocessing to speed up my code. Through my research, I found that charm4py might be a great solution and it has a Pool feature, which breaks up an array in chunks and distributes work between spawned processes. I've implemented charm4py's multiprocessing example and then, translated it to my case:
# Split an array into subarrays for sequential processing (takes only 5 seconds)
a = np.array([a[i - N:i] for i in range(N, len(a))])
result = charm.pool.map(f, a, chunksize=512, ncores=-1)
# I'm running this code through "charmrun +p18 example.py"
The issue that I've encountered is that code was running a lot slower, despite being executed on a more powerful instance (18 physical cores vs 6 physical cores).
I've expected to see ~3x improvement, but it didn't happen. While searching for solutions I've learned that there is some overhead for expensive deserialization/spinning up new processes, but I am not sure if this is the case.
I would really appreciate any feedback or suggestions on how one can implement fast parallel processing of a NumPy array (assuming that function f(x) is not vectorized, takes a pretty long time to compute, and internally makes a large number of specific/individual calls that cannot be parallelized)?
Thank you!

It sounds like you're trying to parallelize this operation with either Charm or Ray (it's not clear how you would use both together).
If you choose to use Ray, and your data is a numpy array, you can take advantage of zero-copy reads to avoid any deserialization overhead.
You may want to optimize your sliding window function a bit, but it will likely look like this:
#ray.remote
def apply_rolling(f, arr, start, end, window_size):
results_arr = []
for i in range(start, end - window_size):
results_arr.append(f(arr[i : i + windows_size])
return np.array(results_arr)
note that this structure lets us call f multiple times within a single task (aka batching).
To use our function:
# Some small setup
big_arr = np.arange(10000000)
big_arr_ref = ray.put(big_arr)
batch_size = len(big_arr) // ray.available_resources()["CPU"]
window_size = 100
# Kick off our tasks
result_refs = []
for i in range(0, big_arr, batch_size):
end_point = min(i + batch_size, len(big_arr))
ref = apply_rolling.remote(f, big_arr_ref, i, end_point)
result_refs.append(ref)
# Handle the results
flattened = []
for section in ray.get(result_refs):
flattened.extend(section)
I'm sure you'll want to customize this code, but here are 2 important and nice properties that you'll likely want to maintain.
Batching: We're utilizing batching to avoid starting too many tasks. In any system, parallelizing incurs overhead, so we always want to be careful and make sure we don't start too many tasks. Furthermore, we are calculating batch_size = len(big_arr) // ray.available_resources()["CPU"] to make sure we use exactly the same number of batches as we have CPUs.
Shared memory: Since Ray's object store supports zero copy reads from numpy arrays, calling ray.get or reading from a numpy array is pretty much free (on a single machine where there are no network costs). There is some overhead in serializing/calling ray.put though, so this approach only calls put (the expensive operation) once, and ray.get (which is implicitly called) many times.
Tip: Be careful when passing arrays as parameters directly into remote functions. It will call ray.put multiple times, even if you pass the same object.

Here's an example based off of your code snippet that uses Ray to parallelize the array computations.
Note that the best way to do this will depend on what your function f looks like.
import numpy as np
import ray
import time
ray.init()
N = 100000
a = np.arange(10**7)
a_id = ray.put(a)
#ray.remote
def f(array, index):
# Do processing
time.sleep(0.2)
return 1
result_ids = []
for i in range(len(a) // N):
result_ids.append(f.remote(a_id, i))
results = ray.get(result_ids)

Related

Multiprocessing with Pool doesn't use all available CPU

I have some code that needs to work through a huge set of matrices (all 5 by 10 binary matrices in reduced row echelon form, with no zero rows) and either accept or reject each matrix depending on whether it satisfies some conditions. Since there's a lot to get through, I'm trying to use multiprocessing to speed things up. Here's roughly what my code currently looks like:
import multiprocessing as mp
import numpy as np
def check_valid(matrix):
# Perform some checks and things
if all_checks_passed:
return matrix.copy()
return None
subgroups = []
with mp.Pool() as pool:
subgroups_iter = pool.imap(
check_valid,
get_rref_matrices(5),
chunksize=1000
)
for item in subgroups_iter:
if item is not None:
subgroups.append(item)
get_rref_matrices is a generator function that recursively finds all the rref matrices (I'm not sure if this function is causing any issues). The full code for this function is here, if that's of interest.
When I run the program, it seems to be very slow (hardly any faster than a single process) and the CPU usage is only about 10%. I've previously run code that has maxed out my CPU, so I'm stumped as to why this code isn't running faster.

iterating through a huge loop efficiently using python

I have 100000 images and I need to get the vectors for each image
imageVectors = []
for i in range(100000):
fileName = "Images/" + str(i) + '.jpg'
imageVectors.append(getvector(fileName).reshape((1,2048)))
cPickle.dump( imageVectors, open( 'imageVectors.pkl', "w+b" ), cPickle.HIGHEST_PROTOCOL )
getVector is a function that takes 1 image at a time and takes about 1 second to process a it. So, basically my problem reduces to
for i in range(100000):
A = callFunction(i) //a complex function that takes 1 sec for each call
The things that I have already tried are: (only the pseduo-code is given here)
1) Using numpy vectorizer:
def callFunction1(i):
return callFunction2(i)
vfunc = np.vectorize(callFunction1)
imageVectors = vfunc(list(range(100000))
2)Using python map:
def callFunction1(i):
return callFunction2(i)
imageVectors = map(callFunction1, list(range(100000))
3) Using python multiprocessing:
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 4 # arbitrary default
pool = multiprocessing.Pool(processes=cpus)
result = pool.map(callFunction, xrange(100000000))
4) Using multiprocessing in a different way:
from multiprocessing import Process, Queue
q = Queue()
N = 100000000
p1 = Process(target=callFunction, args=(N/4,q))
p1.start()
p2 = Process(target=callFunction, args=(N/4,q))
p2.start()
p3 = Process(target=callFunction, args=(N/4,q))
p3.start()
p4 = Process(target=callFunction, args=(N/4,q))
p4.start()
results = []
for i in range(4):
results.append(q.get(True))
p1.join()
p2.join()
p3.join()
p4.join()
All the above methods are taking immensely huge time. Is there any other way more efficient than this so that maybe I can loop through many elements simultaneously instead of sequentially or in any other faster way.
The time is mainly being taken by the getvector function itself. As a work around, I have split my data into 8 different batches and running the same program for different parts of the loop and running eight separate instances of python on a octa-core VM in google cloud. Could anyone suggest if map-reduce or taking help of GPU's using PyCuda may be a good option?
The multiprocessing.Pool solution is a good one, in the sense that it uses all your cores. So it should be approximately N times faster than using plain old map, where N is the number of cores you have.
BTW, you can skip determining the amount of cores. By default multiprocessing.Pool uses as many processes as your CPU has cores.
Instead of a plain map (which blocks until everything has been processed), I would suggest using imap_unordered. This is an iterator that will start returning results as soon as they become available so your parent process can start further processing if any. If ordering is important, you might want to return a tuple (number, array) to identify the result.
Your function returns a numpy array of 2048 values, which I assume are numpy.float64 Using the standard mapping functions will transport the results back to the parent process using IPC. On a 4-core machine that will result in 4 IPC transports of 2048*8 = 16384 bytes, so 65536 bytes/second. That doesn't sound too bad. But I don't know how much overhead the IPC (which involves pickling and Queues) will incur.
In case the overhead is large, you might want to create a shared memory area to store the results in. You would need approximately 1.5 Gib to store 100000 results of 2048 8-byte floats. That is a sizeable amount of memory, but not impractical for current machines.
For 100000 images and 4 cores and each image taking around one second, your program's running time would be in the order of 8 hours.
Your most important task for optimization would be to look into reducing the runtime of the getvector function. For example, would it run just as well if you reduced the size of the images by half? Assuming that the runtime scales linearly to the amount of pixels, that should cut the runtime to 0.25 s.

System running out of memory when Python multiprocessing Pool is used?

I am trying to parallelize my code to find the similarity matrix using multiprocessing module in Python. It works fine when I use the small np.ndarray with 10 X 15 elements. But, when I scale my np.ndarray to 3613 X 7040 elements, system runs out of memory.
Below, is my code.
import multiprocessing
from multiprocessing import Pool
## Importing Jacard_similarity_score
from sklearn.metrics import jaccard_similarity_score
# Function for finding the similarities between two np arrays
def similarityMetric(a,b):
return (jaccard_similarity_score(a,b))
## Below functions are used for Parallelizing the scripts
# auxiliary funciton to make it work
def product_helper1(args):
return (similarityMetric(*args))
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
job_args = getArguments(list_a,list_b)
# map to pool
results = p.map(product_helper1, job_args)
p.close()
p.join()
return (results)
## getArguments function is used to get the combined list
def getArguments(list_a,list_b):
arguments = []
for i in list_a:
for j in list_b:
item = (i,j)
arguments.append(item)
return (arguments)
Now when I run the below code, system runs out of memory and gets hanged. I am passing two numpy.ndarrays testMatrix1 and testMatrix2 which are of size (3613, 7040)
resultantMatrix = parallel_product1(testMatrix1,testMatrix2)
I am new to using this module in Python and trying to understand where I am going wrong. Any help is appreciated.
Odds are, the problem is just combinatoric explosion. You're trying to realize all the pairs in the main process up front, rather than generating them live, so you're storing a huge amount of memory. Assuming the ndarrays contain double values, which become Python float, then the memory usage of the list returned by getArguments is roughly the cost of a tuple and two floats per pair, or about:
3613 * 7040 * (sys.getsizeof((0., 0.)) + sys.getsizeof(0.) * 2)
On my 64 bit Linux system, that means ~2.65 GB of RAM on Py3, or ~2.85 GB on Py2, before the workers even do anything.
If you can process the data in a streaming fashion using a generator, so arguments are produced lazily and discarded when no longer needed, you could probably reduce memory usage dramatically:
import itertools
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
# Returns a generator that lazily produces the tuples
job_args = itertools.product(list_a,list_b)
# map to pool
results = p.map(product_helper1, job_args)
p.close()
p.join()
return (results)
This still requires all the results to fit in memory; if product_helper returns floats, then the expected memory usage for the result list on a 64 bit machine would still be around 0.75 GB or so, which is pretty large; if you can process the results in a streaming fashion, iterating the results of p.imap or even better, p.imap_unordered (the latter returns results as computed, not in the order the generator produced the arguments) and writing them to disk or otherwise ensuring they're released in memory quickly would save a lot of memory; the following just prints them out, but writing them to a file in some reingestable format would also be reasonable.
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
# Returns a generator that lazily produces the tuples
job_args = itertools.product(list_a,list_b)
# map to pool
for result in p.imap_unordered(product_helper1, job_args):
print(result)
p.close()
p.join()
The map method sends all data to the workers via inter-process communication. As currently done, this consumes a huge amount of resources, because you're sending
What I would suggest it to modify getArguments to make a list of tuple of indices in the matrix that need to be combined. That's only two numbers that have to be sent to the worker process, instead of two rows of a matrix! Each worker then knows which rows in the matrix to use.
Load the two matrices and store them in global variables before calling map. This way every worker has access to them. And as long as they're not modified in the workers, the OS's virtual memory manager will not copy identical memory pages, keeping memory usage down.

parallelized algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array

The upshot of the below is that I have an embarrassingly parallel for loop that I am trying to thread. There's a bit of rigamarole to explain the problem, but despite all the verbosity, I think this should be a rather trivial problem that the multiprocessing module is designed to solve easily.
I have a large length-N array of k distinct functions, and a length-N array of abcissa. Thanks to the clever solution provided by #senderle described in Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array, I have a fast numpy-based algorithm that I can use to evaluate the functions at the abcissa to return a length-N array of ordinates:
def apply_indexed_fast(abcissa, func_indices, func_table):
""" Returns the output of an array of functions evaluated at a set of input points
if the indices of the table storing the required functions are known.
Parameters
----------
func_table : array_like
Length k array of function objects
abcissa : array_like
Length Npts array of points at which to evaluate the functions.
func_indices : array_like
Length Npts array providing the indices to use to choose which function
operates on each abcissa element. Thus func_indices is an array of integers
ranging between 0 and k-1.
Returns
-------
out : array_like
Length Npts array giving the evaluation of the appropriate function on each
abcissa element.
"""
func_argsort = func_indices.argsort()
func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
func_ranges.append(None)
out = np.zeros_like(abcissa)
for i in range(len(func_table)):
f = func_table[i]
start = func_ranges[i]
end = func_ranges[i+1]
ix = func_argsort[start:end]
out[ix] = f(abcissa[ix])
return out
What I'm now trying to do is use multiprocessing to parallelize the for loop inside this function. Before describing my approach, for clarity I'll briefly sketch how the algorithm #senderle developed works. If you can read the above code and understand it immediately, just skip the next paragraph of text.
First we find the array of indices that sorts the input func_indices, which we use to define the length-k func_ranges array of integers. The integer entries of func_ranges control the function that gets applied to the appropriate sub-array of the input abcissa, which works as follows. Let f be the i^th function in the input func_table. Then the slice of the input abcissa to which we should apply the function f is slice(func_ranges[i], func_ranges[i+1]). So once func_ranges is calculated, we can just run a simple for loop over the input func_table and successively apply each function object to the appropriate slice, filling our output array. See the code below for a minimal example of this algorithm in action.
def trivial_functional(i):
def f(x):
return i*x
return f
k = 250
func_table = np.array([trivial_functional(j) for j in range(k)])
Npts = 1e6
abcissa = np.random.random(Npts)
func_indices = np.random.random_integers(0,len(func_table)-1,Npts)
result = apply_indexed_fast(abcissa, func_indices, func_table)
So my goal now is to use multiprocessing to parallelize this calculation. I thought this would be straightforward using my usual trick for threading embarrassingly parallel for loops. But my attempt below raises an exception that I do not understand.
from multiprocessing import Pool, cpu_count
def apply_indexed_parallelized(abcissa, func_indices, func_table):
func_argsort = func_indices.argsort()
func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
func_ranges.append(None)
out = np.zeros_like(abcissa)
num_cores = cpu_count()
pool = Pool(num_cores)
def apply_funci(i):
f = func_table[i]
start = func_ranges[i]
end = func_ranges[i+1]
ix = func_argsort[start:end]
out[ix] = f(abcissa[ix])
pool.map(apply_funci, range(len(func_table)))
pool.close()
return out
result = apply_indexed_parallelized(abcissa, func_indices, func_table)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I have seen this elsewhere on SO: Multiprocessing: How to use Pool.map on a function defined in a class?. One by one, I have tried each method proposed there; in all cases, I get a "too many files open" error because the threads were never closed, or the adapted algorithm simply hangs. This seems like there should be a simple solution since this is nothing more than threading an embarrassingly parallel for loop.
Warning/Caveat:
You may not want to apply multiprocessing to this problem. You'll find that relatively simple operations on large arrays, the problems will be memory bound with numpy. The bottleneck is moving data from RAM to the CPU caches. The CPU is starved for data, so throwing more CPUs at the problem doesn't help much. Furthermore, your current approach will pickle and make a copy of the entire array for every item in your input sequence, which adds lots of overhead.
There are plenty of cases where numpy + multiprocessing is very effective, but you need to make sure you're working with a CPU-bound problem. Ideally, it's a CPU-bound problem with relatively small inputs and outputs to alleviate the overhead of pickling the input and output. For many of the problems that numpy is most often used for, that's not the case.
Two Problems with Your Current Approach
On to answering your question:
Your immediate error is due to passing in a function that's not accessible from the global scope (i.e. a function defined inside a function).
However, you have another issue. You're treating the numpy arrays as though they're shared memory that can be modified by each process. Instead, when using multiprocessing the original array will be pickled (effectively making a copy) and passed to each process independently. The original array will never be modified.
Avoiding the PicklingError
As a minimal example to reproduce your error, consider the following:
import multiprocessing
def apply_parallel(input_sequence):
def func(x):
pass
pool = multiprocessing.Pool()
pool.map(func, input_sequence)
pool.close()
foo = range(100)
apply_parallel(foo)
This will result in:
PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed
Of course, in this simple example, we could simply move the function definition back into the __main__ namespace. However, in yours, you need it to refer to data that's passed in. Let's look at an example that's a bit closer to what you're doing:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window):
data = np.pad(data, window, mode='edge')
ind = np.arange(len(data)) + window
def func(i):
return data[i-window:i+window+1].mean()
pool = multiprocessing.Pool()
result = pool.map(func, ind)
pool.close()
return result
foo = np.random.rand(20).cumsum()
result = parallel_rolling_mean(foo, 10)
There are multiple ways you could handle this, but a common approach is something like:
import numpy as np
import multiprocessing
class RollingMean(object):
def __init__(self, data, window):
self.data = np.pad(data, window, mode='edge')
self.window = window
def __call__(self, i):
start = i - self.window
stop = i + self.window + 1
return self.data[start:stop].mean()
def parallel_rolling_mean(data, window):
func = RollingMean(data, window)
ind = np.arange(len(data)) + window
pool = multiprocessing.Pool()
result = pool.map(func, ind)
pool.close()
return result
foo = np.random.rand(20).cumsum()
result = parallel_rolling_mean(foo, 10)
Great! It works!
However, if you scale this up to large arrays, you'll soon find that it will either run very slow (which you can speed up by increasing chunksize in the pool.map call) or you'll quickly run out of RAM (once you increase the chunksize).
multiprocessing pickles the input so that it can be passed to separate and independent python processes. This means you're making a copy of the entire array for every i you operate on.
We'll come back to this point in a bit...
multiprocessing Does not share memory between processes
The multiprocessing module works by pickling the inputs and passing them to independent processes. This means that if you modify something in one process the other process won't see the modification.
However, multiprocessing also provides primitives that live in shared memory and can be accessed and modified by child processes. There are a few different ways of adapting numpy arrays to use a shared memory multiprocessing.Array. However, I'd recommend avoiding those at first (read up on false sharing if you're not familiar with it). There are cases where it's very useful, but it's typically to conserve memory, not to improve performance.
Therefore, it's best to do all modifications to a large array in a single process (this is also a very useful pattern for general IO). It doesn't have to be the "main" process, but it's easiest to think about that way.
As an example, let's say we wanted to have our parallel_rolling_mean function take an output array to store things in. A useful pattern is something similar to the following. Notice the use of iterators and modifying the output only in the main process:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window, output):
def windows(data, window):
padded = np.pad(data, window, mode='edge')
for i in xrange(len(data)):
yield padded[i:i + 2*window + 1]
pool = multiprocessing.Pool()
results = pool.imap(np.mean, windows(data, window))
for i, result in enumerate(results):
output[i] = result
pool.close()
foo = np.random.rand(20000000).cumsum()
output = np.zeros_like(foo)
parallel_rolling_mean(foo, 10, output)
print output
Hopefully that example helps clarify things a bit.
chunksize and performance
One quick note on performance: If we scale this up, it will get very slow very quickly. If you look at a system monitor (e.g. top/htop), you may notice that your cores are sitting idle most of the time.
By default, the master process pickles each input for each process and passes it in immediately and then waits until they're finished to pickle the next input. In many cases, this means that the master process works, then sits idle while the worker processes are busy, then the worker processes sit idle while the master process is pickling the next input.
The key is to increase the chunksize parameter. This will cause pool.imap to "pre-pickle" the specified number of inputs for each process. Basically, the master thread can stay busy pickling inputs and the worker processes can stay busy processing. The downside is that you're using more memory. If each input uses up a large amount of RAM, this can be a bad idea. If it doesn't, though, this can dramatically speed things up.
As a quick example:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window, output):
def windows(data, window):
padded = np.pad(data, window, mode='edge')
for i in xrange(len(data)):
yield padded[i:i + 2*window + 1]
pool = multiprocessing.Pool()
results = pool.imap(np.mean, windows(data, window), chunksize=1000)
for i, result in enumerate(results):
output[i] = result
pool.close()
foo = np.random.rand(2000000).cumsum()
output = np.zeros_like(foo)
parallel_rolling_mean(foo, 10, output)
print output
With chunksize=1000, it takes 21 seconds to process a 2-million element array:
python ~/parallel_rolling_mean.py 83.53s user 1.12s system 401% cpu 21.087 total
But with chunksize=1 (the default) it takes about eight times as long (2 minutes, 41 seconds).
python ~/parallel_rolling_mean.py 358.26s user 53.40s system 246% cpu 2:47.09 total
In fact, with the default chunksize, it's actually far worse than a single-process implementation of the same thing, which takes only 45 seconds:
python ~/sequential_rolling_mean.py 45.11s user 0.06s system 99% cpu 45.187 total

Inefficient multiprocessing of numpy-based calculations

I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:
import time
import numpy
from multiprocessing import Pool
def test_func(i):
a = numpy.random.normal(size=1000000)
b = numpy.random.normal(size=1000000)
for i in range(2000):
a = a + b
b = a - b
a = a - b
return 1
t1 = time.time()
test_func(0)
single_time = time.time() - t1
print("Single time:", single_time)
n_par = 4
pool = Pool()
t1 = time.time()
results_async = [
pool.apply_async(test_func, [i])
for i in range(n_par)]
results = [r.get() for r in results_async]
multicore_time = time.time() - t1
print("Multicore time:", multicore_time)
print("Efficiency:", single_time / multicore_time)
When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?
Some additional info which may be useful:
I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).
when I run this I see n_par processes in top working at 100% CPU
if I replace numpy array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for n_par = 4).
It looks like the test function you're using is memory bound. That means that the run time you're seeing is limited by how fast the computer can pull the arrays from memory into cache. For example, the line a = a + b is actually using 3 arrays, a, b and a new array that will replace a. These three arrays are about 8MB each (1e6 floats * 8 bytes per floats). I believe the different i7s have something like 3MB - 8MB of shared L3 cache so you cannot fit all 3 arrays in cache at once. Your cpu adds the floats faster than the array can be loaded into cache so most of the time is spent waiting on the array to be read from memory. Because the cache is shared between the cores, you don't see any speedup by spreading the work onto multiple cores.
Memory bound operations are an issue for numpy in general and the only way I know to deal with them is to use something like cython or numba.
One easy thing that should bump efficiency up should be to do in-place array operations, if possible -- so add(a,b,a) will not create a new array, while a = a + b will. If your for loop over numpy arrays could be rewritten as vector operations, that should be more efficient as well. Another possibility would be to use numpy.ctypeslib to enable shared memory numpy arrays (see: https://stackoverflow.com/a/5550156/2379433).
I have been programming numerical methods for mathematics and having the same problem: I wasn't seeing any speed-up for a supposedly cpu bounded problem. It turns out my problem was reaching the CPU cache memory limit.
I have been using Intel PCM (Intel® Performance Counter Monitor) to see how the cpu cache memory was behaving (displaying it inside Linux ksysguard). I also disabled 2 of my processors to have clearer results (2 are active).
Here is what I have found out with this code:
def somethinglong(b):
n=200000
m=5000
shared=np.arange(n)
for i in np.arange(m):
0.01*shared
pool = mp.Pool(2)
jobs = [() for i in range(8)]
for i in range(5):
timei = time.time()
pool.map(somethinglong, jobs , chunksize=1)
#for job in jobs:
#somethinglong(job)
print(time.time()-timei)
Example that doesn't reach the cache memory limit:
n=10000
m=100000
Sequential execution: 15s
2 processor pool no cache memory limit: 8s
It can be seen that there is no cache misses (all cache hits), therefore the speed-up is almost perfect: 15/8.
Memory cache hits 2 pool
Example that reaches the cache memory limit:
n=200000
m=5000
Sequential execution: 14s
2 processor pool cache memory limit: 14s
In this case, I increased the size of the vector onto which we operate (and decreased the loop size, to see reasonable execution times). In this case we can see that the memory gets full and the processes always miss the cache memory. Therefore not getting any speedup: 15/15.
Memory cache misses 2 pool
Observation: assigning an operation to a variable (aux = 0.01*shared) also uses the cache memory and can bound the problem by memory (without increasing any vector size).

Categories

Resources