Inefficient multiprocessing of numpy-based calculations

Inefficient multiprocessing of numpy-based calculations - python

I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:
import time
import numpy
from multiprocessing import Pool
def test_func(i):
a = numpy.random.normal(size=1000000)
b = numpy.random.normal(size=1000000)
for i in range(2000):
a = a + b
b = a - b
a = a - b
return 1
t1 = time.time()
test_func(0)
single_time = time.time() - t1
print("Single time:", single_time)
n_par = 4
pool = Pool()
t1 = time.time()
results_async = [
pool.apply_async(test_func, [i])
for i in range(n_par)]
results = [r.get() for r in results_async]
multicore_time = time.time() - t1
print("Multicore time:", multicore_time)
print("Efficiency:", single_time / multicore_time)
When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?
Some additional info which may be useful:
I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).
when I run this I see n_par processes in top working at 100% CPU
if I replace numpy array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for n_par = 4).

It looks like the test function you're using is memory bound. That means that the run time you're seeing is limited by how fast the computer can pull the arrays from memory into cache. For example, the line a = a + b is actually using 3 arrays, a, b and a new array that will replace a. These three arrays are about 8MB each (1e6 floats * 8 bytes per floats). I believe the different i7s have something like 3MB - 8MB of shared L3 cache so you cannot fit all 3 arrays in cache at once. Your cpu adds the floats faster than the array can be loaded into cache so most of the time is spent waiting on the array to be read from memory. Because the cache is shared between the cores, you don't see any speedup by spreading the work onto multiple cores.
Memory bound operations are an issue for numpy in general and the only way I know to deal with them is to use something like cython or numba.

One easy thing that should bump efficiency up should be to do in-place array operations, if possible -- so add(a,b,a) will not create a new array, while a = a + b will. If your for loop over numpy arrays could be rewritten as vector operations, that should be more efficient as well. Another possibility would be to use numpy.ctypeslib to enable shared memory numpy arrays (see: https://stackoverflow.com/a/5550156/2379433).

I have been programming numerical methods for mathematics and having the same problem: I wasn't seeing any speed-up for a supposedly cpu bounded problem. It turns out my problem was reaching the CPU cache memory limit.
I have been using Intel PCM (Intel® Performance Counter Monitor) to see how the cpu cache memory was behaving (displaying it inside Linux ksysguard). I also disabled 2 of my processors to have clearer results (2 are active).
Here is what I have found out with this code:
def somethinglong(b):
n=200000
m=5000
shared=np.arange(n)
for i in np.arange(m):
0.01*shared
pool = mp.Pool(2)
jobs = [() for i in range(8)]
for i in range(5):
timei = time.time()
pool.map(somethinglong, jobs , chunksize=1)
#for job in jobs:
#somethinglong(job)
print(time.time()-timei)
Example that doesn't reach the cache memory limit:
n=10000
m=100000
Sequential execution: 15s
2 processor pool no cache memory limit: 8s
It can be seen that there is no cache misses (all cache hits), therefore the speed-up is almost perfect: 15/8.
Memory cache hits 2 pool
Example that reaches the cache memory limit:
n=200000
m=5000
Sequential execution: 14s
2 processor pool cache memory limit: 14s
In this case, I increased the size of the vector onto which we operate (and decreased the loop size, to see reasonable execution times). In this case we can see that the memory gets full and the processes always miss the cache memory. Therefore not getting any speedup: 15/15.
Memory cache misses 2 pool
Observation: assigning an operation to a variable (aux = 0.01*shared) also uses the cache memory and can bound the problem by memory (without increasing any vector size).

Related

Dask delayed performance issues

I'm relatively new to Dask. I'm trying to parallelize a "custom" function that doesn't use Dask containers. I would just like to speed up the computation. But my results are that when I try parallelizing with dask.delayed, it has significantly worse performance than running the serial version. Here is a minimal implementation demonstrating the issue (the code I actually want to do this with is significantly more involved :) )
import dask,time
def mysum(rng):
# CPU intensive
z = 0
for i in rng:
z += i
return z
# serial
b = time.time(); zz = mysum(range(1, 1_000_000_000)); t = time.time() - b
print(f'time to run in serial {t}')
# parallel
ms_parallel = dask.delayed(mysum)
ss = []
ncores = 10
m = 100_000_000
for i in range(ncores):
lower = m*i
upper = (i+1) * m
r = range(lower, upper)
s = ms_parallel(r)
ss.append(s)
j = dask.delayed(ss)
b = time.time(); yy = j.compute(); t = time.time() - b
print(f'time to run in parallel {t}')
Typical results are:
time to run in serial 55.682398080825806
time to run in parallel 135.2043571472168
It seems I'm missing something basic here.

You are running a pure CPU-bound computation in threads by default. Because of python's Global Interpreter Lock (GIL), only one thread is actually running at a time. In short, you are only adding overhead to your original compute, due to thread switching and task executing.
To actually get faster for this workload, you should use dask-distributed. Just adding
import dask.distributed
client = dask.distributed.Client(threads_per_worker=1)
at the start of your script may well give you a decent speed up, since this invokes a certain number of processes, each with their own GIL. This scheduler becomes the default one just by creating it.
EDIT: ignore the following, I see you are already doing it :). Leaving here for others, unless people want it gone ...The second problem, for dask, is the sheer number of tasks. For any task execution system, there is an overhead associated with each task (actually, this is higher for distributed than the default threads scheduler). You could get around it by computing batches of function calls per task. This is, in practice, what dask.array and dask.dataframe do: they operate on largeish pieces of the overall problem, such that the overhead becomes small compared to the useful CPU execution time.

Strange behaviour during multiprocess calls to numpy conjugate

The attached script evaluates the numpy.conjugate routine for varying numbers of parallel processes on differently sized matrices and records the corresponding run times.
The matrix shape only varies in it's first dimension (from 1,64,64 to 256,64,64). Conjugation calls are always made on 1,64,64 sub matrices to ensure that the parts that are being worked on fit into the L2 cache on my system (256 KB per core, L3 cache in my case is 25MB). Running the script yields the following diagram (with slightly different ax labels and colors).
As you can see starting from a shape of around 100,64,64 the runtime is depending on the number of parallel processes which are used.
What could be the cause of this ?
Or why is the dependence on the number of processes for matrices below (100,64,64) so low?
My main goal is to find a modification to this script such that the runtime becomes as independent as possible from the number of processes for matrices 'a' of arbitrary size.
In case of 20 Processes:
all 'a' matrices take at most: 20 * 16 * 256 * 64 * 64 Byte = 320MB
all 'b' sub matrices take at most: 20 * 16 * 1 * 64 * 64 Byte = 1.25MB
So all sub matrices fit simultaneously in L3 cache as well as individually in the L2 cache per core of my CPU.
I did only use physical cores no hyper-threading for these tests.
Here is the script:
from multiprocessing import Process, Queue
import time
import numpy as np
import os
from matplotlib import pyplot as plt
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
def f(q,size):
a = np.random.rand(size,64,64) + 1.j*np.random.rand(size,64,64)
start = time.time()
n=a.shape[0]
for i in range(20):
for b in a:
b.conj()
duration = time.time()-start
q.put(duration)
def speed_test(number_of_processes=1,size=1):
number_of_processes = number_of_processes
process_list=[]
queue = Queue()
#Start processes
for p_id in range(number_of_processes):
p = Process(target=f,args=(queue,size))
process_list.append(p)
p.start()
#Wait until all processes are finished
for p in process_list:
p.join()
output = []
while queue.qsize() != 0:
output.append(queue.get())
return np.mean(output)
if __name__ == '__main__':
processes=np.arange(1,20,3)
data=[[] for i in processes]
for p_id,p in enumerate(processes):
for size_0 in range(1,257):
data[p_id].append(speed_test(number_of_processes=p,size=size_0))
fig,ax = plt.subplots()
for d in data:
ax.plot(d)
ax.set_xlabel('Matrix Size: 1-256,64,64')
ax.set_ylabel('Runtime in seconds')
fig.savefig('result.png')

The problem is due to at least a combination of two complex effects: cache-thrashing and frequency-scaling. I can reproduce the effect on my 6 core i5-9600KF processor.
Cache thrashing
The biggest effect comes from a cache-thrashing issue. It can be easily tracked by looking at the RAM throughput. Indeed, it is 4 GiB/s for 1 process and 20 GiB/s for 6 processes. The read throughput is similar to the write one so the throughput is symmetric. My RAM is able to reach up to ~40 GiB/s but usually ~32 GiB/s only for mixed read/write patterns. This means the RAM pressure is pretty big. Such use-case typically occurs in two cases:
an array is read/written-back from/to the RAM because cache are not big enough;
many access to different locations in memory are made but they are mapped in the same cache lines in the L3.
At first glance, the first case is much more likely to happen here since arrays are contiguous and pretty big (the other effect unfortunately also happens, see below). In fact, the main problem is the a array that is too big to fit in the L3. Indeed, when size is >128, a takes more than 128*64*64*8*2 = 8 MiB/process. Actually, a is built from two array that must be read so the space needed in cache is 3 time bigger than that: ie. >24 MiB/process. The thing is all processes allocate the same amount of memory, so the bigger the number of processes the bigger the cumulative space taken by a. When the cumulative space is bigger than the cache, the processor needs to write data to the RAM and read it back which is slow.
In fact, this is even worse: processes are not fully synchronized so some process can flush data needed by others due to the filling of a.
Furthermore, b.conj() creates a new array that may not be allocated at the same memory allocation every time so the processor also needs to write data back. This effect is dependent of the low-level allocator being used. One can use the out parameter so to fix this problem. That being said, the problem was not significant on my machine (using out was 2% faster with 6 processes and equally fast with 1 process).
Put it shortly, more processes access more data and the global amount of data do not fit in CPU caches decreasing performance since arrays need to be reloaded over and over.
Frequency scaling
Modern-processors use frequency scaling (like turbo-boost) so to make (quite) sequential applications faster, but they cannot use the same frequency for all cores when they are doing computation because processors have a limited power budget. This results of a lower theoretical scalability. The thing is all processes are doing the same work so N processes running on N cores are not N times takes more time than 1 process running on 1 core.
When 1 process is created, two cores are operating at 4550-4600 MHz (and others are at 3700 MHz) while when 6 processes are running, all cores operate at 4300 MHz. This is enough to explain a difference up to 7% on my machine.
You can hardly control the turbo frequency but you can either disable it completely or control the frequency so the minimum-maximum frequency are both set to the base frequency. Note that the processor is free to use a much lower frequency in pathological cases (ie. throttling, when a critical temperature reached). I do see an improved behavior by tweaking frequencies (7~10% better in practice).
Other effects
When the number of process is equal to the number of core, the OS do more context-switches of the process than if one core is left free for other tasks. Context-switches decrease a bit the performance of the process. THis is especially true when all cores are allocated because it is harder for the OS scheduler to avoid unnecessary migrations. This usually happens on PC with many running processes but not much on computing machines. This overhead is about 5-10% on my machine.
Note that the number of processes should not exceed the number of cores (and not hyper-threads). Beyond this limit, the performance are hardly predictable and many complex overheads appears (mainly scheduling issues).

I'll accept Jérômes answer.
For the interested reader which could ask:
Why are you subdividing your big numpy array and only working on sub matrices?
The answer is, that it's faster!
Lets consider a complex Matrix 'a' which is 128MB big (to big to fit in cache).
For a single proccess one can quickly check that in
import numpy as np
import timeit
a=np.random.rand(8192,32,32)+1.j*np.random.rand(8192,32,32)
print(timeit.timeit('a.conj()',number=100,globals=globals()))
print(timeit.timeit('for i in range(0,8192,8): a[i:i+8].conj()',number = 100 ,globals = globals()))
the second timeit which iterates over 128kB sub-matrices finishes faster than the first (if 128kB is somewhere between your L1 and L2 cache sizes).
In the following plots I'll show computation time vs sub-matrix size computed on two test machines . There are two plots for each test case which cover the sub-matrix size ranges 16kB - 1024kB (using 16kB steps) and 0.5MB - 64MB (using 0.5MB steps) respectively.
Machine I: 2 * Xenon E5-2640 v3(L1i=L1d=32KB, L2=256KB, L3=20MB, 10 cores)
Machine II: 2 * Xenon E5-2640 v4(L1i=L1d=32KB, L2=256KB, L3=50MB, 20 cores)
The sub-matrix size for which the calculation is completed the quickest (64KB) is suspiciously exactly the size of the combined L1 cache of the two CPUs on each of the test Machines.
At the value of the combined L2 cache (512KB) nothing special is happening.
As soon as the combined sub-matrix size of all paralell running processes exceeds the L3 cache of one of the available CPUs the computation time starts to increase rapidly.(Eg. Machine 1, 19 processes, at ~ 1MB, Machine 2, 37 processes, at ~1.3MB)
Here is the script for the plots:
from multiprocessing import Process, Queue
import time
import numpy as np
import timeit
from matplotlib import pyplot as plt
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
m_shape =(8192,32,32)
def f(q,size):
a = np.random.rand(*m_shape) + 1.j*np.random.rand(*m_shape)
start = time.time()
n=a.shape[0]
for i in range(0,n,size):
a[i:i+size].conj()
duration = time.time()-start
q.put(duration)
def speed_test(number_of_processes=1,size=1):
number_of_processes = number_of_processes
process_list=[]
queue = Queue()
#Start processes
for p_id in range(number_of_processes):
p = Process(target=f,args=(queue,size))
process_list.append(p)
p.start()
#Wait until all processes are finished
for p in process_list:
p.join()
output = []
while queue.qsize() != 0:
output.append(queue.get())
return np.mean(output)
if __name__ == '__main__':
processes=np.arange(1,20,3)
data=[[] for i in processes]
## L1 L2 cache data range
sub_matrix_sizes = list(range(1,64,1))
## L3 cache data range
#sub_matrix_sizes = list(range(32,4098,32))
#sub_matrix_sizes.append(8192)
for p_id,p in enumerate(processes):
for size_0 in sub_matrix_sizes:
data[p_id].append(speed_test(number_of_processes=p,size=size_0))
print('{} of {} finished.'.format(p_id+1,len(processes)))
from matplotlib import pyplot as plt
from xframe.presenters.matplolibPresenter import plot1D
data = np.array(data)
sub_size_in_kb = np.array(sub_matrix_sizes)*np.dtype(complex).itemsize*np.prod(m_shape[1:])/1024
sub_size_in_mb = sub_size_in_kb/1024
fig,ax = plt.subplots()
for d in data:
ax.plot(sub_size_in_kb,d)
ax.set_xlabel('Matrix Size in KB')
#ax.set_xlabel('Matrix Size in MB')
ax.set_ylabel('Runtime in seconds')
fig.savefig('result.png')
print('done.')

Issues with parallelizing processing of numpy array

I am having an issue with my attempt in speeding up the computation of my program. In the serialized python version of my code, I'm computing the values of a function f(x), which returns a float, for sliding windows of the NumPy array as can be seen below:
a = np.array([i for i in range(1, 10000000)]) # Some data here
N = 100
result = []
for i in range(N, len(a)):
result.append(f(a[i - N:i]))
Since the NumPy array is really large and f(x) runtime is high, I've tried to apply multiprocessing to speed up my code. Through my research, I found that charm4py might be a great solution and it has a Pool feature, which breaks up an array in chunks and distributes work between spawned processes. I've implemented charm4py's multiprocessing example and then, translated it to my case:
# Split an array into subarrays for sequential processing (takes only 5 seconds)
a = np.array([a[i - N:i] for i in range(N, len(a))])
result = charm.pool.map(f, a, chunksize=512, ncores=-1)
# I'm running this code through "charmrun +p18 example.py"
The issue that I've encountered is that code was running a lot slower, despite being executed on a more powerful instance (18 physical cores vs 6 physical cores).
I've expected to see ~3x improvement, but it didn't happen. While searching for solutions I've learned that there is some overhead for expensive deserialization/spinning up new processes, but I am not sure if this is the case.
I would really appreciate any feedback or suggestions on how one can implement fast parallel processing of a NumPy array (assuming that function f(x) is not vectorized, takes a pretty long time to compute, and internally makes a large number of specific/individual calls that cannot be parallelized)?
Thank you!

It sounds like you're trying to parallelize this operation with either Charm or Ray (it's not clear how you would use both together).
If you choose to use Ray, and your data is a numpy array, you can take advantage of zero-copy reads to avoid any deserialization overhead.
You may want to optimize your sliding window function a bit, but it will likely look like this:
#ray.remote
def apply_rolling(f, arr, start, end, window_size):
results_arr = []
for i in range(start, end - window_size):
results_arr.append(f(arr[i : i + windows_size])
return np.array(results_arr)
note that this structure lets us call f multiple times within a single task (aka batching).
To use our function:
# Some small setup
big_arr = np.arange(10000000)
big_arr_ref = ray.put(big_arr)
batch_size = len(big_arr) // ray.available_resources()["CPU"]
window_size = 100
# Kick off our tasks
result_refs = []
for i in range(0, big_arr, batch_size):
end_point = min(i + batch_size, len(big_arr))
ref = apply_rolling.remote(f, big_arr_ref, i, end_point)
result_refs.append(ref)
# Handle the results
flattened = []
for section in ray.get(result_refs):
flattened.extend(section)
I'm sure you'll want to customize this code, but here are 2 important and nice properties that you'll likely want to maintain.
Batching: We're utilizing batching to avoid starting too many tasks. In any system, parallelizing incurs overhead, so we always want to be careful and make sure we don't start too many tasks. Furthermore, we are calculating batch_size = len(big_arr) // ray.available_resources()["CPU"] to make sure we use exactly the same number of batches as we have CPUs.
Shared memory: Since Ray's object store supports zero copy reads from numpy arrays, calling ray.get or reading from a numpy array is pretty much free (on a single machine where there are no network costs). There is some overhead in serializing/calling ray.put though, so this approach only calls put (the expensive operation) once, and ray.get (which is implicitly called) many times.
Tip: Be careful when passing arrays as parameters directly into remote functions. It will call ray.put multiple times, even if you pass the same object.

Here's an example based off of your code snippet that uses Ray to parallelize the array computations.
Note that the best way to do this will depend on what your function f looks like.
import numpy as np
import ray
import time
ray.init()
N = 100000
a = np.arange(10**7)
a_id = ray.put(a)
#ray.remote
def f(array, index):
# Do processing
time.sleep(0.2)
return 1
result_ids = []
for i in range(len(a) // N):
result_ids.append(f.remote(a_id, i))
results = ray.get(result_ids)

Why is multiprocessing slower here?

I am trying to speed up some code with multiprocessing in Python, but I cannot understand one point. Assume I have the following dumb function:
import time
from multiprocessing.pool import Pool
def foo(_):
for _ in range(100000000):
a = 3
When I run this code without using multiprocessing (see the code below) on my laptop (Intel - 8 cores cpu) time taken is ~2.31 seconds.
t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")
Instead, when I run this code by using Python multiprocessing library (see the code below) time taken is ~6.0 seconds.
pool = Pool(8)
t1 = time.time()
pool.map(foo, range(8))
print(f"Sample multiprocessing {time.time() - t1}")
From the best of my knowledge, I understand that when using multiprocessing there is some time overhead mainly caused by the need to spawn the new processes and to copy the memory state. However, this operation should be performed just once when the processed are initially spawned at the very beginning and should not be that huge.
So what I am missing here? Is there something wrong in my reasoning?
Edit: I think it is better to be more explicit on my question. What I expected here was the multiprocessed code to be slightly slower than the sequential one. It is true that I don't split the whole work across the 8 cores, but I am using 8 cores in parallel to do the same job (hence in an ideal world the processing time should more or less stay the same). Considering the overhead of spawning new processes, I expected a total increase in time of some (not too big) percentage, but not of a ~2.60x increase as I got here.

Well, multiprocessing can't possibly make this faster: you're not dividing the work across 8 processes, you're asking each of 8 processes to do the entire thing. Each process will take at least as long as your code doing it just once without using multiprocessing.
So if multiprocessing weren't helping at all, you'd expect it to take about 8 times as long (it's doing 8x the work!) as your single-processor run. But you said it's not taking 2.31 * 8 ~= 18.5 seconds, but "only" about 6. So you're getting better than a factor of 3 speedup.
Why not more than that? Can't guess from here. That will depend on how many physical cores your machine has, and how much other stuff you're running at the same time. Each process will be 100% CPU-bound for this specific function, so the number of "logical" cores is pretty much irrelevant - there's scant opportunity for processor hyper-threading to help. So I'm guessing you have 4 physical cores.
On my box
Sample timing on my box, which has 8 logical cores but only 4 physical cores, and otherwise left the box pretty quiet:
Without multiprocessing 2.468580484390259
Sample multiprocessing 4.78624415397644
As above, none of that surprises me. In fact, I was a little surprised (but pleasantly) at how effectively the program used up the machine's true capacity.

#TimPeters already answered that you are actually just running the job 8 times across the 8 Pool subprocesses, so it is slower not faster.
That answers the issue but does not really answer what your real underlying question was. It is clear from your surprise at this result, that you were expecting that the single job to somehow be automatically split up and run in parts across the 8 Pool processes. That is not the way that it works. You have to build in/tell it how to split up the work.
Different kinds of jobs needs need to be subdivided in different ways, but to continue with your example you might do something like this:
import time
from multiprocessing.pool import Pool
def foo(_):
for _ in range(100000000):
a = 3
def foo2(job_desc):
start, stop = job_desc
print(f"{start}, {stop}")
for _ in range(start, stop):
a = 3
def main():
t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")
pool_size = 8
pool = Pool(pool_size)
t1 = time.time()
top_num = 100000000
size = top_num // pool_size
job_desc_list = [[size * j, size * (j+1)] for j in range(pool_size)]
# this is in case the the upper bound is not a multiple of pool_size
job_desc_list[-1][-1] = top_num
pool.map(foo2, job_desc_list)
print(f"Sample multiprocessing {time.time() - t1}")
if __name__ == "__main__":
main()
Which results in:
Without multiprocessing 3.080709171295166
0, 12500000
12500000, 25000000
25000000, 37500000
37500000, 50000000
50000000, 62500000
62500000, 75000000
75000000, 87500000
87500000, 100000000
Sample multiprocessing 1.5312283039093018
As this shows, splitting the job up does allow it to take less time. The speedup will depend on the number of CPUs. In a CPU bound job you should try to limit it the pool size to the number of CPUs. My laptop has plenty more CPU's but some of the benefit is lost to the overhead. If the jobs were longer this should look more useful.

iterating through a huge loop efficiently using python

I have 100000 images and I need to get the vectors for each image
imageVectors = []
for i in range(100000):
fileName = "Images/" + str(i) + '.jpg'
imageVectors.append(getvector(fileName).reshape((1,2048)))
cPickle.dump( imageVectors, open( 'imageVectors.pkl', "w+b" ), cPickle.HIGHEST_PROTOCOL )
getVector is a function that takes 1 image at a time and takes about 1 second to process a it. So, basically my problem reduces to
for i in range(100000):
A = callFunction(i) //a complex function that takes 1 sec for each call
The things that I have already tried are: (only the pseduo-code is given here)
1) Using numpy vectorizer:
def callFunction1(i):
return callFunction2(i)
vfunc = np.vectorize(callFunction1)
imageVectors = vfunc(list(range(100000))
2)Using python map:
def callFunction1(i):
return callFunction2(i)
imageVectors = map(callFunction1, list(range(100000))
3) Using python multiprocessing:
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 4 # arbitrary default
pool = multiprocessing.Pool(processes=cpus)
result = pool.map(callFunction, xrange(100000000))
4) Using multiprocessing in a different way:
from multiprocessing import Process, Queue
q = Queue()
N = 100000000
p1 = Process(target=callFunction, args=(N/4,q))
p1.start()
p2 = Process(target=callFunction, args=(N/4,q))
p2.start()
p3 = Process(target=callFunction, args=(N/4,q))
p3.start()
p4 = Process(target=callFunction, args=(N/4,q))
p4.start()
results = []
for i in range(4):
results.append(q.get(True))
p1.join()
p2.join()
p3.join()
p4.join()
All the above methods are taking immensely huge time. Is there any other way more efficient than this so that maybe I can loop through many elements simultaneously instead of sequentially or in any other faster way.
The time is mainly being taken by the getvector function itself. As a work around, I have split my data into 8 different batches and running the same program for different parts of the loop and running eight separate instances of python on a octa-core VM in google cloud. Could anyone suggest if map-reduce or taking help of GPU's using PyCuda may be a good option?

The multiprocessing.Pool solution is a good one, in the sense that it uses all your cores. So it should be approximately N times faster than using plain old map, where N is the number of cores you have.
BTW, you can skip determining the amount of cores. By default multiprocessing.Pool uses as many processes as your CPU has cores.
Instead of a plain map (which blocks until everything has been processed), I would suggest using imap_unordered. This is an iterator that will start returning results as soon as they become available so your parent process can start further processing if any. If ordering is important, you might want to return a tuple (number, array) to identify the result.
Your function returns a numpy array of 2048 values, which I assume are numpy.float64 Using the standard mapping functions will transport the results back to the parent process using IPC. On a 4-core machine that will result in 4 IPC transports of 2048*8 = 16384 bytes, so 65536 bytes/second. That doesn't sound too bad. But I don't know how much overhead the IPC (which involves pickling and Queues) will incur.
In case the overhead is large, you might want to create a shared memory area to store the results in. You would need approximately 1.5 Gib to store 100000 results of 2048 8-byte floats. That is a sizeable amount of memory, but not impractical for current machines.
For 100000 images and 4 cores and each image taking around one second, your program's running time would be in the order of 8 hours.
Your most important task for optimization would be to look into reducing the runtime of the getvector function. For example, would it run just as well if you reduced the size of the images by half? Assuming that the runtime scales linearly to the amount of pixels, that should cut the runtime to 0.25 s.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.