Strange behaviour during multiprocess calls to numpy conjugate

Strange behaviour during multiprocess calls to numpy conjugate - python

The attached script evaluates the numpy.conjugate routine for varying numbers of parallel processes on differently sized matrices and records the corresponding run times.
The matrix shape only varies in it's first dimension (from 1,64,64 to 256,64,64). Conjugation calls are always made on 1,64,64 sub matrices to ensure that the parts that are being worked on fit into the L2 cache on my system (256 KB per core, L3 cache in my case is 25MB). Running the script yields the following diagram (with slightly different ax labels and colors).
As you can see starting from a shape of around 100,64,64 the runtime is depending on the number of parallel processes which are used.
What could be the cause of this ?
Or why is the dependence on the number of processes for matrices below (100,64,64) so low?
My main goal is to find a modification to this script such that the runtime becomes as independent as possible from the number of processes for matrices 'a' of arbitrary size.
In case of 20 Processes:
all 'a' matrices take at most: 20 * 16 * 256 * 64 * 64 Byte = 320MB
all 'b' sub matrices take at most: 20 * 16 * 1 * 64 * 64 Byte = 1.25MB
So all sub matrices fit simultaneously in L3 cache as well as individually in the L2 cache per core of my CPU.
I did only use physical cores no hyper-threading for these tests.
Here is the script:
from multiprocessing import Process, Queue
import time
import numpy as np
import os
from matplotlib import pyplot as plt
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
def f(q,size):
a = np.random.rand(size,64,64) + 1.j*np.random.rand(size,64,64)
start = time.time()
n=a.shape[0]
for i in range(20):
for b in a:
b.conj()
duration = time.time()-start
q.put(duration)
def speed_test(number_of_processes=1,size=1):
number_of_processes = number_of_processes
process_list=[]
queue = Queue()
#Start processes
for p_id in range(number_of_processes):
p = Process(target=f,args=(queue,size))
process_list.append(p)
p.start()
#Wait until all processes are finished
for p in process_list:
p.join()
output = []
while queue.qsize() != 0:
output.append(queue.get())
return np.mean(output)
if __name__ == '__main__':
processes=np.arange(1,20,3)
data=[[] for i in processes]
for p_id,p in enumerate(processes):
for size_0 in range(1,257):
data[p_id].append(speed_test(number_of_processes=p,size=size_0))
fig,ax = plt.subplots()
for d in data:
ax.plot(d)
ax.set_xlabel('Matrix Size: 1-256,64,64')
ax.set_ylabel('Runtime in seconds')
fig.savefig('result.png')

The problem is due to at least a combination of two complex effects: cache-thrashing and frequency-scaling. I can reproduce the effect on my 6 core i5-9600KF processor.
Cache thrashing
The biggest effect comes from a cache-thrashing issue. It can be easily tracked by looking at the RAM throughput. Indeed, it is 4 GiB/s for 1 process and 20 GiB/s for 6 processes. The read throughput is similar to the write one so the throughput is symmetric. My RAM is able to reach up to ~40 GiB/s but usually ~32 GiB/s only for mixed read/write patterns. This means the RAM pressure is pretty big. Such use-case typically occurs in two cases:
an array is read/written-back from/to the RAM because cache are not big enough;
many access to different locations in memory are made but they are mapped in the same cache lines in the L3.
At first glance, the first case is much more likely to happen here since arrays are contiguous and pretty big (the other effect unfortunately also happens, see below). In fact, the main problem is the a array that is too big to fit in the L3. Indeed, when size is >128, a takes more than 128*64*64*8*2 = 8 MiB/process. Actually, a is built from two array that must be read so the space needed in cache is 3 time bigger than that: ie. >24 MiB/process. The thing is all processes allocate the same amount of memory, so the bigger the number of processes the bigger the cumulative space taken by a. When the cumulative space is bigger than the cache, the processor needs to write data to the RAM and read it back which is slow.
In fact, this is even worse: processes are not fully synchronized so some process can flush data needed by others due to the filling of a.
Furthermore, b.conj() creates a new array that may not be allocated at the same memory allocation every time so the processor also needs to write data back. This effect is dependent of the low-level allocator being used. One can use the out parameter so to fix this problem. That being said, the problem was not significant on my machine (using out was 2% faster with 6 processes and equally fast with 1 process).
Put it shortly, more processes access more data and the global amount of data do not fit in CPU caches decreasing performance since arrays need to be reloaded over and over.
Frequency scaling
Modern-processors use frequency scaling (like turbo-boost) so to make (quite) sequential applications faster, but they cannot use the same frequency for all cores when they are doing computation because processors have a limited power budget. This results of a lower theoretical scalability. The thing is all processes are doing the same work so N processes running on N cores are not N times takes more time than 1 process running on 1 core.
When 1 process is created, two cores are operating at 4550-4600 MHz (and others are at 3700 MHz) while when 6 processes are running, all cores operate at 4300 MHz. This is enough to explain a difference up to 7% on my machine.
You can hardly control the turbo frequency but you can either disable it completely or control the frequency so the minimum-maximum frequency are both set to the base frequency. Note that the processor is free to use a much lower frequency in pathological cases (ie. throttling, when a critical temperature reached). I do see an improved behavior by tweaking frequencies (7~10% better in practice).
Other effects
When the number of process is equal to the number of core, the OS do more context-switches of the process than if one core is left free for other tasks. Context-switches decrease a bit the performance of the process. THis is especially true when all cores are allocated because it is harder for the OS scheduler to avoid unnecessary migrations. This usually happens on PC with many running processes but not much on computing machines. This overhead is about 5-10% on my machine.
Note that the number of processes should not exceed the number of cores (and not hyper-threads). Beyond this limit, the performance are hardly predictable and many complex overheads appears (mainly scheduling issues).

I'll accept Jérômes answer.
For the interested reader which could ask:
Why are you subdividing your big numpy array and only working on sub matrices?
The answer is, that it's faster!
Lets consider a complex Matrix 'a' which is 128MB big (to big to fit in cache).
For a single proccess one can quickly check that in
import numpy as np
import timeit
a=np.random.rand(8192,32,32)+1.j*np.random.rand(8192,32,32)
print(timeit.timeit('a.conj()',number=100,globals=globals()))
print(timeit.timeit('for i in range(0,8192,8): a[i:i+8].conj()',number = 100 ,globals = globals()))
the second timeit which iterates over 128kB sub-matrices finishes faster than the first (if 128kB is somewhere between your L1 and L2 cache sizes).
In the following plots I'll show computation time vs sub-matrix size computed on two test machines . There are two plots for each test case which cover the sub-matrix size ranges 16kB - 1024kB (using 16kB steps) and 0.5MB - 64MB (using 0.5MB steps) respectively.
Machine I: 2 * Xenon E5-2640 v3(L1i=L1d=32KB, L2=256KB, L3=20MB, 10 cores)
Machine II: 2 * Xenon E5-2640 v4(L1i=L1d=32KB, L2=256KB, L3=50MB, 20 cores)
The sub-matrix size for which the calculation is completed the quickest (64KB) is suspiciously exactly the size of the combined L1 cache of the two CPUs on each of the test Machines.
At the value of the combined L2 cache (512KB) nothing special is happening.
As soon as the combined sub-matrix size of all paralell running processes exceeds the L3 cache of one of the available CPUs the computation time starts to increase rapidly.(Eg. Machine 1, 19 processes, at ~ 1MB, Machine 2, 37 processes, at ~1.3MB)
Here is the script for the plots:
from multiprocessing import Process, Queue
import time
import numpy as np
import timeit
from matplotlib import pyplot as plt
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
m_shape =(8192,32,32)
def f(q,size):
a = np.random.rand(*m_shape) + 1.j*np.random.rand(*m_shape)
start = time.time()
n=a.shape[0]
for i in range(0,n,size):
a[i:i+size].conj()
duration = time.time()-start
q.put(duration)
def speed_test(number_of_processes=1,size=1):
number_of_processes = number_of_processes
process_list=[]
queue = Queue()
#Start processes
for p_id in range(number_of_processes):
p = Process(target=f,args=(queue,size))
process_list.append(p)
p.start()
#Wait until all processes are finished
for p in process_list:
p.join()
output = []
while queue.qsize() != 0:
output.append(queue.get())
return np.mean(output)
if __name__ == '__main__':
processes=np.arange(1,20,3)
data=[[] for i in processes]
## L1 L2 cache data range
sub_matrix_sizes = list(range(1,64,1))
## L3 cache data range
#sub_matrix_sizes = list(range(32,4098,32))
#sub_matrix_sizes.append(8192)
for p_id,p in enumerate(processes):
for size_0 in sub_matrix_sizes:
data[p_id].append(speed_test(number_of_processes=p,size=size_0))
print('{} of {} finished.'.format(p_id+1,len(processes)))
from matplotlib import pyplot as plt
from xframe.presenters.matplolibPresenter import plot1D
data = np.array(data)
sub_size_in_kb = np.array(sub_matrix_sizes)*np.dtype(complex).itemsize*np.prod(m_shape[1:])/1024
sub_size_in_mb = sub_size_in_kb/1024
fig,ax = plt.subplots()
for d in data:
ax.plot(sub_size_in_kb,d)
ax.set_xlabel('Matrix Size in KB')
#ax.set_xlabel('Matrix Size in MB')
ax.set_ylabel('Runtime in seconds')
fig.savefig('result.png')
print('done.')

Related

Python Pool Multiprocessing Poor CPU Usage

I have a bunch of independent N body sims I want to run in parallel in python. The walltime for individual sims is going to vary dramatically depending on the parameters of the bodies in the sims. It seemed like the best way to do this would be to build pool of processes with the multiprocessing module, give them the sim jobs with the starmap() function, and have them save the results to separate files based on the process ID. However, I've getting awful parallel performance. There is no speedup between 2 and 4 processes (I have 4 CPU on my laptop) and the unix time utility seems to think that the CPU usage percentage is ~150% which is terrible. Below is my code:
import rebound
import numpy as np
import multiprocessing as mp
def two_orbits_one_pool(orbit1, orbit2):
#######################################
print('process number', mp.current_process().name)
#######################################
# build simulation
sim = rebound.Simulation()
# add sun
sim.add(m=1.)
# add two overlapping orbits
sim.add(primary=sim.particles[0], m=orbit1['m'], a=orbit1['a'], e=orbit1['e'], inc=orbit1['i'], \
pomega=orbit1['lop'], Omega=orbit1['lan'], M=orbit1['M'])
sim.add(primary=sim.particles[0], m=orbit2['m'], a=orbit2['a'], e=orbit2['e'], inc=orbit2['i'], \
pomega=orbit2['lop'], Omega=orbit2['lan'], M=orbit2['M'])
sim.move_to_com()
# integrate for 10 orbits of orbit1
P = 2.*np.pi * np.sqrt(orbit1['a']**3)
sim.automateSimulationArchive("archive-{}.bin".format(mp.current_process().name), interval=P)
sim.integrate(10.*P)
if __name__ == "__main__":
# orbit definitions
N_M = 10
N_lop = 10
m = 1e-6
a, e = 1., 0.3
inc, lop, lan = 0., 0., 0.
M = np.linspace(0., 2*np.pi, endpoint=False, num=N_M)
dlop = np.linspace(0., 0.05, num=N_lop)
# orbit dictionaries
args = []
for i in range(dlop.shape[0]):
for j in range(M.shape[0]):
for k in range(M.shape[0]):
args.append( ( {'m':m, 'a':a, 'e':e, 'i':inc, \
'lop':lop, 'lan':lan, 'M':M[j]},
{'m':m, 'a':a, 'e':e, 'i':inc, \
'lop':lop+dlop[i], 'lan':lan, 'M':M[k]} ) )
# fill the pool with orbit jobs
with mp.Pool() as pool:
pool.starmap(two_orbits_one_pool, args)
Could someone explain why this is performing so poorly? I'm much more used to OpenMP and MPI; I'm not that familiar with parallel programming in Python. Overall, I've been quite disappointed in the multiprocessing module. I think maybe I should try using the numba module instead?
EDIT:
In response to Roland Smith's response, I profiled the integration and save time for my code. Here is a stripplot showing the results. As you can see, both Roland Smith's and J_H's suggestions were true. There is a subset of initial conditions that result in extremely long integration times due to close encounters between the bodes. However, in general, the save time was about 5 times longer than the integration time. The job suffers from stragglers and is disk i/o bound.

If there is no discernable speedup, then probably your code is not CPU-bound.
In general, writing to a disk (even an SSD) is much slower than running code on the CPU.
If several worker processes are writing significant amounts of data to disk, that might be the bottleneck.
To diagnose the problem, you have to measure.
You should separate the calculations from the saving of the data; e.g. run sim.integrate() followed by sim.simulationarchive_snapshot() 10 times, and sandwich each of those calls between time.monotonic() calls. Then return the average time of the integration step and the snapshot steps as shown below.
import time
def two_orbits_one_pool(orbit1, orbit2):
#######################################
print('process number', mp.current_process().name)
#######################################
# build simulation
sim = rebound.Simulation()
# add sun
sim.add(m=1.)
# add two overlapping orbits
sim.add(primary=sim.particles[0], m=orbit1['m'], a=orbit1['a'], e=orbit1['e'], inc=orbit1['i'], \
pomega=orbit1['lop'], Omega=orbit1['lan'], M=orbit1['M'])
sim.add(primary=sim.particles[0], m=orbit2['m'], a=orbit2['a'], e=orbit2['e'], inc=orbit2['i'], \
pomega=orbit2['lop'], Omega=orbit2['lan'], M=orbit2['M'])
sim.move_to_com()
# integrate for 10 orbits of orbit1
P = 2.*np.pi * np.sqrt(orbit1['a']**3)
arname = "archive-{}.bin".format(mp.current_process().name)
itime, stime = 0.0, 0.0
for k in range(10):
start = time.monotonic()
sim.integrate(P)
itime += time.monotonic() - start
start = time.monotonic()
sim.simulationarchive_snapshot(arname)
stime += time.monotonic() - start
return (mp.current_process().name, itime/10, stime/10)
# Run the calculations
with mp.Pool() as pool:
data = pool.starmap(two_orbits_one_pool, args)
# Print the times that it took.
for name, itime, stime in data:
print(f"worker {name}: itime {itime} s, stime {stime} s")
That should tell you what the bottleneck is.
Possible solutions if writing to disk is the bottleneck;
Use an SSD to store the simulation results.
Use a RAM-disk to store the simulation results. (Although compared to an SSD not a huge performance boost.)
Check if you can tune your OS for maximum write performance.
Edit1: Given your measurement result, the obvious performance improvement is to save less often.
Another option that might be worth looking at is staggering the writes. That only makes sense if there is significant overlap between the writes from different processes, and if those concurrent writes can saturate the disk I/O subsystem. So you'd have to measure that first.
If there is overlap, create a Lock object in the parent process. Then acquire the lock before (explicitly) saving, and release it after. This won't work with automateSimulationArchive.
A last option is to write your own save function using mmap. Using mmap is somewhat clunky compared to normal file handling in Python. But it can significantly improve performance. However I am unsure that the gains justify the effort in this case.

The straggler effect can have a big impact on such jobs.
straggler effect
Suppose you have N tasks for N cores,
and each task has a different duration.
Order by duration to find min_time and max_time.
All N cores will be busy up through min_time,
but then they go idle, one by one.
Just before max_time, only a single "straggler" core is being used.
predictions
If you can make a decent guess about task duration beforehand,
use that to sort them in descending order.
For T tasks > N cores, schedule the long tasks first.
Then N tasks run for a while, the shortest of those completes,
and the now-idle core picks up a task of "medium" duration.
By the time we get to the T-th task, each core has a random
amount of work still to do, and we're scheduling a "short" task.
So cores are mostly busy doing useful work, right up till near the end.
logging
If you cannot make a useful duration estimate a priori,
at least record the start times and durations.
Use that to figure out whether stragglers are causing you grief,
or if it's something else like L3 cache thrashing.

Diagnosing Python Multiprocessing Bottleneck

I am coding a phase retrieval algorithm and am currently stuck with what I think is a bad multiprocessing speedup, given the problem.
The algorithm itself is composed of an iterative sequence of float operations on a numpy matrix whose size is on the order of 10-100 MB.
No IO operations are done while the algorithm is running.
Parallelization just amounts to running several of these iterative procedures in parallel using multiprocessing.Process.
I tested the program on a node with 40 physical CPU Cores (80 threads) and 250 GB RAM.
Given that there is no communication between the processes and no IO calls I expected a multiprocessing speedup somewhere between 40 and 80 times on this node (was this naive ?).
However the best I could achieve was a speedup of about 20 using 40 parallel Processes.
To diagnose I used cProfile on a random process which is part of a run with 1,40 and 70 parallel executions.
It tuns out that the relative amount of time spent for each sub part of the iterative procedure stays roughly constant for all three tests but each individual operation takes much longer. This is to the point where a simple call to numpy.conjugate takes 4 to 5 times longer in the 70 parallel processes case as compared to the single process case. Here are some Snakeviz diagrams:
Clearly I am running into some kind of bottleneck but what is it and how to further diagnose?
RAM space is not a problem the 250GB are more than enough.
Could RAM bandwidth be an issue ? How to check that?
The CPU-core usage of a single process in all cases is close to 100%. With 80 available "cores" this results in roughly 1%,50%,90% overall CPU usage for the 1,40 and 70 parallel processes case.
EDIT
Here is a minimum working example just using numpy.conjugate()
from multiprocessing import Process, Queue
import time
import numpy as np
import timeit
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
def f(q):
a = np.random.rand(64,254,128) + 1.j*np.random.rand(64,254,128)
start = time.time()
for i in range(10):
a.conj()
duration = time.time()-start
q.put(duration)
def speed_test(number_of_processes=1):
number_of_processes = number_of_processes
process_list=[]
queue = Queue()
#Start processes
for p_id in range(number_of_processes):
p = Process(target=f,args=(queue,))
process_list.append(p)
p.start()
#Wait until all processes are finished
for p in process_list:
p.join()
output = []
while queue.qsize() != 0:
output.append(queue.get())
return np.mean(output)
if __name__ == '__main__':
p1 = speed_test(number_of_processes=1)
p40 = speed_test(number_of_processes=40)
p70 = speed_test(number_of_processes=70)
print('\n 1 process took {} seconds\n 40 processses took {} seconds \n 70 processes took {} seconds'.format(p1,p40,p70))
The numpy.conjugate() part of this code runs about 5 times slower if executed in parallel on 70 cores compared to executing it on a single core. I would have expected a much lower runtime difference what is causing this?
EDIT 2
On suggestion from #Ahmed AEK, here is a plot of the timings for different array sizes.
And for bigger matrices:
Thanks for reading the post!
If I can provide further data just let me know.

Multiprocessing and multithreading in Python

I have a python program which 1) Reads from a very large file from Disk(~95% time) and then 2) Process and Provide a relatively small output (~5% time). This Program is to be run on TeraBytes of files .
Now i am looking to Optimize this Program by utilizing Multi Processing and Multi Threading . The platform I am running is a Virtual Machine with 4 Processors on a virtual Machine .
I plan to have a scheduler Process which will execute 4 Processes (same as processors) and then Each Process should have some threads as most part is I/O . Each Thread will process 1 file & will report result to the Main Thread which in turn will report it back to scheduler Process via IPC . Scheduler can queue these and eventually write them to disk in ordered manner
So wondering How does one decide number of Processes and Threads to create for such scenario ? Is there a Mathematical way to figure out whats the best mix .
Thankyou

I think I would arrange it the inverse of what you are doing. That is, I would create a thread pool of a certain size that would be responsible for producing the results. The tasks that get submitted to this pool would be passed as an argument a processor pool that could be used by the worker thread for submitting the CPU-bound portions of work. In other words, the thread pool workers would primarily be doing all the disk-related operations and handing off to the processor pool any CPU-intensive work.
The size of the processor pool should just be the number of processors you have in your environment. It's difficult to give a precise size for the thread pool; it depends on how many concurrent disk operations it can handle before the law of diminishing returns come into play. It also depends on your memory: The larger the pool, the greater the memory resources that will be taken, especially if entire files have to be read into memory for processing. So, you may have to experiment with this value. The code below outlines these ideas. What you gain from the thread pool is overlapping of I/O operations greater than you would achieve if you just used a small processor pool:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
import os
def cpu_bound_function(arg1, arg2):
...
return some_result
def io_bound_function(process_pool_executor, file_name):
with open(file_name, 'r') as f:
# Do disk related operations:
. . . # code omitted
# Now we have to do a CPU-intensive operation:
future = process_pool_executor.submit(cpu_bound_function, arg1, arg2)
result = future.result() # get result
return result
file_list = [file_1, file_2, file_n]
N_FILES = len(file_list)
MAX_THREADS = 50 # depends on your configuration on how well the I/O can be overlapped
N_THREADS = min(N_FILES, MAX_THREADS) # no point in creating more threds than required
N_PROCESSES = os.cpu_count() # use the number of processors you have
with ThreadPoolExecutor(N_THREADS) as thread_pool_executor:
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = thread_pool_executor.map(partial(io_bound_function, process_pool_executor), file_list)
Important Note:
Another far simpler approach is to just have a single, processor pool whose size is greater than the number of CPU processors you have, for example, 25. The worker processes will do both I/O and CPU operations. Even though you have more processes than CPUs, many of the processes will be in a wait state waiting for I/O to complete allowing CPU-intensive work to run.
The downside to this approach is that the overhead in creating a N processes is far greater than the overhead in creating N threads + a small number of processes. However, as the running time of the tasks submitted to the pool becomes increasingly larger, then this increased overhead becomes decreasingly a smaller percentage of the total run time. So, if your tasks are not trivial, this could be a reasonably performant simplification.
Update: Benchmarks of Both Approaches
I did some benchmarks against the two approaches processing 24 files whose sizes were approximately 10,000KB (actually, these were just 3 different files processed 8 times each, so there might have been some caching done):
Method 1 (thread pool + processor pool)
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
import os
from math import sqrt
import timeit
def cpu_bound_function(b):
sum = 0.0
for x in b:
sum += sqrt(float(x))
return sum
def io_bound_function(process_pool_executor, file_name):
with open(file_name, 'rb') as f:
b = f.read()
future = process_pool_executor.submit(cpu_bound_function, b)
result = future.result() # get result
return result
def main():
file_list = ['/download/httpd-2.4.16-win32-VC14.zip'] * 8 + ['/download/curlmanager-1.0.6-x64.exe'] * 8 + ['/download/Element_v2.8.0_UserManual_RevA.pdf'] * 8
N_FILES = len(file_list)
MAX_THREADS = 50 # depends on your configuration on how well the I/O can be overlapped
N_THREADS = min(N_FILES, MAX_THREADS) # no point in creating more threds than required
N_PROCESSES = os.cpu_count() # use the number of processors you have
with ThreadPoolExecutor(N_THREADS) as thread_pool_executor:
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = list(thread_pool_executor.map(partial(io_bound_function, process_pool_executor), file_list))
print(results)
if __name__ == '__main__':
print(timeit.timeit(stmt='main()', number=1, globals=globals()))
Method 2 (processor pool only)
from concurrent.futures import ProcessPoolExecutor
from math import sqrt
import timeit
def cpu_bound_function(b):
sum = 0.0
for x in b:
sum += sqrt(float(x))
return sum
def io_bound_function(file_name):
with open(file_name, 'rb') as f:
b = f.read()
result = cpu_bound_function(b)
return result
def main():
file_list = ['/download/httpd-2.4.16-win32-VC14.zip'] * 8 + ['/download/curlmanager-1.0.6-x64.exe'] * 8 + ['/download/Element_v2.8.0_UserManual_RevA.pdf'] * 8
N_FILES = len(file_list)
MAX_PROCESSES = 50 # depends on your configuration on how well the I/O can be overlapped
N_PROCESSES = min(N_FILES, MAX_PROCESSES) # no point in creating more threds than required
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = list(process_pool_executor.map(io_bound_function, file_list))
print(results)
if __name__ == '__main__':
print(timeit.timeit(stmt='main()', number=1, globals=globals()))
Results:
(I have 8 cores)
Thread Pool + Processor Pool: 13.5 seconds
Processor Pool Alone: 13.3 seconds
Conclusion: I would try the simpler approach first of just using a processor pool for everything. Now the tricky bit is deciding what the maximum number of processes to create, which was part of your original question and had a simple answer when all it was doing was the CPU-intensive computations. If the number of files you are reading are not too many, then the point is moot; you can have one process per file. But if you have hundreds of files, you will not want to have hundreds of processes in your pool (there is also an upper limit to how many processes you can create and again there are those nasty memory constraints). There is just no way I can give you an exact number. If you do have a large number of files, start with a smallish pool size and keep incrementing until you get no further benefit (of course, you probably do not want to be processing more files than some maximum number for these tests or you will be running forever just deciding on a good pool size for the real run).

For parallel processing:
I saw this question, and quoting the accepted answer:
In practice, it can be difficult to find the optimal number of threads and even that number will likely vary each time you run the program. So, theoretically, the optimal number of threads will be the number of cores you have on your machine. If your cores are "hyper threaded" (as Intel calls it) it can run 2 threads on each core. Then, in that case, the optimal number of threads is double the number of cores on your machine.
For multiprocessing:
Someone asked a similar question here, and the accepted answer said this:
If all of your threads/processes are indeed CPU-bound, you should run as many processes as the CPU reports cores. Due to HyperThreading, each physical CPU cores may be able to present multiple virtual cores. Call multiprocessing.cpu_count to get the number of virtual cores.
If only p of 1 of your threads is CPU-bound, you can adjust that number by multiplying by p. For example, if half your processes are CPU-bound (p = 0.5) and you have two CPUs with 4 cores each and 2x HyperThreading, you should start 0.5 * 2 * 4 * 2 = 8 processes.
The key here is understand what machine are you using, from that, you can choose a nearly optimal number of threads/processes to split the execution of you code. And I said nearly optimal because it will vary a little bit every time you run your script, so it'll be difficult to predict this optimal number from a mathematical point of view.
For your specific situation, if your machine has 4 cores, I would recommend you to only create 4 threads max, and then split them:
1 to the main thread.
3 for file reading and process.

using multiple processes to speed up IO performance may not be a good idea, check this and the sample code below it to see wether it is helpful

One idea can be to have a thread only reading the file (If I understood well, there is only one file) and pushing the independent parts (for ex. rows) into queue with messages.
The messages can be processed by 4 threads. In this way, you can optimize the load between the processors.

On a strongly I/O-bound process (like what you are describing), you do not necessarily need multithreading nor multiprocessing: you could also use more advanced I/O primitives from your OS.
For example on Linux you can submit read requests to the kernel along with a suitably sized mutable buffer and be notified when the buffer is filled. This can be done using the AIO API, for which I've written a pure-python binding: python-libaio (libaio on pypi)), or with the more recent io_uring API for which there seems to be a CFFI python binding (liburing on pypy) (I have neither used io_uring nor this python binding).
This removes the complexity of parallel processing at your level, may reduce the number of OS/userland context switches (reducing the cpu time even further), and lets the OS know more about what you are trying to do, giving it the opportunity of scheduling the IO more efficiently (in a virtualised environment I would not be surprised if it reduced the number of data copies, although I have not tried it myself).
Of course, the downside is that your program will be more tightly bound to the OS you are executing it on, requiring more effort to get it to run on another one.

iterating through a huge loop efficiently using python

I have 100000 images and I need to get the vectors for each image
imageVectors = []
for i in range(100000):
fileName = "Images/" + str(i) + '.jpg'
imageVectors.append(getvector(fileName).reshape((1,2048)))
cPickle.dump( imageVectors, open( 'imageVectors.pkl', "w+b" ), cPickle.HIGHEST_PROTOCOL )
getVector is a function that takes 1 image at a time and takes about 1 second to process a it. So, basically my problem reduces to
for i in range(100000):
A = callFunction(i) //a complex function that takes 1 sec for each call
The things that I have already tried are: (only the pseduo-code is given here)
1) Using numpy vectorizer:
def callFunction1(i):
return callFunction2(i)
vfunc = np.vectorize(callFunction1)
imageVectors = vfunc(list(range(100000))
2)Using python map:
def callFunction1(i):
return callFunction2(i)
imageVectors = map(callFunction1, list(range(100000))
3) Using python multiprocessing:
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 4 # arbitrary default
pool = multiprocessing.Pool(processes=cpus)
result = pool.map(callFunction, xrange(100000000))
4) Using multiprocessing in a different way:
from multiprocessing import Process, Queue
q = Queue()
N = 100000000
p1 = Process(target=callFunction, args=(N/4,q))
p1.start()
p2 = Process(target=callFunction, args=(N/4,q))
p2.start()
p3 = Process(target=callFunction, args=(N/4,q))
p3.start()
p4 = Process(target=callFunction, args=(N/4,q))
p4.start()
results = []
for i in range(4):
results.append(q.get(True))
p1.join()
p2.join()
p3.join()
p4.join()
All the above methods are taking immensely huge time. Is there any other way more efficient than this so that maybe I can loop through many elements simultaneously instead of sequentially or in any other faster way.
The time is mainly being taken by the getvector function itself. As a work around, I have split my data into 8 different batches and running the same program for different parts of the loop and running eight separate instances of python on a octa-core VM in google cloud. Could anyone suggest if map-reduce or taking help of GPU's using PyCuda may be a good option?

The multiprocessing.Pool solution is a good one, in the sense that it uses all your cores. So it should be approximately N times faster than using plain old map, where N is the number of cores you have.
BTW, you can skip determining the amount of cores. By default multiprocessing.Pool uses as many processes as your CPU has cores.
Instead of a plain map (which blocks until everything has been processed), I would suggest using imap_unordered. This is an iterator that will start returning results as soon as they become available so your parent process can start further processing if any. If ordering is important, you might want to return a tuple (number, array) to identify the result.
Your function returns a numpy array of 2048 values, which I assume are numpy.float64 Using the standard mapping functions will transport the results back to the parent process using IPC. On a 4-core machine that will result in 4 IPC transports of 2048*8 = 16384 bytes, so 65536 bytes/second. That doesn't sound too bad. But I don't know how much overhead the IPC (which involves pickling and Queues) will incur.
In case the overhead is large, you might want to create a shared memory area to store the results in. You would need approximately 1.5 Gib to store 100000 results of 2048 8-byte floats. That is a sizeable amount of memory, but not impractical for current machines.
For 100000 images and 4 cores and each image taking around one second, your program's running time would be in the order of 8 hours.
Your most important task for optimization would be to look into reducing the runtime of the getvector function. For example, would it run just as well if you reduced the size of the images by half? Assuming that the runtime scales linearly to the amount of pixels, that should cut the runtime to 0.25 s.

Inefficient multiprocessing of numpy-based calculations

I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:
import time
import numpy
from multiprocessing import Pool
def test_func(i):
a = numpy.random.normal(size=1000000)
b = numpy.random.normal(size=1000000)
for i in range(2000):
a = a + b
b = a - b
a = a - b
return 1
t1 = time.time()
test_func(0)
single_time = time.time() - t1
print("Single time:", single_time)
n_par = 4
pool = Pool()
t1 = time.time()
results_async = [
pool.apply_async(test_func, [i])
for i in range(n_par)]
results = [r.get() for r in results_async]
multicore_time = time.time() - t1
print("Multicore time:", multicore_time)
print("Efficiency:", single_time / multicore_time)
When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?
Some additional info which may be useful:
I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).
when I run this I see n_par processes in top working at 100% CPU
if I replace numpy array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for n_par = 4).

It looks like the test function you're using is memory bound. That means that the run time you're seeing is limited by how fast the computer can pull the arrays from memory into cache. For example, the line a = a + b is actually using 3 arrays, a, b and a new array that will replace a. These three arrays are about 8MB each (1e6 floats * 8 bytes per floats). I believe the different i7s have something like 3MB - 8MB of shared L3 cache so you cannot fit all 3 arrays in cache at once. Your cpu adds the floats faster than the array can be loaded into cache so most of the time is spent waiting on the array to be read from memory. Because the cache is shared between the cores, you don't see any speedup by spreading the work onto multiple cores.
Memory bound operations are an issue for numpy in general and the only way I know to deal with them is to use something like cython or numba.

One easy thing that should bump efficiency up should be to do in-place array operations, if possible -- so add(a,b,a) will not create a new array, while a = a + b will. If your for loop over numpy arrays could be rewritten as vector operations, that should be more efficient as well. Another possibility would be to use numpy.ctypeslib to enable shared memory numpy arrays (see: https://stackoverflow.com/a/5550156/2379433).

I have been programming numerical methods for mathematics and having the same problem: I wasn't seeing any speed-up for a supposedly cpu bounded problem. It turns out my problem was reaching the CPU cache memory limit.
I have been using Intel PCM (Intel® Performance Counter Monitor) to see how the cpu cache memory was behaving (displaying it inside Linux ksysguard). I also disabled 2 of my processors to have clearer results (2 are active).
Here is what I have found out with this code:
def somethinglong(b):
n=200000
m=5000
shared=np.arange(n)
for i in np.arange(m):
0.01*shared
pool = mp.Pool(2)
jobs = [() for i in range(8)]
for i in range(5):
timei = time.time()
pool.map(somethinglong, jobs , chunksize=1)
#for job in jobs:
#somethinglong(job)
print(time.time()-timei)
Example that doesn't reach the cache memory limit:
n=10000
m=100000
Sequential execution: 15s
2 processor pool no cache memory limit: 8s
It can be seen that there is no cache misses (all cache hits), therefore the speed-up is almost perfect: 15/8.
Memory cache hits 2 pool
Example that reaches the cache memory limit:
n=200000
m=5000
Sequential execution: 14s
2 processor pool cache memory limit: 14s
In this case, I increased the size of the vector onto which we operate (and decreased the loop size, to see reasonable execution times). In this case we can see that the memory gets full and the processes always miss the cache memory. Therefore not getting any speedup: 15/15.
Memory cache misses 2 pool
Observation: assigning an operation to a variable (aux = 0.01*shared) also uses the cache memory and can bound the problem by memory (without increasing any vector size).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.