How does scikit-learn handle multiple n_jobs arguments? - python

I have made a pipeline in scikit-learn that looks as follows:
estimators2 = [
('tfidf', TfidfVectorizer(tokenizer=lambda string: string.split())),
('clf', SGDClassifier(n_jobs=13, early_stopping=True, class_weight='balanced'))
]
parameters2 = {
'tfidf__min_df': np.arange(10, 30, 10),
'tfidf__max_df': np.arange(0.75, 0.9, 0.05),
'tfidf__ngram_range': [(1, 1), (2, 2), (3, 3)],
'clf__alpha': (1e-2, 1e-3)
}
p2 = Pipeline(estimators2)
grid2 = RandomizedSearchCV(p2, param_distributions=parameters2,
scoring='balanced_accuracy', n_iter=20, cv=3, n_jobs=13, pre_dispatch='n_jobs')
In this pipeline, there is two times the argument n_jobs? How are they handled by scikit-learn?

Q : How are they handled by scikit-learn?
So, lets start with the documentation, as-is 2019/4Q:
n_jobs : int or None, optional ( default = None )
Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
WHAT ALTERNATIVES DO WE HAVE HERE & WHICH IS THE BEST ONE ?
A)VOID parallelism at all
B)LOCK CPU instead of enhancing the performance
C)USTOM setup CPU-core mapping for maximum performance
D)ISTRIBUTE workloads across a cluster of Dask-nodes
OPTION A : A)VOID parallelism at all
So, one can explicitly avoid parallelism for processing a hierarchical composition of RandomisedSearchCV(Pipeline([(…),(…,SGDClassifier( n_jobs=13,…)),]),…,n_jobs=13,… ) for CPU-intensive, yet MEM-bound processing of the tasks by explicitly using the "threading"-backend inside a context-constructor:
with parallel_backend( 'threading' ): # also ref.'d via sklearn.utils.parallel_backend
grid2.fit( … )
Here, whatever many threads might have got instantiated, all are but waiting for a one, central GIL-lock. All wait but one executes. This is a known GIL-lock-introduced re-[SERIAL]-isation of any thread-based code-execution into an even more expensive, just interleaved, pure-[SERIAL] python code execution. Except for those cases (not this one), that are principally a latency-masking trick ( for a bit better use of a time an I/O-bound tasks spend in NOPs, waiting for the I/O-operation to finish and yield any result(s) ), this will not help in getting most out of your python-based ML-pipeline any faster, but the very opposite.
OPTION B : B)LOCK CPU instead of enhancing the performance
One may choose a better, less prohibitive backend - "multiprocessing" or in more recent joblib-releases also "loky", where GIL-lock is not making us the troubles ( at a cost of instantiating n_jobs-many python-process replicas, each one having its internal and unavoidable GIL-lock, now at least not competing one against other own threads for an execution of time-slice amount of work by first grabbing a GIL-lock ), yet the story is not finished here. This option is a typical case, when multiple-levels of n_jobs-aware processing appears inside the same pipeline - where each of them is fighting (at the O/S-scheduler level) to get a piece of CPU-core time-slice inside a time-sharing run of more processes than a number of CPU-cores. The result? Processes were spawned, yet will have to wait in a queue for theirs turn (if more than a number of cores permissible for the user - check not only the number of cores, but also the permitted CPU-core-affinity settings, enforced by the O/S for a given user/process effective-rights, which on a tightly managed system could be way less than the number of physical (or virtualisation-emulated) of CPU-cores ), loosing time, loosing CPU-cache pre-fetched blocks of data ( so again and again spending the expensive RAM-fetches (paying ~300~350 ns each such time), instead of re-using the pre-fetched (and already paid) data from L1/L2/L3-cache at costs of just about 0.5 ns(!))
OPTION C : C)USTOM setup CPU-core mapping for maximum performance
A good engineering practice is to carefully map CPU-cores for processing.
Given a right backend was in place, one has to decide, where is the performance bottleneck - here, most probably ( with a chance for an exception if having an option D) equipped with a huge cluster of strong & fat-RAM machines ), one will prefer to have each and every SGDClassifier.fit() faster ( spending more n_jobs-specified process-instances for the most expensive sub-task - the training ), than having "more" RandomisedSearchCV()-initiated "toys" on the playground, but suffocated by lack-of-RAM and CPU-cache-inefficiencies.
Your code will always, even behind the curtain of not knowing all details, have to "obey" to run not on all of the CPU-cores but only on those, that are permitted to be harnessed by any such multiprocessing-requested sub-process,
the number of which is not higher than: len( os.sched_getaffinity( 0 ) ). If interested in details, read and use a code here.
In general, a good planning and a profiling practice will help attain a best reasonably-achievable configuration of n_jobs-instantiated processes' mapping onto available set of CPU-cores. No magic but a common sense, process-monitor and benchmarking/timing of run-times will help us in polishing this competence.
OPTION D : D)ISTRIBUTE workloads across a cluster of Dask-nodes
Where possible, using Dask-module enabled distributed-computing nodes, one may set:
with parallel_backend( 'dask' ):
grid2.fit( … )
which will harness all Dask-cluster computing resources for making the "heavy"-task completed smarter, than possible with just a localhost-CPU/RAM resources. Ultimately, this is a maximum level of concurrent-processing possible inside the python-ecosystem in its current as-is state.

You can try using accelerated implementations of algorithms that introduce their internal threading - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex
scikit-learn-intelex is based on TBB for parallelization and can utilize entire system for accelerated algorithms. So you would be able to go without joblib parallelization while still getting better performance.
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()

Related

How to use ForLoop in python efficiently [duplicate]

How do you write (and run) a correct micro-benchmark in Java?
I'm looking for some code samples and comments illustrating various things to think about.
Example: Should the benchmark measure time/iteration or iterations/time, and why?
Related: Is stopwatch benchmarking acceptable?
Tips about writing micro benchmarks from the creators of Java HotSpot:
Rule 0: Read a reputable paper on JVMs and micro-benchmarking. A good one is Brian Goetz, 2005. Do not expect too much from micro-benchmarks; they measure only a limited range of JVM performance characteristics.
Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)
Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.
Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.
Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. The -XX:+PrintCompilation flag reports OSR compilations with an at-sign to denote the non-initial entry point, for example: Trouble$1::run # 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.
Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effects.
Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.
Rule 6: Use appropriate tools to read the compiler's mind, and expect to be surprised by the code it produces. Inspect the code yourself before forming theories about what makes something faster or slower.
Rule 7: Reduce noise in your measurements. Run your benchmark on a quiet machine, and run it several times, discarding outliers. Use -Xbatch to serialize the compiler with the application, and consider setting -XX:CICompilerCount=1 to prevent the compiler from running in parallel with itself. Try your best to reduce GC overhead, set Xmx(large enough) equals Xms and use UseEpsilonGC if it is available.
Rule 8: Use a library for your benchmark as it is probably more efficient and was already debugged for this sole purpose. Such as JMH, Caliper or Bill and Paul's Excellent UCSD Benchmarks for Java.
I know this question has been marked as answered but I wanted to mention two libraries that help us to write micro benchmarks
Caliper from Google
Getting started tutorials
http://codingjunkie.net/micro-benchmarking-with-caliper/
http://vertexlabs.co.uk/blog/caliper
JMH from OpenJDK
Getting started tutorials
Avoiding Benchmarking Pitfalls on the JVM
Using JMH for Java Microbenchmarking
Introduction to JMH
Important things for Java benchmarks are:
Warm up the JIT first by running the code several times before timing it
Make sure you run it for long enough to be able to measure the results in seconds or (better) tens of seconds
While you can't call System.gc() between iterations, it's a good idea to run it between tests, so that each test will hopefully get a "clean" memory space to work with. (Yes, gc() is more of a hint than a guarantee, but it's very likely that it really will garbage collect in my experience.)
I like to display iterations and time, and a score of time/iteration which can be scaled such that the "best" algorithm gets a score of 1.0 and others are scored in a relative fashion. This means you can run all algorithms for a longish time, varying both number of iterations and time, but still getting comparable results.
I'm just in the process of blogging about the design of a benchmarking framework in .NET. I've got a couple of earlier posts which may be able to give you some ideas - not everything will be appropriate, of course, but some of it may be.
jmh is a recent addition to OpenJDK and has been written by some performance engineers from Oracle. Certainly worth a look.
The jmh is a Java harness for building, running, and analysing nano/micro/macro benchmarks written in Java and other languages targetting the JVM.
Very interesting pieces of information buried in the sample tests comments.
See also:
Avoiding Benchmarking Pitfalls on the JVM
Discussion on the main strengths of jmh.
Should the benchmark measure time/iteration or iterations/time, and why?
It depends on what you are trying to test.
If you are interested in latency, use time/iteration and if you are interested in throughput, use iterations/time.
Make sure you somehow use results which are computed in benchmarked code. Otherwise your code can be optimized away.
If you are trying to compare two algorithms, do at least two benchmarks for each, alternating the order. i.e.:
for(i=1..n)
alg1();
for(i=1..n)
alg2();
for(i=1..n)
alg2();
for(i=1..n)
alg1();
I have found some noticeable differences (5-10% sometimes) in the runtime of the same algorithm in different passes..
Also, make sure that n is very large, so that the runtime of each loop is at the very least 10 seconds or so. The more iterations, the more significant figures in your benchmark time and the more reliable that data is.
There are many possible pitfalls for writing micro-benchmarks in Java.
First: You have to calculate with all sorts of events that take time more or less random: Garbage collection, caching effects (of OS for files and of CPU for memory), IO etc.
Second: You cannot trust the accuracy of the measured times for very short intervals.
Third: The JVM optimizes your code while executing. So different runs in the same JVM-instance will become faster and faster.
My recommendations: Make your benchmark run some seconds, that is more reliable than a runtime over milliseconds. Warm up the JVM (means running the benchmark at least once without measuring, that the JVM can run optimizations). And run your benchmark multiple times (maybe 5 times) and take the median-value. Run every micro-benchmark in a new JVM-instance (call for every benchmark new Java) otherwise optimization effects of the JVM can influence later running tests. Don't execute things, that aren't executed in the warmup-phase (as this could trigger class-load and recompilation).
It should also be noted that it might also be important to analyze the results of the micro benchmark when comparing different implementations. Therefore a significance test should be made.
This is because implementation A might be faster during most of the runs of the benchmark than implementation B. But A might also have a higher spread, so the measured performance benefit of A won't be of any significance when compared with B.
So it is also important to write and run a micro benchmark correctly, but also to analyze it correctly.
To add to the other excellent advice, I'd also be mindful of the following:
For some CPUs (e.g. Intel Core i5 range with TurboBoost), the temperature (and number of cores currently being used, as well as thier utilisation percent) affects the clock speed. Since CPUs are dynamically clocked, this can affect your results. For example, if you have a single-threaded application, the maximum clock speed (with TurboBoost) is higher than for an application using all cores. This can therefore interfere with comparisons of single and multi-threaded performance on some systems. Bear in mind that the temperature and volatages also affect how long Turbo frequency is maintained.
Perhaps a more fundamentally important aspect that you have direct control over: make sure you're measuring the right thing! For example, if you're using System.nanoTime() to benchmark a particular bit of code, put the calls to the assignment in places that make sense to avoid measuring things which you aren't interested in. For example, don't do:
long startTime = System.nanoTime();
//code here...
System.out.println("Code took "+(System.nanoTime()-startTime)+"nano seconds");
Problem is you're not immediately getting the end time when the code has finished. Instead, try the following:
final long endTime, startTime = System.nanoTime();
//code here...
endTime = System.nanoTime();
System.out.println("Code took "+(endTime-startTime)+"nano seconds");
http://opt.sourceforge.net/ Java Micro Benchmark - control tasks required to determine the comparative performance characteristics of the computer system on different platforms. Can be used to guide optimization decisions and to compare different Java implementations.

How to properly implement python multiprocessing for expensive image/video tasks?

I'm running on a pretty basic quad-core machine where multiprocessing.cpu_count() = 8 with something like:
from itertools import repeat
from multiprocessing import Pool
def expensive_function(list_of_values, some_param, another_param):
do_some_python_pillow_tasks()
do_some_ffmpeg_tasks()
if __name__ == '__main__':
values = [
['a', 'b', 'c'],
['x', 'y', 'z'],
# ...
# there can be MANY items in this list, let's say 1000
]
pool = Pool(processes=len(values))
pool.starmap(
expensive_function,
zip(values, repeat('yada yada yada'), repeat('hello world')),
)
pool.close()
None of the 1,000 tasks will have problems with each other, in theory they can all be run at the same time.
Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?
Ultimately I want all (potentially 1000) tasks to complete as fast as possible. This may be a stupid question, but can you utilize the GPU to help speed up processing?
Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?
Pool creates many CPython processes and processes is the number of workers to create. Creating about 1000 processes is really not a good idea since creating a process is expensive. I advise you to let the default parameter (or to check if using 4 processes is better in your case).
This may be a stupid question, but can you utilize the GPU to help speed up processing?
No. You cannot use it transparently. You need to rewrite your code to use it and this is generally pretty hard. However, the ffmpeg may use it already. If so, running this task in parallel should certainly not be much faster (it can actually even be slower) since the GPU is a shared resource and the multiple process will compete for its use (since GPU tasks are always massively parallel in practice).
Q : " ... am I using multiprocessing to the best of it's ability ?"
A :Well, that actually does not matter here at all.Congratulations!You happened to enjoy a such seldom use-case, where the so called embarrasingly parallel process-orchestration may save most of otherwise present problems.
Incidentally, nothing new, this very same, exactly the same use-case, reasoning was successfully used by Peter Jackson's VFX-team for his "Lord of The Rings" frame-by-frame video-rendering & video-postprocessing & final LASER-deposition of each frame on color-film band computer power-plant setup in New Zealand. Except his factory was full of Silicon Graphics' workstations ( no Python reported to have been there ), yet the workflow orchestration principle was the same ...
Python multithreading is irrelevant here, as it keeps all threads still stand in a queue and wait one after another for its turn in acquiring the one-&-only-one central Python GIL-lock, so using it is rather an anti-pattern if you wish to gain processing speed here.
Python multiprocessing is inappropriate here, even for as small number as 4 or 8 worker-processes ( the less for ~1k promoted above ), as it
firstspends (in further context negligible [TIME]- and [SPACE]-domains costs on each spawning of a new, independent Python-interpreter processes, copied full-scale, i.e. with all its internal-state & all the data-structures (! expect RAM-/SWAP-thrashing whenever your host physical-memory gets over-saturated with that many copies of the same things & virtual-memory management-service of the O/S starts to, concurrently to running your "useful" work, orchestrate memory SWAP-ins / SWAP-outs, as it thinks the just-O/S-scheduled-process needs to fetch data, that cannot fit/stay in-RAM and so gets not N x 100 [ns] far from CPU, but Q x 10.000.000 [ns] far on-HDD - yes, you read correctly, suddenly being many orders of magnitude slower just to re-read the "own" data, accidentally swapped away + CPU gets the less available for your processing, as it has to perform also all the introduced SWAP-I/O processing. Nasty, isn't it? Yet, it is not all, what hurts you... )next ( and repeated per each of the 1.000 cases ... )you will have to pay ( CPU-wise + MEM-I/O-wise + O/S-IPC-wise )another awful penalty, here for moving data ( parameters ) from the "main" Python-interpreter process to the "spawned" Python-interpreter process, using DATA-SERialiser( at CPU + MEM-I/O add-on costs ) + DATA-moving( O/S-IPC-service add-on costs, yes, DATA-size matters, again ) + DATA-DESerialise( again at CPU + MEM-I/O add-on costs ) all doing that just to make DATA ( parameters ) somehow appear "inside" the other Python-interpreter, whose GIL-lock will not compete with your central and other Python-interpreters ( which is fine, yet on this awfully gigantic sum of add-on costs? Not so nice looking as we get understand details, is it? )
What can be done instead?
a)split the list ( values ) of independent values, as was posted above, in say 4 parts ( quad-core, 2 hardware-threads each, CPU ), andb)let the embarrasingly parallel (independent) problem get solved in a pure-[SERIAL] fashion, by 4 Python processes, each one launched fully independent, on respective quarter of the list( values )
There will be zero add-on cost for doing so,there will be zero add-on SER/DES penalty for 1000+ tasks' data distribution and results' recollection, andthere will be reasonable CPU-core distributed workload ( thermal throttling will, as the CPU-core temperatures may and will grow, appear for all of them - so no magic but sufficient CPU-cooling can save us here anyway )
One may also test, whether PIL.Image processing could get faster, if using OpenCV with numpy.ndarray() smart vectorised processing tricks, yet these are another Level-of-Detail of boosting performance, once we prevent those gigantic overheads costs reminded above.
Except for using a magic wand, there is no other magic possible on Python-interpreter here

Populate a matrix in parallel

I need routinely populate matrices A[i,j] by evaluation of a function between pairs of vectors, as computation of every i,j-pair is independent from each other I want to parallelize this
A = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
A[i,j] = function(X[i], X[j])
How this computation could be elegantly parallelized via joblib or other widely used library?
Q : "How this computation could be elegantly parallelized via joblib or other widely used library?"
If using joblib, the main python interpreter will spawn other, GIL-lock independent copies of its own ( yes, huge memory-I/O to copy all python interpreter state, including all data-structures in O/S Windows, somewhat less horrible initial latency hit in linux-type O/S ), yet the worse is only to come - any "remote" modification of the spawned/distributed replicas of the original data have to somehow make it back to the main-python-process ( yes, huge memory-I/O + cache-(de)coherency hardware workloads (plus per-core L1-data cache-efficiency almost for sure devastated) ).
So this trick does not easily pay for its own add-on costs, unless the function() computation is indeed many times above the costs of process-instantiation + process-to-process data interchange ( SER/DES on the way "there" ( one can imagine a pickle.dumps() memory allocation + pickling-compression/decompression costs ) + SER/DES on the way "back" + the actual p2p-communication latencies (costs) to move the pickled-data elements ).
One might like more reads on this here and here and here.
Is There Any Better Way Forwards?
We all have had for sure heard about numpy and smart numpy-vectorised processing. Many thousands of man*years top level HPC experience were put into the numpy smart data-I/O vectorised processing.
So in most cases, if you try to redesign the function( scalarA, scalarB ) returnins a single scalarResult to be stored into an externally 2D-looped A[i,j] into an in-place modifying function( vectorX_data, matrixA_results ) and let the inner code thereof do both the i,j-looping over the actual matrixA_results.shape[0] and do the actual computing, the results may get astonishingly faster, if the numpy-code can harness the smart CPU-vector instructions, that pay less than 0.5 [ns] L1_data access latency times compared to as much as 300 ~ 380 [ns] RAM access latency times ( if memory-I/O channel were free and permitting unenqueued data transfer from the slow & far RAM-memory, not mentioning even the somewhat latency-masked 10.000.000+ [ns] access-costs for using the numpy.memmap()-file-based data proxy ).
If one has never visited the domain of numpy-tricks with smart-vectorised processing, do not hesitate to read as many as possible posts from a true master in this domain, guru #Divakar - all respect to them!

Python multiprocessing performance only improves with the square root of the number of cores used

I am attempting to implement multiprocessing in Python (Windows Server 2012) and am having trouble achieving the degree of performance improvement that I expect. In particular, for a set of tasks which are almost entirely independent, I would expect a linear improvement with additional cores.
I understand that--especially on Windows--there is overhead involved in opening new processes [1], and that many quirks of the underlying code can get in the way of a clean trend. But in theory the trend should ultimately still be close to linear for a fully parallelized task [2]; or perhaps logistic if I were dealing with a partially serial task [3].
However, when I run multiprocessing.Pool on a prime-checking test function (code below), I get a nearly perfect square-root relationship up to N_cores=36 (the number of physical cores on my server) before the expected performance hit when I get into the additional logical cores.
Here is a plot of my performance test results :
( "Normalized Performance" is [ a run time with 1 CPU-core ] divided by [ a run time with N CPU-cores ] ).
Is it normal to have this dramatic diminishing of returns with multiprocessing? Or am I missing something with my implementation?
import numpy as np
from multiprocessing import Pool, cpu_count, Manager
import math as m
from functools import partial
from time import time
def check_prime(num):
#Assert positive integer value
if num!=m.floor(num) or num<1:
print("Input must be a positive integer")
return None
#Check divisibility for all possible factors
prime = True
for i in range(2,num):
if num%i==0: prime=False
return prime
def cp_worker(num, L):
prime = check_prime(num)
L.append((num, prime))
def mp_primes(omag, mp=cpu_count()):
with Manager() as manager:
np.random.seed(0)
numlist = np.random.randint(10**omag, 10**(omag+1), 100)
L = manager.list()
cp_worker_ptl = partial(cp_worker, L=L)
try:
pool = Pool(processes=mp)
list(pool.imap(cp_worker_ptl, numlist))
except Exception as e:
print(e)
finally:
pool.close() # no more tasks
pool.join()
return L
if __name__ == '__main__':
rt = []
for i in range(cpu_count()):
t0 = time()
mp_result = mp_primes(6, mp=i+1)
t1 = time()
rt.append(t1-t0)
print("Using %i core(s), run time is %.2fs" % (i+1, rt[-1]))
Note: I am aware that for this task it would likely be more efficient to implement multithreading, but the actual script for which this one is a simplified analog is incompatible with Python multithreading due to GIL.
#KellanM deserved [+1] for quantitative performance monitoring
am I missing something with my implementation?
Yes, you abstract from all add-on costs of the process-management.
While you have expressed an expectation of " a linear improvement with additional cores. ", this would hardly appear in practice for several reasons ( even the hype of communism failed to deliver anything for free ).
Gene AMDAHL has formulated the inital law of diminishing returns.
A more recent, re-formulated version, took into account also the effects of process-management {setup|terminate}-add-on overhead costs and tried to cope with atomicity-of-processing ( given large workpackage payloads cannot get easily re-located / re-distributed over available pool of free CPU-cores in most common programming systems ( except some indeed specific micro-scheduling art, like the one demonstrated in Semantic Design's PARLANSE or LLNL's SISAL have shown so colourfully in past ).
A best next step?
If indeed interested in this domain, one may always experimentally measure and compare the real costs of process management ( plus data-flow costs, plus memory-allocation costs, ... up until the process-termination and results re-assembly in the main process ) so as to quantitatively fair record and evaluate the add-on costs / benefit ratio of using more CPU-cores ( that will get, in python, re-instated the whole python-interpreter state, including all its memory-state, before a first usefull operation will get carried out in a first spawned and setup process ).
Underperformance ( for the former case below )if not disastrous effects ( from the latter case below ),of either of ill-engineered resources-mapping policy, be itan "under-booking"-resources from a pool of CPU-coresoran "over-booking"-resources from a pool of RAM-spaceare discussed also here
The link to the re-formulated Amdahl's Law above will help you evaluate the point of diminishing returns, not to pay more than will ever receive.
Hoefinger et Haunschmid experiments may serve as a good practical evidence, how a growing number of processing-nodes ( be it a local O/S managed CPU-core, or a NUMA distributed architecture node ) will start decreasing the resulting performance,
where a Point of diminishing returns ( demonstrated in overhead agnostic Amdahl's Law )
will actually start to become a Point after which you pay more than receive. :
Good luck on this interesting field!
Last, but not least,
NUMA / non-locality issues get their voice heard, into the discussion of scaling for HPC-grade tuned ( in-Cache / in-RAM computing strategies ) and may - as a side-effect - help detect the flaws ( as reported by #eryksun above ). One may feel free to review one's platform actual NUMA-topology by using lstopo tool, to see the abstraction, that one's operating system is trying to work with, once scheduling the "just"-[CONCURRENT] task execution over such a NUMA-resources-topology:

Parallel application in python becomes much slower when using mpi rather than multiprocessing module

Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.
The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.
My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.
What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.
y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)
1) Using multiprocessing module - Pool.map
2) Using mpi - Scatter/Gather
3) Both pictures on top of each other
At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.
Does anyone have any explanation for these results?
ADDED:
I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.
However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.
I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.
Pool.map:
def eval_population(func, pop):
for ind in pop:
ind.fitness.values = func(ind)
return pop
# ...
self.pool = Pool(8)
# ...
for iter_ in xrange(nr_of_generations):
# ...
self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
# ...
MPI - Scatter/Gather
def divide_list(lst, n):
return [lst[i::n] for i in xrange(n)]
def chain_list(lst):
return list(chain.from_iterable(lst))
def evaluate_individuals_in_groups(func, rank, individuals):
comm = MPI.COMM_WORLD
size = MPI.COMM_WORLD.Get_size()
packages = None
if not rank:
packages = divide_list(individuals, size)
ind_for_eval = comm.scatter(packages)
eval_population(func, ind_for_eval)
pop_with_fit = comm.gather(ind_for_eval)
if not rank:
pop_with_fit = chain_list(pop_with_fit)
for index, elem in enumerate(pop_with_fit):
individuals[index] = elem
for iter_ in xrange(nr_of_generations):
# ...
evaluate_individuals_in_groups(self.func, self.rank, pop)
# ...
ADDED 2:
As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result:
I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark:
Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.
MPI is actually designed to do inter node communication, so talk to other machines over the network.
Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.
mpi4py makes a copy for every message, since it's targeted at distributed memory usage.
If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.
If you only intend to do computations within the same machine, there is no need to use mpi here.
Some of this is discussed in this thread.
Update
The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!
I recommend running the ipc-benchmark on the virtualized machine, and post the results.
If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.

Categories

Resources