How to properly implement python multiprocessing for expensive image/video tasks? - python

I'm running on a pretty basic quad-core machine where multiprocessing.cpu_count() = 8 with something like:
from itertools import repeat
from multiprocessing import Pool
def expensive_function(list_of_values, some_param, another_param):
do_some_python_pillow_tasks()
do_some_ffmpeg_tasks()
if __name__ == '__main__':
values = [
['a', 'b', 'c'],
['x', 'y', 'z'],
# ...
# there can be MANY items in this list, let's say 1000
]
pool = Pool(processes=len(values))
pool.starmap(
expensive_function,
zip(values, repeat('yada yada yada'), repeat('hello world')),
)
pool.close()
None of the 1,000 tasks will have problems with each other, in theory they can all be run at the same time.
Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?
Ultimately I want all (potentially 1000) tasks to complete as fast as possible. This may be a stupid question, but can you utilize the GPU to help speed up processing?

Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?
Pool creates many CPython processes and processes is the number of workers to create. Creating about 1000 processes is really not a good idea since creating a process is expensive. I advise you to let the default parameter (or to check if using 4 processes is better in your case).
This may be a stupid question, but can you utilize the GPU to help speed up processing?
No. You cannot use it transparently. You need to rewrite your code to use it and this is generally pretty hard. However, the ffmpeg may use it already. If so, running this task in parallel should certainly not be much faster (it can actually even be slower) since the GPU is a shared resource and the multiple process will compete for its use (since GPU tasks are always massively parallel in practice).

Q : " ... am I using multiprocessing to the best of it's ability ?"
A :Well, that actually does not matter here at all.Congratulations!You happened to enjoy a such seldom use-case, where the so called embarrasingly parallel process-orchestration may save most of otherwise present problems.
Incidentally, nothing new, this very same, exactly the same use-case, reasoning was successfully used by Peter Jackson's VFX-team for his "Lord of The Rings" frame-by-frame video-rendering & video-postprocessing & final LASER-deposition of each frame on color-film band computer power-plant setup in New Zealand. Except his factory was full of Silicon Graphics' workstations ( no Python reported to have been there ), yet the workflow orchestration principle was the same ...
Python multithreading is irrelevant here, as it keeps all threads still stand in a queue and wait one after another for its turn in acquiring the one-&-only-one central Python GIL-lock, so using it is rather an anti-pattern if you wish to gain processing speed here.
Python multiprocessing is inappropriate here, even for as small number as 4 or 8 worker-processes ( the less for ~1k promoted above ), as it
firstspends (in further context negligible [TIME]- and [SPACE]-domains costs on each spawning of a new, independent Python-interpreter processes, copied full-scale, i.e. with all its internal-state & all the data-structures (! expect RAM-/SWAP-thrashing whenever your host physical-memory gets over-saturated with that many copies of the same things & virtual-memory management-service of the O/S starts to, concurrently to running your "useful" work, orchestrate memory SWAP-ins / SWAP-outs, as it thinks the just-O/S-scheduled-process needs to fetch data, that cannot fit/stay in-RAM and so gets not N x 100 [ns] far from CPU, but Q x 10.000.000 [ns] far on-HDD - yes, you read correctly, suddenly being many orders of magnitude slower just to re-read the "own" data, accidentally swapped away + CPU gets the less available for your processing, as it has to perform also all the introduced SWAP-I/O processing. Nasty, isn't it? Yet, it is not all, what hurts you... )next ( and repeated per each of the 1.000 cases ... )you will have to pay ( CPU-wise + MEM-I/O-wise + O/S-IPC-wise )another awful penalty, here for moving data ( parameters ) from the "main" Python-interpreter process to the "spawned" Python-interpreter process, using DATA-SERialiser( at CPU + MEM-I/O add-on costs ) + DATA-moving( O/S-IPC-service add-on costs, yes, DATA-size matters, again ) + DATA-DESerialise( again at CPU + MEM-I/O add-on costs ) all doing that just to make DATA ( parameters ) somehow appear "inside" the other Python-interpreter, whose GIL-lock will not compete with your central and other Python-interpreters ( which is fine, yet on this awfully gigantic sum of add-on costs? Not so nice looking as we get understand details, is it? )
What can be done instead?
a)split the list ( values ) of independent values, as was posted above, in say 4 parts ( quad-core, 2 hardware-threads each, CPU ), andb)let the embarrasingly parallel (independent) problem get solved in a pure-[SERIAL] fashion, by 4 Python processes, each one launched fully independent, on respective quarter of the list( values )
There will be zero add-on cost for doing so,there will be zero add-on SER/DES penalty for 1000+ tasks' data distribution and results' recollection, andthere will be reasonable CPU-core distributed workload ( thermal throttling will, as the CPU-core temperatures may and will grow, appear for all of them - so no magic but sufficient CPU-cooling can save us here anyway )
One may also test, whether PIL.Image processing could get faster, if using OpenCV with numpy.ndarray() smart vectorised processing tricks, yet these are another Level-of-Detail of boosting performance, once we prevent those gigantic overheads costs reminded above.
Except for using a magic wand, there is no other magic possible on Python-interpreter here

Related

Populate a matrix in parallel

I need routinely populate matrices A[i,j] by evaluation of a function between pairs of vectors, as computation of every i,j-pair is independent from each other I want to parallelize this
A = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
A[i,j] = function(X[i], X[j])
How this computation could be elegantly parallelized via joblib or other widely used library?
Q : "How this computation could be elegantly parallelized via joblib or other widely used library?"
If using joblib, the main python interpreter will spawn other, GIL-lock independent copies of its own ( yes, huge memory-I/O to copy all python interpreter state, including all data-structures in O/S Windows, somewhat less horrible initial latency hit in linux-type O/S ), yet the worse is only to come - any "remote" modification of the spawned/distributed replicas of the original data have to somehow make it back to the main-python-process ( yes, huge memory-I/O + cache-(de)coherency hardware workloads (plus per-core L1-data cache-efficiency almost for sure devastated) ).
So this trick does not easily pay for its own add-on costs, unless the function() computation is indeed many times above the costs of process-instantiation + process-to-process data interchange ( SER/DES on the way "there" ( one can imagine a pickle.dumps() memory allocation + pickling-compression/decompression costs ) + SER/DES on the way "back" + the actual p2p-communication latencies (costs) to move the pickled-data elements ).
One might like more reads on this here and here and here.
Is There Any Better Way Forwards?
We all have had for sure heard about numpy and smart numpy-vectorised processing. Many thousands of man*years top level HPC experience were put into the numpy smart data-I/O vectorised processing.
So in most cases, if you try to redesign the function( scalarA, scalarB ) returnins a single scalarResult to be stored into an externally 2D-looped A[i,j] into an in-place modifying function( vectorX_data, matrixA_results ) and let the inner code thereof do both the i,j-looping over the actual matrixA_results.shape[0] and do the actual computing, the results may get astonishingly faster, if the numpy-code can harness the smart CPU-vector instructions, that pay less than 0.5 [ns] L1_data access latency times compared to as much as 300 ~ 380 [ns] RAM access latency times ( if memory-I/O channel were free and permitting unenqueued data transfer from the slow & far RAM-memory, not mentioning even the somewhat latency-masked 10.000.000+ [ns] access-costs for using the numpy.memmap()-file-based data proxy ).
If one has never visited the domain of numpy-tricks with smart-vectorised processing, do not hesitate to read as many as possible posts from a true master in this domain, guru #Divakar - all respect to them!

How does scikit-learn handle multiple n_jobs arguments?

I have made a pipeline in scikit-learn that looks as follows:
estimators2 = [
('tfidf', TfidfVectorizer(tokenizer=lambda string: string.split())),
('clf', SGDClassifier(n_jobs=13, early_stopping=True, class_weight='balanced'))
]
parameters2 = {
'tfidf__min_df': np.arange(10, 30, 10),
'tfidf__max_df': np.arange(0.75, 0.9, 0.05),
'tfidf__ngram_range': [(1, 1), (2, 2), (3, 3)],
'clf__alpha': (1e-2, 1e-3)
}
p2 = Pipeline(estimators2)
grid2 = RandomizedSearchCV(p2, param_distributions=parameters2,
scoring='balanced_accuracy', n_iter=20, cv=3, n_jobs=13, pre_dispatch='n_jobs')
In this pipeline, there is two times the argument n_jobs? How are they handled by scikit-learn?
Q : How are they handled by scikit-learn?
So, lets start with the documentation, as-is 2019/4Q:
n_jobs : int or None, optional ( default = None )
Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
WHAT ALTERNATIVES DO WE HAVE HERE & WHICH IS THE BEST ONE ?
A)VOID parallelism at all
B)LOCK CPU instead of enhancing the performance
C)USTOM setup CPU-core mapping for maximum performance
D)ISTRIBUTE workloads across a cluster of Dask-nodes
OPTION A : A)VOID parallelism at all
So, one can explicitly avoid parallelism for processing a hierarchical composition of RandomisedSearchCV(Pipeline([(…),(…,SGDClassifier( n_jobs=13,…)),]),…,n_jobs=13,… ) for CPU-intensive, yet MEM-bound processing of the tasks by explicitly using the "threading"-backend inside a context-constructor:
with parallel_backend( 'threading' ): # also ref.'d via sklearn.utils.parallel_backend
grid2.fit( … )
Here, whatever many threads might have got instantiated, all are but waiting for a one, central GIL-lock. All wait but one executes. This is a known GIL-lock-introduced re-[SERIAL]-isation of any thread-based code-execution into an even more expensive, just interleaved, pure-[SERIAL] python code execution. Except for those cases (not this one), that are principally a latency-masking trick ( for a bit better use of a time an I/O-bound tasks spend in NOPs, waiting for the I/O-operation to finish and yield any result(s) ), this will not help in getting most out of your python-based ML-pipeline any faster, but the very opposite.
OPTION B : B)LOCK CPU instead of enhancing the performance
One may choose a better, less prohibitive backend - "multiprocessing" or in more recent joblib-releases also "loky", where GIL-lock is not making us the troubles ( at a cost of instantiating n_jobs-many python-process replicas, each one having its internal and unavoidable GIL-lock, now at least not competing one against other own threads for an execution of time-slice amount of work by first grabbing a GIL-lock ), yet the story is not finished here. This option is a typical case, when multiple-levels of n_jobs-aware processing appears inside the same pipeline - where each of them is fighting (at the O/S-scheduler level) to get a piece of CPU-core time-slice inside a time-sharing run of more processes than a number of CPU-cores. The result? Processes were spawned, yet will have to wait in a queue for theirs turn (if more than a number of cores permissible for the user - check not only the number of cores, but also the permitted CPU-core-affinity settings, enforced by the O/S for a given user/process effective-rights, which on a tightly managed system could be way less than the number of physical (or virtualisation-emulated) of CPU-cores ), loosing time, loosing CPU-cache pre-fetched blocks of data ( so again and again spending the expensive RAM-fetches (paying ~300~350 ns each such time), instead of re-using the pre-fetched (and already paid) data from L1/L2/L3-cache at costs of just about 0.5 ns(!))
OPTION C : C)USTOM setup CPU-core mapping for maximum performance
A good engineering practice is to carefully map CPU-cores for processing.
Given a right backend was in place, one has to decide, where is the performance bottleneck - here, most probably ( with a chance for an exception if having an option D) equipped with a huge cluster of strong & fat-RAM machines ), one will prefer to have each and every SGDClassifier.fit() faster ( spending more n_jobs-specified process-instances for the most expensive sub-task - the training ), than having "more" RandomisedSearchCV()-initiated "toys" on the playground, but suffocated by lack-of-RAM and CPU-cache-inefficiencies.
Your code will always, even behind the curtain of not knowing all details, have to "obey" to run not on all of the CPU-cores but only on those, that are permitted to be harnessed by any such multiprocessing-requested sub-process,
the number of which is not higher than: len( os.sched_getaffinity( 0 ) ). If interested in details, read and use a code here.
In general, a good planning and a profiling practice will help attain a best reasonably-achievable configuration of n_jobs-instantiated processes' mapping onto available set of CPU-cores. No magic but a common sense, process-monitor and benchmarking/timing of run-times will help us in polishing this competence.
OPTION D : D)ISTRIBUTE workloads across a cluster of Dask-nodes
Where possible, using Dask-module enabled distributed-computing nodes, one may set:
with parallel_backend( 'dask' ):
grid2.fit( … )
which will harness all Dask-cluster computing resources for making the "heavy"-task completed smarter, than possible with just a localhost-CPU/RAM resources. Ultimately, this is a maximum level of concurrent-processing possible inside the python-ecosystem in its current as-is state.
You can try using accelerated implementations of algorithms that introduce their internal threading - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex
scikit-learn-intelex is based on TBB for parallelization and can utilize entire system for accelerated algorithms. So you would be able to go without joblib parallelization while still getting better performance.
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()

How many processes should I create for the multi-threads CPU in the computational intensive scenario?

I have a 32 cores and 64 threads CPU for executing a scientific computation task. How many processes should I create?
To be noted that my program is computationally intensive involved lots of matrix computations based on Numpy. Now, I use the Python default process pool to execute this task. It will create 64 processes. Will it perform better or worse than 32 processes?
I'm not really sure that Python is suited for multi-threading computational intensive scenarios, due to the Global Interpreter Lock (GIL). Basically, you should use multi-threading in Python only for IO-bound tasks. I'm not sure if Numpy applies since the heavy part if I recall correctly is written in C++.
If you're looking for alternatives you could use the Apache Spark framework to distribute the work across multiple machines. I think that even if you run your code in local mode (i.e. on your machine) with 8/16 workers you could get some performance boost.
EDIT: I'm sorry, I just read on the GIL page that I linked that it doesn't apply for Numpy. I still think that this is not really the best tool you can use, since effective multi-threading programming is quite hard to get right and there are some other nuances that you can read in the link.
It's impossible to give you an answer as it will depend on your exact problem and code but potentially also of your hardware.
Basically the process for multi-processing is to split the work in X parts then distribute it to each process, let each process work and then merge each result.
Now you need to know if you can effectively split the work in 64 parts while keeping each part around the same time of work (if one process take 90% of the time and you can't split it it's useless to have more than 2 processes as you will always wait for the first one).
If you can do it and it's not taking too long to split and merge the work/results (remember that it's a supplementary work to do so it will take extra time) then it can be interesting to use more process.
It is also possible that you can speed-up your code by using less process if you pass too much time on splitting/merging the work/results (sometime the speed-up obtained by using more process can be negative).
Also you have to remember that in some architecture the memory cache can be shared among cores so it can badly affect the performances of multiprocessing.

Python multiprocessing performance only improves with the square root of the number of cores used

I am attempting to implement multiprocessing in Python (Windows Server 2012) and am having trouble achieving the degree of performance improvement that I expect. In particular, for a set of tasks which are almost entirely independent, I would expect a linear improvement with additional cores.
I understand that--especially on Windows--there is overhead involved in opening new processes [1], and that many quirks of the underlying code can get in the way of a clean trend. But in theory the trend should ultimately still be close to linear for a fully parallelized task [2]; or perhaps logistic if I were dealing with a partially serial task [3].
However, when I run multiprocessing.Pool on a prime-checking test function (code below), I get a nearly perfect square-root relationship up to N_cores=36 (the number of physical cores on my server) before the expected performance hit when I get into the additional logical cores.
Here is a plot of my performance test results :
( "Normalized Performance" is [ a run time with 1 CPU-core ] divided by [ a run time with N CPU-cores ] ).
Is it normal to have this dramatic diminishing of returns with multiprocessing? Or am I missing something with my implementation?
import numpy as np
from multiprocessing import Pool, cpu_count, Manager
import math as m
from functools import partial
from time import time
def check_prime(num):
#Assert positive integer value
if num!=m.floor(num) or num<1:
print("Input must be a positive integer")
return None
#Check divisibility for all possible factors
prime = True
for i in range(2,num):
if num%i==0: prime=False
return prime
def cp_worker(num, L):
prime = check_prime(num)
L.append((num, prime))
def mp_primes(omag, mp=cpu_count()):
with Manager() as manager:
np.random.seed(0)
numlist = np.random.randint(10**omag, 10**(omag+1), 100)
L = manager.list()
cp_worker_ptl = partial(cp_worker, L=L)
try:
pool = Pool(processes=mp)
list(pool.imap(cp_worker_ptl, numlist))
except Exception as e:
print(e)
finally:
pool.close() # no more tasks
pool.join()
return L
if __name__ == '__main__':
rt = []
for i in range(cpu_count()):
t0 = time()
mp_result = mp_primes(6, mp=i+1)
t1 = time()
rt.append(t1-t0)
print("Using %i core(s), run time is %.2fs" % (i+1, rt[-1]))
Note: I am aware that for this task it would likely be more efficient to implement multithreading, but the actual script for which this one is a simplified analog is incompatible with Python multithreading due to GIL.
#KellanM deserved [+1] for quantitative performance monitoring
am I missing something with my implementation?
Yes, you abstract from all add-on costs of the process-management.
While you have expressed an expectation of " a linear improvement with additional cores. ", this would hardly appear in practice for several reasons ( even the hype of communism failed to deliver anything for free ).
Gene AMDAHL has formulated the inital law of diminishing returns.
A more recent, re-formulated version, took into account also the effects of process-management {setup|terminate}-add-on overhead costs and tried to cope with atomicity-of-processing ( given large workpackage payloads cannot get easily re-located / re-distributed over available pool of free CPU-cores in most common programming systems ( except some indeed specific micro-scheduling art, like the one demonstrated in Semantic Design's PARLANSE or LLNL's SISAL have shown so colourfully in past ).
A best next step?
If indeed interested in this domain, one may always experimentally measure and compare the real costs of process management ( plus data-flow costs, plus memory-allocation costs, ... up until the process-termination and results re-assembly in the main process ) so as to quantitatively fair record and evaluate the add-on costs / benefit ratio of using more CPU-cores ( that will get, in python, re-instated the whole python-interpreter state, including all its memory-state, before a first usefull operation will get carried out in a first spawned and setup process ).
Underperformance ( for the former case below )if not disastrous effects ( from the latter case below ),of either of ill-engineered resources-mapping policy, be itan "under-booking"-resources from a pool of CPU-coresoran "over-booking"-resources from a pool of RAM-spaceare discussed also here
The link to the re-formulated Amdahl's Law above will help you evaluate the point of diminishing returns, not to pay more than will ever receive.
Hoefinger et Haunschmid experiments may serve as a good practical evidence, how a growing number of processing-nodes ( be it a local O/S managed CPU-core, or a NUMA distributed architecture node ) will start decreasing the resulting performance,
where a Point of diminishing returns ( demonstrated in overhead agnostic Amdahl's Law )
will actually start to become a Point after which you pay more than receive. :
Good luck on this interesting field!
Last, but not least,
NUMA / non-locality issues get their voice heard, into the discussion of scaling for HPC-grade tuned ( in-Cache / in-RAM computing strategies ) and may - as a side-effect - help detect the flaws ( as reported by #eryksun above ). One may feel free to review one's platform actual NUMA-topology by using lstopo tool, to see the abstraction, that one's operating system is trying to work with, once scheduling the "just"-[CONCURRENT] task execution over such a NUMA-resources-topology:

Parallel application in python becomes much slower when using mpi rather than multiprocessing module

Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.
The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.
My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.
What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.
y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)
1) Using multiprocessing module - Pool.map
2) Using mpi - Scatter/Gather
3) Both pictures on top of each other
At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.
Does anyone have any explanation for these results?
ADDED:
I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.
However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.
I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.
Pool.map:
def eval_population(func, pop):
for ind in pop:
ind.fitness.values = func(ind)
return pop
# ...
self.pool = Pool(8)
# ...
for iter_ in xrange(nr_of_generations):
# ...
self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
# ...
MPI - Scatter/Gather
def divide_list(lst, n):
return [lst[i::n] for i in xrange(n)]
def chain_list(lst):
return list(chain.from_iterable(lst))
def evaluate_individuals_in_groups(func, rank, individuals):
comm = MPI.COMM_WORLD
size = MPI.COMM_WORLD.Get_size()
packages = None
if not rank:
packages = divide_list(individuals, size)
ind_for_eval = comm.scatter(packages)
eval_population(func, ind_for_eval)
pop_with_fit = comm.gather(ind_for_eval)
if not rank:
pop_with_fit = chain_list(pop_with_fit)
for index, elem in enumerate(pop_with_fit):
individuals[index] = elem
for iter_ in xrange(nr_of_generations):
# ...
evaluate_individuals_in_groups(self.func, self.rank, pop)
# ...
ADDED 2:
As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result:
I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark:
Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.
MPI is actually designed to do inter node communication, so talk to other machines over the network.
Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.
mpi4py makes a copy for every message, since it's targeted at distributed memory usage.
If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.
If you only intend to do computations within the same machine, there is no need to use mpi here.
Some of this is discussed in this thread.
Update
The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!
I recommend running the ipc-benchmark on the virtualized machine, and post the results.
If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.

Categories

Resources