Launching multiple C processes in Python

Launching multiple C processes in Python - python

I have two programs, one written in C and one written in Python. I want to pass a few arguments to C program from Python and do it many times in parallel, because I have about 1 million of such C calls.
Essentially I did like this:
from subprocess import check_call
import multiprocessing as mp
from itertools import combinations
def run_parallel(f1, f2):
check_call(f"./c_compiled {f1} {f2} &", cwd='.', shell=True)
if __name__ == '__main__':
pairs = combinations(fns, 2)
pool = mp.Pool(processes=32)
pool.starmap(run_parallel, pairs)
pool.close()
However, sometimes I get the following errors (though the main process is still running)
/bin/sh: fork: retry: No child processes
Moreover, sometimes the whole program in Python fails with
BlockingIOError: [Errno 11] Resource temporarily unavailable
I found while it's still running I can see a lot of processes spawned for my user (around 500), while I have at most 512 available.
This does not happen all the time (depending on the arguments) but it often does. How I can avoid these problems?

I'd wager you're running up against a process/file descriptor/... limit there.
You can "save" one process per invocation by not using shell=True:
check_call(["./c_compiled", f1, f2], cwd='.')
But it'd be better still to make that C code callable from Python instead of creating processes to do so. By far the easiest way to interface "random" C code with Python is Cython.

"many times in parallel" you can certainly do, for reasonable values of "many", but "about 1 million of such C calls" all running at the same time on the same individual machine is almost surely out of the question.
You can lighten the load by running the jobs without interposing a shell, as discussed in #AKX's answer, but that's not enough to bring your objective into range. Better would be to queue up the jobs so as to run only a few at a time -- once you reach that number of jobs, start a new one only when a previous one has finished. The exact number you should try to keep running concurrently depends on your machine and on the details of the computation, but something around the number of CPU cores might be a good first guess.
Note in particular that it is counterproductive to have more jobs at any one time than the machine has resources to run concurrently. If your processes do little or no I/O then the number of cores in your machine puts a cap on that, for only the processes that are scheduled on a core at any given time (at most one per core) will make progress while the others wait. Switching among many processes so as to attempt to avoid starving any of them will add overhead. If your processes do a lot of I/O then they will probably spend a fair proportion of their time blocked on I/O, and therefore not (directly) requiring a core, but in this case your I/O devices may well create a bottleneck, which might prove even worse than the limitation from number of cores.

Related

Get the number of physical cores with Python [duplicate]

I have a function foo which consumes a lot of memory and which I would like to run several instances of in parallel.
Suppose I have a CPU with 4 physical cores, each with two logical cores.
My system has enough memory to accommodate 4 instances of foo in parallel but not 8. Moreover, since 4 of these 8 cores are logical ones anyway, I also do not expect using all 8 cores will provide much gains above and beyond using the 4 physical ones only.
So I want to run foo on the 4 physical cores only. In other words, I would like to ensure that doing multiprocessing.Pool(4) (4 being the maximum number of concurrent run of the function I can accommodate on this machine due to memory limitations) dispatches the job to the four physical cores (and not, for example, to a combo of two physical cores and their two logical offsprings).
How to do that in python?
Edit:
I earlier used a code example from multiprocessing but I am library agnostic ,so to avoid confusion, I removed that.

I know the topic is quite old now, but as it still appears as the first answer when typing 'multiprocessing logical core' in google... I feel like I have to give an additional answer because I can see that it would be possible for people in 2018 (or even later..) to get easily confused here (some answers are indeed a little bit confusing)
I can see no better place than here to warn readers about some of the answers above, so sorry for bringing the topic back to life.
--> TO COUNT THE CPUs (LOGICAL/PHYSICAL) USE THE PSUTIL MODULE
For a 4 physical core / 8 thread i7 for ex it will return
import psutil
psutil.cpu_count(logical = False)
4
psutil.cpu_count(logical = True)
8
As simple as that.
There you won't have to worry about the OS, the platform, the hardware itself or whatever. I am convinced it is much better than multiprocessing.cpu_count() which can sometimes give weird results, from my own experience at least.
--> TO USE N PHYSICAL CORES (up to your choice) USE THE MULTIPROCESSING MODULE DESCRIBED BY YUGI
Just count how many physical processes you have, launch a multiprocessing.Pool of 4 workers.
Or you can also try to use the joblib.Parallel() function
joblib in 2018 is not part of the standard distribution of python, but is just a wrapper of the multiprocessing module as described by Yugi.
--> MOST OF THE TIME, DON'T USE MORE CORES THAN AVAILABLE (unless you have benchmarked a very specific code and proved it was worth it)
Misinformation abounds that "the OS will handle things if you specify more cores than are available". It is absolutely 100% false. If you use more cores than available, you will face huge performance drops. The exception would be if the worker processes are IO bound. Because the OS scheduler will try its best to work on every task with the same attention, switching regularly from one to another, and depending on the OS, it can spend up to 100% of its working time to just switching between processes, which would be disastrous.
Don't just trust me: try it, benchmark it, you will see how clear it is.
IS IT POSSIBLE TO DECIDE WHETHER THE CODE WILL BE EXECUTED ON LOGICAL OR PHYSICAL CORE?
If you are asking this question, this means you don't understand the way physical and logical cores are designed, so maybe you should check a little bit more about a processor's architecture.
If you want to run on core 3 rather than core 1 for example, Well I guess there are indeed some solutions, but available only if you know how to code an OS's kernel and scheduler, which I think is not the case if you're asking this question.
If you launch 4 CPU-intensive processes on a 4 physical / 8 logical processor, the scheduler will attribute each of your processes to 1 distinct physical core (and 4 logical core will remain not/poorly used). But on a 4 logical / 8 thread proc, if the processing units are (0,1) (1,2) (2,3) (4,5) (5,6) (6,7), then it makes no difference if the process is executed on 0 or 1 : it is the same processing unit.
From my knowledge at least (but an expert could confirm, maybe it differs from very specific hardware specifications also) I think there is no or very little difference between executing a code on 0 or 1. In the processing unit (0,1), I am not sure that 0 is the logical whereas 1 is the physical, or vice-versa. From my understanding (which can be wrong), both are processors from the same processing unit, and they just share their cache memory / access to the hardware (RAM included), and 0 is not more a physical unit than 1.
More than that you should let the OS decide. Because the OS scheduler can take advantage of a hardware logical-core turbo boost that exist on some platforms (ex i7, i5, i3...), something else that you have no power over, and that could be truly helpful to you.
If you launch 5 CPU-intensive tasks on a 4 physical / 8 logical core, the behaviour will be chaotic, almost unpredictable, mostly dependent on your hardware and OS. The scheduler will try its best. Almost every time, you will face really bad performance.
Let's presume for a moment that we are still talking about a 4(8) classical architecture: Because the scheduler tries its best (and therefore often switches the attributions), depending on the process you are executing, it could be even worse to launch on 5 logical cores than on 8 logical cores (where at least he knows everything will be used at 100% anyway, so lost for lost he won't try much to avoid it, won't switch too often, and therefore won't lose too much time by switching).
It is 99% sure however (but benchmark it on your hardware to be sure) that almost any multiprocessing program will run slower if you use more physical core than available.
A lot of things can intervene... The program, the hardware, the state of the OS, the scheduler it uses, the fruit you ate this morning, your sister's name... In case you doubt about something, just benchmark it, there is no other easy way to see whether you are losing performances or not. Sometimes informatics can be really weird.
--> MOST OF THE TIME, ADDITIONAL LOGICAL CORES ARE INDEED USELESS IN PYTHON (but not always)
There are 2 main ways of doing really parallel tasks in python.
multiprocessing (cannot take advantage of logical cores)
multithreading (can take advantage of logical cores)
For example to run 4 tasks in parallel
--> multiprocessing will create 4 different python interpreter. For each of them you have to start a python interpreter, define the rights of reading/writing, define the environment, allocate a lot of memory, etc. Let's say it as it is: You will start a whole new program instance from 0. It can take a huge amount of time, so you have to be sure that this new program will work long enough so that it is worth it.
If your program has enough work (let's say, a few seconds of work at least), then because the OS allocates CPU-consuming processes on different physical cores, it works, and you can gain a lot of performances, which is great. And because the OS almost always allows processes to communicate between them (although it is slow) they can even exchange (a little bit of) data.
--> multithreading is different. Within your python interpreter, it will just create a small amount of memory that many CPU will be available to share, and work on it at the same time. It is WAY much quicker to spawn (where spawning a new process on an old computer can take many seconds sometimes, spawning a thread is done within a ridiculously small fraction of time). You don't create new processes, but "threads" which are much lighter.
Threads can share memory between threads very quickly, because they literally work together on the same memory (while it has to be copied/exchanged when working with different processes).
BUT: WHY CANNOT WE USE MULTITHREADING IN MOST SITUATIONS ? IT LOOKS VERY CONVENIENT ?
There is a very BIG limitation in python: Only one python line can be executed at a time in a python interpreter, which is called the GIL (Global Interpreter Lock). So most of the time, you will even LOSE performances by using multithreading, because different threads will have to wait to access to the same resource. For pure computational processing (with no IO), multithreading is USELESS and even WORSE if your code is pure python. However, if your threads involve any waiting for IO, multithreading can be very beneficial.
--> WHY SHOULDN'T I USE LOGICAL CORES WHEN USING MULTIPROCESSING ?
Logical cores don't have their own memory access. They can only work on the memory access and on the cache of its hosting physical processor. For example it is very likely (and often used indeed) that the logical and the physical core of a same processing unit both use the same C/C++ function on different emplacements of the cache memory at the same time. Making the treatment hugely faster indeed.
But... these are C/C++ functions ! Python is a big C/C++ wrapper, that needs much more memory and CPU than its equivalent C++ code. It is very likely in 2018 that, whatever you want to do, 2 big python processes will need much, much more memory and cache reading/writing than what a single physical+logical unit can afford, and much more that what the equivalent C/C++ truly-multithreaded code would consume. This once again, would almost always cause performances to drop. Remember that every variable that is not available in the processor's cache, will take x1000 time to read in the memory. If your cache is already completely full for 1 single python process, guess what will happened if you force 2 processes to use it: They will use it one at the time, and switch permanently, causing data to be stupidly flushed and re-read every time it switches. When the data is being read or written from memory, you might think that your CPU "is" working but it's not. It's waiting for the data ! By doing nothing.
--> HOW CAN YOU TAKE ADVANTAGE OF LOGICAL CORES THEN ?
Like I said there is no true multithreading (so no true usage of logical cores) in default python, because of the global interpreter lock. You can force the GIL to be removed during some parts of the program, but I think it would be a wise advise that you don't touch to it if you don't know exactly what you are doing.
Removing the GIL definitely has been a subject of a lot of research (see the experimental PyPy or Cython projects that both try to do so).
For now, no real solution exists for it, as it is a much more complex problem than it seems.
There is, I admit, another solution that can work:
Code your function in C
Wrap it in python with ctype
Use the python multithreading module to call your wrapped C function
This will work 100%, and you will be able to use all the logical cores, in python, with multithreading, and for real. The GIL won't bother you, because you won't be executing true python functions, but C functions instead.
For example, some libraries like Numpy can work on all available threads, because they are coded in C. But if you come to this point, I always thought it could be wise to think about doing your program in C/C++ directly because it is a consideration very far from the original pythonic spirit.
**--> DON'T ALWAYS USE ALL AVAILABLE PHYSICAL CORES **
I often see people be like "Ok I have 8 physical core, so I will take 8 core for my job". It often works, but sometimes turns out to be a poor idea, especially if your job needs a lot of I/O.
Try with N-1 cores (once again, especially for highly I/O-demanding tasks), and you will see that 100% of time, on per-task/average, single tasks will always run faster on N-1 core. Indeed, your computer makes a lot of different things: USB, mouse, keyboard, network, Hard drive, etc... Even on a working station, periodical tasks are performed anytime in the background that you have no idea about. If you don't let 1 physical core to manage those tasks, your calculation will be regularly interrupted (flushed out from the memory / replaced back in memory) which can also lead to performance issues.
You might think "Well, background tasks will use only 5% of CPU-time so there is 95% left". But it's not the case.
The processor handles one task at a time. And every time it switches, a considerably high amount of time is wasted to place everything back at its place in the memory cache/registries. Then, if for some weird reason the OS scheduler does this switching too often (something you have no control on), all of this computing time is lost forever and there's nothing you can do about it.
If (and it sometimes happen) for some unknown reason this scheduler problem impacts the performances of not 1 but 30 tasks, it can result in really intriguing situations where working on 29/30 physical core can be significantly faster than on 30/30
MORE CPU IS NOT ALWAYS THE BEST
It is very frequent, when you use a multiprocessing.Pool, to use a multiprocessing.Queue or manager queue, shared between processes, to allow some basic communication between them. Sometimes (I must have said 100 times but I repeat it), in an hardware-dependent manner, it can occur (but you should benchmark it for your specific application, your code implementation and your hardware) that using more CPU might create a bottleneck when you make processes communicate / synchronize. In those specific cases, it could be interesting to run on a lower CPU number, or even try to deport the synchronization task on a faster processor (here I'm talking about scientific intensive calculation ran on a cluster of course). As multiprocessing is often meant to be used on clusters, you have to notice that clusters often are underclocked in frequency for energy-saving purposes. Because of that, single-core performances can be really bad (balanced by a way-much higher number of CPUs), making the problem even worse when you scale your code from your local computer (few cores, high single-core performance) to a cluster (lot of cores, lower single-core performance), because your code bottleneck according to the single_core_perf/nb_cpu ratio, making it sometimes really annoying
Everyone has the temptation to use as many CPU as possible. But benchmark for those cases is mandatory.
The typical case (in data science for ex) is to have N processes running in parallel and you want to summarize the results in one file. Because you cannot wait the job to be done, you do it through a specific writer process. The writer will write in the outputfile everything that is pushed in his multiprocessing.Queue (single-core and hard-drive limited process). The N processes fill the multiprocessing.Queue.
It is easy then to imagine that if you have 31 CPU writing informations to one really slow CPU, then your performances will drop (and possibly something will crash if you overcome the system's capability to handle temporary data)
--> Take home message
Use psutil to count logical/physical processors, rather than multiprocessing.cpu_count() or whatsoever
Multiprocessing can only work on physical core (or at least benchmark it to prove it is not true in your case)
Multithreading will work on logical core BUT you will have to code and wrap your functions in C, or remove the global lock interpreter (and every time you do so, one kitten atrociously dies somewhere in the world)
If you are trying to run multithreading on pure python code, you will have huge performance drops, so you should 99% of the time use multiprocessing instead
Unless your processes/threads are having long pauses that you can exploit, never use more core than available, and benchmark properly if you want to try
If your task is I/O intensive, you should let 1 physical core to handle the I/O, and if you have enough physical core, it will be worth it. For multiprocessing implementations it needs to use N-1 physical core. For a classical 2-way multithreading, it means to use N-2 logical core.
If you have need for more performances, try PyPy (not production ready) or Cython, or even to code it in C
Last but not least, and the most important of all: If you are really seeking for performance, you should absolutely, always, always benchmark, and not guess anything. Benchmark often reveal strange platform/hardware/driver very specific behaviour that you would have no idea about.

Note: This approach doesn't work on windows and it is tested only on linux.
Using multiprocessing.Process:
Assigning a physical core to each process is quite easy when using Process(). You can create a for loop that iterates trough each core and assigns the new process to the new core using taskset -p [mask] [pid] :
import multiprocessing
import os
def foo():
return
if __name__ == "__main__" :
for process_idx in range(multiprocessing.cpu_count()):
p = multiprocessing.Process(target=foo)
os.system("taskset -p -c %d %d" % (process_idx % multiprocessing.cpu_count(), os.getpid()))
p.start()
I have 32 cores on my workstation so I'll put partial results here:
pid 520811's current affinity list: 0-31
pid 520811's new affinity list: 0
pid 520811's current affinity list: 0
pid 520811's new affinity list: 1
pid 520811's current affinity list: 1
pid 520811's new affinity list: 2
pid 520811's current affinity list: 2
pid 520811's new affinity list: 3
pid 520811's current affinity list: 3
pid 520811's new affinity list: 4
pid 520811's current affinity list: 4
pid 520811's new affinity list: 5
...
As you see, the previous and new affinity of each process here. The first one is for all cores (0-31) and is then assigned to core 0, second process is by default assigned to core0 and then its affinity is changed to the next core (1), and so forth.
Using multiprocessing.Pool:
Warning: This approach needs tweaking the pool.py module since there is no way that I know of that you can extract the pid from the Pool(). Also this changes have been tested on python 2.7 and multiprocessing.__version__ = '0.70a1'.
In Pool.py, find the line where the _task_handler_start() method is being called. In the next line, you can assign the process in the pool to each "physical" core using (I put the import os here so that the reader doesn't forget to import it):
import os
for worker in range(len(self._pool)):
p = self._pool[worker]
os.system("taskset -p -c %d %d" % (worker % cpu_count(), p.pid))
and you're done. Test:
import multiprocessing
def foo(i):
return
if __name__ == "__main__" :
pool = multiprocessing.Pool(multiprocessing.cpu_count())
pool.map(foo,'iterable here')
result:
pid 524730's current affinity list: 0-31
pid 524730's new affinity list: 0
pid 524731's current affinity list: 0-31
pid 524731's new affinity list: 1
pid 524732's current affinity list: 0-31
pid 524732's new affinity list: 2
pid 524733's current affinity list: 0-31
pid 524733's new affinity list: 3
pid 524734's current affinity list: 0-31
pid 524734's new affinity list: 4
pid 524735's current affinity list: 0-31
pid 524735's new affinity list: 5
...
Note that this modification to pool.py assign the jobs to the cores round-robinly. So if you assign more jobs than the cpu-cores, you will end up having multiple of them on the same core.
EDIT:
What OP is looking for is to have a pool() that is capable of staring the pool on specific cores. For this more tweaks on multiprocessing are needed (undo the above-mentioned changes first).
Warning:
Don't try to copy-paste the function definitions and function calls. Only copy paste the part that is supposed to be added after self._worker_handler.start() (you'll see it below). Note that my multiprocessing.__version__ tells me the version is '0.70a1', but it doesn't matter as long as you just add what you need to add:
multiprocessing's pool.py:
add a cores_idx = None argument to __init__() definition. In my version it looks like this after adding it:
def __init__(self, processes=None, initializer=None, initargs=(),
maxtasksperchild=None,cores_idx=None)
also you should add the following code after self._worker_handler.start():
if not cores_idx is None:
import os
for worker in range(len(self._pool)):
p = self._pool[worker]
os.system("taskset -p -c %d %d" % (cores_idx[worker % (len(cores_idx))], p.pid))
multiprocessing's __init__.py:
Add a cores_idx=None argument to definition of the Pool() in as well as the other Pool() function call in the the return part. In my version it looks like:
def Pool(processes=None, initializer=None, initargs=(), maxtasksperchild=None,cores_idx=None):
'''
Returns a process pool object
'''
from multiprocessing.pool import Pool
return Pool(processes, initializer, initargs, maxtasksperchild,cores_idx)
And you're done. The following example runs a pool of 5 workers on cores 0 and 2 only:
import multiprocessing
def foo(i):
return
if __name__ == "__main__":
pool = multiprocessing.Pool(processes=5,cores_idx=[0,2])
pool.map(foo,'iterable here')
result:
pid 705235's current affinity list: 0-31
pid 705235's new affinity list: 0
pid 705236's current affinity list: 0-31
pid 705236's new affinity list: 2
pid 705237's current affinity list: 0-31
pid 705237's new affinity list: 0
pid 705238's current affinity list: 0-31
pid 705238's new affinity list: 2
pid 705239's current affinity list: 0-31
pid 705239's new affinity list: 0
Of course you can still have the usual functionality of the multiprocessing.Poll() as well by removing the cores_idx argument.

I found a solution that doesn't involve changing the source code of a python module. It uses the approach suggested here. One can check that only
the physical cores are active after running that script by doing:
lscpu
in the bash returns:
CPU(s): 8
On-line CPU(s) list: 0,2,4,6
Off-line CPU(s) list: 1,3,5,7
Thread(s) per core: 1
[One can run the script linked above from within python]. In any case, after running the script above, typing these commands in python:
import multiprocessing
multiprocessing.cpu_count()
returns 4.

Multiprocessing with Multithreading? How do I make this more efficient?

I have an interesting problem on my hands. I have access to a 128 CPU ec2 instance. I need to run a program that accepts a 10 million row csv, and sends a request to a DB for each row in that csv to augment the existing data in the csv. In order to speed this up, I use:
executor = concurrent.futures.ProcessPoolExecutor(len(chunks))
futures = [executor.submit(<func_name>, chnk) for chnk in chunks]
successes = concurrent.futures.wait(futures)
I chunk up the 10 million row csv into 128 portions and then use futures to spin up 128 processes (+1 for the main one, so total 129). Each process takes a chunk, and retrieves the records for its chunk and spits the output into a file. At the end of the process, I merge all the files together and voila.
I have a few questions about this.
is this the most efficient way to do this?
by creating 128 subprocesses, am I really using the 128 CPUs of the machine?
would multithreading be better/more efficient?
can I multithread on each CPU?
advice on what to read up on?
Thanks in advance!

Is this most efficient?
Hard to tell without profiling. There's always a bottleneck somewhere. For example if you are cpu limited, and the algorithm can't be made more efficient, that's probably a hard limit. If you're storage bandwidth limited, and you're already using efficient read/write caching (typically handled by the OS or by low level drivers), that's probably a hard limit.
Are all cores of the machine actually used?
(Assuming python is running on a single physical machine, and you mean individual cores of one cpu) Yes, python's mp.Process creates a new OS level process with a single thread which is then assigned to execute for a given amount of time on a physical core by the OS's scheduler. Scheduling algorithms are typically quite good, so if you have an equal number of busy threads as logical cores, the OS will keep all the cores busy.
Would threads be better?
Not likely. Python is not thread safe, so it must only allow a single thread per process run at a time. There are specific exceptions to this when a function is written in c or c++, and calls the python macro: Py_BEGIN_ALLOW_THREADS though this is not extremely common. If most of your time is spent in such functions, threads will actually be allowed to run concurrently, and will have less overhead compared to processes. Threads also share memory, making passing results back after completion easier (threads can simply modify some global state rather than passing results via a queue or similar).
multithreading on each CPU?
Again, I think what you probably have is a single CPU with 128 cores.. The OS scheduler decides which threads should run on each core at any given time. Unless the threads are releasing the GIL, only one thread from each process can run at a time. For example running 128 processes each with 8 threads would result in 1024 threads, but still only 128 of them could ever run at a time, so the extra threads would only add overhead.
what to read up on?
When you want to make code fast, you need to be profiling. Profiling for parallel processing is more challenging, and profiling for a remote / virtualized computer can sometimes be challenging as well. It is not always obvious what is making a particular piece of code slow, and the only way to be sure is to test it. Also look into the tools you're using. I'm specifically thinking about the database you're using, because most database software has had a great deal of work put into optimization, but you must use it in the correct way to get the most speed out of it. Batched requests come to mind rather than accessing a single row at a time.

Thread are not happening at the same time?

I have a program that fetches data via an API. I created a function that only takes the target data as an argument and with a for-loop I run this method 10 times.
The programm takes quite some time to display the data because the next function call only happens when the function before has done its work.
I want to use Threads to make it all happen quicker. However, I'm confused. On realpython.org I read this:
A thread is a separate flow of execution. This means that your program will have two things happening at once. But for most Python 3 implementations the different threads do not actually execute at the same time: they merely appear to. It’s tempting to think of threading as having two (or more) different processors running on your program, each one doing an independent task at the same time. That’s almost right. The threads may be running on different processors, but they will only be running one at a time.
First they say: "This means that your program will have two things happening at once" and then they say "but they will only be running one at a time". So my threads are not done simultaneously?
I want to make a decision on whether to use Threads or Multiprocessing but I can't figure it out.
Can somebody help?

With both Threads or Multiprocessing you must assume that execution of your program could jump from one thread/process to another randomly. The difference is that with Threads, code is never really executed at the same time. That means there is always only one CPU core doing your work. With Multiprocessing, your code runs on multiple cores at the same time. So only Multiprocessing will solve your computation N times faster with N processes. (There will be some overhead of course.) If you are not doing any heavy computation, but need to create the illusion of things running in parallel, use threads. This is especially useful for GUIs.
The confusing part is that IO (copying files or loading something from the web for example) is not CPU bound, as it does not require a lot of CPU instructions to happen. So always use threads for this. To understand it a bit more, you should realise that when a thread is waiting for an IO operation to finish, it is actually in a blocked state. This allows other threads to run. So if you use threads to fetch data the first thread will start loading it and then block. This makes room for the the second thread to do the same and so on. When one of the threads has the data ready, it will unblock, run the rest of its code and finish.
(Note that when multiple threads are running they can pause randomly and give room for other threads to run for a while and then carry on. (See first sentence of this answer.))
Generally always use threads unless you need to do something CPU heavy in parallel. Multiprocessing has a lot of limitations when it comes to how it works internally and using it is more complicated and heavy.
This only applies to some implementations of Python tough, for example the most commonly used "official" implementation, CPython. In other languages or less common Python implementations threads are often able to execute instructions on multiple cores at the same time.

python multiprocessing Pool not always using all workers

The problem:
When sending 1000 tasks to apply_async, they run in parallel on all 48 CPUs, but then sometimes fewer and fewer CPUs run, until only one CPU left is running, and only when the last one finishes its task, then all the CPUs continue running again each with a new task. It shouldn't need to wait for any "task batch" like this..
My (simplified) code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(json2features, (j,)) for j in jsons]
feats = [t.get() for t in tasks]
jsons = [...] is a list of about 1000 JSONs already loaded to memory and parsed to objects.
json2features(json) does some CPU-heavy work on a json, and returns an array of numbers.
This function may take between 1 second and 15 minutes to run, and because of this I sort the jsons using a heuristic, s.t. hopefully the longest tasks are first in the list, and thus start first.
The json2features function also prints when a task is finished and how long it took. It all runs on an ubuntu server with 48 cores and like I said above, it starts out great, using all 47 cores. Then as the tasks get completed, fewer and fewer cores run, which at first sounds perfectly ok, where it not because after the last core is finished (when I see its print to stdout), all CPUs start running again on new tasks, meaning it wasn't really the end of the list. It may do the same thing again, and then again for the actual end of the list.
Sometimes it can be using just one core for 5 minutes, and when the task is finally done, it starts using all cores again, on new tasks. (So it's not stuck on some IPC overhead)
There are no repeated jsons, nor any dependencies between them (it's all static, fresh-from-disk data, no references etc..), nor any dependency between json2features calls (no global state or anything) except for them using the same terminal for their print.
I was suspicious that the problem was that a worker doesn't get released until get is called on its result, so I tried the following code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(print, (i,)) for i in range(1000)]
# feats = [t.get() for t in tasks]
And it does print all 1000 numbers, even though get isn't called.
I have ran out of ideas right now what the problem might be.
Is this really the normal behavior of Pool?
Thanks a lot!

The multiprocessing.Pool relies on a single os.pipe to deliver the tasks to the workers.
Usually on Unix, the default pipe size range from 4 to 64 Kb in size. If the JSONs you are delivering are large in size, you might get the pipe clogged at any given point in time.
This means that, while one of the workers is busy reading the large JSON from the pipe, all the other workers will starve.
It is generally a bad practice to share large data via IPC as it leads to bad performance. This is even underlined in the multiprocessing programming guidelines.
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
Instead of reading the JSON files in the main process, just send the workers their file names and let them open and read the content. You will surely notice an improvement in performance because you are moving the JSON loading phase in the concurrent domain as well.
Note that the same is true also for the results. A single os.pipe is used to return the results to the main process as well. If one or more workers clog the results pipe then you will get all the processes waiting for the main one to drain it. Large results should be written to files as well. You can then leverage multithreading on the main process to quickly read back the results from the files.

How can multiple calculations be launched in parallel, while stopping them all when the first one returns? [Python]

How can multiple calculations be launched in parallel, while stopping them all when the first one returns?
The application I have in mind is the following: there are multiple ways of calculating a certain value; each method takes a different amount of time depending on the function parameters; by launching calculations in parallel, the fastest calculation would automatically be "selected" each time, and the other calculations would be stopped.
Now, there are some "details" that make this question more difficult:
The parameters of the function to be calculated include functions (that are calculated from data points; they are not top-level module functions). In fact, the calculation is the convolution of two functions. I'm not sure how such function parameters could be passed to a subprocess (they are not pickeable).
I do not have access to all calculation codes: some calculations are done internally by Scipy (probably via Fortran or C code). I'm not sure whether threads offer something similar to the termination signals that can be sent to processes.
Is this something that Python can do relatively easily?

I would look at the multiprocessing module if you haven't already. It offers a way of offloading tasks to separate processes whilst providing you with a simple, threading like interface.
It provides the same kinds of primatives as you get in the threading module, for example, worker pools and queues for passing messages between your tasks, but it allows you to sidestep the issue of the GIL since your tasks actually run in separate processes.
The actual semantics of what you want are quite specific so I don't think there is a routine that fits the bill out-of-the-box, but you can surely knock one up.
Note: if you want to pass functions around, they cannot be bound functions since these are not pickleable, which is a requirement for sharing data between your tasks.

Because of the global interpreter lock you would be hard pressed to get any speedup this way. In reality even multithreaded programs in Python only run on one core. Thus, you would just be doing N processes at 1/N times the speed. Even if one finished in half the time of the others you would still lose time in the big picture.

Processes can be started and killed trivially.
You can do this.
import subprocess
watch = []
for s in ( "process1.py", "process2.py", "process3.py" ):
sp = subprocess.Popen( s )
watch.append( sp )
Now you're simply waiting for one of those to finish. When one finishes, kill the others.
import time
winner= None
while winner is None:
time.sleep(10)
for w in watch:
if w.poll() is not None:
winner= w
break
for w in watch:
if w.poll() is None: w.kill()
These are processes -- not threads. No GIL considerations. Make the operating system schedule them; that's what it does best.
Further, each process is simply a script that simply solves the problem using one of your alternative algorithms. They're completely independent and stand-alone. Simple to design, build and test.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.