To my understanding, a thread is a unit under a process. So if I use the multi-threading library in python, it would create the threads under the main process (correct me if im wrong since im still learning). But is there a way to create threads under a different process or child process? So is it possible to multithread in a process since a process has its own shared memory. Lets say an example, i have an application which needs to run in parallel with 3 process. In each process, i want it to run concurrently and share the same memory space. If let's say this is possible, does this mean i need to have a threading code inside my function so that when i run the function with a different process, it will create its own thread?
P.s: I know the gil locks a thread in a process but what im curious is it even possible for a process to create its own thread.
Also its not specifically for python. I just want to know in general about this
Try not to confuse threads and processes. In python, a process is effectively a separate program with its own copy of the python interpreter (at least on platforms that use method spawn to create new processes, such as Window). These are created with the multiprocessing library.
A process can have one or more threads. These share the same memory and can share global variables. These are created with the threading library.
Its perfectly acceptable to create a separate process, and have that process create several threads (although it may be harder to manage as the program grows in size).
As you mentioned the GIL, it does not affect process as they each have their own GIL. Threads within a process are affected by the GIL but they do drop the lock at various points which allows your threading.Thread code to effectively run "concurrently".
But is there a way to create threads under a different process or child process?
Yes
In each process, I want it to run concurrently and share the same memory space.
If you are using separate processes, they do not share the same memory. You need to use an object like a multiprocessing.Queue to transfer data between the processes or shared memory structures such as multiprocessing.Array.
does this mean I need to have a threading code inside my function so that when I run the function with a different process, it will create its own thread?
Yes
Related
In python, I have to fetch crypto data from binance every minute and do some calculations. For fetching data I have two functions func_a() and func_b(). They both do the same thing but in wildly different manner. Sometimes func_a is faster and sometimes func_b is faster. I want to run both the functions in parallel, if any of the function returns result to me faster, I want to kill the other one and move on (because they both are going to bring the same result).
How can I achieve this in python? Please mind that I do not want to replace these functions or their mechanics.
Python threads aren't very suitable for this purpose for two reasons:
The Python GIL means that if you spawn two CPU-bound threads, each of the two threads will run at half its normal speed (because only one thread is actually running at any given instant; the other is waiting to acquire the interpreter lock)
There is no reliable way to unilaterally kill a thread, because if you do that, any resources it had allocated will be leaked, causing major problems.
If you really want to be able to cancel a function-in-progress, then, you have two options:
Modify the function to periodically check a "please_quit" boolean variable (or whatever) and return immediately if that boolean's state has changed to True. Then your main thread can set the please_quit variable and then call join() on the thread, and rest assured that the thread will quit ASAP. (This does require that you have the ability to modify the function's implementation)
Spawn child processes instead of child threads. A child process takes more resources to launch, but it can run truly in parallel (since it has its own separate Python interpreter) and it is safe (usually) to unilaterally kill it, because the OS will automatically clean up all of the process's held resources when the process is killed.
(I don't know if this should be asked here at SO, or one of the other stackexchange)
When doing heav I/O bound tasks e.g API-calls or database fetching, I wonder, if Python only uses one process for multithreading, i.e can we create even more threads by combining multiprocessing and multithreading, like the pseudo-code below
for process in Processes:
for thread in threads:
fetch_api_resuls(thread)
or does Python do this automatically?
I do not think there would be any point doing this: spinning up a new process has a relatively high cost, and spinning up a new thread has a pretty high cost. Serialising tasks to those threads or processes costs again, and synchronising state costs...again.
What I would do if I had two sets of problems:
I/O bound problems (e.g. fetching data over a network)
CPU bound problems related to those I/O bound problems
is to combine multiprocessing with asyncio. This has a much lower overhead---we only have one thread, and we pay for the scheduler only (but no serialisation), doesn't involve spinning up a gazillion processes (each of which uses around as much virtual memory as the parent process) or threads (each of which still uses a fair chunk of memory).
However, I would not use asyncio within the multiprocessing threads---I'd use asyncio in the main thread, and offload cpu-intensive tasks to a pool of worker threads when needed.
I suspect you probably can use threading inside multiprocessing, but it is very unlikely to bring you any speed boost.
I was learning about multi-processing and multi-threading.
From what I understand, threads run on the same core, so I was wondering if I create multiple processes inside a child thread will they be limited to that single core too?
I'm using python, so this is a question about that specific language but I would like to know if it is the same thing with other languages?
I'm not a pyhton expert but I expect this is like in other languages, because it's an OS feature in general.
Process
A process is executed by the OS and owns one thread which will be executed. This is in general your programm. You can start more threads inside your process to do some heavy calculations or whatever you have to do.
But they belong to the process.
Thread
One or more threads are owned by a process and execution will be distributed across all cores.
Now to your question
When you create a given number of threads these threads should in general be distributed across all your cores. They're not limited to the core who's executing the phyton interpreter.
Even when you create a subprocess from your phyton code the process can and should run on other cores.
You can read more about the gernal concept here:
Preemptive multitasking
There're some libraries in different languages who abstract a thread to something often called a Task or something else.
For these special cases it's possible that they're just running inside the thread they were created in.
For example. In the DotNet world there's a Thread and a Task. Often people are misusing the term thread when they're talking about a Task, which in general runns inside the thread it was created.
Every program is represented through one process. A process is the execution context one or multiple threads operate in. All threads in one process share the same tranche of virtual memory assigned to the process.
Python (refering to CPython, e.g. Jython and IronPython have no GIL) is special because it has the global interpreter lock (GIL), which prevents threaded python code from being run on multiple cores in parallel. Only code releasing the GIL can operate truely parallel (I/O operations and some C-extensions like numpy). That's why you will have to use the multiprocessing module for cpu-bound python-code you need to run in parallel. Processes startet with the multiprocessing module then will run it's own python interpreter instance so you can process code truely parallel.
Note that even a single threaded python-application can run on different cores, not in parallel but sequentially, in case the OS re-schedules execution to another core after a context switch took place.
Back to your question:
if I create multiple processes inside a child thread will they be limited to that single core too?
You don't create processes inside a thread, you spawn new independent python-processes with the same limitations of the original python process and on which cores threads of the new processes will execute is up to the OS (...as long you don't manipulate the core-affinity of a process, but let's not go there).
I have created and application. In this application I use multiprocessing library. In that application I do spin two processes (instances of the same class) to consume data from Kafka and put into Python Queue.
This is the library I used:
Python multiprocessing
Q1. Is it concurrency or is it parallelism?
Q2. Is it multithreading or is it multiprocessing?
Q3. How does Python maps Processes to CPUs? (does this question make sense?)
I understand in order to speak about multithreading I need to use separate / multiple CPUs (so separate threads are mapped to separate CPU threads).
I understand in order to speak about multiprocessing I need to use separate memory space for both processes? Is it correct?
I assume if I spin two processes within one Application instance => we talk about concurrency.
If I spin multiple instances of above application then we would talk about parallelism? (multiple CPUs, separate memory spaces used)?
I see that Python library defines it as follows: Python multiprocessing library
The multiprocessing package offers both local and remote concurrency
...
Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.
...
A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism).
First, separate threads are not mapped to separate CPU-s. That's optional, and in python due to the GIL, all threads in a process will run on the same CPU
1) It's both concurrency, in that the order of execution is not set, and parallelism, since the multiprocessing package can run on multiple processors, bypassing the GIL limitations.
2) Since the threading package is another story, then it's definitely multiprocessing
3) I may be speaking out of line, but python , IMO does NOT map processes to CPU-s, it leaves this detail to the OS
Q1: It is at least concurrency, can be parallelism as well (terms intended as defined in the answer to this question). Clearly, if you have one processor only, true parallellism cannot be achieved, becuse only one process can use the CPU at a single time. In that case, however, the muliprocessing library still allows you to define multiple tasks, that run in separate processes. It will be the OS's scheduler to decide which process runs when.
Q2: Multiprocessing (...which is kind of implied by the library name). Due to the Global Interpreter Lock present in most Python interpreter implementations, parallelism with threads is impossible. Multiprocessing offers a threading-like interface that makes use of processes under the hood.
Q3: It doesn't. Python spawns processes, the OS scheduler decided who runs where and when. There are some ways to execute processes on specific CPUs, but this is not the default behaviour of multiprocessing (and I'm not aware of any way to force the library to pin processes to CPUs).
The way I understand it, another Python instance is kicked off with multiprocessing. If so,
a. Is a Python instance started for each multiprocessing process?
b. If one process is working on say, an in-memory database table and another process is working on another in-memory database table, how does Python manage the memory allocation for the 2 processes?
c. Wrt b), is the memory allocation persistant between invocations ie. if the first process is used continuously but the 2nd process is used infrequently, is the in-memory table re-created between process calls to it?
(a) yes
(b) Python pretty much doesn't manage it, the OS does
(c) Yes, if the second process exits then its resources are freed, regardless of the persistence of the first process. You can in principle use shared objects to allow the second process to use something that the first process arranges will persist. How that plays with the specific example of the "something" being a database table is another matter.
Running extra Python processes with multiprocessing is a lot like running extra Python (or for that matter Java) processes with subprocess. The difference is that multiprocessing gives you a suite of ways to communicate between the processes. It doesn't change the basic lifecycle and resource-handling of processes by the OS.