Memory management with Python multiprocessing - python

The way I understand it, another Python instance is kicked off with multiprocessing. If so,
a. Is a Python instance started for each multiprocessing process?
b. If one process is working on say, an in-memory database table and another process is working on another in-memory database table, how does Python manage the memory allocation for the 2 processes?
c. Wrt b), is the memory allocation persistant between invocations ie. if the first process is used continuously but the 2nd process is used infrequently, is the in-memory table re-created between process calls to it?

(a) yes
(b) Python pretty much doesn't manage it, the OS does
(c) Yes, if the second process exits then its resources are freed, regardless of the persistence of the first process. You can in principle use shared objects to allow the second process to use something that the first process arranges will persist. How that plays with the specific example of the "something" being a database table is another matter.
Running extra Python processes with multiprocessing is a lot like running extra Python (or for that matter Java) processes with subprocess. The difference is that multiprocessing gives you a suite of ways to communicate between the processes. It doesn't change the basic lifecycle and resource-handling of processes by the OS.

Related

How to properly use multiprocessing module with Django?

I'm having a python 3.8+ program using Django and Postgresql which requires multiple threads or processes. I cannot use threads since the GLI will restrict them to a single process which results in an awful performance (especially since most of the threads are CPU bound).
So the obvious solution was to use the multiprocessing module. But I've encountered several problems:
When using spawn to generate new processes, I get the "Apps aren't loaded yet" error when the new process imports the Django models. This is because the new process doesn't have the database connection given to the main process by python manage.py runserver. I circumvented it by using fork instead of spawn (like advised here) so the connections are copied to the other processes but I feel like this is not the best solution and there should be a clean way to start new processes with the necessary connections.
When several of the processes simultaneously access the database, sometimes false results are given back (partly even from wrong models / relations) which crashes the program. This can happen in the initial startup when fetching data but also when the program is running. I tried to use ISOLATION LEVEL SERIALIZABLE (like advised here) by adding it in the options in the database settings but that didn't work.
A possible solution might be using custom locks that are given to every process but that doesn't feel like a good solution as well.
So in general, the question is: Is there a good and clean way to use multiprocessing in Django without these issues? A way that new processes have the database connections without needing to rely on fork and that all processes can just access the database without having any race conditions sometimes producing false results like this?
One important thing: I don't use a Pool since the processes aren't running the same simple task. The processes are each running different specific tasks, share data via multiprocessing Signals, Queues, Values and Namespaces (shared memory) and new processes can be triggered by user interaction (websockets).
I've tried to look into Celery since this has been recommended on a lot of questions about Django and multiprocessing but I wouldn't know how to use something like that in the project structure with the specific different processes that need to be created at specific points and the data that gets transferred over the Queues, Signals, Values and Namespaces in the existing project.
Thank you for reading; any help is appreciated!
With every new process, a setup function calling Django.setup() is first called before executing the real function. My hope was that with this way, every process would create an independent connection to the database so that the current system could work.
Yes - you can do that with initializer,
as explained in my other answer from yesteryear.
However, it still throws errors like django.db.utils.OperationalError: lost synchronization with server: got message type "1", length 976434746
That means you're using the fork start method for subprocesses, and any database connections and their state has been forked into the subprocesses too, and they will be out of sync when used by multiple processes.
You'll need to close them:
def subprocess_setup():
django.setup()
from django.db import connections
for conn in connections.all():
conn.close()
with ProcessPoolExecutor(max_workers=5, initializer=subprocess_setup) as executor:

Creating a thread inside a child process

To my understanding, a thread is a unit under a process. So if I use the multi-threading library in python, it would create the threads under the main process (correct me if im wrong since im still learning). But is there a way to create threads under a different process or child process? So is it possible to multithread in a process since a process has its own shared memory. Lets say an example, i have an application which needs to run in parallel with 3 process. In each process, i want it to run concurrently and share the same memory space. If let's say this is possible, does this mean i need to have a threading code inside my function so that when i run the function with a different process, it will create its own thread?
P.s: I know the gil locks a thread in a process but what im curious is it even possible for a process to create its own thread.
Also its not specifically for python. I just want to know in general about this
Try not to confuse threads and processes. In python, a process is effectively a separate program with its own copy of the python interpreter (at least on platforms that use method spawn to create new processes, such as Window). These are created with the multiprocessing library.
A process can have one or more threads. These share the same memory and can share global variables. These are created with the threading library.
Its perfectly acceptable to create a separate process, and have that process create several threads (although it may be harder to manage as the program grows in size).
As you mentioned the GIL, it does not affect process as they each have their own GIL. Threads within a process are affected by the GIL but they do drop the lock at various points which allows your threading.Thread code to effectively run "concurrently".
But is there a way to create threads under a different process or child process?
Yes
In each process, I want it to run concurrently and share the same memory space.
If you are using separate processes, they do not share the same memory. You need to use an object like a multiprocessing.Queue to transfer data between the processes or shared memory structures such as multiprocessing.Array.
does this mean I need to have a threading code inside my function so that when I run the function with a different process, it will create its own thread?
Yes

multiprocessing fork() vs spawn()

I was reading the description of the two from the python doc:
spawn
The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.
[Available on Unix and Windows. The default on Windows and macOS.]
fork
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
[Available on Unix only. The default on Unix.]
And my question is:
is it that the fork is much quicker 'cuz it does not try to identify which resources to copy?
is it that, since fork duplicates everything, it would "waste" much more resources comparing to spawn()?
There's a tradeoff between 3 multiprocessing start methods:
fork is faster because it does a copy-on-write of the parent process's entire virtual memory including the initialized Python interpreter, loaded modules, and constructed objects in memory.
But fork does not copy the parent process's threads. Thus locks (in memory) that in the parent process were held by other threads are stuck in the child without owning threads to unlock them, ready to cause a deadlock when code tries to acquire any of them. Also any native library with forked threads will be in a broken state.
The copied Python modules and objects might be useful or they might needlessly bloat every forked child process.
The child process also "inherits" OS resources like open file descriptors and open network ports. Those can also lead to problems but Python works around some of them.
So fork is fast, unsafe, and maybe bloated.
However these safety problems might not cause trouble depending on what the child process does.
spawn starts a Python child process from scratch without the parent process's memory, file descriptors, threads, etc. Technically, spawn forks a duplicate of the current process, then the child immediately calls exec to replace itself with a fresh Python, then asks Python to load the target module and run the target callable.
So spawn is safe, compact, and slower since Python has to load, initialize itself, read files, load and initialize modules, etc.
However it might not be noticeably slower compared to the work that the child process does.
forkserver forks a duplicate of the current Python process that trims down to approximately a fresh Python process. This becomes the "fork server" process. Then each time you start a child process, it asks the fork server to fork a child and run its target callable.
Those child processes all start out compact and without stuck locks.
forkserver is more complicated and not well documented. Bojan Nikolic's blog post explains more about forkserver and its secret set_forkserver_preload() method to preload some modules. Be wary of using an undocumented method, esp. before the bug fix in Python 3.7.0.
So forkserver is fast, compact, and safe, but it's more complicated and not well documented.
[The docs aren't great on all this so I've combined info from multiple sources and made some inferences. Do comment on any mistakes.]
is it that the fork is much quicker 'cuz it does not try to identify which resources to copy?
Yes, it's much quicker. The kernel can clone the whole process and only copies modified memory-pages as a whole. Piping resources to a new process and booting the interpreter from scratch is not necessary.
is it that, since fork duplicates everything, it would "waste" much more resources comparing to spawn()?
Fork on modern kernels does only "copy-on-write" and it only affects memory-pages which actually change. The caveat is that "write" already encompasses merely iterating over an object in CPython. That's because the reference-count for the object gets incremented.
If you have long running processes with lots of small objects in use, this can mean you waste more memory than with spawn. Anecdotally I recall Facebook claiming to have memory-usage reduced considerably with switching from "fork" to "spawn" for their Python-processes.

Multiprocessing or os.fork, os.exec?

I am using multiprocessing module to fork child processes. Since on forking, child process gets the address space of parent process, I am getting the same logger for parent and child. I want to clear the address space of child process for any values carried over from parent. I got to know that multiprocessing does fork() at lower level but not exec(). I want to know whether it is good to use multiprocessing in my situation or should I go for os.fork() and os.exec() combination or is there any other solution?
Thanks.
Since multiprocessing is running a function from your program as if it were a thread function, it definitely needs a full copy of your process' state. That means doing fork().
Using a higher-level interface provided by multiprocessing is generally better. At least you should not care about the fork() return code yourself.
os.fork() is a lower level function providing less service out-of-the-box, though you certainly can use it for anything multiprocessing is used for... at the cost of partial reimplementation of multiprocessing code. So, I think, multiprocessing should be ok for you.
However, if you process' memory footprint is too large to duplicate it (or if you have other reasons to avoid forking -- open connections to databases, open log files etc.), you may have to make the function you want to run in a new process a separate python program. Then you can run it using subprocess, pass parameters to its stdin, capture its stdout and parse the output to get results.
UPD: os.exec... family of functions is hard to use for most of purposes since it replaces your process with a spawned one (if you run the same program as is running, it will restart from the very beginning, not keeping any in-memory data). However, if you really do not need to continue parent process execution, exec() may be of some use.
From my personal experience: os.fork() is used very often to create daemon processes on Unix; I often use subprocess (the communication is through stdin/stdout); almost never used multiprocessing; not a single time in my life I needed os.exec...().
You can just rebind the logger in the child process to its own. I don't know about other OS, but on Linux the forking doesn't duplicate the entire memory footprint (as Ellioh mentioned), but uses "copy-on-write" concept. So until you change something in the child process - it stays in the memory scope of the parent process. For instance, you can fork 100 child processes (that don't write into memory, only read) and check the overall memory usage. It'll not be parent_memory_usage * 100, but much less.

Tasks queue process in python

Task is:
I have task queue stored in db. It grows. I need to solve tasks by python script when I have resources for it. I see two ways:
python script working all the time. But i don't like it (reason posible memory leak).
python script called by cron and do a little part of task. But i need to solve the problem of one working active script in memory (To prevent active scripts count grow). What is the best solution to implement it in python?
Any ideas to solve this problem at all?
You can use a lockfile to prevent multiple scripts from running out of cron. See the answers to an earlier question, "Python: module for creating PID-based lockfile". This is really just good practice in general for anything that you need to make sure won't have multiple instances running, actually, so you should look into it even if you do have the script running constantly, which I do suggest.
For most things, it shouldn't be too hard to avoid memory leaks, but if you're having a lot of trouble with it (I sometimes do with complex third-party web frameworks, for example), I would suggest instead writing the script with a small, carefully-designed main loop that monitors the database for new jobs, and then uses the multiprocessing module to fork off new processes to complete each task.
When a task is complete, the child process can exit, immediately freeing any memory that isn't properly garbage collected, and the main loop should be simple enough that you can avoid any memory leaks.
This also offers the advantage that you can run multiple tasks in parallel if your system has more than one CPU core, or if your tasks spend a lot of time waiting for I/O.
This is a bit of a vague question. One thing you should remember is that it is very difficult to leak memory in Python, because of the automatic garbage collection. croning a Python script to handle the queue isn't very nice, although it would work fine.
I would use method 1; if you need more power you could make a small Python process that monitors the DB queue and starts new processes to handle the tasks.
I'd suggest using Celery, an asynchronous task queuing system which I use myself.
It may seem a bit heavy for your use case, but it makes it easy to expand later by adding more worker resources if/when needed.

Categories

Resources