How to properly use multiprocessing module with Django?

How to properly use multiprocessing module with Django? - python

I'm having a python 3.8+ program using Django and Postgresql which requires multiple threads or processes. I cannot use threads since the GLI will restrict them to a single process which results in an awful performance (especially since most of the threads are CPU bound).
So the obvious solution was to use the multiprocessing module. But I've encountered several problems:
When using spawn to generate new processes, I get the "Apps aren't loaded yet" error when the new process imports the Django models. This is because the new process doesn't have the database connection given to the main process by python manage.py runserver. I circumvented it by using fork instead of spawn (like advised here) so the connections are copied to the other processes but I feel like this is not the best solution and there should be a clean way to start new processes with the necessary connections.
When several of the processes simultaneously access the database, sometimes false results are given back (partly even from wrong models / relations) which crashes the program. This can happen in the initial startup when fetching data but also when the program is running. I tried to use ISOLATION LEVEL SERIALIZABLE (like advised here) by adding it in the options in the database settings but that didn't work.
A possible solution might be using custom locks that are given to every process but that doesn't feel like a good solution as well.
So in general, the question is: Is there a good and clean way to use multiprocessing in Django without these issues? A way that new processes have the database connections without needing to rely on fork and that all processes can just access the database without having any race conditions sometimes producing false results like this?
One important thing: I don't use a Pool since the processes aren't running the same simple task. The processes are each running different specific tasks, share data via multiprocessing Signals, Queues, Values and Namespaces (shared memory) and new processes can be triggered by user interaction (websockets).
I've tried to look into Celery since this has been recommended on a lot of questions about Django and multiprocessing but I wouldn't know how to use something like that in the project structure with the specific different processes that need to be created at specific points and the data that gets transferred over the Queues, Signals, Values and Namespaces in the existing project.
Thank you for reading; any help is appreciated!

With every new process, a setup function calling Django.setup() is first called before executing the real function. My hope was that with this way, every process would create an independent connection to the database so that the current system could work.
Yes - you can do that with initializer,
as explained in my other answer from yesteryear.
However, it still throws errors like django.db.utils.OperationalError: lost synchronization with server: got message type "1", length 976434746
That means you're using the fork start method for subprocesses, and any database connections and their state has been forked into the subprocesses too, and they will be out of sync when used by multiple processes.
You'll need to close them:
def subprocess_setup():
django.setup()
from django.db import connections
for conn in connections.all():
conn.close()
with ProcessPoolExecutor(max_workers=5, initializer=subprocess_setup) as executor:

Related

Sharing DB client among multiple processes in Python?

My python application uses concurrent.futures.ProcessPoolExecutor with 5 workers and each process makes multiple database queries.
Between the choice of giving each process its own db client, or alternatively , making all process to share a single client, which is considered more safe and conventional?

Short answer: Give each process (that needs it) its own db client.
Long answer: What problem are you trying to solve?
Sharing a DB client between processes basically doesn't happen; you'd have to have the one process which does have the DB client proxy the queries from the others, using more-or-less your own protocol. That can have benefits, if that protocol is specific to your application, but it will add complexity: you'll now have two different kinds of workers in your program, rather than just one kind, plus the protocol between them. You'd want to make sure that the benefits outweigh the additional complexity.
Sharing a DB client between threads is usually possible; you'd have to check the documentation to see which objects and operations are "thread-safe". However, since your application is otherwise CPU-heavy, threading is not suitable, due to Python limitations (the GIL).
At the same time, there's little cost to having a DB client in each process; you will in any case need some sort of client, it might as well be the direct one.
There isn't going to be much more IO, since that's mostly based on the total number of queries and amount of data, regardless of whether that comes from one process or gets spread among several. The only additional IO will be in the login, and that's not much.
If you're running out of connections at the database, you can either tune/upgrade your database for more connections, or use a separate off-the-shelf "connection pooler" to share them; that's likely to be much better than trying to implement a connection pooler from scratch.
More generally, and this applies well beyond this particular question, it's often better to combine several off-the-shelf pieces in a straightforward way, than it is to try to put together a custom complex piece that does the whole thing all at once.
So, what problem are you trying to solve?

It is better to use multithreading or asynchronous approach instead of multiprocessing because it will consume fewer resources. That way you could use a single db connection, but I would recommend creating a separate session for each worker or coroutine to avoid some exceptions or problems with locking.

When using a database that is supposedly thread-safe, do I need to synchronize my own threads?

I'm writing a Python application that uses a Rethink database. I have three worker threads that need to run and possibly access the database at the same time. I know how to synchronize threads in Python, but my question is: do I need to? If Rethink claims to be thread-safe, which is implied on this page giving advice on how to speed things up, can I leave pass the concurrency issues off to the database?

RethinkDB definitely works when accessed concurrently from multiple threads or clients. The Python driver should work fine on multiple threads as long as you open a separate connection for each thread.

You still need logic to handle concurrent writes to the same key.

Architechture of multi-threaded program using database

I've got a fairly simple Python program as outlined below:
It has 2 threads plus the main thread. One of the threads collects some data and puts it on a Queue.
The second thread takes stuff off the queue and logs it. Right now it's just printing out the stuff from the queue, but I'm working on adding it to a local MySQL database.
This is a process that needs to run for a long time (at least a few months).
How should I deal with the database connection? Create it in main, then pass it to the logging thread, or create it directly in the logging thread? And how do I handle unexpected situations with the DB connection (interrupted, MySQL server crashes, etc) in a robust manner?

How should I deal with the database connection? Create it in main,
then pass it to the logging thread, or create it directly in the
logging thread?
I would perhaps configure your logging component with the class that creates the connection and let your logging component request it. This is called dependency injection, and makes life easier in terms of testing e.g. you can mock this out later.
If the logging component created the connections itself, then testing the logging component in a standalone fashion would be difficult. By injecting a component that handles these, you can make a mock that returns dummies upon request, or one that provides connection pooling (and so on).
How you handle database issues robustly depends upon what you want to happen. Firstly make your database interactions transactional (and consequently atomic). Now, do you want your logger component to bring your system to a halt whilst it retries a write. Do you want it to buffer writes up and try out-of-band (i.e. on another thread) ? Is it mission critical to write this or can you afford to lose data (e.g. abandon a bad write). I've not provided any specific answers here, since there are so many options depending upon your requirements. The above details a few possible options.

How to fork a process in python/django?

This is more of a Python general question however in a context of django.
For now I have this view in django which has to process a lot of data. Usually it takes the server (nginx with django running in proxy using) a couple of minutes to do it. Sometimes the server times out. I don't want to increase the time-out time in nginx. I realize that if I can fork a process in python in the django view so that the forked (child) process will do all the data crunching independently of the django view, then the view would be able to return the request to the user immediately (therefore never timing-out) and the child process would continue running in the background finishing up all the calculation.
So here is the question:
How can I fork an independent process in python (and if possible for the python code to be in the same file)? And if possible how can I assign a unix process priority level to it?
I looked at some of the ways of forking a process in python and it seems there are a few options. Which one is the best appropriate for this scenario?
Thank you.

the 'best practice' answer is to use a queue manager, typically RabbitMQ or any backend handled by Django-celery.
Still, there are a few lighter options that do spawn a new thread. what these options usually lack is some way to track progress, or keep the number of threads under control.
check Django-utils to see if it's enough. if not, go for Celery.

If you really want to fork and set priority, you can use os.fork and os.nice, but I think the multiprocessing module or Celery would be more applicable here.

How to create a thread-safe singleton in python

I would like to hold running threads in my Django application. Since I cannot do so in the model or in the session, I thought of holding them in a singleton. I've been checking this out for a while and haven't really found a good how-to for this.
Does anyone know how to create a thread-safe singleton in python?
EDIT:
More specifically what I wand to do is I want to implement some kind of "anytime algorithm", i.e. when a user presses a button, a response returned and a new computation begins (a new thread). I want this thread to run until the user presses the button again, and then my app will return the best solution it managed to find. to do that, i need to save somewhere the thread object - i thought of storing them in the session, what apparently i cannot do.
The bottom line is - i have a FAT computation i want to do on the server side, in different threads, while the user is using my site.

Unless you have a very good reason - you should execute the long running threads in a different process altogether, and use Celery to execute them:
Celery is an open source asynchronous
task queue/job queue based on
distributed message passing. It is
focused on real-time operation, but
supports scheduling as well.
The execution units, called tasks, are
executed concurrently on one or more
worker nodes using multiprocessing,
Eventlet or gevent. Tasks can execute
asynchronously (in the background) or
synchronously (wait until ready).
Celery guide for djangonauts: http://django-celery.readthedocs.org/en/latest/getting-started/first-steps-with-django.html
For singletons and sharing data between tasks/threads, again, unless you have a good reason, you should use the db layer (aka, models) with some caution regarding db locks and refreshing stale data.
Update: regarding your use case, define a Computation model, with a status field. When a user starts a computation, an instance is created, and a task will start to run. The task will monitor the status field (check db once in a while). When a user clicks the button again, a view will change the status to user requested to stop, causing the task to terminate.

If you want asynchronous code in a web application then you're taking the wrong approach. You should run background tasks with a specialist task queue like Celery: http://celeryproject.org/
The biggest problem you have is web server architecture. Unless you go against the recommended Django web server configuration and use a worker thread MPM, you will have no way to track your thread handles between requests as each request typically occupies its own process. This is how Apache normally works: http://httpd.apache.org/docs/2.0/mod/prefork.html
EDIT:
In light of your edit I think you might learn more by creating a custom solution that does this:
Maintains start/stop state in the database
Create a new program that runs as a daemon
Periodically check the start/stop state and begin or end work from here
There's no need for multithreading here unless you need to create a new process for each user. If so, things get more complicated and using Celery will make your life much easier.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.