I'm programming a bit of server code and the MQTT side of it runs in it's own thread using the threading module which works great and no issues but now I'm wondering how to proceed.
I have two MariaDB databases, one of them is local and the other is remote (There is a good and niche reason for this.) and I'm writing a class which handles the databases. This class will start new threads of classes that submits the data to their respected databases. If conditions are true, then it tells the data to start a new thread to push data to one database, if they are false, the data will go to the other database. The MQTT thread has a instance of the "Database handler" class and passes data to it through different calling functions within the class.
Will this work to allow a thread to concentrate on MQTT tasks while another does the database work? There are other threads as well, I've just never combined databases and threads before so I'd like an opinion or any information that would help me out from more seasoned programmers.
Writing code that is "thread safe" can be tricky. I doubt if the Python connector to MySQL is thread safe; there is very little need for it.
MySQL is quite happy to have multiple connections to it from clients. But they must be separate connections, not the same connection running in separate threads.
Very few projects need multi-threaded access to the database. Do you have a particular need? If so let's hear about it, and discuss the 'right' way to do it.
For now, each of your threads that needs to talk to the database should create its own connection. Generally, such a connection can be created soon after starting the thread (or process) and kept open until close to the end of the thread. That is, normally you should have only one connection per thread.
Related
I'm having a python 3.8+ program using Django and Postgresql which requires multiple threads or processes. I cannot use threads since the GLI will restrict them to a single process which results in an awful performance (especially since most of the threads are CPU bound).
So the obvious solution was to use the multiprocessing module. But I've encountered several problems:
When using spawn to generate new processes, I get the "Apps aren't loaded yet" error when the new process imports the Django models. This is because the new process doesn't have the database connection given to the main process by python manage.py runserver. I circumvented it by using fork instead of spawn (like advised here) so the connections are copied to the other processes but I feel like this is not the best solution and there should be a clean way to start new processes with the necessary connections.
When several of the processes simultaneously access the database, sometimes false results are given back (partly even from wrong models / relations) which crashes the program. This can happen in the initial startup when fetching data but also when the program is running. I tried to use ISOLATION LEVEL SERIALIZABLE (like advised here) by adding it in the options in the database settings but that didn't work.
A possible solution might be using custom locks that are given to every process but that doesn't feel like a good solution as well.
So in general, the question is: Is there a good and clean way to use multiprocessing in Django without these issues? A way that new processes have the database connections without needing to rely on fork and that all processes can just access the database without having any race conditions sometimes producing false results like this?
One important thing: I don't use a Pool since the processes aren't running the same simple task. The processes are each running different specific tasks, share data via multiprocessing Signals, Queues, Values and Namespaces (shared memory) and new processes can be triggered by user interaction (websockets).
I've tried to look into Celery since this has been recommended on a lot of questions about Django and multiprocessing but I wouldn't know how to use something like that in the project structure with the specific different processes that need to be created at specific points and the data that gets transferred over the Queues, Signals, Values and Namespaces in the existing project.
Thank you for reading; any help is appreciated!
With every new process, a setup function calling Django.setup() is first called before executing the real function. My hope was that with this way, every process would create an independent connection to the database so that the current system could work.
Yes - you can do that with initializer,
as explained in my other answer from yesteryear.
However, it still throws errors like django.db.utils.OperationalError: lost synchronization with server: got message type "1", length 976434746
That means you're using the fork start method for subprocesses, and any database connections and their state has been forked into the subprocesses too, and they will be out of sync when used by multiple processes.
You'll need to close them:
def subprocess_setup():
django.setup()
from django.db import connections
for conn in connections.all():
conn.close()
with ProcessPoolExecutor(max_workers=5, initializer=subprocess_setup) as executor:
I'm writing a Python application that uses a Rethink database. I have three worker threads that need to run and possibly access the database at the same time. I know how to synchronize threads in Python, but my question is: do I need to? If Rethink claims to be thread-safe, which is implied on this page giving advice on how to speed things up, can I leave pass the concurrency issues off to the database?
RethinkDB definitely works when accessed concurrently from multiple threads or clients. The Python driver should work fine on multiple threads as long as you open a separate connection for each thread.
You still need logic to handle concurrent writes to the same key.
I'm not sure if I'm understanding the use case for DB connection pools (eg: psycopg2.pool and mysql.connector.pooling) in python. It seems to me that parallelism is usually achieved in python using a multi-process rather than a multi-thread approach because of the GIL, and that in the multi-process case these pools are not very useful since each process will initialize its own pool and will only have a single thread running at a time. Is this correct? Is there any strategy for sharing a DB connection pool when using multiple processes, and if not is the usefulness of pooling limited to multi-threaded python applications or are there other scenarios where you would use them?
Keith,
You're on the right track. As mentioned in the S.O post "Accessing a MySQL connection pool from Python multiprocessing,":
Making a seperate pool for each process is redundant and opens up way
too many connections.
Check out the other S.O post, "What is the best solution for database connection pooling in python?", it contains a sample pooling solution in python. This post also discusses the limitations of db-pooling if your application were to become multi-threaded:
Making your own connection pool is a BAD idea if your app ever decides to start using
multi-threading. Making a connection pool for a multi-threaded application is much
more complicated than one for a single-threaded application. You can use something
like PySQLPool in that case.
In-terms of implementing db pooling in python, as mentioned in "Application vs Database Resident Connection Pool," if your database supports it, the best implementation would involve:
Let connection pool be maintained and managed by database itself
(example: Oracle's DRCP) and calling modules just ask connections from the connection
broker described by Oracle DRCP.
Please let me know if you have any questions!
I've got a fairly simple Python program as outlined below:
It has 2 threads plus the main thread. One of the threads collects some data and puts it on a Queue.
The second thread takes stuff off the queue and logs it. Right now it's just printing out the stuff from the queue, but I'm working on adding it to a local MySQL database.
This is a process that needs to run for a long time (at least a few months).
How should I deal with the database connection? Create it in main, then pass it to the logging thread, or create it directly in the logging thread? And how do I handle unexpected situations with the DB connection (interrupted, MySQL server crashes, etc) in a robust manner?
How should I deal with the database connection? Create it in main,
then pass it to the logging thread, or create it directly in the
logging thread?
I would perhaps configure your logging component with the class that creates the connection and let your logging component request it. This is called dependency injection, and makes life easier in terms of testing e.g. you can mock this out later.
If the logging component created the connections itself, then testing the logging component in a standalone fashion would be difficult. By injecting a component that handles these, you can make a mock that returns dummies upon request, or one that provides connection pooling (and so on).
How you handle database issues robustly depends upon what you want to happen. Firstly make your database interactions transactional (and consequently atomic). Now, do you want your logger component to bring your system to a halt whilst it retries a write. Do you want it to buffer writes up and try out-of-band (i.e. on another thread) ? Is it mission critical to write this or can you afford to lose data (e.g. abandon a bad write). I've not provided any specific answers here, since there are so many options depending upon your requirements. The above details a few possible options.
I'm quite new to python threading/network programming, but have an assignment involving both of the above.
One of the requirements of the assignment is that for each new request, I spawn a new thread, but I need to both send and receive at the same time to the browser.
I'm currently using the asyncore library in Python to catch each request, but as I said, I need to spawn a thread for each request, and I was wondering if using both the thread and the asynchronous is overkill, or the correct way to do it?
Any advice would be appreciated.
Thanks
EDIT:
I'm writing a Proxy Server, and not sure if my client is persistent. My client is my browser (using firefox for simplicity)
It seems to reconnect for each request. My problem is that if I open a tab with http://www.google.com in it, and http://www.stackoverflow.com in it, I only get one request at a time from each tab, instead of multiple requests from google, and from SO.
I answered a question that sounds amazingly similar to your, where someone had a homework assignment to create a client server setup, with each connection being handled in a new thread: https://stackoverflow.com/a/9522339/496445
The general idea is that you have a main server loop constantly looking for a new connection to come in. When it does, you hand it off to a thread which will then do its own monitoring for new communication.
An extra bit about asyncore vs threading
From the asyncore docs:
There are only two ways to have a program on a single processor do
“more than one thing at a time.” Multi-threaded programming is the
simplest and most popular way to do it, but there is another very
different technique, that lets you have nearly all the advantages of
multi-threading, without actually using multiple threads. It’s really
only practical if your program is largely I/O bound. If your program
is processor bound, then pre-emptive scheduled threads are probably
what you really need. Network servers are rarely processor bound,
however.
As this quote suggests, using asyncore and threading should be for the most part mutually exclusive options. My link above is an example of the threading approach, where the server loop (either in a separate thread or the main one) does a blocking call to accept a new client. And when it gets one, it spawns a thread which will then continue to handle the communication, and the server goes back into a blocking call again.
In the pattern of using asyncore, you would instead use its async loop which will in turn call your own registered callbacks for various activity that occurs. There is no threading here, but rather a polling of all the open file handles for activity. You get the sense of doing things all concurrently, but under the hood it is scheduling everything serially.