I've got a fairly simple Python program as outlined below:
It has 2 threads plus the main thread. One of the threads collects some data and puts it on a Queue.
The second thread takes stuff off the queue and logs it. Right now it's just printing out the stuff from the queue, but I'm working on adding it to a local MySQL database.
This is a process that needs to run for a long time (at least a few months).
How should I deal with the database connection? Create it in main, then pass it to the logging thread, or create it directly in the logging thread? And how do I handle unexpected situations with the DB connection (interrupted, MySQL server crashes, etc) in a robust manner?
How should I deal with the database connection? Create it in main,
then pass it to the logging thread, or create it directly in the
logging thread?
I would perhaps configure your logging component with the class that creates the connection and let your logging component request it. This is called dependency injection, and makes life easier in terms of testing e.g. you can mock this out later.
If the logging component created the connections itself, then testing the logging component in a standalone fashion would be difficult. By injecting a component that handles these, you can make a mock that returns dummies upon request, or one that provides connection pooling (and so on).
How you handle database issues robustly depends upon what you want to happen. Firstly make your database interactions transactional (and consequently atomic). Now, do you want your logger component to bring your system to a halt whilst it retries a write. Do you want it to buffer writes up and try out-of-band (i.e. on another thread) ? Is it mission critical to write this or can you afford to lose data (e.g. abandon a bad write). I've not provided any specific answers here, since there are so many options depending upon your requirements. The above details a few possible options.
Related
I'm having a python 3.8+ program using Django and Postgresql which requires multiple threads or processes. I cannot use threads since the GLI will restrict them to a single process which results in an awful performance (especially since most of the threads are CPU bound).
So the obvious solution was to use the multiprocessing module. But I've encountered several problems:
When using spawn to generate new processes, I get the "Apps aren't loaded yet" error when the new process imports the Django models. This is because the new process doesn't have the database connection given to the main process by python manage.py runserver. I circumvented it by using fork instead of spawn (like advised here) so the connections are copied to the other processes but I feel like this is not the best solution and there should be a clean way to start new processes with the necessary connections.
When several of the processes simultaneously access the database, sometimes false results are given back (partly even from wrong models / relations) which crashes the program. This can happen in the initial startup when fetching data but also when the program is running. I tried to use ISOLATION LEVEL SERIALIZABLE (like advised here) by adding it in the options in the database settings but that didn't work.
A possible solution might be using custom locks that are given to every process but that doesn't feel like a good solution as well.
So in general, the question is: Is there a good and clean way to use multiprocessing in Django without these issues? A way that new processes have the database connections without needing to rely on fork and that all processes can just access the database without having any race conditions sometimes producing false results like this?
One important thing: I don't use a Pool since the processes aren't running the same simple task. The processes are each running different specific tasks, share data via multiprocessing Signals, Queues, Values and Namespaces (shared memory) and new processes can be triggered by user interaction (websockets).
I've tried to look into Celery since this has been recommended on a lot of questions about Django and multiprocessing but I wouldn't know how to use something like that in the project structure with the specific different processes that need to be created at specific points and the data that gets transferred over the Queues, Signals, Values and Namespaces in the existing project.
Thank you for reading; any help is appreciated!
With every new process, a setup function calling Django.setup() is first called before executing the real function. My hope was that with this way, every process would create an independent connection to the database so that the current system could work.
Yes - you can do that with initializer,
as explained in my other answer from yesteryear.
However, it still throws errors like django.db.utils.OperationalError: lost synchronization with server: got message type "1", length 976434746
That means you're using the fork start method for subprocesses, and any database connections and their state has been forked into the subprocesses too, and they will be out of sync when used by multiple processes.
You'll need to close them:
def subprocess_setup():
django.setup()
from django.db import connections
for conn in connections.all():
conn.close()
with ProcessPoolExecutor(max_workers=5, initializer=subprocess_setup) as executor:
My python application uses concurrent.futures.ProcessPoolExecutor with 5 workers and each process makes multiple database queries.
Between the choice of giving each process its own db client, or alternatively , making all process to share a single client, which is considered more safe and conventional?
Short answer: Give each process (that needs it) its own db client.
Long answer: What problem are you trying to solve?
Sharing a DB client between processes basically doesn't happen; you'd have to have the one process which does have the DB client proxy the queries from the others, using more-or-less your own protocol. That can have benefits, if that protocol is specific to your application, but it will add complexity: you'll now have two different kinds of workers in your program, rather than just one kind, plus the protocol between them. You'd want to make sure that the benefits outweigh the additional complexity.
Sharing a DB client between threads is usually possible; you'd have to check the documentation to see which objects and operations are "thread-safe". However, since your application is otherwise CPU-heavy, threading is not suitable, due to Python limitations (the GIL).
At the same time, there's little cost to having a DB client in each process; you will in any case need some sort of client, it might as well be the direct one.
There isn't going to be much more IO, since that's mostly based on the total number of queries and amount of data, regardless of whether that comes from one process or gets spread among several. The only additional IO will be in the login, and that's not much.
If you're running out of connections at the database, you can either tune/upgrade your database for more connections, or use a separate off-the-shelf "connection pooler" to share them; that's likely to be much better than trying to implement a connection pooler from scratch.
More generally, and this applies well beyond this particular question, it's often better to combine several off-the-shelf pieces in a straightforward way, than it is to try to put together a custom complex piece that does the whole thing all at once.
So, what problem are you trying to solve?
It is better to use multithreading or asynchronous approach instead of multiprocessing because it will consume fewer resources. That way you could use a single db connection, but I would recommend creating a separate session for each worker or coroutine to avoid some exceptions or problems with locking.
I'm programming a bit of server code and the MQTT side of it runs in it's own thread using the threading module which works great and no issues but now I'm wondering how to proceed.
I have two MariaDB databases, one of them is local and the other is remote (There is a good and niche reason for this.) and I'm writing a class which handles the databases. This class will start new threads of classes that submits the data to their respected databases. If conditions are true, then it tells the data to start a new thread to push data to one database, if they are false, the data will go to the other database. The MQTT thread has a instance of the "Database handler" class and passes data to it through different calling functions within the class.
Will this work to allow a thread to concentrate on MQTT tasks while another does the database work? There are other threads as well, I've just never combined databases and threads before so I'd like an opinion or any information that would help me out from more seasoned programmers.
Writing code that is "thread safe" can be tricky. I doubt if the Python connector to MySQL is thread safe; there is very little need for it.
MySQL is quite happy to have multiple connections to it from clients. But they must be separate connections, not the same connection running in separate threads.
Very few projects need multi-threaded access to the database. Do you have a particular need? If so let's hear about it, and discuss the 'right' way to do it.
For now, each of your threads that needs to talk to the database should create its own connection. Generally, such a connection can be created soon after starting the thread (or process) and kept open until close to the end of the thread. That is, normally you should have only one connection per thread.
I've been scratching my head trying to figure out if this is possible.
I have a server program running with about 30 different socket connections to it from all over the country. I need to update this server program now and although the client devices will automatically reconnect, its not totally reliable.
I was wondering if there is a way of saving the socket object to a file? then load it back up when the server restarts? or forcefully keeping a socket open even after the program stops. This way the clients never disconnect at all.
Could really do with hot swappable code here really!
Solution 1.
It can be done with some process magic, at least under linux (although I do believe similar windows api exists). First of all note that sockets cannot be stored in a file. These objects are temporary by their nature. But you can keep them in a separate process. Have a look at this:
Can I open a socket and pass it to another process in Linux
So one way to accomplish this is the following:
Create a "keeper" process at some point (make sure that the process is not a child of the main process so that it stays alive when the main process is gone)
Send all sockets to the keeper process via sendmsg() with SCM_RIGHTS
Shutdown the main process
Do whatever update you have to
Fire the main process
Retrieve sockets from the keeper process
Shutdown the keeper process
However this solution is quite difficult to maintain. You have two separate processes, it is unclear which is the master and which is a slave. So you would probably need another master process at the top. Things get nasty very quickly, not to mention security issues.
Solution 2.
Reloading modules as suggested by #gavinb might be a solution. Note however that in practice this often breaks the app. You never know what those modules do under the hood unless you know the code of every single Python file you use. Plus it imposes some restrictions on modules, i.e. they have to be reloadable. For example some modules use inline caching which makes reloading difficult.
Also once a module is loaded in a different module it keeps a reference to that module. So you not only have to reload it but also update references in every other module that loaded it earlier. The maintanance costs raise very quickly unless you thought about it at the begining of the project (so that every import is encapsulated for easy reload). And bugs caused by two different versions of a module running in the same process are (I imagine, never been in this situation though) extremely difficult to find.
Anyway I would avoid that.
Solution 3.
So this is XY problem. Instead of saving sockets how about you put a proxy in front of the main server? IMO this is the safest and at the same time simpliest solution. The proxy will communicate with the main server (for example over unix domain sockets) and will buffer the data and automatically reconnect to the main server once it is available again. Perhaps you can even reuse some existing tech, e.g. nginx.
No, the sockets are special file handles that belong to the process. If you close the process, the runtime will force close any open files/sockets. This is not Python specific; it is just how operating systems manage resources.
Now what you can do however is dynamically reload one or more modules while keeping the process active. It might take some careful management when you have open sockets, but in theory it should be possible. So yes, hot swappable code is actually supported by Python.
Do some reading and research on "dynamic reloading". The importlib module in Python 3 provides the reload function which is used to:
Reload a previously imported module. The argument must be a module object, so it must have been successfully imported before. This is useful if you have edited the module source file using an external editor and want to try out the new version without leaving the Python interpreter.
I think your critical question is how to hot reload.
And as mentioned by #gavinb, you can import importlib and then use importlib.reload(module) to reload a module dynamically.
Be careful, the parameter of reload(param) must be a module.
With Windows named pipes, what is the proper way to use the CreateNamedPipe, ConnectNamedPipe, DisconnectNamedPipe, and CloseHandle calls?
I am making a server app which is connecting to a client app which connects and disconnects to the pipe multiple times across a session.
When my writes fail because the client disconnected, should I call DisconnectNamedPipe, CloseHandle, or nothing on my handle.
Then, to accept a new connection, should I call CreateNamedPipe and then ConnectNamedPipe, or just ConnectNamedPipe?
I would very much like an explanation of the different states my pipe can be in as a result of these calls, because I have not found this elsewhere.
Additional info:
Language: Python using the win32pipe,win32file and win32api libraries.
Pipe settings: WAIT, no overlap, bytestream.
It is good practice to call DisconnectNamedPipe then CloseHandle, although CloseHandle should clean everything up.
The MSDN documentation is a little vague and their server example is pretty basic. As to whether you reuse pipe handles, it seems that it is your own choice. Documentation for DisconnectNamedPipe seems to indicate that you can re-use a pipe handle for a new client by calling ConnectNamedPipe again on that handle after disconnecting. The role of ConnectNamedPipe seems to be to assign a connecting client to a handle.
Make sure you are cleaning up pipes though as MSDN states the following
Every time a named pipe is created, the system creates the inbound and/or outbound buffers using nonpaged pool, which is the physical memory used by the kernel. The number of pipe instances (as well as objects such as threads and processes) that you can create is limited by the available nonpaged pool. Each read or write request requires space in the buffer for the read or write data, plus additional space for the internal data structures.
I'd also bare the above in mind if you are creating/destroying a lot of pipes. My guess that it would be better to operate a pool of pipe handles if there are many clients and have some grow/shrink mechanism to the pool.
I have managed to achieve what I wanted. I call CreateNamedPipe and CloseHandle exactly once per session, and I call DisconnectNamedPipe when my write fails, followed by another ConnectNamedPipe.
The trick is to only call DisconnectNamedPipe when the pipe was actually connected. I called it every time I tried to connect "just to be sure" and it gave me strange errors.
See also djgandy's answer for more information about pipes.