Multiprocessing or Multithreading for plugin architecture in Python - python

I'm trying to implement a plugin architecture in Python.
I've started writing it using the Threading module where each plugin is a thread which I invoke using the Thread.start() method (since all plugins subclass BasePlugin which subclasses Thread). However I've just come across the multiprocessing module.
I'm currently wondering if I should switch to the multiprocessing module and share data using shared memory / Pipes etc...
I'd like to get other's opinions on this.
The plugin architecture I've been working on works as follows:
An event is received by the Plugin Manager. The Plugin Manager checks for all the plugins who've subscribed to that type of event. It activates them and sends them the event object (since it holds additional information). If one of the plugins is already active there is no need to spawn it (just send the event object to it).
In addition there are a few resources which belong only to one plugin at any point in time. Each plugin can request the resource (I'm not worrying about any race condition here since there won't be that many plugins active at once).

Threads share memory with the primary process and each other. For example you can have a list that is available to all threads. An item appended to a list can be seen by other threads. But you have to be careful. You have to understand which operations on data structures are thread safe and which are not. What happens to the behaviour of your program when two threads are checking for the existence of a key in a dictionary and then writing to it?
Multiple processes do not share memory. The new process that you start gets a copy of the memory at the point where it was spawned.
Threads use less resources. But can be hard to reason about. On the other hand communication between processes is tricky. And you can't just access an arbitrary Python data structure. Which it sounds like you want to be able to do.
A badly written plugin, if it was in a thread, could crash your whole program. Whereas if it was in a separate process this wouldn't happen. Maybe that's a consideration?

Related

Python portable interprocess Semaphore/Event

I'm creating a website using Flask. My WSGI server, Gunicorn, spawns multiple processes.
I have some cross-process objects (notably files) that I want to constrain access to within these processes, and raise events when they are modified.
The choice is normally to use system-wide mutexes/semaphores and events.
However, I can't find a portable (Windows/Mac/Linux) solution for these on Python.
The multiprocessing module (see this question), as far as I can tell, only works for processes spawned by the multiprocessing module itself, which these are not.
There are POSIX semaphores also, but these only work on Linux.
Does anyone know of a more general solution?
I have been researching this for a while, and the closest I could find is the python file-locking library fasteners:
It works quite well in all platforms. The problem it only implements system mutex, but not semaphore like counting. I have implementing my own counting in a locked file with an integer counter and active waiting, but this is still fragile and will leave the system in bad state if one of the process crashes and doesn't update the count properly.

Python 3 Sockets - Can I keep a socket open while stopping and re-running a program?

I've been scratching my head trying to figure out if this is possible.
I have a server program running with about 30 different socket connections to it from all over the country. I need to update this server program now and although the client devices will automatically reconnect, its not totally reliable.
I was wondering if there is a way of saving the socket object to a file? then load it back up when the server restarts? or forcefully keeping a socket open even after the program stops. This way the clients never disconnect at all.
Could really do with hot swappable code here really!
Solution 1.
It can be done with some process magic, at least under linux (although I do believe similar windows api exists). First of all note that sockets cannot be stored in a file. These objects are temporary by their nature. But you can keep them in a separate process. Have a look at this:
Can I open a socket and pass it to another process in Linux
So one way to accomplish this is the following:
Create a "keeper" process at some point (make sure that the process is not a child of the main process so that it stays alive when the main process is gone)
Send all sockets to the keeper process via sendmsg() with SCM_RIGHTS
Shutdown the main process
Do whatever update you have to
Fire the main process
Retrieve sockets from the keeper process
Shutdown the keeper process
However this solution is quite difficult to maintain. You have two separate processes, it is unclear which is the master and which is a slave. So you would probably need another master process at the top. Things get nasty very quickly, not to mention security issues.
Solution 2.
Reloading modules as suggested by #gavinb might be a solution. Note however that in practice this often breaks the app. You never know what those modules do under the hood unless you know the code of every single Python file you use. Plus it imposes some restrictions on modules, i.e. they have to be reloadable. For example some modules use inline caching which makes reloading difficult.
Also once a module is loaded in a different module it keeps a reference to that module. So you not only have to reload it but also update references in every other module that loaded it earlier. The maintanance costs raise very quickly unless you thought about it at the begining of the project (so that every import is encapsulated for easy reload). And bugs caused by two different versions of a module running in the same process are (I imagine, never been in this situation though) extremely difficult to find.
Anyway I would avoid that.
Solution 3.
So this is XY problem. Instead of saving sockets how about you put a proxy in front of the main server? IMO this is the safest and at the same time simpliest solution. The proxy will communicate with the main server (for example over unix domain sockets) and will buffer the data and automatically reconnect to the main server once it is available again. Perhaps you can even reuse some existing tech, e.g. nginx.
No, the sockets are special file handles that belong to the process. If you close the process, the runtime will force close any open files/sockets. This is not Python specific; it is just how operating systems manage resources.
Now what you can do however is dynamically reload one or more modules while keeping the process active. It might take some careful management when you have open sockets, but in theory it should be possible. So yes, hot swappable code is actually supported by Python.
Do some reading and research on "dynamic reloading". The importlib module in Python 3 provides the reload function which is used to:
Reload a previously imported module. The argument must be a module object, so it must have been successfully imported before. This is useful if you have edited the module source file using an external editor and want to try out the new version without leaving the Python interpreter.
I think your critical question is how to hot reload.
And as mentioned by #gavinb, you can import importlib and then use importlib.reload(module) to reload a module dynamically.
Be careful, the parameter of reload(param) must be a module.

Controlling scheduling priority of python threads?

I've written a script that uses two thread pools of ten threads each to pull in data from an API. The thread pool implements this code on ActiveState. Each thread pool is monitoring a Redis database via PubSub for new entries. When a new entry is published, python passes the data to a function that uses python's Subprocess.POpen to execute a PHP shell to do the actual work of calling the API.
This system of launching PHP shells is necessary for functionality with my PHP web app, so launching PHP shells with Python can't be avoided.
This script will only be running on Linux servers.
How do I control the niceness (scheduling priority) of the application's threads?
Edit:
It seems controlling scheduling priority for individual threads in Python isn't possible. Is there a python solution, or at the very least a UNIX command I can run along with my script, to control the priority?
Edit 2:
Well I didn't end up finding a python way to handle it. I'm just running my script with nice now like this:
nice -n 19 python MyScript.py
I believe that threading priority is not controllable in python due to how they are implemented using a global interpreter lock (GIL). Having said that, even if you could give one thread more CPU processing priority, the python implementation that hands around the GIL would not be aware of this as it handed around the GIL. If you were able to increase niceness in a single thread in your pool (say it is doing a more important job) you would need to use your own implementation of locks to give the higher priority thread access to the GIL more often.
A google search returns this article which I believe is similar to what you are asking
Explains why it doesnt work
http://www.velocityreviews.com/forums/t329441-threading-priority.html
Explains the workaround I was suggesting
http://bytes.com/topic/python/answers/645966-setting-thread-priorities
The python threading-docs mention explicitly that there is no support for setting thread-priorities:
The design of this module is loosely based on Java’s threading model. However, where Java makes locks and condition variables basic behavior of every object, they are separate objects in Python. Python’s Thread class supports a subset of the behavior of Java’s Thread class; currently, there are no priorities, no thread groups, and threads cannot be destroyed, stopped, suspended, resumed, or interrupted. The static methods of Java’s Thread class, when implemented, are mapped to module-level functions.
It doesn't work, but I tried:
getting the parent pid and priority
launching threads using concurrent.futures.ThreadPoolExecutor
using ctypes to get the (linux) thread id from within the thread(works)
using the tid with os.setpriority(os.PRIO_PROCESS,tid,parent_priority+1)
calling pool.shutdown() from the parent.
Even with liberal sprinkling of os.sched_yield(), the child threads never actually run past the setpriority().
Reading man pages, it seems threads don't have the capability to change (even their) scheduling priority; you have to do something with "capabilities" to give the thread the "CAP_SYS_NICE" capability. Running the process with root permissions didn't help either; child threads still don't run.
I know, a lot of time has passed, but I recently came across this question, and I thought it would be useful to add another option.
Have a look at threading2, which is a drop-in replacement and extension for the default threading module, with support – sort of – for priority and affinity.
I was wondering if this answer at another related question might be useful in this scenario? (link)
As you are already using Subprocess.POpen to launch your PHP script, it strikes me that you can use "preexec_fn" and either a predefined function, or a lambda function (as demonstrated in the above linked answer) to set the nice level of each launched PHP thread?

Python multiple threads accessing same file

I have two threads, one which writes to a file, and another which periodically
moves the file to a different location. The writes always calls open before writing a message, and calls close after writing the message. The mover uses shutil.move to do the move.
I see that after the first move is done, the writer cannot write to the file anymore, i.e. the size of the file is always 0 after the first move. Am I doing something wrong?
Locking is a possible solution, but I prefer the general architecture of having each external resource (including a file) dealt with by a single, separate thread. Other threads send work requests to the dedicated thread on a Queue.Queue instance (and provide a separate queue of their own as part of the work request's parameters if they need result back), the dedicated thread spends most of its time waiting on a .get on that queue and whenever it gets a requests goes on and executes it (and returns results on the passed-in queue if needed).
I've provided detailed examples of this approach e.g. in "Python in a Nutshell". Python's Queue is intrinsically thread-safe and simplifies your life enormously.
Among the advantages of this architecture is that it translates smoothly to multiprocessing if and when you decide to switch some work to a separate process instead of a separate thread (e.g. to take advantage of multiple cores) -- multiprocessing provides its own workalike Queue type to make such a transition smooth as silk;-).
When two threads access the same resources, weird things happen. To avoid that, always lock the resource. Python has the convenient threading.Lock for that, as well as some other tools (see documentation of the threading module).
Check out http://www.evanfosmark.com/2009/01/cross-platform-file-locking-support-in-python/
You can use a simple lock with his code, as written by Evan Fosmark in an older StackOverflow question:
from filelock import FileLock
with FileLock("myfile.txt"):
# work with the file as it is now locked
print("Lock acquired.")
One of the more elegant libraries I've ever seen.

Threading in a PyQt application: Use Qt threads or Python threads?

I'm writing a GUI application that regularly retrieves data through a web connection. Since this retrieval takes a while, this causes the UI to be unresponsive during the retrieval process (it cannot be split into smaller parts). This is why I'd like to outsource the web connection to a separate worker thread.
[Yes, I know, now I have two problems.]
Anyway, the application uses PyQt4, so I'd like to know what the better choice is: Use Qt's threads or use the Python threading module? What are advantages / disadvantages of each? Or do you have a totally different suggestion?
Edit (re bounty): While the solution in my particular case will probably be using a non-blocking network request like Jeff Ober and Lukáš Lalinský suggested (so basically leaving the concurrency problems to the networking implementation), I'd still like a more in-depth answer to the general question:
What are advantages and disadvantages of using PyQt4's (i.e. Qt's) threads over native Python threads (from the threading module)?
Edit 2: Thanks all for you answers. Although there's no 100% agreement, there seems to be widespread consensus that the answer is "use Qt", since the advantage of that is integration with the rest of the library, while causing no real disadvantages.
For anyone looking to choose between the two threading implementations, I highly recommend they read all the answers provided here, including the PyQt mailing list thread that abbot links to.
There were several answers I considered for the bounty; in the end I chose abbot's for the very relevant external reference; it was, however, a close call.
Thanks again.
This was discussed not too long ago in PyQt mailing list. Quoting Giovanni Bajo's comments on the subject:
It's mostly the same. The main difference is that QThreads are better
integrated with Qt (asynchrnous signals/slots, event loop, etc.).
Also, you can't use Qt from a Python thread (you can't for instance
post event to the main thread through QApplication.postEvent): you
need a QThread for that to work.
A general rule of thumb might be to use QThreads if you're going to interact somehow with Qt, and use Python threads otherwise.
And some earlier comment on this subject from PyQt's author: "they are both wrappers around the same native thread implementations". And both implementations use GIL in the same way.
Python's threads will be simpler and safer, and since it is for an I/O-based application, they are able to bypass the GIL. That said, have you considered non-blocking I/O using Twisted or non-blocking sockets/select?
EDIT: more on threads
Python threads
Python's threads are system threads. However, Python uses a global interpreter lock (GIL) to ensure that the interpreter is only ever executing a certain size block of byte-code instructions at a time. Luckily, Python releases the GIL during input/output operations, making threads useful for simulating non-blocking I/O.
Important caveat: This can be misleading, since the number of byte-code instructions does not correspond to the number of lines in a program. Even a single assignment may not be atomic in Python, so a mutex lock is necessary for any block of code that must be executed atomically, even with the GIL.
QT threads
When Python hands off control to a 3rd party compiled module, it releases the GIL. It becomes the responsibility of the module to ensure atomicity where required. When control is passed back, Python will use the GIL. This can make using 3rd party libraries in conjunction with threads confusing. It is even more difficult to use an external threading library because it adds uncertainty as to where and when control is in the hands of the module vs the interpreter.
QT threads operate with the GIL released. QT threads are able to execute QT library code (and other compiled module code that does not acquire the GIL) concurrently. However, the Python code executed within the context of a QT thread still acquires the GIL, and now you have to manage two sets of logic for locking your code.
In the end, both QT threads and Python threads are wrappers around system threads. Python threads are marginally safer to use, since those parts that are not written in Python (implicitly using the GIL) use the GIL in any case (although the caveat above still applies.)
Non-blocking I/O
Threads add extraordinarily complexity to your application. Especially when dealing with the already complex interaction between the Python interpreter and compiled module code. While many find event-based programming difficult to follow, event-based, non-blocking I/O is often much less difficult to reason about than threads.
With asynchronous I/O, you can always be sure that, for each open descriptor, the path of execution is consistent and orderly. There are, obviously, issues that must be addressed, such as what to do when code depending on one open channel further depends on the results of code to be called when another open channel returns data.
One nice solution for event-based, non-blocking I/O is the new Diesel library. It is restricted to Linux at the moment, but it is extraordinarily fast and quite elegant.
It is also worth your time to learn pyevent, a wrapper around the wonderful libevent library, which provides a basic framework for event-based programming using the fastest available method for your system (determined at compile time).
The advantage of QThread is that it's integrated with the rest of the Qt library. That is, thread-aware methods in Qt will need to know in which thread they run, and to move objects between threads, you will need to use QThread. Another useful feature is running your own event loop in a thread.
If you are accessing a HTTP server, you should consider QNetworkAccessManager.
I asked myself the same question when I was working to PyTalk.
If you are using Qt, you need to use QThread to be able to use the Qt framework and expecially the signal/slot system.
With the signal/slot engine, you will be able to talk from a thread to another and with every part of your project.
Moreover, there is not very performance question about this choice since both are a C++ bindings.
Here is my experience of PyQt and thread.
I encourage you to use QThread.
Jeff has some good points. Only one main thread can do any GUI updates. If you do need to update the GUI from within the thread, Qt-4's queued connection signals make it easy to send data across threads and will automatically be invoked if you're using QThread; I'm not sure if they will be if you're using Python threads, although it's easy to add a parameter to connect().
I can't really recommend either, but I can try describing differences between CPython and Qt threads.
First of all, CPython threads do not run concurrently, at least not Python code. Yes, they do create system threads for each Python thread, however only the thread currently holding Global Interpreter Lock is allowed to run (C extensions and FFI code might bypass it, but Python bytecode is not executed while thread doesn't hold GIL).
On the other hand, we have Qt threads, which are basically common layer over system threads, don't have Global Interpreter Lock, and thus are capable of running concurrently. I'm not sure how PyQt deals with it, however unless your Qt threads call Python code, they should be able to run concurrently (bar various extra locks that might be implemented in various structures).
For extra fine-tuning, you can modify the amount of bytecode instructions that are interpreted before switching ownership of GIL - lower values mean more context switching (and possibly higher responsiveness) but lower performance per individual thread (context switches have their cost - if you try switching every few instructions it doesn't help speed.)
Hope it helps with your problems :)
I can't comment on the exact differences between Python and PyQt threads, but I've been doing what you're attempting to do using QThread, QNetworkAcessManager and making sure to call QApplication.processEvents() while the thread is alive. If GUI responsiveness is really the issue you're trying to solve, the later will help.

Categories

Resources