I'm currently using CherryPy 3.2.2 and am having an issue where my ThreadPool does not grow and shrink at all. Looking through the source of wsgiserver2.py I see two functions in the ThreadPool class 'grow' and 'shrink'. If you download the entire repo and search for those two functions to see where they are being called they are not. Perhaps they are being invoked some other way that is foreign to me, but I would like to know if this is an oversight or I'm just looking in the wrong places.
Note: I'm setting the values (thread_pool and thread_pool_max) correctly before start is called on the Server, from the ServerAdapter, so its not that.
Thanks for all your help.
pcarl
You're correct. Nor ThreadPool.shrink and ThreadPool.grow are being called in CherryPy flow, neither thread_pool_max has any effect unless you call these two methods explicitly.
Normally CherryPy will lazily instantiate thread workers up to thread_pool and will stop there.
If you're sure that you need big thread pool which causes serious memory overhead for your application you can inherit cherrypy.process.plugins.Monitor to monitor thread queue size or other parameter to grow and shrink the pool. Luckily there's already one out there.
Related
Python provides 4 different synchronizing mechanisms in threading module: Event/Condition/Lock(RLock)/Semaphore.
I understand they can be used to synchronize access of shared resources/critical sections between threads. But I am not quite sure when to use which.
Can they be used interchangeably? Or are some of them 'higher level', using others as building blocks? If so, which ones are built on which?
It would be great if someone can illustrate with some examples.
This article probably contains all the information you need. The question is indeed very broad, but let me try to explain how I use each as an example:
Event - Use it when you need threads to communicate a certain state was met so they can both work together in sync. I use it mostly for initiation process of two threads where one dependes on the other.
Example: A client has a threaded manager, and its __init__() needs to know the manager is done instantiating some attributes before it can move on.
Lock/RLock - Use it when you are working with a shared resource and you want to make sure no other thread is reading/writing to it. Although I'd argue that while locking before writing is mandatory, locking before reading could be optional. But it is good to make sure that while you are reading/writing, no other thread is modifying it at the same time. RLock has the ability to be acquired multiple times by its owner, and release() must be called the same amount of times acquire() was in order for it to be used by another thread trying to acquire it.
I haven't used Condition that much, and frankly never had to use Semaphore, so this answer has room of editing and improvement.
I'm trying to implement a plugin architecture in Python.
I've started writing it using the Threading module where each plugin is a thread which I invoke using the Thread.start() method (since all plugins subclass BasePlugin which subclasses Thread). However I've just come across the multiprocessing module.
I'm currently wondering if I should switch to the multiprocessing module and share data using shared memory / Pipes etc...
I'd like to get other's opinions on this.
The plugin architecture I've been working on works as follows:
An event is received by the Plugin Manager. The Plugin Manager checks for all the plugins who've subscribed to that type of event. It activates them and sends them the event object (since it holds additional information). If one of the plugins is already active there is no need to spawn it (just send the event object to it).
In addition there are a few resources which belong only to one plugin at any point in time. Each plugin can request the resource (I'm not worrying about any race condition here since there won't be that many plugins active at once).
Threads share memory with the primary process and each other. For example you can have a list that is available to all threads. An item appended to a list can be seen by other threads. But you have to be careful. You have to understand which operations on data structures are thread safe and which are not. What happens to the behaviour of your program when two threads are checking for the existence of a key in a dictionary and then writing to it?
Multiple processes do not share memory. The new process that you start gets a copy of the memory at the point where it was spawned.
Threads use less resources. But can be hard to reason about. On the other hand communication between processes is tricky. And you can't just access an arbitrary Python data structure. Which it sounds like you want to be able to do.
A badly written plugin, if it was in a thread, could crash your whole program. Whereas if it was in a separate process this wouldn't happen. Maybe that's a consideration?
I'm writing a web UI for data analysis tasks.
Here's the way it's supposed to work:
After a user specifies parameters like dataset and learning rate, I create a new task record, then a executor for this task is started asyncly (The executor may take a long time to run.), and the user is redirected to some other page.
After searching for an async library for python, I started with eventlet, here's what I wrote in a flask view function:
db.save(task)
eventlet.spawn(executor, task)
return redirect("/show_tasks")
With the code above, the executor didn't execute at all.
What may be the problem of my code? Or maybe I should try something else?
While you been given with direct solutions, i will try to answer your first question and explain why your code does not work as expected.
Disclosures: i currently maintain Eventlet. This comment will contain a number of simplifications to fit into reasonable size.
Brief introduction to cooperative multithreading
There are two ways to do Multithreading and Eventlet exploits cooperative approach. At the core is Greenlet library which basically allows you to create independent "execution contexts". One could think of such context as frozen state of all local variables and a pointer to next instruction. Basically, multithreading = contexts + scheduler. Greenlet provides contexts so we need a scheduler, something that makes decisions about which context should occupy CPU right now. It turns, to make decisions we should also run some code. Which means a separate context (green thread). This special green thread is called a Hub in Eventlet code base. Scheduler maintains an ordered set of contexts that need to be run ASAP - run queue and set of contexts that are waiting for something (e.g. network IO or time limited sleep) to finish.
But since we are doing cooperative multitasking, one context will execute indefinitely unless it explicitly yields to another. This would be very sad style of programming, and also by definition incompatible with existing libraries (pointing at they-know-who); so what Eventlet does is it provides green versions of common modules, changed in such way that they switch to Hub instead of blocking everything. Then, some time may be spent in other green threads or in Hub's wait-for-external-events implementation, in which case Hub would switch back to green thread originating that event - and it would continue execution.
End. Now back to your problem.
What eventlet.spawn actually does: it creates a new execution context. Basically, allocates an object in memory. Also it tells scheduler to put this context into run queue, so at first possible moment, Hub will switch to newly spawned function. Your code does not provide such a moment. There is no place where you explicitly give up execution to other green threads, for Eventlet this is usually done via eventlet.sleep(). And since you don't use green versions of common modules, there is no chance to yield implicitly when other code waits. Most appropriate (if not the only one) place would be your WSGI server's accept loop: it should give other green threads chance to run while waiting for next request. Mentioned in first answer eventlet.monkey_patch() is just a convenient way to replace all (or subset of) common modules with their corresponding green versions.
Unwanted opinion on overall design
In separate section, to skip easily. Iff you are building error resistant software, you usually want to limit execution time for spawned threads (including but not limited to "green") and processes and at least report(log) or react to their unhandled errors. In provided code, your spawned green thread, technically may run in next moment or five minutes later (again, because nobody yields CPU) or fail with unhandled exception. Luckily, Eventlet provides two solutions for both problems: Timeout with_timeout() allow to limit waiting time (remember, if it does not yield, you can't possibly limit it) and GreenThread.link() to catch all exceptions. It may be tempting (it was for me) to reraise exceptions in "main" code, and link() allows that easily, but consider that exceptions would be raised from sleep and IO calls - places where you yield to Hub. This may provide some really counter intuitive tracebacks.
You'll need to patch some system libraries in order to make eventlet work. Here is a minimal working example (also as gist):
#!/usr/bin/env python
from flask import Flask
import time
import eventlet
eventlet.monkey_patch()
app = Flask(__name__)
app.debug = True
def background():
""" do something in the background """
print('[background] working in the background...')
time.sleep(2)
print('[background] done.')
return 42
def callback(gt, *args, **kwargs):
""" this function is called when results are available """
result = gt.wait()
print("[cb] %s" % result)
#app.route('/')
def index():
greenth = eventlet.spawn(background)
greenth.link(callback)
return "Hello World"
if __name__ == '__main__':
app.run()
More on that:
http://eventlet.net/doc/patching.html#monkey-patch
One of the challenges of writing a library like Eventlet is that the built-in networking libraries don’t natively support the sort of cooperative yielding that we need.
Eventlet may indeed be suitable for your purposes, but it doesn't just fit in with any old application; Eventlet requires that it be in control of all your application's I/O.
You may be able to get away with either
Starting Eventlet's main loop in another thread, or even
Not using Eventlet and just spawning your task in another thread.
Celery may be another option.
What is the best way to continuously repeat the execution of a given function at a fixed interval while being able to terminate the executor (thread or process) immediately?
Basically I know two approaches:
use multiprocessing and function with infinite cycle and time.sleep at the end. Processing is terminated with process.terminate() in any state.
use threading and constantly recreate timers at the end of the thread function. Processing is terminated by timer.cancel() while sleeping.
(both “in any state” and “while sleeping” are fine, even though the latter may be not immediate). The problem is that I have to use both multiprocessing and threading as the latter appears not to work on ARM (some fuzzy interaction of python interpreter and vim, outside of vim everything is fine) (I was using the second approach there, have not tried threading+cycle; no code is currently left) and the former spawns way too many processes which I would like not to see unless really required. This leads to a problem of having to code two different approaches while threading with cycle is just a few more imports for drop-in replacements of all multiprocessing stuff wrapped in if/else (except that there is no thread.terminate()). Is there some better way to do the job?
Currently used code is here (currently with cycle for both jobs), but I do not think it will be much useful to answer the question.
Update: The reason why I am using this solution are functions that display file status (and some other things like branch) in version control systems in vim statusline. These statuses must be updated, but updating them immediately cannot be done without using hooks and I have no idea how to set hooks temporary and remove on vim quit without possibly spoiling user configuration. Thus standard solution is cache expiring after N seconds. But when cache expired I need to do an expensive shell call and the delay appears to be noticeable, the more noticeable the heavier IO load is. What I am implementing now is updating values for viewed buffers each N seconds in a separate process thus delays are bothering that process and not me. Threads are likely to also work because GIL does not affect calls to external programs.
I'm not clear on why a single long-lived thread that loops infinitely over the tasks wouldn't work for you? Or why you end up with many processes in the multiprocess option?
My immediate reaction would have been a single thread with a queue to feed it things to do. But I may be misunderstanding the problem.
I do not know how do it simply and/or cleanly in Python, but I was wondering if maybe you couldn't take avantage of an existing system scheduler, e.g. crontab for *nix system.
There is an API in python and it might satisfied your needs.
I have two threads, one which writes to a file, and another which periodically
moves the file to a different location. The writes always calls open before writing a message, and calls close after writing the message. The mover uses shutil.move to do the move.
I see that after the first move is done, the writer cannot write to the file anymore, i.e. the size of the file is always 0 after the first move. Am I doing something wrong?
Locking is a possible solution, but I prefer the general architecture of having each external resource (including a file) dealt with by a single, separate thread. Other threads send work requests to the dedicated thread on a Queue.Queue instance (and provide a separate queue of their own as part of the work request's parameters if they need result back), the dedicated thread spends most of its time waiting on a .get on that queue and whenever it gets a requests goes on and executes it (and returns results on the passed-in queue if needed).
I've provided detailed examples of this approach e.g. in "Python in a Nutshell". Python's Queue is intrinsically thread-safe and simplifies your life enormously.
Among the advantages of this architecture is that it translates smoothly to multiprocessing if and when you decide to switch some work to a separate process instead of a separate thread (e.g. to take advantage of multiple cores) -- multiprocessing provides its own workalike Queue type to make such a transition smooth as silk;-).
When two threads access the same resources, weird things happen. To avoid that, always lock the resource. Python has the convenient threading.Lock for that, as well as some other tools (see documentation of the threading module).
Check out http://www.evanfosmark.com/2009/01/cross-platform-file-locking-support-in-python/
You can use a simple lock with his code, as written by Evan Fosmark in an older StackOverflow question:
from filelock import FileLock
with FileLock("myfile.txt"):
# work with the file as it is now locked
print("Lock acquired.")
One of the more elegant libraries I've ever seen.