Heavy I/O and python multiprocessing/multithreading - python

I am designing a little soft which involves:
Fetching a resource on the internet,
Some user interaction (quick editing of the resource),
Some processing.
I would like to do so with many resources (they are all listed in a list). Each is independent from the others. Since the editing part is quite weary, I would like to make life easier for the user (probably me) so that he does not have to wait for the download of each ressource. For simplicity we forget the third task here.
My idea was to use the threading or multiprocessing module. Some thread (say thread 1) would do the "download" in advance while another (say thread 2) one would interact with the user on an already downloaded resource.
Question: How can I make sure that thread 1 is always ahead of at least ahead_min resources and at most ahead_max (ahead_max>ahead_min) at all times?
I typically would need something similar to Queue.Queue(ahead_max) (or multiprocessing.Queue(ahead_max)) except that when ahead_max is attained then insertion is blocked until there are at most ahead_min elements left in the queue (in fact it blocks until the queue is empty, see http://docs.python.org/library/queue.html#module-Queue). Popping should also be blocked until at least ahead_min+1 elements are in the queue (at the end of the sequence of resources I can then insert some dummy objects to ensure even the last resource is processed).
Any idea? If you think of any simpler alternative, please share!

In this case I would suggest to subclass Queue and implement your own logic. This should be an easy task as the implementation of the Queue class is already in Python.
You can use this as template
from queue import Queue
class MyQueue(Queue):
def put(self, item, block=True, timeout=None):
...
def get(self, block=True, timeout=None):
...

First of all, it seems that threading is preferable over multiprocessing in this case, because your task seems to be more IO bound than CPU bound. Then, yes, make use of queues in order to set up the communication between the different "modules". If the default pop behavior is not enough for you, you can play with Queue.qsize() and implement your own logic.

Related

"Published" value accessible across processes in python

I'm writing software in python (3.7) that involves one main GUI thread, and multiple child processes that are each operating as state machines.
I'd like the child processes to publish their current state machine state name so the main GUI thread can check on what state the state machines are in.
I want to find a way to do this such that if the main process and the child process were trying to read/write to the state variable at the same time, the main process would immediately (with no locking/waiting) get a slightly out-of-date state, and the child process would immediately (with no locking/waiting) write the current state to the state variable.
Basically, I want to make sure the child process doesn't get any latency/jitter due to simultaneous access of the state variable, and I don't care if the GUI gets a slightly outdated value.
I looked into:
using a queue.Queue with a maxsize of 1, but the behavior of
queue.Queue is to block if the queue runs out of space - it would
work for my purposes if it behaved like a collections.deque and
silently made the oldest value walk the plank if a new one came in
with no available space.
using a multiprocessing.Value, but from
the documentation, it sounds like you need to acquire a lock to
access or write the value, and that's what I want to avoid - no
locking/blocking for simultaneous read/writes. It says something
vague about how if you don't use the lock, it won't be 'process-safe',
but I don't really know what that means - what bad things would
happen exactly without using a lock?
What's the best way to accomplish this? Thanks!
For some reason, I had forgotten that it's possible to put into a queue in a non-blocking way!
The solution I found is to use a multiprocessing.Queue with maxsize=1, and use non-blocking writes on the producer (child process) side. Here's a short version of what I did:
Initializing in parent process:
import multiprocessing as mp
import queue
publishedValue = mp.Queue(maxsize=1)
In repeatedly scheduled GUI function ("consumer"):
try:
# Attempt to get an updated published value
publishedValue.get(block=False)
except queue.Empty:
# No new published value available
pass
In child "producer" process:
try:
# Clear current value in case GUI hasn't already consumed it
publishedValue.get(block=False)
except queue.Empty:
# Published value has already been consumed, no problem
pass
try:
# Publish new value
publishedValue.put(block=False)
except queue.Full:
# Can't publish value right now, resource is locked
pass
Note that this does require that the child process can repeatedly attempt to re-publish the value if it gets blocked, otherwise the consumer might completely miss a published value (as opposed to simply getting it a bit late).
I think this may be possible in a bit more concise way (and probably with less overhead) with non-blocking writes to a multiprocessing.Value object instead of a queue, but the docs don't make it obvious (to me) how to do that.
Hope this helps someone.

When to use event/condition/lock/semaphore in python's threading module?

Python provides 4 different synchronizing mechanisms in threading module: Event/Condition/Lock(RLock)/Semaphore.
I understand they can be used to synchronize access of shared resources/critical sections between threads. But I am not quite sure when to use which.
Can they be used interchangeably? Or are some of them 'higher level', using others as building blocks? If so, which ones are built on which?
It would be great if someone can illustrate with some examples.
This article probably contains all the information you need. The question is indeed very broad, but let me try to explain how I use each as an example:
Event - Use it when you need threads to communicate a certain state was met so they can both work together in sync. I use it mostly for initiation process of two threads where one dependes on the other.
Example: A client has a threaded manager, and its __init__() needs to know the manager is done instantiating some attributes before it can move on.
Lock/RLock - Use it when you are working with a shared resource and you want to make sure no other thread is reading/writing to it. Although I'd argue that while locking before writing is mandatory, locking before reading could be optional. But it is good to make sure that while you are reading/writing, no other thread is modifying it at the same time. RLock has the ability to be acquired multiple times by its owner, and release() must be called the same amount of times acquire() was in order for it to be used by another thread trying to acquire it.
I haven't used Condition that much, and frankly never had to use Semaphore, so this answer has room of editing and improvement.

CherryPy ThreadPool not growing and/or shrinking

I'm currently using CherryPy 3.2.2 and am having an issue where my ThreadPool does not grow and shrink at all. Looking through the source of wsgiserver2.py I see two functions in the ThreadPool class 'grow' and 'shrink'. If you download the entire repo and search for those two functions to see where they are being called they are not. Perhaps they are being invoked some other way that is foreign to me, but I would like to know if this is an oversight or I'm just looking in the wrong places.
Note: I'm setting the values (thread_pool and thread_pool_max) correctly before start is called on the Server, from the ServerAdapter, so its not that.
Thanks for all your help.
pcarl
You're correct. Nor ThreadPool.shrink and ThreadPool.grow are being called in CherryPy flow, neither thread_pool_max has any effect unless you call these two methods explicitly.
Normally CherryPy will lazily instantiate thread workers up to thread_pool and will stop there.
If you're sure that you need big thread pool which causes serious memory overhead for your application you can inherit cherrypy.process.plugins.Monitor to monitor thread queue size or other parameter to grow and shrink the pool. Luckily there's already one out there.

Atomic operations between main thread and subthread in python

I have a list in my python program that gets new items on certain occasions (It's a message-queue consumer). Then I have a thread that every few minutes checks to see if there's anything in the list, and if there is then I want to do an action on each item and then empty the list.
Now my problem: should I use locks to ensure that the action in the subthread is atomic, and does this ensure that the main thread can't alter the list while I'm going through the list?
Or should I instead use some kind of flag?
Pseudocode to make my problem clearer.
Subthread:
def run(self):
while 1:
if get_main_thread_list() is not empty:
do_operations()
empty_the_list()
sleep(30)
Main thread:
list = []
def on_event(item):
list.add(item)
def main():
start_thread()
start_listening_to_events()
I hope this makes my problem clearer, and any links to resources or comments are obviously welcome!
PS: I'm well aware that I just might not grasp threaded programming well enough for this question, if you believe so could you please take some time explaining whats wrong with my reasoning if you have the time.
should I use locks to ensure that the action in the subthread is atomic, and does this ensure that the main thread can't alter the list while I'm going through the list?
Yes. If you implement it correctly yes.
Or should I instead use some kind of flag?
"some kind of flag" == lock, so you'd better use threading locks.
Important: It looks to me like you're trying to reimplement the queue module from the stdlib, you might want to take a look at it.
Other than having a bunch of interesting features is also thread safe:
The queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads. The Queue class in this module implements all the required locking semantics.

Python multiple threads accessing same file

I have two threads, one which writes to a file, and another which periodically
moves the file to a different location. The writes always calls open before writing a message, and calls close after writing the message. The mover uses shutil.move to do the move.
I see that after the first move is done, the writer cannot write to the file anymore, i.e. the size of the file is always 0 after the first move. Am I doing something wrong?
Locking is a possible solution, but I prefer the general architecture of having each external resource (including a file) dealt with by a single, separate thread. Other threads send work requests to the dedicated thread on a Queue.Queue instance (and provide a separate queue of their own as part of the work request's parameters if they need result back), the dedicated thread spends most of its time waiting on a .get on that queue and whenever it gets a requests goes on and executes it (and returns results on the passed-in queue if needed).
I've provided detailed examples of this approach e.g. in "Python in a Nutshell". Python's Queue is intrinsically thread-safe and simplifies your life enormously.
Among the advantages of this architecture is that it translates smoothly to multiprocessing if and when you decide to switch some work to a separate process instead of a separate thread (e.g. to take advantage of multiple cores) -- multiprocessing provides its own workalike Queue type to make such a transition smooth as silk;-).
When two threads access the same resources, weird things happen. To avoid that, always lock the resource. Python has the convenient threading.Lock for that, as well as some other tools (see documentation of the threading module).
Check out http://www.evanfosmark.com/2009/01/cross-platform-file-locking-support-in-python/
You can use a simple lock with his code, as written by Evan Fosmark in an older StackOverflow question:
from filelock import FileLock
with FileLock("myfile.txt"):
# work with the file as it is now locked
print("Lock acquired.")
One of the more elegant libraries I've ever seen.

Categories

Resources