I'm writing software in python (3.7) that involves one main GUI thread, and multiple child processes that are each operating as state machines.
I'd like the child processes to publish their current state machine state name so the main GUI thread can check on what state the state machines are in.
I want to find a way to do this such that if the main process and the child process were trying to read/write to the state variable at the same time, the main process would immediately (with no locking/waiting) get a slightly out-of-date state, and the child process would immediately (with no locking/waiting) write the current state to the state variable.
Basically, I want to make sure the child process doesn't get any latency/jitter due to simultaneous access of the state variable, and I don't care if the GUI gets a slightly outdated value.
I looked into:
using a queue.Queue with a maxsize of 1, but the behavior of
queue.Queue is to block if the queue runs out of space - it would
work for my purposes if it behaved like a collections.deque and
silently made the oldest value walk the plank if a new one came in
with no available space.
using a multiprocessing.Value, but from
the documentation, it sounds like you need to acquire a lock to
access or write the value, and that's what I want to avoid - no
locking/blocking for simultaneous read/writes. It says something
vague about how if you don't use the lock, it won't be 'process-safe',
but I don't really know what that means - what bad things would
happen exactly without using a lock?
What's the best way to accomplish this? Thanks!
For some reason, I had forgotten that it's possible to put into a queue in a non-blocking way!
The solution I found is to use a multiprocessing.Queue with maxsize=1, and use non-blocking writes on the producer (child process) side. Here's a short version of what I did:
Initializing in parent process:
import multiprocessing as mp
import queue
publishedValue = mp.Queue(maxsize=1)
In repeatedly scheduled GUI function ("consumer"):
try:
# Attempt to get an updated published value
publishedValue.get(block=False)
except queue.Empty:
# No new published value available
pass
In child "producer" process:
try:
# Clear current value in case GUI hasn't already consumed it
publishedValue.get(block=False)
except queue.Empty:
# Published value has already been consumed, no problem
pass
try:
# Publish new value
publishedValue.put(block=False)
except queue.Full:
# Can't publish value right now, resource is locked
pass
Note that this does require that the child process can repeatedly attempt to re-publish the value if it gets blocked, otherwise the consumer might completely miss a published value (as opposed to simply getting it a bit late).
I think this may be possible in a bit more concise way (and probably with less overhead) with non-blocking writes to a multiprocessing.Value object instead of a queue, but the docs don't make it obvious (to me) how to do that.
Hope this helps someone.
Related
I have a function where certain data is being processed, and if the data meets a certain criteria, it's to be handled separate while the rest of the data is being processed.
As an arbitrary example if I'm scraping a web page and collecting all the attributes of an element, one of the elements is a form and just so happens to be hidden, I want to handle it separate, while the rest of the elements can continue being processed:
def get_hidden_forms(element_att):
if element_att == 'hidden':
os.fork()
# handle this seperate
else:
# continue handling any elements that are not hidden
#join both processes
Can this be done with os.fork() or is it intended for another purpose?
I know that os.fork() copies everything about the object, but I could just change values before forking, as stated in this post.
fork basically creates a clone of the process calling it with a new address space and new PID.
From that point on, both processes would continue running next instruction after the fork() call. For this purpose, you normally inspect it's return value and decide what is appropriate action. If it return int greater than 0, it's the PID of child process and you know you are in its parent... you continue parents work. If it's equal to 0, you are in a child process and should do child's work. Value less then 0 means fork has failed, Python would handle that and raise OSError which you should handle (you're still in and there only is a parent).
Now the absolute minimum you'd need to take care of having forked a child process is to also make sure you wait() for them and reap their return codes properly, otherwise you will (at least temporarily) create zombies. That actually means you may want to implement a SICHLD handler to reap your process' children remains as they are done with their execution.
In theory you could use it the way you've described, but it may be a bit too "low level" (and uncomfortable) for that and perhaps would be easier to do and read/understand if you had dedicated code for what you want to handle separately and use multiprocessing to handle running this extra work in separate processes.
I have a program that uses threads to start another thread once a certain threshold is reached. Right now the second thread is being started multiple times. I implemented a lock but I don't think I did it right.
for i in range(max_threads):
t1 = Thread(target=grab_queue)
t1.start()
in grab_queue, I have:
...
rows.append(resultJson)
if len(rows.value()) >= 250:
with Lock():
row_thread = Thread(target=insert_rows, kwargs={'rows': rows.value()})
row_thread.start()
rows.reset()
Which starts another thread to process the list of rows. I would like to make sure that as soon as it hits the if condition, the other threads wont run in order to make sure that extra threads to process the list of rows aren't started.
Your lock is covering the wrong portion of the code. You have a race condition between the check for the size of rows, and the portion of the code where you reset the rows. Given that the lock is taken only after the size check, two threads could easily both decide that the array has grown too large, and only then would the lock kick in to serialize the resetting of the array. "Serialize" in this case means that the task would still be performed twice, once by each thread, but it would happen in succession rather than in parallel.
The correct code could look like this:
rows.append(resultJson)
with grow_lock:
if len(rows.value()) >= 250:
row_thread = Thread(target=insert_rows, kwargs={'rows': rows.value()})
row_thread.start()
rows.reset()
There is another issue with the code as shown in the question: if Lock() refers to threading.Lock, it is creating and locking a new lock on each invocation, and in each thread! A lock protects a resource shared among threads, and to perform that function, the lock must itself be shared. To fix the problem, instantiate the lock once and pass it to the thread's target function.
Taking a step back, your code implements a custom thread pool. Getting that right and covering all the corner cases takes a lot of work, testing, and debugging. There are production-tested modules specialized for that purpose, such as the multiprocessing module shipped with Python (which supports both process and thread pools), and it is a good idea to get acquainted with them before reimplementing their functionality. See, for example, this article for an accessible introduction to multiprocessing-based thread pools.
I've got 2 threads:
A worker thread, that loops looking for input from an ssh socket
A manager thread, that processes stuff from the worker thread
They use a Queue to communicate - as stuff comes in, the worker places it on the Queue if it's important, and the manager takes it off to process.
However, I'd like the manager to also know the last time anything came in - whether important or not.
My thought was that the worker could set an integer (say), and the manager could read it. But there doesn't seem to be a threading primitive that supports this.
Is it safe for the manager to just read the worker's instance variables, providing it doesn't write to them? Or will that give some shared memory issues? Is there some way I can share this state without putting all the junk stuff in the Queue?
Is it safe for the manager to just read the worker's instance
variables, providing it doesn't write to them?
Yes, this is safe in CPython. Because of the GIL, it's impossible for one thread to be reading the value of a variable while another thread is in process of writing it. This is because both operations are a single bytecode instruction, which makes them atomic - the GIL will be held for the entire instruction, so no other thread can be executing at the same time. One has to happen either before or after the other. You'll only run into issues if you have two different threads trying to do non-atomic operations on the same object (like incrementing the integer, for example). If that were the case, you'd need to use a threading.Lock() that was shared between the two threads to synchronize access to the integer.
Do note that the behavior of bytecode (and even the existence of the GIL) is considered an implementation detail, and is therefore subject to change:
CPython implementation detail: Bytecode is an implementation detail of
the CPython interpreter! No guarantees are made that bytecode will not
be added, removed, or changed between versions of Python.
So, if you want to be absolutely safe across all versions and implementations of Python, use a Lock, even though it's not actually necessary right now (and in reality, probably won't ever be) in CPython.
Using a Lock to synchronize access to a variable is very straightforward:
lock = threading.Lock()
Thread 1:
with lock:
print(shared_int) # Some read operation
# Lock is release once we leave the with block
Thread 2:
with lock:
shared_int = 55 # Some write operation
How to correctly fork a child process in twisted that does not use anything from twisted (but uses data from the parent process) (e.g. to process a “snapshot” of some data from the parent process and write it to file, without blocking)?
It seems if I do anything like clean shutdown in the child process after os.fork(), it closes some of the sockets / descriptors in the parent process; the only way to avoid that that I see is to do os.kill(os.getpid(), signal.SIGKILL), which does seem like a bad idea (though not directly problematic).
(additionally, if a dict is changed in the parent process, can it be that it will change in the child process too? Quick test shows that it doesn't change, though. OS/kernels are debian stable / sid)
IReactorProcess.spawnProcess (usually available as from twisted.internet import reactor; reactor.spawnProcess) can spawn a process running any available executable on your system. The subprocess does not need to use Twisted, or, indeed, even be in Python.
Do not call os.fork yourself. As you've discovered, it has lots of very peculiar interactions with process state, that spawnProcess will manage for you.
Among the problems with os.fork are:
Forking copies your current process state, but doesn't copy the state of threads. This means that any thread in the middle of modifying some global state will leave things half-broken, possibly holding some locks which will never be released. Don't run any threads in your application? Have you audited every library you use, every one of its dependencies, to ensure that none of them have ever or will ever use a background thread for anything?
You might think you're only touching certain areas of your application memory, but thanks to Python's reference counting, any object which you even peripherally look at (or is present on the stack) may have reference counts being incremented or decremented. Incrementing or decrementing a refcount is a write operation, which means that whole page (not just that one object) gets copied back into your process. So forked processes in Python tend to accumulate a much larger copied set than, say, forked C programs.
Many libraries, famously all of the libraries that make up the systems on macOS and iOS, cannot handle fork() correctly and will simply crash your program if you attempt to use them after fork but before exec.
There's a flag for telling file descriptors to close on exec - but no such flag to have them close on fork. So any files (including log files, and again, any background temp files opened by libraries you might not even be aware of) can get silently corrupted or truncated if you don't manage access to them carefully.
So I have the following program:
https://github.com/eWizardII/homobabel/blob/master/Experimental/demo_async_falcon.py
However, when it's run I only get two active threads are running, how can I make it so that there are more threads running. I have tried doing stuff like urlv2 = birdofprey(ip2) where ip2 = str(host+1) however that just ends up sending the same thing to two threads. Any help would be appreciated.
Thanks,
Active count=2 means that you have one of your designed thread (birdofprey), and the main thread. This is because you use lock, so the second birdofprey thread waits for the first and so on. I didn't get deeper into the algorithm, but it seems that you don't need to lock birdofprey threads, since they don't share any data (I could get wrong). If they share, you should make exclusive access to the shared data, and not to lock the whole body of run.
Update upon comment
remove locks (if there is no shared data. storage_i is not a shared data.);
in the for loop` create threads, start them, append to a list;
make the second loop over the list of threads, call join collect the information you need.
line 75, urlv.join() blocks until the thread finishes. So you actually create one thread, wait until it's done and then start the next. The other thread is the main thread.
I think the problem is that you need to pull urlv.join() out of the for loop. Right now, because of the join, you're waiting for your new thread to complete before starting the next one.
But for general readability, maintainability, etc., you might want to consider using Python's Queue class to set up a work queue and have a pool of worker threads pulling from it.