Does os.fork() pick up where it left off?

Does os.fork() pick up where it left off? - python

I have a function where certain data is being processed, and if the data meets a certain criteria, it's to be handled separate while the rest of the data is being processed.
As an arbitrary example if I'm scraping a web page and collecting all the attributes of an element, one of the elements is a form and just so happens to be hidden, I want to handle it separate, while the rest of the elements can continue being processed:
def get_hidden_forms(element_att):
if element_att == 'hidden':
os.fork()
# handle this seperate
else:
# continue handling any elements that are not hidden
#join both processes
Can this be done with os.fork() or is it intended for another purpose?
I know that os.fork() copies everything about the object, but I could just change values before forking, as stated in this post.

fork basically creates a clone of the process calling it with a new address space and new PID.
From that point on, both processes would continue running next instruction after the fork() call. For this purpose, you normally inspect it's return value and decide what is appropriate action. If it return int greater than 0, it's the PID of child process and you know you are in its parent... you continue parents work. If it's equal to 0, you are in a child process and should do child's work. Value less then 0 means fork has failed, Python would handle that and raise OSError which you should handle (you're still in and there only is a parent).
Now the absolute minimum you'd need to take care of having forked a child process is to also make sure you wait() for them and reap their return codes properly, otherwise you will (at least temporarily) create zombies. That actually means you may want to implement a SICHLD handler to reap your process' children remains as they are done with their execution.
In theory you could use it the way you've described, but it may be a bit too "low level" (and uncomfortable) for that and perhaps would be easier to do and read/understand if you had dedicated code for what you want to handle separately and use multiprocessing to handle running this extra work in separate processes.

Related

"Published" value accessible across processes in python

I'm writing software in python (3.7) that involves one main GUI thread, and multiple child processes that are each operating as state machines.
I'd like the child processes to publish their current state machine state name so the main GUI thread can check on what state the state machines are in.
I want to find a way to do this such that if the main process and the child process were trying to read/write to the state variable at the same time, the main process would immediately (with no locking/waiting) get a slightly out-of-date state, and the child process would immediately (with no locking/waiting) write the current state to the state variable.
Basically, I want to make sure the child process doesn't get any latency/jitter due to simultaneous access of the state variable, and I don't care if the GUI gets a slightly outdated value.
I looked into:
using a queue.Queue with a maxsize of 1, but the behavior of
queue.Queue is to block if the queue runs out of space - it would
work for my purposes if it behaved like a collections.deque and
silently made the oldest value walk the plank if a new one came in
with no available space.
using a multiprocessing.Value, but from
the documentation, it sounds like you need to acquire a lock to
access or write the value, and that's what I want to avoid - no
locking/blocking for simultaneous read/writes. It says something
vague about how if you don't use the lock, it won't be 'process-safe',
but I don't really know what that means - what bad things would
happen exactly without using a lock?
What's the best way to accomplish this? Thanks!

For some reason, I had forgotten that it's possible to put into a queue in a non-blocking way!
The solution I found is to use a multiprocessing.Queue with maxsize=1, and use non-blocking writes on the producer (child process) side. Here's a short version of what I did:
Initializing in parent process:
import multiprocessing as mp
import queue
publishedValue = mp.Queue(maxsize=1)
In repeatedly scheduled GUI function ("consumer"):
try:
# Attempt to get an updated published value
publishedValue.get(block=False)
except queue.Empty:
# No new published value available
pass
In child "producer" process:
try:
# Clear current value in case GUI hasn't already consumed it
publishedValue.get(block=False)
except queue.Empty:
# Published value has already been consumed, no problem
pass
try:
# Publish new value
publishedValue.put(block=False)
except queue.Full:
# Can't publish value right now, resource is locked
pass
Note that this does require that the child process can repeatedly attempt to re-publish the value if it gets blocked, otherwise the consumer might completely miss a published value (as opposed to simply getting it a bit late).
I think this may be possible in a bit more concise way (and probably with less overhead) with non-blocking writes to a multiprocessing.Value object instead of a queue, but the docs don't make it obvious (to me) how to do that.
Hope this helps someone.

Create process in python, only if it doesn't exist

I have a python server that eventually needs a background process to perform an action.
It creates a child process that should be able to last more than its parent. But it shouldn't create such a child process if it is already running (it can happen if a previous parent process created it).
I can think of a couple of different aproaches to solve this problem:
Check all current running processes before creating the new one: Cross-platform way to get PIDs by process name in python
Write a file when the child process starts, delete it when it's done. Check the file before creating a child process.
But none of them seem to perfectly fit my needs. Solution (1) doesn't work well if child process is a fork of its parent. Solution (2) is ugly, it looks prone to failure.
It would be great for me to provide a fixed pid or name at process creation, so I could always look for the process in system in a fixed way and be certain if it is running or not. But I haven't found a way to do this.

"It creates a child process that should be able to last more than its parent." Don't.
Have a longer lived service process create the child for you. Talk to this service over a Unix domain socket. It then can be used to pass file descriptors to the child. The service can also trivially ensure that it only ever has a single child.
This is the pattern that can be used to eliminate the need for children that outlive their parents.
Using command names makes it trivial to do a DoS by just creating a process with the same name that does nothing. Using PID files is ambiguous due to PID reuse. Only having a supervisor that waits on its children can it properly restart them when they exit or ensure that they are running.

Get Data from Other Processes using Multiprocessing

(Language is Python 3)
I am writing a program with the module multiprocessing and using Pool. I need some variable that is shared between all of the processes. The parent process will initialize this variable and pass it as an argument to p.map(). I want the child processes to change this variable. The intent of this is because the first part of the child processes' work should be done in parallel (computational work that doesn't need any other processes' data). But, the second part of the processes' work needs to be done in order, one process after another, because they are writing to a file and the contents of that file should be in order. I want each process to wait until the others are done before moving on. I will record the "progress" of the entire program with the variable, e.g. when the first process is done writing to the file, it will increment the variable by one. I want this to be a signal to the next process in line to begin writing to the file. But I need some sort of waituntil() function to make the processes wait until the Value variable indicates that it is their "turn" to write to the file.
Here are my two problems:
I need a variable that the child processes can edit, and the child processes can actually get the value of that variable. What type of variable should I use? Should I use Value, Manager, or something else?
I need the processes to wait until the variable described above equals to a certain value, signaling that it is their turn to write to the file. Is there any sort of waituntil() function that I can use?

What you are looking for is called Synchronization.
There are multitudes of different synchronization primitives to choose from.
You should never attempt to write synchronization primitives on your own, as it is non-trivial to do correctly!
In your case either an Event or a Condition might be suitable.

Lock threads in Python for a task

I have a program that uses threads to start another thread once a certain threshold is reached. Right now the second thread is being started multiple times. I implemented a lock but I don't think I did it right.
for i in range(max_threads):
t1 = Thread(target=grab_queue)
t1.start()
in grab_queue, I have:
...
rows.append(resultJson)
if len(rows.value()) >= 250:
with Lock():
row_thread = Thread(target=insert_rows, kwargs={'rows': rows.value()})
row_thread.start()
rows.reset()
Which starts another thread to process the list of rows. I would like to make sure that as soon as it hits the if condition, the other threads wont run in order to make sure that extra threads to process the list of rows aren't started.

Your lock is covering the wrong portion of the code. You have a race condition between the check for the size of rows, and the portion of the code where you reset the rows. Given that the lock is taken only after the size check, two threads could easily both decide that the array has grown too large, and only then would the lock kick in to serialize the resetting of the array. "Serialize" in this case means that the task would still be performed twice, once by each thread, but it would happen in succession rather than in parallel.
The correct code could look like this:
rows.append(resultJson)
with grow_lock:
if len(rows.value()) >= 250:
row_thread = Thread(target=insert_rows, kwargs={'rows': rows.value()})
row_thread.start()
rows.reset()
There is another issue with the code as shown in the question: if Lock() refers to threading.Lock, it is creating and locking a new lock on each invocation, and in each thread! A lock protects a resource shared among threads, and to perform that function, the lock must itself be shared. To fix the problem, instantiate the lock once and pass it to the thread's target function.
Taking a step back, your code implements a custom thread pool. Getting that right and covering all the corner cases takes a lot of work, testing, and debugging. There are production-tested modules specialized for that purpose, such as the multiprocessing module shipped with Python (which supports both process and thread pools), and it is a good idea to get acquainted with them before reimplementing their functionality. See, for example, this article for an accessible introduction to multiprocessing-based thread pools.

Put large ndarrays fast to multiprocessing.Queue

When trying to put a large ndarray to a Queue in a Process, I encounter the following problem:
First, here is the code:
import numpy
import multiprocessing
from ctypes import c_bool
import time
def run(acquisition_running, data_queue):
while acquisition_running.value:
length = 65536
data = numpy.ndarray(length, dtype='float')
data_queue.put(data)
time.sleep(0.1)
if __name__ == '__main__':
acquisition_running = multiprocessing.Value(c_bool)
data_queue = multiprocessing.Queue()
process = multiprocessing.Process(
target=run, args=(acquisition_running, data_queue))
acquisition_running.value = True
process.start()
time.sleep(1)
acquisition_running.value = False
process.join()
print('Finished')
number_items = 0
while not data_queue.empty():
data_item = data_queue.get()
number_items += 1
print(number_items)
If I use length=10 or so, everything works fine. I get 9 items transmitted through the Queue.
If I increase to length=1000, on my computer the process.join() blocks, although the function run() is already done. I can comment the line with process.join() and will see, that there are only 2 items put in the Queue, so apparently putting data to the Queue got very slow.
My plan is actually to transport 4 ndarray, each with length 65536. For the Thread this worked very fast (<1ms). Is there a way to improve speed of transmitting data for processes?
I used Python 3.4 on a Windows machine, but with Python 3.4 on Linux I get the same behavior.

"Is there a way to improve speed of transmitting data for processes?"
Surely, given the right problem to solve. Currently, you are just filling a buffer without emptying it simultaneously. Congratulations, you have just built yourself a so-called deadlock. The corresponding quote from the documentation is:
Bear in mind that a process that has put items in a queue will wait
before terminating until all the buffered items are fed by the
“feeder” thread to the underlying pipe.
But, let's approach this slowly. First of all, "speed" is not your problem! I understand that you are just experimenting with Python's multiprocessing. The most important insight when reading your code is that the flow of communication between parent and child and especially the event handling does not really make sense. If you have a real-world problem that you are trying to solve, you definitely cannot solve it this way. If you do not have a real-world problem, then you first need to come up with a good one before you should start writing code ;-). Eventually, you will need to understand the communication primitives an operating system provides for inter-process communication.
Explanation for what you are observing:
Your child process generates about 10 * length * size(float) bytes of data (considering the fact that your child process can perform about 10 iterations while your parent sleeps about 1 s before it sets acquisition_running to False). While your parent process sleeps, the child puts named amount of data into a queue. You need to appreciate that a queue is a complex construct. You do not need to understand every bit of it. But one thing really really is important: a queue for inter-process communication clearly uses some kind of buffer* that sits between parent and child. Buffers usually have a limited size. You are writing to this buffer from within the child without simultaneously reading from it in the parent. That is, the buffer contents steadily grow while the parent is just sleeping. By increasing length you run into the situation where the queue buffer is full and the child process cannot write to it anymore. However, the child process cannot terminate before it has written all data. At the same time, the parent process waits for the child to terminate.
You see? One entity waits for the other. The parent waits for the child to terminate and the child waits for the parent to make some space. Such a situation is called deadlock. It cannot resolve itself.
Regarding the details, the buffer situation is a little more complex than described above. Your child process has spawned an additional thread that tries to push the buffered data through a pipe to the parent. Actually, the buffer of this pipe is the limiting entity. It is defined by the operating system and, at least on Linux, is usually not larger than 65536 Bytes.
The essential part is, in other words: the parent does not read from the pipe before the child finishes attempting to write to the pipe. In every meaningful scenario where pipes are used, reading and writing happen in a rather simultaneous fashion so that one process can quickly react to input provided by another process. You are doing the exact opposite: you put your parent to sleep and therefore render it dis-responsive to input from the child, resulting in a deadlock situation.
(*) "When a process first puts an item on the queue a feeder thread is started which transfers objects from a buffer into the pipe", from https://docs.python.org/2/library/multiprocessing.html

If you have really big arrays, you might want to only pass their pickled state -- or a better alternative might be to use multiprocessing.Array or multiprocessing.sharedctypes.RawArray to make a shared memory array (for the latter, see http://briansimulator.org/sharing-numpy-arrays-between-processes/). You have to worry about conflicts, as you'll have an array that's not bound by the GIL -- and needs locks. However, you only need to send array indices to access the shared array data.

One thing you could do to resolve that issue, in tandem with the excellent answer from JPG, is to unload your Queue between every processes.
So do this instead:
process.start()
data_item = data_queue.get()
process.join()
While this does not fully replicate the behavior in the code (number of data counting), you get the idea ;)

Convert array/list to str(your_array)
q.put(str(your_array))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.