Parallel programming using python's multiprocessing and process defunc

Parallel programming using python's multiprocessing and process defunc - python

I have a problem with creating parallel program using multiprocessing. AFAIK when I start a new process using this module (multiprocessing) I should do "os.wait()" or "childProcess.join()" to get its' exit status. But placing above functions in my program can occur in stopping main process if something happens to child process (and the child process will hang).
The problem is that if I don't do that I'll get child processes go zombie (and will be listed as something like "python < defunct>" in top listing).
Is there any way to avoid waiting for child processes to end and to avoid creating zombie processes and\or not bothering the main process so much about it's child processes?

Though ars' answer should solve your immediate issues, you might consider looking at celery: http://ask.github.com/celery/index.html. It's a relatively developer-friendly approach to accomplishing these goals and more.

You may have to provide more information or actual code to figure this out. Have you been through the documentation, in particular the sections labeled "Warning"? For example, you may be facing something like this:
Warning: As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread()), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.

Related

Create process in python, only if it doesn't exist

I have a python server that eventually needs a background process to perform an action.
It creates a child process that should be able to last more than its parent. But it shouldn't create such a child process if it is already running (it can happen if a previous parent process created it).
I can think of a couple of different aproaches to solve this problem:
Check all current running processes before creating the new one: Cross-platform way to get PIDs by process name in python
Write a file when the child process starts, delete it when it's done. Check the file before creating a child process.
But none of them seem to perfectly fit my needs. Solution (1) doesn't work well if child process is a fork of its parent. Solution (2) is ugly, it looks prone to failure.
It would be great for me to provide a fixed pid or name at process creation, so I could always look for the process in system in a fixed way and be certain if it is running or not. But I haven't found a way to do this.

"It creates a child process that should be able to last more than its parent." Don't.
Have a longer lived service process create the child for you. Talk to this service over a Unix domain socket. It then can be used to pass file descriptors to the child. The service can also trivially ensure that it only ever has a single child.
This is the pattern that can be used to eliminate the need for children that outlive their parents.
Using command names makes it trivial to do a DoS by just creating a process with the same name that does nothing. Using PID files is ambiguous due to PID reuse. Only having a supervisor that waits on its children can it properly restart them when they exit or ensure that they are running.

Python multiprocessing prevent switching off to other processes

While using the multiprocessing module in Python, is there a way to prevent the process of switching off to another process for a certain time?
I have ~50 different child processes spawned to retrieve data from a database (each process = each table in DB) and after querying and filtering the data, I try to write the output to an excel file.
Since all the processes have similar steps, they all end up to the writing process at similar times, and of course since I am writing to a single file, I have a lock that prevents multiple processes to write on the file.
The problem is though, the writing process seems to take very long, compared to the writing process when I had the same amount of data written in a single process (slower by at least x10)
I am guessing one of the reasons could be that while writing, the cpu is constantly switching off to other processes, which are all stuck at the mutex lock, only to come back to the process that is the only one that is active. I am guessing the context switching is a significant waste of time since there are a lot of processes to switch back and forth from
I was wondering if there was a way to lock a process such that for a certain part of the code, no context switching between processes happen
Or any other suggestions to speed up this process?

Don't use locking and don't write from multiple processes; Let the child processes return the output to the parent (e.g. via standard output), and have it wait for the processes to join to read it. I'm not 100% on the multiprocessing API but you could just have the parent process sleep and wait for a SIGCHLD and only then read data from an exited child's standard output, and write it to your output file.
This way only one process is writing to the file and you don't need any busy looping or whatever. It will be much simpler and much more efficient.

You can raise the priority of your process (Go to Task Manager, right click on the process and raise it's process priority). However the OS will context switch no matter what, your process has no better claim then other processes to the OS.

Does a process always need to be terminated?

I am using a Python process to run one of my functions like so:
Process1 = Process(target = someFunction)
Process1.start()
Now that function has no looping or anything, it just does its thing, then ends, does the Process die with it? or do I always need to drop a:
Process1.terminate()
Afterwards?

The child process will exit by itself - the Process1.terminate() is unnecessary in that regard. This is especially true if using any shared resources between the child and parent process. From the Python documentation:
Avoid terminating processes
Using the Process.terminate method to stop a process is liable to cause any shared resources (such as locks, semaphores, pipes and queues) currently being used by the process to become broken or unavailable to other processes.
Therefore it is probably best to only consider using Process.terminate on processes which never use any shared resources.
However, if you want the parent process to wait for the child process to finish (perhaps the child process is modifying something that the parent will access afterwards), then you'll want to use Process1.join() to block the parent process from continuing until the child process complete. This is generally good practice when using child processes to avoid zombie processes or orphaned children.

No, as per the documentation it only sends a SIGTERM or TerminateProcess() to the process in question. If it has already exited then there is nothing to terminate.
However, it is always a good process to use exit codes in your subprocesses:
import sys
sys.exit(1)
And then check the exit code once you know the process has terminated:
if Process1.exitcode():
errorHandle()

Multiprocessing or os.fork, os.exec?

I am using multiprocessing module to fork child processes. Since on forking, child process gets the address space of parent process, I am getting the same logger for parent and child. I want to clear the address space of child process for any values carried over from parent. I got to know that multiprocessing does fork() at lower level but not exec(). I want to know whether it is good to use multiprocessing in my situation or should I go for os.fork() and os.exec() combination or is there any other solution?
Thanks.

Since multiprocessing is running a function from your program as if it were a thread function, it definitely needs a full copy of your process' state. That means doing fork().
Using a higher-level interface provided by multiprocessing is generally better. At least you should not care about the fork() return code yourself.
os.fork() is a lower level function providing less service out-of-the-box, though you certainly can use it for anything multiprocessing is used for... at the cost of partial reimplementation of multiprocessing code. So, I think, multiprocessing should be ok for you.
However, if you process' memory footprint is too large to duplicate it (or if you have other reasons to avoid forking -- open connections to databases, open log files etc.), you may have to make the function you want to run in a new process a separate python program. Then you can run it using subprocess, pass parameters to its stdin, capture its stdout and parse the output to get results.
UPD: os.exec... family of functions is hard to use for most of purposes since it replaces your process with a spawned one (if you run the same program as is running, it will restart from the very beginning, not keeping any in-memory data). However, if you really do not need to continue parent process execution, exec() may be of some use.
From my personal experience: os.fork() is used very often to create daemon processes on Unix; I often use subprocess (the communication is through stdin/stdout); almost never used multiprocessing; not a single time in my life I needed os.exec...().

You can just rebind the logger in the child process to its own. I don't know about other OS, but on Linux the forking doesn't duplicate the entire memory footprint (as Ellioh mentioned), but uses "copy-on-write" concept. So until you change something in the child process - it stays in the memory scope of the parent process. For instance, you can fork 100 child processes (that don't write into memory, only read) and check the overall memory usage. It'll not be parent_memory_usage * 100, but much less.

Python Multiprocessing respawn crashed processes

I want to create some worker processes and if they crash due to an exception, I would like them to respawn. Aside from the is_alive method in the multiprocessing module, I can't seem to find a way to do this.
This would require me to iterate over all the running processes (after a sleep) and check if they are alive. This is essentially a busy loop, I was wondering if there was a better solution that will wake up my program in the event that any one of my worker processes has crashed. Once it wakes up, I would like to log th exception that crashed my program and launch another process.

Polling to see if the child processes are alive should work fine, since it's a low-overhead check and you don't need to check that often.
The first answer to this (similar) question has a Python code example: Multi-server monitor/auto restarter in python

You can wrap your worker processes in try/except blocks where the except pushes a message onto a pipe before raising. Of course, polling isn't really worse than this and it's simpler.

If you're on a unix-like system, your main program can be notified of dead children by installing a signal handler. Look up your operating system's documentation on signal(), especially SIGCHLD. I'm afraid I don't remember whether Windows covers SIGCHLD with its very limited POSIX signal support.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel programming using python's multiprocessing and process defunc - python

Though ars' answer should solve your immediate issues, you might consider looking at celery: http://ask.github.com/celery/index.html. It's a relatively developer-friendly approach to accomplishing these goals and more.

Related

Create process in python, only if it doesn't exist

Python multiprocessing prevent switching off to other processes

Does a process always need to be terminated?

Multiprocessing or os.fork, os.exec?

Python Multiprocessing respawn crashed processes

Categories

Resources