how to terminate a process at the end of its target function? - python

I am trying to refine a very large JSON Data Set. In order to do that, I split the file into many subparts (with the Unix split command), and assign each part to a process so that it can be fetched and refined independetly.
Each process has its input file, which corresponds to a subset of the main dataset.
Here is how my code looks like:
import multiprocessing as mp
def my_target(input_file, output_file):
...
some code
...
# Is it possible to end the process here ?
#end of the function
worker_count = mp.cpu_count()
processes = [mp.Process(target = my_target, args=(input_file, output_file)) for _ in range(worker_count)]
for p in processes:
p.start()
It is very likely that the processes won't terminate at the same time and hence here is my question: Is it possible to terminate a process when it reaches the last line of the target_function my_target() ?
I suppose that letting processes idle after they're finished with their tasks can slow the evolution of other processes no ?
Any recommendations ?

I guess, that you should check this question, as related to what you might need:
how to to terminate process using python's multiprocessing.
Because you have to take care about the "zombie process", because if the process is ended and not joined - it will become idle.

Related

How to run multiple process simultaneously?

I have a loop with highly time-consuming process and instead of waiting for each process to complete to move to next iteration, is it possible to run the process and just move to next iteration without waiting for it to complete?
Example : Given a text, the script should try to find the matching links from the Web and files from the local disk. Both return simply a list of links or paths.
for proc in (web_search, file_search):
results = proc(text)
yield from results
What I have as a solution is, using a timer while doing the job. And if the time exceeds the waiting time, the process should be moved to a tray and asked to work from there. Now I will go to next iteration and repeat the same. After my loop is over, I will collect the results from the process moved to the tray.
For simple cases, where the objective is to let each process run simultaneously, we can use Thread of threading module.
So we can tackle the issue like this, we make each process as a Thread and ask it to put its results in a list or some other collection. The code is given below:
from threading import Thread
results = []
def add_to_collection(proc, args, collection):
'''proc is the function, args are the arguments to pass to it.
collection is our container (here it is the list results) for
collecting results.'''
result = proc(*args)
collection.append(result)
print("Completed":, proc)
# Now we do our time consuming tasks
for proc in (web_search, file_search):
t = Thread(target=add_to_collection, args=(proc, ()))
# We assume proc takes no arguments
t.start()
For complex tasks, as mentioned in comments, its better to go with multiprocessing.pool.Pool.

python multiprocessing stuck (maybe reading csv)

I am trying to learn how to use multiprocessing and I am having a problem.
I am trying to run this code:
import multiprocessing as mp
import random
import string
random.seed(123)
# Define an output queue
output = mp.Queue()
# define a example function
def rand_string(length, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
rand_str = ''.join(random.choice(
string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
for i in range(length))
output.put(rand_str)
# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(4)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
From here
The code in itself runs properly, but when I replace rand_string with my function (reads a bunch of csv files in Pandas dataframes) the code never ends.
The function is this:
def readMyCSV(clFile):
aClTable = pd.read_csv(clFile)
# I do some processing here, but at the end the
# function returns a Pandas DataFrame
return(aClTable)
Then I wrap the function so that it allows for a Queue in the arguments:
def readMyCSVParWrap(clFile, outputq):
outputq.put(readMyCSV(clFile))
and I build the processes with:
processes = [mp.Process(target=readMyCSVParWrap, args=(singleFile,output)) for singleFile in allFiles[:5]]
If I do so, the code never stops running, and results are never printed.
IF I put only the clFile string in the output queue, e.g.:
outputq.put((clFile))
the results are printed properly (just a list of clFiles)
When I look at htop, I see 5 processes being spawn, but they do not use any CPU.
Lastly, the readMyCSV function works properly if I run it by itself (returns a Pandas DataFrame)
Is there anything I am doing wrong?
I am running this in a Jupyter notebook, maybe that is an issue?
Seems your join-statements on the processes are causing a deadlock. The processes can't terminate because they wait till the items on the queue are consumed, but in your code this happens only after the joining.
Joining processes that use queues
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.
docs
The docs further suggest to swap the lines with queue.get and join or just removing join.
Also important:
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process)...protect the “entry point” of the program by using if name == 'main':. ibid

how to start multiple jobs in python and communicate with the main job

I am a novice user of python multithreading/multiprocessing, so please bear with me.
I would like to solve the following problem and I need some help/suggestions in this regard.
Let me describe in brief:
I would like to start a python script which does something in the
beginning sequentially.
After the sequential part is over, I would like to start some jobs
in parallel.
Assume that there are four parallel jobs I want to start.
I would like to also start these jobs on some other machines using "lsf" on the computing cluster.My initial script is also running on a ” lsf”
machine.
The four jobs which I started on four machines will perform two logical steps A and B---one after the other.
When a job started initially, they start with logical step A and finish it.
After every job (4jobs) has finished the Step A; they should notify the first job which started these. In other words, the main job which started is waiting for the confirmation from these four jobs.
Once the main job receives confirmation from these four jobs; it should notify all the four jobs to do the logical step B.
Logical step B will automatically terminate the jobs after finishing the task.
Main job is waiting for the all jobs to finish and later on it should continue with the sequential part.
An example scenario would be:
Python script running on an “lsf” machine in the cluster starts four "tcl shells" on four “lsf” machines.
In each tcl shell, a script is sourced to do the logical step A.
Once the step A is done, somehow they should inform the python script which is waiting for the acknowledgement.
Once the acknowledgement is received from all the four, python script inform them to do the logical step B.
Logical step B is also a script which is sourced in their tcl shell; this script will also close the tcl shell at the end.
Meanwhile, python script is waiting for all the four jobs to finish.
After all four jobs are finished; it should continue with the sequential part again and finish later on.
Here are my questions:
I am confused about---should I use multithreading/multiprocessing. Which one suits better?
In fact what is the difference between these two? I read about these but I wasn't able to conclude.
What is python GIL? I also read somewhere at any one point in time only one thread will execute.
I need some explanation here. It gives me an impression that I can't use threads.
Any suggestions on how could I solve my problem systematically and in a more pythonic way.
I am looking for some verbal step by step explanation and some pointers to read on each step.
Once the concepts are clear, I would like to code it myself.
Thanks in advance.
In addition to roganjosh's answer, I would include some signaling to start the step B after A has finished:
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue, proceed):
print "Process {} has started been created".format(process_number)
print "Process {} has ended step A".format(process_number)
sys.stdout.flush()
queue.put((process_number, "done"))
proceed.wait() #wait for the signal to do the second part
print "Process {} has ended step B".format(process_number)
sys.stdout.flush()
def multiproc_master():
queue = mp.Queue()
proceed = mp.Event()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in range(4)]
for p in processes:
p.start()
#block = True waits until there is something available
results = [queue.get(block=True) for p in processes]
proceed.set() #set continue-flag
for p in processes: #wait for all to finish (also in windows)
p.join()
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs
1) From the options you listed in your question, you should probably use multiprocessing in this case to leverage multiple CPU cores and compute things in parallel.
2) Going further from point 1: the Global Interpreter Lock (GIL) means that only one thread can actually execute code at any one time.
A simple example for multithreading that pops up often here is having a prompt for user input for, say, an answer to a maths problem. In the background, they want a timer to keep incrementing at one second intervals to register how long the person took to respond. Without multithreading, the program would block whilst waiting for user input and the counter would not increment. In this case, you could have the counter and the input prompt run on different threads so that they appear to be running at the same time. In reality, both threads are sharing the same CPU resource and are constantly passing an object backwards and forwards (the GIL) to grant them individual access to the CPU. This is hopeless if you want to properly process things in parallel. (Note: In reality, you'd just record the time before and after the prompt and calculate the difference rather than bothering with threads.)
3) I have made a really simple example using multiprocessing. In this case, I spawn 4 processes that compute the sum of squares for a randomly chosen range. These processes do not have a shared GIL and therefore execute independently unlike multithreading. In this example, you can see that all processes start and end at slightly different times, but we can aggregate the results of the processes into a single queue object. The parent process will wait for all 4 child processes to return their computations before moving on. You could then repeat the code for func_B (not included in the code).
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue):
start = time.time()
print "Process {} has started at {}".format(process_number, start)
sys.stdout.flush()
my_calc = sum([x**2 for x in xrange(random.randint(1000000, 3000000))])
end = time.time()
print "Process {} has ended at {}".format(process_number, end)
sys.stdout.flush()
queue.put((process_number, my_calc))
def multiproc_master():
queue = mp.Queue()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in xrange(4)]
for p in processes:
p.start()
# Unhash the below if you run on Linux (Windows and Linux treat multiprocessing
# differently as Windows lacks os.fork())
#for p in processes:
# p.join()
results = [queue.get() for p in processes]
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs

Python's multiprocessing is not creating tasks in parallel

I am learning about multithreading in python using the multiprocessing library. For that purpose, I tried to create a program to divide a big file into several smaller chunks. So, first I read all the data from that file, and then create worker tasks that take a segment of the data from that input file, and write that segment into a file. I expect to have as many parallel threads running as the number of segments, but that does not happen. I see maximum two tasks and the program terminates after that. What mistake am I doing. The code is given below.
import multiprocessing
def worker(segment, x):
fname = getFileName(x)
writeToFile(segment, fname)
if __name__ == '__main__':
with open(fname) as f:
lines = f.readlines()
jobs = []
for x in range(0, numberOfSegments):
segment = getSegment(x, lines)
jobs.append(multiprocessing.Process(target=worker, args=(segment, x)))
jobs[len(jobs)-1].start()
for p in jobs:
p.join
Process gives you one additional thread (which, with your main process, gives you two processes). The call to join at the end of each loop will wait for that process to finish before starting the next loop. If you insist on using Process, you'll need to store the returned processes (probably in a list), and join every process in a loop after your current loop.
You want the Pool class from multiprocessing (https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool)

Control Number of Processes in Python using multiprocessing

I would like to control the number of Processes spawned while using the multiprocessing package.
Say I only want three processes active at the same time. The only way I know how to do this is:
import multiprocessing
import Queue
def worker(arg):
## Do stuff
return returnvalue
argument = list(1,2,3,4,5,6)
aliveprocesses = 0
jobs = Queue.Queue()
for arg in argument:
while jobs.qsize() > 2:
jobs.get().join()
p = multiprocessing.Process(target=worker,args=(arg,))
jobs.put(p)
p.start()
Basically I only know how to monitor one process at a time using the Process.join() function. I monitor the oldest process until it is done and then create a new process. For my program the oldest process should finish before the others, on average. But who knows? Maybe another process finishes first and I would have no way of knowing.
The only alternative I can think of is something like this:
import multiprocessing
import time
def worker(arg):
## Do stuff
return returnvalue
argument = list(1,2,3,4,5,6)
aliveprocesses = 0
jobs = set()
for arg in argument:
while aliveprocesses > 2:
for j in jobs:
if not j.is_alive():
aliveprocesses -= 1
break
time.sleep(1)
p = multiprocessing.Process(target=worker,args=(arg,))
jobs.put(p)
p.start()
aliveprocesses += 1
In the above function you are checking all of processes if they are still alive. If they are all still alive you sleep for a bit and then check again until there is a dead process after which you spawn a new process. The problem here is that from what I understand the time.sleep() function is not a particularly efficient way to wait for a process to end.
Ideally I would like a function "superjoin()" like Process.join() only it uses a set of Process objects and when one Process within the set returns then superjoin() returns. And superjoin() does not itself use the time.sleep() function ie it's not being "passed the buck"
Since you seem to have a single (parallel) task, instead of managing processes individually, you should use the higher-level multiprocessing.Pool, which makes managing the number of processes easier.
You can't join a pool, but you have blocking calls (such as Pool.map) that perform this kind of task.
If you need finer-grained control, you may want to adapt Pool's source code

Categories

Resources