python : wait until a function is finished - python

My script is composed of several functions that execute relatively long bash commands in terms of execution time (retrieve the list of suid binaries, retrieve multiple files, etc).
I have linked these functions to execute them via multiprocessing, as shown below:
def main_f():
queue=Queue()
proc10= multiprocessing.Process(target=verify_defenses)
proc2 = multiprocessing.Process(target=useful_software)
proc3 = multiprocessing.Process(target=process)
proc4 = multiprocessing.Process(target=start_suid)
proc6 =multiprocessing.Process(target=path)
proc7 =multiprocessing.Process(target=writable_directories)
proc8 =multiprocessing.Process(target=authorized_exec)
proc9 =multiprocessing.Process(target=crontab)
proc1 =multiprocessing.Process(target=enum_users)
proc11 = multiprocessing.Process(target=network_interface)
proc12 =multiprocessing.Process(target=check_write)
proc13 = multiprocessing.Process(target=search_files)
proc_tab = [proc1, proc2,.... ("speed" tasks)]
for p in proc_tab:
p.start()
for p in proc_tab:
p.join()
//"slow" tasks
proc4.start()
proc4.join()
proc_x.start()
proc_x.join(), etc
my problem is that multiprocessing "executes" the functions too quickly for them to have time to do their job (the function that returns suid returns nothing because of this lack of time for example): what can I do to ask multiprocessing to wait until my function has finished its execution before moving on to the next one?

Related

Parallel instances to run an external serial software

I am attempting to write a python function that uses multiprocessing to execute multiple instances of a external serial software. Using the code below I am able to "start" all of the runs of the external software, but it appears that only 1 instance of the software is actually computing (i.e., log files were generated for every run but only one instance is updating the log). One issue with the external program that I am using is that it only accepts one specific input file name. My original solution to this is to start the Processes 1 by 1 while waiting 10 seconds in between, so that I can update the information of this input file after each process instance has started. In otherwords, I write a new input file each instance starts.
Overall, what I would like to do is allocate X # of CPUs (via a scheduler) to my python program and then have my python program start X # of serial instances of the external software, so that all instances run in parallel. I should also not return to the main python program until all instances of the external software have completed. I appreciate any helpful insights! I am relatively new to this type of parallelism, so I apologize if my question has an exceptionally simple solution that I don't see.
from multiprocessing import Process
from subprocess import call, Popen, PIPE
import time
def run_parallel_serial(n_loops=None, name=None, setting=None):
processes = []
for i in range(1, n_loops+1, 1):
prepare_software_input(n_loop=i, info=name)
processes.append(Process(target=run_software))
count = 0
for p in processes:
update_input(settings[count])
p.start()
time.sleep(10)
count += 1
for p in processes:
p.join()
print("Done running in parallel!")
return
The uprepare_software_input and update_input functions are used to write and overwrite the input file used by the external software. The run_software function uses the subprocess module as the following
def run_software():
p = Popen([PROGRAM_EXEC, SOFTWARE_INPUT_FILE], stdin=PIPE, stdout=PIPE, stderr=PIPE).communicate()
del(p)
return
p.s. I am not married to multiprocessing. If there is an alternative strategy with a different module, I am happy to switch.

How to run multiple process simultaneously?

I have a loop with highly time-consuming process and instead of waiting for each process to complete to move to next iteration, is it possible to run the process and just move to next iteration without waiting for it to complete?
Example : Given a text, the script should try to find the matching links from the Web and files from the local disk. Both return simply a list of links or paths.
for proc in (web_search, file_search):
results = proc(text)
yield from results
What I have as a solution is, using a timer while doing the job. And if the time exceeds the waiting time, the process should be moved to a tray and asked to work from there. Now I will go to next iteration and repeat the same. After my loop is over, I will collect the results from the process moved to the tray.
For simple cases, where the objective is to let each process run simultaneously, we can use Thread of threading module.
So we can tackle the issue like this, we make each process as a Thread and ask it to put its results in a list or some other collection. The code is given below:
from threading import Thread
results = []
def add_to_collection(proc, args, collection):
'''proc is the function, args are the arguments to pass to it.
collection is our container (here it is the list results) for
collecting results.'''
result = proc(*args)
collection.append(result)
print("Completed":, proc)
# Now we do our time consuming tasks
for proc in (web_search, file_search):
t = Thread(target=add_to_collection, args=(proc, ()))
# We assume proc takes no arguments
t.start()
For complex tasks, as mentioned in comments, its better to go with multiprocessing.pool.Pool.

How many threads are running in this python example?

I want to learn how to run a function in multithreading using python. In other words, I have a long list of arguments that I want to send to a function that might take time to finish. I want my program to loop over the arguments and call the functions on parallel (no need to wait until the function finishes fromt he forst argument).
I found this example code from here:
import Queue
import threading
import urllib2
# called by each thread
def get_url(q, url):
q.put(urllib2.urlopen(url).read())
theurls = ["http://google.com", "http://yahoo.com"]
q = Queue.Queue()
for u in theurls:
t = threading.Thread(target=get_url, args = (q,u))
t.daemon = True
t.start()
s = q.get()
print s
My question are:
1) I normally know that I have to specify a number of threads that I want my program to run in parallel. There is no specific number of threads in the code above.
2) The number of threads is something that varies from device to device (depends on the processor, memory, etc.). Since this code does not specify any number of threads, how the program knows the right number of threads to run concurrently?
The threads are being created in the for loop. The for loop gets executed twice since there are two elements in theurls . This also answers your other two questions. Thus you end up with two threads in the program
plus main loop thread
Total 3

Running Python on multiple cores

I have created a (rather large) program that takes quite a long time to finish, and I started looking into ways to speed up the program.
I found that if I open task manager while the program is running only one core is being used.
After some research, I found this website:
Why does multiprocessing use only a single core after I import numpy? which gives a solution of os.system("taskset -p 0xff %d" % os.getpid()),
however this doesn't work for me, and my program continues to run on a single core.
I then found this:
is python capable of running on multiple cores?,
which pointed towards using multiprocessing.
So after looking into multiprocessing, I came across this documentary on how to use it https://docs.python.org/3/library/multiprocessing.html#examples
I tried the code:
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()
a = input("Finished")
After running the code (not in IDLE) It said this:
Finished
hello bob
Finished
Note: after it said Finished the first time I pressed enter
So after this I am now even more confused and I have two questions
First: It still doesn't run with multiple cores (I have an 8 core Intel i7)
Second: Why does it input "Finished" before its even run the if statement code (and it's not even finished yet!)
To answer your second question first, "Finished" is printed to the terminal because a = input("Finished") is outside of your if __name__ == '__main__': code block. It is thus a module level constant which gets assigned when the module is first loaded and will execute before any code in the module runs.
To answer the first question, you only created one process which you run and then wait to complete before continuing. This gives you zero benefits of multiprocessing and incurs overhead of creating the new process.
Because you want to create several processes, you need to create a pool via a collection of some sort (e.g. a python list) and then start all of the processes.
In practice, you need to be concerned with more than the number of processors (such as the amount of available memory, the ability to restart workers that crash, etc.). However, here is a simple example that completes your task above.
import datetime as dt
from multiprocessing import Process, current_process
import sys
def f(name):
print('{}: hello {} from {}'.format(
dt.datetime.now(), name, current_process().name))
sys.stdout.flush()
if __name__ == '__main__':
worker_count = 8
worker_pool = []
for _ in range(worker_count):
p = Process(target=f, args=('bob',))
p.start()
worker_pool.append(p)
for p in worker_pool:
p.join() # Wait for all of the workers to finish.
# Allow time to view results before program terminates.
a = input("Finished") # raw_input(...) in Python 2.
Also note that if you join workers immediately after starting them, you are waiting for each worker to complete its task before starting the next worker. This is generally undesirable unless the ordering of the tasks must be sequential.
Typically Wrong
worker_1.start()
worker_1.join()
worker_2.start() # Must wait for worker_1 to complete before starting worker_2.
worker_2.join()
Usually Desired
worker_1.start()
worker_2.start() # Start all workers.
worker_1.join()
worker_2.join() # Wait for all workers to finish.
For more information, please refer to the following links:
https://docs.python.org/3/library/multiprocessing.html
Dead simple example of using Multiprocessing Queue, Pool and Locking
https://pymotw.com/2/multiprocessing/basics.html
https://pymotw.com/2/multiprocessing/communication.html
https://pymotw.com/2/multiprocessing/mapreduce.html

how to start multiple jobs in python and communicate with the main job

I am a novice user of python multithreading/multiprocessing, so please bear with me.
I would like to solve the following problem and I need some help/suggestions in this regard.
Let me describe in brief:
I would like to start a python script which does something in the
beginning sequentially.
After the sequential part is over, I would like to start some jobs
in parallel.
Assume that there are four parallel jobs I want to start.
I would like to also start these jobs on some other machines using "lsf" on the computing cluster.My initial script is also running on a ” lsf”
machine.
The four jobs which I started on four machines will perform two logical steps A and B---one after the other.
When a job started initially, they start with logical step A and finish it.
After every job (4jobs) has finished the Step A; they should notify the first job which started these. In other words, the main job which started is waiting for the confirmation from these four jobs.
Once the main job receives confirmation from these four jobs; it should notify all the four jobs to do the logical step B.
Logical step B will automatically terminate the jobs after finishing the task.
Main job is waiting for the all jobs to finish and later on it should continue with the sequential part.
An example scenario would be:
Python script running on an “lsf” machine in the cluster starts four "tcl shells" on four “lsf” machines.
In each tcl shell, a script is sourced to do the logical step A.
Once the step A is done, somehow they should inform the python script which is waiting for the acknowledgement.
Once the acknowledgement is received from all the four, python script inform them to do the logical step B.
Logical step B is also a script which is sourced in their tcl shell; this script will also close the tcl shell at the end.
Meanwhile, python script is waiting for all the four jobs to finish.
After all four jobs are finished; it should continue with the sequential part again and finish later on.
Here are my questions:
I am confused about---should I use multithreading/multiprocessing. Which one suits better?
In fact what is the difference between these two? I read about these but I wasn't able to conclude.
What is python GIL? I also read somewhere at any one point in time only one thread will execute.
I need some explanation here. It gives me an impression that I can't use threads.
Any suggestions on how could I solve my problem systematically and in a more pythonic way.
I am looking for some verbal step by step explanation and some pointers to read on each step.
Once the concepts are clear, I would like to code it myself.
Thanks in advance.
In addition to roganjosh's answer, I would include some signaling to start the step B after A has finished:
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue, proceed):
print "Process {} has started been created".format(process_number)
print "Process {} has ended step A".format(process_number)
sys.stdout.flush()
queue.put((process_number, "done"))
proceed.wait() #wait for the signal to do the second part
print "Process {} has ended step B".format(process_number)
sys.stdout.flush()
def multiproc_master():
queue = mp.Queue()
proceed = mp.Event()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in range(4)]
for p in processes:
p.start()
#block = True waits until there is something available
results = [queue.get(block=True) for p in processes]
proceed.set() #set continue-flag
for p in processes: #wait for all to finish (also in windows)
p.join()
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs
1) From the options you listed in your question, you should probably use multiprocessing in this case to leverage multiple CPU cores and compute things in parallel.
2) Going further from point 1: the Global Interpreter Lock (GIL) means that only one thread can actually execute code at any one time.
A simple example for multithreading that pops up often here is having a prompt for user input for, say, an answer to a maths problem. In the background, they want a timer to keep incrementing at one second intervals to register how long the person took to respond. Without multithreading, the program would block whilst waiting for user input and the counter would not increment. In this case, you could have the counter and the input prompt run on different threads so that they appear to be running at the same time. In reality, both threads are sharing the same CPU resource and are constantly passing an object backwards and forwards (the GIL) to grant them individual access to the CPU. This is hopeless if you want to properly process things in parallel. (Note: In reality, you'd just record the time before and after the prompt and calculate the difference rather than bothering with threads.)
3) I have made a really simple example using multiprocessing. In this case, I spawn 4 processes that compute the sum of squares for a randomly chosen range. These processes do not have a shared GIL and therefore execute independently unlike multithreading. In this example, you can see that all processes start and end at slightly different times, but we can aggregate the results of the processes into a single queue object. The parent process will wait for all 4 child processes to return their computations before moving on. You could then repeat the code for func_B (not included in the code).
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue):
start = time.time()
print "Process {} has started at {}".format(process_number, start)
sys.stdout.flush()
my_calc = sum([x**2 for x in xrange(random.randint(1000000, 3000000))])
end = time.time()
print "Process {} has ended at {}".format(process_number, end)
sys.stdout.flush()
queue.put((process_number, my_calc))
def multiproc_master():
queue = mp.Queue()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in xrange(4)]
for p in processes:
p.start()
# Unhash the below if you run on Linux (Windows and Linux treat multiprocessing
# differently as Windows lacks os.fork())
#for p in processes:
# p.join()
results = [queue.get() for p in processes]
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs

Categories

Resources