Can Python version message be suppressed for child processes (Windows)? - python

I am using multiprocessing to calculate a large mass of data; i.e. I periodically spawn a process so that the total number of processes is equal to the number of CPU's on my machine.
I periodically print out the progress of the entire calculation... but this is inconveniently interspersed with Python's welcome messages from each child!
To be clear, this is a Windows specific problem due to how multiprocessing is handled.
E.g.
> python -q my_script.py
Python Version: 3.7.7 on Windows
Then many subsequent duplicates of the same version message print; one for each child process.
How can I suppress these?
I understand that if you run Python on the command line with a -q flag, it suppresses the welcome message; though I don't know how to translate that into my script.
EDIT:
I tried to include the interpreter flag -q like so:
multiprocessing.set_executable(sys.executable + ' -q')
Yet to no avail. I receive a FileNotFoundError which tells me I cannot pass options this way due to how they check arguments.
Anyways, here is the relevant section of code (It's an entire function):
def _parallelize(self, buffer, func, cpus):
## Number of Parallel Processes ##
cpus_max = mp.cpu_count()
cpus = min(cpus_max, cpus) if cpus else int(0.75*cpus_max)
## Total Processes to-do ##
N = ceil(self.SampleLength / DATA_MAX) # Number of Child Processes
print("N: ", N)
q = mp.Queue() # Child Process results Queue
## Initialize each CPU w/ a Process ##
for p in range(min(cpus, N)):
mp.Process(target=func, args=(p, q)).start()
## Collect Validation & Start Remaining Processes ##
for p in tqdm(range(N)):
n, data = q.get() # Collects a Result
i = n * DATA_MAX # Shifts to Proper Interval
buffer[i:i + len(data)] = data # Writes to open HDF5 file
if p < N - cpus: # Starts a new Process
mp.Process(target=func, args=(p + cpus, q)).start()
SECOND EDIT:
I should probably mention that I'm doing everything within an anaconda environment.

The message is printed on interactive startup.
A spawned process does inherit some flags from the child process.
But looking at the code in multiprocessing it does not seem possible to change these parameters from within the program.
So the easiest way to get rid of the messages should be to add the -q option to the original python invocation that starts your program.
I have confirmed that the -q flag is inherited.
So that should suppress the message for the original process and the children that it spawns.
Edit:
If you look at the implementation of set_executable, you will see that you cannot add or change arguments that way. :-(
Edit2:
You wrote:
I'm doing everything within an anaconda environment.
You you mean a virtual environment, or some kind of fancy IDE like spyder?
If you ever have a Python problem, first try reproducing it in plain CPython, running from the command line. IDE's and fancy environments like anaconda sometimes do weird things when running Python.

Related

Parallel instances to run an external serial software

I am attempting to write a python function that uses multiprocessing to execute multiple instances of a external serial software. Using the code below I am able to "start" all of the runs of the external software, but it appears that only 1 instance of the software is actually computing (i.e., log files were generated for every run but only one instance is updating the log). One issue with the external program that I am using is that it only accepts one specific input file name. My original solution to this is to start the Processes 1 by 1 while waiting 10 seconds in between, so that I can update the information of this input file after each process instance has started. In otherwords, I write a new input file each instance starts.
Overall, what I would like to do is allocate X # of CPUs (via a scheduler) to my python program and then have my python program start X # of serial instances of the external software, so that all instances run in parallel. I should also not return to the main python program until all instances of the external software have completed. I appreciate any helpful insights! I am relatively new to this type of parallelism, so I apologize if my question has an exceptionally simple solution that I don't see.
from multiprocessing import Process
from subprocess import call, Popen, PIPE
import time
def run_parallel_serial(n_loops=None, name=None, setting=None):
processes = []
for i in range(1, n_loops+1, 1):
prepare_software_input(n_loop=i, info=name)
processes.append(Process(target=run_software))
count = 0
for p in processes:
update_input(settings[count])
p.start()
time.sleep(10)
count += 1
for p in processes:
p.join()
print("Done running in parallel!")
return
The uprepare_software_input and update_input functions are used to write and overwrite the input file used by the external software. The run_software function uses the subprocess module as the following
def run_software():
p = Popen([PROGRAM_EXEC, SOFTWARE_INPUT_FILE], stdin=PIPE, stdout=PIPE, stderr=PIPE).communicate()
del(p)
return
p.s. I am not married to multiprocessing. If there is an alternative strategy with a different module, I am happy to switch.

Mulitprocessing pool for function with no arguments/iterable?

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

How to make python run multiple processes/threads on different cores?

I have a python script that needs to run thousands of commands in the Linux command line in the fastest time possible. So I want to split each command out into a different thread/process, running them separately, and then spread out the load between all of my machine's cores.
I've tried using the threading and multiprocessing modules, and also subprocess.Popen, but I can't seem to force python to run it's processes and therefore the commands on different cores.
Can anybody help and/or recommend the best way for me to do this?
Here is my current code:
while True:
# grab next section of data
data = file.read(bytes_to_read)
cmd_stream += data
# remove dead processes
if len(processes) >= max_processes:
while (processes) and (processes[0].poll() is not None):
del processes[0]
while len(processes) < max_processes:
# get new command, break if none
new_cmd, cmd_stream = get_next_cmd(cmd_stream)
if not new_cmd:
break
new_cmds = turn_args_into_cmd(new_cmd)
for args in new_cmds:
new_process = subprocess.Popen(args)
processes.append(new_process)
I've also tried the following code in replace of the last section:
p = Process(target=run_cmd, args=(new_cmd,))
processes.append(p)
p.start()

how to start multiple jobs in python and communicate with the main job

I am a novice user of python multithreading/multiprocessing, so please bear with me.
I would like to solve the following problem and I need some help/suggestions in this regard.
Let me describe in brief:
I would like to start a python script which does something in the
beginning sequentially.
After the sequential part is over, I would like to start some jobs
in parallel.
Assume that there are four parallel jobs I want to start.
I would like to also start these jobs on some other machines using "lsf" on the computing cluster.My initial script is also running on a ” lsf”
machine.
The four jobs which I started on four machines will perform two logical steps A and B---one after the other.
When a job started initially, they start with logical step A and finish it.
After every job (4jobs) has finished the Step A; they should notify the first job which started these. In other words, the main job which started is waiting for the confirmation from these four jobs.
Once the main job receives confirmation from these four jobs; it should notify all the four jobs to do the logical step B.
Logical step B will automatically terminate the jobs after finishing the task.
Main job is waiting for the all jobs to finish and later on it should continue with the sequential part.
An example scenario would be:
Python script running on an “lsf” machine in the cluster starts four "tcl shells" on four “lsf” machines.
In each tcl shell, a script is sourced to do the logical step A.
Once the step A is done, somehow they should inform the python script which is waiting for the acknowledgement.
Once the acknowledgement is received from all the four, python script inform them to do the logical step B.
Logical step B is also a script which is sourced in their tcl shell; this script will also close the tcl shell at the end.
Meanwhile, python script is waiting for all the four jobs to finish.
After all four jobs are finished; it should continue with the sequential part again and finish later on.
Here are my questions:
I am confused about---should I use multithreading/multiprocessing. Which one suits better?
In fact what is the difference between these two? I read about these but I wasn't able to conclude.
What is python GIL? I also read somewhere at any one point in time only one thread will execute.
I need some explanation here. It gives me an impression that I can't use threads.
Any suggestions on how could I solve my problem systematically and in a more pythonic way.
I am looking for some verbal step by step explanation and some pointers to read on each step.
Once the concepts are clear, I would like to code it myself.
Thanks in advance.
In addition to roganjosh's answer, I would include some signaling to start the step B after A has finished:
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue, proceed):
print "Process {} has started been created".format(process_number)
print "Process {} has ended step A".format(process_number)
sys.stdout.flush()
queue.put((process_number, "done"))
proceed.wait() #wait for the signal to do the second part
print "Process {} has ended step B".format(process_number)
sys.stdout.flush()
def multiproc_master():
queue = mp.Queue()
proceed = mp.Event()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in range(4)]
for p in processes:
p.start()
#block = True waits until there is something available
results = [queue.get(block=True) for p in processes]
proceed.set() #set continue-flag
for p in processes: #wait for all to finish (also in windows)
p.join()
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs
1) From the options you listed in your question, you should probably use multiprocessing in this case to leverage multiple CPU cores and compute things in parallel.
2) Going further from point 1: the Global Interpreter Lock (GIL) means that only one thread can actually execute code at any one time.
A simple example for multithreading that pops up often here is having a prompt for user input for, say, an answer to a maths problem. In the background, they want a timer to keep incrementing at one second intervals to register how long the person took to respond. Without multithreading, the program would block whilst waiting for user input and the counter would not increment. In this case, you could have the counter and the input prompt run on different threads so that they appear to be running at the same time. In reality, both threads are sharing the same CPU resource and are constantly passing an object backwards and forwards (the GIL) to grant them individual access to the CPU. This is hopeless if you want to properly process things in parallel. (Note: In reality, you'd just record the time before and after the prompt and calculate the difference rather than bothering with threads.)
3) I have made a really simple example using multiprocessing. In this case, I spawn 4 processes that compute the sum of squares for a randomly chosen range. These processes do not have a shared GIL and therefore execute independently unlike multithreading. In this example, you can see that all processes start and end at slightly different times, but we can aggregate the results of the processes into a single queue object. The parent process will wait for all 4 child processes to return their computations before moving on. You could then repeat the code for func_B (not included in the code).
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue):
start = time.time()
print "Process {} has started at {}".format(process_number, start)
sys.stdout.flush()
my_calc = sum([x**2 for x in xrange(random.randint(1000000, 3000000))])
end = time.time()
print "Process {} has ended at {}".format(process_number, end)
sys.stdout.flush()
queue.put((process_number, my_calc))
def multiproc_master():
queue = mp.Queue()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in xrange(4)]
for p in processes:
p.start()
# Unhash the below if you run on Linux (Windows and Linux treat multiprocessing
# differently as Windows lacks os.fork())
#for p in processes:
# p.join()
results = [queue.get() for p in processes]
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs

python multiprocess fails to start

Here is my code for a simple multiprocessing task in python
from multiprocessing import Process
def myfunc(num):
tmp = num * num
print 'squared O/P will be ', tmp
return(tmp)
a = [ i**3 for i in range(5)] ## just defining a list
task = [Process(target = myfunc, args = (i,)) for i in a] ## creating processes
for each in task : each.start() # starting processes <------ problem line
for each in task : each.join() # waiting all to finish up
When I run this code, it hangs at certain point, so to identify it I ran it line by line in python shell and found that when I call 'each.start()' The shell pops out a dialogue box as:
" The program is still running , do you want to kill it? '
and I select 'yes' the shell closes.
When I replace Process with 'threading.Thread' the same code runs but with this nonsense output:
Squared Squared Squared Squared Squared 0 1491625
36496481
Is there any help in this regard ? thank in advance
To run my python codes I use Idlex IDE and I start it from terminal.
I have Intel Xeon Processor with 4 cores / 8 Threads, and 8GB RAM
With a little thought I finally found the problem.
This is happening because in Python, the float and int objects are not 'thread-safe', meaning the memory allocated to calculate any function's value by one thread/process can be overwritten by another and hence they show absurd values. This is called a race condition.
To solve this problem, use deque() from the collections module or, even better, use the 'Lock' facility. deque() works with arrays but it's meant for arrays of the same kind (much like MATLAB arrays) and is thread/process safe. 'Lock' avoids race conditions.
So the edit would be :
def myfunc(num):
lock.acquire()
.......some code .....
.......some code......
lock.release()
That's all.
But one problem still persists and that is with the multiprocessing module. Even after calling 'lock', the problem mentioned in the question remains.
Save the code above into a .py file and then run it in a gnome-terminal with
python myfile.py
Where "myfile.py" is the filename you saved to.
I would assume that the IDE you are using is confused somehow by Process()

Categories

Resources