How to structure code that distributes jobs to threads/nodes in Python?

How to structure code that distributes jobs to threads/nodes in Python? - python

I have python code that takes a bunch of tasks and distributes them to either different threads or different nodes on a cluster. I always end up writing a main script driver.py, that takes two command line arguments: --run-all and --run-task. The first is just a wrapper that iterates through all tasks and then calls driver.py --run-task with each task passed as argument. Example:
== driver.py ==
# Determine the current script
DRIVER = os.path.abspath(__file__)
(opts, args) = parser.parse_args()
if opts.run_all is not None:
# Run all tasks
for task in opts.run_all.split(","):
# Call driver.py again with a specific task
cmd = "python %s --run-task %s" %(DRIVER, task)
# Execute on system
distribute_cmd(cmd)
elif opts.run_task is not None:
# Run on an individual task
# code here for processing a task...
The user would then call:
$ driver.py --run-all task1,task2,task3,task4
And each task would get distributed.
The function distribute_cmd takes a shell executable command and sends in a system-specific way to either a node or a thread. The reason driver.py has to find its own name and call itself is because distribute_cmd needs an executable shell command; it cannot take a function name for example.
This consideration led me to this design, of a driver script having two modes and having to call itself. This has two complications: (1) the script has to find out its own path via __file__ and (2) when making this into a Python package, it's unclear where driver.py should go. It's meant to be an executable scripts, but if I put it in setup.py's scripts=, then I will have to find out where the scripts live (see correct way to find scripts directory from setup.py in Python distutils?). This does not seem to be a good solution.
What's an alternative design to this? Keep in mind that the distribution of tasks has to result in an executable command that can be passed as a string to distribute_cmd. thanks.

are you looking for is a library that already does exactly what you need, e.g. Fabric or Celery.
if you were not using nodes, I would suggest using multiprocessing.
this is a slightly similar question to this one
To be able to execute remotely, you either need:
ssh access to the box, in that case you can use Fabric to send your commands.
a server, SocketServer, tcp server, or anything that will accept connections.
an agent, or client, that will wait for data, if you are using a agent, you may as well use a broker for your messages. Celery allows you to do some of the plumbing, one end puts messages on the queue while the other end gets message from the queue. If the message is a command to execute, then the agent can do an os.system() call, or call subprocess.Popen()
celery example:
import os
from celery import Celery
celery = Celery('tasks', broker='amqp://guest#localhost//')
#celery.task
def run_command(command):
return os.system(command)
You will then need a worker that binds on the queue and waits for tasks to execute. More info in the documentation.
fabric example:
the code:
from fabric.api import run
def exec_remotely(command):
run(command)
the invocation:
$ fab exec_remotely:command='ls -lh'
More info in the documentation.
batch system case:
To go back to the question...
distribute_cmd is something that would call bsub somescript.sh
you need to find file only because you are going to re-execute the same script with other parameters
because of the above, you might have a problem providing a correct distutils script.
Let's question this design.
Why do you need to use the same script?
Can your driver write scripts then call bsub?
Can you use temporary files?
Do all the nodes actually share a filesystem?
How do you know file is going to exist on the node?
example:
TASK_CODE = {
'TASK1': '''#!/usr/bin/env python
#... actual code for task1 goes here ...
''',
'TASK2': '''#!/usr/bin/env python
#... actual code for task2 goes here ...
'''}
# driver portion
(opts, args) = parser.parse_args()
if opts.run_all is not None:
for task in opts.run_all.split(","):
task_path = '/tmp/taskfile_%s' % task
with open(task_path, 'w') as task_file:
task_file.write(TASK_CODE[task])
# note: should probably do better error handling.
distribute_cmd(task_path)

Related

Can't get python subprocess.Popen() to start another python script in the background

I am in a bit of a pickle here. I have a python script (gather.py) that gathers information from an .xml file and uploads it into a database on a infinite loop that sleeps for 60sec; btw all of this is local. I am using Flask to run a webpage that will later pull information from the database, but at the moment all it does is display a sample page (main.py). I want to run main.py as for it to start gather.py as background process that won't prevent Flask from starting, I tried importing gather.py but it halts the process (indefinitely) and Flask won't start. After Googling for a while it seems that the best option is to use a task queue (Celery) and a message-broker (RabbitMQ) to take care of this. This is fine if the application were to do a lot of stuff in the background, but I only need it to do 1 or 2 things. So I did more digging and found posts stating that subprocess.Popen() could do the job. I tried using it and I don't think it failed, since it didn't raise any errors, but the database is empty. I confirmed that both gather.py and main.py work independently. I tried running the following code in IDLE:
subprocess.Popen([sys.executable, 'path\to\gather.py'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
and got this in return:
<subprocess.Popen object at 0x049A1CF0>
Now, I don't know what this means, I tried using .value and .attrib but understandably I get this:
AttributeError: 'Popen' object has no attribute 'value'
and
AttributeError: 'Popen' object has no attribute 'attrib'
Then I read on a StackOverflow post that stdout=subprocess.PIPE would cause the program to halt so, in a 'just in case' moment, I ran:
subprocess.Popen([sys.executable, 'path\to\gather.py'], stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
and got this in return:
<subprocess.Popen object at 0x034A77D0>
Through all this process the database tables have remained empty. I am new to the subprocess module but all this checks and I can't figure out why it is not running gather.py. Is it because it has an infinite loop?? If there is a better option pls let me know.
Python version: 3.4.4
PS. IDK if it'll matter but I am running a portable version of Python (PortableApps) on a Windows 10 PC. This is why I included sys.executable inside subprocess.Popen().

Solution 1 (all in python script):
Try to use Thread and Queue.
I do this:
from flask import Flask
from flask import request
import json
from Queue import Queue
from threading import Thread
import time
def task(q):
q.put(0)
t = time.time()
while True:
time.sleep(1)
q.put(time.time() - t)
queue = Queue()
worker = Thread(target=task, args=(queue,))
worker.start()
app = Flask(__name__)
#app.route('/')
def message_from_queue():
msg = "Running: calculate %f seconds" % queue.get()
return msg
if __name__ == '__main__':
app.run(host='0.0.0.0')
If you run this code each access to '/' get a value calculate in task in background. Maybe you need to block until the task get a value, but it isnt enough information in the question. Of course you need to refactor your gather.py to pass a queue for it.
Solution 2 (using a system script):
For windows, create a .bat file and run both script from there:
#echo off
start python 'path\to\gather.py'
set FLASK_APP=app.py
flask run
This will run gather.py and after start the flask server. If you use start /min python 'path\to\gather.py' the gather will run in minimized mode.

subprocess.Popen will not work in opening a python program because it recognizes python as a file and not a executable. Subprocess.Popen can only open .exe files and nothing other than that.
You can use:
os.system('python_file_path.py')
but it won't be a background process(depends on your script)

Launch a single python script as different processes differing by command line arguments

I have python script that takes command line arguments. The way I get the command line arguments is by reading a mongo database. I need to iterate over the mongo query and launch a different process for the single script with different command line arguments from the mongo query.
Key is, I need the launched processes to be:
separate processes share nothing
when killing the process, I need to be able to kill them all easily.
I think the command killall -9 script.py would work and satisfies the second constraint.
Edit 1
From the answer below, the launcher.py program looks like this
def main():
symbolPreDict = initializeGetMongoAllSymbols()
keys = sorted(symbolPreDict.keys())
for symbol in keys:
# Display key.
print(symbol)
command = ['python', 'mc.py', '-s', str(symbol)]
print command
subprocess.call(command)
if __name__ == '__main__':
main()
The problem is that mc.py has a call that blocks
receiver = multicast.MulticastUDPReceiver ("192.168.0.2", symbolMCIPAddrStr, symbolMCPort )
while True:
try:
b = MD()
data = receiver.read() # This blocks
...
except Exception, e:
print str(e)
When I run the launcher, it just executes one of the mc.py (there are at least 39). How do I modify the launcher program to say "run the launched script in background" so that the script returns to the launcher to launch more scripts?
Edit 2
The problem is solved by replacing subprocess.call(command) with subprocess.Popen(command)
One thing I noticed though, if I say ps ax | grep mc.py, the PID seem to be all different. I don't think I care since I can kill them all pretty easily with killall.
[Correction] kill them with pkill -f xxx.py

There are several options for launching scripts from a script. The easiest are probably to use the subprocess or os modules.
I have done this several times to launch things to separate nodes on a cluster. Using os it might look something like this:
import os
for i in range(len(operations)):
os.system("python myScript.py {:} {:} > out.log".format(arg1,arg2))
using killall you should have no problem terminating processes spawned this way.
Another option is to use subprocess which has got a wide range of features and is much more flexible than os.system. An example might look like:
import subprocess
for i in range(len(operations)):
command = ['python','myScript.py','arg1','arg2']
subprocess.call(command)
In both of these methods, the processes are independent and share nothing other than a parent PID.

How to launch a couple of python scripts from a first python script and then terminate them all at once?

I have a function in a python script which should launch another python script multiple times, I am assuming this can be done like this(Script is just my imagination of how this would work.)
iterations = input("Enter the number of processes to run")
for x in range(0, iterations):
subprocess.call("python3 /path/to/the/script.py", shell=True)
but, I also need to pass over some defined variables into the other script, for example, if
x = 1
in the first script, then, I need x to have the same value in the second script without defining it there, I have NO idea how to do that.
And then also killing them, I have read about some method using PIDs, but don't those change every time?
Most of the methods I found on Google looked overly complex and what I want to do is really simple. Can anyone guide me in the right direction as to what to use and how I should go at accomplishing it?

I have a function in a python script which should launch another python script multiple times, I am assuming this can be done like this(Script is just my imagination of how this would work.)
**
Here is the subprocess manual page which contains everything I will be talking about
https://docs.python.org/2/library/subprocess.html
One of the way to call one script from other is using subprocess.Popen
something on the lines
import subprocess
for i in range(0,100):
ret = subprocess.Popen("python3 /path/to/the/script.py",stdout=subprocess.PIPE,stderr=subprocess.PIPE,shell=True)
you can use the return value from Open to make the call synchronous using the communicate method.
out,err = ret.communicate()
This would block the calling script until the subprocess finishes.
I also need to pass over some defined variables into the other script??
There are multiple ways to do this.
1. Pass parameters to the called script and parse it using OptionPraser or sys.args
in the called script have something like
from optparse import OptionParser
parser = OptionParser()
parser.add_option("-x","--variable",action="store_true",dest="xvalue",default=False)
(options,args) = parser.parse_args()
if options.xvalue == True:
###do something
in the callee script use subprocess as
ret = subprocess.Popen("python3 /path/to/the/script.py -x",stdout=subprocess.PIPE,stderr=subprocess.PIPE,shell=True)
Note the addition of -x parameter
You can use args parse
https://docs.python.org/2/library/argparse.html#module-argparse
Pass the subprocess a environment variable which can be used to configure the subprocess. This is fast but this only works one way, i.e. from parent process to child process.
in called script
import os
x = int(os.enviorn('xvalue'))
in callee script set the environment variable
import os
int x = 1
os.environ['xvalue'] = str(x)
Use sockets or pipes or some other IPC method
And then also killing them, I have read about some method using PIDs, but don't those change every time?
again you can use subprocess to hold the process id and terminate it
this will give you the process id
ret.pid
you can then use .terminate to terminate the process if it is running
ret.terminate()
to check if the process is running you can use the poll method from subprocess Popen. I would suggest you to check before you terminate the process
ret.poll()
poll will return a None if the process is running

If you just need to pass some values to second script, and you need to run that
by means of subprocess module, then you may simply pass the variables as command line arguments:
for x in range(0, iterations):
subprocess.call('python3 /path/to/second_script.py -x=%s'%x, shell=True)
And recieve the -x=1 via sys.argv list inside second_script.py (using argparse module)
On the other hand, If you need to exchange something between the two scripts dynamically (while both are running), You can use the pipe mechanism or even better, use the multiprocessing (wich requires some changes in your current code), it would make communication with and controlling it (terminating it) much cleaner.

You can pass variables to subprocesses via the command line, environment variables or passing data in on stdin. Command line is easy for simple strings that aren't too long and don't themselves have shell meta characters in them. The target script would pull them from sys.argv.
script.py:
import sys
import os
import time
x = sys.argv[1]
print(os.getpid(), "processing", x)
time.sleep(240)
subprocess.Popen starts child processes but doesn't wait for them to complete. You could start all of the children, put their popen objects in a list and finish with them later.
iterations = input("Enter the number of processes to run")
processes = []
for x in range(0, iterations):
processes.append(subprocess.Popen([sys.executable, "/path/to/the/script.py", str(x)])
time.sleep(10)
for proc in processes:
if proc.poll() is not None:
proc.terminate()
for proc in processes:
returncode = proc.wait()

Unable to open a Python subprocess in Web2py (SIGABRT)

I've got an Apache2/web2py server running using the wsgi handler functionality. Within one of the controllers, I am trying to run an external executable to perform some processing on 2 files.
My approach to this is to use the subprocess module to kick off the executable. I have simplified the code to a bare-bones implementation with little success.
from subprocess import *
p = Popen(("echo", "Hello"), shell=False)
ret = p.wait()
print "Process ended with status %s" % ret
When running the above code on its own (create new file and running via python command line), it works exactly as expected.
However, as soon as I place the exact same code into my web2py controller, the external process stops working. Instead of the process returning with code 0 as is expected in the above example, it always returns -6 and "Hello" is not printed to stdout.
After doing some digging, I found that negative results from p.wait() implies that a signal caused the process to end abnormally. And, according to some docs I found, -6 corresponds to the SIGABRT signal.
I would have expected this signal to be a result of some poorly executed code in my child process. However, since this is only running echo (and since it works outside of web2py) I have my doubts that the child process is signalling itself.
Is there some web2py limitation/configuration that causes Popen() requests to always fail? If so, how can I modify my logic so that the controller (or whatever) is actually able to spawn this external process?
** EDIT: Looks like web2py applications may not like the subprocess module. According to a reply to a message reply in the web2py email group:
"You should not use subprocess in a web2py application (if you really need too, look into the admin/controllers/shell.py) but you can use it in a web2py program running from shell (web2py.py -R myprogram.py)."
I will be checking out some options based on the note here and see if any solution presents itself.

In the end, the best I was able to come up with involved setting up a simple XML RPC server and call the functions from that:
my_server.py
#my_server.py
from SimpleXMLRPCServer import SimpleXMLRPCServer, SimpleXMLRPCRequestHandler
from subprocess import *
proc_srvr = xmlrpclib.ServerProxy("http://localhost:12345")
def echo_fn():
p = Popen(("echo", "hello"), shell=False)
ret = p.wait()
print "Process ended with status %s" % ret
return True # RPC Server doesn't like to return None
def main():
server = SimpleXMLRPCServer(("localhost", 12345), ErrorHandler)
server.register_function(echo_fn, "echo_fn")
while True:
server.handle_request()
if __name__ == "__main__":
main()
web2py_controller.py
#web2py_controller.py
def run_echo():
proc_srvr = xmlrpclib.ServerProxy("http://localhost:12345")
proc_srvr.echo_fn()
I'll be honest, I'm not a Python nor SimpleRPCServer guru, so the overall code may not be up to best-practice standards. However, going this route did allow me to, in effect, call a subprocess from a controller in web2py.
(Note, this was a quick and dirty simplification of the code that I have in my project. I have not validated it is in a working state, so it may require some tweaks.)

How to get environment from a subprocess?

I want to call a process via a python program, however, this process need some specific environment variables that are set by another process. How can I get the first process environment variables to pass them to the second?
This is what the program look like:
import subprocess
subprocess.call(['proc1']) # this set env. variables for proc2
subprocess.call(['proc2']) # this must have env. variables set by proc1 to work
but the to process don't share the same environment. Note that these programs aren't mine (the first is big and ugly .bat file and the second a proprietary soft) so I can't modify them (ok, I can extract all that I need from the .bat but it's very combersome).
N.B.: I am using Windows, but I prefer a cross-platform solution (but my problem wouldn't happen on a Unix-like ...)

Here's an example of how you can extract environment variables from a batch or cmd file without creating a wrapper script. Enjoy.
from __future__ import print_function
import sys
import subprocess
import itertools
def validate_pair(ob):
try:
if not (len(ob) == 2):
print("Unexpected result:", ob, file=sys.stderr)
raise ValueError
except:
return False
return True
def consume(iter):
try:
while True: next(iter)
except StopIteration:
pass
def get_environment_from_batch_command(env_cmd, initial=None):
"""
Take a command (either a single command or list of arguments)
and return the environment created after running that command.
Note that if the command must be a batch file or .cmd file, or the
changes to the environment will not be captured.
If initial is supplied, it is used as the initial environment passed
to the child process.
"""
if not isinstance(env_cmd, (list, tuple)):
env_cmd = [env_cmd]
# construct the command that will alter the environment
env_cmd = subprocess.list2cmdline(env_cmd)
# create a tag so we can tell in the output when the proc is done
tag = 'Done running command'
# construct a cmd.exe command to do accomplish this
cmd = 'cmd.exe /s /c "{env_cmd} && echo "{tag}" && set"'.format(**vars())
# launch the process
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, env=initial)
# parse the output sent to stdout
lines = proc.stdout
# consume whatever output occurs until the tag is reached
consume(itertools.takewhile(lambda l: tag not in l, lines))
# define a way to handle each KEY=VALUE line
handle_line = lambda l: l.rstrip().split('=',1)
# parse key/values into pairs
pairs = map(handle_line, lines)
# make sure the pairs are valid
valid_pairs = filter(validate_pair, pairs)
# construct a dictionary of the pairs
result = dict(valid_pairs)
# let the process finish
proc.communicate()
return result
So to answer your question, you would create a .py file that does the following:
env = get_environment_from_batch_command('proc1')
subprocess.Popen('proc2', env=env)

As you say, processes don't share the environment - so what you literally ask is not possible, not only in Python, but with any programming language.
What you can do is to put the environment variables in a file, or in a pipe, and either
have the parent process read them, and pass them to proc2 before proc2 is created, or
have proc2 read them, and set them locally
The latter would require cooperation from proc2; the former requires that the variables become known before proc2 is started.

Since you're apparently in Windows, you need a Windows answer.
Create a wrapper batch file, eg. "run_program.bat", and run both programs:
#echo off
call proc1.bat
proc2
The script will run and set its environment variables. Both scripts run in the same interpreter (cmd.exe instance), so the variables prog1.bat sets will be set when prog2 is executed.
Not terribly pretty, but it'll work.
(Unix people, you can do the same thing in a bash script: "source file.sh".)

You can use Process in psutil to get the environment variables for that Process.
If you want to implement it yourself, you can refer to the internal implementation of psutil. It adapts to some operating system.
Currently supported operating systems are:
AIX
FreeBSD, OpenBSD, NetBSD
Linux
macOS
Sun Solaris
Windows
Eg: In Linux platform, you can find one pid 7877 environment variables in file /proc/7877/environ, just open with rt mode to read it.
Of course the best way to do this is to:
import os
from typing import Dict
from psutil import Process
process = Process(pid=os.getpid())
process_env: Dict = process.environ()
print(process_env)
You can find other platform implementation in source code
Hope I can help you.

The Python standard module multiprocessing have a Queues system that allow you to pass pickle-able object to be passed through processes. Also processes can exchange messages (a pickled object) using os.pipe. Remember that resources (e.g : database connection) and handle (e.g : file handles) can't be pickled.
You may find this link interesting :
Communication between processes with multiprocessing
Also the PyMOTw about multiprocessing worth mentioning :
multiprocessing Basics
sorry for my spelling

Two things spring to mind: (1) make the processes share the same environment, by combining them somehow into the same process, or (2) have the first process produce output that contains the relevant environment variables, that way Python can read it and construct the environment for the second process. I think (though I'm not 100% sure) that there isn't any way to get the environment from a subprocess as you're hoping to do.

Environment is inherited from the parent process. Set the environment you need in the main script, not a subprocess (child).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.