I have a script task.py that I am trying to invoke. It seems there are two ways to do that. One is to use the subprocess API while the other is to use Python's import mechanism.
task.py
def call_task():
print("task in progress...")
return "something"
print("calling task..")
out = call_task()
print("output of the executed task::", out)
Now, we have two approaches to invoke the above task.py python script.
Approach 1
import task as task
print("invoke call-task")
out = task.call_task()
print("output::", out)
Approach 2
import subprocess, shlex, PIPE
proc = subprocess.Popen(shlex.split("python task.py"), stdout = PIPE)
out = proc.communicate()
print("output::", out)
Although both approaches work, which approach is more pythonic?
Running a separate Python process from Python is frequently an antipattern. There are situations where you specifically want two Python instances (for example, if the module you want to use requires its own signal handling etc) but in the absence of factors which force the other choice, import is generally vastly preferrable in terms of usability (you get to call the functions inside the package in an order different from its main flow, and have more fine-grained control over the internals) and performance (starting a separate process is almost always a bad idea if you can avoid it).
While "The subprocess module allows you to spawn new processes" which executes the code within your task.py,
importing will result in the original process executing your code.
Other than that, it should be identical.
You can read more about it in the Python Subprocess Docs
As i've seen, its rather unusual to execute python code using an extra subprocess.
It may benefit performance wise but the more pythonic way would be importing i guess.
Related
I have recently came across a few posts on stack overflow saying that subprocess is much better than os.system, however I am having difficulty finding the exact advantages.
Some examples of things I have run into:
https://docs.python.org/3/library/os.html#os.system
"The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function."
No idea in what ways it is more powerful though, I know it is easier in many ways to use subprocess but is it actually more powerful in some way?
Another example is:
https://stackoverflow.com/a/89243/3339122
The advantage of subprocess vs system is that it is more flexible (you can get the stdout, stderr, the "real" status code, better error handling, etc...).
This post which has 2600+ votes. Again could not find any elaboration on what was meant by better error handling or real status code.
Top comment on that post is:
Can't see why you'd use os.system even for quick/dirty/one-time. subprocess seems so much better.
Again, I understand it makes some things slightly easier, but I hardly can understand why for example:
subprocess.call("netsh interface set interface \"Wi-Fi\" enable", shell=True)
is any better than
os.system("netsh interface set interface \"Wi-Fi\" enabled")
Can anyone explain some reasons it is so much better?
First of all, you are cutting out the middleman; subprocess.call by default avoids spawning a shell that examines your command, and directly spawns the requested process. This is important because, besides the efficiency side of the matter, you don't have much control over the default shell behavior, and it actually typically works against you regarding escaping.
In particular, do not do this:
subprocess.call('netsh interface set interface "Wi-Fi" enable')
since
If passing a single string, either shell must be True (see below) or else the string must simply name the program to be executed without specifying any arguments.
Instead, you'll do:
subprocess.call(["netsh", "interface", "set", "interface", "Wi-Fi", "enable"])
Notice that here all the escaping nightmares are gone. subprocess handles escaping (if the OS wants arguments as a single string - such as Windows) or passes the separated arguments straight to the relevant syscall (execvp on UNIX).
Compare this with having to handle the escaping yourself, especially in a cross-platform way (cmd doesn't escape in the same way as POSIX sh), especially with the shell in the middle messing with your stuff (trust me, you don't want to know what unholy mess is to provide a 100% safe escaping for your command when calling cmd /k).
Also, when using subprocess without the shell in the middle you are sure you are getting correct return codes. If there's a failure launching the process you get a Python exception, if you get a return code it's actually the return code of the launched program. With os.system you have no way to know if the return code you get comes from the launched command (which is generally the default behavior if the shell manages to launch it) or it is some error from the shell (if it didn't manage to launch it).
Besides arguments splitting/escaping and return code, you have way better control over the launched process. Even with subprocess.call (which is the most basic utility function over subprocess functionalities) you can redirect stdin, stdout and stderr, possibly communicating with the launched process. check_call is similar and it avoids the risk of ignoring a failure exit code. check_output covers the common use case of check_call + capturing all the program output into a string variable.
Once you get past call & friends (which is blocking just as os.system), there are way more powerful functionalities - in particular, the Popen object allows you to work with the launched process asynchronously. You can start it, possibly talk with it through the redirected streams, check if it is running from time to time while doing other stuff, waiting for it to complete, sending signals to it and killing it - all stuff that is way besides the mere synchronous "start process with default stdin/stdout/stderr through the shell and wait it to finish" that os.system provides.
So, to sum it up, with subprocess:
even at the most basic level (call & friends), you:
avoid escaping problems by passing a Python list of arguments;
avoid the shell messing with your command line;
either you have an exception or the true exit code of the process you launched; no confusion about program/shell exit code;
have the possibility to capture stdout and in general redirect the standard streams;
when you use Popen:
you aren't restricted to a synchronous interface, but you can actually do other stuff while the subprocess run;
you can control the subprocess (check if it is running, communicate with it, kill it).
Given that subprocess does way more than os.system can do - and in a safer, more flexible (if you need it) way - there's just no reason to use system instead.
There are many reasons, but the main reason is mentioned directly in the docstring:
>>> os.system.__doc__
'Execute the command in a subshell.'
For almost all cases where you need a subprocess, it is undesirable to spawn a subshell. This is unnecessary and wasteful, it adds an extra layer of complexity, and introduces several new vulnerabilities and failure modes. Using subprocess module cuts out the middleman.
I have a Python program from which I spawn a sub-program to process some files without holding up the main program. I'm currently using bash for the sub-program, started with a command and two parameters like this:
result = os.system('sub-program.sh file.txt file.txt &')
That works fine, but I (eventually!) realised that I could use Python for the sub-program, which would be far preferable, so I have converted it. The simplest way of spawning it might be:
result = os.system('python3 sub-program.py file.txt file.txt &')
Some research has shown several more sophisticated alternatives, but I have the impression that the latest and most approved method is this one:
subprocess.Popen(["python3", "-u", "sub-program.py"])
Am I correct in thinking that that is the most appropriate way of doing it? Would anyone recommend a different method and why? Simple would be good as I'm a bit of a Python novice.
If this is the recommended method, I can probably work out what the "-u" does and how to add the parameters for myself.
Optional extras:
Send a message back from the sub-program to the main program.
Make the sub-program quit when the main program does.
Yes, using subprocess is the recommended way to go according to the documentation:
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.
However, subprocess.Popen may not be what you're looking for. As opposed to os.system you will create a Popen object that corresponds to the subprocess and you'll have to wait for it in order to wait for it's completion, fx:
proc = subprocess.Popen(["python3", "-u", "sub-program.py"])
do_something()
res = proc.wait()
If you want to just run a program and wait for completion you should probably use subprocess.run (or maybe subprocess.call, subprocess.check_call or subprocess.check_output) instead.
Thanks skyking!
With
import subprocess
at the beginning of the main program, this does what I want:
with open('output.txt', 'w') as f:
subprocess.Popen([spawned.py, parameter1, parameter2], stdout = f)
The first line opens a file for the output from the sub-program started in the second line. In the second line, the square brackets contain the stuff for the sub-program - name followed by two parameters. The parameters are available in the sub-program in sys.argv[1] and sys.argv[2]. After that come the subprocess parameters - the f says to output to the text file mentioned above.
Is there any particular reason it has to be another program entirely? Why not just spawn another process which runs one of the functions defined within your script?
I suggest that you read up on multiprocessing. Python has module just for that: https://docs.python.org/dev/library/multiprocessing.html
Here you can find info on spawning new processes, communicating between them and syncronizing them.
Be warned though that if you want to really speed up your file processing you'll want to use processes instead of threads (due to some limitations in python, threads will only slow you down which is confusing).
Also check out this page: https://pymotw.com/2/multiprocessing/basics.html
It has some code samples that will help you out a lot.
Don't forget this guard in your script:
if __name__ == '__main__':
It is very important ;)
This is a summary of my code:
# import whatever
def createFolder():
#someCode
var1=Gdrive.createFolder(name)
return var1
def main():
#someCode
var2=createFolder()
return var2
if __name__ == "__main__":
print main()
One way in which I managed to return a value to a bash variable was printing what was returned from main(). Another way is just printing the variable in any place of the script.
Is there any way to return it in a more pythonic way?
The script is called this way:
folder=$(python create_folder.py "string_as_arg")
A more pythonic way would be to avoid bash and write the whole lot in python.
You can't expect bash to have a pythonic way of getting values from another process - it's way is the bash way.
bash and python are running in different processes, and inter-process communication (IPC) must go via kernel. There are many IPC mechanisms, but bash does not support them all (shared memory, for example). The lowest common denominator here is bash, so you must use what bash supports, not what python has (python has everything).
Without shared memory, it is not a simple thing to write to variables of another process - let alone another language. Debuggers do it, but they are written specifically for the host language.
The mechanism you use from bash is to capture the stdout of the child process, so python must print. Under the covers this uses an anonymous pipe. You could use a named pipe (also known as a fifo) instead, which python would open as a normal file and write to it. But it wouldn't buy you much.
If you were working in bash then you could simply do:
export var="value"
However, there is no such equivalent in Python. If you try to use os.environ those values will persist for the rest of the process and will not modify anything after the program finishes. Your best bet is to do exactly what you are already doing.
You can try to set an environment variable from within the python code and read it outside, at the bash script. This way looks very elegant to me, but it is definitely not the "perfect solution" or the only solution. If you like this approach, this thread might be useful: How to set environment variables in Python
There are other ways, very similar to what you have done. Check also this thread: store return value of a Python script in a bash script
Just use sys.exit(), i.e.:
import sys
[...]
if __name__ == "__main__":
sys.exit(main())
I'm working on a Python 2.7 project doing a fair amount of I/O; processes are launched via the subprocess module, directories are created via os.makedirs, files are copied via shutil.copy2 and more.
Now I'd like to a "dry run" mode, i.e. the program doesn't actually do any I/O. Is there an easy way to do this, knowing that basically all my I/O is done using the three modules os, shutil and subprocess?
Two approaches I considered so far:
Write wrapper functions for all the things I'd like to silence, e.g. mymakedirs which just forwards to os.makedirs. All wrapper functions check a global flag and do nothing if requested. Unfortunately this means not only writing a lot of wrapper functions but also touching a lot of existing code.
Write proxy modules like myshutil which consider a global flag and depending on that either do from shutil import *, or it provides stubs. The only downsides to this I can see - how can I easily tell what stubs to write (can I see what functions are called in a module?), and I'd need to do a slight modification to all client code so that e.g. import shutil is changed to import myshutil.
The second idea seems the best to me so far, but I wonder: is there another, even nicer technique to proxy an existing module with as little modification to existing code as possible?
In solution 1, you don't need to rewrite your code: you can monkeypatch os to intercept the calls:
>>> def mymkdir(*args):
... print "mkdir", args
...
>>> os.mkdir = mymkdir # monkey patching os
>>> os.mkdir("toto")
mkdir ('toto',)
You probably can even switch the entire module something like os = myos. Sorry I don't have time now to figure out a concrete solution.
The use case is as follows :
I have a script that runs a series of
non-python executables to reduce (pulsar) data. I right now use
subprocess.Popen(..., shell=True) and then the communicate function of subprocess to
capture the standard out and standard error from the non-python executables and the captured output I log using the python logging module.
The problem is: just one core of the possible 8 get used now most of the time.
I want to spawn out multiple processes each doing a part of the data set in parallel and I want to keep track of progres. It is a script / program to analyze data from a low frequencey radio telescope (LOFAR). The easier to install / manage and test the better.
I was about to build code to manage all this but im sure it must already exist in some easy library form.
The subprocess module can start multiple processes for you just fine, and keep track of them. The problem, though, is reading the output from each process without blocking any other processes. Depending on the platform there's several ways of doing this: using the select module to see which process has data to be read, setting the output pipes non-blocking using the fnctl module, using threads to read each process's data (which subprocess.Popen.communicate itself uses on Windows, because it doesn't have the other two options.) In each case the devil is in the details, though.
Something that handles all this for you is Twisted, which can spawn as many processes as you want, and can call your callbacks with the data they produce (as well as other situations.)
Maybe Celery will serve your needs.
If I understand correctly what you are doing, I might suggest a slightly different approach. Try establishing a single unit of work as a function and then layer on the parallel processing after that. For example:
Wrap the current functionality (calling subprocess and capturing output) into a single function. Have the function create a result object that can be returned; alternatively, the function could write out to files as you see fit.
Create an iterable (list, etc.) that contains an input for each chunk of data for step 1.
Create a multiprocessing Pool and then capitalize on its map() functionality to execute your function from step 1 for each of the items in step 2. See the python multiprocessing docs for details.
You could also use a worker/Queue model. The key, I think, is to encapsulate the current subprocess/output capture stuff into a function that does the work for a single chunk of data (whatever that is). Layering on the parallel processing piece is then quite straightforward using any of several techniques, only a couple of which were described here.