I'm a programming newbie and I'm trying to understand how stdin, stdout, and stderr work. As I understand it, stdout and stderr are two different places where we can direct output from programs. I guess I don't understand what's the point of having a second "stream" of output just for errors with stderr? Why not have errors on the regular stdout? What does having errors on stderr allow me to do (basically why is stderr useful)?
There are two "points" to supporting distinct stout and stderr streams:
When you are writing applications that can be chained together (e.g. using pipelines) you don't want the "normal" output to get mixed up with errors, warnings, debug info and other "chit chat". Mixing them in the same stream would make life difficult for the next program in the chain / pipeline.
Example:
$ cat some-file | grep not
$ echo $?
If the cat command did not write its error messages to stderr, then the grep command would see a "file not found" message if "some-file" did not exist. It would then (incorrectly) match on the "not", and set the return code for the pipeline incorrectly. Constructing pipelines that coped with this sort of thing would be hellishly difficult.
Separate stdout and stderr streams have been support in (at least) UNIX and UNIX-like system since ... umm ... the 1970's. And they are part of the POSIX standard. If a new programming language's runtime libraries did not support this, then it would be considered to be crippled; i.e. unsuitable for writing production quality applications.
(In the history of programming languages, Python is still relatively new.)
However, nobody is forcing to write your applications to use stderr for its intended purpose. (Well ... maybe your future co-workers will :-) )
In UNIX (and Linux, and other Posix-compatible systems) programs are often combined with pipes, so that one program takes the output of another one as input. If you would mix normal output and error information, every program would need to know how to treat diagnostic info from its pipe data producer differently from normal data. In practice, that is impossible due to the large number of program combinations.
By writing error information to stderr, each program makes it possible for the user to get this info without needing to filter it out of the data stream intended to be read by the next program in the pipe.
Related
I have recently came across a few posts on stack overflow saying that subprocess is much better than os.system, however I am having difficulty finding the exact advantages.
Some examples of things I have run into:
https://docs.python.org/3/library/os.html#os.system
"The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function."
No idea in what ways it is more powerful though, I know it is easier in many ways to use subprocess but is it actually more powerful in some way?
Another example is:
https://stackoverflow.com/a/89243/3339122
The advantage of subprocess vs system is that it is more flexible (you can get the stdout, stderr, the "real" status code, better error handling, etc...).
This post which has 2600+ votes. Again could not find any elaboration on what was meant by better error handling or real status code.
Top comment on that post is:
Can't see why you'd use os.system even for quick/dirty/one-time. subprocess seems so much better.
Again, I understand it makes some things slightly easier, but I hardly can understand why for example:
subprocess.call("netsh interface set interface \"Wi-Fi\" enable", shell=True)
is any better than
os.system("netsh interface set interface \"Wi-Fi\" enabled")
Can anyone explain some reasons it is so much better?
First of all, you are cutting out the middleman; subprocess.call by default avoids spawning a shell that examines your command, and directly spawns the requested process. This is important because, besides the efficiency side of the matter, you don't have much control over the default shell behavior, and it actually typically works against you regarding escaping.
In particular, do not do this:
subprocess.call('netsh interface set interface "Wi-Fi" enable')
since
If passing a single string, either shell must be True (see below) or else the string must simply name the program to be executed without specifying any arguments.
Instead, you'll do:
subprocess.call(["netsh", "interface", "set", "interface", "Wi-Fi", "enable"])
Notice that here all the escaping nightmares are gone. subprocess handles escaping (if the OS wants arguments as a single string - such as Windows) or passes the separated arguments straight to the relevant syscall (execvp on UNIX).
Compare this with having to handle the escaping yourself, especially in a cross-platform way (cmd doesn't escape in the same way as POSIX sh), especially with the shell in the middle messing with your stuff (trust me, you don't want to know what unholy mess is to provide a 100% safe escaping for your command when calling cmd /k).
Also, when using subprocess without the shell in the middle you are sure you are getting correct return codes. If there's a failure launching the process you get a Python exception, if you get a return code it's actually the return code of the launched program. With os.system you have no way to know if the return code you get comes from the launched command (which is generally the default behavior if the shell manages to launch it) or it is some error from the shell (if it didn't manage to launch it).
Besides arguments splitting/escaping and return code, you have way better control over the launched process. Even with subprocess.call (which is the most basic utility function over subprocess functionalities) you can redirect stdin, stdout and stderr, possibly communicating with the launched process. check_call is similar and it avoids the risk of ignoring a failure exit code. check_output covers the common use case of check_call + capturing all the program output into a string variable.
Once you get past call & friends (which is blocking just as os.system), there are way more powerful functionalities - in particular, the Popen object allows you to work with the launched process asynchronously. You can start it, possibly talk with it through the redirected streams, check if it is running from time to time while doing other stuff, waiting for it to complete, sending signals to it and killing it - all stuff that is way besides the mere synchronous "start process with default stdin/stdout/stderr through the shell and wait it to finish" that os.system provides.
So, to sum it up, with subprocess:
even at the most basic level (call & friends), you:
avoid escaping problems by passing a Python list of arguments;
avoid the shell messing with your command line;
either you have an exception or the true exit code of the process you launched; no confusion about program/shell exit code;
have the possibility to capture stdout and in general redirect the standard streams;
when you use Popen:
you aren't restricted to a synchronous interface, but you can actually do other stuff while the subprocess run;
you can control the subprocess (check if it is running, communicate with it, kill it).
Given that subprocess does way more than os.system can do - and in a safer, more flexible (if you need it) way - there's just no reason to use system instead.
There are many reasons, but the main reason is mentioned directly in the docstring:
>>> os.system.__doc__
'Execute the command in a subshell.'
For almost all cases where you need a subprocess, it is undesirable to spawn a subshell. This is unnecessary and wasteful, it adds an extra layer of complexity, and introduces several new vulnerabilities and failure modes. Using subprocess module cuts out the middleman.
Why did Guido (or whoever else) decide to make python --version print to stderr rather than stdout? Just curious what the use case is that makes standard error more appropriate than standard out.
Python 3.4 was modified to output to stdout, which is the expected behavior. This is listed as a bug with Python here: http://bugs.python.org/issue18338. The comments on the bug report indicate that while stdout is the reasonable choice, it would break backward compatibility. Python 2.7.9 is largely unchanged, because so much relies on it.
Hope that helps!
Many programs would just use stdout and not care but I would prefer stderr on principle. In short, I believe stdout is for the product of successful execution of a program while stderr is for any messages meant for the user. Calculated values go to stdout while errors, stack traces, help, version and usage messages are meant for the user and should go to stderr.
I use this question to decide which to output stream is appropriate: Is this message meant for the consumer of the main product of this program (whether that's the human user or another program or whatever) or is it strictly for the human user of this program?
Also, looks like Java uses stderr for version messages as well by the way: Why does 'java -version' go to stderr?
I'm writing a pygtk application in Python 2.7.5 that requires some heavy mathematical calculations, so I need to do these calculations in an external pypy (that don't support gtk) for efficiency and plot the results in the main program as they are produced.
Since the ouput of the calculations is potentially infinite and I want to show it as it is produced, I cannot use subprocess.Popen.communicate(input).
I am able to do non-blocking reads of the output (via fcntl), but I am not able to effectively send the input (or anyway something else that I don't see is going wrong). For ex, the following code:
import subprocess
# start pypy subprocess
pypy = subprocess.Popen(['pypy', '-u'], bufsize=0, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# send input to pypy
pypy.stdin.write('import sys\nprint "hello"\nsys.stdout.flush()\n')
pypy.stdin.flush()
# read output from pypy
pypy.stdout.flush()
print pypy.stdout.readline()
Will get stuck on the last line. What is weird to me is that if I substitute 'pypy' with 'cat' it will work, and if I substitute the input-output lines with
print pypy.communicate(input='import sys\nprint "hello"\nsys.stdout.flush()\n')[0]
it will also work (but it does not fit with what I want to do). I thought it was a problem of buffering, but I tried several ways of avoiding it (including writing to stderr and so) with no luck. I also tried sending to pypy the command to print in a while True loop, also with no luck (that makes me think that is not a problem with output buffering but maybe with input buffering).
Currently I am redirecting a script to a log file with the following command:
python /usr/home/scripts/myscript.py 2>&1 | tee /usr/home/logs/mylogfile.log
This seems to work but it does not write to the file as soon as there is a print command. Rather it waits until there is a group of lines that it can print. I want the console and the log file to be written to simultaneously. How can this be done with output redirection. Note that running the script on the console prints everything when it should. Though doing a tail -f on the logfile is not smooth since it writes about 50 lines at a time. Any suggestions?
It sounds like the shell is actually what's doing the buffering, since you say it outputs as expected to the console when not tee'd.
You could look at this post for potential solutions to undo that shell buffering: https://unix.stackexchange.com/questions/25372/turn-off-buffering-in-pipe
But I would recommend doing it entirely within Python, so you have more direct control, and instead of printing to stdout, use the logging module.
This would allow additional flexibility in terms of multiple logging levels, the ability to add multiple sources to the logging object centrally (i.e. stdout and a file -- and one which rotates with size if you'd like with logging.handlers.RotatingFileHandler) and you wouldn't be subject to the external buffering of the shell.
More info: https://docs.python.org/2/howto/logging.html
For simple debugging in a complex project is there a reason to use the python logger instead of print? What about other use-cases? Is there an accepted best use-case for each (especially when you're only looking for stdout)?
I've always heard that this is a "best practice" but I haven't been able to figure out why.
The logging package has a lot of useful features:
Easy to see where and when (even what line no.) a logging call is being made from.
You can log to files, sockets, pretty much anything, all at the same time.
You can differentiate your logging based on severity.
Print doesn't have any of these.
Also, if your project is meant to be imported by other python tools, it's bad practice for your package to print things to stdout, since the user likely won't know where the print messages are coming from. With logging, users of your package can choose whether or not they want to propogate logging messages from your tool or not.
One of the biggest advantages of proper logging is that you can categorize messages and turn them on or off depending on what you need. For example, it might be useful to turn on debugging level messages for a certain part of the project, but tone it down for other parts, so as not to be taken over by information overload and to easily concentrate on the task for which you need logging.
Also, logs are configurable. You can easily filter them, send them to files, format them, add timestamps, and any other things you might need on a global basis. Print statements are not easily managed.
Print statements are sort of the worst of both worlds, combining the negative aspects of an online debugger with diagnostic instrumentation. You have to modify the program but you don't get more, useful code from it.
An online debugger allows you to inspect the state of a running program; But the nice thing about a real debugger is that you don't have to modify the source; neither before nor after the debugging session; You just load the program into the debugger, tell the debugger where you want to look, and you're all set.
Instrumenting the application might take some work up front, modifying the source code in some way, but the resulting diagnostic output can have enormous amounts of detail, and can be turned on or off to a very specific degree. The python logging module can show not just the message logged, but also the file and function that called it, a traceback if there was one, the actual time that the message was emitted, and so on. More than that; diagnostic instrumentation need never be removed; It's just as valid and useful when the program is finished and in production as it was the day it was added; but it can have it's output stuck in a log file where it's not likely to annoy anyone, or the log level can be turned down to keep all but the most urgent messages out.
anticipating the need or use for a debugger is really no harder than using ipython while you're testing, and becoming familiar with the commands it uses to control the built in pdb debugger.
When you find yourself thinking that a print statement might be easier than using pdb (as it often is), You'll find that using a logger pulls your program in a much easier to work on state than if you use and later remove print statements.
I have my editor configured to highlight print statements as syntax errors, and logging statements as comments, since that's about how I regard them.
In brief, the advantages of using logging libraries do outweigh print as below reasons:
Control what’s emitted
Define what types of information you want to include in your logs
Configure how it looks when it’s emitted
Most importantly, set the destination for your logs
In detail, segmenting log events by severity level is a good way to sift through which log messages may be most relevant at a given time. A log event’s severity level also gives you an indication of how worried you should be when you see a particular message. For instance, dividing logging type to debug, info, warning, critical, and error. Timing can be everything when you’re trying to understand what went wrong with an application. You want to know the answers to questions like:
“Was this happening before or after my database connection died?”
“Exactly when did that request come in?”
Furthermore, it is easy to see where a log has occurred through line number and filename or method name even in which thread.
Here's a functional logging library for Python named loguru.
If you use logging then the person responsible for deployment can configure the logger to send it to a custom location, with custom information. If you only print, then that's all they get.
Logging essentially creates a searchable plain text database of print outputs with other meta data (timestamp, loglevel, line number, process etc.).
This is pure gold, I can run egrep over the log file after the python script has run.
I can tune my egrep pattern search to pick exactly what I am interested in and ignore the rest. This reduction of cognitive load and freedom to pick my egrep pattern later on by trial and error is the key benefit for me.
tail -f mylogfile.log | egrep "key_word1|key_word2"
Now throw in other cool things that print can't do (sending to socket, setting debug levels, logrotate, adding meta data etc.), you have every reason to prefer logging over plain print statements.
I tend to use print statements because it's lazy and easy, adding logging needs some boiler plate code, hey we have yasnippets (emacs) and ultisnips (vim) and other templating tools, so why give up logging for plain print statements!?
I would add to all other mentionned advantages that the print function in standard configuration is buffered. The flush may occure only at the end of the current block (the one where the print is).
This is true for any program launched in a non interactive shell (codebuild, gitlab-ci for instance) or whose output is redirected.
If for any reason the program is killed (kill -9, hard reset of the computer, …), you may be missing some line of logs if you used print for the same.
However, the logging library will ensure to flush the logs printed to stderr and stdout immediately at any call.