For simple debugging in a complex project is there a reason to use the python logger instead of print? What about other use-cases? Is there an accepted best use-case for each (especially when you're only looking for stdout)?
I've always heard that this is a "best practice" but I haven't been able to figure out why.
The logging package has a lot of useful features:
Easy to see where and when (even what line no.) a logging call is being made from.
You can log to files, sockets, pretty much anything, all at the same time.
You can differentiate your logging based on severity.
Print doesn't have any of these.
Also, if your project is meant to be imported by other python tools, it's bad practice for your package to print things to stdout, since the user likely won't know where the print messages are coming from. With logging, users of your package can choose whether or not they want to propogate logging messages from your tool or not.
One of the biggest advantages of proper logging is that you can categorize messages and turn them on or off depending on what you need. For example, it might be useful to turn on debugging level messages for a certain part of the project, but tone it down for other parts, so as not to be taken over by information overload and to easily concentrate on the task for which you need logging.
Also, logs are configurable. You can easily filter them, send them to files, format them, add timestamps, and any other things you might need on a global basis. Print statements are not easily managed.
Print statements are sort of the worst of both worlds, combining the negative aspects of an online debugger with diagnostic instrumentation. You have to modify the program but you don't get more, useful code from it.
An online debugger allows you to inspect the state of a running program; But the nice thing about a real debugger is that you don't have to modify the source; neither before nor after the debugging session; You just load the program into the debugger, tell the debugger where you want to look, and you're all set.
Instrumenting the application might take some work up front, modifying the source code in some way, but the resulting diagnostic output can have enormous amounts of detail, and can be turned on or off to a very specific degree. The python logging module can show not just the message logged, but also the file and function that called it, a traceback if there was one, the actual time that the message was emitted, and so on. More than that; diagnostic instrumentation need never be removed; It's just as valid and useful when the program is finished and in production as it was the day it was added; but it can have it's output stuck in a log file where it's not likely to annoy anyone, or the log level can be turned down to keep all but the most urgent messages out.
anticipating the need or use for a debugger is really no harder than using ipython while you're testing, and becoming familiar with the commands it uses to control the built in pdb debugger.
When you find yourself thinking that a print statement might be easier than using pdb (as it often is), You'll find that using a logger pulls your program in a much easier to work on state than if you use and later remove print statements.
I have my editor configured to highlight print statements as syntax errors, and logging statements as comments, since that's about how I regard them.
In brief, the advantages of using logging libraries do outweigh print as below reasons:
Control what’s emitted
Define what types of information you want to include in your logs
Configure how it looks when it’s emitted
Most importantly, set the destination for your logs
In detail, segmenting log events by severity level is a good way to sift through which log messages may be most relevant at a given time. A log event’s severity level also gives you an indication of how worried you should be when you see a particular message. For instance, dividing logging type to debug, info, warning, critical, and error. Timing can be everything when you’re trying to understand what went wrong with an application. You want to know the answers to questions like:
“Was this happening before or after my database connection died?”
“Exactly when did that request come in?”
Furthermore, it is easy to see where a log has occurred through line number and filename or method name even in which thread.
Here's a functional logging library for Python named loguru.
If you use logging then the person responsible for deployment can configure the logger to send it to a custom location, with custom information. If you only print, then that's all they get.
Logging essentially creates a searchable plain text database of print outputs with other meta data (timestamp, loglevel, line number, process etc.).
This is pure gold, I can run egrep over the log file after the python script has run.
I can tune my egrep pattern search to pick exactly what I am interested in and ignore the rest. This reduction of cognitive load and freedom to pick my egrep pattern later on by trial and error is the key benefit for me.
tail -f mylogfile.log | egrep "key_word1|key_word2"
Now throw in other cool things that print can't do (sending to socket, setting debug levels, logrotate, adding meta data etc.), you have every reason to prefer logging over plain print statements.
I tend to use print statements because it's lazy and easy, adding logging needs some boiler plate code, hey we have yasnippets (emacs) and ultisnips (vim) and other templating tools, so why give up logging for plain print statements!?
I would add to all other mentionned advantages that the print function in standard configuration is buffered. The flush may occure only at the end of the current block (the one where the print is).
This is true for any program launched in a non interactive shell (codebuild, gitlab-ci for instance) or whose output is redirected.
If for any reason the program is killed (kill -9, hard reset of the computer, …), you may be missing some line of logs if you used print for the same.
However, the logging library will ensure to flush the logs printed to stderr and stdout immediately at any call.
Related
I would like users of my program to be able to define custom scripts in python without breaking the program. I am looking at something like this:
def call(script):
code = "access modification code" + script
exec(code)
Where "access modification code" defines a scope such that script can only access variables it instantiates itself. Is it possible to do this or something with similar functionality, such as creating a new python environment with its own scope and then receiving output from it?
Thank you for your time :)
Clarification Edit
"I want to prevent both active attacks and accidental interaction with the program variables outside the users script (hence hiding all globals). The user scripts are intended to be small and inputted as text. The return of the user script needs to be immediate, as though it were native to the program."
There's two separate problems you want to prevent in this scenario:
Prevent an outside attacker from running arbitrary code in the context of the OS user executing your program. This means preventing arbitrary code execution and privilege escalation.
Preventing a legitimate user of your program from changing the program behavior in unintended ways.
For the first problem, you need to make sure that the source of the python code you execute is the user. You shouldn't accept input from a socket, or from a file that other users can write to. You need to make sure, somehow, that it was the user who is running the program that provided the input.
Actual solutions will depend on your OS, but you may want to consider setting up restrictive file permissions, if you're storing the code in a file.
Don't ignore or downplay this problem, or your users will fall victim to virus/malware/hackers thanks to your program.
The way you solve the second problem depends on what exactly constitutes intended behaviour in your program. If you're happy with outputting simple data structures, you can run your user-inputted code in a separate process, and pass the result over a pipe in a serialization format such as JSON or YAML.
Here's a very simple example:
#remember to set restrictive file permissions on this file. This is OS-dependent
USER_CODE_FILE="/home/user/user_code_file.py"
#absolute path to python binary (executable)
PYTHON_PATH="/usr/bin/python"
import subprocess
import json
user_code= '''
import json
my_data= {"a":[1,2,3]}
print json.dumps(my_data)
'''
with open(USER_CODE_FILE,"wb") as f:
f.write(user_code)
user_result_str= subprocess.check_output([PYTHON_PATH, USER_CODE_FILE])
user_result= json.loads( user_result_str )
print user_result
This is a fairly simple solution, and it has significant overhead. Don't use this if you need to run the user code many times in a short period of time.
Ultimately, this solution is only effective against unsophisticated attackers (users). Generally speaking, there's no way to protect any user process from the user itself - nor would it make much sense.
If you really want more assurance, and want to mitigate the first problem too, you should run the process as a separate ("guest") user. This, again, is OS-dependent.
Finally a warning: avoid exec and eval to the best of your abilities. They don't protect either your program or the user against the inputted code. There's a lot of information about this on the web, just search for "python secure eval"
Actual Problem (Crash Log Generation)
Is there a python module that could help me produce meaningful crash logs? Or a good way to go about producing them?
I want my crash logs to contain:
All variables within the current execution stack
All global variables
Parameters at invocation (i.e. flags, their values, etc)
Is this something I'd just be better off writing myself?
Context (Not particularly relevant)
I have a program that is used by a large number of people within my company that I am responsible for supporting. Unfortunately it doesn't always work correctly (about 1 out of 1000 times) and I am having difficulty tracking the bug down. I think that having solid crash logs would really help here so that my users could just submit those rather than making vague phone calls for help.
You can look at how django generates its error pages: https://github.com/django/django/blob/master/django/views/debug.py#L59
In non-debug mode these are mailed to the list of admins in the settings file. It's super handy!
The python logging module has everything you need already.
I am working on a website, hosted on DreamHost, using Python. For a while, I was using their default setup, which runs Python scripts using CGI. It worked fine, but I was worried that if I get a lot of traffic, it would run slow and use a lot of memory, so I switched it over to FastCGI using this module.
Overall, it still works fine, but there is one major annoyance: I can't seem to be able to see anything that gets written to the standard error stream. If anything goes wrong, my usual source of useful clues for what to do about it no longer works. Before, I used to see stuff sent to standard error in my Apache error log. Now, it just seems to disappear.
I tried making a test script, and writing strings using sys.stderr.write (from various places), and environ["wsgi.errors"].write (from within my app, where environ is the first parameter passed to the app by the WSGI/FastCGI wrapper). Either way, I couldn't find them. Does anyone know why, or how to access this data?
Keep in mind that this is my first time ever using FastCGI, so please let me know if I am making a bad choice by using this fcgi module.
If something in your system is capturing file-descriptor two (the "real" stderr), you can assign sys.stderr to any open, writeable file object, or to a file-like object (it basically just needs to implement write) -- including a cStdIO.StdIO instance, whose value you can get at any time (before it's closed) with a call to its .getvalue() method.
To capture any uncaught exception just before it terminates your code, assign to sys.excepthook a function of yours in which you get the information and emit it in any way of your choice; or, to get and emit anything that was written to sys.stderr even without an exception (if that's what you want -- I'm not sure, from your question), use atexit to
register your grab-info-and-emit-it function.
My question is simple: what to write into a log.
Are there any conventions?
What do I have to put in?
Since my app has to be released, I'd like to have friendly logs, which could be read by most people without asking what it is.
I already have some ideas, like a timestamp, a unique identifier for each function/method, etc..
I'd like to have several log levels, like tracing/debugging, informations, errors/warnings.
Do you use some pre-formatted log resources?
Thank you
It's quite pleasant, and already implemented.
Read this: http://docs.python.org/library/logging.html
Edit
"easy to parse, read," are generally contradictory features. English -- easy to read, hard to parse. XML -- easy to parse, hard to read. There is no format that achieves easy-to-read and easy-to-parse. Some log formats are "not horrible to read and not impossible to parse".
Some folks create multiple handlers so that one log has several formats: a summary for people to read and an XML version for automated parsing.
Here are some suggestions for content:
timestamp
message
log message type (such as error, warning, trace, debug)
thread id ( so you can make sense of the log file from a multi threaded application)
Best practices for implementation:
Put a mutex around the write method so that you can be sure that each write is thread safe and will make sense.
Send 1 message at at a time to the log file, and specify the type of log message each time. Then you can set what type of logging you want to take on program startup.
Use no buffering on the file, or flush often in case there is a program crash.
Edit: I just noticed the question was tagged with Python, so please see S. Lott's answer before mine. It may be enough for your needs.
Since you tagged your question python, I refer you to this question as well.
As for the content, Brian proposal is good. Remember however to add the program name if you are using a shared log.
The important thing about a logfile is "greppability". Try to provide all your info in a single line, with proper string identifiers that are unique (also in radix) for a particular condition. As for the timestamp, use the ISO-8601 standard which sorts nicely.
A good idea is to look at log analysis software. Unless you plan to write your own, you will probably want to exploit an existing log analysis package such as Analog. If that is the case, you will probably want to generate a log output that is similar enough to the formats that it accepts. It will allow you to create nice charts and graphs with minimal effort!
In my opinion, best approach is to use existing logging libraries such as log4j (or it's variations for other languages). It gives you control on how your messages are formatted and you can change it without ever touching your code. It follows best practices, robust and used by millions of users. Of course, you can write you own logging framework, but that would be very odd thing to do, unless you need something very specific. In any case, walk though their documentation and see how log statements are presented there.
Check out log4py - Python ported version of log4j, I think there are several implementations for Python out there..
Python's logging library are thread-safe for single process threads.
timeStamp i.e. DateTime YYYY/MM/DD:HH:mm:ss:ms
User
Thread ID
Function Name
Message/Error Message/Success Message/Function Trace
Have this in XML format and you can then easily write a parser for it.
<log>
<logEntry DebugLevel="0|1|2|3|4|5....">
<TimeStamp format="YYYY/MM/DD:HH:mm:ss:ms" value="2009/04/22:14:12:33:120" />
<ThreadId value="" />
<FunctionName value="" />
<Message type="Error|Success|Failure|Status|Trace">
Your message goes here
</Message>
</logEntry>
</log>
I would like to enable students to submit python code solutions to a few simple python problems. My applicatoin will be running in GAE. How can I limit the risk from malicios code that is sumitted? I realize that this is a hard problem and I have read related Stackoverflow and other posts on the subject. I am curious if the restrictions aleady in place in the GAE environment make it simpler to limit damage that untrusted code could inflict. Is it possible to simply scan the submitted code for a few restricted keywords (exec, import, etc.) and then ensure the code only runs for less than a fixed amount of time, or is it still difficult to sandbox untrusted code even in the resticted GAE environment? For example:
# Import and execute untrusted code in GAE
untrustedCode = """#Untrusted code from students."""
class TestSpace(object):pass
testspace = TestSpace()
try:
#Check the untrusted code somehow and throw and exception.
except:
print "Code attempted to import or access network"
try:
# exec code in a new namespace (Thanks Alex Martelli)
# limit runtime somehow
exec untrustedCode in vars(testspace)
except:
print "Code took more than x seconds to run"
#mjv's smiley comment is actually spot-on: make sure the submitter IS identified and associated with the code in question (which presumably is going to be sent to a task queue), and log any diagnostics caused by an individual's submissions.
Beyond that, you can indeed prepare a test-space that's more restrictive (thanks for the acknowledgment;-) including a special 'builtin' that has all you want the students to be able to use and redefines __import__ &c. That, plus a token pass to forbid exec, eval, import, __subclasses__, __bases__, __mro__, ..., gets you closer. A totally secure sandbox in a GAE environment however is a real challenge, unless you can whitelist a tiny subset of the language that the students are allowed.
So I would suggest a layered approach: the sandbox GAE app in which the students upload and execute their code has essentially no persistent layer to worry about; rather, it "persists" by sending urlfetch requests to ANOTHER app, which never runs any untrusted code and is able to vet each request very critically. Default-denial with whitelisting is still the holy grail, but with such an extra layer for security you may be able to afford a default-acceptance with blacklisting...
You really can't sandbox Python code inside App Engine with any degree of certainty. Alex's idea of logging who's running what is a good one, but if the user manages to break out of the sandbox, they can erase the event logs. The only place this information would be safe is in the per-request logging, since users can't erase that.
For a good example of what a rathole trying to sandbox Python turns into, see this post. For Guido's take on securing Python, see this post.
There are another couple of options: If you're free to choose the language, you could run Rhino (a Javascript interpreter) on the Java runtime; Rhino is nicely sandboxed. You may also be able to use Jython; I don't know if it's practical to sandbox it, but it seems likely.
Alex's suggestion of using a separate app is also a good one. This is pretty much the approach that shell.appspot.com takes: It can't prevent you from doing malicious things, but the app itself stores nothing of value, so there's no harm if you do.
Here's an idea. Instead of running the code server-side, run it client-side with Skuplt:
http://www.skulpt.org/
This is both safer, and easier to implement.