Python MultiProcess, Logging, Various Classes - python

I am currently trying to log to a single file from multiple processes but I am having a lot of trouble with it. I have spend countless hours looking online -- stackoverflow and Google, but have come up with nothing concrete.
I have read: How should I log while using multiprocessing in Python?
I've been trying to use zzzeek's code but it does not write to the file for me. I don't have a specific way I'm doing it -- I've just been trying every way I can.
Have any of you got it to work and have sample code, or do you have an alternative way of doing it. I need to log multiple processes to the same file. I would also like to log any errors from various class to the same file. I, however, am satisfied with simple getting the multiprocess one to work.
Thanks

Look at these posts:
Using logging with multiprocessing
Improved QueueHandler, QueueListener: dealing with handlers that block
logutils: Using recent logging features with older Python versions

Here's some sample code that works with zzzeek's handler:
mtlog = MultiProcessingLog('foo.log', 'a', 0, 0)
logging.getLogger().addHandler(mtlog)
def task(_):
logging.error('Hi from {}'.format(os.getpid()))
p = multiprocessing.Pool()
p.map(task, range(4))
Here's my running it:
$ python mtlog.py
$ cat foo.log
Hi from 6179
Hi from 6180
Hi from 6181
Hi from 6182
In fact, any trivial test I come up with works just fine. So clearly, you're doing something wrong, probably the same thing, in all of your attempts.
My first guess is that you're trying to use it on Windows. As Noah Yetter's comment says:
Unfortunately this approach doesn't work on Windows. From docs.python.org/library/multiprocessing.html 16.6.2.12 "Note that on Windows child processes will only inherit the level of the parent process’s logger – any other customization of the logger will not be inherited." Subprocesses won't inherit the handler, and you can't pass it explicitly because it's not pickleable.
Although zzzeek replies that he thinks it'll work, I'm 90% sure he's wrong:
I think that only refers to the logger that's hardwired into the multiprocessing module. This recipe isn't making any usage of that nor should it care about propagation of loglevels, it just shuttles data from child to parent using normal multiprocessing channels.
That's exactly backward. Propagation of log levels does work; propagation of addHandler does not.
To make this work, you'd need to pass the queue explicitly to the children, and build the child-side logger out of that.

Related

Python3 curses code not working in pipeline

I am writing a python script that I want to use in a unix pipeline. My goal is to write to the screen using curses (which should only be seen by the person running the command, not the pipe), and then write the "return value" to stdout at the end so it can continue down the pipeline, something along the lines of ./myscript.py | consumer_script
This was failing in mysterious ways until I found This. The suggested solution was to use newterm instead of init_scr.
My problem is that I am using python, and from what I could find in the documentation, newterm doesnt exist. All I was able to find was a single reference to newterm, and it didn't come with a link.
Could someone please either point me towards the python newterm, or suggest another way of working with pipes and curses.
I think you're making this more complicated than it needs to be... the simple answer is to write the curses stream to another handle than stdout. If it works for you, stderr is the obvious choice. In short, anything that gets written to stdout goes into the pipeline, and if you don't want it there, you need a different handle.
Check out this thread for ways to write to stderr in python:
How to print to stderr in Python?

Preferred method for controlled logging of Python debug messages?

I'm writing a large hardware simulation library in Python3. For logging, I use the Python3 Logging module.
For controlling debug messages with method-level granularity, I learned "on the street" (ok, here at StackOverflow) to create sub-loggers within each method I wanted to log from:
sub_logger = logging.getChild("new_sublogger_name")
sub_logger.setLevel(logging.DEBUG)
# Sample debug message
sub_logger.debug("This is a debug message...")
By changing the call to setLevel(), the user is able to enable/disable debugging messages on a per-method basis.
Now the Boss Man don't like this approach. He's advocating a single-point at which all logging messages in the library can be enabled/disabled with the same method-level granularity. (This was to be accomplished by writing our own Python logging library BTW).
Not wanting to re-invent the logging wheel, I proposed to instead continue to use the Python Logging library, but instead use Filters to allow single-point control of logging messages.
Having not used Python Logging Filters very often, is there a consensus on using Filters vs Sublogger.setLevel() for this application? What are the pros/cons of each method?
I'm quite used to setLevel() after using it for a while, but that may be coloring my objectiveness. I DO NOT, however, wish to waste everyone's time writing another Python logging library.
I think the existing logging module does what you want. The trick is to separate the place where you call setLevel() (a configuration operation) from the places where you call getChild() (ongoing logging operations).
import logging
logger = logging.getLogger('mod1')
def fctn1():
logger.getChild('fctn1').debug('I am chatty')
# do stuff (notice, no setLevel)
def fctn2():
logger.getChild('fctn2').debug('I am even more chatty')
# do stuff (notice, no setLevel)
Notice there was no setLevel() there, which makes sense. Why call setLevel() every time and since when does a method know what logging level the user wants.
You set your logging levels in a configuration step at the beginning of the program. You can do it with the dictionary based configuration, a python module that does a bunch of setLevel() calls or even something you cook up with ini files or whatever. But basically it boils down to:
def config_logger():
logging.getLogger('abc.def').setLevel(logging.INFO)
logging.getLogger('mod1').setLevel(logging.WARN)
logging.getLogger('mod1.fctn1').setLeveL(logging.DEBUG)
(etc...)
Now, if you want to get fancy with filters, you can use them to inspect the stack frame and pull the method name out for you. But that gets more complicated.

In python, why use logging instead of print?

For simple debugging in a complex project is there a reason to use the python logger instead of print? What about other use-cases? Is there an accepted best use-case for each (especially when you're only looking for stdout)?
I've always heard that this is a "best practice" but I haven't been able to figure out why.
The logging package has a lot of useful features:
Easy to see where and when (even what line no.) a logging call is being made from.
You can log to files, sockets, pretty much anything, all at the same time.
You can differentiate your logging based on severity.
Print doesn't have any of these.
Also, if your project is meant to be imported by other python tools, it's bad practice for your package to print things to stdout, since the user likely won't know where the print messages are coming from. With logging, users of your package can choose whether or not they want to propogate logging messages from your tool or not.
One of the biggest advantages of proper logging is that you can categorize messages and turn them on or off depending on what you need. For example, it might be useful to turn on debugging level messages for a certain part of the project, but tone it down for other parts, so as not to be taken over by information overload and to easily concentrate on the task for which you need logging.
Also, logs are configurable. You can easily filter them, send them to files, format them, add timestamps, and any other things you might need on a global basis. Print statements are not easily managed.
Print statements are sort of the worst of both worlds, combining the negative aspects of an online debugger with diagnostic instrumentation. You have to modify the program but you don't get more, useful code from it.
An online debugger allows you to inspect the state of a running program; But the nice thing about a real debugger is that you don't have to modify the source; neither before nor after the debugging session; You just load the program into the debugger, tell the debugger where you want to look, and you're all set.
Instrumenting the application might take some work up front, modifying the source code in some way, but the resulting diagnostic output can have enormous amounts of detail, and can be turned on or off to a very specific degree. The python logging module can show not just the message logged, but also the file and function that called it, a traceback if there was one, the actual time that the message was emitted, and so on. More than that; diagnostic instrumentation need never be removed; It's just as valid and useful when the program is finished and in production as it was the day it was added; but it can have it's output stuck in a log file where it's not likely to annoy anyone, or the log level can be turned down to keep all but the most urgent messages out.
anticipating the need or use for a debugger is really no harder than using ipython while you're testing, and becoming familiar with the commands it uses to control the built in pdb debugger.
When you find yourself thinking that a print statement might be easier than using pdb (as it often is), You'll find that using a logger pulls your program in a much easier to work on state than if you use and later remove print statements.
I have my editor configured to highlight print statements as syntax errors, and logging statements as comments, since that's about how I regard them.
In brief, the advantages of using logging libraries do outweigh print as below reasons:
Control what’s emitted
Define what types of information you want to include in your logs
Configure how it looks when it’s emitted
Most importantly, set the destination for your logs
In detail, segmenting log events by severity level is a good way to sift through which log messages may be most relevant at a given time. A log event’s severity level also gives you an indication of how worried you should be when you see a particular message. For instance, dividing logging type to debug, info, warning, critical, and error. Timing can be everything when you’re trying to understand what went wrong with an application. You want to know the answers to questions like:
“Was this happening before or after my database connection died?”
“Exactly when did that request come in?”
Furthermore, it is easy to see where a log has occurred through line number and filename or method name even in which thread.
Here's a functional logging library for Python named loguru.
If you use logging then the person responsible for deployment can configure the logger to send it to a custom location, with custom information. If you only print, then that's all they get.
Logging essentially creates a searchable plain text database of print outputs with other meta data (timestamp, loglevel, line number, process etc.).
This is pure gold, I can run egrep over the log file after the python script has run.
I can tune my egrep pattern search to pick exactly what I am interested in and ignore the rest. This reduction of cognitive load and freedom to pick my egrep pattern later on by trial and error is the key benefit for me.
tail -f mylogfile.log | egrep "key_word1|key_word2"
Now throw in other cool things that print can't do (sending to socket, setting debug levels, logrotate, adding meta data etc.), you have every reason to prefer logging over plain print statements.
I tend to use print statements because it's lazy and easy, adding logging needs some boiler plate code, hey we have yasnippets (emacs) and ultisnips (vim) and other templating tools, so why give up logging for plain print statements!?
I would add to all other mentionned advantages that the print function in standard configuration is buffered. The flush may occure only at the end of the current block (the one where the print is).
This is true for any program launched in a non interactive shell (codebuild, gitlab-ci for instance) or whose output is redirected.
If for any reason the program is killed (kill -9, hard reset of the computer, …), you may be missing some line of logs if you used print for the same.
However, the logging library will ensure to flush the logs printed to stderr and stdout immediately at any call.

python queue concurrency process management

The use case is as follows :
I have a script that runs a series of
non-python executables to reduce (pulsar) data. I right now use
subprocess.Popen(..., shell=True) and then the communicate function of subprocess to
capture the standard out and standard error from the non-python executables and the captured output I log using the python logging module.
The problem is: just one core of the possible 8 get used now most of the time.
I want to spawn out multiple processes each doing a part of the data set in parallel and I want to keep track of progres. It is a script / program to analyze data from a low frequencey radio telescope (LOFAR). The easier to install / manage and test the better.
I was about to build code to manage all this but im sure it must already exist in some easy library form.
The subprocess module can start multiple processes for you just fine, and keep track of them. The problem, though, is reading the output from each process without blocking any other processes. Depending on the platform there's several ways of doing this: using the select module to see which process has data to be read, setting the output pipes non-blocking using the fnctl module, using threads to read each process's data (which subprocess.Popen.communicate itself uses on Windows, because it doesn't have the other two options.) In each case the devil is in the details, though.
Something that handles all this for you is Twisted, which can spawn as many processes as you want, and can call your callbacks with the data they produce (as well as other situations.)
Maybe Celery will serve your needs.
If I understand correctly what you are doing, I might suggest a slightly different approach. Try establishing a single unit of work as a function and then layer on the parallel processing after that. For example:
Wrap the current functionality (calling subprocess and capturing output) into a single function. Have the function create a result object that can be returned; alternatively, the function could write out to files as you see fit.
Create an iterable (list, etc.) that contains an input for each chunk of data for step 1.
Create a multiprocessing Pool and then capitalize on its map() functionality to execute your function from step 1 for each of the items in step 2. See the python multiprocessing docs for details.
You could also use a worker/Queue model. The key, I think, is to encapsulate the current subprocess/output capture stuff into a function that does the work for a single chunk of data (whatever that is). Layering on the parallel processing piece is then quite straightforward using any of several techniques, only a couple of which were described here.

Dealing with external processes

I've been working on a gui app that needs to manage external processes. Working with external processes leads to a lot of issues that can make a programmer's life difficult. I feel like maintenence on this app is taking an unacceptably long time. I've been trying to list the things that make working with external processes difficult so that I can come up with ways of mitigating the pain. This kind of turned into a rant which I thought I'd post here in order to get some feedback and to provide some guidance to anybody thinking about sailing into these very murky waters. Here's what I've got so far:
Output from the child can get mixed up with output from the parent. This can make both outputs misleading and hard to read. It can be hard to tell what came from where. It becomes harder to figure out what's going on when things are asynchronous. Here's a contrived example:
import textwrap, os, time
from subprocess import Popen
test_path = 'test_file.py'
with open(test_path, 'w') as file:
file.write(textwrap.dedent('''
import time
for i in range(3):
print 'Hello %i' % i
time.sleep(1)'''))
proc = Popen('python -B "%s"' % test_path)
for i in range(3):
print 'Hello %i' % i
time.sleep(1)
os.remove(test_path)
Output:
Hello 0
Hello 0
Hello 1
Hello 1
Hello 2
Hello 2
I guess I could have the child process write its output to a file. But it can be annoying to have to open up a file every time I want to see the result of a print statement.
If I have code for the child process I could add a label, something like print 'child: Hello %i', but it can be annoying to do that for every print. And it adds some noise to the output. And of course I can't do it if I don't have access to the code.
I could manually manage the process output. But then you open up a huge can of worms with threads and polling and stuff like that.
A simple solution is to treat processes like synchronous functions, that is, no further code executes until the process completes. In other words, make the process block. But that doesn't work if you're building a gui app. Which brings me to the next problem...
Blocking processes cause the gui to become unresponsive.
import textwrap, sys, os
from subprocess import Popen
from PyQt4.QtGui import *
from PyQt4.QtCore import *
test_path = 'test_file.py'
with open(test_path, 'w') as file:
file.write(textwrap.dedent('''
import time
for i in range(3):
print 'Hello %i' % i
time.sleep(1)'''))
app = QApplication(sys.argv)
button = QPushButton('Launch process')
def launch_proc():
# Can't move the window until process completes
proc = Popen('python -B "%s"' % test_path)
proc.communicate()
button.connect(button, SIGNAL('clicked()'), launch_proc)
button.show()
app.exec_()
os.remove(test_path)
Qt provides a process wrapper of its own called QProcess which can help with this. You can connect functions to signals to capture output relatively easily. This is what I'm currently using. But I'm finding that all these signals behave suspiciously like goto statements and can lead to spaghetti code. I think I want to get sort-of blocking behavior by having the 'finished' signal from QProcess call a function containing all the code that comes after the process call. I think that should work but I'm still a bit fuzzy on the details...
Stack traces get interrupted when you go from the child process back to the parent process. If a normal function screws up, you get a nice complete stack trace with filenames and line numbers. If a subprocess screws up, you'll be lucky if you get any output at all. You end up having to do a lot more detective work everytime something goes wrong.
Speaking of which, output has a way of disappearing when dealing external processes. Like if you run something via the windows 'cmd' command, the console will pop up, execute the code, and then disappear before you have a chance to see the output. You have to pass the /k flag to make it stick around. Similar issues seem to crop up all the time.
I suppose both problems 3 and 4 have the same root cause: no exception handling. Exception handling is meant to be used with functions, it doesn't work with processes. Maybe there's some way to get something like exception handling for processes? I guess that's what stderr is for? But dealing with two different streams can be annoying in itself. Maybe I should look into this more...
Processes can hang and stick around in the background without you realizing it. So you end up yelling at your computer cuz it's going so slow until you finally bring up your task manager and see 30 instances of the same process hanging out in the background.
Also, hanging background processes can interefere with other instances of the process in various fun ways, such as causing permissions errors by holding a handle to a file or someting like that.
It seems like an easy solution to this would be to have the parent process kill the child process on exit if the child process didn't close itself. But if the parent process crashes, cleanup code might not get called and the child can be left hanging.
Also, if the parent waits for the child to complete, and the child is in an infinite loop or something, you can end up with two hanging processes.
This problem can tie in to problem 2 for extra fun, causing your gui to stop responding entirely and force you to kill everything with the task manager.
F***ing quotes
Parameters often need to be passed to processes. This is a headache in itself. Especially if you're dealing with file paths. Say... 'C:/My Documents/whatever/'. If you don't have quotes, the string will often be split at the space and interpreted as two arguments. If you need nested quotes you can use ' and ". But if you need to use more than two layers of quotes, you have to do some nasty escaping, for example: "cmd /k 'python \'path 1\' \'path 2\''".
A good solution to this problem is passing parameters as a list rather than as a single string. Subprocess allows you to do this.
Can't easily return data from a subprocess.
You can use stdout of course. But what if you want to throw a print in there for debugging purposes? That's gonna screw up the parent if it's expecting output formatted a certain way. In functions you can print one string and return another and everything works just fine.
Obscure command-line flags and a crappy terminal based help system.
These are problems I often run into when using os level apps. Like the /k flag I mentioned, for holding a cmd window open, who's idea was that? Unix apps don't tend to be much friendlier in this regard. Hopefully you can use google or StackOverflow to find the answer you need. But if not, you've got a lot of boring reading and frusterating trial and error to do.
External factors.
This one's kind of fuzzy. But when you leave the relatively sheltered harbor of your own scripts to deal with external processes you find yourself having to deal with the "outside world" to a much greater extent. And that's a scary place. All sorts of things can go wrong. Just to give a random example: the cwd in which a process is run can modify it's behavior.
There are probably other issues, but those are the ones I've written down so far. Any other snags you'd like to add? Any suggestions for dealing with these problems?
Check out the subprocess module. It should help with output separation. I don't see any way around either separate output streams or some kind of output tagging in a single stream.
The hanging process problem is difficult as well. The only solution I have been able to make is to put a timer on the external process, and kill it if it does not return in the allotted time. Crude, nasty, and if anyone else has a good solution, I would love to hear it so I can use it too.
One thing you could do to help deal with the problem of completely un-managed shutdown is to keep a directory of pid files. Whenever you kick off an external process, write a file into your pid file directory with a name that is the pid for the process. Erase the pid file when you know the process has exited cleanly. You can use the stuff in the pid directory to help cleanup on crashes or re-starts.
This may not provide any satisfying or useful answers, but maybe it's a start.

Categories

Resources