I have a site running in Django. Frontend is lighttpd and is using fcgi to host django.
I start my fcgi processes as follows:
python2.6 /<snip>/manage.py runfcgi maxrequests=10 host=127.0.0.1 port=8000 pidfile=django.pid
For logging, I have a RotatingFileHandler defined as follows:
file_handler = RotatingFileHandler(filename, maxBytes=10*1024*1024, backupCount=5,encoding='utf-8')
The logging is working. However, it looks like the files are rotating when they do not even get up to 10Kb, let alone 10Mb. My guess is that each fcgi instance is only handling 10 requests, and then re-spawning. Each respawn of fcgi creates a new file. I confirm that fcgi is starting up under new process id every so often (hard to tell time exactly, but under a minute).
Is there any way to get around this issues? I would like all fcgi instances logging to one file until it reaches the size limit, at which point a log file rotation would take place.
As Alex stated, logging is thread-safe, but the standard handlers cannot be safely used to log from multiple processes into a single file.
ConcurrentLogHandler uses file locking to allow for logging from within multiple processes.
In your shoes I'd switch to a TimedRotatingFileHandler -- I'm surprised that the size-based rotating file handles is giving this problem (as it should be impervious to what processes are producing the log entries), but the timed version (though not controlled on exactly the parameter you prefer) should solve it. Or, write your own, more solid, rotating file handler (you can take a lot from the standard library sources) that ensures varying processes are not a problem (as they should never be).
As you appear to be using the default file opening mode of append ("a") rather than write ("w"), if a process re-spawns it should append to the existing file, then rollover when the size limit is reached. So I am not sure that what you are seeing is caused by re-spawning CGI processes. (This of course assumes that the filename remains the same when the process re-spawns).
Although the logging package is thread-safe, it does not handle concurrent access to the same file from multiple processes - because there is no standard way to do it in the stdlib. My normal advice is to set up a separate daemon process which implements a socket server and logs events received across it to file - the other processes then just implement a SocketHandler to communicate with the logging daemon. Then all events will get serialised to disk properly. The Python documentation contains a working socket server which could serve as a basis for this need.
Related
I work at Linux
I have function which calculate some data and brings me results.
This function running in parallel in many copies on many workers in the local web.
I pass the function to the program which manages tasks for workers.
I need to have logging in that function.
I set logging_manager.py and have proper configuration in it - RotatingFileHandler (3 mb for every file, 99 log files max) - Rotating file handler makes the logs size stable - i cant get more than 300 mb for logs
But when I set logging in the beginning of the function - it sets new root logger for every parallel execution and in time every logger opens a file and it exceeds maximum process number which can be handled on machine and then crashing. Also huge amount of logs makes my log files messy because every logger tries to log to the same file but they don't know about each other.
This situation slows me down. I don't like it.
If i set logger to log to new file every time - i will quickly have space exceed on my machine and need to clean it by hand or set service to do it for me - i don't want such a thing
I tried to set logger as a process and manage logs with Queue, but then again - i need to code it on that function, so result will be the same - new process for every file. Loggers don't know about each other.
So my question is:
How tell the process if similar process already exists and force it to connect to previous logger process queue instead of creating new, separate process with separate queue?
This is something like centralized python logging service for the machine which will handle all my parallel functions and properly manage log messages in the files.
Is it even possible? Or do I need to find different solution?
### Start 4 subprocesses ###
server = tornado.httpserver.HTTPServer(app)
server.bind(8000)
server.start(4) # 4 subprocesses
### Logger using TimeRotatingFileHandler within each app ###
timefilehandler = logging.handlers.TimedRotatingFileHandler(
filename=os.path.join(dirname, logname + '.log'),
when='MIDNIGHT',
interval=1,
encoding='utf-8'
)
Using tornado with mutiple subprocesses and logger resulted in multiple logging files subfixed like(if using file name as logging name):
service_0.log
service_1.log
service_2.log
service_3.log
Is it possible to enable all the subprocesses to write to one place in tornado? Or if it is better to use some log aggregation tools to handle the hassle since it is quite inconvenient to check the logs one by one, any ideas? Thanks in advance.
You can't have different (sub)processes write to a single file - if you want to solve that you should use a log aggregator where different tornado servers log to a common endpoint (either in the cloud or locally). If you're not inclined to use a third party solution you can write one in Tornado.
Look into https://docs.python.org/3/library/logging.handlers.html to see if there's anything you like.
Or you can grep 4 files at the same time.
p.s. IIRC using subprocesses is not recommended for production, so I would suggest you run 4 processes with different ports and use the port in the log name as well.
I have a possibly long running program that currently has 4 processes, but could be configured to have more. I have researched logging from multiple processes using python's logging and am using the SocketHandler approach discussed here. I never had any problems having a single logger (no sockets), but from what I read I was told it would fail eventually and unexpectedly. As far as I understand its unknown what will happen when you try to write to the same file at the same time. My code essentially does the following:
import logging
log = logging.getLogger(__name__)
def monitor(...):
# Spawn child processes with os.fork()
# os.wait() and act accordingly
def main():
log_server_pid = os.fork()
if log_server_pid == 0:
# Create a LogRecordSocketServer (daemon)
...
sys.exit(0)
# Add SocketHandler to root logger
...
monitor(<configuration stuff>)
if __name__ == "__main__":
main()
So my questions are: Do I need to create a new log object after each os.fork()? What happens to the existing global log object?
With doing things the way I am, am I even getting around the problem that I'm trying to avoid (multiple open files/sockets)? Will this fail and why will it fail (I'd like to be able to tell if future similar implementations will fail)?
Also, in what way does the "normal" (one log= expression) method of logging to one file from multiple processes fail? Does it raise an IOError/OSError? Or does it just not completely write data to the file?
If someone could provide an answer or links to help me out, that would be great. Thanks.
FYI:
I am testing on Mac OS X Lion and the code will probably end up running on a CentOS 6 VM on a Windows machine (if that matters). Whatever solution I use does not need to work on Windows, but should work on a Unix based system.
UPDATE: This question has started to move away from logging specific behavior and is more in the realm of what does linux do with file descriptors during forks. I pulled out one of my college textbooks and it seems that if you open a file in append mode from two processes (not before a fork) that they will both be able to write to the file properly as long as your write doesn't exceed the actual kernel buffer (although line buffering might need to be used, still not sure on that one). This creates 2 file table entries and one v-node table entry. Opening a file then forking isn't supposed to work, but it seems to as long as you don't exceed the kernel buffer as before (I've done it in a previous program).
So I guess, if you want platform independent multi-processing logging you use sockets and create a new SocketHandler after each fork to be safe as Vinay suggested below (that should work everywhere). For me, since I have strong control over what OS my software is being run on, I think I'm going to go with one global log object with a FileHandler (opens in append mode by default and line buffered on most OSs). The documentation for open says "A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used." or I could just create my own logging stream to be sure of line buffering. And just to be clear, I'm ok with:
# Process A
a_file.write("A\n")
a_file.write("A\n")
# Process B
a_file.write("B\n")
producing...
A\n
B\n
A\n
as long as it doesn't produce...
AB\n
\n
A\n
Vinay(or anyone else), how wrong am I? Let me know. Thanks for any more clarity/sureness you can provide.
Do I need to create a new log object after each os.fork()? What happens to the existing global log object?
AFAIK the global log object remains pointing to the same logger in parent and child processes. So you shouldn't need to create a new one. However, I think you should create and add the SocketHandler after the fork() in monitor(), so that the socket server has four distinct connections, one to each child process. If you don't do this, then the child processes forked off in monitor() will inherit the SocketHandler and its socket handle from their parent, but I don't know for sure that it will misbehave. The behaviour is likely to be OS-dependent and you may be lucky on OSX.
With doing things the way I am, am I even getting around the problem that I'm trying to avoid (multiple open files/sockets)? Will this fail and why will it fail (I'd like to be able to tell if future similar implementations will fail)?
I wouldn't expect failure if you create the socket connection to the socket server after the last fork() as I suggested above, but I am not sure the behaviour is well-defined in any other case. You refer to multiple open files but I see no reference to opening files in your pseudocode snippet, just opening sockets.
Also, in what way does the "normal" (one log= expression) method of logging to one file from multiple processes fail? Does it raise an IOError/OSError? Or does it just not completely write data to the file?
I think the behaviour is not well-defined, but one would expect failure modes to present as interspersed log messages from different processes in the file, e.g.
Process A writes first part of its message
Process B writes its message
Process A writes second part of its message
Update: If you use a FileHandler in the way you described in your comment, things will not be so good, due to the scenario I've described above: process A and B both begin pointing at the end of file (because of the append mode), but thereafter things can get out of sync because (e.g. on a multiprocessor, but even potentially on a uniprocessor), one process can (preempt another and) write to the shared file handle before another process has finished doing so.
Requirements
We have several servers (20-50) - Solaris 10 and Linux (SLES) - running a mix of different applications, each generating a bunch of log events into textfiles. We need to capture these to a separate monitoring box, where we can do analysis/reporting/alerts.
Current Approach
Currently, we use SSH with a remote "tail -f" to stream the logfiles from the servers onto the monitoring box. However, this is somewhat brittle.
New Approach
I'd like to replace this with RabbitMQ. The servers would publish their log events into this, and each monitoring script/app could then subscribe to the appropriate queue.
Ideally, we'd like the applications themselves to dump events directly into the RabbitMQ queue.
However, assuming that's not an option in the short term (we may not have source for all the apps), we need a way to basically "tail -f" the logfiles from disk. I'm most comfortable in Python, so I was looking at a Pythonic way of doing that - the consensus seems to be to just use a loop with readline() and sleep() to emulate "tail -f".
Questions
Is there an easier way of "tail -f" a whole bunch of textfiles directly onto a RabbitMQ stream? Something inbuilt, or an extension we could leverage on? Any other tips/advice here?
If we do write a Python wrapper to capture all the logfiles and publish them - I'd ideally like a single Python script to concurrently handle all the logfiles, rather than manually spinning up a separate instance for each logfile. How should we tackle this? Are there considerations in terms of performance, CPU usage, throughput, concurrency etc.?
We need to subscribe to the queues, and then possibly dump the events back to disk and reconstruct the original logfiles. Any tips/advice on this? And we'd also like a single Python script we could startup to handle reconstructing all of the logfiles - rather than 50 separate instances of the same script - is that easily achievable?
Cheers,
Victor
PS: We did have a look at Facebook's Scribe, as well as Flume, and both seem a little heavyweight for our needs.
You seem to be describing centralized syslog with rabbitmq as the transport.
If you could live with syslog, take a look at syslog-ng. Otherwise, you might
save some time by using parts of logstash ( http://logstash.net/ ).
If it would be possible you can make the Application publish the events Asynchronously to RabbitMQ instead of writing it to log files. I have done this currently in Java.
But some times it is not possible to make the app log the way you want.
1 ) You can write a file tailer in python which publishes to AMQP. I don't know of anything which plugs in a File as the input to RabbitMQ. Have a look at http://code.activestate.com/recipes/436477-filetailpy/ and http://www.perlmonks.org/?node_id=735039 for tailing files.
2) You can create a Python Daemon which can tail all the given files either as processes or in a round robin fashion.
3) A similar approach to 2 can help you solve this. You can probably have a single queue for each log file.
If you are talking about application logging (as opposed to e.g. access logs such as Apache webserver logs), you can use a handler for stdlib logging which writes to AMQP middleware.
I have a python-based daemon that provides a REST-like interface over HTTP to some command line tools. The general nature of the tool is to take in a request, perform some command-line action, store a pickled data structure to disk, and return some data to the caller. There's a secondary thread spawned on daemon startup that looks at that pickled data on disk periodically and does some cleanup based on what's in the data.
This works just fine if the disk where the pickled data resides happens to be local disk on a Linux machine. If you switch to NFS-mounted disk the daemon starts life just fine, but over time the NFS-mounted share "disappears" and the daemon can no longer tell where it is on disk with calls like os.getcwd(). You'll start to see it log errors like:
2011-07-13 09:19:36,238 INFO Retrieved submit directory '/tech/condor_logs/submit'
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): handler.path: /condor/submit?queue=Q2%40scheduler
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): submitting from temporary submission directory '/tech/condor_logs/submit/tmpoF8YXk'
2011-07-13 09:19:36,240 ERROR Caught un-handled exception: [Errno 2] No such file or directory
2011-07-13 09:19:36,241 INFO submitter - - [13/Jul/2011 09:19:36] "POST /condor/submit?queue=Q2%40scheduler HTTP/1.1" 500 -
The un-handled exception resolves to the daemon being unable to see the disk any more. Any attempts to figure out the daemon's current working directory with os.getcwd() at this point will fail. Even trying to change to the root of the NFS mount /tech, will fail. All the while the logger.logging.* methods are happily writing out log and debug messages to a log file located on the NFS-mounted share at /tech/condor_logs/logs/CondorAgentLog.
The disk is most definitely still available. There are other, C++-based daemons, reading and writing with a much higher rate of frequency on this share at the time that the python-based daemon.
I've come to an impasse debugging this problem. Since it works on local disk the general structure of the code must be good, right? There's something about NFS-mounted shares and my code that are incompatible but I can't tell what it might be.
Are there special considerations one must implement when dealing with a long-running Python daemon that will be reading and writing frequently to an NFS-mounted file share?
If anyone wants to see the code the portion that handles the HTTP request and writes the pickled object to disk is in github here. And the portion that the sub-thread uses to do periodic cleanup of stuff from disk by reading the pickled objects is here.
I have the answer to my problem and it had nothing to with the fact that I was doing file I/O on an NFS share. It turns out the problem just showed up faster if the I/O was over an NFS mount versus local disk.
A key piece of information is that the code was running threaded via the SocketServer.ThreadingMixIn and HTTPServer classes.
My handler code was doing something close to the following:
base_dir = getBaseDirFromConfigFile()
current_dir = os.getcwd()
temporary_dir = tempfile.mkdtemp(dir=base_dir)
chdir(temporary_dir)
doSomething()
chdir(current_dir)
cleanUp(temporary_dir)
That's the flow, more or less.
The problem wasn't that the I/O was being done on NFS. The problem was that os.getcwd() isn't thread-local, it's a process global. So as one thread issued a chdir() to move to the temporary space it just created under base_dir, the next thread calling os.getcwd() would get the other thread's temporary_dir instead of the static base directory where the HTTP server was started in.
There's some other people reporting similar issues here and here.
The solution was to get rid of the chdir() and getcwd() calls. To startup and stay in one directory and access everything else through absolute paths.
The NFS vs local file stuff through me for a loop. It turns out my block:
chdir(temporary_dir)
doSomething()
chdir(current_dir)
cleanUp(temporary_dir)
was running much slower when the filesystem was NFS versus local. It made the problem occur much sooner because it increased the chances that one thread was still in doSomething() while another thread was running the current_dir = os.getcwd() part of the code block. On local disk the threads moved through the entire code block so quickly they rarely intersected like that. But, give it enough time (about a week), and the problem would crop up when using local disk.
So a big lesson learned on thread safe operations in Python!
To answer the question literally, yes there are some gotchas with NFS. E.g.:
NFS is not cache coherent, so if several clients are accessing a file they might get stale data.
In particular, you cannot rely on O_APPEND to atomically append to files.
Depending on the NFS server, O_CREAT|O_EXCL might not work properly (it does work properly on modern Linux, at least).
Especially older NFS servers have deficient or completely non-working locking support. Even on more modern servers, lock recovery can be a problem after server and/or client reboot. NFSv4, a stateful protocol, ought to be more robust here than older protocol versions.
All this being said, it sounds like you problem isn't really related to any of the above. In my experience, the Condor daemons will at some point, depending on the configuration, clean up files left from jobs that have finished. My guess would be to look for the suspect here.