Generate a consolidated log from multiple remote processes

Generate a consolidated log from multiple remote processes - python

I have script that spawns another script on several hosts on a network. These scripts generate output that I want to capture. So the only option I can see right now is this:
Log each process' output in a separate file (like 20130308.hostname.log, etc).
Is there a way to generate a consolidated log out of all the processes? By consolidated I mean something like this:
host1:
outputline1
outputline2
host2:
outputline1
outputline2
outputline3
host3:
...
I want to be able to open one file - and check up on what happened on a particular host.

If I understood well your question, you want to have several processes log into one file. There are better option then the one you suggest, please check the python logging cookbook. 'Multiple modules' and 'multiple handlers' should to solve your issue.

You can use a centralized logging system like syslog for aggregating all logging messages into a central log. Python's logging module provides a SyslogHandler for that.
For aggregating individual logfiles into one logfile manually you will have to write a script performing the aggregation and merging.

Related

Query Python3 script to get stats about the script

I have a script that continually runs and accepts data (For those that are familiar and if it helps, it is connected to EMDR - https://eve-market-data-relay.readthedocs.org).
Inside the script I have debugging built in so that I can see how much data is currently in the queue for the threads to process, however this is built to be used with just printing to the console. What I would like to do is be able to either run the same script with an additional option or a totally different script that would return the current queue count without having to enable debug.
Is there a way to do this could someone please point me in the direction of the documentation/libaries that I need to research?

There are many ways to solve this; two that come to mind:
You can write the queue count to a k/v store (like memcache or redis) and then have another script read that for you and do whatever other actions required.
You can create a specific logger for your informational output (like the queue length) and set it to log somewhere else other than the console. For example, you could use it to send you an email or log to an external service, etc. See the logging cookbook for examples.

Logging on Hadoop

I am trying to run map reduce job. But I am unable to find my log files when I run this job. I am using hadoop streaming job to perform map reduce and I am using Python. I am using python's logging module to log messages. When I run this on a file by using "cat" command, the log file is created.
cat file | ./mapper.py
But when I run this job via hadoop, I am unable to find the log file.
import os,logging
logging.basicConfig(filename="myApp.log", level=logging.INFO)
logging.info("app start")
##
##logic with log messages
##
logging.info("app complete")
But I cannot find the myApp.log file anywhere. Is the log data stored anywhere or does hadoop ignore the application logging complete. I have searched for my log items in the userlogs folder too, but it doesn't look like my log items are there.
I work with vast amounts of data where random items are not making to the next stage, this is a very big issue on our side, so I am trying to find a way to use logging to debug my application.
Any help is appreciated.

I believe that you are logging in stdout? If so, you should definitely log in stderr instead, or create your own custom stream.
Using hadoop-streaming, stdout is the stream dedicated to pass key-values between mappers/reducers and to output results, so you should not log anything in it.

How to open new log file for certain files in twisted

I'm running a python twisted application with lots of different services, and the log file for that application is pretty crowded with all kinds of output. So, to better see what is going on in one specific service, I would like to log messages for that service only to a different logfile. But I can't figure out how to do this.
For my application, I am using a shell script run.sh that calls twistd as follows:
twistd --logfile /var/log/whatever/path/mylogfile.log -y myapplication.py
The file myapplication.py launches all services in the application, one of which is the service I am interested in. That service has all its code in file myservice.py.
So, is there any way to specify a new log file just for my service? Do I do this in myapplication.py when I launch the service of do I do it with some python code in myservice.py?

Having seen systems that use more than one log file, I would strongly urge you not to go in this direction.
Guy's answer sounds like it is more in the right direction. To go into even more detail, though, consider using a structured log format such as the one provided by structlog (which includes Twisted integration).
Once entries in your log file are structured you will have a chance of building tools that work with them. The example Guy gave of using grep to find the events related to the service you're concerned with this a step in this direction. If you go further in this direction and say that each log event will be (for example) a json-encoded object then you can parse each line and apply arbitrarily complex filtering logic to the resulting objects.

consider the following two options:
depend of the log lines format when viewing / tailing the log do something like:
tail -f mylogfile.log | grep <something unique like your service name?>
configure twisted to use Python standard logging and tunnel log messages there, see Using the standard library logging module

It appears you could create t.p.l.LogPublisher for each service and attach a FileLogObserver to it for the actual writing into a file

Streaming text logfiles into RabbitMQ, then reconstructing at other end?

Requirements
We have several servers (20-50) - Solaris 10 and Linux (SLES) - running a mix of different applications, each generating a bunch of log events into textfiles. We need to capture these to a separate monitoring box, where we can do analysis/reporting/alerts.
Current Approach
Currently, we use SSH with a remote "tail -f" to stream the logfiles from the servers onto the monitoring box. However, this is somewhat brittle.
New Approach
I'd like to replace this with RabbitMQ. The servers would publish their log events into this, and each monitoring script/app could then subscribe to the appropriate queue.
Ideally, we'd like the applications themselves to dump events directly into the RabbitMQ queue.
However, assuming that's not an option in the short term (we may not have source for all the apps), we need a way to basically "tail -f" the logfiles from disk. I'm most comfortable in Python, so I was looking at a Pythonic way of doing that - the consensus seems to be to just use a loop with readline() and sleep() to emulate "tail -f".
Questions
Is there an easier way of "tail -f" a whole bunch of textfiles directly onto a RabbitMQ stream? Something inbuilt, or an extension we could leverage on? Any other tips/advice here?
If we do write a Python wrapper to capture all the logfiles and publish them - I'd ideally like a single Python script to concurrently handle all the logfiles, rather than manually spinning up a separate instance for each logfile. How should we tackle this? Are there considerations in terms of performance, CPU usage, throughput, concurrency etc.?
We need to subscribe to the queues, and then possibly dump the events back to disk and reconstruct the original logfiles. Any tips/advice on this? And we'd also like a single Python script we could startup to handle reconstructing all of the logfiles - rather than 50 separate instances of the same script - is that easily achievable?
Cheers,
Victor
PS: We did have a look at Facebook's Scribe, as well as Flume, and both seem a little heavyweight for our needs.

You seem to be describing centralized syslog with rabbitmq as the transport.
If you could live with syslog, take a look at syslog-ng. Otherwise, you might
save some time by using parts of logstash ( http://logstash.net/ ).

If it would be possible you can make the Application publish the events Asynchronously to RabbitMQ instead of writing it to log files. I have done this currently in Java.
But some times it is not possible to make the app log the way you want.
1 ) You can write a file tailer in python which publishes to AMQP. I don't know of anything which plugs in a File as the input to RabbitMQ. Have a look at http://code.activestate.com/recipes/436477-filetailpy/ and http://www.perlmonks.org/?node_id=735039 for tailing files.
2) You can create a Python Daemon which can tail all the given files either as processes or in a round robin fashion.
3) A similar approach to 2 can help you solve this. You can probably have a single queue for each log file.

If you are talking about application logging (as opposed to e.g. access logs such as Apache webserver logs), you can use a handler for stdlib logging which writes to AMQP middleware.

How to add contextual information to log lines from multiprocessing workers?

I have a pool of worker processes (using multiprocessing.Pool) and want to log from these to a single log file. I am aware of logging servers, syslog, etc. but they all seem to require some changes to how my app is installed, monitored, logs processed etc. which I would like to avoid.
I am using CPython 2.6 on Linux.
Finally I stumbled into a solution which almost works for me. The basic idea is that you start a log listener process, set up a queue between it and the worker processes, and the workers log into the queue (using QueueHandler), and the listener then formats and serializes the log lines to a file.
This is all working so far according to the solution linked above.
But then I wanted to have the workers log some contextual information, for example a job token, for every log line. In pool.apply_async() method I can pass in the contextual info I want to be logged. Note that I am only interested in the contextual information while the worker is doing the specific job; when it is idle there should not be any contextual information if the worker wants to log something. So basically the log listener has log format specified as something like:
"%(job_token)s %(process)d %(asctime)s %(msg)"
and the workers are supposed to provide job_token as contextual info in the log record (the other format specifiers are standard).
I have looked at custom log filters. With custom filter I can create a filter when the job starts and apply the filter to the root logger, but I am using 3rd party modules which create their own loggers (typically at module import time), and my custom filter is not applied to them.
Is there a way to make this work in the above setup? Or is there some alternative way to make this work (remember that I would still prefer a single log file, no separate log servers, job-specific contextual information for worker log lines)?

Filters can be applied to handlers as well as loggers - so you can just apply the filter to your QueueHandler. If this handler is attached to the root logger in your processes, then any logging by third party modules should also be handled by the handler, so you should get the context in those logged events, too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.