Logging on Hadoop

Logging on Hadoop - python

I am trying to run map reduce job. But I am unable to find my log files when I run this job. I am using hadoop streaming job to perform map reduce and I am using Python. I am using python's logging module to log messages. When I run this on a file by using "cat" command, the log file is created.
cat file | ./mapper.py
But when I run this job via hadoop, I am unable to find the log file.
import os,logging
logging.basicConfig(filename="myApp.log", level=logging.INFO)
logging.info("app start")
##
##logic with log messages
##
logging.info("app complete")
But I cannot find the myApp.log file anywhere. Is the log data stored anywhere or does hadoop ignore the application logging complete. I have searched for my log items in the userlogs folder too, but it doesn't look like my log items are there.
I work with vast amounts of data where random items are not making to the next stage, this is a very big issue on our side, so I am trying to find a way to use logging to debug my application.
Any help is appreciated.

I believe that you are logging in stdout? If so, you should definitely log in stderr instead, or create your own custom stream.
Using hadoop-streaming, stdout is the stream dedicated to pass key-values between mappers/reducers and to output results, so you should not log anything in it.

Related

Is it possible to write a log to a single file from several locust slaves?

I use locust on several machines (https://locust.io/). Each --slave and --master node with the --logfile option writes a log to its directory.
Is it possible to make them write a common log to a single file?
Since it is very inconvenient to collect and analyze logs from all machines every time.

I don't know exactly what it is that you're analyzing from the logs (so this might or might not be the answer you're looking for), but you can use Locust's --csv command line option (and possibly also --csv-full-history), to have the master node continuously write the aggregated request statistics and failures in CSV format to files.

Query Python3 script to get stats about the script

I have a script that continually runs and accepts data (For those that are familiar and if it helps, it is connected to EMDR - https://eve-market-data-relay.readthedocs.org).
Inside the script I have debugging built in so that I can see how much data is currently in the queue for the threads to process, however this is built to be used with just printing to the console. What I would like to do is be able to either run the same script with an additional option or a totally different script that would return the current queue count without having to enable debug.
Is there a way to do this could someone please point me in the direction of the documentation/libaries that I need to research?

There are many ways to solve this; two that come to mind:
You can write the queue count to a k/v store (like memcache or redis) and then have another script read that for you and do whatever other actions required.
You can create a specific logger for your informational output (like the queue length) and set it to log somewhere else other than the console. For example, you could use it to send you an email or log to an external service, etc. See the logging cookbook for examples.

logging with IPython parallel

I am trying to setup logging when using IPython parallel. Specifically, I would like to redirect log messages from the engines to the client. So, rather than each of the engines logging individually to their own log files, as in IPython.parallel - can I write my own log into the engine logs?, I am looking for something like How should I log while using multiprocessing in Python?
Based on reviewing the IPython code base, I have the impression that the way to do this would be to register a zmq.log.hander.PUBHandler with the logging module (see documentation in iploggerapp.py). I have tried this in various ways, but none seem to work. I also tried to register a logger via IPython.parallel.util. connect_engine_logger, but this also does not appear to do anything.
update
I have made some progress on this problem. If I specify in ipengine_config c.IPEngineApp.log_url, then the logger of the IPython application has the appropriate EnginePubHandler. I checked this via
%%px
from IPython.config import Application
log = Application.instance().log
print(log.handlers)
Which indicated that the application logger has an EnginePUBHandler for each engine. Next, I can start the iplogger app in a separate terminal and see the log messages from each engine.
However, What I would like to achieve is to see these log messages in the notebook, rather than in a separate terminal. I have tried starting iplogger from within the notebook via a system call, but this crashes.

How to open new log file for certain files in twisted

I'm running a python twisted application with lots of different services, and the log file for that application is pretty crowded with all kinds of output. So, to better see what is going on in one specific service, I would like to log messages for that service only to a different logfile. But I can't figure out how to do this.
For my application, I am using a shell script run.sh that calls twistd as follows:
twistd --logfile /var/log/whatever/path/mylogfile.log -y myapplication.py
The file myapplication.py launches all services in the application, one of which is the service I am interested in. That service has all its code in file myservice.py.
So, is there any way to specify a new log file just for my service? Do I do this in myapplication.py when I launch the service of do I do it with some python code in myservice.py?

Having seen systems that use more than one log file, I would strongly urge you not to go in this direction.
Guy's answer sounds like it is more in the right direction. To go into even more detail, though, consider using a structured log format such as the one provided by structlog (which includes Twisted integration).
Once entries in your log file are structured you will have a chance of building tools that work with them. The example Guy gave of using grep to find the events related to the service you're concerned with this a step in this direction. If you go further in this direction and say that each log event will be (for example) a json-encoded object then you can parse each line and apply arbitrarily complex filtering logic to the resulting objects.

consider the following two options:
depend of the log lines format when viewing / tailing the log do something like:
tail -f mylogfile.log | grep <something unique like your service name?>
configure twisted to use Python standard logging and tunnel log messages there, see Using the standard library logging module

It appears you could create t.p.l.LogPublisher for each service and attach a FileLogObserver to it for the actual writing into a file

Generate a consolidated log from multiple remote processes

I have script that spawns another script on several hosts on a network. These scripts generate output that I want to capture. So the only option I can see right now is this:
Log each process' output in a separate file (like 20130308.hostname.log, etc).
Is there a way to generate a consolidated log out of all the processes? By consolidated I mean something like this:
host1:
outputline1
outputline2
host2:
outputline1
outputline2
outputline3
host3:
...
I want to be able to open one file - and check up on what happened on a particular host.

If I understood well your question, you want to have several processes log into one file. There are better option then the one you suggest, please check the python logging cookbook. 'Multiple modules' and 'multiple handlers' should to solve your issue.

You can use a centralized logging system like syslog for aggregating all logging messages into a central log. Python's logging module provides a SyslogHandler for that.
For aggregating individual logfiles into one logfile manually you will have to write a script performing the aggregation and merging.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Logging on Hadoop - python

I believe that you are logging in stdout? If so, you should definitely log in stderr instead, or create your own custom stream. Using hadoop-streaming, stdout is the stream dedicated to pass key-values between mappers/reducers and to output results, so you should not log anything in it.

Related

Is it possible to write a log to a single file from several locust slaves?

Query Python3 script to get stats about the script

logging with IPython parallel

How to open new log file for certain files in twisted

Generate a consolidated log from multiple remote processes

Categories

Resources