Airflow 1.9 over verbose logging

Airflow 1.9 over verbose logging - python

After upgrading from version 1.7.1.3 I noticed more verbose logging messages in Airflow tasks. To be more precise, my current airflow 1.9 output message has following format when I am running bash bash operator task:
[2018-05-17 16:43:08,104] {base_task_runner.py:98} INFO - Subtask: [2018-05-17 16:43:08,104] {bash_operator.py:101} INFO - <SCRIPT LOGS HERE>
While on 1.7.1.3 the messages had following format:
[2018-05-17 16:10:02,615] {bash_operator.py:77} INFO - <SCRIPT LOGS HERE>
Is there any way to return to previous level of log details (from v. 1.7.1.3) on airflow 1.9, i.e. Not display base_task_runner logs in config?
I have tried to modify log format in airflow.cfg
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class =
# Log format
log_format = [%%(asctime)s] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s
simple_log_format = %%(asctime)s %%(levelname)s - %%(message)s
namely I tried to modify remove asctime from log_format, but that was removing timestamps from both base_task_runner and bash_operator. Maybe simple_log_format could solve this? What is the difference between log_format and simple_log_format variables?
I also haven't set up logging config class. I've got an impression that was mainly used for pushing the logs remotely do I still need it if I store my logs locally?
Thanks

I think this is not possible because some calling structures have changed between versions, if I am not mistaken.
Task calls will always be a Subtask. Since this means a different hierarchy, the log structure is also affected.

Related

Greengrass V2 generates extra `stdout` logs

I deploy greengrass components into my EC2 instance. The deploy greengrass components have been generating logs which wraps around my python log.
what is causing the "wrapping" around it? how can I remove these wraps.
For example, the logs in bold are wraps the original python log.
The log in emphasis is generated by my python log formatter.
2022-12-13T23:59:56.926Z [INFO] (Copier) com.bolt-data.iot.RulesEngineCore: stdout. [2022-12-13 23:59:56,925][DEBUG ][iot-ipc] checking redis pub-sub health (io_thread[140047617824320]:pub_sub_redis:_connection_health_nanny_task:61).
{scriptName=services.com.bolt-data.iot.RulesEngineCore.lifecycle.Run, serviceName=com.bolt-data.iot.RulesEngineCore, currentState=RUNNING}
The following is my python log formatter.
formatter = logging.Formatter(
fmt="[%(asctime)s][%(levelname)-7s][%(name)s] %(message)s (%(threadName)s[%(thread)d]:%(module)s:%(funcName)s:%(lineno)d)"
)
# TODO: when we're running in the lambda function, don't stream to stdout
_handler = logging.StreamHandler(stream=stdout)
_handler.setLevel(get_level_from_environment())
_handler.setFormatter(formatter)

by default Greengrass Nucleus captures the stdoud and stderr streams from the processes it manages, including custom components. It then outputs each line of the logs with the prefix and suffix you have highlighted in bold. This cannot be changed. You can switch the format from TEXT to JSON which can make the log easier to parse by a machine (check greengrass-nucleus-component-configuration - logging.format)
import logging
from logging.handlers import RotatingFileHandler
logger = logging.Logger(__name__)
_handler = RotatingFileHandler("mylog.log")
# additional _handler configuration
logger.addHandler(_handler)
If you want to output a log containing only what your application generates, change the logger configuration output to file. You can write the file in the work folder of the component or in another location, such as /var/log

Unable to use logging library for kafka logs

I'm relatively new to confluent-kafka python and I am having an hard time figuring out how to send the logs from the kafka library to the program logger.
I am using a SerializingProducer and, according to the docs here: https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#serde-producer I should be able to pass a logger as config entry.
Something like:
logger = logging.getLogger()
producer_config = {
"bootstrap.servers" : "https....",
...,
"logger" : logger
}
However this does not seem to work as intended. The only log visible is the one producer directly by the program. Removing the line
"logger" : logger
allows the kafka library to correctly show the logs in the console.
My producer looks as follow:
log = logging.getLogger(__name__)
class MyProducer(confluent_kafka.SerializingProducer):
def __init__(self, topic, producer_config, source_folder) -> None:
producer_config["logger"] = log
super(MyProducer, self).__init__(producer_config)
log.info(f"Producer initialized with the following conf::\n{producer_config}")
Any idea on why it happens and how to fix this?
Other information available:
kafka library logs to stderr
passing directly an Handler rather than a Logger does not fix the problem, but rather duplicates the output of the program
Thanks in advance.

Python `logging` module + YAML config file

I have found a lot of documentation and tutorials such as the official logging config docs, the official logging cookbook, and this nice tutorial by Fang.
Each of them have gotten me near to an answer, but not quite. My question is this:
When using Config Files, how can I use a logger with 2 separate handlers at 2 separate levels?
To clarify, here is an example of my YAML file:
---
version: 1
handlers:
debug_console:
class: logging.StreamHandler
level: DEBUG
.
.
.
info_file_handler:
class: logging.handlers.RotatingFileHandler
level: INFO
.
.
.
loggers:
dev:
handlers: [debug_console, info_file_handler]
test:
handlers: [info_file_handler]
root:
handlers: [info_file_handler]
I want to have two ways to run the logger, where one way (dev) is more verbose than the other. Moreover, when running the dev logger, I want it to have two different levels for the two different handlers.
This is a snippet of the code to try to launch the logger:
with open('logging.yaml', 'r') as f:
log_cfg = yaml.safe_load(f.read())
logging.config.dictConfig(log_cfg)
my_logger = logging.getLogger('dev')
The dictConfig line above works correctly. I say this because when I get to the code which asks to log to the console, I will see dev as the name when the log prints out. (I have edited the yaml, but it contains %(name)s in the format.)
But there is something wrong with my_logger. Even though it is tied to the name of dev, none of the rest of the attributes seem to have been set. Specifically, I see:
>>> my_logger
<Logger dev_model (WARNING)>
I don't know the logging module well enough to understand where the problem is. What I want is:
When I activate the 'dev' logger, I want to launch 2 handlers, one which is at the DEBUG level and writes to console, the other which is at the INFO level and writes to a file.
How can this be done?

If I understand the question correctly, the problem is caused by the fact the logger itself has a log level, not just handlers. Logger's log level defaults to WARNING, which seems to be set on your logger. If a generated message has a lower priority than the logger's level then it does not even make to the handlers.
So try setting logger's level to DEBUG. info_file_handler should ignore any messages more verbose than it's own level.
As for this part:
none of the rest of the attributes seem to have been set.
What happens there is logger's repr() method is called to convert Logger to some sort of string representation in order to render it. Which is not guaranteed to show all the attributes of the object.

Such a long question... Too long for me to understand well.
But I think you misunderstand how handlers work. Actually logger itself doesn't output anything but handlers do.
So let's say if you set DEBUG on dev logger, it will pass logs >= DEBUG to all handlers. And then debug_console handler will process logs >= DEBUG but info_file_handler will only process logs >= INFO. Setting DEBUG on dev logger won't let info_file_handler output logs < INFO. So you do can have two separate levels which one is >= DEBUG and goes to console while another is >= INFO and goes to file.
I am presuming I understand you rightly...

robot framework info message not printing

My python code generates a log file using logging framework and all INFO messages are captured in the log file. I integrated my program with ROBOT framework and now the log file is not generated. Instead the INFO messages are printed in the log.html. I understand this is because robot existing logger is being called and hence INFO are directed to log.html. I don't want the behavior to change, I still want the user defined log file to be generated separately with just the INFO level messages.
How can I achieve this?

Python Code --> Logging Library --> "Log File"
RobotFramework --> Python Code --> Logging Library --> by default "log.html"
When you run using python code it will allow you set log file name.
But when you run using robotframework, the file is by default set to log.html (since robot uses the same logging library internally that you are using) so your logging function is overridden by that of robotframework.
That is why you see it in log.html instead of your file.
You can also refer Robot Framework not creating file or writing to it
Hope it helps!

The issue has been fixed now, which was a very minor one. But am still analyzing it deeper, will update when I am clear on the exact cause.
This was the module that I used,
def call_logger(logger_name, logFile):
level = logging.INFO
l = logging.getLogger(logger_name)
if not getattr(l, 'handler_set', None):
formatter = logging.Formatter('%(asctime)s : %(message)s')
fileHandler = logging.FileHandler(logFile, mode = 'a')
fileHandler.setFormatter(formatter)
streamHandler = logging.StreamHandler()
streamHandler.setFormatter(formatter)
l.setLevel(level)
l.addHandler(fileHandler)
l.addHandler(streamHandler)
l.handler_set = True
When I changed the parameter "logFile" to a different name "log_file" it worked.
Looks like "logFile" was a built in robot keyword.

How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully.
However, I cannot for the life of me figure out how to stop all of the verbose INFO logging after each command.
I have tried nearly every possible scenario in the below code (commenting out, setting to OFF) within my log4j.properties file in the conf folder in where I launch the application from as well as on each node and nothing is doing anything. I still get the logging INFO statements printing after executing each statement.
I am very confused with how this is supposed to work.
#Set everything to be logged to the console log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
Here is my full classpath when I use SPARK_PRINT_LAUNCH_COMMAND:
Spark Command:
/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/bin/java
-cp :/root/spark-1.0.1-bin-hadoop2/conf:/root/spark-1.0.1-bin-hadoop2/conf:/root/spark-1.0.1-bin-hadoop2/lib/spark-assembly-1.0.1-hadoop2.2.0.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar
-XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --class
org.apache.spark.repl.Main
contents of spark-env.sh:
#!/usr/bin/env bash
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH=/root/spark-1.0.1-bin-hadoop2/conf/
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.
# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
export SPARK_SUBMIT_CLASSPATH="$FWDIR/conf"

Just execute this command in the spark directory:
cp conf/log4j.properties.template conf/log4j.properties
Edit log4j.properties:
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
Replace at the first line:
log4j.rootCategory=INFO, console
by:
log4j.rootCategory=WARN, console
Save and restart your shell. It works for me for Spark 1.1.0 and Spark 1.5.1 on OS X.

In Spark 2.0 you can also configure it dynamically for your application using setLogLevel:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
master('local').\
appName('foo').\
getOrCreate()
spark.sparkContext.setLogLevel('WARN')
In the pyspark console, a default spark session will already be available.

Inspired by the pyspark/tests.py I did
def quiet_logs(sc):
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
Calling this just after creating SparkContext reduced stderr lines logged for my test from 2647 to 163. However creating the SparkContext itself logs 163, up to
15/08/25 10:14:16 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
and it's not clear to me how to adjust those programmatically.

Edit your conf/log4j.properties file and Change the following line:
log4j.rootCategory=INFO, console
to
log4j.rootCategory=ERROR, console
Another approach would be to :
Fireup spark-shell and type in the following:
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
You won't see any logs after that.

>>> log4j = sc._jvm.org.apache.log4j
>>> log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)

For PySpark, you can also set the log level in your scripts with sc.setLogLevel("FATAL"). From the docs:
Control our logLevel. This overrides any user-defined log settings. Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN

You can use setLogLevel
val spark = SparkSession
.builder()
.config("spark.master", "local[1]")
.appName("TestLog")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

This may be due to how Spark computes its classpath. My hunch is that Hadoop's log4j.properties file is appearing ahead of Spark's on the classpath, preventing your changes from taking effect.
If you run
SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell
then Spark will print the full classpath used to launch the shell; in my case, I see
Spark Command: /usr/lib/jvm/java/bin/java -cp :::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.1.jar:/root/spark/lib/datanucleus-core-3.2.2.jar:/root/spark/lib/datanucleus-rdbms-3.2.1.jar -XX:MaxPermSize=128m -Djava.library.path=:/root/ephemeral-hdfs/lib/native/ -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark.repl.Main
where /root/ephemeral-hdfs/conf is at the head of the classpath.
I've opened an issue [SPARK-2913] to fix this in the next release (I should have a patch out soon).
In the meantime, here's a couple of workarounds:
Add export SPARK_SUBMIT_CLASSPATH="$FWDIR/conf" to spark-env.sh.
Delete (or rename) /root/ephemeral-hdfs/conf/log4j.properties.

Simply add below param to your spark-submit command
--conf "spark.driver.extraJavaOptions=-Dlog4jspark.root.logger=WARN,console"
This overrides system value temporarily only for that job. Check exact property name (log4jspark.root.logger here) from log4j.properties file.
Hope this helps, cheers!

Spark 1.6.2:
log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
Spark 2.x:
spark.sparkContext.setLogLevel('WARN')
(spark being the SparkSession)
Alternatively the old methods,
Rename conf/log4j.properties.template to conf/log4j.properties in Spark Dir.
In the log4j.properties, change log4j.rootCategory=INFO, console to log4j.rootCategory=WARN, console
Different log levels available:
OFF (most specific, no logging)
FATAL (most specific, little data)
ERROR - Log only in case of Errors
WARN - Log only in case of Warnings or Errors
INFO (Default)
DEBUG - Log details steps (and all logs stated above)
TRACE (least specific, a lot of data)
ALL (least specific, all data)

Programmatic way
spark.sparkContext.setLogLevel("WARN")
Available Options
ERROR
WARN
INFO

I used this with Amazon EC2 with 1 master and 2 slaves and Spark 1.2.1.
# Step 1. Change config file on the master node
nano /root/ephemeral-hdfs/conf/log4j.properties
# Before
hadoop.root.logger=INFO,console
# After
hadoop.root.logger=WARN,console
# Step 2. Replicate this change to slaves
~/spark-ec2/copy-dir /root/ephemeral-hdfs/conf/

This below code snippet for scala users :
Option 1 :
Below snippet you can add at the file level
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.WARN)
Option 2 :
Note : which will be applicable for all the application which is using
spark session.
import org.apache.spark.sql.SparkSession
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
Option 3 :
Note : This configuration should be added to your log4j.properties.. (could be like /etc/spark/conf/log4j.properties (where the spark installation is there) or your project folder level log4j.properties)
since you are changing at module level. This will be applicable for all the application.
log4j.rootCategory=ERROR, console
IMHO, Option 1 is wise way since it can be switched off at file level.

The way I do it is:
in the location I run the spark-submit script do
$ cp /etc/spark/conf/log4j.properties .
$ nano log4j.properties
change INFO to what ever level of logging you want and then run your spark-submit

I you want to keep using the logging (Logging facility for Python) you can try splitting configurations for your application and for Spark:
LoggerManager()
logger = logging.getLogger(__name__)
loggerSpark = logging.getLogger('py4j')
loggerSpark.setLevel('WARNING')

You can also set it like this programmatically, At the beginning of your program.
Logger.getLogger("org").setLevel(Level.WARN)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Airflow 1.9 over verbose logging - python

I think this is not possible because some calling structures have changed between versions, if I am not mistaken. Task calls will always be a Subtask. Since this means a different hierarchy, the log structure is also affected.

Related

Greengrass V2 generates extra `stdout` logs

Unable to use logging library for kafka logs

Python `logging` module + YAML config file

robot framework info message not printing

How to turn off INFO logging in Spark?

Categories

Resources