How to turn off INFO logging in Spark?

How to turn off INFO logging in Spark? - python

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully.
However, I cannot for the life of me figure out how to stop all of the verbose INFO logging after each command.
I have tried nearly every possible scenario in the below code (commenting out, setting to OFF) within my log4j.properties file in the conf folder in where I launch the application from as well as on each node and nothing is doing anything. I still get the logging INFO statements printing after executing each statement.
I am very confused with how this is supposed to work.
#Set everything to be logged to the console log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
Here is my full classpath when I use SPARK_PRINT_LAUNCH_COMMAND:
Spark Command:
/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/bin/java
-cp :/root/spark-1.0.1-bin-hadoop2/conf:/root/spark-1.0.1-bin-hadoop2/conf:/root/spark-1.0.1-bin-hadoop2/lib/spark-assembly-1.0.1-hadoop2.2.0.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar
-XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --class
org.apache.spark.repl.Main
contents of spark-env.sh:
#!/usr/bin/env bash
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH=/root/spark-1.0.1-bin-hadoop2/conf/
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.
# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
export SPARK_SUBMIT_CLASSPATH="$FWDIR/conf"

Just execute this command in the spark directory:
cp conf/log4j.properties.template conf/log4j.properties
Edit log4j.properties:
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
Replace at the first line:
log4j.rootCategory=INFO, console
by:
log4j.rootCategory=WARN, console
Save and restart your shell. It works for me for Spark 1.1.0 and Spark 1.5.1 on OS X.

In Spark 2.0 you can also configure it dynamically for your application using setLogLevel:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
master('local').\
appName('foo').\
getOrCreate()
spark.sparkContext.setLogLevel('WARN')
In the pyspark console, a default spark session will already be available.

Inspired by the pyspark/tests.py I did
def quiet_logs(sc):
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
Calling this just after creating SparkContext reduced stderr lines logged for my test from 2647 to 163. However creating the SparkContext itself logs 163, up to
15/08/25 10:14:16 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
and it's not clear to me how to adjust those programmatically.

Edit your conf/log4j.properties file and Change the following line:
log4j.rootCategory=INFO, console
to
log4j.rootCategory=ERROR, console
Another approach would be to :
Fireup spark-shell and type in the following:
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
You won't see any logs after that.

>>> log4j = sc._jvm.org.apache.log4j
>>> log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)

For PySpark, you can also set the log level in your scripts with sc.setLogLevel("FATAL"). From the docs:
Control our logLevel. This overrides any user-defined log settings. Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN

You can use setLogLevel
val spark = SparkSession
.builder()
.config("spark.master", "local[1]")
.appName("TestLog")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

This may be due to how Spark computes its classpath. My hunch is that Hadoop's log4j.properties file is appearing ahead of Spark's on the classpath, preventing your changes from taking effect.
If you run
SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell
then Spark will print the full classpath used to launch the shell; in my case, I see
Spark Command: /usr/lib/jvm/java/bin/java -cp :::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.1.jar:/root/spark/lib/datanucleus-core-3.2.2.jar:/root/spark/lib/datanucleus-rdbms-3.2.1.jar -XX:MaxPermSize=128m -Djava.library.path=:/root/ephemeral-hdfs/lib/native/ -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark.repl.Main
where /root/ephemeral-hdfs/conf is at the head of the classpath.
I've opened an issue [SPARK-2913] to fix this in the next release (I should have a patch out soon).
In the meantime, here's a couple of workarounds:
Add export SPARK_SUBMIT_CLASSPATH="$FWDIR/conf" to spark-env.sh.
Delete (or rename) /root/ephemeral-hdfs/conf/log4j.properties.

Simply add below param to your spark-submit command
--conf "spark.driver.extraJavaOptions=-Dlog4jspark.root.logger=WARN,console"
This overrides system value temporarily only for that job. Check exact property name (log4jspark.root.logger here) from log4j.properties file.
Hope this helps, cheers!

Spark 1.6.2:
log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
Spark 2.x:
spark.sparkContext.setLogLevel('WARN')
(spark being the SparkSession)
Alternatively the old methods,
Rename conf/log4j.properties.template to conf/log4j.properties in Spark Dir.
In the log4j.properties, change log4j.rootCategory=INFO, console to log4j.rootCategory=WARN, console
Different log levels available:
OFF (most specific, no logging)
FATAL (most specific, little data)
ERROR - Log only in case of Errors
WARN - Log only in case of Warnings or Errors
INFO (Default)
DEBUG - Log details steps (and all logs stated above)
TRACE (least specific, a lot of data)
ALL (least specific, all data)

Programmatic way
spark.sparkContext.setLogLevel("WARN")
Available Options
ERROR
WARN
INFO

I used this with Amazon EC2 with 1 master and 2 slaves and Spark 1.2.1.
# Step 1. Change config file on the master node
nano /root/ephemeral-hdfs/conf/log4j.properties
# Before
hadoop.root.logger=INFO,console
# After
hadoop.root.logger=WARN,console
# Step 2. Replicate this change to slaves
~/spark-ec2/copy-dir /root/ephemeral-hdfs/conf/

This below code snippet for scala users :
Option 1 :
Below snippet you can add at the file level
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.WARN)
Option 2 :
Note : which will be applicable for all the application which is using
spark session.
import org.apache.spark.sql.SparkSession
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
Option 3 :
Note : This configuration should be added to your log4j.properties.. (could be like /etc/spark/conf/log4j.properties (where the spark installation is there) or your project folder level log4j.properties)
since you are changing at module level. This will be applicable for all the application.
log4j.rootCategory=ERROR, console
IMHO, Option 1 is wise way since it can be switched off at file level.

The way I do it is:
in the location I run the spark-submit script do
$ cp /etc/spark/conf/log4j.properties .
$ nano log4j.properties
change INFO to what ever level of logging you want and then run your spark-submit

I you want to keep using the logging (Logging facility for Python) you can try splitting configurations for your application and for Spark:
LoggerManager()
logger = logging.getLogger(__name__)
loggerSpark = logging.getLogger('py4j')
loggerSpark.setLevel('WARNING')

You can also set it like this programmatically, At the beginning of your program.
Logger.getLogger("org").setLevel(Level.WARN)

Related

How can I change the logfile naming for the cluster scheduler in snakemake?

I am running job on a PBS TORQUE cluster and want to customize my log scripts for rules repeated for many files.
The default naming scheme is for each script for each rule snakejob.{rulename}.{id}.sh.o26730731, e.g. snakejob.all.7.sh.o26730731, where only the ending varies for different files (as they are executed one after another). This comes from the script snakemake creates for submission to the cluster.
I can specify a common log-directory to qsub using the -e or -o options.
I know that profiles exist or that one could use wildcards, something like (I have to test that):
snakemake --jobs 10 --cluster "qsub -o logs/{wildcards.file} -e logs/{wildcards.file}"
Alternatively the naming of the script temporarily saved by snakemake under .snakemake/tmp<hash> could be altered to achieve unique naming of logs per file.
I tried to set the log-directory in the rule, but this did not work when I specified a directory (missing .log):
rule target:
input:
# mockfile approach: https://stackoverflow.com/a/53751654/9684872
# replace? https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#directories-as-outputs
file = expand(os.path.join(config['DATADIR'], "{file}", "{file}.txt"), file=FILES)
rule execute:
log:
#dir = os.path.join(config['DATADIR'], "{file}") # Building DAG is stuck in endless loop
dir = os.path.join(config['DATADIR'], "{file}.log") # works
params:
logdir = os.path.join(config['DATADIR'], "{file}") #works
So what is your approach or what would you suggest how this could be solved best to have logs identified by the {file} wildcard?

PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically):
import sys, os, multiprocessing
from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as sFn
from pyspark.sql.types import *
from pyspark.sql.types import Row
# ------------------------------------------
# Note: Row() in .../pyspark/sql/types.py
# isn't included in '__all__' list(), so
# we must import it by name here.
# ------------------------------------------
num_cpus = multiprocessing.cpu_count() # Number of CPUs for SPARK Local mode.
os.environ.pop('SPARK_MASTER_HOST', None) # Since we're using pip/pySpark these three ENVs
os.environ.pop('SPARK_MASTER_POST', None) # aren't needed; and we ensure pySpark doesn't
os.environ.pop('SPARK_HOME', None) # get confused by them, should they be set.
os.environ.pop('PYTHONSTARTUP', None) # Just in case pySpark 2.x attempts to read this.
os.environ['PYSPARK_PYTHON'] = sys.executable # Make SPARK Workers use same Python as Master.
os.environ['JAVA_HOME'] = '/usr/lib/jvm/jre' # Oracle JAVA for our pip/python3/pySpark 2.4 (CDH's JRE won't work).
JARS_IVY_REPO = '/home/jdoe/SPARK.JARS.REPO.d/'
# ======================================================================
# Maven Coordinates for JARs (and their dependencies) needed to plug
# extra functionality into Spark 2.x (e.g. Kafka SQL and Streaming)
# A one-time internet connection is necessary for Spark to autimatically
# download JARs specified by the coordinates (and dependencies).
# ======================================================================
spark_jars_packages = ','.join(['org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.0',
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0',])
# ======================================================================
spark_conf = SparkConf()
spark_conf.setAll([('spark.master', 'local[{}]'.format(num_cpus)),
('spark.app.name', 'myApp'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.eventLog.enabled', 'false'),
('spark.logConf', 'false'),
('spark.jars.repositories', 'file:/' + JARS_IVY_REPO),
('spark.jars.ivy', JARS_IVY_REPO),
('spark.jars.packages', spark_jars_packages), ])
spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()
spark_ctxt = spark_sesn.sparkContext
spark_reader = spark_sesn.read
spark_streamReader = spark_sesn.readStream
spark_ctxt.setLogLevel("WARN")
However the plugins aren't downloading and/or loading when I run the snippet (e.g. ./python -i init_spark.py), as they should.
This mechanism used to work, but then stopped. What am I missing?
Thank you in advance!

This is the kind of post where the QUESTION will be worth more than the ANSWER, because the code above works but isn't anywhere to be found in Spark 2.x documentation or examples.
The above is how I've programmatically added functionality to Spark 2.x by way of Maven Coordinates. I had this working but then it stopped working. Why?
When I ran the above code in a jupyter notebook, the notebook had -- behind the scenes -- already run that identical code snippet by way of my PYTHONSTARTUP script. That PYTHONSTARTUP script has the same code as the above, but omits the maven coordinates (by intent).
Here, then, is how this subtle problem emerges:
spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()
Because a Spark Session already existed, the above statement simply reused that existing session (.getOrCreate()), which did not have the jars/libraries loaded (again, because my PYTHONSTARTUP script intentionally omits them). This is why it is a good idea to put print statements in PYTHONSTARTUP scripts (which are otherwise silent).
In the end, I simply forgot to do this: $ unset PYTHONSTARTUP before starting the JupyterLab / Notebook daemon.
I hope the Question helps others because that's how to programmatically add functionality to Spark 2.x (in this case Kafka). Note that you'll need an internet connection for the one-time download of the specified jars and recursive dependencies from Maven Central.

Export environment variables at runtime with airflow

I am currently converting workflows that were implemented in bash scripts before to Airflow DAGs. In the bash scripts, I was just exporting the variables at run time with
export HADOOP_CONF_DIR="/etc/hadoop/conf"
Now I'd like to do the same in Airflow, but haven't found a solution for this yet. The one workaround I found was setting the variables with os.environ[VAR_NAME]='some_text' outside of any method or operator, but that means they get exported the moment the script gets loaded, not at run time.
Now when I try to call os.environ[VAR_NAME] = 'some_text' in a function that gets called by a PythonOperator, it does not work. My code looks like this
def set_env():
os.environ['HADOOP_CONF_DIR'] = "/etc/hadoop/conf"
os.environ['PATH'] = "somePath:" + os.environ['PATH']
os.environ['SPARK_HOME'] = "pathToSparkHome"
os.environ['PYTHONPATH'] = "somePythonPath"
os.environ['PYSPARK_PYTHON'] = os.popen('which python').read().strip()
os.environ['PYSPARK_DRIVER_PYTHON'] = os.popen('which python').read().strip()
set_env_operator = PythonOperator(
task_id='set_env_vars_NOT_WORKING',
python_callable=set_env,
dag=dag)
Now when my SparkSubmitOperator gets executed, I get the exception:
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
My use case where this is relevant is that I have SparkSubmitOperator, where I submit jobs to YARN, therefore either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. Setting them in my .bashrc or any other config is sadly not possible for me, which is why I need to set them at runtime.
Preferably I'd like to set them in an Operator before executing the SparkSubmitOperator, but if there was the possibility to pass them as arguments to the SparkSubmitOperator, that would be at least something.

From what I can see in the spark submit operator you can pass in environment variables to spark-submit as a dictionary.
:param env_vars: Environment variables for spark-submit. It
supports yarn and k8s mode too.
:type env_vars: dict
Have you tried this?

Airflow 1.9 over verbose logging

After upgrading from version 1.7.1.3 I noticed more verbose logging messages in Airflow tasks. To be more precise, my current airflow 1.9 output message has following format when I am running bash bash operator task:
[2018-05-17 16:43:08,104] {base_task_runner.py:98} INFO - Subtask: [2018-05-17 16:43:08,104] {bash_operator.py:101} INFO - <SCRIPT LOGS HERE>
While on 1.7.1.3 the messages had following format:
[2018-05-17 16:10:02,615] {bash_operator.py:77} INFO - <SCRIPT LOGS HERE>
Is there any way to return to previous level of log details (from v. 1.7.1.3) on airflow 1.9, i.e. Not display base_task_runner logs in config?
I have tried to modify log format in airflow.cfg
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class =
# Log format
log_format = [%%(asctime)s] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s
simple_log_format = %%(asctime)s %%(levelname)s - %%(message)s
namely I tried to modify remove asctime from log_format, but that was removing timestamps from both base_task_runner and bash_operator. Maybe simple_log_format could solve this? What is the difference between log_format and simple_log_format variables?
I also haven't set up logging config class. I've got an impression that was mainly used for pushing the logs remotely do I still need it if I store my logs locally?
Thanks

I think this is not possible because some calling structures have changed between versions, if I am not mistaken.
Task calls will always be a Subtask. Since this means a different hierarchy, the log structure is also affected.

Python - send KDE knotify message with cron job on linux?

I'm trying to send a notification to KDE's knotify from a cron job. The code below works fine but when I run it as a cron job the notification doesnt appear.
#!/usr/bin/python2
import dbus
import gobject
album = "album"
artist = "artist"
title = "title"
knotify = dbus.SessionBus().get_object("org.kde.knotify", "/Notify")
knotify.event("warning", "kde", [], title, u"by %s from %s" % (artist, album), [], [], 0, 0, dbus_interface="org.kde.KNotify")
Anyone know how I can run this as a cron job?

You need to supply an environment variable called DBUS_SESSION_BUS_ADDRESS.
You can get the value from a running kde session.
$ echo $DBUS_SESSION_BUS_ADDRESS
unix:abstract=/tmp/dbus-iHb7INjMEc,guid=d46013545434477a1b7a6b27512d573c
In your kde startup (autostart module in configuration), create a script entry to run after your environment starts up. Output this environment variable value to a temp file in your home directory and then you can set the environment variable within your cron job or python script from the temp file.
#!/bin/bash
echo $DBUS_SESSION_BUS_ADDRESS > $HOME/tmp/kde_dbus.session
As of 2019 KDE5, it still works but is slightly different results:
$ echo $DBUS_SESSION_BUS_ADDRESS
unix:path=/run/user/1863/bus
To test it, you can do the following:
$ qdbus org.freedesktop.ScreenSaver /ScreenSaver SimulateUserActivity
You may need to use qdbus-qt5 if you still have the old kde4 binaries installed along with kde5. You can determine which one you should use with the following:
export QDBUS_CMD=$(which qdbus-qt5 2> /dev/null || which qdbus || exit 1)
I run this with a sleep statement when I want to prevent my screensaver from engaging and it works. I run it remotely from another computer beside my main one.
For those who want to know how I lock and unlock the remote screensaver, it's a different command...
loginctl lock-session 1
or
loginctl unlock-session 1
That is assuming that your session is the first one. You can add scripts to the KDE notification events for screensaver start and stop. Hope this information helps someone who wants to synchronize their screen savers across more than one computer.
I know this is long answer, but I wanted to provide an example for you to test with and a practical use case where I use it today.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.