Yarn resource manager fails the application even though Spark application executes successfully

Yarn resource manager fails the application even though Spark application executes successfully - python

I am running simple hello world python script with AWS EMR + Spark + Yarn.
Looking at the logs, even though the Spark application succeeds, overall job is marked as failed by Yarn resource manager.
Logs for the spark application shows success. "Hello world" is printed in stdout also. (See pastebin for application logs)
Logs for the node manager show no issue or error. (See pastebin for node manager logs)
Logs for the resource manager on master host show that resource manager marks the application as FAILED even though application completion seems to be successful. There is no apparent reason in the log for the failure! (See pastebin for resource manager logs)
I checked all logs and cannot really figure out the root cause. What could be the issue? How can I debug further?

Your logs have the following statement:-
ERROR ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
This arises typically if you are setting .master() in the SparkSession builder.

Related

Is it possible to get container OS logs from Google Cloud Run

I'm using google cloud run. I run container with simple Flask+gunicorn app that starts heavy computation.
Sometimes it fails
Application exec likely failed
terminated: Application failed to start: not available
I'm 100% confident it's not related to google cloud run timeouts or Flask + gunicorn timeouts.
I've added hooks for gunicorn: worker_exit, worker_abort, worker_int, on_exit. Mentioned hooks are not invoked.
Exactly the same operation works well locally. I can reproduce it at cloud run only.
Seems like something crashes at cloud run and just kills my python process completely.
Is there any chance to debug it?
Maybe I can stream tail -f /var/log/{messages,kernel,dmesg,syslog} somehow in parallel to logs?
The idea is to understand what kills app.
UPD:
I've managed to get a bit more logs
Default
[INFO] Handling signal: term
Caught SIGTERM signal.Caught SIGTERM signal.
What is the right way to find what (and why) sends SIGTERM to my python process?

I would suggest setting up Cloud Logging with your Cloud Run instance. You can easily do so by following this documentation which shows how to attach Cloud Logging to the Python root logger. This will allow you to have more control over the logs that appear for your Cloud Run application.
Setting Up Cloud Logging for Python
Also in setting up Cloud Logging it should allow Cloud Run to pick up automatically any logs under the var/log directory as well as any syslogs (dev/log).
Hope this helps! Let me know if you need further assistance.

logging system kill signals for debugging

I have a script run by python3.7.9 on my ubuntu18.04 docker container.
At some point the python interpreter is killed. This is likely caused by resource excess.
Using docker log ${container_id} I only get the stderr inside the container, but I am also interested in what kind of resource was exceeded, so that I can give useful feedback to development.
Is this automatically logged on a system level (linux, docker)?
If this is not the case, how can I log this?

Cluster deploy mode is currently not supported for python applications on standalone clusters

I am trying to run sample Python program on my Spark cluster. The cluster consists of a master and two workers. Nevertheless, when I am trying to run my sample code, it swears:
$ spark-submit --master spark://sparkmaster:7077 --deploy-mode cluster test01.py
Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is currently not supported for python applications on standalone clusters.
What does it mean? Is my cluster standalone? Even if it consists of 3 computers, it is still standalone? How to make it non-standalone to run python programs in clustered mode?
If I just do
spark-submit test01.py
it crushes with error
21/03/30 11:07:27 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
21/03/30 11:07:27 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
21/03/30 11:07:27 ERROR SparkContext: Error initializing SparkContext.
java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:461)
at sun.nio.ch.Net.bind(Net.java:453)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
I wrote test01.py in the following way
from pyspark.sql import SparkSession
logFile = "README.md" # Should be some file on your system
spark = SparkSession.builder\
.appName("SimpleApp")\
.config("spark.driver.bindAddress", "127.0.0.1")\
.getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
spark.stop()
and it worked. Unfortunately, there are no traces of this work on spark master page.
Was it working distributed at all?

Hey Nothing's wrong with your configuration. As the error indicates, this is simply a limitation of Apache Spark.
Well for spark to run it needs resources. In standalone mode you start workers and spark master and persistence layer can be any - HDFS, FileSystem, cassandra etc. In YARN mode you are asking YARN-Hadoop cluster to manage the resource allocation and book keeping.
When you use master as local[2] you request Spark to use 2 core's and run the driver and workers in the same JVM. In local mode all spark job related tasks run in the same JVM.
So in Standalone you are defining "containers" for the worker and spark master to run in your machine (so you can have 2 workers and your tasks can be distributed in the JVM of those two workers?) (# in local mode you are just running everything in the same JVM in your local machine).

Gunicorn log errors in Google Cloud Platform

I am running a Django REST API on Google Kubernetes Engine, using Gunicorn for my WSGI server. When my application encounters a 500 server error, the Python stack trace is not showing up in the GCP Logging console (in the "GKE Container" resource). In a different Django project of mine which uses Daphne for the ASGI/WSGI server, this traceback is being properly logged. Even more strange, the Gunicorn application has properly logged errors before, just a few weeks ago. Those errors appear in the Error Reporting console as well.
To be clear, this is the type of information that I would like to see in the GCP logs:
Internal Server Error: /v1/user/errant-endpoint
Traceback (most recent call last):
File "/path/to/python3.6/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
...
File "/path/to/project/file.py", line 176, in my_file
print(test)
NameError: name 'test' is not defined
For the Gunicorn project, some of the Python tracebacks are logged, like this one when Gunicorn is started:
/usr/local/lib/python3.6/site-packages/django/db/models/fields/__init__.py:1421: RuntimeWarning: DateTimeField User.last_login received a naive datetime (2019-02-04 05:49:47.530648) while time zone support is active.
With 500 errors, however, only the HTTP info is logged:
[04/Feb/2019:06:03:58 +0000] "POST /v1/errant-endpoint HTTP/1.1" 500
I've looked up the resources for setting up Stackdriver Logging and Stackdriver Error Reporting, but neither of these seem to apply because 1) that only appears to work when you want to explicitly log your errors (as in client.report_exception()), and 2) Error Reporting has caught previous errors without my setting up, so it seems to be possible without having to install those client libraries.
There are so many variables at play here that I'm not sure where to start. I may not have provided enough information here in order to properly diagnose this (docker setup, kubernetes configuration, etc.), but I figured I may be misunderstanding something fundamental about this process, and that someone could be so kind as to enlighten me.
EDIT:
I found in the GKE documentation how to ensure that the Stackdriver logging is enabled for my cluster. Still no luck.
UPDATE (partial solution):
I set DEBUG = True in settings.py, which prompted the errors to start being logged in Stackdriver Logging and Error Reporting. I've gone ahead and done this for my canary environment, but it's less than ideal, since it exposes some of my backend code. Still not sure why this works for my other GCP project without running in debug mode.

How to debug Django app running on Heroku using a remote pdb connection?

To debug a bug I'm seeing on Heroku but not on my local machine, I'm trying to do step-through debugging.
The typical import pdb; pdb.set_trace() approach doesn't work with Heroku since you don't have access to a console connected to your app, but apparently you can use rpdb, a "remote" version of pdb.
So I've installed rpdb, added import rpdb; rpdb.set_trace() at the appropriate spot. When I make a request that hits the rpdb line, the app hangs as expected and I see the following in my heroku log:
pdb is running on 3d0c9fdd-c18a-4cc2-8466-da6671a72cbc:4444
Ok, so how to connect to the pdb that is running? I've tried heroku run nc 3d0c9fdd-c18a-4cc2-8466-da6671a72cbc 4444 to try to connect to the named host from within heroku's system, but that just immediately exits with status 1 and no error message.
So my specific question is: how do I now connect to this remote pdb?
The general related question is: is this even the right way for this sort of interactive debugging of an app running on Heroku? Is there a better way?
NOTE RE CELERY: Note, I've now also tried a similar approach with Celery, to no avail. The default host celery's rdb (remote pdb wrapper) uses is localhost, which you can't get to when it's Heroku. I've tried using the CELERY_RDB_HOST environment variable to the domain of the website that is being hosted on Heroku, but that gives a "Cannot assign requested address" error. So it's the same basic issue -- how to connect to the remote pdb instance that's running on Heroku?

In answer to your second question, I do it differently depending on the type of error (browser-side, backend, or view). For backend and view testing (unittests), will something like this work for you?
$ heroku run --app=your-app "python manage.py shell --settings=settings.production"
Then debug-away within ipython:
>>> %run -d script_to_run_unittests.py
Even if you aren't running a django app you could just run the debugger as a command line option to ipython so that any python errors will drop you to the debugger:
$ heroku run --app=your-app "ipython --pdb"
Front-end testing is a whole different ballgame where you should look into tools like selenium. I think there's also a "salad" test suite module that makes front end tests easier to write. Writing a test that breaks is the first step in debugging (or so I'm told ;).
If the bug looks simple, you can always do the old "print and run" with something like
import logging
logger = logging.getLogger(__file__)
logger.warn('here be bugs')`
and review your log files with getsentry.com or an equivalent monitoring tool or just:
heroku logs --tail

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.