Spark submit (2.3) on kubernetes cluster from Python

Spark submit (2.3) on kubernetes cluster from Python - python

So now that k8s is integrated directly with spark in 2.3 my spark submit from the console executes correctly on a kuberenetes master without any spark master pods running, spark handles all the k8s details:
spark-submit \
--deploy-mode cluster \
--class com.app.myApp \
--master k8s://https://myCluster.com \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.app.name=myApp \
--conf spark.executor.instances=10 \
--conf spark.kubernetes.container.image=myImage \
local:///myJar.jar
What I am trying to do is do a spark-submit via AWS lambda to my k8s cluster. Previously I used the command via the spark master REST API directly (without kubernetes):
request = requests.Request(
'POST',
"http://<master-ip>:6066/v1/submissions/create",
data=json.dumps(parameters))
prepared = request.prepare()
session = requests.Session()
response = session.send(prepared)
And it worked. Now I want to integrate Kubernetes and do it similarly where I submit an API request to my kubernetes cluster from python and have spark handle all the k8s details, ideally something like:
request = requests.Request(
'POST',
"k8s://https://myK8scluster.com:443",
data=json.dumps(parameters))
Is it possible in the Spark 2.3/Kubernetes integration?

I afraid that is impossible for Spark 2.3, if you using native Kubernetes support.
Based on description from deployment instruction, submission process container several steps:
Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
So, in fact, you have no place to submit a job until you starting a submission process, which will launch a first Spark's pod (driver) for you. And after application completes, everything terminated.
Because of running a fat container on AWS Lambda is not a best solution, and also because if is not way to run any commands in container itself (is is possible, but with hack, here is blueprint about executing Bash inside an AWS Lambda) the simplest way is to write some small custom service, which will work on machine outside of AWS Lambda and provide REST interface between your application and spark-submit utility. I don't see any other ways to make it without a pain.

Related

Trigerring spark-submit in local mode from a remote server

I have a pyspark application that i am currently running in local mode. Going forward this needs to be deployed in production and the pyspark job needs to be trigerred from clients end.
What changes do i need to make from my end so that it can be trigerred from clients end?
I am using spark 3.2 and python3.6.
Currently i am executing this job by firing spark-submit command from the same server where spark is installed.
spark-submit --jars /app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar spark_job.py
1.First i think i need to specify jars in my spark-session
spark = SparkSession.builder.master("local[*]").appName('test1').config("spark.jars", "/app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar").getOrCreate()
This didnt work and erroring out with "java.lang.ClassNotFoundException"
I think it is not able to find the jar though it works fine when i paas it from spark-submit.
What is the right way of passing jars in spark session?
2.I dont have a spark cluster. It is more like a R&D project which process small dataset so i dont think i need a cluster yet. How shall i run spark-submit from a remote server at client's end?
Shall i change master from local to my server's ip where spark is installed?
3.How client can trigger this job. Shall i write a python script with a subprocess trigerring this
spark-submit and give it to client so that they can execute this python file at a specific time from their workflow manager tool?
import subprocess
spark_submit_str= "spark-submit --master "server-ip" spark_job.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
print(stderr)
print(stdout)

Python script from systemctl at launch

I have 3 systemd services I created that run a Python script and pass a key argument because the scripts download datasets that require that key. When I enable the service it works just fine, but if I reboot the EC2 instance its on it fails on launch with:
(code=exited, status=255)
I want it to run on launch because I have a workflow that uses EventBridge to trigger a Lambda function that turns on the instance at specific time to download the dataset to S3 and begin the ETL process. Why would the service run as intended with $sudo systemctl start service-name.service but fails on startup?

Hmmm, this will depend on how you're running the EC2. Basically, there are two ways.
Via Cloud-init / EC2 userdata
You can specify if the script will be executed on the first boot (when the EC2 is created), every (re)boot, etc.
You can check the officials' docs for that:
Cloud-init: Event and Updates
AWS - Run commands on your Linux instance at launch
How can I utilize user data to automatically run a script with every restart of my Amazon EC2 Linux instance?
Via Linux systemd
You can use the example below (just remove the comments and/or just/add the Requires and After if it's needed.)
## doc here: https://man7.org/linux/man-pages/man5/systemd.unit.5.html#[UNIT]_SECTION_OPTIONS
[Unit]
Description=Startup script
# Requires=my-other-service.target
# After=After=network.target my-other-service.target
## doc here: https://man7.org/linux/man-pages/man5/systemd.service.5.html#OPTIONS
[Service]
Type=oneshot
ExecStart=/opt/myscripts/startup.sh
## doc here: https://man7.org/linux/man-pages/man5/systemd.unit.5.html#[INSTALL]_SECTION_OPTIONS
[Install]
WantedBy=multi-user.target

How can I debug `Error while fetching server API version` when running docker?

Context
I am running Apache Airflow, and trying to run a sample Docker container using Airflow's DockerOperator. I am testing using docker-compose and deploying to Kubernetes (EKS). Whenever I run my task, I am receiving the Error: ERROR - Error while fetching server API version. The erros happens both on docker-compose as well as EKS (kubernetes).

I guess your Airflow Docker container is trying to launch a worker on the same Docker machine where it is running. To do so, you need to give Airflow's container special permissions and, as you said, acces to the Docker socket. This is called Docker In Docker (DIND). There are more than one way to do it. In this tutorial there are 3 different ways explained. It also depends on where those containers are run: Kubernetes, Docker machines, external services (like GitLab or GitHub), etc.

Deploying Dataflow in a CI pipeline

I've written a streaming Google Dataflow pipeline in python using the beam SDK. There's documentation about how I run this locally and set the -runner flag to run it on Dataflow.
I'm now trying to automate the deployment of this to a CI pipeline (bitbucket pipelines but not really relevant). There is documentation on how to 'run' a pipeline, but not really 'deploy' it. The commands I've tested with look like:
python -m dataflow --runner "DataflowRunner" \
--jobName "<jobName>" \
--topic "<pub-sub-topic"" \
--project "<project>" \
--dataset "<dataset>" \
--worker_machine_type "n1-standard-2" \
--temp_location "gs://<bucket-name>/tmp/"
This will run the job, but because it's streaming it will never return. It also internally manages the packaging and pushing to a bucket. I know if I kill that process it keeps running, but setting that up on a CI server in a way where I can detect whether the process actually succeeded or I just killed it after some timeout is difficult.
This seems ridiculous and like I'm missing something obvious, but how do I package and run this module on dataflow in a way I can reliably know it deployed from a CI pipeline?

So yes, it was something dumb.
Basically when you use the
with beam.Pipeline(options=options) as p:
syntax, under the hood it's calling wait_until_finish. So the wait was being invoked without me realizing, causing it to hang around forever. Refactoring to remove the context manager fixes the problem.

To expand on jamielennox's answer.
When on the direct runner on your local development environment, you want to see the pipeline running indefinitely; perhaps only to manually cancel with Ctrl-C after a while.
When deploying the pipeline to run on GCP's Dataflow, you want your script to deploy the job and end.
runner_name = pipeline_options.get_all_options().get('runner')
if runner_name == 'DirectRunner':
with beam.Pipeline(options=pipeline_options) as pipeline:
_my_setup_pipeline(config, pipeline, subscription_full_name)
elif runner_name == 'DataflowRunner':
pipeline = beam.Pipeline(options=pipeline_options)
_my_setup_pipeline(config, pipeline, subscription_full_name)
pipeline.run()
else:
raise Exception(f'Unknown runner: {runner_name}')

Submit jobs to Apache-Spark while being behind a firewall

Usecase:
I'm behind a firewall, and I have a remote spark cluster I can access to, however those machines cannot connect directly to me.
As Spark doc states it is necessary for the worker to be able to reach the driver program:
Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network.
If you’d like to send requests to the cluster remotely, it’s better to
open an RPC to the driver and have it submit operations from nearby
than to run a driver far away from the worker nodes.
The suggested solution is to have a server process running on the cluster listening to RPC and let it execute itself the spark driver program locally.
Does such a program already exists? Such a process should manage 1+ RPC, returning exceptions and handling logs.
Also in that case, is my local program or the spark driver who has to create the SparkContext?
Note:
I have a standalone cluster
Solution1:
A simple way would be to use cluster mode (similar to --deploy-mode cluster) for the standalone cluster, however the doc says:
Currently, standalone mode does not support cluster mode for Python
applications.

Just a few options:
Connect to the cluster node using ssh, start screen, submit Spark application, go back to check the results.
Deploy middleware like Job Server, Livy or Mist on your cluster, and use it for submissions.
Deploy notebook (Zeppelin, Toree) on your cluster and submit applications from the notebook.
Set fixed spark.driver.port and ssh forward all connections through one of the cluster nodes, using its IP as spark.driver.bindAddress.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark submit (2.3) on kubernetes cluster from Python - python

Related

Trigerring spark-submit in local mode from a remote server

Python script from systemctl at launch

How can I debug `Error while fetching server API version` when running docker?

Deploying Dataflow in a CI pipeline

Submit jobs to Apache-Spark while being behind a firewall

Categories

Resources