How to submit a pyspark job to remote cluster from windows client? - python

We are using a remote Spark cluster with YARN (in Hortonworks). Developers want to use Spyder to implement Spark application in Windows. To ssh to cluster using ipython notebook or Jupyter works well. Is there any other way to communicate with Spark cluster from Windows.
Question 1: I got a headache with submitting spark job(written in python) from Windows which has no Spark installed. Is there anyone could help me out of this. Specifically, how to phrase the command line to submit the job.
We could ssh to YARN node in the cluster just in case these might relative to some solution. It is also ping-able from cluster to windows client.
Question 2: What do we need to have in the client side e.g. Spark libraries if we want to do the debug with environment like this?

Related

Trigerring spark-submit in local mode from a remote server

I have a pyspark application that i am currently running in local mode. Going forward this needs to be deployed in production and the pyspark job needs to be trigerred from clients end.
What changes do i need to make from my end so that it can be trigerred from clients end?
I am using spark 3.2 and python3.6.
Currently i am executing this job by firing spark-submit command from the same server where spark is installed.
spark-submit --jars /app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar spark_job.py
1.First i think i need to specify jars in my spark-session
spark = SparkSession.builder.master("local[*]").appName('test1').config("spark.jars", "/app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar").getOrCreate()
This didnt work and erroring out with "java.lang.ClassNotFoundException"
I think it is not able to find the jar though it works fine when i paas it from spark-submit.
What is the right way of passing jars in spark session?
2.I dont have a spark cluster. It is more like a R&D project which process small dataset so i dont think i need a cluster yet. How shall i run spark-submit from a remote server at client's end?
Shall i change master from local to my server's ip where spark is installed?
3.How client can trigger this job. Shall i write a python script with a subprocess trigerring this
spark-submit and give it to client so that they can execute this python file at a specific time from their workflow manager tool?
import subprocess
spark_submit_str= "spark-submit --master "server-ip" spark_job.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
print(stderr)
print(stdout)

How to submit a pyspark job by using spark submit?

I am using Spark 2.4.3 Version. is this command enough to submit a job?
spark-submit accum.py /home/karthi/accm.txt
where to submit this command?
Yes, if you want to submit a Spark job with a Python module, you have to run spark-submit module.py.
Spark is a distributed framework so when you submit a job, it means that you 'send' the job in a cluster. But, you can also easily run it in your machine, with the same command (standalone mode).
You can find examples in Spark official documentation: https://spark.apache.org/docs/2.4.3/submitting-applications.html
NOTE: In order to run the spark-submit, you have two choices:
Go to /path/to/spark/bin and run the the spark-submit /path/to/module.py
Or add the followingn in .bashrc and use run-submit anywhere
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

Set Deploy-mode to cluster for pyspark from jupyter

I've installed a cloudera CDH cluster with spark2 on 7 hosts ( 2 matsers, 4 workers and 1 edge)
I installed a Jupyter server on the edge node, I want to set pyspark to run on cluster mode, I run this on a notebook
os.environ['PYSPARK_SUBMIT_ARGS']='--master yarn --deploy-mode=cluster pyspark-shell'
It gives me "Error: Cluster deploy mode is not applicable to Spark shells."
Can someone helps me with this?
Thanks
The answer here is you can't. Firstly because the configured Jupiter behind the scenes launches a pyspark shell session. Which you cant run on cluster mode.
One soultion which i think of to your problem can be
Livy+spark magic+jupyter
Where Livy can run on yarn mode and serve job request as REST calls.
Spark_magic residing on jupyter.
You can follow the below link for more info on this
https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d
Major update.
I. Have succeeded to deploy a jupyter hub with cdh5.13, it works without no problems.
One thing to pay attention to is to install as default language python 3, with python 2, multiple jobs will failed because of incompatibility with cloudera package

connect local python script to remote spark master

I am using python 2.7 with spark standalone cluster.
When I start the master on the same machine running the python script. it works smoothly.
When I start the master on a remote machine, and try to start spark context on the local machine to access the remote spark master. nothing happens and i get a massage saying that the task did not get any resources.
When i access the master's UI. i see the job, but nothing happens with it, it's just there.
How do i access a remote spark master via a local python script?
Thanks.
EDIT:
I read that in order to do this i need to run the cluster in cluster mode (not client mode), and I found that currently standalone mode does not support this for python application.
Ideas?
in order to do this i need to run the cluster in cluster mode (not client mode), and I found here that currently standalone mode does not support this for python application.

Spark streaming and kafka integration

I'm using kafka and spark streaming for a project programmed in python. I want to send data from kafka producer to my streaming program. It's working smoothly when i execute the following command with the dependencies specified:
./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0 ./kafkastreaming.py
Is there any way where i can specify the dependencies and run the streaming code directly(i.e. without using spark-submit or with using spark-submit but not specifying the dependencies.)
I tried specifying the dependencies in the spark-defaults.conf in the conf dir of spark.
The specified dependencies were:
1.org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0
2.org.apache.spark:spark-streaming-kafka-0-8-assembly:2.1.1
NOTE - I referred to spark streaming guide using netcat from
https://spark.apache.org/docs/latest/streaming-programming-guide.html
and it worked without using spark-submit command hence i want to know if i can do the same with kafka and spark streaming.
Provide your additional dependencies into the "jars" folder of your spark distribution. stop and start spark again. This way, dependencies wil be resolved at runtime without adding any additional option in your command line

Categories

Resources