Trigerring spark-submit in local mode from a remote server - python

I have a pyspark application that i am currently running in local mode. Going forward this needs to be deployed in production and the pyspark job needs to be trigerred from clients end.
What changes do i need to make from my end so that it can be trigerred from clients end?
I am using spark 3.2 and python3.6.
Currently i am executing this job by firing spark-submit command from the same server where spark is installed.
spark-submit --jars /app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar spark_job.py
1.First i think i need to specify jars in my spark-session
spark = SparkSession.builder.master("local[*]").appName('test1').config("spark.jars", "/app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar").getOrCreate()
This didnt work and erroring out with "java.lang.ClassNotFoundException"
I think it is not able to find the jar though it works fine when i paas it from spark-submit.
What is the right way of passing jars in spark session?
2.I dont have a spark cluster. It is more like a R&D project which process small dataset so i dont think i need a cluster yet. How shall i run spark-submit from a remote server at client's end?
Shall i change master from local to my server's ip where spark is installed?
3.How client can trigger this job. Shall i write a python script with a subprocess trigerring this
spark-submit and give it to client so that they can execute this python file at a specific time from their workflow manager tool?
import subprocess
spark_submit_str= "spark-submit --master "server-ip" spark_job.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
print(stderr)
print(stdout)

Related

Spark submit (2.3) on kubernetes cluster from Python

So now that k8s is integrated directly with spark in 2.3 my spark submit from the console executes correctly on a kuberenetes master without any spark master pods running, spark handles all the k8s details:
spark-submit \
--deploy-mode cluster \
--class com.app.myApp \
--master k8s://https://myCluster.com \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.app.name=myApp \
--conf spark.executor.instances=10 \
--conf spark.kubernetes.container.image=myImage \
local:///myJar.jar
What I am trying to do is do a spark-submit via AWS lambda to my k8s cluster. Previously I used the command via the spark master REST API directly (without kubernetes):
request = requests.Request(
'POST',
"http://<master-ip>:6066/v1/submissions/create",
data=json.dumps(parameters))
prepared = request.prepare()
session = requests.Session()
response = session.send(prepared)
And it worked. Now I want to integrate Kubernetes and do it similarly where I submit an API request to my kubernetes cluster from python and have spark handle all the k8s details, ideally something like:
request = requests.Request(
'POST',
"k8s://https://myK8scluster.com:443",
data=json.dumps(parameters))
Is it possible in the Spark 2.3/Kubernetes integration?
I afraid that is impossible for Spark 2.3, if you using native Kubernetes support.
Based on description from deployment instruction, submission process container several steps:
Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
So, in fact, you have no place to submit a job until you starting a submission process, which will launch a first Spark's pod (driver) for you. And after application completes, everything terminated.
Because of running a fat container on AWS Lambda is not a best solution, and also because if is not way to run any commands in container itself (is is possible, but with hack, here is blueprint about executing Bash inside an AWS Lambda) the simplest way is to write some small custom service, which will work on machine outside of AWS Lambda and provide REST interface between your application and spark-submit utility. I don't see any other ways to make it without a pain.

connect local python script to remote spark master

I am using python 2.7 with spark standalone cluster.
When I start the master on the same machine running the python script. it works smoothly.
When I start the master on a remote machine, and try to start spark context on the local machine to access the remote spark master. nothing happens and i get a massage saying that the task did not get any resources.
When i access the master's UI. i see the job, but nothing happens with it, it's just there.
How do i access a remote spark master via a local python script?
Thanks.
EDIT:
I read that in order to do this i need to run the cluster in cluster mode (not client mode), and I found that currently standalone mode does not support this for python application.
Ideas?
in order to do this i need to run the cluster in cluster mode (not client mode), and I found here that currently standalone mode does not support this for python application.

Spark streaming and kafka integration

I'm using kafka and spark streaming for a project programmed in python. I want to send data from kafka producer to my streaming program. It's working smoothly when i execute the following command with the dependencies specified:
./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0 ./kafkastreaming.py
Is there any way where i can specify the dependencies and run the streaming code directly(i.e. without using spark-submit or with using spark-submit but not specifying the dependencies.)
I tried specifying the dependencies in the spark-defaults.conf in the conf dir of spark.
The specified dependencies were:
1.org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0
2.org.apache.spark:spark-streaming-kafka-0-8-assembly:2.1.1
NOTE - I referred to spark streaming guide using netcat from
https://spark.apache.org/docs/latest/streaming-programming-guide.html
and it worked without using spark-submit command hence i want to know if i can do the same with kafka and spark streaming.
Provide your additional dependencies into the "jars" folder of your spark distribution. stop and start spark again. This way, dependencies wil be resolved at runtime without adding any additional option in your command line

How to submit a pyspark job to remote cluster from windows client?

We are using a remote Spark cluster with YARN (in Hortonworks). Developers want to use Spyder to implement Spark application in Windows. To ssh to cluster using ipython notebook or Jupyter works well. Is there any other way to communicate with Spark cluster from Windows.
Question 1: I got a headache with submitting spark job(written in python) from Windows which has no Spark installed. Is there anyone could help me out of this. Specifically, how to phrase the command line to submit the job.
We could ssh to YARN node in the cluster just in case these might relative to some solution. It is also ping-able from cluster to windows client.
Question 2: What do we need to have in the client side e.g. Spark libraries if we want to do the debug with environment like this?

Execute a Hadoop Job in remote server and get its results from python webservice

I have a Hadoop job packaged in a jar file that I can execute in a server using command line and store the results in the hdfs of the server using the command line.
Now, I need to create a Web Service in Python (Tornado) that must to execute the Hadoop Job and get the results to present them to the user. The Web Service is hosted in other server.
I googled a lot for call the Job from outside the server in python Script but unfortunately did not have answers.
Anyone have a solution for this?
Thanks
One option could be install the binaries of hadoop in your webservice server using the same configuration than in your hadoop cluster. You will require that to be able to talk with the cluster. You don't nead to lunch any hadoop deamon there. At least configure HADOOP_HOME, HADOOP_CONFIG_DIR, HADOOP_LIBS and set properly the PATH environment variables.
You need the binaries because you will use them to submit the job and the configurations to tell hadoop client where is the cluster (the namenode and the resourcemanager).
Then in python you can execute the hadoop jar command using subprocess: https://docs.python.org/2/library/subprocess.html
You can configure the job to notify your server when the job has finished using a callback: https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
And finally you could read the results in HDFS using WebHDFS (HDFS WEB API) or using some python HDFS package like: https://pypi.python.org/pypi/hdfs/

Categories

Resources