Spark streaming and kafka integration - python

I'm using kafka and spark streaming for a project programmed in python. I want to send data from kafka producer to my streaming program. It's working smoothly when i execute the following command with the dependencies specified:
./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0 ./kafkastreaming.py
Is there any way where i can specify the dependencies and run the streaming code directly(i.e. without using spark-submit or with using spark-submit but not specifying the dependencies.)
I tried specifying the dependencies in the spark-defaults.conf in the conf dir of spark.
The specified dependencies were:
1.org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0
2.org.apache.spark:spark-streaming-kafka-0-8-assembly:2.1.1
NOTE - I referred to spark streaming guide using netcat from
https://spark.apache.org/docs/latest/streaming-programming-guide.html
and it worked without using spark-submit command hence i want to know if i can do the same with kafka and spark streaming.

Provide your additional dependencies into the "jars" folder of your spark distribution. stop and start spark again. This way, dependencies wil be resolved at runtime without adding any additional option in your command line

Related

Trigerring spark-submit in local mode from a remote server

I have a pyspark application that i am currently running in local mode. Going forward this needs to be deployed in production and the pyspark job needs to be trigerred from clients end.
What changes do i need to make from my end so that it can be trigerred from clients end?
I am using spark 3.2 and python3.6.
Currently i am executing this job by firing spark-submit command from the same server where spark is installed.
spark-submit --jars /app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar spark_job.py
1.First i think i need to specify jars in my spark-session
spark = SparkSession.builder.master("local[*]").appName('test1').config("spark.jars", "/app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar").getOrCreate()
This didnt work and erroring out with "java.lang.ClassNotFoundException"
I think it is not able to find the jar though it works fine when i paas it from spark-submit.
What is the right way of passing jars in spark session?
2.I dont have a spark cluster. It is more like a R&D project which process small dataset so i dont think i need a cluster yet. How shall i run spark-submit from a remote server at client's end?
Shall i change master from local to my server's ip where spark is installed?
3.How client can trigger this job. Shall i write a python script with a subprocess trigerring this
spark-submit and give it to client so that they can execute this python file at a specific time from their workflow manager tool?
import subprocess
spark_submit_str= "spark-submit --master "server-ip" spark_job.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
print(stderr)
print(stdout)

Two separate images to run spark in client-mode using Kubernetes, Python with Apache-Spark 3.2.0?

I deployed Apache Spark 3.2.0 using this script run from a distribution folder for Python:
./bin/docker-image-tool.sh -r <repo> -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
I can create a container under K8s using Spark-Submit just fine. My goal is to run spark-submit configured for client mode vs. local mode and expect additional containers will be created for the executors.
Does the image I created allow for this, or do I need to create a second image (without the -p option) using the docker-image tool and configure within a different container ?
It turns out that only one image is needed if you're running PySpark. Using Client-mode, the code spawns the executors and workers for you and they run once you create a spark-submit command. Big improvement from Spark version 2.4!

How to submit a pyspark job by using spark submit?

I am using Spark 2.4.3 Version. is this command enough to submit a job?
spark-submit accum.py /home/karthi/accm.txt
where to submit this command?
Yes, if you want to submit a Spark job with a Python module, you have to run spark-submit module.py.
Spark is a distributed framework so when you submit a job, it means that you 'send' the job in a cluster. But, you can also easily run it in your machine, with the same command (standalone mode).
You can find examples in Spark official documentation: https://spark.apache.org/docs/2.4.3/submitting-applications.html
NOTE: In order to run the spark-submit, you have two choices:
Go to /path/to/spark/bin and run the the spark-submit /path/to/module.py
Or add the followingn in .bashrc and use run-submit anywhere
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

Execute a Hadoop Job in remote server and get its results from python webservice

I have a Hadoop job packaged in a jar file that I can execute in a server using command line and store the results in the hdfs of the server using the command line.
Now, I need to create a Web Service in Python (Tornado) that must to execute the Hadoop Job and get the results to present them to the user. The Web Service is hosted in other server.
I googled a lot for call the Job from outside the server in python Script but unfortunately did not have answers.
Anyone have a solution for this?
Thanks
One option could be install the binaries of hadoop in your webservice server using the same configuration than in your hadoop cluster. You will require that to be able to talk with the cluster. You don't nead to lunch any hadoop deamon there. At least configure HADOOP_HOME, HADOOP_CONFIG_DIR, HADOOP_LIBS and set properly the PATH environment variables.
You need the binaries because you will use them to submit the job and the configurations to tell hadoop client where is the cluster (the namenode and the resourcemanager).
Then in python you can execute the hadoop jar command using subprocess: https://docs.python.org/2/library/subprocess.html
You can configure the job to notify your server when the job has finished using a callback: https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
And finally you could read the results in HDFS using WebHDFS (HDFS WEB API) or using some python HDFS package like: https://pypi.python.org/pypi/hdfs/

Apache Spark Python to Scala translation

If I got it right Apache YARN receives Application Master and Node Manager as JAR files. They executed as Java process on the nodes of the YARN cluster.
When I write a Spark program using Python, Does it compiled into JAR somehow?
If not how come Spark is able to execute Python logic on YARN cluster nodes?
The PySpark driver program uses Py4J (http://py4j.sourceforge.net/) to launch a JVM and create a Spark Context. Spark RDD operations written in Python are mapped to operations on PythonRDD.
On the remote workers, PythonRDD launches sub-processes which run Python. The data and code is passed from the Remote Worker's JVM to its Python sub-process using pipes.
Therefore, it is necessary for your YARN nodes to have python installed for this to work.
The python code is not compiled to a JAR, but is distributed around the cluster using Spark. In order to make this possible, user functions written in Python are pickled using the following code https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py
Source: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Categories

Resources