How to submit a pyspark job by using spark submit? - python

I am using Spark 2.4.3 Version. is this command enough to submit a job?
spark-submit accum.py /home/karthi/accm.txt
where to submit this command?

Yes, if you want to submit a Spark job with a Python module, you have to run spark-submit module.py.
Spark is a distributed framework so when you submit a job, it means that you 'send' the job in a cluster. But, you can also easily run it in your machine, with the same command (standalone mode).
You can find examples in Spark official documentation: https://spark.apache.org/docs/2.4.3/submitting-applications.html
NOTE: In order to run the spark-submit, you have two choices:
Go to /path/to/spark/bin and run the the spark-submit /path/to/module.py
Or add the followingn in .bashrc and use run-submit anywhere
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

Related

How to run a non-spark code on databricks cluster?

I am able to pull the data from databricks connect and run spark jobs perfectly. My question is how to run non-spark or native python code on remote cluster. Not sharing the code due to confidentiality.
When you're using databricks connect, then your local machine is a driver of your Spark job, so non-Spark code will be always executed on your local machine. If you want to execute it remotely, then you need to package it as wheel/egg, or upload Python files onto DBFS (for example, via databricks-cli) and execute your code as Databricks job (for example, using the Run Submit command of Jobs REST API, or create a Job with databricks-cli and use databricks jobs run-now to execute it)

How to submit a pyFlink job to remote Kubernetes session cluster?

Currently, I have a running Flink Kubernetes session cluster (Flink version 1.13.2) and I can access the web UI by port-forward also, I can submit the WordCount jar example by this command ./bin/flink run -m localhost:8081 examples/batch/WordCount.jar from my local environment.
But when I try to submit the pyFlink example by command ./bin/flink run -m localhost:8081 -py examples/python/table/batch/word_count.py the job freezes and the log says that is waiting for the results.
I tried many ways including creating virtualenv, passing pyClientExecutable and pyexec, syncing local and remote python versions but, none of them worked.
What am I missing? How can I submit python example to the remote session cluster?
Note: when I submit pyFlink word_count example in the job manager pod, it runs without any problem.
I don't have any Flink-1.13 on hand; however, the same example in Flink-1.15 has a line of comment to remind you to remove the .wait.

Spark streaming and kafka integration

I'm using kafka and spark streaming for a project programmed in python. I want to send data from kafka producer to my streaming program. It's working smoothly when i execute the following command with the dependencies specified:
./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0 ./kafkastreaming.py
Is there any way where i can specify the dependencies and run the streaming code directly(i.e. without using spark-submit or with using spark-submit but not specifying the dependencies.)
I tried specifying the dependencies in the spark-defaults.conf in the conf dir of spark.
The specified dependencies were:
1.org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0
2.org.apache.spark:spark-streaming-kafka-0-8-assembly:2.1.1
NOTE - I referred to spark streaming guide using netcat from
https://spark.apache.org/docs/latest/streaming-programming-guide.html
and it worked without using spark-submit command hence i want to know if i can do the same with kafka and spark streaming.
Provide your additional dependencies into the "jars" folder of your spark distribution. stop and start spark again. This way, dependencies wil be resolved at runtime without adding any additional option in your command line

How to submit a pyspark job to remote cluster from windows client?

We are using a remote Spark cluster with YARN (in Hortonworks). Developers want to use Spyder to implement Spark application in Windows. To ssh to cluster using ipython notebook or Jupyter works well. Is there any other way to communicate with Spark cluster from Windows.
Question 1: I got a headache with submitting spark job(written in python) from Windows which has no Spark installed. Is there anyone could help me out of this. Specifically, how to phrase the command line to submit the job.
We could ssh to YARN node in the cluster just in case these might relative to some solution. It is also ping-able from cluster to windows client.
Question 2: What do we need to have in the client side e.g. Spark libraries if we want to do the debug with environment like this?

Execute a Hadoop Job in remote server and get its results from python webservice

I have a Hadoop job packaged in a jar file that I can execute in a server using command line and store the results in the hdfs of the server using the command line.
Now, I need to create a Web Service in Python (Tornado) that must to execute the Hadoop Job and get the results to present them to the user. The Web Service is hosted in other server.
I googled a lot for call the Job from outside the server in python Script but unfortunately did not have answers.
Anyone have a solution for this?
Thanks
One option could be install the binaries of hadoop in your webservice server using the same configuration than in your hadoop cluster. You will require that to be able to talk with the cluster. You don't nead to lunch any hadoop deamon there. At least configure HADOOP_HOME, HADOOP_CONFIG_DIR, HADOOP_LIBS and set properly the PATH environment variables.
You need the binaries because you will use them to submit the job and the configurations to tell hadoop client where is the cluster (the namenode and the resourcemanager).
Then in python you can execute the hadoop jar command using subprocess: https://docs.python.org/2/library/subprocess.html
You can configure the job to notify your server when the job has finished using a callback: https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
And finally you could read the results in HDFS using WebHDFS (HDFS WEB API) or using some python HDFS package like: https://pypi.python.org/pypi/hdfs/

Categories

Resources