I've installed a cloudera CDH cluster with spark2 on 7 hosts ( 2 matsers, 4 workers and 1 edge)
I installed a Jupyter server on the edge node, I want to set pyspark to run on cluster mode, I run this on a notebook
os.environ['PYSPARK_SUBMIT_ARGS']='--master yarn --deploy-mode=cluster pyspark-shell'
It gives me "Error: Cluster deploy mode is not applicable to Spark shells."
Can someone helps me with this?
Thanks
The answer here is you can't. Firstly because the configured Jupiter behind the scenes launches a pyspark shell session. Which you cant run on cluster mode.
One soultion which i think of to your problem can be
Livy+spark magic+jupyter
Where Livy can run on yarn mode and serve job request as REST calls.
Spark_magic residing on jupyter.
You can follow the below link for more info on this
https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d
Major update.
I. Have succeeded to deploy a jupyter hub with cdh5.13, it works without no problems.
One thing to pay attention to is to install as default language python 3, with python 2, multiple jobs will failed because of incompatibility with cloudera package
Related
I write pyspark code to deal with some spark-sql data.
Last month, it worked perfectly when I ran spark-submit --master local[25]. From top command, I could see 25 python threads.
However, nothing change but today the spark-submit only create one thread. I wonder what kind of things can cause such problem.
This is on a ubuntu server on AWS, which has 16 CPU cores. The Spark version is 2.2.1 and Python is 3.6
Just find the problem: there is another user running his own spark task on the same instance which occupying resources.
I am recently reading some deep learning using tensorflow. I was successfully installed tensorflow on Ubuntu 16 in python environment on my laptop and validate this installation at the end and it was successful by following the procedure on Tensorflow.
In order to take the ultimate benefits from processing capacity, i have taken an account of cluster client running CentOS (Linux based operating system) running under my school to take the large capacity of benefits for their students like simulation etc. And this Cluster system comprises of about 16 cluster nodes with 6 processes. This is all what i know about the cluster environment.
Note: this is not GPU based cluster.
Admin has given me userid and password along with host ip address to access the cluster from any system. I installed one software naming Xmanager Enterprise 5 under windows operating system. I can access my account now.
What i need to know is that,
I am unable to install anything without passing sudo command and if i pass sudo command it shows me the privilege error message. How can i install required packages on cluster machine so that i could run my program accordingly?
Or is there anything else by installing everything locally to my laptop
ubuntu and then converting this to cluster end.
In fact, I want to know something more about it but I am unable to start with. Please guide me.
I am using python 2.7 with spark standalone cluster.
When I start the master on the same machine running the python script. it works smoothly.
When I start the master on a remote machine, and try to start spark context on the local machine to access the remote spark master. nothing happens and i get a massage saying that the task did not get any resources.
When i access the master's UI. i see the job, but nothing happens with it, it's just there.
How do i access a remote spark master via a local python script?
Thanks.
EDIT:
I read that in order to do this i need to run the cluster in cluster mode (not client mode), and I found that currently standalone mode does not support this for python application.
Ideas?
in order to do this i need to run the cluster in cluster mode (not client mode), and I found here that currently standalone mode does not support this for python application.
We are using a remote Spark cluster with YARN (in Hortonworks). Developers want to use Spyder to implement Spark application in Windows. To ssh to cluster using ipython notebook or Jupyter works well. Is there any other way to communicate with Spark cluster from Windows.
Question 1: I got a headache with submitting spark job(written in python) from Windows which has no Spark installed. Is there anyone could help me out of this. Specifically, how to phrase the command line to submit the job.
We could ssh to YARN node in the cluster just in case these might relative to some solution. It is also ping-able from cluster to windows client.
Question 2: What do we need to have in the client side e.g. Spark libraries if we want to do the debug with environment like this?
I have just started learning spark and have been using R & Python on Jupyter notebook in my company.
All spark and Jupyter are installed on my computer locally and function perfectly fine individually.
Instead of creating .py script for pyspark in cmd every single time, could I possibly connect it to my Jupyter notebook live and run the scripts there? I have seen many posts on how to achieve that on Linux and Mac but sadly I will have to stick with Window 7 at this case.
Thanks!
Will
You could use the Sandbox from Hortonworks (http://hortonworks.com/downloads/#sandbox) and run your code in Apache Zeppelin.
No setup necessary. Install virtual box and run the sandbox. Then access zeppelin and ambari via your host (windows) browser and you are good to go to run your %pyspark code. Zeppelin has a look an feel like Jupyter.