How to run a non-spark code on databricks cluster? - python

I am able to pull the data from databricks connect and run spark jobs perfectly. My question is how to run non-spark or native python code on remote cluster. Not sharing the code due to confidentiality.

When you're using databricks connect, then your local machine is a driver of your Spark job, so non-Spark code will be always executed on your local machine. If you want to execute it remotely, then you need to package it as wheel/egg, or upload Python files onto DBFS (for example, via databricks-cli) and execute your code as Databricks job (for example, using the Run Submit command of Jobs REST API, or create a Job with databricks-cli and use databricks jobs run-now to execute it)

Related

Running script using local Python and packages but executing certain commands on remote server

I know how to run a Python script made locally on a remote server and have seen a lot of questions in that regard. But I am in a situation where I cannot install python packages on the remote server I am accessing. Specifically, I need to use pypostal, which requires libpostal to be installed and I cannot do so. Moreover, I need pyspark to play with Hive tables.
Therefore, I need the script to run locally, where I can manage my packages and everything executes fine, but certain commands need to access the server in order to grab data. For example, using pyspark to get Hive tables on the server into a local dataframe. Essentially, I need all the Python to be executed using my local distribution with my local packages but perform its actions on the remote server.
I have looked into things like paramiko. But as far as I can workout, is just like an SSH client, which would use the Python distro on the remote server and not locally. Though, perhaps I don't understand how to use it properly.
I am running python 3.6 on Ubuntu 18.04 using WSL. The packages I am using are pandas, numpy, pyspark, and postal (subsequently libpostal).
TLDR;
Is it possible to run a script locally, have parts of it execute remotely but using my local Python? Or if there are other possible solutions, I would be grateful.

Apache Spark ALS algorithm

I want to run a movie recommendation app based on ALS algorithm on Apache Spark using Python
I’m using Spark2.2.0-Hadoop2.7
I have one master and 2 workers
When I want to run the app using this command
Spark-submit —master Spark://192.168.190.132:7077 —total-executor-cores 8 —executor-memory 2g engine.py
I get errors it says the ratings.csv file doesn’t exist( I checked the addres everything is correct)
error picture below
https://i.stack.imgur.com/dgK2Q.jpg
But when I use this command
Spark-submit app.pyit works but fails after a while
I’m not using HDFS I load dataset locally
Do I need to copy datasets to all worker nodes?
you need to upload dataset to HDFS if you want to work as spark standalone spark.using webui for all worker nodes.using hdfs -put to upload on HDFS.

Windows task scheduler issue python script automation

I'm running a python script using the task scheduler. The script gets data from a database using pandas and stores them in csv.gz files.
It's running properly but there is a difference in data size when the Task Scheduler runs automatically and when it is run manually via Task scheduler run button. The manual run gets all the data, whereas automated run gets lesser data.
It runs the same script. I've checked multiple times. But unable to identify the issue.
PS:
Using Windows Server 2008, pymssql to connect to Database

connect local python script to remote spark master

I am using python 2.7 with spark standalone cluster.
When I start the master on the same machine running the python script. it works smoothly.
When I start the master on a remote machine, and try to start spark context on the local machine to access the remote spark master. nothing happens and i get a massage saying that the task did not get any resources.
When i access the master's UI. i see the job, but nothing happens with it, it's just there.
How do i access a remote spark master via a local python script?
Thanks.
EDIT:
I read that in order to do this i need to run the cluster in cluster mode (not client mode), and I found that currently standalone mode does not support this for python application.
Ideas?
in order to do this i need to run the cluster in cluster mode (not client mode), and I found here that currently standalone mode does not support this for python application.

Execute a Hadoop Job in remote server and get its results from python webservice

I have a Hadoop job packaged in a jar file that I can execute in a server using command line and store the results in the hdfs of the server using the command line.
Now, I need to create a Web Service in Python (Tornado) that must to execute the Hadoop Job and get the results to present them to the user. The Web Service is hosted in other server.
I googled a lot for call the Job from outside the server in python Script but unfortunately did not have answers.
Anyone have a solution for this?
Thanks
One option could be install the binaries of hadoop in your webservice server using the same configuration than in your hadoop cluster. You will require that to be able to talk with the cluster. You don't nead to lunch any hadoop deamon there. At least configure HADOOP_HOME, HADOOP_CONFIG_DIR, HADOOP_LIBS and set properly the PATH environment variables.
You need the binaries because you will use them to submit the job and the configurations to tell hadoop client where is the cluster (the namenode and the resourcemanager).
Then in python you can execute the hadoop jar command using subprocess: https://docs.python.org/2/library/subprocess.html
You can configure the job to notify your server when the job has finished using a callback: https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
And finally you could read the results in HDFS using WebHDFS (HDFS WEB API) or using some python HDFS package like: https://pypi.python.org/pypi/hdfs/

Categories

Resources