Running python scripts on Databricks cluster - python

Is it possible to run arbitrary python script written in Pycharm on my azure Databricks cluster?
Databricks offered using databricks-connect but it turned out to be useful for only spark-jobs.
More specifically I'd like to like to use networkx to analyse some graph so huge that my local machine unable to work with them.
I'm not sure if its possible at all...
Thanks in advance!

Related

How do I create a standalone Jupyter Lab server using Pyinstaller or similar?

I would like to create a self-contained, .exe file that launches a JupyterLab server as an IDE on a physical server which doesn't have Python installed itself.
The idea is to deploy it as part of an ETL workflow tool, so that it can be used to view notebooks that will contain the ETL steps in a relatively easily digestible format (the notebooks will be used as pipelines via papermill and scrapbook - not really relevant here).
While I can use Pyinstaller to bundle JupyterLab as a package, there isn't a way to launch it on the Pythonless server (that I can see), and I can't figure out a way to do it using Python code alone.
Is it possible to package JupyterLab this way so that I can run the .exe on the server and then connect to 127.0.0.1:8888 on the server to view a notebook?
I have tried using the link below as a starting point, but I think I'm missing something as no server seems to start using this code alone, and I'm not sure how I would execute this via a tornado server etc.:
https://gist.github.com/bollwyvl/bd56b58ba0a078534272043327c52bd1
I would really appreciate any ideas, help, or somebody to tell my why this idea is impossible madness!
Thanks!
Phil.
P.S. I should add that Docker isn't an option here :( I've done this before using Docker and it's extremely easy.

Advice on Jupyter/Voila CLOUD setup

I have currently my anaconda setup on my local machine. I use standard Jupyter Notebooks (Python 3) and Voila - additionally I access DropBox where I have stored my data. It all works nice however I am thinking about moving to the cloud to do my analyzes anywhere.
I have tried Microsoft Azure and Google.
Somehow I think Azure is the better choice for me. Do you have any advice what else I should try or how to best setup Azure (I struggle - where is Voila...).
Thanks for all the hints.

Unable to execute scala code on Azure DataBricks cluster

I am trying to setup a Development environment for DataBricks, So my developers can write code using VSCODE IDE(or some other IDE) and execute the code against the DataBricks Cluster.
So I went through the Documentation of DataBricks Connect and did the setup as suggested in the document.
https://docs.databricks.com/dev-tools/databricks-connect.html#overview
Post the Setup I am able to execute python code on Azure DataBricks cluster, but not with Scala code
While Running the setup I found that it is saying Skipping scala command test on windows, I am not sure whether I am missing some configuration here.
Please suggest how to resolve this issue.
This is not an error but just a statement that says databricks-connect test is skipping testing scala code on windows you can still execute code from local machine on cluster using databricks-connect, you need to add the jars from databricks-connect get-jar-dir directory to your project structure in IDE as described in this documentation steps https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect#intellij-scala-or-java
Also note that when using azure databricks you enter a generic Databricks Host along with your workspace id(org-id) when you execute databricks-connect configure
eg- https://westeurope.azuredatabricks.net/o=?xxxx instead of https://adb-xxxx.yz.azuredatabricks.net

Azure Databricks Python Job

I have a requirement to parse a lot of small unstructured files in near real-time inside Azure and load the parsed data into a SQL database. I chose Python (because I don't think any Spark cluster or big data would suite considering the volume of source files and their size) and the parsing logic has been already written. I am looking forward to schedule this python script in different ways using Azure PaaS
Azure Data Factory
Azure Databricks
Both 1+2
May I ask what's the implication of running a Python notebook activity from Azure Data Factory pointing to Azure Databricks? Would I be able to fully leverage the potential of the cluster (Driver & Workers)?
Also, please suggest me if you think the script has to be converted to PySpark to meet my use case requirement to run in Azure Databricks? The only hesitation here is the files are in KB and they are unstructured.
If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive (and slow due to cluster startup times).
You could rewrite as pyspark but if the data volumes are as low as you say then this is still expensive and slow. The smallest cluster will consume two vm’s - each with 4 cores.
I would look at using Azure Functions instead. Python is now an option: https://learn.microsoft.com/en-us/azure/python/tutorial-vs-code-serverless-python-01
Azure Functions also have great integration with Azure Data Factory so your workflow would still work.

Creating azure ml experiments merely using python notebook within azure ml studio

I wonder if it is possible to design ML experiments without using the drag&drop functionality (which is very nice btw)? I want to use Python code in the notebook (within Azure ML studio) to access the algorithms (e.g., matchbox recommender, regression models, etc) in the studio and design experiments? Is this possible?
I appreciate any information and suggestion!
The algorithms used as modules in Azure ML Studio are not currently able to be used directly as code for Python programming.
That being said: you can attempt to publish the outputs of the out-of-the-box algorithms as web services, which can be consumed by Python code in the Azure ML Studio notebooks. You can also create your own algorithms and use them as custom Python or R modules.

Categories

Resources