How to refer pyspark.dbutils when developing code on local machine - python

I'm new to Pyspark and asking question for best design pattern/practice:
I'm developing a library that should run both, on local machine and on Databricks.
Currently working on loading secrets. If code runs on databricks, I should load secrets using dbutils.secrets.get while if code runs on local machine, dotenv.load_dotenv.
Question:
How can I create/refer to dbutils variable (which is readily provided in databricks instance)? pyspark doesnt have such module... even if I import SparkSession I still need DBUtils which is not found on pyspark local installation.
my current solution: if identify that code runs on Databricks, I create dbutils with:
dbutils = globals()['dbutils']

Just follow the approach described in documentation for databricks-connect - wrap instantiation of dbutils into a function call that will behave differently depending on if you're on Databricks or not:
def get_secret(....):
if spark.conf.get("spark.databricks.service.client.enabled") == "true":
from pyspark.dbutils import DBUtils
return DBUtils(spark).secrets.get(....)
else:
get local secret

Related

Unable to execute scala code on Azure DataBricks cluster

I am trying to setup a Development environment for DataBricks, So my developers can write code using VSCODE IDE(or some other IDE) and execute the code against the DataBricks Cluster.
So I went through the Documentation of DataBricks Connect and did the setup as suggested in the document.
https://docs.databricks.com/dev-tools/databricks-connect.html#overview
Post the Setup I am able to execute python code on Azure DataBricks cluster, but not with Scala code
While Running the setup I found that it is saying Skipping scala command test on windows, I am not sure whether I am missing some configuration here.
Please suggest how to resolve this issue.
This is not an error but just a statement that says databricks-connect test is skipping testing scala code on windows you can still execute code from local machine on cluster using databricks-connect, you need to add the jars from databricks-connect get-jar-dir directory to your project structure in IDE as described in this documentation steps https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect#intellij-scala-or-java
Also note that when using azure databricks you enter a generic Databricks Host along with your workspace id(org-id) when you execute databricks-connect configure
eg- https://westeurope.azuredatabricks.net/o=?xxxx instead of https://adb-xxxx.yz.azuredatabricks.net

Gremlin operation using python in Azure functions

Is was trying to create Azure function using Python(Http trigger) to fetch data from the gremlin graph.
I used
from gremlin_python.driver import client as clientDriver
to import the libraries and it was working fine locally.
When i deploy the same code to the Azure portal and ran the code, am getting 500 internal error.
After trying some changes, i could see "from gremlin_python.driver import client as clientDriver" import statement is not working(When i remove this piece the code works)
When we run the code in VSCode, we are creating a virtual env and installing the gremlin packages, so it was working in local and not in Azure portal.
Could someone help me in resolving this issue.
For this problem, we need to make sure the requirements.txt is all right. And if you just do the import module by the line
from gremlin_python.driver import client as clientDriver
You need to add another line to import the gremlin_python.driver module explicitly.
import gremlin_python.driver
Hope it helps~

How can I read a local file from an R or Python script in Azure Machine Learning Studio?

I need to read a csv file, which is saved in my local computer, from code within an "Execute R/Python Script" in an experiment of Azure Machine Learning Studio. I don't have to upload the data as usually, i.e. from Datasets -> New -> Load from local file or with an Import Data module. I must do it with code. In principle this is not possible, neither from an experiment nor from a notebook, and in fact I always got error. But I'm confused because the documentation about Execute Python Script module says (among other things):
Limitations
The Execute Python Script currently has the following limitations:
Sandboxed execution. The Python runtime is currently sandboxed and, as a result, does not allow access to the network or to the local file system in a persistent manner. All files saved locally are isolated and deleted once the module finishes. The Python code cannot access most directories on the machine it runs on, the exception being the current directory and its subdirectories.
According to the highlighted text, it should be possible to access and load a file from current directory, using for instance the pandas function read_csv. But actually no. There is some trick to accomplish this?
Thanks.
You need to remember that Azure ML Studio is an online tool, and it's not running any code on your local machine.
All the work is being done in the cloud, including running the Execute Python Script, and this is what the text you've highlighted refers to: the directories and subdirectories of the cloud machine running your machine learning experiment, and not your own, local, computer.

Azure timer trigger function using Python

I am writing an Azure timer trigger using Python 3.x. I've got one such function running. I think I know to do it, create one from JS and then delete the 'index.js' and create a run.py. But this time, when I run my python function, I always got an error saying "No such file: index.js". I didn't see any bonds between the function and the 'index.js' file.
Any thoughts?
We could add the python function from the Azure portal directly. If you want to create Timetrigger function,then we could change the trigger type
The following is my detail steps to create Python timetrigger function.
1.Create an Azure function App
2.Add a python function
3.Change the httptrigger to timetrigger
a. delete the httptrigger and http output
b. add the time trigger
4.Add the test code and test it from Azure portal.
The default version is 2.7.8. If you want to use python 3.x, you could follow this tutorial to update the python version.
5.Update the python version.
a. Install extension for Azure function App
b. Add Handler Mappings entry so as to use Python3.X via FastCGI
6.Test it from Azure portal
I followed tutorial in comment and reproduce your issue on my side though I refresh the portal.
However, after waiting for some time, it works. I suspect it's due to cache.
I suggest you creating python azure function on kudu directly. Just create run.py and function.json in new folder instead of changing the JS template.
Hope it helps you.
In my case, run.py is recognized and run after I restart Azure Functions from the portal:
Azure Functions > Overview > Restart
screenshot

Run cleanup script on AWS Oracle RDS from AWS Lambda

I'm using Apex to deploy lambda functions in AWS. I need to write a lambda function which runs a cleanup script on an Oracle RDS in my AWS VPC. Oracle has a very nice python library called cx_Oracle, but I'm having some problems using it in a Lambda function (running on Python 2.7). My first step was to try to run the oracle-described test code as follows:
from __future__ import print_function
import json
import boto3
import boto3.ec2
import os
import cx_Oracle
def handle(event, context):
con = cx_Oracle.connect('username/password#my.oracle.rds:1521/orcl')
print(str(con.version))
con.close()
When I try to run this piece of test code, I get the following response:
Unable to import module 'main': /var/task/cx_Oracle.so: invalid ELF header
Google has told me that this error is caused because the cx_Oracle library is not a complete oracle implementation for python, rather it requires the SQLPlus client to be pre-installed, and the cx_Oracle library references components installed as part of SQLPlus.
Obviously pre-installing SQLPlus might be difficult.
Apex has the
hooks {}
functionality which would allow me to pre-build things, but I'm having trouble finding documentation showing what happens to those artefacts and how that works. In theory I could download the libraries into a nexus or an S3 bucket, and then in my hooks{} declaration, I could add them to the zip file. I could then try to install them as part of the python script. However, I have a few problems with this:
How are the 'built' artefacts accessed inside the lambda
function? Can they be? Have I misunderstood this?
Does a python 2.7 lambda function have enough access rights to
the operating system of the host container to be able to install a
library?
If the answer to question 2 is no, is there another way to write
a lambda function to run some SQL against an Oracle RDS instance?

Categories

Resources