How do I use a variable defined in the EMR cluster's Python instance when I run code on the managed Jupyter notebook instance using %%local?
Specifically I want to use matplotlib as shown in this question, and display plot from a dataframe generated using spark.sql(). Using %%sql lets me easily use data results in %%local, but I would still need to pass parameters to %%sql from the EMR Python instance
Example:
ln[1]: parameter = 'Hello parameter'
ln[2]: %%local
print(parameter)
I keep getting error that my variable is not defined.
I found 2 workarounds
Use %%spark -o df to return SQL query results to a dataframe that can be used with %%local like in this answer
Do all query building, execution and any data processing like normal without using any %% magic commands, then write the final data to a temporary table in my database using df.createOrReplaceTempView("temp_table_name"). Then use a simple query to retrieve the final data with %%sql -q -o df and SELECT * FROM temp_table_name
Related
How can I call functions defined in a different Synapse notebook after running the notebook with mssparkutils.notebook.run()?
example:
#parameters
value = "test"
from notebookutils import mssparkutils
mssparkutils.notebook.run("function definitions", 60, {"param": value})
df = load_cosmos_data() #defined in 'function definitions' notebook
This fails with: NameError: name 'load_cosmos_data' is not defined
I can use the functions with the %run command, but I need to be able to pass the parameter through to the function definitions notebook. %run doesn't allow me to pass a variable as a parameter.
After going through this Official Microsoft Documentation,
When referencing other notebook, after the exit from the referenced
notebook with exit() or without that, the source notebook script will
be executed and they will become two different notebooks which have no
relationship between them. We can’t access any variable from that
notebook, and it applies to the functions of that notebook as well.
In general programming languages as well, we can’t access the variables of a function which are local to it after its return. It is only possible when we return that variable.
Unfortunately, the exit() method doesn’t support returning values other than strings from the referenced notebook.
From the above code, assuming that you need to access the dataframe which is returning from the function load_cosmos_data() in referenced notebook. You can do it using the temporary views.
Please follow the demonstration below:
In the referenced notebook call the function and store the returned dataframe in a variable and create a temporary view for that. You can store this temporary view as dataframe in the source notebook.
Function Notebook:
Code:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
def load_data():
data2 = [(24,"Rakesh","Govindula"),
(16,"Virat","Kohli")]
schema = StructType([ \
StructField("id",IntegerType(),True), \
StructField("firstname",StringType(),True), \
StructField("lastname",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
return df
df2=load_data()
df2.show()
df2.createOrReplaceTempView("dataframeview")
mssparkutils.notebook.exit("dataframeview")
Source Notebook:
Code:
value="test"
from notebookutils import mssparkutils
view_name=mssparkutils.notebook.run("/function_notebook", 60, {"param": value})
df=spark.sql("select * from {0}".format(view_name))
df.show()
With this approach you can pass the parameter through to function notebook and can access the dataframe returned from the function as well.
Please go through this SO Thread if you face any issues when returning values from synapse notebook.
I am accessing an Intersystems cache 2017.1.xx instance through a python process to get various attributes about the database in able to monitor the database.
One of the items I want to monitor is license usage. I wrote a objectscript script in a Terminal window to access license usage by user:
s Rset=##class(%ResultSet).%New("%SYSTEM.License.UserListAll")
s r=Rset.Execute()
s ncol=Rset.GetColumnCount()
While (Rset.Next()) {f i=1:1:ncol w !,Rset.GetData(i)}
But, I have been unable to determine how to convert this script into a Python equivalent. I am using the intersys.pythonbind3 import for connecting and accessing the cache instance. I have been able to create python functions that accessing most everything else in the instance but this one piece of data I can not figure out how to translate it to Python (3.7).
Following should work (based on the documentation):
query = intersys.pythonbind.query(database)
query.prepare_class("%SYSTEM.License","UserListAll")
query.execute();
# Fetch each row in the result set, and print the
# name and value of each column in a row:
while 1:
cols = query.fetch([None])
if len(cols) == 0: break
print str(cols[0])
Also, notice that InterSystems IRIS -- successor to the Caché now has Python as an embedded language. See more in the docs
Since the noted query "UserListAll" is not defined correctly in the library; not SqlProc. So to resolve this issue would require a ObjectScript with the query and the use of #Result set or similar in Python to get the results. So I am marking this as resolved.
Not sure which Python interface you're using for Cache/IRIS, but this Open Source 3rd party one is worth investigating for the kind of things you're trying to do:
https://github.com/chrisemunt/mg_python
When running a Databricks notebook as a job, you can specify job or run parameters that can be used within the code of the notebook. However, it wasn't clear from documentation how you actually fetch them. I'd like to be able to get all the parameters as well as job id and run id.
Job/run parameters
When the notebook is run as a job, then any job parameters can be fetched as a dictionary using the dbutils package that Databricks automatically provides and imports. Here's the code:
run_parameters = dbutils.notebook.entry_point.getCurrentBindings()
If the job parameters were {"foo": "bar"}, then the result of the code above gives you the dict {'foo': 'bar'}. Note that Databricks only allows job parameter mappings of str to str, so keys and values will always be strings.
Note that if the notebook is run interactively (not as a job), then the dict will be empty. The getCurrentBinding() method also appears to work for getting any active widget values for the notebook (when run interactively).
Getting the jobId and runId
To get the jobId and runId you can get a context json from dbutils that contains that information. (Adapted from databricks forum):
import json
context_str = dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()
context = json.loads(context_str)
run_id_obj = context.get('currentRunId', {})
run_id = run_id_obj.get('id', None) if run_id_obj else None
job_id = context.get('tags', {}).get('jobId', None)
So within the context object, the path of keys for runId is currentRunId > id and the path of keys to jobId is tags > jobId.
Nowadays you can easily get the parameters from a job through the widget API. This is pretty well described in the official documentation from Databricks. Below, I'll elaborate on the steps you have to take to get there, it is fairly easy.
Create or use an existing notebook that has to accept some parameters. We want to know the job_id and run_id, and let's also add two user-defined parameters environment and animal.
# Get parameters from job
job_id = dbutils.widgets.get("job_id")
run_id = dbutils.widgets.get("run_id")
environment = dbutils.widgets.get("environment")
animal = dbutils.widgets.get("animal")
print(job_id)
print(run_id)
print(environment)
print(animal)
Now let's go to Workflows > Jobs to create a parameterised job. Make sure you select the correct notebook and specify the parameters for the job at the bottom. According to the documentation, we need to use curly brackets for the parameter values of job_id and run_id. For the other parameters, we can pick a value ourselves.
Note: The reason why you are not allowed to get the job_id and run_id directly from the notebook, is because of security reasons (as you can see from the stack trace when you try to access the attributes of the context). Within a notebook you are in a different context, those parameters live at a "higher" context.
Run the job and observe that it outputs something like:
dev
squirrel
137355915119346
7492
Command took 0.09 seconds
You can even set default parameters in the notebook itself, that will be used if you run the notebook or if the notebook is triggered from a job without parameters. This makes testing easier, and allows you to default certain values.
# Adding widgets to a notebook
dbutils.widgets.text("environment", "tst")
dbutils.widgets.text("animal", "turtle")
# Removing widgets from a notebook
dbutils.widgets.remove("environment")
dbutils.widgets.remove("animal")
# Or removing all widgets from a notebook
dbutils.widgets.removeAll()
And last but not least, I tested this on different cluster types, so far I found no limitations. My current settings are:
spark.databricks.cluster.profile serverless
spark.databricks.passthrough.enabled true
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.repl.allowedLanguages python,sql
Like the question says, I want to know if/how you can set a Databricks widget input using a variable instead of a hard-coded value. I have two notebooks. One needs apply a filter to some values. The other needs to run some code, then optionally (as dictated by another widget) apply that same filter.
Here's some example code (modified for simplicity/privacy).
In Notebook2 we have:
start = dbutils.widgets.get("startDate")
filter_condition = None
if start:
filter_condition = f"GeneratedDate >= '{start}'"
foo = important_function(filter_condition)
%run ./Notebook1 $run_training="True" $num_trials=100 $filter_string=filter_condition
where I want filter_condition to be the above-defined variable and not a string.
In Notebook1, there's some code like:
if run_training=="True":
bar = optimize_model(datasets, grid, int(num_trials))
elif run_training=="False":
baz = apply_filter(filter_string)
else:
# Throw error
You're probably looking for the notebook.run function of the databricks utilities package, rather than the %run command:
dbutils.notebook.run(path='./Notebook1',
timeout_seconds=300,
arguments={'run_training':'True',
'num_trials':100,
'filter_string':filter_condition})
The notebook will be run as a "ephemeral" job. Note that the notebook will run in a separate notebook environment, so any variables etc created will not be brought back into the notebook you ran it from. Your input arguments come through as widget variables, which can be accessed using:
num_trails = dbutils.widgets.get('num_trails')
etc. I think you are already doing that though.
Using native Python code in SQL UDFs in Monetdb is really powerful. BUT, debugging such UDFs could benefit from more support. In particular, if I use the old-fashioned print('debugging info') it disappears in the big black void.
create function dummy()
returns string
language python{
print('Entering the dummy UDF')
return 'hello';
};
How to retrieve this information from the server or MonetDB client.
I was debugging some Python UDF last week :)
Step 1: first make sure your Python code at least works in a Python interpreter.
Step 2: in a Python UDF, write your debugging info. to a file, e.g.:
f = open('/tmp/debug.out', 'w')
f.write('my debugging info\n')
f.close()
This isn't ideal, but it works. Also, I used this to export the parameter values of my Python UDF. In this way, I can run the body of my Python UDF in a Python interpreter with the exact data I receive from MonetDB.
In case someone is still interested in this problem.
There are two novel ways of debugging MonetDB's Python/UDFs.
1) Using the python client pymonetdb (https://github.com/gijzelaerr/pymonetdb).
You can install it throw pip
pip install numpy
To use it, think of the following setting with a table that holds an integer and a UDF that computes the mean absolute deviation of a given column.
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (3), (6), (8), (10);
CREATE OR REPLACE FUNCTION mean_deviation(column INTEGER)
RETURNS DOUBLE LANGUAGE PYTHON {
mean = 0.0
for i in range (0, len(column)):
mean += column[I]
mean = mean / len(column)
distance = 0.0
for i in range (0, len(column)):
distance += column[i] - mean
deviation = distance/len(column)
return deviation;
};
To debug your function using terminal debugging (i.e., pdb) you just need to open a database connection using pymonetdb.connect(), later you get a cursor object from the connection, and through the cursor object you call the debug() function, sending as parameters the SQL you want to examine and the UDF name you wish to debug.
import pymonetdb
conn = pymonetdb.connect(database='demo') #Open Database connection
c = conn.cursor()
sql = 'select mean_deviation(i) from integers;'
c.debug(sql, 'mean_deviation') #Console Debugging
There is an optional sampling step that only transfers a uniform random sample of the data instead of the full input data set. If you wish to sample you just need to send the number of elements you wish to get from the sampling (e.g., c.debug(sql, 'mean_deviation', 10) in case you desire the subset of 10 elements)
2) Using a POC plugin for PyCharm called devudf, which you can install throw the plugin page of pycharm, or by directly going to the JetBrains page: https://plugins.jetbrains.com/plugin/12063-devudf. It adds an option to the main menu called "UDF Development" and allows for you do directly import and export UDFs from your database directly to pycharm, and enjoy the IDE's debugging capabilities.