Databricks widgets

Databricks widgets - python

My question is about widgets to pass parameters in databricks. I am using widgets in one notebook to set parameters. Then, I am running this initial notebook from other notebooks. I want the chosen parameters to be pulled in.
For example, in Notebook 1 I have:
dbutils.widgets.dropdown("start_year", "2011", [str(x) for x in range(2008, 2021)], "Earliest year")
start_year=dbutils.widgets.get("start_year")
print("The start_year is " + dbutils.widgets.get("start_year"))
The print statement correctly prints whatever year the user has selected.
In Notebook 2, which runs Notebook 1 using %run, it will only print the default year, in this case 2011, no matter what is selected. What am I doing wrong? Thanks!

From the widget docs (all the way at the bottom):
If you run a notebook that contains widgets, the specified notebook is run with the widget’s default values. You can also pass in values to widgets. For example:
%run /path/to/notebook $X="10" $Y="1"
Have you considered setting "notebook 1" to store the value, at least temporarily, in dbfs rather than in a variable? Then you can retrieve whatever was set in "notebook 1".
i.e. Assuming the user must visit and set the parameter in Notebook 1
Notebook 1
start_year=dbutils.widgets.get("start_year")
myTmpTxt = "The start_year is " + dbutils.widgets.get("start_year")
print(myTmpTxt)
with open('/dbfs/mnt/tmp/tmpStoreYear.txt', 'w') as savedYear:
savedYear.write(myTmpTxt)
Notebook 2
openQuery = open('/dbfs/mnt/tmp/tmpStoreYear.txt', 'rb')
openQuery.read()
Also, there is an option specify widget parameters using dbutils.notebook.run (docs)
However, it still runs as a separate job, and the values won't be imported to your "notebook 2"
Hope this helps!

Related

How do you get the run parameters and runId within Databricks notebook?

When running a Databricks notebook as a job, you can specify job or run parameters that can be used within the code of the notebook. However, it wasn't clear from documentation how you actually fetch them. I'd like to be able to get all the parameters as well as job id and run id.

Job/run parameters
When the notebook is run as a job, then any job parameters can be fetched as a dictionary using the dbutils package that Databricks automatically provides and imports. Here's the code:
run_parameters = dbutils.notebook.entry_point.getCurrentBindings()
If the job parameters were {"foo": "bar"}, then the result of the code above gives you the dict {'foo': 'bar'}. Note that Databricks only allows job parameter mappings of str to str, so keys and values will always be strings.
Note that if the notebook is run interactively (not as a job), then the dict will be empty. The getCurrentBinding() method also appears to work for getting any active widget values for the notebook (when run interactively).
Getting the jobId and runId
To get the jobId and runId you can get a context json from dbutils that contains that information. (Adapted from databricks forum):
import json
context_str = dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()
context = json.loads(context_str)
run_id_obj = context.get('currentRunId', {})
run_id = run_id_obj.get('id', None) if run_id_obj else None
job_id = context.get('tags', {}).get('jobId', None)
So within the context object, the path of keys for runId is currentRunId > id and the path of keys to jobId is tags > jobId.

Nowadays you can easily get the parameters from a job through the widget API. This is pretty well described in the official documentation from Databricks. Below, I'll elaborate on the steps you have to take to get there, it is fairly easy.
Create or use an existing notebook that has to accept some parameters. We want to know the job_id and run_id, and let's also add two user-defined parameters environment and animal.
# Get parameters from job
job_id = dbutils.widgets.get("job_id")
run_id = dbutils.widgets.get("run_id")
environment = dbutils.widgets.get("environment")
animal = dbutils.widgets.get("animal")
print(job_id)
print(run_id)
print(environment)
print(animal)
Now let's go to Workflows > Jobs to create a parameterised job. Make sure you select the correct notebook and specify the parameters for the job at the bottom. According to the documentation, we need to use curly brackets for the parameter values of job_id and run_id. For the other parameters, we can pick a value ourselves.
Note: The reason why you are not allowed to get the job_id and run_id directly from the notebook, is because of security reasons (as you can see from the stack trace when you try to access the attributes of the context). Within a notebook you are in a different context, those parameters live at a "higher" context.
Run the job and observe that it outputs something like:
dev
squirrel
137355915119346
7492
Command took 0.09 seconds
You can even set default parameters in the notebook itself, that will be used if you run the notebook or if the notebook is triggered from a job without parameters. This makes testing easier, and allows you to default certain values.
# Adding widgets to a notebook
dbutils.widgets.text("environment", "tst")
dbutils.widgets.text("animal", "turtle")
# Removing widgets from a notebook
dbutils.widgets.remove("environment")
dbutils.widgets.remove("animal")
# Or removing all widgets from a notebook
dbutils.widgets.removeAll()
And last but not least, I tested this on different cluster types, so far I found no limitations. My current settings are:
spark.databricks.cluster.profile serverless
spark.databricks.passthrough.enabled true
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.repl.allowedLanguages python,sql

Google Collab How to show value of assignments?

I am working on this python notebook in Google Collab:
https://github.com/AllenDowney/ModSimPy/blob/master/notebooks/chap01.ipynb
I had to change the configuration line because the one stated in the original was erroring out:
# Configure Jupyter to display the assigned value after an assignment
# Line commented below because errors out
# %config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
# Edit solution given below
%config InteractiveShell.ast_node_interactivity='last_expr'
However, I think the original statement was meant to show values of assignments (if I'm not mistaken), so that when I run the following cell in the notebook, I should see an output:
meter = UNITS.meter
second = UNITS.second
a = 9.8 * meter / second**2
If so how can I make the notebook on google collab show output of assignments?

The short answer is: you can not show output of assignments in Colab.
Your confusion comes from how Google Colab works. The original script is meant to run in IPython. But Colab is not a regular IPython. When you run IPython shell, your %config InteractiveShell.ast_node_interactivity options are (citing documentation)
‘all’, ‘last’, ‘last_expr’ , ‘last_expr_or_assign’ or ‘none’,
specifying which nodes should be run interactively (displaying output
from expressions). ‘last_expr’ will run the last node interactively
only if it is an expression (i.e. expressions in loops or other blocks
are not displayed) ‘last_expr_or_assign’ will run the last expression
or the last assignment. Other values for this parameter will raise a
ValueError.
all will display all the variables, but not the assignments, for example
x = 5
x
y = 7
y
Out[]:
5
7
The differences between the options become more significant when you want to display variables in the loop.
In Colab your options are restricted to ['all', 'last', 'last_expr', 'none']. If you select all, the result for the above cell will be
Out[]:
57
Summarizing all that, there is no way of showing the result of assignment in Colab. Your only option (AFAIK) is to add the variable you want to see to the cell where it is assigned (which is similar to regular print):
meter = UNITS.meter
second = UNITS.second
a = 9.8 * meter / second**2
a

Google Colab has not yet been upgraded to the latest IPython version- if you explicitly upgrade with
!pip install -U ipython
then last_expr_or_assign will work.

Well the easy way is to just wrap your values in print statements like:
print(meter)
print(second)
print(a)
However, if you want to do it the jupyter way, looks like the answer is
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Found the above from this link: https://stackoverflow.com/a/36835741/7086982

As mentioned by others, it doesn't work as last_expr_or_assign was introduced in juptyer v6.1, and colab is using v5.x. Upgrading the jupyter version on colab might cause some instability (warning shown by colab):
WARNING: Upgrading ipython, ipykernel, tornado, prompt-toolkit or pyzmq can
cause your runtime to repeatedly crash or behave in unexpected ways and is not
recommended
One other solution is to use an extension such as ipydex, which provides magic comments (like ##:, ##:T, ##:S) which cause that either the return value or the right hand side of an assignment of a line is displayed.
!pip install ipydex
%load_ext ipydex.displaytools
e.g.
a = 4
c = 5 ##:
output
c := 5
---

Setting a widget input using a variable in Databricks

Like the question says, I want to know if/how you can set a Databricks widget input using a variable instead of a hard-coded value. I have two notebooks. One needs apply a filter to some values. The other needs to run some code, then optionally (as dictated by another widget) apply that same filter.
Here's some example code (modified for simplicity/privacy).
In Notebook2 we have:
start = dbutils.widgets.get("startDate")
filter_condition = None
if start:
filter_condition = f"GeneratedDate >= '{start}'"
foo = important_function(filter_condition)
%run ./Notebook1 $run_training="True" $num_trials=100 $filter_string=filter_condition
where I want filter_condition to be the above-defined variable and not a string.
In Notebook1, there's some code like:
if run_training=="True":
bar = optimize_model(datasets, grid, int(num_trials))
elif run_training=="False":
baz = apply_filter(filter_string)
else:
# Throw error

You're probably looking for the notebook.run function of the databricks utilities package, rather than the %run command:
dbutils.notebook.run(path='./Notebook1',
timeout_seconds=300,
arguments={'run_training':'True',
'num_trials':100,
'filter_string':filter_condition})
The notebook will be run as a "ephemeral" job. Note that the notebook will run in a separate notebook environment, so any variables etc created will not be brought back into the notebook you ran it from. Your input arguments come through as widget variables, which can be accessed using:
num_trails = dbutils.widgets.get('num_trails')
etc. I think you are already doing that though.

How to programmatically create several new cells in a Jupyter notebook

I want to programmatically create several cells in a Jupyter notebook.
With this function I can create one cell
def create_new_cell(contents):
from IPython.core.getipython import get_ipython
shell = get_ipython()
shell.set_next_input(contents, replace=False)
But if I try to call it several times, for instance, from a for loop, like so
for x in ['a', 'b', 'c']:
create_new_cell(x)
It will only create one cell with the last item in the list. I've tried to find if there's a "flush" function or something similar but did not succeed.
Does anyone know how to properly write several cells programmatically?

I dug a bit more in the code of the shell.payload_manager and found out that in the current implementation of the set_next_input it does not pass the single argument to the shell.payload_manager.write_payload function. That prevents the notebook from creating several cells, since they all have the same source (the set_next_input function, in this case).
That being said, the following function works. It's basically the code from write_payload function setting the single parameter to False.
def create_new_cell(contents):
from IPython.core.getipython import get_ipython
shell = get_ipython()
payload = dict(
source='set_next_input',
text=contents,
replace=False,
)
shell.payload_manager.write_payload(payload, single=False)
Hope this helps someone out there ;)

from IPython.display import display, Javascript
def add_cell(text, type='code', direct='above'):
text = text.replace('\n','\\n').replace("\"", "\\\"").replace("'", "\\'")
display(Javascript('''
var cell = IPython.notebook.insert_cell_{}("{}")
cell.set_text("{}")
'''.format(direct, type, text)));
for i in range(3):
add_cell(f'# heading{i}', 'markdown')
add_cell(f'code {i}')
codes above will add cells as follows：

How do you escape a dash in Jython/Websphere?

I have a Jython script that is used to set up a JDBC datasource on a Websphere 7.0 server. I need to set several properties on that datasource. I am using this code, which works, unless value is '-'.
def setCustomProperty(datasource, name, value):
parms = ['-propertyName', name, '-propertyValue', value]
AdminTask.setResourceProperty(datasource, parms)
I need to set the dateSeparator property on my datasource to just that - a dash. When I run this script with setCustomProperty(ds, 'dateSeparator', '-') I get an exception that says, "Invalid property: ". I figured out that it thinks that the dash means that another parameter/argument pair is expected.
Is there any way to get AdminTask to accept a dash?
NOTE: I can't set it via AdminConfig because I cannot find a way to get the id of the right property (I have multiple datasources).

Here is a solution that uses AdminConfig so that you can set the property value to the dash -. The solution accounts for multiple data sources, finding the correct one by specifying the appropriate scope (i.e. the server, but this could be modified if your datasource exists within a different scope) and then finding the datasource by name. The solution also accounts for modifying the existing "dateSeparator" property if it exists, or it creates it if it doesn't.
The code doesn't look terribly elegant, but I think it should solve your problem :
def setDataSourceProperty(cell, node, server, ds, propName, propVal) :
scopes = AdminConfig.getid("/Cell:%s/Node:%s/Server:%s/" % (cell, node, server)).splitlines()
datasources = AdminConfig.list("DataSource", scopes[0]).splitlines()
for datasource in datasources :
if AdminConfig.showAttribute(datasource, "name") == ds :
propertySet = AdminConfig.list("J2EEResourcePropertySet", datasource).splitlines()
customProp = [["name", propName], ["value", propVal]]
for property in AdminConfig.list("J2EEResourceProperty", propertySet[0]).splitlines() :
if AdminConfig.showAttribute(property, "name") == propName :
AdminConfig.modify(property, customProp)
return
AdminConfig.create("J2EEResourceProperty", propertySet[0], customProp)
if (__name__ == "__main__"):
setDataSourceProperty("myCell01", "myNode01", "myServer", "myDataSource", "dateSeparator", "-")
AdminConfig.save()

Please see the Management Console preferences settings. You can do what you are attempting now and you should get to see the Jython equivalent that the Management Console is creating for its own use. Then just copy it.

#Schemetrical solution worked for me. Just giving another example with jvm args.
Not commenting on the actual answer because I don't have enough reputation.
server_name = 'server1'
AdminTask.setGenericJVMArguments('[ -serverName %s -genericJvmArguments "-agentlib:getClasses" ]' % (server_name))

Try using a String instead of an array to pass the parameters using double quotes to surround the values starting with a dash sign
Example:
AdminTask.setVariable('-variableName JDK_PARAMS -variableValue "-Xlp -Xscm250M" -variableDescription "-Yes -I -can -now -use -dashes -everywhere :-)" -scope Cell=MyCell')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.