How can I call functions defined in a different Synapse notebook after running the notebook with mssparkutils.notebook.run()?
example:
#parameters
value = "test"
from notebookutils import mssparkutils
mssparkutils.notebook.run("function definitions", 60, {"param": value})
df = load_cosmos_data() #defined in 'function definitions' notebook
This fails with: NameError: name 'load_cosmos_data' is not defined
I can use the functions with the %run command, but I need to be able to pass the parameter through to the function definitions notebook. %run doesn't allow me to pass a variable as a parameter.
After going through this Official Microsoft Documentation,
When referencing other notebook, after the exit from the referenced
notebook with exit() or without that, the source notebook script will
be executed and they will become two different notebooks which have no
relationship between them. We can’t access any variable from that
notebook, and it applies to the functions of that notebook as well.
In general programming languages as well, we can’t access the variables of a function which are local to it after its return. It is only possible when we return that variable.
Unfortunately, the exit() method doesn’t support returning values other than strings from the referenced notebook.
From the above code, assuming that you need to access the dataframe which is returning from the function load_cosmos_data() in referenced notebook. You can do it using the temporary views.
Please follow the demonstration below:
In the referenced notebook call the function and store the returned dataframe in a variable and create a temporary view for that. You can store this temporary view as dataframe in the source notebook.
Function Notebook:
Code:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
def load_data():
data2 = [(24,"Rakesh","Govindula"),
(16,"Virat","Kohli")]
schema = StructType([ \
StructField("id",IntegerType(),True), \
StructField("firstname",StringType(),True), \
StructField("lastname",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
return df
df2=load_data()
df2.show()
df2.createOrReplaceTempView("dataframeview")
mssparkutils.notebook.exit("dataframeview")
Source Notebook:
Code:
value="test"
from notebookutils import mssparkutils
view_name=mssparkutils.notebook.run("/function_notebook", 60, {"param": value})
df=spark.sql("select * from {0}".format(view_name))
df.show()
With this approach you can pass the parameter through to function notebook and can access the dataframe returned from the function as well.
Please go through this SO Thread if you face any issues when returning values from synapse notebook.
Related
When running a Databricks notebook as a job, you can specify job or run parameters that can be used within the code of the notebook. However, it wasn't clear from documentation how you actually fetch them. I'd like to be able to get all the parameters as well as job id and run id.
Job/run parameters
When the notebook is run as a job, then any job parameters can be fetched as a dictionary using the dbutils package that Databricks automatically provides and imports. Here's the code:
run_parameters = dbutils.notebook.entry_point.getCurrentBindings()
If the job parameters were {"foo": "bar"}, then the result of the code above gives you the dict {'foo': 'bar'}. Note that Databricks only allows job parameter mappings of str to str, so keys and values will always be strings.
Note that if the notebook is run interactively (not as a job), then the dict will be empty. The getCurrentBinding() method also appears to work for getting any active widget values for the notebook (when run interactively).
Getting the jobId and runId
To get the jobId and runId you can get a context json from dbutils that contains that information. (Adapted from databricks forum):
import json
context_str = dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()
context = json.loads(context_str)
run_id_obj = context.get('currentRunId', {})
run_id = run_id_obj.get('id', None) if run_id_obj else None
job_id = context.get('tags', {}).get('jobId', None)
So within the context object, the path of keys for runId is currentRunId > id and the path of keys to jobId is tags > jobId.
Nowadays you can easily get the parameters from a job through the widget API. This is pretty well described in the official documentation from Databricks. Below, I'll elaborate on the steps you have to take to get there, it is fairly easy.
Create or use an existing notebook that has to accept some parameters. We want to know the job_id and run_id, and let's also add two user-defined parameters environment and animal.
# Get parameters from job
job_id = dbutils.widgets.get("job_id")
run_id = dbutils.widgets.get("run_id")
environment = dbutils.widgets.get("environment")
animal = dbutils.widgets.get("animal")
print(job_id)
print(run_id)
print(environment)
print(animal)
Now let's go to Workflows > Jobs to create a parameterised job. Make sure you select the correct notebook and specify the parameters for the job at the bottom. According to the documentation, we need to use curly brackets for the parameter values of job_id and run_id. For the other parameters, we can pick a value ourselves.
Note: The reason why you are not allowed to get the job_id and run_id directly from the notebook, is because of security reasons (as you can see from the stack trace when you try to access the attributes of the context). Within a notebook you are in a different context, those parameters live at a "higher" context.
Run the job and observe that it outputs something like:
dev
squirrel
137355915119346
7492
Command took 0.09 seconds
You can even set default parameters in the notebook itself, that will be used if you run the notebook or if the notebook is triggered from a job without parameters. This makes testing easier, and allows you to default certain values.
# Adding widgets to a notebook
dbutils.widgets.text("environment", "tst")
dbutils.widgets.text("animal", "turtle")
# Removing widgets from a notebook
dbutils.widgets.remove("environment")
dbutils.widgets.remove("animal")
# Or removing all widgets from a notebook
dbutils.widgets.removeAll()
And last but not least, I tested this on different cluster types, so far I found no limitations. My current settings are:
spark.databricks.cluster.profile serverless
spark.databricks.passthrough.enabled true
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.repl.allowedLanguages python,sql
Like the question says, I want to know if/how you can set a Databricks widget input using a variable instead of a hard-coded value. I have two notebooks. One needs apply a filter to some values. The other needs to run some code, then optionally (as dictated by another widget) apply that same filter.
Here's some example code (modified for simplicity/privacy).
In Notebook2 we have:
start = dbutils.widgets.get("startDate")
filter_condition = None
if start:
filter_condition = f"GeneratedDate >= '{start}'"
foo = important_function(filter_condition)
%run ./Notebook1 $run_training="True" $num_trials=100 $filter_string=filter_condition
where I want filter_condition to be the above-defined variable and not a string.
In Notebook1, there's some code like:
if run_training=="True":
bar = optimize_model(datasets, grid, int(num_trials))
elif run_training=="False":
baz = apply_filter(filter_string)
else:
# Throw error
You're probably looking for the notebook.run function of the databricks utilities package, rather than the %run command:
dbutils.notebook.run(path='./Notebook1',
timeout_seconds=300,
arguments={'run_training':'True',
'num_trials':100,
'filter_string':filter_condition})
The notebook will be run as a "ephemeral" job. Note that the notebook will run in a separate notebook environment, so any variables etc created will not be brought back into the notebook you ran it from. Your input arguments come through as widget variables, which can be accessed using:
num_trails = dbutils.widgets.get('num_trails')
etc. I think you are already doing that though.
How do I use a variable defined in the EMR cluster's Python instance when I run code on the managed Jupyter notebook instance using %%local?
Specifically I want to use matplotlib as shown in this question, and display plot from a dataframe generated using spark.sql(). Using %%sql lets me easily use data results in %%local, but I would still need to pass parameters to %%sql from the EMR Python instance
Example:
ln[1]: parameter = 'Hello parameter'
ln[2]: %%local
print(parameter)
I keep getting error that my variable is not defined.
I found 2 workarounds
Use %%spark -o df to return SQL query results to a dataframe that can be used with %%local like in this answer
Do all query building, execution and any data processing like normal without using any %% magic commands, then write the final data to a temporary table in my database using df.createOrReplaceTempView("temp_table_name"). Then use a simple query to retrieve the final data with %%sql -q -o df and SELECT * FROM temp_table_name
I want to programmatically create several cells in a Jupyter notebook.
With this function I can create one cell
def create_new_cell(contents):
from IPython.core.getipython import get_ipython
shell = get_ipython()
shell.set_next_input(contents, replace=False)
But if I try to call it several times, for instance, from a for loop, like so
for x in ['a', 'b', 'c']:
create_new_cell(x)
It will only create one cell with the last item in the list. I've tried to find if there's a "flush" function or something similar but did not succeed.
Does anyone know how to properly write several cells programmatically?
I dug a bit more in the code of the shell.payload_manager and found out that in the current implementation of the set_next_input it does not pass the single argument to the shell.payload_manager.write_payload function. That prevents the notebook from creating several cells, since they all have the same source (the set_next_input function, in this case).
That being said, the following function works. It's basically the code from write_payload function setting the single parameter to False.
def create_new_cell(contents):
from IPython.core.getipython import get_ipython
shell = get_ipython()
payload = dict(
source='set_next_input',
text=contents,
replace=False,
)
shell.payload_manager.write_payload(payload, single=False)
Hope this helps someone out there ;)
from IPython.display import display, Javascript
def add_cell(text, type='code', direct='above'):
text = text.replace('\n','\\n').replace("\"", "\\\"").replace("'", "\\'")
display(Javascript('''
var cell = IPython.notebook.insert_cell_{}("{}")
cell.set_text("{}")
'''.format(direct, type, text)));
for i in range(3):
add_cell(f'# heading{i}', 'markdown')
add_cell(f'code {i}')
codes above will add cells as follows:
Q: How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session
I made the following code change to increase spark.sql.pivotMaxValues. It sadly had no effect in the resulting error after restarting jupyter and running the code again.
from pyspark import SparkConf, SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
import numpy as np
try:
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker') # original
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", "99999")
conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", 99999)
sc = SparkContext(conf=conf)
except:
print("Variables sc and conf are now defined. Everything is OK and ready to run.")
<... (other code) ...>
df = sess.read.csv(in_filename, header=False, mode="DROPMALFORMED", schema=csv_schema)
ct = df.crosstab('username', 'itemname')
Spark error message that was thrown on my crosstab line of code:
IllegalArgumentException: "requirement failed: The number of distinct values for itemname, can't exceed 1e4. Currently 16467"
I expect I'm not actually setting the config variable that I was trying to set, so what is a way to get that value actually set, programmatically if possible? THanks.
References:
Finally, you may be interested to know that there is a maximum number
of values for the pivot column if none are specified. This is mainly
to catch mistakes and avoid OOM situations. The config key is
spark.sql.pivotMaxValues and its default is 10,000.
Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
I would prefer to change the config variable upwards, since I have written the crosstab code already which works great on smaller datasets. If it turns out there truly is no way to change this config variable then my backup plans are, in order:
relational right outer join to implement my own Spark crosstab with higher capacity than was provided by databricks
scipy dense vectors with handmade unique combinations calculation code using dictionaries
kernel.json
This configuration file should be distributed together with jupyter
~/.ipython/kernels/pyspark/kernel.json
It contains SPARK configuration, including variable PYSPARK_SUBMIT_ARGS - list of arguments that will be used with spark-submit script.
You can try to add --conf spark.sql.pivotMaxValues=99999 to this variable in mentioned script.
PS
There are also cases where people are trying to override this variable programmatically. You can give it a try too...