How to inject the information about load version into Kedro node?

How to inject the information about load version into Kedro node? - python

I need to run a Kedro (v0.17.4) pipeline with a node that is supposed to process data with a different logic depending on the load version of the input.
As a simple and crude example assuming there is a catalog.yml file with this entry:
test_data_set:
type: pandas.CSVDataSet
filepath: data/01_raw/test.csv
versioned: true
and there are multiple versions of test.csv (say '1' and '2') and I want to use the Catalog from the config file and run the following node/pipeline:
from kedro.config import ConfigLoader
from kedro.io import DataCatalog
conf_loader = ConfLoader(['conf/base'])
conf_catalog = conf_loader.get('catalog*', 'catalog/**')
io = DataCatalog.from_config(conf_catalog)
def my_node(my_data_set):
#if version_of_my_data_set == '1': # how to do this?
# print("do something with version 1")
# ... do something else
return
my_pipeline = Pipeline([node(func=my_node, inputs="test_data_set", outputs=None, name="process_versioned_data")])
SequentialRunner().run(my_pipeline, catalog=io)
I understand that runtime parameters or the load version are supposed to be separated from the logic in a node by design, but in my specific case it would still be useful to find a way to do this.
In general the pipeline will be executed via the API but also via the command line with the --load_version flag.
Solutions that I have considered but discarded:
store the load version somehow in the Kedro session and access it within the node via "get_current_session" (how?)
add load_version as a required input parameter for the node (would probably break compatibility with some upstream pipeline)
In short:
Is there a good way to pass the information of the user specified load version of a dataset to a kedro node?

Related

Converting Intersystems cache objectscript into a python function

I am accessing an Intersystems cache 2017.1.xx instance through a python process to get various attributes about the database in able to monitor the database.
One of the items I want to monitor is license usage. I wrote a objectscript script in a Terminal window to access license usage by user:
s Rset=##class(%ResultSet).%New("%SYSTEM.License.UserListAll")
s r=Rset.Execute()
s ncol=Rset.GetColumnCount()
While (Rset.Next()) {f i=1:1:ncol w !,Rset.GetData(i)}
But, I have been unable to determine how to convert this script into a Python equivalent. I am using the intersys.pythonbind3 import for connecting and accessing the cache instance. I have been able to create python functions that accessing most everything else in the instance but this one piece of data I can not figure out how to translate it to Python (3.7).

Following should work (based on the documentation):
query = intersys.pythonbind.query(database)
query.prepare_class("%SYSTEM.License","UserListAll")
query.execute();
# Fetch each row in the result set, and print the
# name and value of each column in a row:
while 1:
cols = query.fetch([None])
if len(cols) == 0: break
print str(cols[0])
Also, notice that InterSystems IRIS -- successor to the Caché now has Python as an embedded language. See more in the docs

Since the noted query "UserListAll" is not defined correctly in the library; not SqlProc. So to resolve this issue would require a ObjectScript with the query and the use of #Result set or similar in Python to get the results. So I am marking this as resolved.

Not sure which Python interface you're using for Cache/IRIS, but this Open Source 3rd party one is worth investigating for the kind of things you're trying to do:
https://github.com/chrisemunt/mg_python

How do you get the run parameters and runId within Databricks notebook?

When running a Databricks notebook as a job, you can specify job or run parameters that can be used within the code of the notebook. However, it wasn't clear from documentation how you actually fetch them. I'd like to be able to get all the parameters as well as job id and run id.

Job/run parameters
When the notebook is run as a job, then any job parameters can be fetched as a dictionary using the dbutils package that Databricks automatically provides and imports. Here's the code:
run_parameters = dbutils.notebook.entry_point.getCurrentBindings()
If the job parameters were {"foo": "bar"}, then the result of the code above gives you the dict {'foo': 'bar'}. Note that Databricks only allows job parameter mappings of str to str, so keys and values will always be strings.
Note that if the notebook is run interactively (not as a job), then the dict will be empty. The getCurrentBinding() method also appears to work for getting any active widget values for the notebook (when run interactively).
Getting the jobId and runId
To get the jobId and runId you can get a context json from dbutils that contains that information. (Adapted from databricks forum):
import json
context_str = dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()
context = json.loads(context_str)
run_id_obj = context.get('currentRunId', {})
run_id = run_id_obj.get('id', None) if run_id_obj else None
job_id = context.get('tags', {}).get('jobId', None)
So within the context object, the path of keys for runId is currentRunId > id and the path of keys to jobId is tags > jobId.

Nowadays you can easily get the parameters from a job through the widget API. This is pretty well described in the official documentation from Databricks. Below, I'll elaborate on the steps you have to take to get there, it is fairly easy.
Create or use an existing notebook that has to accept some parameters. We want to know the job_id and run_id, and let's also add two user-defined parameters environment and animal.
# Get parameters from job
job_id = dbutils.widgets.get("job_id")
run_id = dbutils.widgets.get("run_id")
environment = dbutils.widgets.get("environment")
animal = dbutils.widgets.get("animal")
print(job_id)
print(run_id)
print(environment)
print(animal)
Now let's go to Workflows > Jobs to create a parameterised job. Make sure you select the correct notebook and specify the parameters for the job at the bottom. According to the documentation, we need to use curly brackets for the parameter values of job_id and run_id. For the other parameters, we can pick a value ourselves.
Note: The reason why you are not allowed to get the job_id and run_id directly from the notebook, is because of security reasons (as you can see from the stack trace when you try to access the attributes of the context). Within a notebook you are in a different context, those parameters live at a "higher" context.
Run the job and observe that it outputs something like:
dev
squirrel
137355915119346
7492
Command took 0.09 seconds
You can even set default parameters in the notebook itself, that will be used if you run the notebook or if the notebook is triggered from a job without parameters. This makes testing easier, and allows you to default certain values.
# Adding widgets to a notebook
dbutils.widgets.text("environment", "tst")
dbutils.widgets.text("animal", "turtle")
# Removing widgets from a notebook
dbutils.widgets.remove("environment")
dbutils.widgets.remove("animal")
# Or removing all widgets from a notebook
dbutils.widgets.removeAll()
And last but not least, I tested this on different cluster types, so far I found no limitations. My current settings are:
spark.databricks.cluster.profile serverless
spark.databricks.passthrough.enabled true
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.repl.allowedLanguages python,sql

Change Logdir of Ray RLlib Training instead of ~/ray_results

I'm using Ray & RLlib to train RL agents on an Ubuntu system. Tensorboard is used to monitor the training progress by pointing it to ~/ray_results where all the log files for all runs are stored. Ray Tune is not being used.
For example, on starting a new Ray/RLlib training run, a new directory will be created at
~/ray_results/DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1
To visualize the training progress, we need to start Tensorboard using
tensorboard --logdir=~/ray_results
Question: Is it possible to configure Ray/RLlib to change the output directory of the log files from ~/ray_results to another location?
Additionally, instead of logging to a directory named something like DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1, can this directory name by set by ourselves?
Failed Attempt: Tried setting
os.environ['TUNE_RESULT_DIR'] = '~/another_dir`
before running ray.init(), but the result log files were still being written to ~/ray_results.

Without using Tune, you can change the logdir using rllib's "Trainer". The "Trainer" class takes in an optional "logger_creator" if you want to specify where to save the log (see here).
A concrete example:
Define your customized logger creator (you can simply modify from the default one):
def custom_log_creator(custom_path, custom_str):
timestr = datetime.today().strftime("%Y-%m-%d_%H-%M-%S")
logdir_prefix = "{}_{}".format(custom_str, timestr)
def logger_creator(config):
if not os.path.exists(custom_path):
os.makedirs(custom_path)
logdir = tempfile.mkdtemp(prefix=logdir_prefix, dir=custom_path)
return UnifiedLogger(config, logdir, loggers=None)
return logger_creator
Pass this logger_creator to the trainer, and start training:
trainer = PPOTrainer(config=config, env='CartPole-v0',
logger_creator=custom_log_creator(os.path.expanduser("~/another_ray_results/subdir"), 'custom_dir'))
for i in range(ITER_NUM):
result = trainer.train()
You will find the training results (i.e., TensorBoard events file, params, model, ...) saved under "~/another_ray_results/subdir" with your specified naming convention.

Is it possible to configure Ray/RLlib to change the output directory of the log files from ~/ray_results to another location?
There is currently no way to configure this using RLib CLI tool (rllib).
If you're okay with Python API, then, as described in documentation, local_dir parameter of tune.run is responsible for specifying output directory, default is ~/ray_results.
Additionally, instead of logging to a directory named something like DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1, can this directory name by set by ourselves?
This is governed by trial_name_creator parameter of tune.run. It must be a function that accepts trial object and formats it into a string like so:
def trial_name_id(trial):
return f"{trial.trainable_name}_{trial.trial_id}"
tune.run(...trial_name_creator=trial_name_id)

Just for anyone who bumps into this problem with Ray Tune.
You can specify local_dir for run_config within tune.Tuner:
# This logs to 2 different trial folders:
# ./results/test_experiment/trial_name_1 and ./results/test_experiment/trial_name_2
# Only trial_name is autogenerated.
tuner = tune.Tuner(trainable,
tune_config=tune.TuneConfig(num_samples=2),
run_config=air.RunConfig(local_dir="./results", name="test_experiment"))
results = tuner.fit()
Please see this link for more info.

CANoe: How to select and start test cases from XML Test Module from Python using CANoe COM interface?

currently I am able to:
start CANoe application
load a CANoe configuration file
load a test setup file
def load_test_setup(self, canoe_test_setup_file: str = None) -> None:
logger.info(
f'Loading CANoe test setup file <{canoe_test_setup_file}>.')
if self.measurement.Running:
logger.info(
f'Simulation is currently running, so new test setup could \
not be loaded!')
return
self.test_setup.TestEnvironments.Add(canoe_test_setup_file)
test_environment = self.test_setup.TestEnvironments.Item(1)
logger.info(f'Loaded test environment is <{test_environment.Name}>.')
How can I access the XML Test Module loaded with the test setup (tse) file and select tests to be executed?

The last before line in your snippet is most probably causing the issue.
I have been trying to fix this issue for quite some time now and finally found the solution.
Somehow when you execute the line self.test_setup.TestEnvironments.Item(1)
win32com creates an object of type TestSetupItem which doesn't have the necessary properties or methods to access the test cases. Instead we want to access objects of collection types TestSetupFolders or TestModules. win32com creates object of TestSetupItem type even though I have a single XML Test Module (called AutomationTestSeq) in the Test Environment as you can see here.
There are three possible solutions that I found.
Manually clearing the generated cache before each run.
Using win32com.client.DispatchWithEvents or win32com.client.gencache.EnsureDispatch generates a bunch of python files that describe CANoe's object model.
If you had used either of those before, TestEnvironments.Item(1) will always return TestSetupItem instead of the more appropriate type objects.
To remove the cache you need to delete the C:\Users\{username}\AppData\Local\Temp\gen_py\{python version} folder.
Doing this every time is of course not very practical.
Force win32com to always use dynamic dispatch.
You can do this by using:
canoe = win32com.client.dynamic.Dispatch("CANoe.Application")
Any objects you create using canoe from now on, will be dynamically dispatched.
Forcing dynamic dispatch is easier than manually clearing the cache folder every time. This gave me good results always. But doing this will not let you have any insight into the objects. You won't be able to see the acceptable properties and methods for the objects.
Typecast TestSetupItem to TestSetupFolders or TestModules.
This has the risk that if you typecast incorrectly, you will get unexpected results. But has worked well for me so far.
In short: win32.CastTo(test_env, "ITestEnvironment2"). This will ensure that you are using the recommended object hierarchy as per CANoe technical reference.
Note that you will also have to typecast TestSequenceItem to TestCase to be able to access test case verdict and enable/disable test cases.
Below is a decent example script.
"""Execute XML Test Cases without a pass verdict"""
import sys
from time import sleep
import win32com.client as win32
CANoe = win32.DispatchWithEvents("CANoe.Application")
CANoe.Open("canoe.cfg")
test_env = CANoe.Configuration.TestSetup.TestEnvironments.Item('Test Environment')
# Cast required since test_env is originally of type <ITestEnvironment>
test_env = win32.CastTo(test_env, "ITestEnvironment2")
# Get the XML TestModule (type <TSTestModule>) in the test setup
test_module = test_env.TestModules.Item('AutomationTestSeq')
# {.Sequence} property returns a collection of <TestCases> or <TestGroup>
# or <TestSequenceItem> which is more generic
seq = test_module.Sequence
for i in range(1, seq.Count+1):
# Cast from <ITestSequenceItem> to <ITestCase> to access {.Verdict}
# and the {.Enabled} property
tc = win32.CastTo(seq.Item(i), "ITestCase")
if tc.Verdict != 1: # Verdict 1 is pass
tc.Enabled = True
print(f"Enabling Test Case {tc.Ident} with verdict {tc.Verdict}")
else:
tc.Enabled = False
print(f"Disabling Test Case {tc.Ident} since it has already passed")
CANoe.Measurement.Start()
sleep(5) # Sleep because measurement start is not instantaneous
test_module.Start()
sleep(1)

Just continue what you have done.
The TestEnvironment contains the TestModules. Each TestModule contains a TestSequence which in turn contains the TestCases.
Keep in mind that you cannot individual TestCases but only the TestModule. But you can enable and disable individual TestCases before execution by using the COM-API.
(typing this from the top of my head, might not work 100%)
test_module = test_environment.TestModules.Item(1) # of 2 or whatever
test_sequence = test_module.Sequence
for i in range(1, test_sequence.Count + 1):
test_case = test_sequence.Item(i)
if ...:
test_case.Enabled = False # or True
test_module.Start()
You have to keep in mind that a TestSequence can also contain other TestSequences (i.e. a TestGroup). This depends on how your TestModule is setup. If so, you have to take care of that in your loop and descend into these TestGroups while searching for your TestCase of interest.

How to run a function on all Spark workers before processing data in PySpark?

I'm running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a "setup" function on all workers on all nodes in the cluster.
The streaming task classifies incoming messages as spam or not spam, but before it can do that it needs to download the latest pre-trained models from HDFS to local disk, like this pseudo code example:
def fetch_models():
if hadoop.version > local.version:
hadoop.download()
I've seen the following examples here on SO:
sc.parallelize().map(fetch_models)
But in Spark 1.6 parallelize() requires some data to be used, like this shitty work-around I'm doing now:
sc.parallelize(range(1, 1000)).map(fetch_models)
Just to be fairly sure that the function is run on ALL workers I set the range to 1000. I also don't exactly know how many workers are in the cluster when running.
I've read the programming documentation and googled relentlessly but I can't seem to find any way to actually just distribute anything to all workers without any data.
After this initialization phase is done, the streaming task is as usual, operating on incoming data from Kafka.
The way I'm using the models is by running a function similar to this:
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition)))
Theoretically I could check whether or not the models are up to date in the on_partition function, though it would be really wasteful to do this on each batch. I'd like to do it before Spark starts retrieving batches from Kafka, since the downloading from HDFS can take a couple of minutes...
UPDATE:
To be clear: it's not an issue on how to distribute the files or how to load them, it's about how to run an arbitrary method on all workers without operating on any data.
To clarify what actually loading models means currently:
def on_partition(config, partition):
if not MyClassifier.is_loaded():
MyClassifier.load_models(config)
handle_partition(config, partition)
While MyClassifier is something like this:
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
MyClassifier.clf = load_from_file(config)
Static methods since PySpark doesn't seem to be able to serialize classes with non-static methods (the state of the class is irrelevant with relation to another worker). Here we only have to call load_models() once, and on all future batches MyClassifier.clf will be set. This is something that should really not be done for each batch, it's a one time thing. Same with downloading the files from HDFS using fetch_models().

If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:
some_path = ... # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)
and retrieve it on the workers using SparkFiles.get and standard IO tools:
from pyspark import SparkFiles
with open(SparkFiles.get(some_path)) as fw:
... # Do something
If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:
model.py:
from pyspark import SparkFiles
config = ...
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
path = SparkFiles.get(config.get("model_file"))
MyClassifier.clf = load_from_file(path)
# Executed once per interpreter
MyClassifier.load_models(config)
main.py:
from pyspark import SparkContext
config = ...
sc = SparkContext("local", "foo")
# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")
import model
ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()
ssc.start()
ssc.awaitTermination()

This is a typical use case for Spark's broadcast variables. Let's say fetch_models returns the models rather than saving them locally, you would do something like:
bc_models = sc.broadcast(fetch_models())
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition, bc_models.value)))
This does assume that your models fit in memory, on the driver and the executors.
You may be worried that broadcasting the models from the single driver to all the executors is inefficient, but it uses 'efficient broadcast algorithms' that can outperform distributing through HDFS significantly according to this analysis

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.