How to specify dask client via environment variable - python

How can I instruct dask to use a distributed Client as the scheduler, externally from the code, e.g. via an environment variable?
The motivation is to take advantage of one of the key features of dask - namely the transparency of going from a single machine to a distributed cluster. However, there seems to be one little thing obscuring this transparency - the need to register a Client via code.
I can set the named schedulers (e.g. "synchronous" and "processes") via the config (file/env var) as instructed here, but how do I use the same mechanism with a distributed one?
Ideally, I would like to set something like:
DASK_SCHEDULER=distributed(scheduler_file=...)
as an environment variable which would be equivalent of running client = Client(scheduler_file=...) within python code.
This would then mean the EXACT same code can be run in different environments (local and distributed).

One way to do it would be do add to pass the scheduler has an argument; per say using Argparse.
Thus you could have python my_script.py <ip:port> were you specify either the distributed or <127.0.0.1:port> for local.

Related

Changing AWS Lambda environment variables when running test

I've got a few small Python functions that post to twitter running on AWS. I'm a novice when it comes to Lambda, knowing only enough to get the functions running.
The functions have environment variables set in Lambda with various bits of configuration, such as post frequency and the secret data for the twitter application. These are read into the python script directly.
It's all triggered by an Event Bridge cron job that runs every hour.
I'm wanting to create a test event that will allow me to invoke the function manually, but would like to be able to change the post frequency variable when run like this.
Is there a simple way to change environment variables when running a test event?
That is very much possible and there are multiple ways to do it. One is to use AWS CLI's aws lambda update-function-configuration: https://docs.aws.amazon.com/cli/latest/reference/lambda/update-function-configuration.html
Alternatively, depending on programming language that you prefer, you can use AWS SDK that also has a similar method, you can find an example with JS SDK in this doc: https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/javascript_lambda_code_examples.html

How can I run a python script at the end of a specific DBT model, using the post_hook param within config()?

I would like to be able to run a python script only at the end of a specific DBT model.
My idea is to use post_hook parameter from config() function of that specific model.
Is there a way to do this?
You cannot do this today. dbt does not provide a Python runtime.
Depending on how you deploy dbt, you could use fal for this (either open source or cloud): https://fal.ai/, or another (heavier) orchestrator, like Airflow, Dagster, or Prefect.
You should also know that there is an active Discussion about External Nodes and/or executable exposures that would solve for this use case: https://github.com/dbt-labs/dbt-core/discussions/5073
dbt is also planning to release Python-language models in the near future, but that is unlikely to solve this use case; that Python will be executed in your Warehouse environment, and may or may not be able to make arbitrary web requests (e.g., Snowpark is really just dataframe-python that gets transpiled to SQL)

Automatically register new prefect flows?

Is there a mechanism to automatically register flows/new flows if a local agent is running, without having to manually run e.g. flow.register(...) on each one?
In airflow, I believe they have a process that regularly scans for any files with dag in the name in the specified airflow home folder, then searches them for DAG objects. And if it finds them it loads them so they are accessible through the UI without having to manually 'register' them.
Does something similar exist for prefect. So for example if I just created the following file test_flow.py, without necessarily running it or adding flow.run_agent() is there a way for it to just be magically registered and accessible through the UI :) - just by it simply existing in the proper place?
# prefect_home_folder/test_flow.py
import prefect
from prefect import task, Flow
#task
def hello_task():
logger = prefect.context.get("logger")
logger.info("Hello, Cloud!")
flow = Flow("hello-flow", tasks=[hello_task])
flow.register(project_name='main')
I could write a script that has similar behavior to the airflow process to scan a folder and register flows at regular intervals, but I wonder if it's a bit hacky or if there is a better solution and I'm justing thinking too much in terms of airflow?
Great question (and awesome username!) - in short, I suggest you are thinking too much in terms of Airflow. There are a few reasons this is not currently available in Prefect:
explicit is better than implicit
Prefect flows are not constrained to live in one place and are not constrained to have the same runtime environments; this makes both the automatic discovery of a flow + re-serializing it complicated from a single agent process (which is not required to share the same runtime environment as the flows it submits)
agents are better thought of as being parametrized by deployment infrastructure, not flow storage
Ideally for production workflows you'd use a CI/CD process so that anytime you make a code change an automatic job is triggered that re-registers the flow. A few comments that may be helpful:
you don't actually need to re-register the flow for every possible code change; for example, if you changed the message that your hello_task logs in your example, you could simply re-save the flow to its original location (what this looks like depends on the type of storage you use). Ultimately you only need to re-register if any of the metadata about your flow changes (retry settings, task names, dependency relationships, etc.)
you can use flow.register("My Project", idempotency_key=flow.serialized_hash()) to automatically capture this; this pattern will only register a new version if the flow's backend representation changes in some way

opencensus exporter - one global or per thread?

I am using Opencensus to do some monitoring on a grpc server with 10 workers. My question is whether, when making a Tracer, the exporter for the tracer should be local or Global. IE
this is the server:
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
Do I do:
tracer_module.Tracer(sampler=always_on.AlwaysOnSampler(), exporter=GLOBAL_EXPORTER)
where:
GLOBAL_EXPORTER = stackdriver_exporter.StackdriverExporter(transport=BackgroundThreadTransport))
OR do I do:
tracer_module.Tracer(sampler=always_on.AlwaysOnSampler(), exporter=stackdriver_exporter.StackdriverExporter(transport=BackgroundThreadTransport)))
I have tried both and they work. The former will use a global exporter which should be more efficient (I would think) but the aggregation seems a bit odd (one call is 'aggregated with another). On the other hand, the second way makes a second exporter (which is short lived, since it will exist only for that call) and does seem to export correctly. Question is more what is more correct from a system perspective. IE for the second option does creating stackdriver_exporter.StackdriverExporter(transport=BackgroundThreadTransport) invalidate a different exporter (which was created in a different thread)?
You should use a global exporter. It was not intended for a new export thread to be created for every Tracer. There should be one background thread running which handles all exporting to StackDriver.
As for the aggregation, it shouldn't be aggregating all the spans together. That may be a bug in the StackDriver UI (there are a number of known issues).

How to leverage the world-size parameter for DistributedDataParallel in Pytorch example for multiple GPUs?

I am running this Pytorch example on a g2.2xlarge AWS machine. So, when I run time python imageNet.py ImageNet2, it runs well with the following timing:
real 3m16.253s
user 1m50.376s
sys 1m0.872s
However, when I add the world-size parameter, it gets stuck and does not execute anything. The command is as follows: time python imageNet.py --world-size 2 ImageNet2
So, how do I leverage the DistributedDataParallel functionality with the world-size parameter in this script. The world-size parameter is nothing but number of distributed processes.
Do I spin up another similar instance for this purpose? If yes, then how do the script recognize the instance? Do I need to add some parameters like the instance's IP or something?
World size argument is the number of nodes in your distributed training, so if you set the world size to 2 you need to run the same command with a different rank on the other node. If you just want to increase the number of GPUs on a single node you need to change ngpus_per_node instead. Take a look at the multiple node example in this Readme.

Categories

Resources