I am trying to collect application-specific Prometheus metrics in Django for functions that are called by django-background-tasks.
In my application models.py file, I am first adding a custom metric with:
my_task_metric = Summary("my_task_metric ", "My task metric")
Then, I am adding this to my function to capture the timestamp at which this function was last run successfully:
#background()
def my_function():
# my function code here
# collecting the metric
my_task_metric.observe((datetime.now().replace(tzinfo=timezone.utc) - datetime(1970, 1, 1).replace(tzinfo=timezone.utc)).total_seconds())
When I bring up Django, the metric is created and accessible in /metrics. However, after this function is run, the value for sum is 0 as if the metric is not observed. Am I missing something?
Or is there a better way to monitor django-background-tasks with Prometheus? I have tried using the model of django-background-tasks but I found it a bit cumbersome.
I ended up creating a decorator leveraging the Prometheus Pushgateway feature
def push_metric_to_prometheus(function):
registry = CollectorRegistry()
Gauge(f'{function.__name__}_last_successful_run', f'Last time {function.__name__} successfully finished',
registry=registry).set_to_current_time()
push_to_gateway('bolero.club:9091', job='batchA', registry=registry)
return function
and then on my function (the order of the decorators is important)
#background()
#push_metric_to_prometheus
def my_function():
# my function code here
Related
I'm currently discovering Prefect and I'm trying to deploy it to schedule workflows. I struggle a bit to understand how to access some data though. Here is my problem: I create a deployment and run it via the Python API and I need the ID of the flow run it creates (to cancel it, may other things happen outside of the flow).
When I run without any scheduling I can access the data I need (the flow run UUID), but I kind of want the scheduling part. It may be because the run_deployment function is asynchronous but as I am nowhere near being an expert in Python I don't know for sure (well that, and the fact that my code never exits after calling the main() function).
Here is what my code looks like:
from prefect import flow, task
from prefect.deployments import Deployment, run_deployment
from datetime import datetime, date, time, timezone
# Import the flow:
from script import my_flow
# Configure the deployment:
deployment_name = "my_deployment"
# Create the deployment for the flow:
deployment = Deployment.build_from_flow(
flow = my_flow,
name = deployment_name,
version = 1,
work_queue_name = "my_queue",
)
deployment.apply()
def main():
# Schedule a flow run based on the deployment:
response = run_deployment(
name = "my_flow/" + deployment_name,
parameters = {my_param},
scheduled_time = dateutil.parser.isoparse(scheduledDate),
flow_run_name = "my_run",
)
print(response)
if __name__ == "__main__":
main()
exit()
I searched a bit and saw in that post that it was possible to print the flow run id as it was executed, but in my case I need before the execution.
Is there anyway to get that data (using the Python API)? Or to set the flow ID myself? (I've already thoroughly checked the docs, I'm quite sure this is not possible)
Thanks a lot for your time!
Gauthier
As of 2.7.12 - released the same day you posted your question - you can create names for flows programmatically. Does that get you what you need?
As of 2.7.12 - released the same day you posted your question - you can create names for flows programmatically. Does that get you what you need?
Both tasks and flows now expose a mechanism for customizing the names of runs! This new keyword argument (flow_run_name for flows, task_run_name for tasks) accepts a string that will be used to create a run name for each run of the function. The most basic usage is as follows:
from datetime import datetime
from prefect import flow, task
#task(task_run_name="custom-static-name")
def my_task(name):
print(f"hi {name}")
#flow(flow_run_name="custom-but-fixed-name")
def my_flow(name: str, date: datetime):
return my_task(name)
my_flow()
This is great, but doesn’t help distinguish between multiple runs of the same task or flow. In order to make these names dynamic, you can template them using the parameter names of the task or flow function, using all of the basic rules of Python string formatting as follows:
from datetime import datetime
from prefect import flow, task
#task(task_run_name="{name}")
def my_task(name):
print(f"hi {name}")
#flow(flow_run_name="{name}-on-{date:%A}")
def my_flow(name: str, date: datetime):
return my_task(name)
my_flow()
See the docs or https://github.com/PrefectHQ/prefect/pull/8378 for more details.
run_deployment returns a flow run object - which you named response in your code.
If you want to get the ID before the flow run is actually executed, you just have to set timeout=0, so that run_deployment will return immediately after submission.
You only have to do:
flow_run = run_deployment(
name = "my_flow/" + deployment_name,
parameters = {my_param},
scheduled_time = dateutil.parser.isoparse(scheduledDate),
flow_run_name = "my_run",
timeout=0
)
print(flow_run.id)
I am very new to airflow and I am trying to create a DAG based on the below requirement.
Task 1 - Run a Bigquery query to get a value which I need to push to 2nd task in the dag
Task 2 - Use the value from the above query and run another query and export the data into google cloud bucket.
I have read other answers related to this and I understand we cannot use xcom_pull or xcom_push in bigqueryoperator in airflow. So what I am doing is using a python operator where I can use jinja template variables by using "provide_context=True".
Below is the snipped of my code. Just the task 1 where I want to do "task_instance.xcom_push" in order to see the value in airflow under logs xcom.
def get_bq_operator(dag, task_id, configuration, table_params=None, trigger_rule='all_success'):
bq_operator = BigQueryInsertJobOperator(
task_id=task_id,
configuration=configuration,
gcp_conn_id=gcp_connection_id,
dag=dag,
params=table_params,
trigger_rule=trigger_rule,
**task_instance.xcom_push(key='yr_wk', value=yr_wk),**
)
return bq_operator
def get_bq_wm_yr_wk():
get_bq_operator(dag,app_name,bigquery_util.get_bq_job_configuration(
bq_query,
query_params=None))
get_wm_yr_wk = PythonOperator(task_id='get_wm_yr_wk',
python_callable=get_bq_wm_yr_wk,
provide_context=True,
on_failure_callback=failure_callback,
on_retry_callback=failure_callback,
dag=dag)
"bq_query" is the one I am passing the sql file which has my query and the query returns the value of yr_wk which I need to use in my 2nd task.
The highlighted task_instance.xcom_push(key='yr_wk', value=yr_wk), in get_bq_operator is failing and the errror i am getting is as below
raise KeyError(f'Variable {key} does not exist')
KeyError: 'Variable ei_migration_hour does not exist'
If I comment the line above , the DAG runs fine. However, how do I validate the value of yr_wk?? I want to push it so that I can view the value in logs.
I do not fully understand your code :), but if you want to do something with results of BigQuery query, then by far better way to approach it is to use BigQueryHook in your python callable.
Operators in Airflow are usually thin wrappers around Hooks that really provide a "complete" taks (for example you can use it run an update operation) but if you want to do something with the result of it and you already do it via Python Operator, it is far better to use Hooks directly as you do not make all the assumptions that operators have in execute method.
In your case it should be something like (and I am using here the new TaskFlow syntax which is preferred to do this kind of operations. See https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html for the tutorial on Task Flow API. Aspecially in Airflow 2 it became the de-facto default way of writing tasks.
#task(.....)
def my_task():
hook = BigQueryHook(....) # initialize it with the right parameters
result = hook.run(sql='YOUR_QUERY', ...) # add other necessary params
processed_result = process_result(result) # do something with the result
return processed_result
This way you do not evey have to run xcom_push (task_flow API will do it for you automatically and other tasks will be able to use by just doing :
#task
next_task(input):
pass
And then:
result = my_task()
next_task(result)
Then all the xcom push/pull will be handled for you automatically via TaskFlow.
Once an MLflow run is finished, external scripts can access its parameters and metrics using python mlflow client and mlflow.get_run(run_id) method, but the Run object returned by get_run seems to be read-only.
Specifically, .log_param .log_metric, or .log_artifact cannot be used on the object returned by get_run, raising errors like these:
AttributeError: 'Run' object has no attribute 'log_param'
If we attempt to run any of the .log_* methods on mlflow, it would log them into to a new run with auto-generated run ID in the Default experiment.
Example:
final_model_mlflow_run = mlflow.get_run(final_model_mlflow_run_id)
with mlflow.ActiveRun(run=final_model_mlflow_run) as myrun:
# this read operation uses correct run
run_id = myrun.info.run_id
print(run_id)
# this write operation writes to a new run
# (with auto-generated random run ID)
# in the "Default" experiment (with exp. ID of 0)
mlflow.log_param("test3", "This is a test")
Note that the above problem exists regardless of the Run status (.info.status can be both "FINISHED" or "RUNNING", without making any difference).
I wonder if this read-only behavior is by design (given that immutable modeling runs improve experiments reproducibility)? I can appreciate that, but it also goes against code modularity if everything has to be done within a single monolith like the with mlflow.start_run() context...
As it was pointed out to me by Hans Bambel and as it is documented here mlflow.start_run (in contrast to mlflow.ActiveRun) accepts the run_id parameter of an existing run.
Here's an example tested to work in v1.13 through v1.19 - as you see one can even overwrite an existing metric to correct a mistake:
with mlflow.start_run(run_id=final_model_mlflow_run_id):
# print(mlflow.active_run().info)
mlflow.log_param("start_run_test", "This is a test")
mlflow.log_metric("start_run_test", 1.23)
mlflow.log_metric("start_run_test", 1.33)
mlflow.log_artifact("/home/jovyan/_tmp/formula-features-20201103.json", "start_run_test")
I have been looking at developing some custom resources via the use of Lambda from CloudFormation (CF) and have been looking at using the custom resource helper, but it started off ok then the CF stack took ages to create or delete. When I checked the cloud watch logs I noticed there was an error after running the create or cloud functions in my Lambda.
[7cfecd7b-69df-4408-ab12-a764a8bf674e][2021-02-07 12:41:12,535][ERROR] send(..) failed executing requests.put(..):
Formatting field not found in record: 'requestid'
I noticed some others had this issue but no resolution. I have used the generic code from the link below, my custom code works and completes but it looks like passing an update to CF. I looked through the crhelper.py the only reference I can find for 'requestid' is this :
logfmt = '[%(requestid)s][%(asctime)s][%(levelname)s] %(message)s \n'
mainlogger.handlers[0].setFormatter(logging.Formatter(logfmt))
return logging.LoggerAdapter(mainlogger, {'requestid': event['RequestId']})
Reference
To understand the error that you're having we need to look at a reproducible code example of what you're doing. Take into consideration that every time that you have some kind of error on a custom resource operation it may take ages to finished, as you have noticed.
But, there is a good alternative to the original custom resource helper that you're using, and, in my experience, this works very well while maintaining the code much simpler (thanks to a good level of abstraction) and follows the best practices. This is the
custom resource helper framework, as explained on this AWS blog.
You can find more details about the implementation on github here.
Basically, after downloading the needed dependencies and load them on your lambda (this depends on how you handle custom lambda dependencies), you can manage your custom resources operations like this:
from crhelper import CfnResource
import logging
logger = logging.getLogger(__name__)
# Initialise the helper
helper = CfnResource()
try:
## put here your initial code for every operation
pass
except Exception as e:
helper.init_failure(e)
#helper.create
def create(event, context):
logger.info("Got Create")
print('Here we are creating some cool stuff')
#helper.update
def update(event, context):
logger.info("Got Update")
print('Here you update the things you want')
#helper.delete
def delete(event, context):
logger.info("Got Delete")
print('Here you handle the delete operation')
# this will call the defined function according the
# cloudformation operation that is running (create, update or delete)
def handler(event, context):
helper(event, context)
I'm running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a "setup" function on all workers on all nodes in the cluster.
The streaming task classifies incoming messages as spam or not spam, but before it can do that it needs to download the latest pre-trained models from HDFS to local disk, like this pseudo code example:
def fetch_models():
if hadoop.version > local.version:
hadoop.download()
I've seen the following examples here on SO:
sc.parallelize().map(fetch_models)
But in Spark 1.6 parallelize() requires some data to be used, like this shitty work-around I'm doing now:
sc.parallelize(range(1, 1000)).map(fetch_models)
Just to be fairly sure that the function is run on ALL workers I set the range to 1000. I also don't exactly know how many workers are in the cluster when running.
I've read the programming documentation and googled relentlessly but I can't seem to find any way to actually just distribute anything to all workers without any data.
After this initialization phase is done, the streaming task is as usual, operating on incoming data from Kafka.
The way I'm using the models is by running a function similar to this:
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition)))
Theoretically I could check whether or not the models are up to date in the on_partition function, though it would be really wasteful to do this on each batch. I'd like to do it before Spark starts retrieving batches from Kafka, since the downloading from HDFS can take a couple of minutes...
UPDATE:
To be clear: it's not an issue on how to distribute the files or how to load them, it's about how to run an arbitrary method on all workers without operating on any data.
To clarify what actually loading models means currently:
def on_partition(config, partition):
if not MyClassifier.is_loaded():
MyClassifier.load_models(config)
handle_partition(config, partition)
While MyClassifier is something like this:
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
MyClassifier.clf = load_from_file(config)
Static methods since PySpark doesn't seem to be able to serialize classes with non-static methods (the state of the class is irrelevant with relation to another worker). Here we only have to call load_models() once, and on all future batches MyClassifier.clf will be set. This is something that should really not be done for each batch, it's a one time thing. Same with downloading the files from HDFS using fetch_models().
If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:
some_path = ... # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)
and retrieve it on the workers using SparkFiles.get and standard IO tools:
from pyspark import SparkFiles
with open(SparkFiles.get(some_path)) as fw:
... # Do something
If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:
model.py:
from pyspark import SparkFiles
config = ...
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
path = SparkFiles.get(config.get("model_file"))
MyClassifier.clf = load_from_file(path)
# Executed once per interpreter
MyClassifier.load_models(config)
main.py:
from pyspark import SparkContext
config = ...
sc = SparkContext("local", "foo")
# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")
import model
ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()
ssc.start()
ssc.awaitTermination()
This is a typical use case for Spark's broadcast variables. Let's say fetch_models returns the models rather than saving them locally, you would do something like:
bc_models = sc.broadcast(fetch_models())
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition, bc_models.value)))
This does assume that your models fit in memory, on the driver and the executors.
You may be worried that broadcasting the models from the single driver to all the executors is inefficient, but it uses 'efficient broadcast algorithms' that can outperform distributing through HDFS significantly according to this analysis