Monitor Multiple Airflow instances - python

I currently have task a task to monitor completely independent airflow instances running across multiple customer servers. All of them have similar DAGs. I have to combine metrics from all these instances and monitor them. Can anyone please suggest a approach for it?
I tried to use prometheus with multiple statsd exporter and grafana dashboards but I can't seem to get it working.

The best approach is using tags to differentiate between the metrics of the different instances, but unfortunately, Airflow doesn't push tags for prometheus, it does only for datadog and it will do soon (starting from 2.6.0) for InfluxDB.
If you can use one of these services, you can use the same stats-prefix, and add statsd-tags to separate between the different airflow instances:
# instance 1:
statsd-datadog-tags=instance:1
# instance 2:
statsd-datadog-tags=instance:2
Then you will be able to put the metrics on the same monitor/dashboard, and group and filter them by instance name.
Also, in 2.6.0 we will support removing the variables from the metrics names and sending them as tags, so you will be able to create one dashboard for all your dags/tasks and group by tags to better monitor them.

Related

Performing PromQL queries on prometheus metrics in a text file

I have a custom Prometheus exporter that is meant for extreme "hermeticity" - to operate at all times, even when there is not network connection, for a spectrum of reasons.
Normally, I have a main Prometheus instance that scrapes nodes with this exporter, but when network goes out, the team added functionality to the exporter to dump the metrics to a text file periodically, in order to not lose any crucial data.
Now, I have about many hours of metrics from several nodes, in some text files, and I want to be able to query them. I checked to see if the prometheus_client package in python had any way to query on that, but the closest thing I found was to parse the text-formatted metrics to gauge/counter objects in python, and if I wished to query on them I would have to implement something my self.
I've searched for available solutions, but the only way to query Prometheus I found was through the API, which needs me to push the metrics into the main Prometheus instance.
I don't have direct access to the main Prometheus instance, thus I can't make a quick script to push metrics into it.
Finally, my question is: How can I perform PromQL queries on Prometheus text-formatted metrics in a text file? Is there an available solution, or do we have to implement something similar ourselves?

persist data to DynamoDB in Alexa hosted app

Are there any really good articles breaking down how to persist data into DynamoDB from Alexa? I can't seem to find a good article to break down step by step on how to persist a slot value into DynamoDB. I see in the Alexa docs here about implementing the code in Python, but that seems to be only part of what I'm looking for.
There's really no comprehensive breakdown of this, as like this tutorial, that persists data to S3. I would like to try to find something similar for DynamoDB. If there's an answer from a previous question that has answered it, let me know and I can mark it as a duplicate.
You can just use a tutorial which uses python and aws lambdas.
Like this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.html
the amazon article is more about the development kit which can give you some nice features to store persistent attributes for a users.
so usually I have a persistent store for users (game scores, ..., last use of skill whatever) and additional data in an other table
The persistence adapter has an interface spec that abstracts away most of the details operationally. You should be able to change persistence adapters by initializing one that meets the spec, and in the initialization there may be some different configuration options. But the way you put things in and get them out should remain functionally the same.
You can find the configuration options for S3 and Dynamo here. https://developer.amazon.com/en-US/docs/alexa/alexa-skills-kit-sdk-for-python/manage-attributes.html
I have written a "local persistence adapter" in JavaScript to let me store values in flat files at localhost instead of on S3 when I'm doing local dev/debug. Swapping the two out (depending on environment) is all handled at adapter initialization. My handlers that use the attributes manager don't change.

Extracting data continuously from RDS MySQL schemas in parallel

I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.

How to Remove All Session Objects after H2O AutoML?

I am trying to create an ML application in which a front end takes user information and data, cleans it, and passes it to h2o AutoML for modeling, then recovers and visualizes the results. Since the back end will be a stand-alone / always-on service that gets called many times, I want to ensure that all objects created in each session are removed, so that h2o doesn't get cluttered and run out of resources. The problem is that many objects are being created, and I am unsure how to identify/track them, so that I can remove them before disconnecting each session.
Note that I would like the ability to run more than one analysis concurrently, which means I cannot just call remove_all(), since this may remove objects still needed by another session. Instead, it seems I need a list of session objects, which I can pass to the remove() method. Does anyone know how to generate this list?
Here's a simple example:
import h2o
import pandas as pd
df = pd.read_csv("C:\iris.csv")
my_frame = h2o.H2OFrame(df, "my_frame")
aml = H2OAutoML(max_runtime_secs=100)
aml.train(y='class', training_frame=my_frame)
Looking in the Flow UI shows that this simple example generated 5 new frames, and 74 models. Is there a session ID tag or something similar that I can use to identify these separately from any objects created in another session, so I can remove them?
The recommended way to clean only your work is to use h2o.remove(aml).
This will delete the automl instance on the backend and cascade to all the submodels and attached objects like metrics.
It won't delete the frames that you provided though (e.g. training_frame).
You can use h2o.ls() to list the H2O objects. Then you can use h2o.remove('YOUR_key') to remove ones you don't want to keep.
For example:
#Create frame of objects
h_objects = h2o.ls()
#Filter for keys of one AutoML session
filtered_objects = h_objects[h_objects['key'].str.contains('AutoML_YYYYMMDD_xxxxxx')]
for key in filtered_objects['key']:
h2o.remove(key)
Alternatively, you can remove all AutoML objects using the filter below instead.
filtered_objects = h_objects[h_objects['key'].str.lower().str.contains('automl')]

How to send a query or stored procedure execution request to a specific location/region of cosmosdb?

I'm trying to multi-thread some tasks using cosmosdb to optimize ETL time, and I can't find how, using the python API (but I could do something in REST if required) if I have a stored procedure to call twice for two partitions keys, I could send it to two different regions (namely 'West Europe' and 'Central France)
I defined those as PreferredLocations in the connection policy but don't know how to include to a query, the instruction to route it to a specific location.
The only place you could specify that on would be the options objects of the requests. However there is nothing related to the regions.
What you can do is initialize multiple clients that have a different order in the preferred locations and then spread the load that way in different regions.
However, unless your apps are deployed on those different regions and latency is less, there is no point in doing so since Cosmos DB will be able to cope with all the requests in a single region as long as you have the RUs needed.

Categories

Resources