I have a custom Prometheus exporter that is meant for extreme "hermeticity" - to operate at all times, even when there is not network connection, for a spectrum of reasons.
Normally, I have a main Prometheus instance that scrapes nodes with this exporter, but when network goes out, the team added functionality to the exporter to dump the metrics to a text file periodically, in order to not lose any crucial data.
Now, I have about many hours of metrics from several nodes, in some text files, and I want to be able to query them. I checked to see if the prometheus_client package in python had any way to query on that, but the closest thing I found was to parse the text-formatted metrics to gauge/counter objects in python, and if I wished to query on them I would have to implement something my self.
I've searched for available solutions, but the only way to query Prometheus I found was through the API, which needs me to push the metrics into the main Prometheus instance.
I don't have direct access to the main Prometheus instance, thus I can't make a quick script to push metrics into it.
Finally, my question is: How can I perform PromQL queries on Prometheus text-formatted metrics in a text file? Is there an available solution, or do we have to implement something similar ourselves?
Related
I currently have task a task to monitor completely independent airflow instances running across multiple customer servers. All of them have similar DAGs. I have to combine metrics from all these instances and monitor them. Can anyone please suggest a approach for it?
I tried to use prometheus with multiple statsd exporter and grafana dashboards but I can't seem to get it working.
The best approach is using tags to differentiate between the metrics of the different instances, but unfortunately, Airflow doesn't push tags for prometheus, it does only for datadog and it will do soon (starting from 2.6.0) for InfluxDB.
If you can use one of these services, you can use the same stats-prefix, and add statsd-tags to separate between the different airflow instances:
# instance 1:
statsd-datadog-tags=instance:1
# instance 2:
statsd-datadog-tags=instance:2
Then you will be able to put the metrics on the same monitor/dashboard, and group and filter them by instance name.
Also, in 2.6.0 we will support removing the variables from the metrics names and sending them as tags, so you will be able to create one dashboard for all your dags/tasks and group by tags to better monitor them.
Are there any really good articles breaking down how to persist data into DynamoDB from Alexa? I can't seem to find a good article to break down step by step on how to persist a slot value into DynamoDB. I see in the Alexa docs here about implementing the code in Python, but that seems to be only part of what I'm looking for.
There's really no comprehensive breakdown of this, as like this tutorial, that persists data to S3. I would like to try to find something similar for DynamoDB. If there's an answer from a previous question that has answered it, let me know and I can mark it as a duplicate.
You can just use a tutorial which uses python and aws lambdas.
Like this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.html
the amazon article is more about the development kit which can give you some nice features to store persistent attributes for a users.
so usually I have a persistent store for users (game scores, ..., last use of skill whatever) and additional data in an other table
The persistence adapter has an interface spec that abstracts away most of the details operationally. You should be able to change persistence adapters by initializing one that meets the spec, and in the initialization there may be some different configuration options. But the way you put things in and get them out should remain functionally the same.
You can find the configuration options for S3 and Dynamo here. https://developer.amazon.com/en-US/docs/alexa/alexa-skills-kit-sdk-for-python/manage-attributes.html
I have written a "local persistence adapter" in JavaScript to let me store values in flat files at localhost instead of on S3 when I'm doing local dev/debug. Swapping the two out (depending on environment) is all handled at adapter initialization. My handlers that use the attributes manager don't change.
Assume this scenario:
We analyze the data, train some machine learning models using whatever tool we have at hand, and save those models. This is done in Python, using Apache Spark python shell and API. We know Apache Spark is good at batch processing, hence a good choice for the aboce scenario.
Now going into production, for each given request, we need to return a response which depends also on the output of the trained model. This is, I assume, what people call stream processing, and Apache Flink is usually recommended for it. But how would you use the same models trained using tools available in Python, in a Flink pipeline?
The micro-batch mode of Spark wouldn't work here, since we really need to respond to each request, and not in batches.
I've also seen some libraries trying to do machine learning in Flink, but that doesn't satisfy needs of people who have diverse tools in Python and not Scala, and are not even familiar with Scala.
So the question is, how do people approach this problem?
This question is related, but not a duplicate, since the author there mentions explicitly using Spark MLlib. That library runs on JVM, and has more potential to be ported to other JVM based platforms. But here the question is how would people approach it if let say, they were using scikit-learn, or GPy or whatever other method/package they use.
I needed a way of creating a custom Transformer for an ml Pipeline and have that custom object be saved/loaded along with the rest of the pipeline. This led me to digging into the very ugly depths of spark model serialisation / deserialisation. In short it looks like all the spark ml models have two components metadata and model data where model data is what ever parameters were learned during .fit(). The metadata is saved in a directory called metadata under the model save dir and as far as I can tell is json so that shouldn't be an issue. The model parameters themselves seem to be saved just as a parquet file in the save dir. This is the implementation for saving an LDA model
override protected def saveImpl(path: String): Unit = {
DefaultParamsWriter.saveMetadata(instance, path, sc)
val oldModel = instance.oldLocalModel
val data = Data(instance.vocabSize, oldModel.topicsMatrix, oldModel.docConcentration,
oldModel.topicConcentration, oldModel.gammaShape)
val dataPath = new Path(path, "data").toString
sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
}
notice the sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) on the last line. So the good news is that you could load file into your webserver, and if the server is in Java/Scala you'll just need to keep the spark jars in the classpath.
If however you're using say python for the webserver you could use a parquet library for python (https://github.com/jcrobak/parquet-python) the bad news is that some or all of the objects in the parquet file are going to be binary Java dumps so you can't actually read them in python. A few options come to mind, use Jython (meh), use Py4J to load the objects, this is what pyspark uses to communicate with the JVM so this could actually work. I wouldn't expect this to be exactly straightforward though.
Or from the linked question use jpmml-spark and hope for the best.
Have a look at MLeap.
We have had some success at externalizing the model learned on Spark into separate services which provide prediction on the new incoming data. We externalized the LDA topic modelling pipeline, albeit for in Scala. But they do have python support so it's worth looking at.
Recently cloud dataflow python sdk was made available and I decided to use it. Unfortunately the support to read from cloud datastore is yet to come so I have to fall back on writing custom source so that I can utilize the benefits of dynamic splitting, progress estimation etc as promised. I did study the documentation thoroughly but am unable to put the pieces together so that I can speed up my entire process.
To be more clear my first approach was:
querying the cloud datastore
creating ParDo function and passing the returned query to it.
But with this it took 13 minutes to iterate over 200k entries.
So I decided to write custom source that would read the entities efficiently. But am unable to achieve that due to my lack of understanding of putting the pieces together. Can any one please help me with how to create custom source for reading from datastore.
Edited:
For first approach the link to my gist is:
https://gist.github.com/shriyanka/cbf30bbfbf277deed4bac0c526cf01f1
Thank you.
In the code you provided, the access to Datastore happens before the pipeline is even constructed:
query = client.query(kind='User').fetch()
This executes the whole query and reads all entities before the Beam SDK gets involved at all.
More precisely, fetch() returns a lazy iterable over the query results, and they get iterated over when you construct the pipeline, at beam.Create(query) - but, once again, this happens in your main program, before the pipeline starts. Most likely, this is what's taking 13 minutes, rather than the pipeline itself (but please feel free to provide a job ID so we can take a deeper look). You can verify this by making a small change to your code:
query = list(client.query(kind='User').fetch())
However, I think your intention was to both read and process the entities in parallel.
For Cloud Datastore in particular, the custom source API is not the best choice to do that. The reason is that the underlying Cloud Datastore API itself does not currently provide the properties necessary to implement the custom source "goodies" such as progress estimation and dynamic splitting, because its querying API is very generic (unlike, say, Cloud Bigtable, which always returns results ordered by key, so e.g. you can estimate progress by looking at the current key).
We are currently rewriting the Java Cloud Datastore connector to use a different approach, which uses a ParDo to split the query and a ParDo to read each of the sub-queries. Please see this pull request for details.
I would like to take data from the Facebook Graph API and analyze it to find out roughly how close one person is to another. I am attempting to use the Pylons framework with SqlAlchemy (right now it is attached to a SQLite database) to store information from the Graph API so that I can make it available to my other applications via a RESTful web service. I am wondering what would be the best approach to analyzing the data.
For example, should I create objects analogous to the nodes and edges in the Graph API (users, posts, statuses, etc.) and analyze them, then store only the aftermath of that analysis in the database, perhaps the UIDs of each node and its connections to other nodes? Or should I store even less, and only have a database of the users and their close friends? Or should I go through step by step and store each of the objects via the ORM mapper in the database and make the analysis from the database after having filled it?
What sorts of concerns go into the designing of a database in situations like this? How should objects relate/map to the model? Where should the analysis be taking place during the whole process of grabbing data and storing it?
I'd store as much as possible, dump everything you can. Try to maintain the relationships between nodes so you can traverse/analyze them later. This affords you the opportunity to analyze your data set as much as you want, over and over and try different things. If you want to use SQLAlchemy you could use a simple self-referential relationship: http://www.sqlalchemy.org/docs/05/mappers.html#adjacency-list-relationships. That way you can maintain the connections between objects easily, and easily traverse them. You should also think about using MongoDB. It's pretty nice for this sort of thing, you can pretty much just dump the JSON responses you get from Facebook into MongoDB. It also has a great python client. Here's the MongoDB docs on storing a tree in MongoDB: http://www.mongodb.org/display/DOCS/Trees+in+MongoDB. There are a couple approaches that make sense there.