How to find notebooks attached to a cluster in Databricks using API? - python

I run an overnight job terminating all running clusters in Azure Databricks. As each cluster might be used by multiple people, I want to find out programmatically which notebooks are attached to the each running cluster.
I use the Python Databricks Cluster API (https://github.com/crflynn/databricks-api), however I'm not against the REST API if necessary.
dbx_env.cluster.get_cluster(cluster_id)

There is no explicit API for that, so it's not so straightforward. One possible approach would be to analyze audit log for attachNotebook and attachNotebook events, and decide if cluster is used or not. But method may not be reliable, as events are appearing with delay, plus you need to have a job that will analyze the audit log.
Simpler solution would be to enforce auto-termination time on all interactive clusters - in this case they will be terminated automatically when nobody use them. You can either:
enforce that through cluster policies
have a script that will go through list of clusters and check auto-termination time, setting it to something like 30 or 60 minutes.
monitor create & edit events in audit log, and correct clusters that have no or very high auto-termination times

Related

Run & scale simple python scripts on Google Cloud Platform

I have a simple python script that I would like to run thousands of it's instances on GCP (at the same time). This script is triggered by the $Universe scheduler, something like "python main.py --date '2022_01'".
What architecture and technology I have to use to achieve this.
PS: I cannot drop $Universe but I'm not against suggestions to use another technologies.
My solution:
I already have a $Universe server running all the time.
Create Pub/Sub topic
Create permanent Compute Engine that listen to Pub/Sub all the time
$Universe send thousand of events to Pub/Sub
Compute engine trigger the creation of a Python Docker Image on another Compute Engine
Scale the creation of the Docker images (I don't know how to do it)
Is it a good architecture?
How to scale this kind of process?
Thank you :)
It might be very difficult to discuss architecture and design questions, as they usually are heavy dependent on the context, scope, functional and non functional requirements, cost, available skills and knowledge and so on...
Personally I would prefer to stay with entirely server-less approach if possible.
For example, use a Cloud Scheduler (server less cron jobs), which sends messages to a Pub/Sub topic, on the other side of which there is a Cloud Function (or something else), which is triggered by the message.
Should it be a Cloud Function, or something else, what and how should it do - depends on you case.
As I understand, you will have a lot of simultaneous call on a custom python code trigger by an orchestrator ($Universe) and you want it on GCP platform.
Like #al-dann, I would go to serverless approach in order to reduce the cost.
As I also understand, pub sub seems to be not necessary, you will could easily trigger the function from any HTTP call and will avoid Pub Sub.
PubSub is necessary only to have some guarantee (at least once processing), but you can have the same behaviour if the $Universe validate the http request for every call (look at http response code & body and retry if not match the expected result).
If you want to have exactly once processing, you will need more tooling, you are close to event streaming (that could be a good use case as I also understand). In that case in a full GCP, I will go to pub / sub & Dataflow that can guarantee exactly once, or Kafka & Kafka Streams or Flink.
If at least once processing is fine for you, I will go http version that will be simple to maintain I think. You will have 3 serverless options for that case :
App engine standard: scale to 0, pay for the cpu usage, can be more affordable than below function if the request is constrain to short period (few hours per day since the same hardware will process many request)
Cloud Function: you will pay per request(+ cpu, memory, network, ...) and don't have to think anything else than code but the code executed is constrain on a proprietary solution.
Cloud run: my prefered one since it's the same pricing than cloud function but you gain the portability, the application is a simple docker image that you can move easily (to kubernetes, compute engine, ...) and change the execution engine depending on cost (if the load change between the study and real world).

Is Jenkins the right tool for real time streaming pipeline?

Say I would like to scrape json format information from an URL which refresh every 2 minutes. I need to run my pipeline (written in Python) continuously every 2 minutes to capture them without any missing data. In this case the pipeline is real time processing.
Currently, I'm using Jenkins to run the pipeline every 2 minutes but I do not think this is a correct practice and Jenkins is meant for CI/CD pipelines. I doubt mine is CI/CD pipeline. Even though I knew there is Jenkins pipeline plugin, I still think using the plugin is conceptually incorrect.
So, what is the best data processing tools in this case? At this point, no data transformation needed. I believe in the future as the process gets more complex, transformation is required. Just FYI, the data will pump into azure blob storage.
Jenkins is meant for CI/CD pipelines and not for scheduling jobs. So yes, you are right, Jenkins isn't the right tool for what you want to achieve.
I suppose you are using Microsoft Azure.
If you are using Databricks, use Spark Streaming application and crate the job.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs
The other approach is to use Azure DataFactory with Scheduled Trigger for the job. As you want to run every 2 minutes, I not sure what will be the cost involved.
As mentioned Jenkins probably should not be used as a distributed Cron system... Let it do what it's good at - CI.
You can boot any VM and create a Cron script or you can use a scheduler like Airflow
FWIW, "every 2 minutes" is not "real time"
Just FYI, the data will pump into azure blob storage.
If you are set on using Kafka (via Event Hub), you could use Kafka Connect to scrape the site, send data into a Kafka topic, then use a sink connector to send data to WASB

Create a Notification System using Python flask for IOT application

I have created a flask based IOT application where the devices send data regularly via a REST API and that data is stored in a DB.
I want to develop a notification system which send notification to the Mobile APP whenever the thresholds for a particular device exceeds.
The threshold and the time window for each device is stored in DB
Example:
if the temperature of device x for the last 5 minutes is greater that 30 deg C then send a notification to the user.
What would be the best approach to solve this using Python ?
Currently I am using celery beat and running a worker every 1 sec which reads the device data and threshold configured by user from the database and based on the value sends the notification to the APP via PYFCM.
I don't feel this method would be scalable in the long run.
That is not particularly "python" or "flask" thing you're asking about.
That's architectural thing. And, after it, that's a design thing.
So, architecturally the typical IoT data are timeseries of measurement values tagged with some attributes for distinction by origins, groups, flavors, you-name-it.
As timeseries are being established (received from devices & stored in DB), you typically want to process them with statistical methods to build some functions ready for analysis.
That, in a nutshell, is the architectural disposition.
Now, having all that we can assume what the best design solution could be.
Ideally, that should be the tool that is suited for storing timeseries & equipped with statistical analysis tools as well, in perfect case - having notification adapters/transport too (or that could be a bunch of separate tools that coulkd easily bound together).
Fortunately, they have already exist: look up for Timeseries Databases.
My personal prefrence is InfluxDB, but there exist others, like Prometheus, this recently launched AWS service, etc. Google it all.
Virtually all of them are equipped with stat analysis tools of different kind (in InfluxDB, for instance, you can do it with query language right away, plus there's a powerful stream/batch processor that goes even beyond).
Most of them are equipped with notification instruments as well (e.g. aforementioned Influx stream processor have built-in notification actions on statistical events).
So if you're looking for long term/scalable solution (what's the expected scale, BTW?) - that's one of the best options you've got. Will take some effort/refactor though.
And of course you always can implement it with the tools you've got.
Your "polling workers" solution seems to be perfectly fine for now, next you go for statistical libs specific to your platforms (not familiar with Python, sorry, and your DB isn't known either - can't give a specific advice which to use) and you'll be fine.

AWS: containarised serverless solutions

This is more of an architectural question. If I should be asking this question elsewhere, please let me know and I shall.
I have a use-case where I need to run the same python script (could be long-running) multiple times based on demand and pass on different parameters to each of them. The trigger for running the script is external and it should be able to pass on a STRING parameter to the script along with the resource configurations each script requires to run. We plan to start using AWS by next month and I was going through the various options I have. AWS Batch and Fargate both seem like feasible options given my requirements of not having to manage the infrastructure, dynamic spawning of jobs and management of jobs via python SDK.
The only problem is that both of these services are not available in India. I need to have my processing servers in India physically. What options do I have? Auto-scaling and Python SDK management (Creation and Deletion of tasks) are my main requirements (preferably containerized).
Why are you restricted to India? Often restrictions are due to data retention, in which case just store your data on Indian servers (S3, DynamoDB etc) and then you are able to run your 'program' in another AWS region

HttpIO - Consuming external resources in a Dataflow transform

I'm trying to write a custom Source using the Python Dataflow SDK to read JSON data in parallel from a REST endpoint.
E.g. for a given set of IDs, I need to retrieve the data from:
https://foo.com/api/results/1
https://foo.com/api/results/2
...
https://foo.com/api/results/{maxID}
The key features I need are monitoring & rate limiting : even though I need parallelism (either thread/process-based or using async/coroutines), I also need to make sure my job stays "polite" towards the API endpoint - effectively avoiding involuntary DDoS.
Using psq, I should be able to implement some kind of rate-limit mechanism, but then I'd lose the ability to monitor progress & ETA using the Dataflow Service Monitoring
It seems that, although they work well together, monitoring isn't unified between Google Cloud Dataflow and Google Cloud Pub/Sub (which uses Google Stackdriver Monitoring)
How should I go about building a massively-parallel HTTP consumer workflow which implements rate limiting and has web-based monitoring ?
Dataflow does not currently have a built-in way of doing global rate-limiting, but you can use the Source API to do this. The key concept is that each split of a Source will be processed by a single thread at most, so you can implement local rate limiting separately for each split.
This solution does not use Pub/Sub at all, so you can exclusively use the Dataflow Monitoring UI. If you want to set up alerts based on particular events in your pipeline, you could do something like this

Categories

Resources