On Premise MLOps Pipeline Stack

On Premise MLOps Pipeline Stack - python

My motive is to build a MLOps pipeline which is 100% independnt from Cloud service like AWS, GCP and Azure. I have a project for a client in a production factory and would like to build a Camera based Object Tracking ML service for them. I want to build this pipeline in my own server or (on-premise computer). I am really confused with what stacks i should use. I keep ending up with a Cloud component based solution. It would be great to get some advice on what are the components that i can use and preferably open source.

Assuming your main objective is to build a 100% no cloud MLOps pipeline you can do that with mostly open source tech. All of the following can be installed on prem / without cloud services
For Training: You can use whatever you want. I'd recommend Pytorch because it plays nicer with some of the following suggestions, but Tensorflow is also a popular choice.
For CI/CD: if this is going to be on prem and you are going to retrain the model with production data / need to trigger updates to your deployment with each code update you can use Jenkins (open source) or CircleCI (commercial)
For Model Packaging: Chassis (open source) is the only project I am aware of for generically turning AI/ ML model files into something useful that can be run on your intended hardware. It basically takes an AI / ML model file as input and creates a docker image as its output. It's open source and supports Intel, ARM, CPU, and GPU. The website is here: http://www.chassis.ml and the git repo is here: https://github.com/modzy/chassis
For Deployment: Chassis model containers are automatically built with internal gRPC servers that can be deployed locally as docker containers. If you just want to stream a single source of data through them, the SDK has methods for doing that. If you want something that accepts multiple streams or auto scales to available resources on infrastructure you'll need a Kubernetes cluster with a deployment solution like Modzy or KServe. Chassis containers work out of the box with either.
KServe (https://github.com/kserve/kserve) is free, but basically just
gives you a centralized processing platform hosting a bunch of copies
of your running model. It doesn't allow later triage of the model's processing history.
Modzy (https://www.modzy.com/) is commercial, but also adds in all
the RBAC, Job history preservation, auditing, etc. Modzy also has an
edge deployment feature if you want to mange your models centrally, but run them in a distributed manner on the camera hardware instead of on a centralized
server.

As per your requirement for on prem solution, you may go ahead with
Kubeflow ,
Also use the following
default storage class : nfs-provisioner
on prem load balancing : metallb

Related

debug and deploy featurizer (data processor for imodel inference) of sagemaker endpoint

I am looking at this example to implement the data processing of incoming raw data for a sagemaker endpoint prior to model inference/scoring. This is all great but I have 2 questions:
How can one debug this (e.g can I invoke endpoint without it being exposed as restful API and then use Sagemaker debugger)
Sagemaker can be used "remotely" - e.g. via VSC. Can such a script be uploaded programatically?
Thanks.

Sagemaker Debugger is only to monitor the training jobs.
https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html
I dont think you can use it on Endpoints.
The script that you have provided is used both for training and inference. The container used by the estimator will take care of what functions to run. So it is not possible to debug the script directly. But what are you debugging in the code ? Training part or the inference part ?
While creating the estimator we need to give either the entry_point or the source directory. If you are using the "entry_point" then the value should be relative path to the file, if you are using "source_dir" then you should be able to give an S3 path. So before running the estimator, you can programmatically tar the files and upload it to S3 and then use the S3 path in the estimator.

SageMaker experiments store

I just started using aws sagemaker for running and maintaining models, experiments. just wanted to know is there any persistent layer for the sagemaker from where i can get data of my experiments/models instead of looking into the sagemaker studio. Does sagemaker saves the experiments or its data like s3 location in any table something like modelsdb?

SageMaker Studio is using the SageMaker API to pull all of the data its displaying. Essentially there's no secret API here getting invoked.
Quite a bit of what's being displayed with respect to experiments is from the search results, the rest coming from either List* or Describe* calls. Studio is taking the results from the search request and displaying it in the table format that you're seeing. Search results when searching over resource ExperimentTrialComponent that have a source (such as a training job) will be enhanced with the original sources data ([result]::SourceDetail::TrainingJob) if supported (work is ongoing to add additional source detail resource types).
All of the metadata that is related to resources in SageMaker is available via the APIs; there is no other location (in the cloud) like s3 for that data.
As of this time there is no effort to determine if it's possible to add support into modeldb for SageMaker that I'm aware of. Given that modeldb appears to make some assumptions about it's talking to a relational database it would appear unlikely to be something that would be doable. (I only read the overview very quickly so this might be inaccurate.)

How can I load a third-party ML model into a serverless function

I have built an ML model (using the sklearn module), and I want to serve it predictions via AWS API Gateway + Lambda function.
My problems are:
I can't install sklearn + numpy etc. because of the lambda capacity limitations. (the bundle is greater than 140MB)
Maybe that a silly question, but, do you know if there are better ways to do that task?
I've tried this tutorial, in order to reduce the bundle size. However, it raises an exception because of the --use-wheel flag.
https://serverlesscode.com/post/scikitlearn-with-amazon-linux-container/
bucket = s3.Bucket(os.environ['BUCKET'])
model_stream = bucket.Object(os.environ['MODEL_NAME'])
model = pickle.loads(model_stream)
model.predict(z_features)[0]
Where z_features are my features after using a scalar

Just figure it out!
The solution basically stands on top of the AWS Lambda Layers.
I created a sklearn layer that contains only the relevant compiled libraries.
Then, I run sls package in order to pack a bundle which contains those files together with my own handler.py code.
The last step was to run
sls deploy --package .serverless
Hope it'll be helpful to others.

If you simply want to serve your sklearn model, you could skip the hassle of setting up a lambda function and tinkering with the API Gateway - just upload your model as a pkl file to FlashAI.io which will serve your model automatically for free. It handles high traffic environments and unlimited inference requests. For sklearn models specifically, just check out the user guide and within 5 minutes you'll have your model available as an API.
Disclaimer: I'm the author of this service

How to overcome TrainingException when training a large model with Azure Machine Learning service?

I'm training a large-ish model, trying to use for the purpose Azure Machine Learning service in Azure notebooks.
I thus create an Estimator to train locally:
from azureml.train.estimator import Estimator
estimator = Estimator(source_directory='./source_dir',
compute_target='local',
entry_script='train.py')
(my train.py should load and train starting from a large word vector file).
When running with
run = experiment.submit(config=estimator)
I get
TrainingException:
====================================================================
While attempting to take snapshot of
/data/home/username/notebooks/source_dir Your total
snapshot size exceeds the limit of 300.0 MB. Please see
http://aka.ms/aml-largefiles on how to work with large files.
====================================================================
The link provided in the error is likely broken.
Contents in my ./source_dir indeed exceed 300 MB.
How can I solve this?

You can place the training files outside source_dir so that they don't get uploaded as part of submitting the experiment, and then upload them separately to the data store (which is basically using the Azure storage associated with your workspace). All you need to do then is reference the training files from train.py.
See the Train model tutorial for an example of how to upload data to the data store and then access it from the training file.

After I read the GitHub issue Encounter |total Snapshot size 300MB while start logging and the offical document Manage and request quotas for Azure resources for Azure ML service, I think it's an unknown issue which need some time to wait Azure to fix.
Meanwhile, I recommended that you can try to migrate the current work to the other service Azure Databricks, to upload your dataset and codes and then run it in the notebook of Azure Databricks which is host on HDInsight Spark Cluster without any worry about memory or storage limits. You can refer to these samples for Azure ML on Azure Databricks.

Discovering peer instances in Azure Virtual Machine Scale Set

Problem: Given N instances launched as part of VMSS, I would like my application code on each azure instance to discover the IP address of the other peer instances. How do I do this?
The overall intent is to cluster the instances so, as to provide active passive HA or keep the configuration in sync.
Seems like there is some support for REST API based querying : https://learn.microsoft.com/en-us/rest/api/virtualmachinescalesets/
Would like to know any other way to do it, i.e. either python SDK or instance meta data URL etc.

The RestAPI you mentioned has a Python SDK, the "azure-mgmt-compute" client
https://learn.microsoft.com/python/api/azure.mgmt.compute.compute.computemanagementclient

One way to do this would be to use instance metadata. Right now instance metadata only shows information about the VM it's running on, e.g.
curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute?api-version=2017-03-01"
{"compute":
{"location":"westcentralus","name":"imdsvmss_0","offer":"UbuntuServer","osType":"Linux","platformFaultDomain":"0","platformUpdateDomain":"0",
"publisher":"Canonical","sku":"16.04-LTS","version":"16.04.201703300","vmId":"e850e4fa-0fcf-423b-9aed-6095228c0bfc","vmSize":"Standard_D1_V2"},
"network":{"interface":[{"ipv4":{"ipaddress":[{"ipaddress":"10.0.0.4","publicip":"52.161.25.104"}],"subnet":[{"address":"10.0.0.0","dnsservers":[],"prefix":"24"}]},
"ipv6":{"ipaddress":[]},"mac":"000D3AF8BECE"}]}}
You could do something like have each VM send the info to a listener on VM#0, or to an external service, or you could combine this with Azure Files, and have each VM output to a common share. There's an Azure template proof of concept here which outputs information from each VM to an Azure File share.. https://github.com/Azure/azure-quickstart-templates/tree/master/201-vmss-azure-files-linux - every VM has a mountpoint which contains info written by every VM.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.