I just started using aws sagemaker for running and maintaining models, experiments. just wanted to know is there any persistent layer for the sagemaker from where i can get data of my experiments/models instead of looking into the sagemaker studio. Does sagemaker saves the experiments or its data like s3 location in any table something like modelsdb?
SageMaker Studio is using the SageMaker API to pull all of the data its displaying. Essentially there's no secret API here getting invoked.
Quite a bit of what's being displayed with respect to experiments is from the search results, the rest coming from either List* or Describe* calls. Studio is taking the results from the search request and displaying it in the table format that you're seeing. Search results when searching over resource ExperimentTrialComponent that have a source (such as a training job) will be enhanced with the original sources data ([result]::SourceDetail::TrainingJob) if supported (work is ongoing to add additional source detail resource types).
All of the metadata that is related to resources in SageMaker is available via the APIs; there is no other location (in the cloud) like s3 for that data.
As of this time there is no effort to determine if it's possible to add support into modeldb for SageMaker that I'm aware of. Given that modeldb appears to make some assumptions about it's talking to a relational database it would appear unlikely to be something that would be doable. (I only read the overview very quickly so this might be inaccurate.)
Related
My motive is to build a MLOps pipeline which is 100% independnt from Cloud service like AWS, GCP and Azure. I have a project for a client in a production factory and would like to build a Camera based Object Tracking ML service for them. I want to build this pipeline in my own server or (on-premise computer). I am really confused with what stacks i should use. I keep ending up with a Cloud component based solution. It would be great to get some advice on what are the components that i can use and preferably open source.
Assuming your main objective is to build a 100% no cloud MLOps pipeline you can do that with mostly open source tech. All of the following can be installed on prem / without cloud services
For Training: You can use whatever you want. I'd recommend Pytorch because it plays nicer with some of the following suggestions, but Tensorflow is also a popular choice.
For CI/CD: if this is going to be on prem and you are going to retrain the model with production data / need to trigger updates to your deployment with each code update you can use Jenkins (open source) or CircleCI (commercial)
For Model Packaging: Chassis (open source) is the only project I am aware of for generically turning AI/ ML model files into something useful that can be run on your intended hardware. It basically takes an AI / ML model file as input and creates a docker image as its output. It's open source and supports Intel, ARM, CPU, and GPU. The website is here: http://www.chassis.ml and the git repo is here: https://github.com/modzy/chassis
For Deployment: Chassis model containers are automatically built with internal gRPC servers that can be deployed locally as docker containers. If you just want to stream a single source of data through them, the SDK has methods for doing that. If you want something that accepts multiple streams or auto scales to available resources on infrastructure you'll need a Kubernetes cluster with a deployment solution like Modzy or KServe. Chassis containers work out of the box with either.
KServe (https://github.com/kserve/kserve) is free, but basically just
gives you a centralized processing platform hosting a bunch of copies
of your running model. It doesn't allow later triage of the model's processing history.
Modzy (https://www.modzy.com/) is commercial, but also adds in all
the RBAC, Job history preservation, auditing, etc. Modzy also has an
edge deployment feature if you want to mange your models centrally, but run them in a distributed manner on the camera hardware instead of on a centralized
server.
As per your requirement for on prem solution, you may go ahead with
Kubeflow ,
Also use the following
default storage class : nfs-provisioner
on prem load balancing : metallb
In the API docs about kedro.io and kedro.contrib.io I could not find info about how to read/write data from/to network attached storage such as e.g. FritzBox NAS.
So I'm a little rusty on network attached storage, but:
If you can mount your network attached storage onto your OS and access it like a regular folder, then it's just a matter of providing the right filepath when writing the config for a given catalog entry. See for example: Using Python, how can I access a shared folder on windows network?
Otherwise, if accessing the network attached storage requires anything special, you might want to create a custom dataset that uses a Python library for interfacing with your network attached storage. Something like pysmb comes to mind.
The custom dataset could borrow heavily from the logic in existing kedro.io or kedro.extras.datasets datasets, but you replace the filepath/fsspec handling code with pysmb instead.
I'm training a large-ish model, trying to use for the purpose Azure Machine Learning service in Azure notebooks.
I thus create an Estimator to train locally:
from azureml.train.estimator import Estimator
estimator = Estimator(source_directory='./source_dir',
compute_target='local',
entry_script='train.py')
(my train.py should load and train starting from a large word vector file).
When running with
run = experiment.submit(config=estimator)
I get
TrainingException:
====================================================================
While attempting to take snapshot of
/data/home/username/notebooks/source_dir Your total
snapshot size exceeds the limit of 300.0 MB. Please see
http://aka.ms/aml-largefiles on how to work with large files.
====================================================================
The link provided in the error is likely broken.
Contents in my ./source_dir indeed exceed 300 MB.
How can I solve this?
You can place the training files outside source_dir so that they don't get uploaded as part of submitting the experiment, and then upload them separately to the data store (which is basically using the Azure storage associated with your workspace). All you need to do then is reference the training files from train.py.
See the Train model tutorial for an example of how to upload data to the data store and then access it from the training file.
After I read the GitHub issue Encounter |total Snapshot size 300MB while start logging and the offical document Manage and request quotas for Azure resources for Azure ML service, I think it's an unknown issue which need some time to wait Azure to fix.
Meanwhile, I recommended that you can try to migrate the current work to the other service Azure Databricks, to upload your dataset and codes and then run it in the notebook of Azure Databricks which is host on HDInsight Spark Cluster without any worry about memory or storage limits. You can refer to these samples for Azure ML on Azure Databricks.
I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?
There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.
Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.
I am new to AWS and am trying to port a python-based image processing application to the cloud. Our application scenario is similar to the Batch Processing scenario described here
[media.amazonwebservices.com/architecturecenter/AWS_ac_ra_batch_03.pdf]
Specifically the steps involved are:
Receive a large number of images (>1000) and one CSV file containing image metadata
Parse CSV file and create a database (using dynamoDB).
Push images to the cloud (using S3), and push messages of form (bucketname, keyname)
to an input queue (using SQS).
"Pop" messages from the input queue
Fetch appropriate image data from S3, and metadata from dynamoDB.
Do the processing
Update the corresponding entry for that image in dynamoDB
Save results to S3
Save a message in output queue (SQS) which feeds the next part of
the pipeline.
Steps 4-9 would involve the use of EC2 instances.
From the boto documentation and tutorials online, I have understood how to incorporate S3, SQS, and dynamoDB into the pipeline. However, I am unclear on how exactly to proceed with the EC2 inclusion. I tried looking at some example implementations online, but couldn't figure out what the EC2 machine should do to make our batch image processing application work
Use a BOOTSTRAP_SCRIPT with an infinite loop that constantly polls
the input queue ad processes messages if available. This is what I
think is being done in the Django-PDF example on AWS blog
http://aws.amazon.com/articles/Python/3998
Use boto.services to take care of all the details of reading
messages, retrieving and storing files in S3, writing messages etc.
This is what is used in the monster muck mash-up example
http://aws.amazon.com/articles/Python/691
Which of the above methods is preferred for batch processing applications, or is there a better way? Also, for each of the above how do I incorporate the use of Auto-scaling group to manage EC2 machines based on load in the input queue.
Any help in this regards would be really appreciated. Thank you.
You should write an application (using Python and Boto for example) that will do the SQS polling and interact with S# and DynamoDB.
This application must be installed at boot time on the EC2 instance. Several options are available (CloudFormation, Chef, CloudInit and user-data or Custom AMI) but I would suggest you to start with User-Data as described here http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
You also must ensure your instances has proper privileges to talk to S3, SQS and DynamodDB. You must create IAM permissions for this. Then attach the permissions to a role and the role to your instance. Detailled procedure is available in the doc at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html