How do I batch process my R machine learning model? - python

I have an R script for a random forest classification model. I connected my R instance to my redshift postgres database with RPostgreSQL and then ran a series of data transformation and model commands, using the caret library.
My model assigns a score to every user. I've been running the model on my own and uploading the data s3 and then transferring it to postgres with \copy
I'd like to automate the process. All the queries stay the same every time, including the R script to build the model, but the underlying data changes and so I want to make sure my model is calculating from the most up to date data and then rewriting the scores table.
I was going to use AWS lambda and set up a cron job, but the model takes more than 300 seconds. Instead, I'm thinking about this architecture:
1- Write a script that triggers the R job to run with a curl request. It would cause the R script to download the new data and create the model with a rest API through plumbR
2- Have the R API return the new scores as a JSON object and then use the Python script to upload the scores to s3 and then\copy them over to redshift postgres.
3- Set up a cron job on AWS elastic beanstalk to trigger this every Monday,
Could that work? Can I use AWS beanstalk to set up a python script to run automatically and, if so, does that mean I need an ec2 instance running 24/7?

Related

How to Trigger an On-Demand Scheduled Query Using a Cloud Function?

What I basically want to happen is my on demand scheduled query will run when a new file lands in my google cloud storage bucket. This query will load the CSV file into a temporary table, perform some transformation/cleaning and then append to a table.
Just to try and get the first part running, my on demand scheduled query looks like this. The idea being it will pick up the CSV file from the bucket and dump it into a table.
LOAD DATA INTO spreadsheep-20220603.Case_Studies.loading_test
from files
(
format='CSV',
uris=['gs://triggered_upload/*.csv']
);
I was in the process of setting up a Google Cloud Function that triggers when a file lands in the storage bucket, that seems to be fine but I haven't had luck working out how that function will trigger the scheduled query.
Any idea what bit of python code is needed in the function to trigger the query?
It seems to me that it's not really a scheduled query you want at all. You don't want one to run at regular intervals, you want to run a query in response to a certain event.
Now, you've rigged up a cloud function to execute some code whenever a new file is added to a bucket. What this cloud function needs is the BigQuery python client library. Here's an example of how it's used.
All that remains is to wrap this code in an appropriate function and specify dependencies and permissions using the cloud functions python framework. Here is a guide on how to do that.

debug and deploy featurizer (data processor for imodel inference) of sagemaker endpoint

I am looking at this example to implement the data processing of incoming raw data for a sagemaker endpoint prior to model inference/scoring. This is all great but I have 2 questions:
How can one debug this (e.g can I invoke endpoint without it being exposed as restful API and then use Sagemaker debugger)
Sagemaker can be used "remotely" - e.g. via VSC. Can such a script be uploaded programatically?
Thanks.
Sagemaker Debugger is only to monitor the training jobs.
https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html
I dont think you can use it on Endpoints.
The script that you have provided is used both for training and inference. The container used by the estimator will take care of what functions to run. So it is not possible to debug the script directly. But what are you debugging in the code ? Training part or the inference part ?
While creating the estimator we need to give either the entry_point or the source directory. If you are using the "entry_point" then the value should be relative path to the file, if you are using "source_dir" then you should be able to give an S3 path. So before running the estimator, you can programmatically tar the files and upload it to S3 and then use the S3 path in the estimator.

Custom sagemaker container for training, write forecast to AWS RDS, on a daily basis

I have 3 main process to perform using Amazon SageMaker.
Using own training python script, (not using sagemaker container, inbuilt algorithm) [Train.py]
-> For this, I have referred to this link:
Bring own algorithm to AWS sagemaker
and it seems that we can bring our own training script to sagemaker managed training setup, and model artifacts can be uploaded to s3 etc.
Note: I am using Light GBM model for training.
Writing forecast to AWS RDS DB:
-> There is no need to deploy model and create endpoint, because training will happen everyday, and will create forecast as soon as training completes. (Need to generate forecast in train.py itself)
-> Challenge is how can I write forecast in AWS RDS DB from train.py script. (Given that script is running in Private VPC)
Scheduling this process as daily job:
--> I have gone through AWS step functions and seems to be the way to trigger daily training and write forecast to RDS.
--> Challenge is how to use step function for time based trigger and not event based.
Any suggestions on how to do this? Any best practices to follow? Thank you in advance.
The way to trigger Step Functions on schedule is by using CloudWatch Events (sort of cron). Check out this tutorial: https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-cloudwatch-events-target.html
Don't write to the RDS from your Python code! It is better to write the output to S3 and then "copy" the files from S3 into the RDS. Decoupling these batches will make a more reliable and scalable process. You can trigger the bulk copy into the RDS when the files are written to S3 or to a later time when your DB is not too busy.

Deploying python machine learning model using Azure Functions

I developed a machine learning model locally and wanted to deploy it as web service using Azure Functions.
At first model was save to binary using pickle module and then uploaded into blob storage associated with Azure Function (python language, http-trigger service) instance.
Then, after installing all required modules, following code was used to make a predicition on sample data:
import json
import pickle
import time
postreqdata = json.loads(open(os.environ['req']).read())
with open('D:/home/site/wwwroot/functionName/model_test.pkl', 'rb') as txt:
mod = txt.read()
txt.close()
model = pickle.loads(mod)
scs= [[postreqdata['var_1'],postreqdata['var_2']]]
prediciton = model.predict_proba(scs)[0][1]
response = open(os.environ['res'], 'w')
response.write(str(prediciton))
response.close()
where: predict_proba is a method of a trained model used for training and scs variable is definied for extracting particular values (values of variables for the model) from POST request.
Whole code works fine, predictions are send as a response, values are correct but the execution after sending a request lasts for 150 seconds! (locally it's less than 1s). Moreover, after trying to measure which part of a code takes so long: it's 10th line (pickle.loads(mod)).
Do you have any ideas why it takes such a huge amount of time? Model size is very small (few kB).
Thanks
When a call is made to the AML the first call must warm up the
container. By default a web service has 20 containers. Each container
is cold, and a cold container can cause a large(30 sec) delay.
I suggest you referring to this thread Azure Machine Learning Request Response latency and this article about Azure ML performance.
Hope it helps you.

AWS Redshift Data Processing

I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster.
The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. However, sprocs don't appear to be a thing in Redshift. How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment?
The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron?
I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. Thanks in advance for the assistance!
ETL
Amazon Redshift does not support stored procedures. Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. (I know of many Oracle customers who have locked themselves into never being able to change technologies!)
You should run your ETL logic external to Redshift, simply using Redshift as a database. This could be as simple as running a script that uses psql to call Redshift, such as:
`psql <authentication stuff> -c 'insert into z select a, b, from x'`
(Use psql v8, upon which Redshift was based.)
Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte.
Machine Learning
Yes, you could run code on an EC2 instance. If it is small, you could use AWS Lambda (maximum 5 minutes run-time). Many ML users like using Spark on Amazon EMR. It depends upon the technology stack you require.
Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate.
Lots of options, indeed!
The 2 options for running ETL on Redshift
Create some "create table as" type SQL, which will take your source
tables as input and generate your target (transformed table)
Do the transformation outside of the database using an ETL tool. For
example EMR or Glue.
Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (i.e. option 1).
Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource.
There is no inbuilt scheduling or orchestration tool. Apache Airflow is a good option if you need something more full featured than cron jobs.
Basic transforming of existing data
It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below:
You can use boto3 (https://aws.amazon.com/sdk-for-python/) in order
to talk with Redshift from any workstation of you LAN (make sure
your IP has proper privilege)
You can write your own functions using Python that mimics stored procedures. Inside these functions, you can put / constrict your transformation
logic.
Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. See more here
(https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/)
Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does
Best way to host my python logic
It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so:
Host the scrip in any of your server (LAN) and connect to Redshift using boto3. If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. Enable the EC2 in ad-hoc basis, complete you job and disable it. It will be cost effective. You can do it using AWS framework. I have done this using .Net framework. I assume boto3 does have this support.
If your result set are relatively smaller you can directly save them into the target redshift table
If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. If not, upload them directly into the target table.
Hope this helps.

Categories

Resources