AWS Studio ModuleNotFoundError: No module named 'sagemaker'

AWS Studio ModuleNotFoundError: No module named 'sagemaker' - python

I am trying to replicate the below example for churn prediction.
https://towardsdatascience.com/a-practical-guide-to-mlops-in-aws-sagemaker-part-i-1d28003f565
Preprocessing.py has to import sagemaker but it's throwing ModuleNotFoundError as I run the pipeline. Same sagemaker package is also imported in pipeline.py but it works fine there. Please let me know how we can install packages in studio environment with the syntax. I tried with pip and conda install in a cell in another ipynb file.. Requirement already satisfied message is only displayed when it gets installed.

So probably the first thing to understand here is that the steps in a SageMaker pipeline don't actually run inside of SageMaker Studio, but in containerized jobs.
What I think you're seeing is that the SageMaker Python SDK (which is open-source and published on PyPI as sagemaker) is present in your Studio notebook kernel where you set up the pipeline, but missing from the processing job that runs preprocessing.py.
I see the pipeline uses a ScriptProcessor based on the XGBoost v1.0-1 image (image_uri and script_eval in pipeline.py), so it looks like this particular image doesn't have sagemaker installed by default.
In fact, preprocessing.py only seems to be using the library for the purpose of looking up the name of the SageMaker default bucket. You could achieve the same result with only boto3 (which should already be installed) as follows:
account_id = boto3.client("sts").get_caller_identity()["Account"]
region = boto3.Session().region_name
trans_bucket = f"sagemaker-{region}-{account_id}"
If you really needed to install extra libraries to use with your processing jobs, I would suggest to check out FrameworkProcessor (which could install sagemaker via you providing a requirements.txt file) instead of ScriptProcessor - but watch out that there have been some bug reports when using FrameworkProcessor and Pipelines together.
If FrameworkProcessor isn't working, you could instead build your own container image FROM the pre-provided one and pip install sagemaker in the Dockerfile. You would upload this customized image to Amazon ECR and then reference it in your pipeline instead of the standard XGBoost one.

Related

How can I match my local azure automl python sdk version to the remote version?

I'm using the azure automl python sdk to download and save a model then reload it. I get the following error:
anaconda3\envs\automl_21\lib\site-packages\sklearn\base.py:318: UserWarning: Trying to unpickle estimator Pipeline from version 0.22.1 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
How can I ensure that the versions match?

My Microsoft contact says -
"For this, their best bet is probably to see what the training env was pinned to and install those same pins. They can get that env by running child_run.get_environment() and then pip install all the pkgs listed in there with the pins listed there."
A useful code snippet.
for run in experiment.get_runs():
tags_dictionary = run.get_tags()
best_run = AutoMLRun(experiment, tags_dictionary['automl_best_child_run_id'])
env = best_run.get_environment()
print(env.python.conda_dependencies.serialize_to_string())

How to use AWS Lambda layer using Python?

I have a simple Lambda function which is using the numpy library,
I have set up a virtual environment in my local, and my code is able to fetch and use the library locally.
I tried to use AWS Lambda's layer, and zipped the venv folder and uploaded to the layer,
Then I attached the correct layer and version to my function,
But the function is not able to fetch the library
Following is the code which works fine on local -
import numpy as np
def main(event, context):
a = np.array([1, 2, 3])
print("Your numpy array:")
print(a)
Following is the venv structure which I zipped and uploaded -
I get the following error -
{
"errorMessage": "Unable to import module 'handler': No module named 'numpy'",
"errorType": "Runtime.ImportModuleError"
}
My Lambda deployment looks like this -
I'm trying to refer this -
https://towardsdatascience.com/introduction-to-amazon-lambda-layers-and-boto3-using-python3-39bd390add17

I've seen that a few libraries like numpy and pandas don't work in Lambda when installed using pip. I have had success using the .whl package files for these libraries to create the Lambda layer. Refer to the steps below:
NOTE: These steps set up the libraries specific to the Python 3.7 runtime. If using any other version, you would need to download the .whl files corresponding to that Python version.
Create an EC2 instance using Amazon Linux AMI and SSH into this instance. We should create our layer in Amazon Linux AMI as the Lambda Python 3.7 runtime runs on this operating system (doc).
Make sure this instance has Python3 and "pip" tool installed.
Download the numpy .whl file for the cp37 Python version and the manylinux1_x86_64 OS by executing the below command:
$ wget https://files.pythonhosted.org/packages/d6/c6/58e517e8b1fb192725cfa23c01c2e60e4e6699314ee9684a1c5f5c9b27e1/numpy-1.18.5-cp37-cp37m-manylinux1_x86_64.whl
Skip to the next step if you're not using pandas. Download the pandas .whl file for the cp37 Python version and the manylinux1_x86_64 OS by executing the below command:
$ wget https://files.pythonhosted.org/packages/a4/5f/1b6e0efab4bfb738478919d40b0e3e1a06e3d9996da45eb62a77e9a090d9/pandas-1.0.4-cp37-cp37m-manylinux1_x86_64.whl
Next, we will create a directory named "python" and unzip these files into that directory:
$ mkdir python
$ unzip pandas-1.0.4-cp37-cp37m-manylinux1_x86_64.whl -d python/
$ unzip numpy-1.18.5-cp37-cp37m-manylinux1_x86_64.whl -d python/
We also need to download "pytz" library to successfully import numpy and pandas libraries:
$ pip3 install -t python/ pytz
Next, we would remove the “*.dist-info” files from our package directory to reduce the size of the resulting layer.
$ cd python
$ sudo rm -rf *.dist-info
This will install all the required libraries that we need to run pandas and numpy.
Zip the current "python" directory and upload it to your S3 bucket. Ensure that the libraries are present in the hierarchy as given here.
$ cd ..
$ zip -r lambda-layer.zip python/
$ aws s3 cp lambda-layer.zip s3://YOURBUCKETNAME
The "lambda-layer.zip" file can then be used to create a new layer from the Lambda console.

Base on aws lamda layer doc, https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html your zip package for the layer must have this structure.
my_layer.zip
| python/numpy
| python/numpy-***.dist-info
So what you have to do is create a folder python, and put the content of site-packages inside it, then zip up that python folder. I tried this out with a simple package and it seem to work fine.
Also keep in mind, some package require c/c++ compilation, and for that to work you must install and package on a machine with similar architecture to lambda. Usually you would need to do this on an EC2 where you install and package where it have similar architecture to the lambda.

That's bit of misleading question, because you at least did not mention you use serverless. I found it going through the snapshot of you project structure you provided. That means you probably use serverless for deployment of your project within AWS provider.
Actually, there are multiple ways you can arrange lambda layer. Let's have a look at each of them.
Native AWS
Once you will navigate to Add a layer, you will find 3 options:
[AWS Layers, Custom Layers, Specify an ARN;].
Specify an ARN Guys, who did all work for you: KLayers
so, you need numpy, okay. Within lambda function navigate to the layers --> create a new layer --> out of 3 options, choose Specify an ARN and as the value put: arn:aws:lambda:eu-west-1:770693421928:layer:Klayers-python38-numpy:12.
It will solve your problem and you will be able to work with numpy Namespace.
Custom Layers
Choose a layer from a list of layers created by your AWS account or organization.
For custom layers the way of implementing can differ based on your requirements in terms of deployment.
If are allowed to accomplish things manually, you should have a glimpse at following Medium article. I assume it will help you!
AWS Layers
As for AWS pre-build layers, all is simple.
Layers provided by AWS that are compatible with your function's runtime.
Can differentiate between runtimes
For me I have list of: Perl5, SciPy, AppConfig Extension
Serverless
Within serverless things are much easier, because you can define you layers directly with lambda definition in serverless.yml file. Afterwards, HOW to define them can differ as well.
Examples can be found at: How to publish and use AWS Lambda Layers with the Serverless Framework
If you will have any questions, feel free to expand the discussion.
Cheers!

Dockerfile pip install not working on openshift for xgboost

So my team at work had put together a simple python service and had a dockerfile that performs some small tasks like pip installing dependencies, creating a user, rewiring some directories etc.
Everything worked fine locally -- we were able to build and run a docker image locally on my laptop (I have a MacBook Pro, running High Sierra if that is relevant).
We attempted to build the project on Openshift and kept getting a "Generic Build error" which told us to check the logs.
The log line it was failing on was
RUN pip3 install pandas flask-restful xgboost scikit-learn nltk scipy
and there was no related error listed with it. It literally just stopped at that point.
Specifically, it was breaking out when it got to installing xgboost. We removed the xgboost part and the entire Dockerfile ran fine and the build completed successfully so we know it was definitely just that one install causing the issue. Has anyone else encountered this and know why it happens?
We are pretty certain it wasn't any kind of memory or storage issue as the container had plenty of room for how small the service was.
Note: We were able to eventually use a coworker's template image to get our project up with the needed dependencies, just curious why this was happening in case we run into a similar issue in the future

Not sure if this is an option for you, but when I need to build from a Dockerfile, I usually use a hosted service like Quay.io or DockerHub.

Adding python modules to AzureML workspace

I've been working recently on deploying a machine learning model as a web service. I used Azure Machine Learning Studio for creating my own Workspace ID and Authorization Token. Then, I trained LogisticRegressionCV model from sklearn.linear_model locally on my machine (using python 2.7.13) and with the usage of below code snippet I wanted to publish my model as web service:
from azureml import services
#services.publish('workspaceID','authorization_token')
#services.types(var_1= float, var_2= float)
#services.returns(int)
def predicting(var_1, var_2):
input = np.array([var_1, var_2].reshape(1,-1)
return model.predict_proba(input)[0][1]
where input variable is a list with data to be scored and model variable contains trained classifier. Then after defining above function I want to make a prediction on sample input vector:
predicting.service(1.21, 1.34)
However following error occurs:
RuntimeError: Error 0085: The following error occurred during script
evaluation, please view the output log for more information:
And the most important message in log is:
AttributeError: 'module' object has no attribute 'LogisticRegressionCV'
The error is strange to me because when I was using normal sklearn.linear_model.LogisticRegression everything was fine. I was able to make predictions sending POST requests to created endpoint, so I guess sklearn worked correctly.
After changing to LogisticRegressionCV it does not.
Therefore I wanted to update sklearn on my workspace.
Do you have any ideas how to do it? Or even more general question: how to install any python module on azure machine learning studio in a way to use predict functions of any model I develpoed locally?
Thanks

For anyone who came across this question like I did in hopes of installing modules in AzureML notebooks; it seems the current environments sit on Conda on the compute so it's now as simple as executing
!conda env list
# conda environments:
#
base * /anaconda
azureml_py36 /anaconda/envs/azureml_py36
!conda -n azureml_py36 -y <packages>
from within the notebook environment or doing pretty much the same without the ! in the terminal environment

For installing python module on Azure ML Studio, there is a section Technical Notes of the offical document Execute Python Script which introduces it.
The general steps as below.
Create a Python project via virtualenv and active it.
Install all packages you want via pip on the virtual Python environment, and then
Package all files and directorys under the path Lib\site-packages of your project as a zip file.
Upload the zip package into your Azure ML WorkSpace as a dataSet.
Follow the offical document to import Python Module for your Execute Python Script.
For more details, you can refer to the other similar SO thread Updating pandas to version 0.19 in Azure ML Studio, it even introduced how to update the version of Python packages installed by Azure.
Hope it helps.

I struggled with the same issue: error 0085
I was able to resolve it by using Azure ML code example available from their library:
Deployment of AzureML Web Services from Python Notebooks
can be found at https://gallery.cortanaintelligence.com/Notebook/Deployment-of-AzureML-Web-Services-from-Python-Notebooks-4
I won't copy the whole code here, but I used it exactly as is and it worked with Boston dataset. Then I used it with my dataset, and I no longer got error 0085. I haven't tracked down the error yet but it's most likely due to some misbehaving character or indent. Hope this helps.

"Unable to get Filesystem for path" error when training neural network on google cloud

I am using Google Cloud to train a neural network on the cloud like in the following example:
https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow
To start I set the following to environmental variables:
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
BUCKET_NAME=${PROJECT_ID}-mlengine
I then uploaded my training and evaluation data, both csv's with the names eval_set.csv and train_set.csv to Google cloud storage with the following command:
gsutil cp -r data gs://$BUCKET_NAME
I then verified that these two csv files where in the polar-terminal-160506-mlengine/data directory on my Google Cloud storage.
I then did the following environmental variable assignments
# Assign appropriate values.
PROJECT=$(gcloud config list project --format "value(core.project)")
JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"
GCS_PATH="${BUCKET}/${USER}/${JOB_ID}"
DICT_FILE=gs://cloud-ml-data/img/flower_photos/dict.txt
Before trying to preprocess my evaluation data like so:
# Preprocess the eval set.
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
--output_path "${GCS_PATH}/preproc/eval" \
--cloud
Sadly, this runs for a bit and then crashes outputting the following error:
ValueError: Unable to get the Filesystem for path gs://polar-terminal-160506-mlengine/data/eval_set.csv
This doesn't seem possible as I have confirmed with my eyes via my Google Cloud Storage console that eval_set.csv is stored at this location. Is this perhaps a permissions issue or something I am not seeing?
Edit:
I have found the cause of this run time error to be from a certain line in the trainer.preprocess.py file. The line is this one:
read_input_source = beam.io.ReadFromText(
opt.input_path, strip_trailing_newlines=True)
Seems like a pretty good clue but I am still not really sure what is going on. When I google "beam.io.ReadFromText ValueError: Unable to get the Filesystem for path" nothing relevant at all appears which is a bit odd. Thoughts?

It looks like your apache-beam library installation might be incomplete.
try pip install apache-beam[gcp]
It allows apache beam to access files stored on Google Cloud Storage.
Apache Beam package available here

Just as Jean-Christophe described, I believe your installation is incomplete.
The apache-beam package doesn't include all the stuff to read/write from GCP. To get all that, as well as the runner for being able to deploy your pipeline to CloudDataflow (the DataRunner), you'll need to install it via pip.
pip install google-cloud-dataflow
This is how I was able to resolve the same issue.

Try pip install apache_beam[gcp]. This will help you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.