Cant generate XGBoost training report in sagemaker, only profiler_report - python

I am trying to generate the XGBoost trainingreport to see feature importances however the following code only generates the profiler report.
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
from sagemaker.predictor import csv_serializer
from sagemaker.debugger import Rule, rule_configs
# Define IAM role
rules=[
Rule.sagemaker(rule_configs.create_xgboost_report())
]
role = get_execution_role()
prefix = 'sagemaker/models'
my_region = boto3.session.Session().region_name
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", my_region, "latest")
bucket_name = 'binary-base'
s3 = boto3.resource('s3')
try:
if my_region == 'us-east-1':
s3.create_bucket(Bucket=bucket_name)
else:
s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region })
print('S3 bucket created successfully')
except Exception as e:
print('S3 error: ',e)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('../Data/Base_Model_Data_No_Labels/train.csv')
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'validation/val.csv')).upload_file('../Data/Base_Model_Data_No_Labels/val.csv')
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('../Data/Base_Model_Data/test.csv'
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(xgboost_container,
role,
volume_size =5,
instance_count=1,
instance_type='ml.m4.xlarge',
output_path='s3://{}/{}/output'.format(bucket_name, prefix, 'xgboost_model'),
sagemaker_session=sess,
rules=rules)
xgb.set_hyperparameters(objective='binary:logistic',
num_round=100,
scale_pos_weight=8.5)
xgb.fit({'train': s3_input_train, "validation": s3_input_val}, wait=True)
When Checking the output path via:
rule_output_path = xgb.output_path + "/" + xgb.latest_training_job.job_name + "/rule-output"
! aws s3 ls {rule_output_path} --recursive
We only see the profiler report generated
What am I doing wrong/missing? I wish to generate the XGboost Training report to see its feature importances.

Related

Deploying a databricks model as a scoring webservice failed in Azure Machine Learning

I am doing an Azure Databricks lab 04. Integrating Azure Databricks and Azure Machine Learning -> 2. Deploying Models in Azure Machine Learning. The idea is to 1. train a model 2) deploy that model in an Azure Container Instance (ACI) in AML and 3) make predictions via HTTPS. However, I get an error when deploying the model.
The full code from the notebook is displayed at the bottom or can be found here: https://adb-4934989010098757.17.azuredatabricks.net/?o=4934989010098757#notebook/4364513836468644/command/4364513836468645 .
I run the actual model deployment in the following way:
aci_service_name='nyc-taxi-service'
service = Model.deploy(workspace=ws,
name=aci_service_name,
models=[registered_model],
inference_config=inference_config,
deployment_config= aci_config,
overwrite=True)
service.wait_for_deployment(show_output=True)
print(service.state)
After running the model deployment, the cell runs for over 25 minutes and breaks when checking the status of the inference endpoint. It gives the following error: "
"Service deployment polling reached non-successful terminal state, current service state: Failed
code": "AciDeploymentFailed",
"statusCode": 400,
"message": "Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.
The scoring script looks like this:
script_dir = 'scripts'
dbutils.fs.mkdirs(script_dir)
script_dir_path = os.path.join('/dbfs', script_dir)
print("Script directory path:", script_dir_path)
%%writefile $script_dir_path/score.py
import json
import numpy as np
import pandas as pd
import sklearn
import joblib
from azureml.core.model import Model
columns = ['passengerCount', 'tripDistance', 'hour_of_day', 'day_of_week',
'month_num', 'normalizeHolidayName', 'isPaidTimeOff', 'snowDepth',
'precipTime', 'precipDepth', 'temperature']
def init():
global model
model_path = Model.get_model_path('nyc-taxi-fare')
model = joblib.load(model_path)
print('model loaded')
def run(input_json):
# Get predictions and explanations for each data point
inputs = json.loads(input_json)
data_df = pd.DataFrame(np.array(inputs).reshape(-1, len(columns)), columns = columns)
# Make prediction
predictions = model.predict(data_df)
# You can return any data type as long as it is JSON-serializable
return {'predictions': predictions.tolist()}
Does someone know how I could fix this potentially? Thanks in advance!
The full code is displayed below:
**Required Libraries**:
* `azureml-sdk[databricks]` via PyPI
* `sklearn-pandas==2.1.0` via PyPI
* `azureml-mlflow` via PyPI
import os
import numpy as np
import pandas as pd
import pickle
import sklearn
import joblib
import math
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn_pandas import DataFrameMapper
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib
import matplotlib.pyplot as plt
import azureml
from azureml.core import Workspace, Experiment, Run
from azureml.core.model import Model
print('The azureml.core version is {}'.format(azureml.core.VERSION))
%md
### Connect to the AML workspace
%md
In the following cell, be sure to set the values for `subscription_id`, `resource_group`, and `workspace_name` as directed by the comments. Please note, you can copy the subscription ID and resource group name from the **Overview** page on the blade for the Azure ML workspace in the Azure portal.
#Provide the Subscription ID of your existing Azure subscription
subscription_id = " ..... "
#Replace the name below with the name of your resource group
resource_group = "RG_1"
#Replace the name below with the name of your Azure Machine Learning workspace
workspace_name = "aml-ws"
print("subscription_id:", subscription_id)
print("resource_group:", resource_group)
print("workspace_name:", workspace_name)
%md
**Important Note**: You will be prompted to login in the text that is output below the cell. Be sure to navigate to the URL displayed and enter the code that is provided. Once you have entered the code, return to this notebook and wait for the output to read `Workspace configuration succeeded`.
*Also note that the sign-on link and code only appear the first time in a session. If an authenticated session is already established, you won't be prompted to enter the code and authenticate when creating an instance of the Workspace.*
ws = Workspace(subscription_id, resource_group, workspace_name)
print(ws)
print('Workspace region:', ws.location)
print('Workspace configuration succeeded')
%md
### Load the training data
In this notebook, we will be using a subset of NYC Taxi & Limousine Commission - green taxi trip records available from [Azure Open Datasets]( https://azure.microsoft.com/en-us/services/open-datasets/). The data is enriched with holiday and weather data. Each row of the table represents a taxi ride that includes columns such as number of passengers, trip distance, datetime information, holiday and weather information, and the taxi fare for the trip.
Run the following cell to load the table into a Spark dataframe and reivew the dataframe.
dataset = spark.sql("select * from nyc_taxi_1_csv").toPandas()
display(dataset)
%md
### Use MLflow with Azure Machine Learning for Model Training
In the subsequent cells you will learn to do the following:
- Set up MLflow tracking URI so as to use Azure ML
- Create MLflow experiment – this will create a corresponding experiment in Azure ML Workspace
- Train a model on Azure Databricks cluster while logging metrics and artifacts using MLflow
- Save the trained model to Databricks File System (DBFS)
import mlflow
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
experiment_name = 'MLflow-AML-Exercise'
mlflow.set_experiment(experiment_name)
print("Training model...")
output_folder = 'outputs'
model_file_name = 'nyc-taxi.pkl'
dbutils.fs.mkdirs(output_folder)
model_file_path = os.path.join('/dbfs', output_folder, model_file_name)
with mlflow.start_run() as run:
df = dataset.dropna(subset=['totalAmount'])
x_df = df.drop(['totalAmount'], axis=1)
y_df = df['totalAmount']
X_train, X_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=0)
numerical = ['passengerCount', 'tripDistance', 'snowDepth', 'precipTime', 'precipDepth', 'temperature']
categorical = ['hour_of_day', 'day_of_week', 'month_num', 'normalizeHolidayName', 'isPaidTimeOff']
numeric_transformations = [([f], Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])) for f in numerical]
categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]
transformations = numeric_transformations + categorical_transformations
clf = Pipeline(steps=[('preprocessor', DataFrameMapper(transformations, df_out=True)),
('regressor', GradientBoostingRegressor())])
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
y_actual = y_test.values.flatten().tolist()
rmse = math.sqrt(mean_squared_error(y_actual, y_predict))
mlflow.log_metric('rmse', rmse)
mae = mean_absolute_error(y_actual, y_predict)
mlflow.log_metric('mae', mae)
r2 = r2_score(y_actual, y_predict)
mlflow.log_metric('R2 score', r2)
plt.figure(figsize=(10,10))
plt.scatter(y_actual, y_predict, c='crimson')
plt.yscale('log')
plt.xscale('log')
p1 = max(max(y_predict), max(y_actual))
p2 = min(min(y_predict), min(y_actual))
plt.plot([p1, p2], [p1, p2], 'b-')
plt.xlabel('True Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.axis('equal')
results_graph = os.path.join('/dbfs', output_folder, 'results.png')
plt.savefig(results_graph)
mlflow.log_artifact(results_graph)
joblib.dump(clf, open(model_file_path,'wb'))
mlflow.log_artifact(model_file_path)
%md
Run the cell below to list the experiment run in Azure Machine Learning Workspace that you just completed.
aml_run = list(ws.experiments[experiment_name].get_runs())[0]
aml_run
%md
## Exercise 1: Register a databricks-trained model in AML
Azure Machine Learning provides a Model Registry that acts like a version controlled repository for each of your trained models. To version a model, you use the SDK as follows. Run the following cell to register the model with Azure Machine Learning.
model_name = 'nyc-taxi-fare'
model_description = 'Model to predict taxi fares in NYC.'
model_tags = {"Type": "GradientBoostingRegressor",
"Run ID": aml_run.id,
"Metrics": aml_run.get_metrics()}
registered_model = Model.register(model_path=model_file_path, #Path to the saved model file
model_name=model_name,
tags=model_tags,
description=model_description,
workspace=ws)
print(registered_model)
%md
## Exercise 2: Deploy a service that uses the model
%md
### Create the scoring script
script_dir = 'scripts'
dbutils.fs.mkdirs(script_dir)
script_dir_path = os.path.join('/dbfs', script_dir)
print("Script directory path:", script_dir_path)
%%writefile $script_dir_path/score.py
import json
import numpy as np
import pandas as pd
import sklearn
import joblib
from azureml.core.model import Model
columns = ['passengerCount', 'tripDistance', 'hour_of_day', 'day_of_week',
'month_num', 'normalizeHolidayName', 'isPaidTimeOff', 'snowDepth',
'precipTime', 'precipDepth', 'temperature']
def init():
global model
model_path = Model.get_model_path('nyc-taxi-fare')
model = joblib.load(model_path)
print('model loaded')
def run(input_json):
# Get predictions and explanations for each data point
inputs = json.loads(input_json)
data_df = pd.DataFrame(np.array(inputs).reshape(-1, len(columns)), columns = columns)
# Make prediction
predictions = model.predict(data_df)
# You can return any data type as long as it is JSON-serializable
return {'predictions': predictions.tolist()}
%md
### Create the deployment environment
from azureml.core import Environment
from azureml.core.environment import CondaDependencies
my_env_name="nyc-taxi-env"
myenv = Environment.get(workspace=ws, name='AzureML-Minimal').clone(my_env_name)
conda_dep = CondaDependencies()
conda_dep.add_pip_package("numpy==1.18.1")
conda_dep.add_pip_package("pandas==1.1.5")
conda_dep.add_pip_package("joblib==0.14.1")
conda_dep.add_pip_package("scikit-learn==0.24.1")
conda_dep.add_pip_package("sklearn-pandas==2.1.0")
conda_dep.add_pip_package("azure-ml-api-sdk")
myenv.python.conda_dependencies=conda_dep
print("Review the deployment environment.")
myenv
%md
### Create the inference configuration
from azureml.core.model import InferenceConfig
inference_config = InferenceConfig(entry_script='score.py', source_directory=script_dir_path, environment=myenv)
print("InferenceConfig created.")
%md
### Create the deployment configuration
In this exercise we will use the Azure Container Instance (ACI) to deploy the model
from azureml.core.webservice import AciWebservice, Webservice
description = 'NYC Taxi Fare Predictor Service'
aci_config = AciWebservice.deploy_configuration(
cpu_cores=3,
memory_gb=15,
location='eastus',
description=description,
auth_enabled=True,
tags = {'name': 'ACI container',
'model_name': registered_model.name,
'model_version': registered_model.version
}
)
print("AciWebservice deployment configuration created.")
%md
### Deploy the model as a scoring webservice
Please note that it can take **10-15 minutes** for the deployment to complete.
aci_service_name='nyc-taxi-service'
service = Model.deploy(workspace=ws,
name=aci_service_name,
models=[registered_model],
inference_config=inference_config,
deployment_config= aci_config,
overwrite=True)
service.wait_for_deployment(show_output=True)
print(service.state)
%md
## Exercise 3: Consume the deployed service
%md
**Review the webservice endpoint URL and API key**
api_key, _ = service.get_keys()
print("Deployed ACI test Webservice: {} \nWebservice Uri: {} \nWebservice API Key: {}".
format(service.name, service.scoring_uri, api_key))
%md
**Prepare test data**
#['passengerCount', 'tripDistance', 'hour_of_day', 'day_of_week', 'month_num',
# 'normalizeHolidayName', 'isPaidTimeOff', 'snowDepth', 'precipTime', 'precipDepth', 'temperature']
data1 = [2, 5, 9, 4, 5, 'Memorial Day', True, 0, 0.0, 0.0, 65]
data2 = [[3, 10, 15, 4, 7, 'None', False, 0, 2.0, 1.0, 80],
[2, 5, 9, 4, 5, 'Memorial Day', True, 0, 0.0, 0.0, 65]]
print("Test data prepared.")
dataset.head()
%md
### Consume the deployed webservice over HTTP
import requests
import json
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
response = requests.post(service.scoring_uri, json.dumps(data1), headers=headers)
print('Predictions for data1')
print(response.text)
print("")
response = requests.post(service.scoring_uri, json.dumps(data2), headers=headers)
print('Predictions for data2')
print(response.text)
%md
### Clean-up
When you are done with the exercise, delete the deployed webservice by running the cell below.
service.delete()
print("Deployed webservice deleted.")

'mlflow' has no attribute 'last_active_run'

For the first time, it is proceeding mlflow with port 5000.
Testing Mlflow, problem is no attribute last_active_run in mlflow
But, It was an example provided by Mlflow.
link is here mlflow
What is problem and how can I change code?
shell
wget https://raw.githubusercontent.com/mlflow/mlflow/master/examples/sklearn_autolog/utils.py
wget https://raw.githubusercontent.com/mlflow/mlflow/master/examples/sklearn_autolog/pipeline.py
pipeline.py
from pprint import pprint
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import mlflow
from utils import fetch_logged_data
def main():
# enable autologging
mlflow.sklearn.autolog()
# prepare training data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
# train a model
pipe = Pipeline([("scaler", StandardScaler()), ("lr", LinearRegression())])
pipe.fit(X, y)
run_id = mlflow.last_active_run().info.run_id
print("Logged data and model in run: {}".format(run_id))
# show logged data
for key, data in fetch_logged_data(run_id).items():
print("\n---------- logged {} ----------".format(key))
pprint(data)
if __name__ == "__main__":
main()
utils.py
import mlflow
from mlflow.tracking import MlflowClient
def yield_artifacts(run_id, path=None):
"""Yield all artifacts in the specified run"""
client = MlflowClient()
for item in client.list_artifacts(run_id, path):
if item.is_dir:
yield from yield_artifacts(run_id, item.path)
else:
yield item.path
def fetch_logged_data(run_id):
"""Fetch params, metrics, tags, and artifacts in the specified run"""
client = MlflowClient()
data = client.get_run(run_id).data
# Exclude system tags: https://www.mlflow.org/docs/latest/tracking.html#system-tags
tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
artifacts = list(yield_artifacts(run_id))
return {
"params": data.params,
"metrics": data.metrics,
"tags": tags,
"artifacts": artifacts,
}
Error message
INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '8cc3f4e03b4e417b95a64f1a9a41be63', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
Traceback (most recent call last):
File "/Users/taein/Desktop/mlflow/pipeline.py", line 33, in <module>
main()
File "/Users/taein/Desktop/mlflow/pipeline.py", line 23, in main
run_id = mlflow.last_active_run().info.run_id
AttributeError: module 'mlflow' has no attribute 'last_active_run'
Thanks for your helping
It's because of the mlflow version that you mentioned in the comments. mlflow.last_active_run() API was introduced in mlflow 1.25.0
. So you should upgrade the mlflow or you can use the previous version of the code available here.
wget https://raw.githubusercontent.com/mlflow/mlflow/5e2cb3baef544b00a972dff9dd6fb764be20510b/examples/sklearn_autolog/utils.py
wget https://raw.githubusercontent.com/mlflow/mlflow/5e2cb3baef544b00a972dff9dd6fb764be20510b/examples/sklearn_autolog/pipeline.py

cannot load pickle files for xgboost images of version > 1.2-2 in sagemaker - UnpicklingError

I can train a XGBoost model using Sagemaker images like so:
import boto3
import sagemaker
from sagemaker.inputs import TrainingInput
import os
folder = r"C:\Somewhere"
os.chdir(folder)
s3_prefix = 'some_model'
s3_bucket_name = 'the_bucket'
train_file_name = 'train.csv'
val_file_name = 'val.csv'
role_arn = 'arn:aws:iam::482777693429:role/bla_instance_role'
region_name = boto3.Session().region_name
s3_input_train = TrainingInput(s3_data='s3://{}/{}/{}'.format(s3_bucket_name, s3_prefix, train_file_name), content_type='csv')
s3_input_val = TrainingInput(s3_data='s3://{}/{}/{}'.format(s3_bucket_name, s3_prefix, val_file_name), content_type='csv')
print(type(s3_input_train))
hyperparameters = {
"max_depth":"13",
"eta":"0.15",
"gamma":"4",
"min_child_weight":"6",
"subsample":"0.7",
"objective":"reg:squarederror",
"num_round":"50"}
output_path = 's3://{}/{}/output'.format(s3_bucket_name, s3_prefix)
# 1.5-1
# 1.3-1
estimator = sagemaker.estimator.Estimator(image_uri=sagemaker.image_uris.retrieve("xgboost", region_name, "1.2-2"),
hyperparameters=hyperparameters,
role=role_arn,
instance_count=1,
instance_type='ml.m5.2xlarge',
#instance_type='local',
volume_size=1, # 1 GB
output_path=output_path)
estimator.fit({'train': s3_input_train, 'validation': s3_input_val})
This work for all versions 1.2-2, 1.3-1 and 1.5-1. Unfortunately the following code only works for version 1.2-2:
import boto3
import os
import pickle as pkl
import tarfile
import pandas as pd
import xgboost as xgb
folder = r"C:\Somewhere"
os.chdir(folder)
s3_prefix = 'some_model'
s3_bucket_name = 'the_bucket'
model_path = 'output/sagemaker-xgboost-2022-04-30-10-52-29-877/output/model.tar.gz'
session = boto3.Session(profile_name='default')
session.resource('s3').Bucket(s3_bucket_name).download_file('{}/{}'.format(s3_prefix, model_path), 'model.tar.gz')
t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()
model_file_name = 'xgboost-model'
with open(model_file_name, "rb") as input_file:
e = pkl.load(input_file)
Otherwise I get a:
_pickle.UnpicklingError: unpickling stack underflow
Am I missing something? Is my "pickle loading code wrong"?
The version of xgboost is 1.6.0 where I run the pickle code.
I found the solution here. I will leave it in case someone come accross the same issue.

Amazon SageMaker how to predict new data

I built the model in Amazon SageMaker, the code is attached below.
Now I would like to be able to upload new data to s3 and get predictions based on this model without having to recalculate it every time.
sess = sagemaker.Session()
bucket = "innogy-bda-germany-dev-landing-dc3-retailpl"
prefix = "sagemaker/xgboost-upsell"
role = get_execution_role()
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest")
display(container)
train_path = 's3://innogy-bda-germany-dev-landing-dc3-retailpl/UPSELL/LIST/train.csv'
test_path = 's3://innogy-bda-germany-dev-landing-dc3-retailpl/UPSELL/LIST/validation.csv'
s3_input_train = sagemaker.TrainingInput(s3_data=train_path, content_type='csv')
s3_input_test = sagemaker.TrainingInput(s3_data=test_path, content_type='csv')
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(
container,
role,
instance_count=1,
instance_type="ml.m5.4xlarge",
output_path="s3://innogy-bda-germany-dev-landing-dc3-retailpl/UPSELL/LIST/output",
sagemaker_session=sess,
)
xgb.set_hyperparameters(
alpha= 1.340343927865692,
colsample_bytree= 0.525162855476281,
eta= 0.06451533130134757,
gamma= 0.9683995477068462,
max_depth= 10,
min_child_weight= 3.851108988963441,
num_round= 987,
subsample= 0.8725573749114485,
silent=0,
objective="binary:logistic",
early_stopping_rounds=50,
)
xgb.fit({"train": s3_input_train, "validation": s3_input_validation})
I am asking for a code example, how to extract this model from s3 to a new notebook now and use it to predict new data.
Additionally, I wonder why You don't throw away the target variable while using the built-in xgboost model in sagemaker since when making a prediction on a new set, I will not know the target.
train_data, validation_data, test_data = np.split(df_smote.sample(frac=1, random_state=1729),[int(0.7 * len(df_smote)), int(0.9 * len(df_smote))],)
You will need to follow the steps outlined here to deploy your model to an EC2 instance, so you can do batch or on-demand predictions.
https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-model-deployment.html

Azure ML output from pipeline

I am trying to construct a pipeline in Microsoft Azure having (for now) a simple python script in input.
The problem is that I cannot find my output.
In my Notebooks section I have constructed the following two codes:
1) script called "test.ipynb"
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset, Datastore
import pandas as pd
import numpy as np
import datetime
import math
#Upload datasets
subscription_id = 'myid'
resource_group = 'myrg'
workspace_name = 'mywn'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset_zre = Dataset.get_by_name(workspace, name='file1')
dataset_SLA = Dataset.get_by_name(workspace, name='file2')
df_zre = dataset_zre.to_pandas_dataframe()
df_SLA = dataset_SLA.to_pandas_dataframe()
result = pd.concat([df_SLA,df_zre], sort=True)
result.to_csv(path_or_buf="/mnt/azmnt/code/Users/aniello.spiezia/outputs/output.csv",index=False)
def_data_store = workspace.get_default_datastore()
def_data_store.upload(src_dir = '/mnt/azmnt/code/Users/aniello.spiezia/outputs', target_path = '/mnt/azmnt/code/Users/aniello.spiezia/outputs', overwrite = True)
print("\nFinished!")
#End of the file
2) pipeline code called "pipeline.ipynb"
import os
import pandas as pd
import json
import azureml.core
from azureml.core import Workspace, Run, Experiment, Datastore
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.runconfig import CondaDependencies, RunConfiguration
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.telemetry import set_diagnostics_collection
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData, StepSequence
print("SDK Version:", azureml.core.VERSION)
###############################
ws = Workspace.from_config()
print('Workspace name: ' + ws.name,
'Subscription id: ' + ws.subscription_id,
'Resource group: ' + ws.resource_group, sep = '\n')
experiment_name = 'aml-pipeline-cicd' # choose a name for experiment
project_folder = '.' # project folder
experiment = Experiment(ws, experiment_name)
print("Location:", ws.location)
set_diagnostics_collection(send_diagnostics=True)
###############################
cd = CondaDependencies.create(pip_packages=["azureml-sdk==1.0.17", "azureml-train-automl==1.0.17", "pyculiarity", "pytictoc", "cryptography==2.5", "pandas"])
amlcompute_run_config = RunConfiguration(framework = "python", conda_dependencies = cd)
amlcompute_run_config.environment.docker.enabled = False
amlcompute_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
amlcompute_run_config.environment.spark.precache_packages = False
###############################
aml_compute_target = "aml-compute"
try:
aml_compute = AmlCompute(ws, aml_compute_target)
print("found existing compute target.")
except:
print("creating new compute target")
provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2",
idle_seconds_before_scaledown=1800,
min_nodes = 0,
max_nodes = 4)
aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)
aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
print("Azure Machine Learning Compute attached")
###############################
def_data_store = ws.get_default_datastore()
def_blob_store = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(def_blob_store.name))
# Naming the intermediate data as anomaly data and assigning it to a variable
output_data = PipelineData("output_data", datastore = def_blob_store)
print("output_data object created")
step = PythonScriptStep(name = "test",
script_name = "test.ipynb",
compute_target = aml_compute,
source_directory = project_folder,
allow_reuse = True,
runconfig = amlcompute_run_config)
print("Step created.")
###############################
steps = [step]
print("Step lists created")
pipeline = Pipeline(workspace = ws, steps = steps)
print ("Pipeline is built")
pipeline.validate()
print("Pipeline validation complete")
pipeline_run = experiment.submit(pipeline)
print("Pipeline is submitted for execution")
pipeline_run.wait_for_completion(show_output = False)
print("Pipeline run completed")
###############################
def_data_store.download(target_path = '.',
prefix = 'outputs',
show_progress = True,
overwrite = True)
model_fname = 'output.csv'
model_path = os.path.join("outputs", model_fname)
pipeline_run.upload_file(name = model_path, path_or_stream = model_path)
print('Uploaded the model {} to experiment {}'.format(model_fname, pipeline_run.experiment.name))
And this give me the following error:
Pipeline run completed
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-22-a8a523969bb3> in <module>
111
112 # Upload the model file explicitly into artifacts (for CI/CD)
--> 113 pipeline_run.upload_file(name = model_path, path_or_stream = model_path)
114 print('Uploaded the model {} to experiment {}'.format(model_fname, pipeline_run.experiment.name))
115
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/core/run.py in wrapped(self, *args, **kwargs)
47 "therefore, the {} cannot upload files, or log file backed metrics.".format(
48 self, self.__class__.__name__))
---> 49 return func(self, *args, **kwargs)
50 return wrapped
51
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/core/run.py in upload_file(self, name, path_or_stream)
1749 :rtype: azure.storage.blob.models.ResourceProperties
1750 """
-> 1751 return self._client.artifacts.upload_artifact(path_or_stream, RUN_ORIGIN, self._container, name)
1752
1753 #_check_for_data_container_id
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/_restclient/artifacts_client.py in upload_artifact(self, artifact, *args, **kwargs)
108 if isinstance(artifact, str):
109 self._logger.debug("Uploading path artifact")
--> 110 return self.upload_artifact_from_path(artifact, *args, **kwargs)
111 elif isinstance(artifact, IOBase):
112 self._logger.debug("Uploading io artifact")
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/_restclient/artifacts_client.py in upload_artifact_from_path(self, path, *args, **kwargs)
100 path = os.path.normpath(path)
101 path = os.path.abspath(path)
--> 102 with open(path, "rb") as stream:
103 return self.upload_artifact_from_stream(stream, *args, **kwargs)
104
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/azmnt/code/Users/aniello.spiezia/outputs/output.csv'
Do you know what the problem could be?
In particular I am interested in saving somewhere the output file called "output.csv"
The best way for you to do this depends a bit on how you want to process the output.csv file after the run completed. But, in general you can just write your csv to the ./outputs folder:
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset, Datastore
import pandas as pd
import numpy as np
import datetime
import math
#Upload datasets
subscription_id = 'myid'
resource_group = 'myrg'
workspace_name = 'mywn'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset_zre = Dataset.get_by_name(workspace, name='file1')
dataset_SLA = Dataset.get_by_name(workspace, name='file2')
df_zre = dataset_zre.to_pandas_dataframe()
df_SLA = dataset_SLA.to_pandas_dataframe()
result = pd.concat([df_SLA,df_zre], sort=True)
if not os.path.isdir('outputs')
os.mkdir('outputs')
result.to_csv('outputs/output.csv', index=False)
print("\nFinished!")
#End of the file
After the run has completed, AzureML will upload the contents of the outputs directory to the run history, so no need to datastore.upload().
Afterwards, you can see the file in http://ml.azure.com when you navigate to the run like my model.pt file below:
See here for some information on the ./outputs and ./logs folders: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-save-write-experiment-files#where-to-write-files
If you actually want to create another DataSet as a result of your Run, please see this post here: Azure Machine Learning Service - dataset API question
In Daniel's example above, you would need to download the output from the run rather than the datastore in your pipeline.ipynb code. Instead of calling def_data_store.download(), you would call pipeline_run.download('outputs/output.csv', '.').
Another option is to output your data using PipelineData. PipelineData represents a named piece of output of a pipeline step, and is useful if you want to connect multiple steps together with inputs and outputs. With PipelineData, you would need to pass the PipelineData object into PythonScriptStep when you declare your step (as part of arguments=[] and outputs=[]), and then have your script read the output path from the command-line arguments.
This notebook has examples of using PipelineData within a pipeline and downloading the outputs: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.ipynb
And this blog post has details about how to handle this within your script (parsing the command-line arguments, creating the output directory, and writing the output file): https://blog.x5ff.xyz/blog/ai-azureml-python-data-pipelines/

Categories

Resources