How to parallelize work on an Azure ML Service Compute cluster?

How to parallelize work on an Azure ML Service Compute cluster? - python

I am able to submit jobs to Azure ML services using a compute cluster. It works well, and the autoscaling combined with good flexibility for custom environments seems to be exactly what I need. However, so far all these jobs seem to only use one compute node of the cluster. Ideally I would like to use multiple nodes for a computation, but all methods that I see rely on rather deep integration with azure ML services.
My modelling case is a bit atypical. From previous experiments I identified a group of architectures (pipelines of preprocessing steps + estimators in Scikit-learn) that worked well.
Hyperparameter tuning for one of these estimators can be performed reasonably fast (couple of minutes) with RandomizedSearchCV. So it seems less effective to parallelize this step.
Now I want to tune and train this entire list of architectures.
This should be very easily to parallelize since all architectures can be trained independently.
Ideally I would like something like (in pseudocode)
tuned = AzurePool.map(tune_model, [model1, model2,...])
However, I could not find any resources on how I could achieve this with an Azure ML Compute cluster.
An acceptable alternative would come in the form of a plug-and-play substitute for sklearn's CV-tuning methods, similar to the ones provided in dask or spark.

There are a number of ways you could tackle this with AzureML. The simplest would be to just launch a number of jobs using the AzureML Python SDK (the underlying example is taken from here)
from azureml.train.sklearn import SKLearn
runs = []
for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
for penalty in [0.5, 1, 1.5]:
print ('submitting run for kernel', kernel, 'penalty', penalty)
script_params = {
'--kernel': kernel,
'--penalty': penalty,
}
estimator = SKLearn(source_directory=project_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train_iris.py',
pip_packages=['joblib==0.13.2'])
runs.append(experiment.submit(estimator))
The above requires you to factor your training out into a script (or a set of scripts in a folder) along with the python packages required. The above estimator is a convenience wrapper for using Scikit Learn. There are also estimators for Tensorflow, Pytorch, Chainer and a generic one (azureml.train.estimator.Estimator) -- they all differ in the Python packages and base docker they use.
A second option, if you are actually tuning parameters, is to use the HyperDrive service like so (using the same SKLearn Estimator as above):
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice
estimator = SKLearn(source_directory=project_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train_iris.py',
pip_packages=['joblib==0.13.2'])
param_sampling = RandomParameterSampling( {
"--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
"--penalty": choice(0.5, 1, 1.5)
}
)
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
hyperparameter_sampling=param_sampling,
primary_metric_name='Accuracy',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=12,
max_concurrent_runs=4)
hyperdrive_run = experiment.submit(hyperdrive_run_config)
Or you could use DASK to schedule the work as you were mentioning. Here is a sample of how to set up DASK on and AzureML Compute Cluster so you can do interactive work on it: https://github.com/danielsc/azureml-and-dask

there's also a ParallelTaskConfiguration Class with a worker_count_per_node setting, which defaults to 1.

Related

xgboost.fit() vs. xgboost.train()

I am looking over the docs in XGBoost, but I am not understanding 1) if there are any differences between using xgboost.fit() vs. xgboost.train(), and 2) if there are any advantages/disadvantages using one over the other?
I think the only one I've identified so far is that you can specify more params with the train() function, but I'm not entirely sold that you cannot specify those same params somewhere within the fit() function as well.

Taking most points from this related answer on the use of DMatrix in xgboost:
The XGBoost Python package allows choosing between two APIs. The Scikit-Learn API has objects XGBRegressor and XGBClassifier trained via calling fit.
XGBoost's own Learning API has xgboost.train.
You probably could specify most models with any of the two choices.
Mostly a matter of personal preference. The scikit-learn API makes it easy to utilize some of the tools available in scikit-learn (model selection, pipelines etc.)

SageMaker: PipelineModel and Hyperparameter Tuning

We want to tune a SageMaker PipelineModel with a HyperparameterTuner (or something similar) where several components of the pipeline have associated hyperparameters. Both components in our case are realized via SageMaker containers for ML algorithms.
model = PipelineModel(..., models = [ our_model, xgb_model ])
deploy = Estimator(image_uri = model, ...)
...
tuner = HyperparameterTuner(deply, .... tune_parameters, ....)
tuner.fit(...)
Now, there is of course the problem how to distribute the tune_parameters to the pipeline steps during the tuning.
In scikit-learn this is achieved by specially naming the tuning parameters <StepName>__<ParameterName>.
I don't see a way to achieve something similar with SageMaker, though. Also, search of the two keywords brings up the same question here but is not really what we want to do.
Any suggestion how to achieve this?

If both the models need to be jointly optimized, you could run a SageMaker HPO job in script mode and define both the models in the script. Or you could run two HPO jobs, optimize each model, and then create the Pipeline Model. There is no native support for doing an HPO job on a PipelineModel.
I work at AWS and my opinions are my own.

TensorFlow Federated: How to tune non-IIDness in federated dataset?

I am testing some algorithms in TensorFlow Federated (TFF). In this regard, I would like to test and compare them on the same federated dataset with different "levels" of data heterogeneity, i.e. non-IIDness.
Hence, I would like to know whether there is any way to control and tune the "level" of non-IIDness in a specific federated dataset, in an automatic or semi-automatic fashion, e.g. by means of TFF APIs or just traditional TF API (maybe inside the Dataset utils).
To be more practical: for instance, the EMNIST federated dataset provided by TFF has 3383 clients with each one of them having their handwritten characters. However, these local dataset seems to be quite balanced in terms of number of local examples and in terms of represented classes (all classes are, more or less, represented locally).
If I would like to have a federated dataset (e.g., starting by the TFF's EMNIST one) that is:
Patologically non-IID, for example having clients that hold only one class out of N classes (always referring to a classification task). Is this the purpose of tff.simulation.datasets.build_single_label_dataset documentation here. If so, how should I use it from a federated dataset such as the ones already provided by TFF?;
Unbalanced in terms of the amount of local examples (e.g., one client has 10 examples, another one has 100 examples);
Both the possibilities;
how should I proceed inside the TFF framework to prepare a federated dataset with those characteristics?
Should I do all the stuff by hand? Or do some of you have some advices to automate this process?
An additional question: in this paper "Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification", by Hsu et al., they exploit the Dirichlet distribution to synthesize a population of non-identical clients, and they use a concentration parameter to control the identicalness among clients. This seems an wasy-to-tune way to produce datasets with different levels of heterogeneity. Any advice about how to implement this strategy (or a similar one) inside the TFF framework, or just in TensorFlow (Python) considering a simple dataset such as the EMNIST, would be very useful too.
Thank you a lot.

For Federated Learning simulations, its quite reasonable to setup the client datasets in Python, in the experiment driver, to achieve the desired distributions. At some high-level, TFF handles modeling data location ("placements" in the type system) and computation logic. Re-mixing/generating a simulation dataset is not quite core to the library, though there are helpful libraries as you've found. Doing this directly in python by manipulating the tf.data.Dataset and then "pushing" the client datasets into a TFF computation seems straightforward.
Label non-IID
Yes, tff.simulation.datasets.build_single_label_dataset is intended for this purpose.
It takes a tf.data.Dataset and essentially filters out all examples that don't match desired_label values for the label_key (assuming the dataset yields dict like structures).
For EMNIST, to create a dataset of all the ones (regardless of user), this could be achieved by:
train_data, _ = tff.simulation.datasets.emnist.load_data()
ones = tff.simulation.datasets.build_single_label_dataset(
train_data.create_tf_dataset_from_all_clients(),
label_key='label', desired_label=1)
print(ones.element_spec)
>>> OrderedDict([('label', TensorSpec(shape=(), dtype=tf.int32, name=None)), ('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None))])
print(next(iter(ones))['label'])
>>> tf.Tensor(1, shape=(), dtype=int32)
Data imbalance
Using a combination of tf.data.Dataset.repeat and tf.data.Dataset.take can be used to create data imbalances.
train_data, _ = tff.simulation.datasets.emnist.load_data()
datasets = [train_data.create_tf_dataset_for_client(id) for id in train_data.client_ids[:2]]
print([tf.data.experimental.cardinality(ds).numpy() for ds in datasets])
>>> [93, 109]
datasets[0] = datasets[0].repeat(5)
datasets[1] = datasets[1].take(5)
print([tf.data.experimental.cardinality(ds).numpy() for ds in datasets])
>>> [465, 5]

Is it possible to use HyperDriveStep with time-series cross-validation?

I want to deploy a stacked model to Azure Machine Learning Service. The architecture of the solution consists of three models and one meta-model.
Data is a time-series data.
I'd like the model to automatically re-train based on some schedule. I'd also like to re-tune hyperparameters during each re-training.
AML Service offers HyperDriveStep class that can be used in the pipeline for automatic hyperparameter optimization.
Is it possible - and if so, how to do it - to use HyperDriveStep with time-series CV?
I checked the documentation, but haven't found a satisfying answer.

AzureML HyperDrive is a black box optimizer, meaning that it will just run your code with different parameter combinations based on the configuration you chose. At the same time, it supports Random and Bayesian sampling and has different policies for early stopping (see here for relevant docs and here for an example -- HyperDrive is towards the end of the notebook).
The only thing that your model/script/training needs to adhere to is to be launched from a script that takes --param style parameters. As long as that holds you could optimize the parameters for each of your models individually and then tune the meta-model, or you could tune them all in one run. It will mainly depend on the size of the parameter space and the amount of compute you want to use (or pay for).

Training a RandomForest is slow on a computing cluster

I have an account to a computing cluster than runs on Linux. I'm using scikit-learn to train a Random Forest classifier with 1000 trees on a very large dataset. I tried to use all the cores of the computing cluster by running the following code:
clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
clf.fit(data, Y)
However when I run the code I see that only 1.2% of the CPUs are being used! So why it's not using all the cores that exists? And how to solve this please?
Edit: I saw that my problem could be relevant to the one in this link, but I wasn't able to understand the solution. https://github.com/scikit-learn/scikit-learn/issues/1053

This might not be the root of your problem (as n_jobs=-1 should automatically detect and use all cores in your master node) but Sklearn will run in parallel in all cores of a single machine in your cluster. By default it will NOT run on cores from different machines in your cluster as this would imply knowing about the architecture of your cluster and communicating via the network which sklearn doesn't know how to do, as it varies from cluster to cluster.
For this you will have to use a solution like ipython parallel. See the excellent tutorial by Oliver Grisel if you want to use the full power of your cluster.
I recommend that you update sklearn to the latest version, try your code locally(ideally under the same OS, sklearn version), debug the scaling behavior and CPU utilization by setting n_jobs=1,2,3... and benchmarking the fit. For example if n_jobs=1 doesn't have a high utilization rate in one core in the cluster but it does in your local PC, this would indicate a problem with the cluster and not with the code. Sometimes the top command in a cluster behaves differently, you should consult this with the admin.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.