Final Edit: this problem ended up occurring because the target array were integers that were supposed to represent categories so it was doing a regression. Once I converted them into factors using .asfactor(), then the confusion matrix method detailed in the answer below worked
I am trying to run a confusion matrix on my Random Forest Model (my_model), but the documentation has been less than helpful. From here it says the command is h2o.confusionMatrix(my_model) but there is no such thing in 3.0.
Here are the steps to fit the model:
from h2o.estimators.random_forest import H2ORandomForestEstimator
data_h = h2o.H2OFrame(data)
train, valid = data_h.split_frame(ratios=[.7], seed = 1234)
my_model = H2ORandomForestEstimator(model_id = "rf_h", ntrees = 400,
max_depth = 30, nfolds = 8, seed = 25)
my_model.train(x = features, y = target, training_frame=train)
pred = rf_h.predict(valid)
I have tried the following:
my_model.confusion_matrix()
AttributeError: type object 'H2ORandomForestEstimator' has no attribute
'confusion_matrix'
Gotten from this example.
I have attempted to use tab completion to find out what it might be and have tried:
h2o.model.confusion_matrix(my_model)
TypeError: 'module' object is not callable
and
h2o.model.ConfusionMatrix(my_model)
which outputs simply all the model diagnostics and then the error:
H2OTypeError: Argument `cm` should be a list, got H2ORandomForestEstimator
Finally,
h2o.model.ConfusionMatrix(pred)
Which gives the same error as above.
Not sure what to do here, how can I view the results of the confusion matrix of the model?
Edit: Added more code to the beginning of the question for Context
please see the documentation for the full parameter list. For your convenience here is the list confusion_matrix(metrics=None, thresholds=None, train=False, valid=False, xval=False).
Here is a working example of how to use the method:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)
# try using the binomial_double_trees (boolean parameter):
# Initialize and train a DRF
cars_drf = H2ORandomForestEstimator(binomial_double_trees = False, seed = 1234)
cars_drf.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
cars_drf.confusion_matrix()
# or specify the validation frame
cars_drf.confusion_matrix(valid=True)
Related
pros_gbm = H2OGradientBoostingEstimator(nfolds=0,seed=1234, keep_cross_validation_predictions = False, ntrees=1000, max_depth=3, learn_rate=0.01, distribution='multinomial')
pros_gbm.train(x=predictors, y=target, training_frame=hf_train, validation_frame = hf_test)
pros_gbm.predict(hf_test)
Currently, I am predicting my test data like above, but how can I predict my test data for the nth tree(out of 1000 trees) of this model? is there any option in "predict" for that, or is there any other way?
You can get the predicted probabilities (cumulative for each tree) using staged_predict_proba() and the lead node assignments from predict_leaf_node_assignment(). Here is an example:
from h2o.estimators import H2OGradientBoostingEstimator
# Import the prostate dataset into H2O:
prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
# Set the predictors and response; set the factors:
prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
predictors = ["ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
response = "CAPSULE"
# Build and train the model:
pros_gbm = H2OGradientBoostingEstimator(nfolds=5,
seed=1111,
keep_cross_validation_predictions = True)
pros_gbm.train(x=predictors, y=response, training_frame=prostate)
print(pros_gbm.predict_leaf_node_assignment(prostate[:1, :]))
print(pros_gbm.staged_predict_proba(prostate[:1, :]))
You can also check out the Tree Class if you want details (leaf/split info) for each tree.
If we have a hierarchical model with data from different sites as different groups in the model, how do we predict on new groups (new sites that we haven't seen before)?
e.g. using the following logistic regression model:
from pymc3 import Model, sample, Normal, HalfCauchy,Bernoulli
import theano.tensor as tt
with Model() as varying_slope:
mu_beta = Normal('mu_beta', mu=0., sd=1e5)
sigma_beta = HalfCauchy('sigma_beta', 5)
a = Normal('a', mu=0., sd=1e5)
betas = Normal('b',mu=mu_beta,sd=sigma_beta,shape=(n_features,n_site))
y_hat = a + tt.dot(X_shared,betas[:,site_shared])
y_like = Bernoulli('y_like', logit_p=y_hat, observed=train_y)
After we fit this model, we can predict on new data (i.e. sample from the posterior predictive) from a specific site using:
site_to_predict = 1
samples = 500
x = tt.matrix('X',dtype='float64')
new_site = tt.vector('new_site',dtype='int32')
n_samples = tt.iscalar('n_samples')
x.tag.test_value = np.empty(shape=(1,X.shape[1]))
new_site.tag.test_value = np.empty(shape=(1,1))
_sample_proba = approx.sample_node(varying_slope.y_like.distribution.p,
size=n_samples,
more_replacements={X_shared: x,site_shared:new_site})
sample_proba = theano.function([x,new_site,n_samples], _sample_proba)
pred_test = sample_proba(test_X.reshape(1,-1),np.array(site_to_predict).reshape(-1),samples)
but what is the correct way to sample from the posterior predictive distribution if we have a new unseen site ?
I'm just copying my answer from the pymc discourse thread if someone by chance runs into this question or another one like it here.
First of all, beware of the centered hierarchical parametrization 1 you are using, it may lead to divergences and difficulties while fitting.
That being said, your model looks more or less like a GLM with shared prior random variates mu_beta and sigma_beta across features and sites. Once you get a posterior distribution over those two, your predictions should look something like
y_hat = a + dot(X_shared, Normal(mu=mu_beta, sigma=sigma_beta))
y_like = Bernoulli('y_like', logit_p=y_hat)
So, we will aim to get that.
The way in which we always recommend out of sample posterior predictive checks is to use theano.shared's. I’ll use a different approach, inspired in the functional API that is being the core design idea for pymc4. The are many differences I wont go into between pymc3 and the skeleton of pymc4, but one thing that I started to use more were factory functions to get the Model instances. Instead of trying to define things inside the model with theano.shared's, I just create a new model with the new data and draw posterior predictive samples from it. I just recently posted about this here.
The idea is to create the model with the training data and sample from it to get a trace. Then you use have to extract from the trace the hierarchical part which is shared with the unseen site: mu_beta, sigma_beta and a. Finally, you create a new model using the new data of the test site, and sample from the posterior predictive using a list of dictionaries that hold the mu_beta, sigma_beta and a part of the training trace. Here’s a self-contained example
import numpy as np
import pymc3 as pm
from theano import tensor as tt
from matplotlib import pyplot as plt
def model_factory(X, y, site_shared, n_site, n_features=None):
if n_features is None:
n_features = X.shape[-1]
with pm.Model() as model:
mu_beta = pm.Normal('mu_beta', mu=0., sd=1)
sigma_beta = pm.HalfCauchy('sigma_beta', 5)
a = pm.Normal('a', mu=0., sd=1)
b = pm.Normal('b', mu=0, sd=1, shape=(n_features, n_site))
betas = mu_beta + sigma_beta * b
y_hat = a + tt.dot(X, betas[:, site_shared])
pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
return model
# First I generate some training X data
n_features = 10
ntrain_site = 5
ntrain_obs = 100
ntest_site = 1
ntest_obs = 1
train_X = np.random.randn(ntrain_obs, n_features)
train_site_shared = np.random.randint(ntrain_site, size=ntrain_obs)
new_site_X = np.random.randn(ntest_obs, n_features)
test_site_shared = np.zeros(ntest_obs, dtype=np.int32)
# Now I generate the training and test y data with a sample from the prior
with model_factory(X=train_X,
y=np.empty(ntrain_obs, dtype=np.int32),
site_shared=train_site_shared,
n_site=ntrain_site) as train_y_generator:
train_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
with model_factory(X=new_site_X,
y=np.empty(ntest_obs, dtype=np.int32),
site_shared=test_site_shared,
n_site=ntest_site) as test_y_generator:
new_site_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
# The previous part is just to get some toy data to fit
# Now comes the important parts. First training
with model_factory(X=train_X,
y=train_Y,
site_shared=train_site_shared,
n_site=ntrain_site) as train_model:
train_trace = pm.sample()
# Second comes the hold out data posterior predictive
with model_factory(X=new_site_X,
y=new_site_Y,
site_shared=test_site_shared,
n_site=ntrain_site) as test_model:
# We first have to extract the learnt global effect from the train_trace
df = pm.trace_to_dataframe(train_trace,
varnames=['mu_beta', 'sigma_beta', 'a'],
include_transformed=True)
# We have to supply the samples kwarg because it cannot be inferred if the
# input trace is not a MultiTrace instance
ppc = pm.sample_posterior_predictive(trace=df.to_dict('records'),
samples=len(df))
plt.figure()
plt.hist(ppc['y_like'], 30)
plt.axvline(new_site_Y, linestyle='--', color='r')
The posterior predictive I get looks like this:
Of course, I don’t know what kind of data to concretely put as your X_shared, site_shared or train_y, so I just made up some nonsense toy data at the beginning of the code, you should replace that with your actual data.
When using:
"keep_cross_validation_predictions": True
"keep_cross_validation_fold_assignment": True
in H2O's XGBoost Estimator, I am not able to map these cross validated probabilities back to the original dataset. There is one documentation example for R but not for Python (combining holdout predictions).
Any leads on how to do this in Python?
The cross-validated predictions are stored in two different places -- once as a list of length k (for k-folds) in model.cross_validation_predictions(), and another as an H2O Frame with the CV preds in the same order as the original training rows in model.cross_validation_holdout_predictions(). The latter is usually what people want (we added this later, that's why there are two versions).
Yes, unfortunately the R example to get this frame in the "Cross-validation" section of the H2O User Guide does not have a Python version (ticket to fix that). In the keep_cross_validation_predictions argument documentation, it only shows one of the two locations.
Here's an updated example using XGBoost and showing both types of CV predictions:
import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("http://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# try using the `keep_cross_validation_predictions` (boolean parameter):
# first initialize your estimator, set nfolds parameter
xgb = H2OXGBoostEstimator(keep_cross_validation_predictions = True, nfolds = 5, seed = 1)
# then train your model
xgb.train(x = x, y = y, training_frame = train)
# print the cross-validation predictions as a list
xgb.cross_validation_predictions()
# print the cross-validation predictions as an H2OFrame
xgb.cross_validation_holdout_predictions()
The CV pred frame of predictions looks like this:
Out[57]:
predict p0 p1
--------- --------- --------
1 0.396057 0.603943
1 0.149905 0.850095
1 0.0407018 0.959298
1 0.140991 0.859009
0 0.67361 0.32639
0 0.865698 0.134302
1 0.12927 0.87073
1 0.0549603 0.94504
1 0.162544 0.837456
1 0.105603 0.894397
[10000 rows x 3 columns]
For Python there is an example of this on GBM, and it should be exactly the same for XGB. According to that page, you should be able to do something like this:
model = H2OXGBoostEstimator(keep_cross_validation_predictions = True)
model.train(x = predictors, y = response, training_frame = train)
cv_predictions = model.cross_validation_predictions()
When training a model in h2o v3.10 using the python h2o library, I am seeing an error when trying to set one_hot_explicit as a choice for the categorical_encoding parameter.
encoding = "enum"
gbm = H2OGradientBoostingEstimator(
categorical_encoding = encoding)
gbm.train(x, y,train_h2o_df,test_h2o_df)
Works fine and the model uses enum categorical_encoding, but when:
encoding = "one_hot_explicit"
or
encoding = "OneHotExplicit"
the following error is raised:
gbm Model Build progress: | (failed)
....
OSError: Job with key $03017f00000132d4ffffffff$_bde8fcb4777df7e0be1199bf590a47f9 failed with an exception: java.lang.AssertionError
stacktrace:
java.lang.AssertionError
at hex.ModelBuilder.init(ModelBuilder.java:958)
at hex.tree.SharedTree.init(SharedTree.java:78)
at hex.tree.gbm.GBM.init(GBM.java:57)
at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:159)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:169)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1203)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Is there some dependency I'm missing or is this a bug?
your encoding choice should work, though you may want to update to the latest stable release of H2O. Here is a code snippet you can run that works, and test if it works for you. If it works then you can try and pinpoint the difference between your previous code and the example below.
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "DayOfWeek", "Month", "Distance"]
response = "IsDepDelayed"
# split into train and validation sets
train, valid= airlines.split_frame(ratios = [.8], seed = 1234)
# try using the `categorical_encoding` parameter:
encoding = "one_hot_explicit"
# initialize the estimator
airlines_gbm = H2OGradientBoostingEstimator(categorical_encoding = encoding, seed =1234)
# then train the model
airlines_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# print the auc for the validation set
airlines_gbm.auc(valid=True)
I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined:
rdc = RandomForestClassifier()
pipeline = Pipeline(stages=STAGES + [rdc])
paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20])
.addGrid(rdc.minInfoGain, [0.01, 0.001])
.addGrid(rdc.numTrees, [5, 10, 20, 30])
.build()
evaluator = MulticlassClassificationEvaluator()
valid = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.50)
model = valid.fit(df)
result = model.bestModel.transform(df)
OK so now I'm able to retrieves simple information with a handmade function:
def evaluate(result):
predictionAndLabels = result.select("prediction", "label")
metrics = ["f1","weightedPrecision","weightedRecall","accuracy"]
for m in metrics:
evaluator = MulticlassClassificationEvaluator(metricName=m)
print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))
Now I want several things:
What are the parameters of the best model? This post partially answers the question: How to extract model hyper-parameters from spark.ml in PySpark?
What are the parameters of all models?
What are the results (aka recall, accuracy, etc...) of each model ? I only found print(model.validationMetrics) that displays (it seems) a list containing the accuracy of each model, but I can't get to know which model to refers.
If I can retrieve all those informations, I should be able to display graphs, bar charts, and work as I do with Panda and sklearn.
Spark 2.4+
SPARK-21088 CrossValidator, TrainValidationSplit should collect all models when fitting - adds support for collecting submodels.
By default this behavior is disabled, but can be controlled using CollectSubModels Param (setCollectSubModels).
valid = TrainValidationSplit(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
collectSubModels=True)
model = valid.fit(df)
model.subModels
Spark < 2.4
Long story short you simply cannot get parameters for all models because, similarly to CrossValidator, TrainValidationSplitModel retains only the best model. These classes are designed for semi-automated model selection not exploration or experiments.
What are the parameters of all models?
While you cannot retrieve actual models validationMetrics correspond to input Params so you should be able to simply zip both:
from typing import Dict, Tuple, List, Any
from pyspark.ml.param import Param
from pyspark.ml.tuning import TrainValidationSplitModel
EvalParam = List[Tuple[float, Dict[Param, Any]]]
def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam:
return list(zip(model.validationMetrics, model.getEstimatorParamMaps()))
to get some about relationship between metrics and parameters.
If you need more information you should use Pipeline Params. It will preserve all model which can be used for further processing:
models = pipeline.fit(df, params=paramGrid)
It will generate a list of the PipelineModels corresponding to the params argument:
zip(models, params)
I think I've found a way to do this. I wrote a function that specifically pulls out hyperparameters for a logistic regression that has two parameters, created with a CrossValidator:
def hyperparameter_getter(model_obj,cv_fold = 5.0):
enet_list = []
reg_list = []
## Get metrics
metrics = model_obj.avgMetrics
assert type(metrics) is list
assert len(metrics) > 0
## Get the paramMap element
for x in range(len(model_obj._paramMap.keys())):
if model_obj._paramMap.keys()[x].name=='estimatorParamMaps':
param_map_key = model_obj._paramMap.keys()[x]
params = model_obj._paramMap[param_map_key]
for i in range(len(params)):
for k in params[i].keys():
if k.name =='elasticNetParam':
enet_list.append(params[i][k])
if k.name =='regParam':
reg_list.append(params[i][k])
results_df = pd.DataFrame({'metrics':metrics,
'elasticNetParam': enet_list,
'regParam':reg_list})
# Because of [SPARK-16831][PYTHON]
# It only sums across folds, doesn't average
spark_version = [int(x) for x in sc.version.split('.')]
if spark_version[0] <= 2:
if spark_version[1] < 1:
results_df.metrics = 1.0*results_df['metrics'] / cv_fold
return results_df