How to predict a H2O GBM model for nth tree? - python

pros_gbm = H2OGradientBoostingEstimator(nfolds=0,seed=1234, keep_cross_validation_predictions = False, ntrees=1000, max_depth=3, learn_rate=0.01, distribution='multinomial')
pros_gbm.train(x=predictors, y=target, training_frame=hf_train, validation_frame = hf_test)
Currently, I am predicting my test data like above, but how can I predict my test data for the nth tree(out of 1000 trees) of this model? is there any option in "predict" for that, or is there any other way?

You can get the predicted probabilities (cumulative for each tree) using staged_predict_proba() and the lead node assignments from predict_leaf_node_assignment(). Here is an example:
from h2o.estimators import H2OGradientBoostingEstimator
# Import the prostate dataset into H2O:
prostate = h2o.import_file("")
# Set the predictors and response; set the factors:
prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
predictors = ["ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
response = "CAPSULE"
# Build and train the model:
pros_gbm = H2OGradientBoostingEstimator(nfolds=5,
keep_cross_validation_predictions = True)
pros_gbm.train(x=predictors, y=response, training_frame=prostate)
print(pros_gbm.predict_leaf_node_assignment(prostate[:1, :]))
print(pros_gbm.staged_predict_proba(prostate[:1, :]))
You can also check out the Tree Class if you want details (leaf/split info) for each tree.


How to balance training set in python?

I'm trying to apply baseline model to my data set. But the data set is imbalanced and only 11% of the data belongs to positive category. I split the data without sampling, the recall for positive records is very low. I want to balance the training data(0.5 negative 0.5 positive) without balancing testing data. Does anyone know how to do that?
#splitting train and test data
train,test = train_test_split(coupon,test_size = 0.3,random_state = 100)
##separating dependent and independent variables
cols = [i for i in coupon.columns if i not in target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]
#Function attributes
#dataframe - processed dataframe
#Algorithm - Algorithm used
#training_x - predictor variables dataframe(training)
#testing_x - predictor variables dataframe(testing)
#training_y - target variable(training)
#training_y - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic
#regression,features for tree based models)
#threshold_plot - if True returns threshold plot for model
def coupon_use_prediction(algorithm,training_x,testing_x,
training_y,testing_y,cols,cf,threshold_plot) :
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
if cf == "coefficients" :
coefficients = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
coefficients = pd.DataFrame(algorithm.feature_importances_)
column_df = pd.DataFrame(cols)
coef_sumry = (pd.merge(coefficients,column_df,left_index= True,
right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy Score : ",accuracy_score(testing_y,predictions))
You have to way of balancing data : up sampling or down sampling.
Up sampling : duplication of the under-represented data.
Down sampling : sampling of the over-represented data.
For the upsampling it is pretty much easy.
For the downsampling you can use sklearn.utils.resample and provide the number of sample you want to get.
Please note that as #paritosh-singh mentioned, changing the distribution may not be the only solution. There are machine learning algorithms that can:
- support imbalanced data
- already have built-in weighting option to takes in account the data distribution

How do we predict on new unseen groups in a hierarchical model in PyMC3?

If we have a hierarchical model with data from different sites as different groups in the model, how do we predict on new groups (new sites that we haven't seen before)?
e.g. using the following logistic regression model:
from pymc3 import Model, sample, Normal, HalfCauchy,Bernoulli
import theano.tensor as tt
with Model() as varying_slope:
mu_beta = Normal('mu_beta', mu=0., sd=1e5)
sigma_beta = HalfCauchy('sigma_beta', 5)
a = Normal('a', mu=0., sd=1e5)
betas = Normal('b',mu=mu_beta,sd=sigma_beta,shape=(n_features,n_site))
y_hat = a +,betas[:,site_shared])
y_like = Bernoulli('y_like', logit_p=y_hat, observed=train_y)
After we fit this model, we can predict on new data (i.e. sample from the posterior predictive) from a specific site using:
site_to_predict = 1
samples = 500
x = tt.matrix('X',dtype='float64')
new_site = tt.vector('new_site',dtype='int32')
n_samples = tt.iscalar('n_samples')
x.tag.test_value = np.empty(shape=(1,X.shape[1]))
new_site.tag.test_value = np.empty(shape=(1,1))
_sample_proba = approx.sample_node(varying_slope.y_like.distribution.p,
more_replacements={X_shared: x,site_shared:new_site})
sample_proba = theano.function([x,new_site,n_samples], _sample_proba)
pred_test = sample_proba(test_X.reshape(1,-1),np.array(site_to_predict).reshape(-1),samples)
but what is the correct way to sample from the posterior predictive distribution if we have a new unseen site ?
I'm just copying my answer from the pymc discourse thread if someone by chance runs into this question or another one like it here.
First of all, beware of the centered hierarchical parametrization 1 you are using, it may lead to divergences and difficulties while fitting.
That being said, your model looks more or less like a GLM with shared prior random variates mu_beta and sigma_beta across features and sites. Once you get a posterior distribution over those two, your predictions should look something like
y_hat = a + dot(X_shared, Normal(mu=mu_beta, sigma=sigma_beta))
y_like = Bernoulli('y_like', logit_p=y_hat)
So, we will aim to get that.
The way in which we always recommend out of sample posterior predictive checks is to use theano.shared's. I’ll use a different approach, inspired in the functional API that is being the core design idea for pymc4. The are many differences I wont go into between pymc3 and the skeleton of pymc4, but one thing that I started to use more were factory functions to get the Model instances. Instead of trying to define things inside the model with theano.shared's, I just create a new model with the new data and draw posterior predictive samples from it. I just recently posted about this here.
The idea is to create the model with the training data and sample from it to get a trace. Then you use have to extract from the trace the hierarchical part which is shared with the unseen site: mu_beta, sigma_beta and a. Finally, you create a new model using the new data of the test site, and sample from the posterior predictive using a list of dictionaries that hold the mu_beta, sigma_beta and a part of the training trace. Here’s a self-contained example
import numpy as np
import pymc3 as pm
from theano import tensor as tt
from matplotlib import pyplot as plt
def model_factory(X, y, site_shared, n_site, n_features=None):
if n_features is None:
n_features = X.shape[-1]
with pm.Model() as model:
mu_beta = pm.Normal('mu_beta', mu=0., sd=1)
sigma_beta = pm.HalfCauchy('sigma_beta', 5)
a = pm.Normal('a', mu=0., sd=1)
b = pm.Normal('b', mu=0, sd=1, shape=(n_features, n_site))
betas = mu_beta + sigma_beta * b
y_hat = a +, betas[:, site_shared])
pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
return model
# First I generate some training X data
n_features = 10
ntrain_site = 5
ntrain_obs = 100
ntest_site = 1
ntest_obs = 1
train_X = np.random.randn(ntrain_obs, n_features)
train_site_shared = np.random.randint(ntrain_site, size=ntrain_obs)
new_site_X = np.random.randn(ntest_obs, n_features)
test_site_shared = np.zeros(ntest_obs, dtype=np.int32)
# Now I generate the training and test y data with a sample from the prior
with model_factory(X=train_X,
y=np.empty(ntrain_obs, dtype=np.int32),
n_site=ntrain_site) as train_y_generator:
train_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
with model_factory(X=new_site_X,
y=np.empty(ntest_obs, dtype=np.int32),
n_site=ntest_site) as test_y_generator:
new_site_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
# The previous part is just to get some toy data to fit
# Now comes the important parts. First training
with model_factory(X=train_X,
n_site=ntrain_site) as train_model:
train_trace = pm.sample()
# Second comes the hold out data posterior predictive
with model_factory(X=new_site_X,
n_site=ntrain_site) as test_model:
# We first have to extract the learnt global effect from the train_trace
df = pm.trace_to_dataframe(train_trace,
varnames=['mu_beta', 'sigma_beta', 'a'],
# We have to supply the samples kwarg because it cannot be inferred if the
# input trace is not a MultiTrace instance
ppc = pm.sample_posterior_predictive(trace=df.to_dict('records'),
plt.hist(ppc['y_like'], 30)
plt.axvline(new_site_Y, linestyle='--', color='r')
The posterior predictive I get looks like this:
Of course, I don’t know what kind of data to concretely put as your X_shared, site_shared or train_y, so I just made up some nonsense toy data at the beginning of the code, you should replace that with your actual data.

Confusion Matrix on H2O

Final Edit: this problem ended up occurring because the target array were integers that were supposed to represent categories so it was doing a regression. Once I converted them into factors using .asfactor(), then the confusion matrix method detailed in the answer below worked
I am trying to run a confusion matrix on my Random Forest Model (my_model), but the documentation has been less than helpful. From here it says the command is h2o.confusionMatrix(my_model) but there is no such thing in 3.0.
Here are the steps to fit the model:
from h2o.estimators.random_forest import H2ORandomForestEstimator
data_h = h2o.H2OFrame(data)
train, valid = data_h.split_frame(ratios=[.7], seed = 1234)
my_model = H2ORandomForestEstimator(model_id = "rf_h", ntrees = 400,
max_depth = 30, nfolds = 8, seed = 25)
my_model.train(x = features, y = target, training_frame=train)
pred = rf_h.predict(valid)
I have tried the following:
AttributeError: type object 'H2ORandomForestEstimator' has no attribute
Gotten from this example.
I have attempted to use tab completion to find out what it might be and have tried:
TypeError: 'module' object is not callable
which outputs simply all the model diagnostics and then the error:
H2OTypeError: Argument `cm` should be a list, got H2ORandomForestEstimator
Which gives the same error as above.
Not sure what to do here, how can I view the results of the confusion matrix of the model?
Edit: Added more code to the beginning of the question for Context
please see the documentation for the full parameter list. For your convenience here is the list confusion_matrix(metrics=None, thresholds=None, train=False, valid=False, xval=False).
Here is a working example of how to use the method:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("")
# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)
# try using the binomial_double_trees (boolean parameter):
# Initialize and train a DRF
cars_drf = H2ORandomForestEstimator(binomial_double_trees = False, seed = 1234)
cars_drf.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# or specify the validation frame

H2o Python: Combining XGB Holdout Predictions

When using:
"keep_cross_validation_predictions": True
"keep_cross_validation_fold_assignment": True
in H2O's XGBoost Estimator, I am not able to map these cross validated probabilities back to the original dataset. There is one documentation example for R but not for Python (combining holdout predictions).
Any leads on how to do this in Python?
The cross-validated predictions are stored in two different places -- once as a list of length k (for k-folds) in model.cross_validation_predictions(), and another as an H2O Frame with the CV preds in the same order as the original training rows in model.cross_validation_holdout_predictions(). The latter is usually what people want (we added this later, that's why there are two versions).
Yes, unfortunately the R example to get this frame in the "Cross-validation" section of the H2O User Guide does not have a Python version (ticket to fix that). In the keep_cross_validation_predictions argument documentation, it only shows one of the two locations.
Here's an updated example using XGBoost and showing both types of CV predictions:
import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator
# Import a sample binary outcome training set into H2O
train = h2o.import_file("")
# Identify predictors and response
x = train.columns
y = "response"
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# try using the `keep_cross_validation_predictions` (boolean parameter):
# first initialize your estimator, set nfolds parameter
xgb = H2OXGBoostEstimator(keep_cross_validation_predictions = True, nfolds = 5, seed = 1)
# then train your model
xgb.train(x = x, y = y, training_frame = train)
# print the cross-validation predictions as a list
# print the cross-validation predictions as an H2OFrame
The CV pred frame of predictions looks like this:
predict p0 p1
--------- --------- --------
1 0.396057 0.603943
1 0.149905 0.850095
1 0.0407018 0.959298
1 0.140991 0.859009
0 0.67361 0.32639
0 0.865698 0.134302
1 0.12927 0.87073
1 0.0549603 0.94504
1 0.162544 0.837456
1 0.105603 0.894397
[10000 rows x 3 columns]
For Python there is an example of this on GBM, and it should be exactly the same for XGB. According to that page, you should be able to do something like this:
model = H2OXGBoostEstimator(keep_cross_validation_predictions = True)
model.train(x = predictors, y = response, training_frame = train)
cv_predictions = model.cross_validation_predictions()

Wrong classification outputs with sklearn GMM classifier

I'm building a basic speaker recognizer with the GMM toolkit from sklearn. I have 3 classes, for each class I have a classifier. In the testing stage, the GMM for the speaker with the highest probability should be selected and the program should return the predicted class for each test sample. I want to vary the number of mixture components and set n_components=4 in this example code.
If I use 4 mixture components the output of my classifier will either be 0, 1, 2 or 3. If I use 3 mixture components, it will be 0, 1 or 2. I have the feeling that the classifier returns the predicted mixture component instead of the whole GMM. But I want it to predict the class: 1, 2 or 3.
Here is my code:
import numpy as np
from sklearn.mixture import GMM
#set path
class_names = [1,2,3]
covs = ['spherical', 'diag', 'tied', 'full']
training_data = {1: np.loadtxt(path+"/"), 2: np.loadtxt(path+"/"), 3: np.loadtxt(path+"/")}
print "Training models"
models = {}
for c in class_names:
# make a GMM for each of the classes in class_names
models[c] = dict((covar_type,GMM(n_components=4,
covariance_type=covar_type, init_params='wmc',n_init=1, n_iter=20))
for covar_type in covs)
for cov in covs:
for c in class_names:
#define test set
test01 = np.loadtxt(path+"/")
test02 = np.loadtxt(path+"/")
test03 = np.loadtxt(path+"/")
testing_data = {1: test01, 2: test02, 3: test03}
probs = {}
print "Calculating Probabilities"
for c in class_names:
probs[c] = {}
for cov in covs:
probs[c][cov] = {}
for p in class_names:
probs[c][cov] = models[p][cov].predict(testing_data[c])
for c in class_names:
print c
for cov in covs:
print " ",cov,
for p in class_names:
print p, probs,
Is my assumption from above correct or do I have a logical error in my code?
Is there a way to solve this in sklearn?
Thanks in advance for your help!
In your code, the first time you the keys of the models dict are covariance types and the second time the keys are class names. I misread your code, sorry.
Edit: if you want the per-sample likelihood of the data under a fitted GMM models you should use the score_samples method. The predict method does not return probabilities but component assignments instead.
Also GMM by default is non supervised model. If you want to build a supervised model out of a bunch GMM models, you should probably wrap it as an estimator class that wraps them and implement the fit / predict API to be able to estimate its accuracy via cross validation and adjust the hyper parameter values by grid search. Pull request #2468 is implementing something like this. It it's merged in time it might get included in the next scikit-learn release (0.15 that should come out early 2014).

