I'm getting drastically different F1 scores with the same input data with scikit-learn and caret. Here's how I'm running a GBM model for each.
scikit-learn (F1 is default output)
est = GradientBoostingClassifier(n_estimators = 4000, learning_rate = 0.1, max_depth = 5, max_features = 'log2', random_state = 0)
cv = StratifiedKFold(y = labels, n_folds = 10, shuffle = True, random_state = 0)
scores = cross_val_score(est, data, labels, scoring = 'f1', cv, n_jobs = -1)
caret (F1 must be defined and called):
f1 <- function(data, lev = NULL, model = NULL) {
f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
c("F1" = f1_val)
}
set.seed(0)
gbm <- train(label ~ .,
data = data,
method = "gbm",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = f1, classProbs = TRUE),
metric = "F1",
verbose = FALSE)
From the above code, I get an F1 score of ~0.8 using scikit-learn and ~0.25 using caret. A small difference might be attributed to algorithm differences, but I must be doing something wrong with the caret modeling to get the massive difference I'm seeing here. I'd prefer not to post my data set, so hopefully the issue can be diagnosed from the code. Any help would be much appreciated.
GBT is an ensemble of decision trees. The difference comes from:
The number of decision trees in the ensemble (n_estimators = 4000 vs. n.trees = 100).
The shape (breadth, depth) of individual decision trees (max_depth = 5 vs. interaction.depth = 1).
Currently, you're comparing the F1 score of a 100 MB GradientBoostingClassifier object with a 100 kB gbm object - one GBT model contains literally thousands of times more information than the other.
You may wish to export both models to the standardized PMML representation using sklearn2pmml and r2pmml packages, and look inside the resulting PMML files (plain text, so can be opened in any text editor) to better grasp their internal structure.
Related
I am trying to put a FeatureUnion of a PCA, IncrementalPCA and FastICA into a pipeline with a RandomForestClassifier and searching the optimal parameters of the forest with a HalvingGridSearchCV.
Excerpts from the code look like this:
for n_components in range(20,80,10):
# all decomposers use the same parameters
decomposer_pars = {
'n_components':n_components,
'whiten':True,
}
# define the list of decomposers
pipe_preprocessing = [
('pca',PCA(**decomposer_pars)),
('fastica',FastICA(**decomposer_pars)),
('incpca',IncrementalPCA(**decomposer_pars))
]
# define clf
clf = RandomForestClassifier(n_estimators=50,...)
# model
pipe_model = Pipeline(steps=[
('rf', clf)
])
# join to parallel feature union
pipe_preprocessing = FeatureUnion(pipe_preprocessing)
# full pipeline preprocessing + model
pipe = Pipeline(steps=[('preprocessing',pipe_preprocessing),*pipe_model.steps])
# halving gridsearch with crossvalidation
sh = HalvingGridSearchCV(estimator = pipe,
param_grid = {
'rf__min_weight_fraction_leaf' : [0,0.001,0.01,0.1],
'rf__min_samples_split' : [0.001,0.01,0.1],
'rf__max_features' : [3,5],
'rf__min_impurity_decrease' : [0,0.001,0.01],
},
cv = cv, # NOTE: see description below
factor = 2,
scoring = make_scorer(accuracy_score),
resource = 'n_samples',
min_resources = 375,
max_resources = 3000,
aggressive_elimination = False,
refit = False,
return_train_score = False,
n_jobs = n_jobs,
verbose = 0,
error_score='raise')
res = sh.fit(X_train.values,y_train.reindex(X_train.index).values)
Notes:
The generator cv is custom written an generates training / validation folds of size 2794 / 279, respectively. The generator should result in n_splits=24 folds.
The overall training matrix X_train has a shape (69844, 80).
The classifier clf is simply an instance of RandomForestClassifier with n_estimators=50.
Execution of this code throws this error:
ValueError: n_components=20 must be between 0 and min(n_samples, n_features)=15 with svd_solver='full'
It's clear that the PCA components need not be larger than either the number of features or the number of samples. What I don't understand is why I get this error. The training fold that I feed in are of shape (2794,80), thus the error above should only occur for n_components>=min(n_samples, n_features)=80. I do not understand why the data is interpreted as having min(n_samples, n_features)=15. When I set n_components<15, the code works.
I don't understand what I am doing wrong here. In my understanding, FeatureUnion applies the three decomposers independently to the input training data, and should (internally) return a part of the feature matrix with shape=(n_components,2794). Thus, the transformed feature matrix would be (3*n_components,2794) and subsequent fitting of the clf should work fine.
I tried increasing the size of the validation folds (although this does not make sense in theory). Did not change anything.
Also, I increased the size of the train folds to 9978. Still the same error.
HOWEVER, increasing min_resources in HalvingGridSearchCV to 1000 does resolve the issue and the code runs up to n_components=40. Then, again, the same error.
Obviously, min_resources is limiting n_samples. But, the smallest value possible in my code above is 375, which would still result in folds of shape (375,80), such that the error should not occur for any value of n_components that I scan over.
Thus, min_resources seems to work differently than in my understanding. How does min_resources excatly affect the size of the internal training folds?
Thank you!
EDIT
I manually performed the transformation with the FeatureUnion with all values of n_components, and it works fine. This speaks for the fact that the problem must be caused by min_resources in HalvingGridSearchCV. Still did not find a solution for that.
pros_gbm = H2OGradientBoostingEstimator(nfolds=0,seed=1234, keep_cross_validation_predictions = False, ntrees=1000, max_depth=3, learn_rate=0.01, distribution='multinomial')
pros_gbm.train(x=predictors, y=target, training_frame=hf_train, validation_frame = hf_test)
pros_gbm.predict(hf_test)
Currently, I am predicting my test data like above, but how can I predict my test data for the nth tree(out of 1000 trees) of this model? is there any option in "predict" for that, or is there any other way?
You can get the predicted probabilities (cumulative for each tree) using staged_predict_proba() and the lead node assignments from predict_leaf_node_assignment(). Here is an example:
from h2o.estimators import H2OGradientBoostingEstimator
# Import the prostate dataset into H2O:
prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
# Set the predictors and response; set the factors:
prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
predictors = ["ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
response = "CAPSULE"
# Build and train the model:
pros_gbm = H2OGradientBoostingEstimator(nfolds=5,
seed=1111,
keep_cross_validation_predictions = True)
pros_gbm.train(x=predictors, y=response, training_frame=prostate)
print(pros_gbm.predict_leaf_node_assignment(prostate[:1, :]))
print(pros_gbm.staged_predict_proba(prostate[:1, :]))
You can also check out the Tree Class if you want details (leaf/split info) for each tree.
I'm trying to apply baseline model to my data set. But the data set is imbalanced and only 11% of the data belongs to positive category. I split the data without sampling, the recall for positive records is very low. I want to balance the training data(0.5 negative 0.5 positive) without balancing testing data. Does anyone know how to do that?
#splitting train and test data
train,test = train_test_split(coupon,test_size = 0.3,random_state = 100)
##separating dependent and independent variables
cols = [i for i in coupon.columns if i not in target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]
#Function attributes
#dataframe - processed dataframe
#Algorithm - Algorithm used
#training_x - predictor variables dataframe(training)
#testing_x - predictor variables dataframe(testing)
#training_y - target variable(training)
#training_y - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic
#regression,features for tree based models)
#threshold_plot - if True returns threshold plot for model
def coupon_use_prediction(algorithm,training_x,testing_x,
training_y,testing_y,cols,cf,threshold_plot) :
#model
algorithm.fit(training_x,training_y)
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
#coeffs
if cf == "coefficients" :
coefficients = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
coefficients = pd.DataFrame(algorithm.feature_importances_)
column_df = pd.DataFrame(cols)
coef_sumry = (pd.merge(coefficients,column_df,left_index= True,
right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy Score : ",accuracy_score(testing_y,predictions))
You have to way of balancing data : up sampling or down sampling.
Up sampling : duplication of the under-represented data.
Down sampling : sampling of the over-represented data.
For the upsampling it is pretty much easy.
For the downsampling you can use sklearn.utils.resample and provide the number of sample you want to get.
Please note that as #paritosh-singh mentioned, changing the distribution may not be the only solution. There are machine learning algorithms that can:
- support imbalanced data
- already have built-in weighting option to takes in account the data distribution
If we have a hierarchical model with data from different sites as different groups in the model, how do we predict on new groups (new sites that we haven't seen before)?
e.g. using the following logistic regression model:
from pymc3 import Model, sample, Normal, HalfCauchy,Bernoulli
import theano.tensor as tt
with Model() as varying_slope:
mu_beta = Normal('mu_beta', mu=0., sd=1e5)
sigma_beta = HalfCauchy('sigma_beta', 5)
a = Normal('a', mu=0., sd=1e5)
betas = Normal('b',mu=mu_beta,sd=sigma_beta,shape=(n_features,n_site))
y_hat = a + tt.dot(X_shared,betas[:,site_shared])
y_like = Bernoulli('y_like', logit_p=y_hat, observed=train_y)
After we fit this model, we can predict on new data (i.e. sample from the posterior predictive) from a specific site using:
site_to_predict = 1
samples = 500
x = tt.matrix('X',dtype='float64')
new_site = tt.vector('new_site',dtype='int32')
n_samples = tt.iscalar('n_samples')
x.tag.test_value = np.empty(shape=(1,X.shape[1]))
new_site.tag.test_value = np.empty(shape=(1,1))
_sample_proba = approx.sample_node(varying_slope.y_like.distribution.p,
size=n_samples,
more_replacements={X_shared: x,site_shared:new_site})
sample_proba = theano.function([x,new_site,n_samples], _sample_proba)
pred_test = sample_proba(test_X.reshape(1,-1),np.array(site_to_predict).reshape(-1),samples)
but what is the correct way to sample from the posterior predictive distribution if we have a new unseen site ?
I'm just copying my answer from the pymc discourse thread if someone by chance runs into this question or another one like it here.
First of all, beware of the centered hierarchical parametrization 1 you are using, it may lead to divergences and difficulties while fitting.
That being said, your model looks more or less like a GLM with shared prior random variates mu_beta and sigma_beta across features and sites. Once you get a posterior distribution over those two, your predictions should look something like
y_hat = a + dot(X_shared, Normal(mu=mu_beta, sigma=sigma_beta))
y_like = Bernoulli('y_like', logit_p=y_hat)
So, we will aim to get that.
The way in which we always recommend out of sample posterior predictive checks is to use theano.shared's. I’ll use a different approach, inspired in the functional API that is being the core design idea for pymc4. The are many differences I wont go into between pymc3 and the skeleton of pymc4, but one thing that I started to use more were factory functions to get the Model instances. Instead of trying to define things inside the model with theano.shared's, I just create a new model with the new data and draw posterior predictive samples from it. I just recently posted about this here.
The idea is to create the model with the training data and sample from it to get a trace. Then you use have to extract from the trace the hierarchical part which is shared with the unseen site: mu_beta, sigma_beta and a. Finally, you create a new model using the new data of the test site, and sample from the posterior predictive using a list of dictionaries that hold the mu_beta, sigma_beta and a part of the training trace. Here’s a self-contained example
import numpy as np
import pymc3 as pm
from theano import tensor as tt
from matplotlib import pyplot as plt
def model_factory(X, y, site_shared, n_site, n_features=None):
if n_features is None:
n_features = X.shape[-1]
with pm.Model() as model:
mu_beta = pm.Normal('mu_beta', mu=0., sd=1)
sigma_beta = pm.HalfCauchy('sigma_beta', 5)
a = pm.Normal('a', mu=0., sd=1)
b = pm.Normal('b', mu=0, sd=1, shape=(n_features, n_site))
betas = mu_beta + sigma_beta * b
y_hat = a + tt.dot(X, betas[:, site_shared])
pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
return model
# First I generate some training X data
n_features = 10
ntrain_site = 5
ntrain_obs = 100
ntest_site = 1
ntest_obs = 1
train_X = np.random.randn(ntrain_obs, n_features)
train_site_shared = np.random.randint(ntrain_site, size=ntrain_obs)
new_site_X = np.random.randn(ntest_obs, n_features)
test_site_shared = np.zeros(ntest_obs, dtype=np.int32)
# Now I generate the training and test y data with a sample from the prior
with model_factory(X=train_X,
y=np.empty(ntrain_obs, dtype=np.int32),
site_shared=train_site_shared,
n_site=ntrain_site) as train_y_generator:
train_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
with model_factory(X=new_site_X,
y=np.empty(ntest_obs, dtype=np.int32),
site_shared=test_site_shared,
n_site=ntest_site) as test_y_generator:
new_site_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
# The previous part is just to get some toy data to fit
# Now comes the important parts. First training
with model_factory(X=train_X,
y=train_Y,
site_shared=train_site_shared,
n_site=ntrain_site) as train_model:
train_trace = pm.sample()
# Second comes the hold out data posterior predictive
with model_factory(X=new_site_X,
y=new_site_Y,
site_shared=test_site_shared,
n_site=ntrain_site) as test_model:
# We first have to extract the learnt global effect from the train_trace
df = pm.trace_to_dataframe(train_trace,
varnames=['mu_beta', 'sigma_beta', 'a'],
include_transformed=True)
# We have to supply the samples kwarg because it cannot be inferred if the
# input trace is not a MultiTrace instance
ppc = pm.sample_posterior_predictive(trace=df.to_dict('records'),
samples=len(df))
plt.figure()
plt.hist(ppc['y_like'], 30)
plt.axvline(new_site_Y, linestyle='--', color='r')
The posterior predictive I get looks like this:
Of course, I don’t know what kind of data to concretely put as your X_shared, site_shared or train_y, so I just made up some nonsense toy data at the beginning of the code, you should replace that with your actual data.
When using:
"keep_cross_validation_predictions": True
"keep_cross_validation_fold_assignment": True
in H2O's XGBoost Estimator, I am not able to map these cross validated probabilities back to the original dataset. There is one documentation example for R but not for Python (combining holdout predictions).
Any leads on how to do this in Python?
The cross-validated predictions are stored in two different places -- once as a list of length k (for k-folds) in model.cross_validation_predictions(), and another as an H2O Frame with the CV preds in the same order as the original training rows in model.cross_validation_holdout_predictions(). The latter is usually what people want (we added this later, that's why there are two versions).
Yes, unfortunately the R example to get this frame in the "Cross-validation" section of the H2O User Guide does not have a Python version (ticket to fix that). In the keep_cross_validation_predictions argument documentation, it only shows one of the two locations.
Here's an updated example using XGBoost and showing both types of CV predictions:
import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("http://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# try using the `keep_cross_validation_predictions` (boolean parameter):
# first initialize your estimator, set nfolds parameter
xgb = H2OXGBoostEstimator(keep_cross_validation_predictions = True, nfolds = 5, seed = 1)
# then train your model
xgb.train(x = x, y = y, training_frame = train)
# print the cross-validation predictions as a list
xgb.cross_validation_predictions()
# print the cross-validation predictions as an H2OFrame
xgb.cross_validation_holdout_predictions()
The CV pred frame of predictions looks like this:
Out[57]:
predict p0 p1
--------- --------- --------
1 0.396057 0.603943
1 0.149905 0.850095
1 0.0407018 0.959298
1 0.140991 0.859009
0 0.67361 0.32639
0 0.865698 0.134302
1 0.12927 0.87073
1 0.0549603 0.94504
1 0.162544 0.837456
1 0.105603 0.894397
[10000 rows x 3 columns]
For Python there is an example of this on GBM, and it should be exactly the same for XGB. According to that page, you should be able to do something like this:
model = H2OXGBoostEstimator(keep_cross_validation_predictions = True)
model.train(x = predictors, y = response, training_frame = train)
cv_predictions = model.cross_validation_predictions()