I am using H2O (Python) where I am playing with H2OGridSearch for alpha values of a GLM (H2OGeneralizedLinearEstimator), also using lambda_search=True using k-fold cross-validation.
How can I get the best model's lambda value?
EDIT: Fully reproducible example
Data:
34.40 17:1 73:1 127:1 265:1 912:1 1162:1 1512:1 1556:1 1632:1 1738:1
205.10 127:1 138:1 338:1 347:1 883:1 912:1 1120:1 1122:1 1512:1
7.75 66:1 127:1 347:1 602:1 1422:1 1512:1 1535:1 1738:1
8.85 127:1 608:1 906:1 979:1 1077:1 1512:1 1738:1
51.80 127:1 347:1 608:1 766:1 912:1 928:1 952:1 1034:1 1512:1 1610:1 1738:1
110.00 127:1 229:1 347:1 602:1 608:1 1171:1 1512:1 1718:1
8.90 66:1 127:1 205:1 347:1 490:1 589:1 912:1 1016:1 1512:1
Call this file h2o_example.svmlight
Then run:
h2o_data = h2o.import_file("h2o_example.svmlight")
cols = h2o_data.columns[1:]
hyper_parameters = {"alpha": [0.0, 0.01, 0.99, 1.0]}
grid = H2OGridSearch(H2OGeneralizedLinearEstimator(family="gamma", link="log", lambda_search=True, nfolds=2, intercept=True, standardize=False),
hyper_params=hyper_parameters)
grid.train(y="C1", x=cols, training_frame=h2o_data)
grid_table = grid.get_grid(sort_by="r2", decreasing=True)
best = grid_table.models[0]
best.actual_params["lambda"]
best.actual_params["alpha"]
The last two commands fail, giving me an error:
TypeError: 'property' object has no attribute '__getitem__'
Apparently, I am using lambda_search in a wrong way. How can I get a single alpha and lambda value for the best model according to my criterion?
Final EDIT
There are multiple ways of getting lambda (shown below) but here are two concise ways of getting lambda.(Note fully reproducible code is at the bottom)
If you have lambda_search = True, you can look at the model summary table under the lambda_search column and see what value is set for lambda.min, which is your best lambda
model.summary()['lambda_search']
which will produce a list with a string similar to:
['nlambda = 100, lambda.max = 12.733, lambda.min = 0.05261, lambda.1se = -1.0']
if you don't use lambda search and don't set a lambda value (or do set it) you can also use the summary table
model.summary()['regularization']
output looks like:
['Elastic Net (alpha = 0.5, lambda = 0.01289 )']
Other options:
look at the actual parameters of the model:
best.actual_params['lambda']
best.actual_params['alpha']
where best was your best model in the grid search results
First EDIT
to get the best model you can do
grid_table = grid.get_grid(sort_by='r2', decreasing=True)
best = grid_table.models[0]
Then you can use:
best.actual_params['lambda']
Fully reproducible example
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()
# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum"]
response = "IsDepDelayed"
# split into train and validation sets
train, valid= airlines.split_frame(ratios = [.8])
# try using the `lambda_` parameter:
# initialize your estimator
airlines_glm = H2OGeneralizedLinearEstimator(family = 'binomial', lambda_ = .0001)
# then train your model
airlines_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# print the auc for the validation data
print(airlines_glm.auc(valid=True))
# Example of values to grid over for `lambda`
# import Grid Search
from h2o.grid.grid_search import H2OGridSearch
# select the values for lambda_ to grid over
hyper_params = {'lambda': [1, 0.5, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0]}
# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: {'strategy': "RandomDiscrete"}
# initialize the glm estimator
airlines_glm_2 = H2OGeneralizedLinearEstimator(family = 'binomial')
# build grid search with previously made GLM and hyperparameters
grid = H2OGridSearch(model = airlines_glm_2, hyper_params = hyper_params,
search_criteria = {'strategy': "Cartesian"})
# train using the grid
grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# sort the grid models by decreasing AUC
grid_table = grid.get_grid(sort_by = 'auc', decreasing = True)
print(grid_table)
best = grid_table.models[0]
print(best.actual_params['lambda'])
I am not sure why the following does not work
best = grid_table.models[0]
best.actual_params["lambda"]
best.actual_params["alpha"]
It may be an issue with h2o, but if you change the above to the following you should be able to at least access those parameters:
best = grid.models[x]
best.actual_params["lambda"]
best.actual_params["alpha"]
Note that I have changed 0 to x because you need to take note of which model performs the best according to your error criteria because the contents within grid may not be sorted according to your error criteria. This requires you to take a look at grid_tableand take note of the model_id and looking at how the models are being stored in grid
Then you should be able to at least reference lambda and alpha. However, when you run a grid search on alpha and you turn the search on for lambda through the lambda_search property best.actual_params["lambda"] will return the full list of lambdas that were searched over. You could still reference it by considering what Lauren has suggested, but I typically like to see everything in a table and would suggest turning lambda_search off and adding it to the hyper parameters you search over.
import numpy as np
lambda_search_range = list(np.linspace(0,1,100))
h2o_data = h2o.import_file("h2o_example.svmlight")
cols = h2o_data.columns[1:]
hyper_parameters = {"alpha": [0.0, 0.01, 0.99, 1.0],
"lambda": lambda_search_range}
grid = H2OGridSearch(H2OGeneralizedLinearEstimator(family="gamma",
link="log", lambda_search=False, nfolds=2,
intercept=True, standardize=False), hyper_params=hyper_parameters)
grid.train(y="C1", x=cols, training_frame=h2o_data)
grid_table = grid.get_grid(sort_by="r2", decreasing=True)
param_dict = grid_table.get_hyperparams_dict(grid_table.model_ids[0])
param_dict should be a dictionary that contains the alpha and lambda values for your best model according to the error criteria you specified.
Related
I've seen that it's common practice to delete input features that demonstrate co-linearity (and leave only one of them).
However, I've just completed a course on how a linear regression model will give different weights to different features, and I've thought that maybe the model will do better than us giving a low weight to less necessary features instead of completely deleting them.
To try to solve this doubt myself, I've created a small dataset resembling a x_squared function and applied two linear regression models using Python:
A model that keeps only the x_squared feature
A model that keeps both the x and x_squared features
The results suggest that we shouldn't delete features, and let the model decide the best weights instead. However, I would like to ask the community if the rationale of my exercise is right, and whether you've found this doubt in other places.
Here's my code to generate the dataset:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate the data
all_Y = [10, 3, 1.5, 0.5, 1, 5, 8]
all_X = range(-3, 4)
all_X_2 = np.square(all_X)
# Store the data into a dictionary
data_dic = {"x": all_X, "x_2": all_X_2, "y": all_Y}
# Generate a dataframe
df = pd.DataFrame(data=data_dic)
# Display the dataframe
display(df)
which produces this:
and this is the code to generate the ML models:
# Create the lists to iterate over
ids = [1, 2]
features = [["x_2"], ["x", "x_2"]]
titles = ["$x^{2}$", "$x$ and $x^{2}$"]
colors = ["blue", "green"]
# Initiate figure
fig = plt.figure(figsize=(15,5))
# Iterate over the necessary lists to plot results
for i, model, title, color in zip(ids, features, titles, colors):
# Initiate model, fit and make predictions
lr = LinearRegression()
lr.fit(df[model], df["y"])
predicted = lr.predict(df[model])
# Calculate mean squared error of the model
mse = mean_squared_error(all_Y, predicted)
# Create a subplot for each model
plt.subplot(1, 2, i)
plt.plot(df["x"], predicted, c=color, label="f(" + title + ")")
plt.scatter(df["x"], df["y"], c="red", label="y")
plt.title("Linear regression using " + title + " --- MSE: " + str(round(mse, 3)))
plt.legend()
# Display results
plt.show()
which generate this:
What do you think about this issue? This difference in the Mean Squared Error can be of high importance on certain contexts.
Because x and x^2 are not linear anymore, that is why deleting one of them is not helping the model. The general notion for regression is to delete those features which are highly co-linear (which is also highly correlated)
So x2 and y are highly correlated and you are trying to predict y with x2? A high correlation between predictor variable and response variable is usually a good thing - and since x and y are practically uncorrelated you are likely to "dilute" your model and with that get worse model performance.
(Multi-)Colinearity between the predicor variables themselves would be more problematic.
My code functions properly but I am repeating a block several times to vary the polynomial variable, degree. I assume this can and should be looped to allow quicker iterations, but I'm not sure how to do it. Prior to the code below I generate the train_test split which I keep for plotting.
After several iterations, I use np.vstack on the y_predictions to create a single array.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
### degree 1 ####
poly1 = PolynomialFeatures(degree=1)
x1_poly = poly1.fit_transform(X_train)
linreg1 = LinearRegression().fit(x1_poly, y_train)
pred_1= poly1.transform(x_prediction_data)
y1_poly_pred=linreg1.predict(pred_1)
### degree 3 #####
poly3 = PolynomialFeatures(degree=3)
x3_poly = poly3.fit_transform(X_train)
linreg3 = LinearRegression().fit(x3_poly, y_train)
pred_3= poly3.transform(x_prediction_data)
y3_poly_pred=linreg3.predict(pred_3)
#### ect... will make several other degree = 6, 9 ...
I would recommend collecting your answers in a dictionary, but I created a list for simplicity.
The code iterates over i, which is the degree of your polynomials. Trains the model, etc..., then collects its answers.
prediction_collector = []
for i in [1,3,6,9]:
poly = PolynomialFeatures(degree=i)
x_poly = poly.fit_transform(X_train)
linreg = LinearRegression().fit(x_poly, y_train)
pred= poly.transform(x_prediction_data)
y_poly_pred=linreg.predict(pred)
# to collect the answer after each iteration/increase of degrees
predictions_collector.append(y_poly_pred)
I am trying to estimate a model in tensorflow using NUTS by providing it a likelihood function. I have checked the likelihood function is returning reasonable values. I am following the setup here for setting up NUTS:
https://rlhick.people.wm.edu/posts/custom-likes-tensorflow.html
and some of the examples here for setting up priors, etc.:
https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/Multilevel_Modeling_Primer.ipynb
My code is in a colab notebook here:
https://drive.google.com/file/d/1L9JQPLO57g3OhxaRCB29do2m808ZUeex/view?usp=sharing
I get the error: OperatorNotAllowedInGraphError: iterating overtf.Tensoris not allowed: AutoGraph did not convert this function. Try decorating it directly with #tf.function. This is my first time using tensorflow and I am quite lost interpreting this error. It would also be ideal if I could pass the starting parameter values as a single input (example I am working off doesn't do it, but I assume it is possible).
Update
It looks like I had to change the position of the #tf.function decorator. The sampler now runs, but it gives me the same value for all samples for each of the parameters. Is it a requirement that I pass a joint distribution through the log_prob() function? I am clearly missing something. I can run the likelihood through bfgs optimization and get reasonable results (I've estimated the model via maximum likelihood with fixed parameters in other software). It looks like I need to define the function to return a joint distribution and call log_prob(). I can do this if I set it up as a logistic regression (logit choice model is logistically distributed in differences). However, I lose the standard closed form.
My function is as follows:
#tf.function
def mmnl_log_prob(init_mu_b_time,init_sigma_b_time,init_a_car,init_a_train,init_b_cost,init_scale):
# Create priors for hyperparameters
mu_b_time = tfd.Sample(tfd.Normal(loc=init_mu_b_time, scale=init_scale),sample_shape=1).sample()
# HalfCauchy distributions are too wide for logit discrete choice
sigma_b_time = tfd.Sample(tfd.Normal(loc=init_sigma_b_time, scale=init_scale),sample_shape=1).sample()
# Create priors for parameters
a_car = tfd.Sample(tfd.Normal(loc=init_a_car, scale=init_scale),sample_shape=1).sample()
a_train = tfd.Sample(tfd.Normal(loc=init_a_train, scale=init_scale),sample_shape=1).sample()
# a_sm = tfd.Sample(tfd.Normal(loc=init_a_sm, scale=init_scale),sample_shape=1).sample()
b_cost = tfd.Sample(tfd.Normal(loc=init_b_cost, scale=init_scale),sample_shape=1).sample()
# Define a heterogeneous random parameter model with MultivariateNormalDiag()
# Use MultivariateNormalDiagPlusLowRank() to define nests, etc.
b_time = tfd.Sample(tfd.MultivariateNormalDiag( # b_time
loc=mu_b_time,
scale_diag=sigma_b_time),sample_shape=num_idx).sample()
# Definition of the utility functions
V1 = a_train + tfm.multiply(b_time,TRAIN_TT_SCALED) + b_cost * TRAIN_COST_SCALED
V2 = tfm.multiply(b_time,SM_TT_SCALED) + b_cost * SM_COST_SCALED
V3 = a_car + tfm.multiply(b_time,CAR_TT_SCALED) + b_cost * CAR_CO_SCALED
print("Vs",V1,V2,V3)
# Definition of loglikelihood
eV1 = tfm.multiply(tfm.exp(V1),TRAIN_AV_SP)
eV2 = tfm.multiply(tfm.exp(V2),SM_AV_SP)
eV3 = tfm.multiply(tfm.exp(V3),CAR_AV_SP)
eVD = eV1 + eV2 +
eV3
print("eVs",eV1,eV2,eV3,eVD)
l1 = tfm.multiply(tfm.truediv(eV1,eVD),tf.cast(tfm.equal(CHOICE,1),tf.float32))
l2 = tfm.multiply(tfm.truediv(eV2,eVD),tf.cast(tfm.equal(CHOICE,2),tf.float32))
l3 = tfm.multiply(tfm.truediv(eV3,eVD),tf.cast(tfm.equal(CHOICE,3),tf.float32))
ll = tfm.reduce_sum(tfm.log(l1+l2+l3))
print("ll",ll)
return ll
The function is called as follows:
nuts_samples = 1000
nuts_burnin = 500
chains = 4
## Initial step size
init_step_size=.3
init = [0.,0.,0.,0.,0.,.5]
##
## NUTS (using inner step size averaging step)
##
#tf.function
def nuts_sampler(init):
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=mmnl_log_prob,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
num_results=nuts_samples,
current_state=init,
kernel=adapt_nuts_kernel,
num_burnin_steps=100,
parallel_iterations=5)
return samples_nuts_, stats_nuts_
samples_nuts, stats_nuts = nuts_sampler(init)
I have an answer to my question! It is simply a matter of different nomenclature. I need to define my model as a softmax function, which I knew was what I would call a "logit model", but it just wasn't clicking for me. The following blog post gave me the epiphany:
http://khakieconomics.github.io/2019/03/17/Putting-it-all-together.html
I am working on a mortality prediction (binary outcome) problem with “base mortality probability” as my offset in the XGboost problem.
I have used gbtree booster and binary:logistic objective function. In my data data I have multiple observations/records having same X values but different offset values.
As per my understanding (please correct me, if wrong) the XGBoost under binary:logistic setup tries to fit a model of below representation. log(p/1-p) = offset + F(x). Where F(x) is optimized (for a specific loss function) using splits with various X values.
Thus, when the X values are exactly same, to get the F(x), I can use the predicted output (with outputmargin = True option) and subtract the offset from here. However, when I got the output, it turned out in the above mentioned approach, I am getting different values F(X) for a same set X. I believe the way offset is handled internally in the XGBoost is different from my understanding. Can anyone explain me this method/mathematical formulation of handlng offset.
I am specifically interested in extracting the value of F(x) (as this is additional information the model is providing) by adjusting the model prediction from the offset values.
Here are the sample codes:
library(xgboost)
x1 = runif(1000)
y1 = as.numeric(runif(1000)>.8)
y2 = as.numeric(runif(1000)>.8)
off1 = runif(1000)
off2 = runif(1000)
#stacking the data to have same X values
x= c(x1,x1)
y = c(y1,y2)
off = c(off1,off2)
length(unique(off)) # shows unique 2000 values
length(unique(x)) # shows unique 1000 values, i.e. each X is repeated once (as expected)
fulldata = cbind.data.frame(x,y,off)
train_dMtrix = xgb.DMatrix(data = as.matrix(x),
label = y,
base_margin = off)
params_list=list(booster = "gblinear", objective = "binary:logistic",
eta = 0.05, max_depth= 4, min_child_weight = 10, eval_metric = 'logloss')
set.seed(100)
xgbmodel = xgb.train(params = params_list, data = train_dMtrix, nrounds=100, callbacks = list(cb.gblinear.history()))
# Getting the prediction in link format
fulldata$Predicted_link = predict(xgbmodel, train_dMtrix, outputmargin = TRUE)
# Assuming Predicted_link = offset + F(x), calculating F(x) for each values of X
fulldata$F_x = fulldata$Predicted_link - fulldata$off
# As per my understanding, since the F(X) in purely independent of offset,
# the model predictions of F_x (not the predicted probability) should be exactly same for same values of x,
# irrespective of the corresponding offsets. Given I have 1000 distinct X values, I'm expecting 1000 distinct F_x values
length(unique(fulldata$F_x)) # shows almost 2000 unique values, which is contrary to my expectation.
I'm new to hyperopt package.
Now, I wanna optimize my LDA model which is implemented in gensim. The LDA model is optimized to maximize silhouette score over training data.
Now, my question is "How do I pass training-data(numpy.ndarray) to objective-function which is called from hyperopt?"
I looked tutorials and some example codes. They set training-data as global variable. But in my situation, it's difficult to set training-data as global variable as they do.
I wrote following code to optimize LDA with hyoeropt. I'm stacked with the way to pass training-data to gensim_objective_function function because I'm gonna put gensim_lda_optimaze in system which calls gensim_lda_optimaze function.
How to realize that?
# I want to pass training data to this function!
# gensim_lda_tuning_training_corpus, gensim_lda_tuning_num_topic, gensim_lda_tuning_word2id is what I wanna pass
def gensim_objective_function(arg_dict):
from .gensim_lda import evaluate_clustering
from .gensim_lda import call_lda_single
from .gensim_lda import get_topics_ids
alpha = arg_dict['alpha']
eta = arg_dict['eta']
iteration= arg_dict['iteration']
gamma_threshold= arg_dict['gamma_threshold']
minimum_probability= arg_dict['minimum_probability']
passes= arg_dict['passes']
# train LDA model
lda_model, gensim_corpus = call_lda_single(matrix=gensim_lda_tuning_training_corpus,
num_topics=gensim_lda_tuning_num_topic,
word2id_dict=gensim_lda_tuning_word2id,
alpha=alpha, eta=eta,
iteration=iteration,
gamma_threshold=gamma_threshold,
minimum_probability=minimum_probability,
passes=passes)
topic_ids = get_topics_ids(trained_lda_model=lda_model, gensim_corpus=gensim_corpus)
labels = [t[0] for t in topic_ids]
# get silhouette score with extracted label
evaluation_score = evaluate_clustering(feature_matrix=gensim_lda_tuning_training_corpus, labels=numpy.array(labels))
return -1 * evaluation_score
def gensim_lda_optimaze(feature_matrix, num_topics, word2id_dict):
assert isinstance(feature_matrix, (ndarray, csr_matrix))
assert isinstance(num_topics, int)
assert isinstance(word2id_dict, dict)
parameter_space = {
'alpha': hp.loguniform("alpha", numpy.log(0.1), numpy.log(1)),
'eta': hp.loguniform("eta", numpy.log(0.1), numpy.log(1)),
'iteration': 100,
'gamma_threshold': 0.001,
'minimum_probability': 0.01,
'passes': 10
}
trials = Trials()
best = fmin(
gensim_objective_function,
parameter_space,
algo=tpe.suggest,
max_evals=100,
trials=trials
)
return best
You can always use partial in python.
from functools import partial
def foo(params, data):
return params, data
goo = partial(foo, data=[1,2,3])
print goo('ala')
gives
ala [1, 2, 3]
In other words, you make a proxy function, which has data loaded as a given parameter and you ask hyperopt to optimize this new function, with data already set.
thus in your case you change gensim_objective_function to be something accepting all your params:
def RAW_gensim_objective_function(arg_dict, gensim_lda_tuning_training_corpus,
gensim_lda_tuning_num_topic,
gensim_lda_tuning_word2id):
and create actual function to optimize by passing your data in different part of code
gensim_objective_function = partial(RAW_gensim_objective_function,
gensim_lda_tuning_training_corpus = YOUR_CORPUS,
gensim_lda_tuning_num_topic = YOUR_NUM_TOPICS,
gensim_lda_tuning_word2id = YOUR_IDs)