I'm new to hyperopt package.
Now, I wanna optimize my LDA model which is implemented in gensim. The LDA model is optimized to maximize silhouette score over training data.
Now, my question is "How do I pass training-data(numpy.ndarray) to objective-function which is called from hyperopt?"
I looked tutorials and some example codes. They set training-data as global variable. But in my situation, it's difficult to set training-data as global variable as they do.
I wrote following code to optimize LDA with hyoeropt. I'm stacked with the way to pass training-data to gensim_objective_function function because I'm gonna put gensim_lda_optimaze in system which calls gensim_lda_optimaze function.
How to realize that?
# I want to pass training data to this function!
# gensim_lda_tuning_training_corpus, gensim_lda_tuning_num_topic, gensim_lda_tuning_word2id is what I wanna pass
def gensim_objective_function(arg_dict):
from .gensim_lda import evaluate_clustering
from .gensim_lda import call_lda_single
from .gensim_lda import get_topics_ids
alpha = arg_dict['alpha']
eta = arg_dict['eta']
iteration= arg_dict['iteration']
gamma_threshold= arg_dict['gamma_threshold']
minimum_probability= arg_dict['minimum_probability']
passes= arg_dict['passes']
# train LDA model
lda_model, gensim_corpus = call_lda_single(matrix=gensim_lda_tuning_training_corpus,
num_topics=gensim_lda_tuning_num_topic,
word2id_dict=gensim_lda_tuning_word2id,
alpha=alpha, eta=eta,
iteration=iteration,
gamma_threshold=gamma_threshold,
minimum_probability=minimum_probability,
passes=passes)
topic_ids = get_topics_ids(trained_lda_model=lda_model, gensim_corpus=gensim_corpus)
labels = [t[0] for t in topic_ids]
# get silhouette score with extracted label
evaluation_score = evaluate_clustering(feature_matrix=gensim_lda_tuning_training_corpus, labels=numpy.array(labels))
return -1 * evaluation_score
def gensim_lda_optimaze(feature_matrix, num_topics, word2id_dict):
assert isinstance(feature_matrix, (ndarray, csr_matrix))
assert isinstance(num_topics, int)
assert isinstance(word2id_dict, dict)
parameter_space = {
'alpha': hp.loguniform("alpha", numpy.log(0.1), numpy.log(1)),
'eta': hp.loguniform("eta", numpy.log(0.1), numpy.log(1)),
'iteration': 100,
'gamma_threshold': 0.001,
'minimum_probability': 0.01,
'passes': 10
}
trials = Trials()
best = fmin(
gensim_objective_function,
parameter_space,
algo=tpe.suggest,
max_evals=100,
trials=trials
)
return best
You can always use partial in python.
from functools import partial
def foo(params, data):
return params, data
goo = partial(foo, data=[1,2,3])
print goo('ala')
gives
ala [1, 2, 3]
In other words, you make a proxy function, which has data loaded as a given parameter and you ask hyperopt to optimize this new function, with data already set.
thus in your case you change gensim_objective_function to be something accepting all your params:
def RAW_gensim_objective_function(arg_dict, gensim_lda_tuning_training_corpus,
gensim_lda_tuning_num_topic,
gensim_lda_tuning_word2id):
and create actual function to optimize by passing your data in different part of code
gensim_objective_function = partial(RAW_gensim_objective_function,
gensim_lda_tuning_training_corpus = YOUR_CORPUS,
gensim_lda_tuning_num_topic = YOUR_NUM_TOPICS,
gensim_lda_tuning_word2id = YOUR_IDs)
Related
The fitness output is always 0 when my genetic algorithm is training. This should be normal at first but it does this for the entire duration of the training and does not improve at all. I have the training of the genetic algorithm set up so that it trains on small increments of data at a time instead of the entire training data array. The reason I am doing this is because I want the genetic algorithm to train on the most resent data last. This is a simplified version of what I am trying...
def trainModels(coin):
global endSliceTraining, startSliceTraining # These are also used in the fitness function
torch_ga = pygad.torchga.TorchGA(model=NeuralNetworkModel, num_solutions=10)
while endSliceTraining < len(trainingData):
ga_instance = pygad.GA(num_generations=10,
num_parents_mating=2,
initial_population=torch_ga.population_weights,
fitness_func=fitness_func,
parent_selection_type="sss",
mutation_type="random",
mutation_by_replacement=True)
ga_instance.run()
solution, solution_fitness, solution_idx = ga_instance.best_solution()
startSliceTraining += 1
endSliceTraining += 1
def fitness_func(solution, solution_idx):
global startSliceTraining, endSliceTraining
stats = numpy.array(trainingData)
statsSlice = stats[startSliceTraining:endSliceTraining]
for stats in statsSlice:
action = [0,0,0]
stats = torch.tensor(stats, dtype=torch.float)
prediction = pygad.torchga.predict(model=NeuralNetworks[currentCoinIndex], solution=solution, data=stats)
move = torch.argmax(prediction).item()
action[move] = 1
"""
I deleted the complicated part here. This area was not the problem. I hope what's above in this function is understandable
"""
return ...
I'm thinking that maybe my parameters in the pygad.GA are wrong or that perhaps for some reason the Neural Network is not being transfered over to be used in the next set of data but I don't know.
Help would be appreciated, thank you!
I am trying to estimate a model in tensorflow using NUTS by providing it a likelihood function. I have checked the likelihood function is returning reasonable values. I am following the setup here for setting up NUTS:
https://rlhick.people.wm.edu/posts/custom-likes-tensorflow.html
and some of the examples here for setting up priors, etc.:
https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/Multilevel_Modeling_Primer.ipynb
My code is in a colab notebook here:
https://drive.google.com/file/d/1L9JQPLO57g3OhxaRCB29do2m808ZUeex/view?usp=sharing
I get the error: OperatorNotAllowedInGraphError: iterating overtf.Tensoris not allowed: AutoGraph did not convert this function. Try decorating it directly with #tf.function. This is my first time using tensorflow and I am quite lost interpreting this error. It would also be ideal if I could pass the starting parameter values as a single input (example I am working off doesn't do it, but I assume it is possible).
Update
It looks like I had to change the position of the #tf.function decorator. The sampler now runs, but it gives me the same value for all samples for each of the parameters. Is it a requirement that I pass a joint distribution through the log_prob() function? I am clearly missing something. I can run the likelihood through bfgs optimization and get reasonable results (I've estimated the model via maximum likelihood with fixed parameters in other software). It looks like I need to define the function to return a joint distribution and call log_prob(). I can do this if I set it up as a logistic regression (logit choice model is logistically distributed in differences). However, I lose the standard closed form.
My function is as follows:
#tf.function
def mmnl_log_prob(init_mu_b_time,init_sigma_b_time,init_a_car,init_a_train,init_b_cost,init_scale):
# Create priors for hyperparameters
mu_b_time = tfd.Sample(tfd.Normal(loc=init_mu_b_time, scale=init_scale),sample_shape=1).sample()
# HalfCauchy distributions are too wide for logit discrete choice
sigma_b_time = tfd.Sample(tfd.Normal(loc=init_sigma_b_time, scale=init_scale),sample_shape=1).sample()
# Create priors for parameters
a_car = tfd.Sample(tfd.Normal(loc=init_a_car, scale=init_scale),sample_shape=1).sample()
a_train = tfd.Sample(tfd.Normal(loc=init_a_train, scale=init_scale),sample_shape=1).sample()
# a_sm = tfd.Sample(tfd.Normal(loc=init_a_sm, scale=init_scale),sample_shape=1).sample()
b_cost = tfd.Sample(tfd.Normal(loc=init_b_cost, scale=init_scale),sample_shape=1).sample()
# Define a heterogeneous random parameter model with MultivariateNormalDiag()
# Use MultivariateNormalDiagPlusLowRank() to define nests, etc.
b_time = tfd.Sample(tfd.MultivariateNormalDiag( # b_time
loc=mu_b_time,
scale_diag=sigma_b_time),sample_shape=num_idx).sample()
# Definition of the utility functions
V1 = a_train + tfm.multiply(b_time,TRAIN_TT_SCALED) + b_cost * TRAIN_COST_SCALED
V2 = tfm.multiply(b_time,SM_TT_SCALED) + b_cost * SM_COST_SCALED
V3 = a_car + tfm.multiply(b_time,CAR_TT_SCALED) + b_cost * CAR_CO_SCALED
print("Vs",V1,V2,V3)
# Definition of loglikelihood
eV1 = tfm.multiply(tfm.exp(V1),TRAIN_AV_SP)
eV2 = tfm.multiply(tfm.exp(V2),SM_AV_SP)
eV3 = tfm.multiply(tfm.exp(V3),CAR_AV_SP)
eVD = eV1 + eV2 +
eV3
print("eVs",eV1,eV2,eV3,eVD)
l1 = tfm.multiply(tfm.truediv(eV1,eVD),tf.cast(tfm.equal(CHOICE,1),tf.float32))
l2 = tfm.multiply(tfm.truediv(eV2,eVD),tf.cast(tfm.equal(CHOICE,2),tf.float32))
l3 = tfm.multiply(tfm.truediv(eV3,eVD),tf.cast(tfm.equal(CHOICE,3),tf.float32))
ll = tfm.reduce_sum(tfm.log(l1+l2+l3))
print("ll",ll)
return ll
The function is called as follows:
nuts_samples = 1000
nuts_burnin = 500
chains = 4
## Initial step size
init_step_size=.3
init = [0.,0.,0.,0.,0.,.5]
##
## NUTS (using inner step size averaging step)
##
#tf.function
def nuts_sampler(init):
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=mmnl_log_prob,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
num_results=nuts_samples,
current_state=init,
kernel=adapt_nuts_kernel,
num_burnin_steps=100,
parallel_iterations=5)
return samples_nuts_, stats_nuts_
samples_nuts, stats_nuts = nuts_sampler(init)
I have an answer to my question! It is simply a matter of different nomenclature. I need to define my model as a softmax function, which I knew was what I would call a "logit model", but it just wasn't clicking for me. The following blog post gave me the epiphany:
http://khakieconomics.github.io/2019/03/17/Putting-it-all-together.html
I'm trying to fit some data using lmfit. Originally, I used the Model class which worked well based on this example/tutorial:
https://lmfit.github.io/lmfit-py/model.html
But then I wanted to add some parameters constraints to the model, thus I looked at this tutorial:
https://lmfit.github.io/lmfit-py/parameters.html
However, I've got some problems in combining the two classes such that they work nicely together.
Either it complains about the fit function missing parameters or getting invalid parameters (that's the case in the example I'll post) or I get a model which doesn't actually take the parameters I specified.
I can actually solve the problem using one of these different approaches:
1. Pass the paramaters using model.make_params(...), but I would like to split them up individually
2. I could use the Minimizer instead of Model, but I would like to understand why they are so differently implemented even though I would expect them to be very similar (except that they work on different type of input)
Any help / explanations would be really appreciated. :)
This code is based on the example on the Parameters tutorial page.
What I did here is to modify the example such that the Model class is used instead of the Minimizer class, which in principle should work, but I'm somehow doing it the wrong way.
As comparison, the original example is here (scroll to the bottom):
https://lmfit.github.io/lmfit-py/parameters.html
# <examples/doc_parameters_basic.py>
import numpy as np
from lmfit import Model, Parameters
# create data to be fitted
x = np.linspace(0, 15, 301)
data = (5. * np.sin(2*x - 0.1) * np.exp(-x*x*0.025) +
np.random.normal(size=len(x), scale=0.2))
# define objective function: returns the array to be minimized
def fcn2min(params, x):
"""Model a decaying sine wave and subtract data."""
amp = params['amp']
shift = params['shift']
omega = params['omega']
decay = params['decay']
model = amp * np.sin(x*omega + shift) * np.exp(-x*x*decay)
return model
# create a set of Parameters
params = Parameters()
params.add('amp', value=10, min=0)
params.add('decay', value=0.1)
params.add('shift', value=0.0, min=-np.pi/2., max=np.pi/2)
params.add('omega', value=3.0)
# do fit, here with leastsq model
fitmodel = Model(fcn2min)
result = fitmodel.fit(data, params, x=x)
# calculate final result
final = result.fit_report()
# try to plot results
try:
import matplotlib.pyplot as plt
plt.plot(x, data, 'k+')
plt.plot(x, final, 'r')
plt.show()
except ImportError:
pass
# <end of examples/doc_parameters_basic.py>
But using it like this, I get an error
ValueError: Invalid independent variable name ('params') for function fcn2min
I tried:
Specifying all parameters as function arguments, e.g.
def fcn2min(x, amp, shift, omega, decay):
But in this case I end up with the model / function parameters not being connected and I get a 'fit' which does nothing.
Now I tried stuff like specifying:
fitmodel = Model(fcn2min, independent_vars=['x'], param_names=params)
But in this case I get
Invalid parameter name ('amp') for function fcn2min
Also tried something like this:
params = fitmodel.make_params()
params.add('amp', value=10, min=0)
...
But in this case I also don't get parameters which are connected to the model, which can be seen from the output of
fitmodel.param_names
which returns an empty list.
You are confusing a model function as wrapped by the Model class for curve-fitting with an objective function for general purpose minimization with minimize or leastsq. An objective function will have a signature like:
def objective(params, *args):
where params is a lmfit.Parameters instance that you would have to unpack within the function and *args are optional arguments. This function will return the array -- the objective -- that will be minimized in the least squares sense.
In contrast, a model function used to create a curve-fitting Model will have a signature like
def modelfunc(x, par1, par2, par3, ..., **kws):
where x is the independent variable and par1 ... will contain the values for the model parameters. The model function will return an array that models the data.
There are lots of examples using objective functions and using model functions at
https://github.com/lmfit/lmfit-py/tree/master/examples. To model your data, you would do
# define MODEL **not objective** function: returns the array to model the data
def sinedecay(x, amp, shift, omega, decay):
"""Model a decaying sine wave""" .
return amp * np.sin(x*omega + shift) * np.exp(-x*x*decay)
# create model:
smodel = Model(sinedecay)
# create a set of Parameters
params = smodel.make_params(amp=10, decay=0.1, shift=0, omega=3)
params.set('amp', min=0)
params.set('shift', min=-np.pi/2., max=np.pi/2)
# do fit
result = smodel.fit(data, params, x=x)
I am using H2O (Python) where I am playing with H2OGridSearch for alpha values of a GLM (H2OGeneralizedLinearEstimator), also using lambda_search=True using k-fold cross-validation.
How can I get the best model's lambda value?
EDIT: Fully reproducible example
Data:
34.40 17:1 73:1 127:1 265:1 912:1 1162:1 1512:1 1556:1 1632:1 1738:1
205.10 127:1 138:1 338:1 347:1 883:1 912:1 1120:1 1122:1 1512:1
7.75 66:1 127:1 347:1 602:1 1422:1 1512:1 1535:1 1738:1
8.85 127:1 608:1 906:1 979:1 1077:1 1512:1 1738:1
51.80 127:1 347:1 608:1 766:1 912:1 928:1 952:1 1034:1 1512:1 1610:1 1738:1
110.00 127:1 229:1 347:1 602:1 608:1 1171:1 1512:1 1718:1
8.90 66:1 127:1 205:1 347:1 490:1 589:1 912:1 1016:1 1512:1
Call this file h2o_example.svmlight
Then run:
h2o_data = h2o.import_file("h2o_example.svmlight")
cols = h2o_data.columns[1:]
hyper_parameters = {"alpha": [0.0, 0.01, 0.99, 1.0]}
grid = H2OGridSearch(H2OGeneralizedLinearEstimator(family="gamma", link="log", lambda_search=True, nfolds=2, intercept=True, standardize=False),
hyper_params=hyper_parameters)
grid.train(y="C1", x=cols, training_frame=h2o_data)
grid_table = grid.get_grid(sort_by="r2", decreasing=True)
best = grid_table.models[0]
best.actual_params["lambda"]
best.actual_params["alpha"]
The last two commands fail, giving me an error:
TypeError: 'property' object has no attribute '__getitem__'
Apparently, I am using lambda_search in a wrong way. How can I get a single alpha and lambda value for the best model according to my criterion?
Final EDIT
There are multiple ways of getting lambda (shown below) but here are two concise ways of getting lambda.(Note fully reproducible code is at the bottom)
If you have lambda_search = True, you can look at the model summary table under the lambda_search column and see what value is set for lambda.min, which is your best lambda
model.summary()['lambda_search']
which will produce a list with a string similar to:
['nlambda = 100, lambda.max = 12.733, lambda.min = 0.05261, lambda.1se = -1.0']
if you don't use lambda search and don't set a lambda value (or do set it) you can also use the summary table
model.summary()['regularization']
output looks like:
['Elastic Net (alpha = 0.5, lambda = 0.01289 )']
Other options:
look at the actual parameters of the model:
best.actual_params['lambda']
best.actual_params['alpha']
where best was your best model in the grid search results
First EDIT
to get the best model you can do
grid_table = grid.get_grid(sort_by='r2', decreasing=True)
best = grid_table.models[0]
Then you can use:
best.actual_params['lambda']
Fully reproducible example
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()
# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum"]
response = "IsDepDelayed"
# split into train and validation sets
train, valid= airlines.split_frame(ratios = [.8])
# try using the `lambda_` parameter:
# initialize your estimator
airlines_glm = H2OGeneralizedLinearEstimator(family = 'binomial', lambda_ = .0001)
# then train your model
airlines_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# print the auc for the validation data
print(airlines_glm.auc(valid=True))
# Example of values to grid over for `lambda`
# import Grid Search
from h2o.grid.grid_search import H2OGridSearch
# select the values for lambda_ to grid over
hyper_params = {'lambda': [1, 0.5, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0]}
# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: {'strategy': "RandomDiscrete"}
# initialize the glm estimator
airlines_glm_2 = H2OGeneralizedLinearEstimator(family = 'binomial')
# build grid search with previously made GLM and hyperparameters
grid = H2OGridSearch(model = airlines_glm_2, hyper_params = hyper_params,
search_criteria = {'strategy': "Cartesian"})
# train using the grid
grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# sort the grid models by decreasing AUC
grid_table = grid.get_grid(sort_by = 'auc', decreasing = True)
print(grid_table)
best = grid_table.models[0]
print(best.actual_params['lambda'])
I am not sure why the following does not work
best = grid_table.models[0]
best.actual_params["lambda"]
best.actual_params["alpha"]
It may be an issue with h2o, but if you change the above to the following you should be able to at least access those parameters:
best = grid.models[x]
best.actual_params["lambda"]
best.actual_params["alpha"]
Note that I have changed 0 to x because you need to take note of which model performs the best according to your error criteria because the contents within grid may not be sorted according to your error criteria. This requires you to take a look at grid_tableand take note of the model_id and looking at how the models are being stored in grid
Then you should be able to at least reference lambda and alpha. However, when you run a grid search on alpha and you turn the search on for lambda through the lambda_search property best.actual_params["lambda"] will return the full list of lambdas that were searched over. You could still reference it by considering what Lauren has suggested, but I typically like to see everything in a table and would suggest turning lambda_search off and adding it to the hyper parameters you search over.
import numpy as np
lambda_search_range = list(np.linspace(0,1,100))
h2o_data = h2o.import_file("h2o_example.svmlight")
cols = h2o_data.columns[1:]
hyper_parameters = {"alpha": [0.0, 0.01, 0.99, 1.0],
"lambda": lambda_search_range}
grid = H2OGridSearch(H2OGeneralizedLinearEstimator(family="gamma",
link="log", lambda_search=False, nfolds=2,
intercept=True, standardize=False), hyper_params=hyper_parameters)
grid.train(y="C1", x=cols, training_frame=h2o_data)
grid_table = grid.get_grid(sort_by="r2", decreasing=True)
param_dict = grid_table.get_hyperparams_dict(grid_table.model_ids[0])
param_dict should be a dictionary that contains the alpha and lambda values for your best model according to the error criteria you specified.
This is perhaps a silly question.
I'm trying to fit data to a very strange PDF using MCMC evaluation in PyMC. For this example I just want to figure out how to fit to a normal distribution where I manually input the normal PDF. My code is:
data = [];
for count in range(1000): data.append(random.gauss(-200,15));
mean = mc.Uniform('mean', lower=min(data), upper=max(data))
std_dev = mc.Uniform('std_dev', lower=0, upper=50)
# #mc.potential
# def density(x = data, mu = mean, sigma = std_dev):
# return (1./(sigma*np.sqrt(2*np.pi))*np.exp(-((x-mu)**2/(2*sigma**2))))
mc.Normal('process', mu=mean, tau=1./std_dev**2, value=data, observed=True)
model = mc.MCMC([mean,std_dev])
model.sample(iter=5000)
print "!"
print(model.stats()['mean']['mean'])
print(model.stats()['std_dev']['mean'])
The examples I've found all use something like mc.Normal, or mc.Poisson or whatnot, but I want to fit to the commented out density function.
Any help would be appreciated.
An easy way is to use the stochastic decorator:
import pymc as mc
import numpy as np
data = np.random.normal(-200,15,size=1000)
mean = mc.Uniform('mean', lower=min(data), upper=max(data))
std_dev = mc.Uniform('std_dev', lower=0, upper=50)
#mc.stochastic(observed=True)
def custom_stochastic(value=data, mean=mean, std_dev=std_dev):
return np.sum(-np.log(std_dev) - 0.5*np.log(2) -
0.5*np.log(np.pi) -
(value-mean)**2 / (2*(std_dev**2)))
model = mc.MCMC([mean,std_dev,custom_stochastic])
model.sample(iter=5000)
print "!"
print(model.stats()['mean']['mean'])
print(model.stats()['std_dev']['mean'])
Note that my custom_stochastic function returns the log likelihood, not the likelihood, and that it is the log likelihood for the entire sample.
There are a few other ways to create custom stochastic nodes. This doc gives more details, and this gist contains an example using pymc.Stochastic to create a node with a kernel density estimator.