Is there a way to predict how long it will take to run a classifier from sci-kit learn based on the parameters and dataset? I know, pretty meta, right?
Some classifiers/parameter combinations are quite fast, and some take so long that I eventually just kill the process. I'd like a way to estimate in advance how long it will take.
Alternatively, I'd accept some pointers on how to set common parameters to reduce the run time.
There are very specific classes of classifier or regressors that directly report remaining time or progress of your algorithm (number of iterations etc.). Most of this can be turned on by passing verbose=2 (any high number > 1) option to the constructor of individual models. Note: this behavior is according to sklearn-0.14. Earlier versions have a bit different verbose output (still useful though).
The best example of this is ensemble.RandomForestClassifier or ensemble.GradientBoostingClassifier` that print the number of trees built so far and remaining time.
clf = ensemble.GradientBoostingClassifier(verbose=3)
clf.fit(X, y)
Out:
Iter Train Loss Remaining Time
1 0.0769 0.10s
...
Or
clf = ensemble.RandomForestClassifier(verbose=3)
clf.fit(X, y)
Out:
building tree 1 of 100
...
This progress information is fairly useful to estimate the total time.
Then there are other models like SVMs that print the number of optimization iterations completed, but do not directly report the remaining time.
clf = svm.SVC(verbose=2)
clf.fit(X, y)
Out:
*
optimization finished, #iter = 1
obj = -1.802585, rho = 0.000000
nSV = 2, nBSV = 2
...
Models like linear models don't provide such diagnostic information as far as I know.
Check this thread to know more about what the verbosity levels mean: scikit-learn fit remaining time
If you are using IPython, you can consider to use the built-in magic commands such as %time and %timeit
%time - Time execution of a Python statement or expression. The CPU and wall clock times are printed, and the value of the expression (if any) is returned. Note that under Win32, system time is always reported as 0, since it can not be measured.
%timeit - Time execution of a Python statement or expression using the timeit module.
Example:
In [4]: %timeit NMF(n_components=16, tol=1e-2).fit(X)
1 loops, best of 3: 1.7 s per loop
References:
https://ipython.readthedocs.io/en/stable/interactive/magics.html
http://scikit-learn.org/stable/developers/performance.html
We're actually working on a package that gives runtime estimates of scikit-learn fits.
You would basically run it right before running the algo.fit(X, y) to get the runtime estimation.
Here's a simple use case:
from scitime import Estimator
estimator = Estimator()
rf = RandomForestRegressor()
X,y = np.random.rand(100000,10),np.random.rand(100000,1)
# Run the estimation
estimation, lower_bound, upper_bound = estimator.time(rf, X, y)
Feel free to take a look!
Related
I applied this tutorial https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb (on a different dataset), the turorial did not compute the mean squared error from individual output, so I added the following line in the comparison function:
mean_squared_error(signal_true,signal_pred)
but the loss and mse from the prediction were different from loss and mse from the model.evaluation on the test data. The errors from the model.evaluation (Loss, mae, mse) (test-set):
[0.013499056920409203, 0.07980187237262726, 0.013792216777801514]
the error from individual target (outputs):
Target0 0.167851388666284
Target1 0.6068108648555771
Target2 0.1710370357827747
Target3 2.747463225418181
Target4 1.7965991690103074
Target5 0.9065426398192563
I think it might a problem in training the model but i could not find where is it exactly. I would really appreciate your help.
thanks
There are a number of reasons that you can have differences between the loss for training and evaluation.
Certain ops, such as batch normalization, are disabled on prediction- this can make a big difference with certain architectures, although it generally isn't supposed to if you're using batch norm correctly.
MSE for training is averaged over the entire epoch, while evaluation only happens on the latest "best" version of the model.
It could be due to differences in the datasets if the split isn't random.
You may be using different metrics without realizing it.
I'm not sure exactly what problem you're running into, but it can be caused by a lot of different things and it's often difficult to debug.
I had the same problem and found a solution. Hopefully this is the same problem you encountered.
It turns out that model.predict doesn't return predictions in the same order generator.labels does, and that is why MSE was much larger when I attempted to calculate manually (using the scikit-learn metric function).
>>> model.evaluate(valid_generator, return_dict=True)['mean_squared_error']
13.17293930053711
>>> mean_squared_error(valid_generator.labels, model.predict(valid_generator)[:,0])
91.1225401637833
My quick and dirty solution:
valid_generator.reset() # Necessary for starting from first batch
all_labels = []
all_pred = []
for i in range(len(valid_generator)): # Necessary for avoiding infinite loop
x = next(valid_generator)
pred_i = model.predict(x[0])[:,0]
labels_i = x[1]
all_labels.append(labels_i)
all_pred.append(pred_i)
print(np.shape(pred_i), np.shape(labels_i))
cat_labels = np.concatenate(all_labels)
cat_pred = np.concatenate(all_pred)
The result:
>>> mean_squared_error(cat_labels, cat_pred)
13.172956865002352
This can be done much more elegantly, but was enough for me to confirm my hypothesis of the problem and regain some sanity.
My goal is to perform a parameter estimation (model calibration) using PyGmo. My model will be an external "black blox" model (c-code) outputting the objective function J to be minimized (J in this case will be the "Normalized Root Mean Square Error" (NRMSE) between model outputs and measured data. To speed up the optimization (calibration) I would like to run my models/simulations on multiple cores/threads in parallel. Therefore I would like to use a batch fitness evaluator (bfe) in PyGMO. I prepared a minimal example using a simple problem class but using pure python (no external model) and the rosenbrock problem:
#!/usr/bin/env python
# coding: utf-8
import numpy as np
from fmpy import read_model_description, extract, simulate_fmu, freeLibrary
from fmpy.fmi2 import FMU2Slave
import pygmo as pg
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
import time
#-------------------------------------------------------
def main():
# Optimization
# Define problem
class my_problem:
def __init__(self, dim):
self.dim = dim
def fitness(self, x):
J = np.zeros((1,))
for i in range(len(x) - 1):
J[0] += 100.*(x[i + 1]-x[i]**2)**2+(1.-x[i])**2
return J
def get_bounds(self):
return (np.full((self.dim,),-5.),np.full((self.dim,),10.))
def get_name(self):
return "My implementation of the Rosenbrock problem"
def get_extra_info(self):
return "\nDimensions: " + str(self.dim)
def batch_fitness(self, dvs):
J = [123] * len(dvs)
return J
prob = pg.problem(my_problem(30))
print('\n----------------------------------------------')
print('\nProblem description: \n')
print(prob)
#-------------------------------------------------------
dvs = pg.batch_random_decision_vector(prob, 1)
print('\n----------------------------------------------')
print('\nBarch fitness evaluation:')
print('\ndvs length:' + str(len(dvs)))
print('\ndvs:')
print(dvs)
udbfe = pg.default_bfe()
b = pg.bfe(udbfe=udbfe)
print('\nudbfe:')
print(udbfe)
print('\nbfe:')
print(b)
fvs = b(prob, dvs)
print(fvs)
#-------------------------------------------------------
pop_size = 50
gen_size = 1000
algo = pg.algorithm(pg.sade(gen = gen_size)) # The algorithm (a self-adaptive form of Differential Evolution (sade - jDE variant)
algo.set_verbosity(int(gen_size/10)) # We set the verbosity to 100 (i.e. each 100 gen there will be a log line)
print('\n----------------------------------------------')
print('\nOptimization:')
start = time.time()
pop = pg.population(prob, size = pop_size) # The initial population
pop = algo.evolve(pop) # The actual optimization process
best_fitness = pop.get_f()[pop.best_idx()] # Getting the best individual in the population
print('\n----------------------------------------------')
print('\nResult:')
print('\nBest fitness: ', best_fitness) # Get the best parameter set
best_parameterset = pop.get_x()[pop.best_idx()]
print('\nBest parameter set: ',best_parameterset)
print('\nTime elapsed for optimization: ', time.time() - start, ' seconds\n')
if __name__ == '__main__':
main()
When I try to run this code I get the following error:
Exception has occurred: ValueError
function: bfe_check_output_fvs
where: C:\projects\pagmo2\src\detail\bfe_impl.cpp, 103
what: An invalid result was produced by a batch fitness evaluation: the number of produced fitness vectors, 30, differs from the number of input decision vectors, 1
By deleting or commeting out this two lines:
fvs = b(prob, dvs)
print(fvs)
the script can be run without errors.
My questions:
How to use the batch fitness evaluation? (I know this is a new
capability of PyGMO and they are still working on the
documentation...) Can anybody give a minimal example on how to implement this?
Is this the right way to go to speed up my model calibration problem? Or should I use islands and archipelagos? If I got it right, the islands in an archipelago are not communicating to eachother, right? So if one performs e.g. a Particle Swarm Optimization and wants to evaluate several objective function calls simultaneously (in parallel) then the batch fitness evaluator is the right choice?
Do I need to care about archipelagos and islands in this example? What are they exactly meant for? Is it worth running several optimizations but with different initial x (input to objective function) and then to take the best solution? Is this a common approach in optimization with GA's?
I am very knew to the field of optimization and PyGMO, so thx for helping!
Is this the right way to go to speed up my model calibration problem? Or should I use islands and archipelagos? If I got it right, the islands in an archipelago are not communicating to eachother, right? So if one performs e.g. a Particle Swarm Optimization and wants to evaluate several objective function calls simultaneously (in parallel) then the batch fitness evaluator is the right choice?
There are 2 modes of parallelization in pagmo, the island model (i.e., coarse-grained parallelization) and the BFE machinery (i.e., fine-grained parallelization).
The island model works on any problem/algorithm combination, and it is based on the idea that multiple optimisations are run in parallel while exchanging information to accelerate the global convergence to a solution.
The BFE machinery, instead, parallelizes a single optimisation, and it requires explicit support in the solver to work. Currently in pagmo only a handful of solvers are able to take advantage of the BFE machinery. The BFE machinery can also be used to parallelise the initialisation of a population of individuals, which can be useful is your fitness function is particularly heavyweight.
Which parallelisation method is best for you depends on the properties of your problem. In my experience, users tend to prefer the BFE machinery (fine-grained parallelisation) if the fitness function is very heavy (e.g., it takes minutes or more to compute), because in such a situation fitness evaluations are so costly that in order to take advantage of the island model one would have to wait too long. The BFE is also in some sense easier to understand because you don't have to delve into the details of archipelagos, topologies, etc. On the other hand, the BFE works only with certain solvers (although we are trying to extend BFE support to other solvers as time goes by).
How to use the batch fitness evaluation? (I know this is a new capability of PyGMO and they are still working on the documentation...) Can anybody give a minimal example on how to implement this?
One way of using the BFE is what you did in your example, i.e., via the implementation of a batch_fitness() method in your problem. However, my suggestion would be to comment out the batch_fitness() method and try using one of the general-purpose batch fitness evaluators provided with pagmo. The easiest thing to do is to just default-construct an instance of the bfe class and then pass it to one of the algorithms that can use the BFE machinery. One such algorithm is nspso:
https://esa.github.io/pygmo2/algorithms.html#pygmo.nspso
So, something like this:
b = pg.bfe() # Construct a default BFE
uda = pg.nspso(gen = gen_size) # Construct the algorithm
uda.set_bfe(b) # Tell the UDA to use the BFE machinery
algo = pg.algorithm(uda) # Construct a pg.algorithm from the UDA
new_pop = algo.evolve(pop) # Evolve the population
This should use multiple processes to evaluate your fitness functions in parallel within the loop of the nspso algorithm.
If you need more help, please come over to our public users/devs chat room, where you should get assistance rather quickly (normally):
https://gitter.im/pagmo2/Lobby
Suppose you are training a custom tf.estimator.Estimator with tf.estimator.train_and_evaluate using a validation dataset in a setup similar to that of #simlmx's:
classifier = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=model_dir,
params=params)
train_spec = tf.estimator.TrainSpec(
input_fn = training_data_input_fn,
)
eval_spec = tf.estimator.EvalSpec(
input_fn = validation_data_input_fn,
)
tf.estimator.train_and_evaluate(
classifier,
train_spec,
eval_spec
)
Often, one uses a validation dataset to cut off training to prevent over-fitting when the loss continues to improve for the training dataset but not for the validation dataset.
Currently the tf.estimator.EvalSpec allows one to specify after how many steps (defaults to 100) to evaluate the model.
How can one (if possible not using tf.contrib functions) designate to terminate training after n number of evaluation calls (n * steps) where the evaluation loss does not improve and then save the "best" model / checkpoint (determined by validation dataset) to a unique file name (e.g. best_validation.checkpoint)
I understand your confusion now. The documentation for stop_if_no_decrease_hook states (emphasis mine):
max_steps_without_decrease: int, maximum number of training steps with
no decrease in the given metric.
eval_dir: If set, directory
containing summary files with eval metrics. By default,
estimator.eval_dir() will be used.
Looking through the code of the hook (version 1.11), though, you find:
def stop_if_no_metric_improvement_fn():
"""Returns `True` if metric does not improve within max steps."""
eval_results = read_eval_metrics(eval_dir) #<<<<<<<<<<<<<<<<<<<<<<<
best_val = None
best_val_step = None
for step, metrics in eval_results.items(): #<<<<<<<<<<<<<<<<<<<<<<<
if step < min_steps:
continue
val = metrics[metric_name]
if best_val is None or is_lhs_better(val, best_val):
best_val = val
best_val_step = step
if step - best_val_step >= max_steps_without_improvement: #<<<<<
tf_logging.info(
'No %s in metric "%s" for %s steps, which is greater than or equal '
'to max steps (%s) configured for early stopping.',
increase_or_decrease, metric_name, step - best_val_step,
max_steps_without_improvement)
return True
return False
What the code does is load the evaluation results (produced with your EvalSpec parameters) and extract the eval results and the global_step (or whichever other custom step you use to count) associated with the specific evaluation record.
This is the source of the training steps part of the docs: the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range (which IMHO is a bit counter-intuitive).
So, to recap: Yes, the early-stopping hook uses the evaluation results to decide when it's time to cut the training, but you need to pass in the number of training steps you want to monitor and keep in mind how many evaluations will happen in that number of steps.
Examples with numbers to hopefully clarify more
Let's assume you're training indefinitely long having an evaluation every 1k steps. The specifics of how the evaluation runs is not relevant, as long as it runs every 1k steps producing a metric we want to monitor.
If you set the hook as hook = tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator, 'my_metric_to_monitor', 10000) the hook will consider the evaluations happening in a range of 10k steps.
Since you're running 1 eval every 1k steps, this boils down to early-stopping if there's a sequence of 10 consecutive evals without any improvement.
If then you decide to rerun with evals every 2k steps, the hook will only consider a sequence of 5 consecutive evals without improvement.
Keeping the best model
First of all, an important note: this has nothing to do with early stopping, the issue of keeping a copy of the best model through the training and the one of stopping the training once performance start degrading are completely unrelated.
Keeping the best model can be done very easily defining a tf.estimator.BestExporter in your EvalSpec (snippet taken from the link):
serving_input_receiver_fn = ... # define your serving_input_receiver_fn
exporter = tf.estimator.BestExporter(
name="best_exporter",
serving_input_receiver_fn=serving_input_receiver_fn,
exports_to_keep=5) # this will keep the 5 best checkpoints
eval_spec = [tf.estimator.EvalSpec(
input_fn=eval_input_fn,
steps=100,
exporters=exporter,
start_delay_secs=0,
throttle_secs=5)]
If you don't know how to define the serving_input_fn have a look here
This allows you to keep the overall best 5 models you obtained, stored as SavedModels (which is the preferred way to store models at the moment).
I am trying to run a tsne analysis on a square distance matrix. These are the commands I am using.
model = TSNE(n_components = 2,perplexity = 32, verbose = 10,n_iter = 1000, metric = "precomputed")
embeddings = model.fit_transform(D)
This is the output I receive: output from tsne function
It looks like the program is running through 75 iterations and then calling it good and quitting. When I plot the data from the tsne it's basically just a single dense blob. Why is the program quitting early and how can I make it run longer?
It's quitting because the exit-condition is reached.
Interpreting the log, the exit-condition is probably a metric on the gradient, called gradient-norm here. If needed, checkout the basics of gradient-descent to understand the intuition. As every iteration is making a step towards the negative of the gradient, tiny gradients will not do much to the objective (and will be interpreted as: we found a local/global minimum).
It looks like (still interpreting your log only):
if np.linalg.norm(gradient) < 1e-4:
return solution
There is no merit to further do more iterations for this parameterization of the optimization-problem. The solution won't get better (in terms of minimization).
You can only try other parameters (resulting in other optimization-problems).
I have a hierarchical logit that has observations over time. Following Carter 2010, I have included a time, time^2, and time^3 term. The model mixes using Metropolis or NUTS before I add the time variables. HamiltonianMC fails. NUTS and Metropolis also work with time. But NUTS and Metropolis fail with time^2 and time^3, but they fail differently and in a puzzling way. However, unlike in other models that fail for more obvious model specification reasons, ADVI still gives an estimate, (and ELBO is not inf).
NUTS either stalls early (last time it made it to 60 iterations), or progresses too quickly and returns an empty traceplot with an error about varnames.
Metropolis errors out immediately with a dimension mismatch error. It looks like the one in this github error, but I'm using a Bernoulli outcome, not a negative binomial. The end of the error looks like: ValueError: Input dimension mis-match. (input[0].shape[0] = 1, input[4].shape[0] = 18)
I get an empty trace when I try HamiltonianMC. It returns the starting values with no mixing
ADVI gives me a mean and a standard deviation.
I increased the ADVI iterations by a lot. It gave pretty close to the same starting points and NUTS still failed.
I double checked that the fix for the github issue is in place in the version of pymc3 I'm running. It is.
My intuition is that this has something to do with how huge the time^2 and time^3 variables get, since I'm looking over a large time-frame. Time^3 starts at 0 and goes to 64,000.
Here is what I've tried for sampling so far. Note that I have small sample sizes while testing, since it takes so long to run (if it finishes) and I'm just trying to get it to sample at all. Once I find one that works, I'll up the number of iterations
with my_model:
mu,sds,elbo = pm.variational.advi(n=500000,learning_rate=1e-1)
print(mu['mu_b'])
step = pm.NUTS(scaling=my_model.dict_to_array(sds)**2,
is_cov=True)
my_trace = pm.sample(500,
step=step,
start=mu,
tune=100)
I've also done the above with tune=1000
I've also tried a Metropolis and Hamiltonian.
with my_model:
my_trace = pm.sample(5000,step=pm.Metropolis())
with my_model:
my_trace = pm.sample(5000,step=pm.HamiltonianMC())
Questions:
Is my intuition about the size and spread of the time variables reasonable?
Are there ways to sample square and cubed terms more effectively? I've searched, but can you perhaps point me to a resource on this so I can learn more about it?
What's up with Metropolis and the dimension mismatch error?
What's up with the empty trace plots for NUTS? Usually when it stalls, the trace up until the stall works.
Are there alternative ways to handle time that might be easier to sample?
I haven't posted a toy model, because it's hard to replicate without the data. I'll add a toy model once I replicate with simulated data. But the actual model is below:
with pm.Model() as my_model:
mu_b = pm.Flat('mu_b')
sig_b = pm.HalfCauchy('sig_b',beta=2.5)
b_raw = pm.Normal('b_raw',mu=0,sd=1,shape=n_groups)
b = pm.Deterministic('b',mu_b + sig_b*b_raw)
t1 = pm.Normal('t1',mu=0,sd=100**2,shape=1)
t2 = pm.Normal('t2',mu=0,sd=100**2,shape=1)
t3 = pm.Normal('t3',mu=0,sd=100**2,shape=1)
est =(b[data.group.values]* data.x.values) +\
(t1*data.t.values)+\
(t2*data.t2.values)+\
(t3*data.t3.values)
y = pm.Bernoulli('y', p=tt.nnet.sigmoid(est), observed = data.y)
BREAKTHROUGH 1: Metropolis error
Weird syntax issue. Theano seemed to be confused about a model with both constant and random effects. I created a constant in data equal to 0, data['c']=0 and used it as an index for the time, time^2 and time^3 effects, as follows:
est =(b[data.group.values]* data.x.values) +\
(t1[data.c.values]*data.t.values)+\
(t2[data.c.values]*data.t2.values)+\
(t3[data.c.values]*data.t3.values)
I don't think this is the whole issue, but it's a step in the right direction. I bet this is why my asymmetric specification didn't work, and if so, suspect it may sample better.
UPDATE: It sampled! Will now try some of the suggestions for making it easier on the sampler, including using the specification suggested here. But at least it's working!
Without having the dataset to play with it is hard to give a definite answer, but here is my best guess:
To me, it is a bit unexpected to hear about the third order polynomial in there. I haven't read the paper, so I can't really comment on it, but I think this might be the reason for your problems. Even very small values for t3 will have a huge influence on the predictor. To keep this reasonable, I'd try to change the parametrization a bit: First make sure that your predictor is centered (something like data['t'] = data['t'] - data['t'].mean() and after that define data.t2 and data.t3). Then try to set a more reasonable prior on t2 and t3. They should be pretty small, so maybe try something like
t1 = pm.Normal('t1',mu=0,sd=1,shape=1)
t2 = pm.Normal('t2',mu=0,sd=1,shape=1)
t2 = t2 / 100
t3 = pm.Normal('t3',mu=0,sd=1,shape=1)
t3 = t3 / 1000
If you want to look at other models, you could try to model your predictor as a GaussianRandomWalk or a Gaussian Process.
Updating pymc3 to the latest release candidate should also help, the sampler was improved quit a bit.
Update I just noticed you don't have an intercept term in your model. Unless there is a good reason for that you probably want to add
intercept = pm.Flat('intercept')
est = (intercept
+ b[..] * data.x
+ ...)