Python Sklearn - Multidimensional Parameter Optimization

Python Sklearn - Multidimensional Parameter Optimization - python

I am not too convinced by the parameter optimization classes sklearn provides, in fact GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) just loops over the parameters I pass in via param_grid. RandomizedSearch (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) seems like a we-gave-up-approach to me.
One of my approaches was to re-use adaptive integration of a function (used in numerical mathematics): basically reduce the search-space of a parameter in every iteration until a certain error-threshold is reached. The biggest advantage of this method is that the error is also reduced in each iteration.
The problem is that certain parameter-values may be untouched if you consider the different score-values (precision, roc-auc etc.) as functions. I still got good results in one case, not so good in other cases.
What would be a good mathematical approach to optimize values more efficient than GridSearch and RandomizedSearch?

Related

How does optimization happen in the parameters of the algorithms in sci-kit learn library?

When Machine Learning is seen mathematically, we have cost functions, to reduce the error in the prediction for the next time and we keep on optimizing the parameters of the equation/s used in the particular algorithm.
I wonder where does this optimization happen in the library
Sci-kit learn.
There is no function for doing this job, so far I know,there are rather a bunch of algorithms as functions.
Can someone please tell me how do I optimize those parameters in sci-kit learn, and is there a way to do it in the mentioned library or is it just for learning purposes.
I saw the code of library of logistic regression but got nothing.
Any effort is appreciated.

I got it.
GridsearchCV is the answer, thats what I was looking for.
I think it allows us to choose the values of alpha, c and number of iterations, therefore, not allowing to alter the values of weights directly and I think thats ok or thats how we'd assign values to those parameters after carrying out the same process independtly.
This article helped me to understand it well.

Differences between MiniBatchKMeans.fit and MiniBatchKMeans.partial_fit

I am interested in sklearn.cluster.MiniBatchKMeans as a way to use huge datasets. Anyway I am a bit confused about the difference between MiniBatchKMeans.partial_fit() and MiniBatchKMeans.fit().
Documentation about fit() states that:
Compute the centroids on X by chunking it into mini-batches.
while documentation about partial_fit() states that:
Update k means estimate on a single mini-batch X.
So, as I understand it fit() splits up the dataset to chunk of data with which it trains the k means (I guess the argument batch_size of MiniBatchKMeans() refers to this one) while partial_fit() uses all data passed to it to update the centres. The term "update" may seem a bit ambiguous indicating an initial training (using fit()) should have been performed or not, but judging from the example in the documentation this is not necessary (I can use partial_fit() at the beginning also).
Is it true that partial_fit() will use all data passed to it regardless of size or is the data size bound to the batch_size passed as argument to the MiniBatchKMeans constructor? Also if batch_size is set to be greater than the actual data size is the result the same as the standard k-means algorithm (I guess efficient could vary in the latter case though due to different architectures).

TL;DR
partial_fit is for online clustering were fit is for offline, however i think MiniBatchKMeans's partial_fit method is a little rough.
Long explanation
I diged old PR's from the repo, and found this one, it seems to be the first commit of this implementation, it mentions that this algorithm can implement the partial_fit method as a online clustering method (following the online API discussion).
So as well as the BIRCH implementation, this algorithm uses fit as one time offline clustering and partial_fit as online clustering.
However, i did some tests comparing the ARI of the result labels by using the fit in the entire dataset versus using partial_fit and fit in chunks, and didn't seems to get anywhere, since the ARI result were very low (~0.5), and by changing the initialization apparently the fit chunked beat partial_fit, which doesn't make sense. you can find my notebook here.
So my guess is, based in this response in the PR:
I believe that this branch can and should be merged.
The online fitting API (partial_fit) probably needs to mature, but I
think that it is a bit out of scope of this work. The core
contribution, that is a mini-batch K-means, is nice, and does seem to
speed up things.
Is that the implementation hasn't changed much since that PR, and the partial_fit method is still a little rough, the two implementations from 2011 and now has changed (compared from the release tag), however both of them calls the function _mini_batch_step once in partial_fit (without verbose info) and calls multiple time in fit (with verbose info).

Cross-validation in LightGBM

How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions?
Here's an example - we train our cv model using the code below:
cv_mod = lgb.cv(params,
d_train,
500,
nfold = 10,
early_stopping_rounds = 25,
stratified = True)
How can we use the parameters found from the best iteration of the above code to predict an output? In this case, cv_mod has no "predict" method like lightgbm.train, and the dictionary output from lightgbm.cvthrows an error when used in lightgbm.train.predict(..., pred_parameters = cv_mod).
Am I missing an important transformation step?

In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to evaluate performance of model-building procedure.
A basic train/test split is conceptually identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about the estimate of generalisation error. There is more info in a sense of getting the error + stat uncertainty. There is an excellent discussion on CrossValidated (start with the links added to the question, which cover the same question, but formulated in a different way). It covers nested cross validation and is absolutely not straightforward. But if you will wrap your head around the concept in general, this will help you in various non-trivial situations. The idea that you have to take away is: The purpose of CV is to evaluate performance of model-building procedure.
Keeping that idea in mind, how does one approach hyperparameter estimation in general (not only in LightGBM)?
You want to train a model with a set of parameters on some data and evaluate each variation of the model on an independent (validation) set. Then you intend to choose the best parameters by choosing the variant that gives the best evaluation metric of your choice.
This can be done with a simple train/test split. But evaluated performance, and thus the choice of the optimal model parameters, might be just a fluctuation on a particular split.
Thus, you can evaluate each of those models more statistically robust averaging evaluation over several train/test splits, i.e k-fold CV.
Then you can make a step further and say that you had an additional hold-out set, that was separated before hyperparameter optimisation was started. This way you can evaluate the chosen best model on that set to measure the final generalisation error. However, you can make even step further and instead of having a single test sample you can have an outer CV loop, which brings us to nested cross validation.
Technically, lightbgm.cv() allows you only to evaluate performance on a k-fold split with fixed model parameters. For hyper-parameter tuning you will need to run it in a loop providing different parameters and recoding averaged performance to choose the best parameter set. after the loop is complete. This interface is different from sklearn, which provides you with complete functionality to do hyperparameter optimisation in a CV loop. Personally, I would recommend to use the sklearn-API of lightgbm. It is just a wrapper around the native lightgbm.train() functionality, thus it is not slower. But it allows you to use the full stack of sklearn toolkit, thich makes your life MUCH easier.

If you're happy with your CV results, you just use those parameters to call the 'lightgbm.train' method. Like #pho said, CV is usually just for param tuning. You don't use the actual CV object for predictions.

You should use CV for parameter optimization.
If your model performs well on all folds use these parameters to train on the whole training set.
Then evaluate that model on the external test set.

NN: outputting a probability density function instead of a single value

This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!

Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.

Alternative to scipy.optimize.minimize constrained optimization?

I am interested in finding optimized parameters of a model (by minimizing the model's output with the known value). The parameters I am interested in finding have bounds and they are also constrained by an inequality that looks like 1 - sum(x_par) >= 0, where x_par is a list of some of the parameters out of the total parameter list. I have used scipy.optimize.minimize to minimize this problem with different methods (such as COBYLA and SLSQP), but the fitting performance by this function is quite poor and the error is generally above 50%.
I have noticed that scipy.optimize.curve_fit and scipy.optimize.differential_evolution work very well in terms of fitting the given values, but these functions do not allow constraints on parameters. I am looking for an alternative in python to optimize my problem that allows constraining parameters and can do a better job in fitting the given curve/values than scipy.optimize.minimize.

You might find lmfit useful. This module is a wrapper around many of the scipy.optimized routines (including leastsq, differential_evolution, most of the scaler minimizers) that replaces all variables with Parameter objects that can be fixed or free, have bounds applied, or be constrained as mathematical expressions of other Parameters, all independent of the method used to solve the minimization problem. There is also a Model class to support many curve fitting problems, and support for improved analysis of confidence intervals for parameters.
With some care, inequality constraints can be applied, as is discussed briefly at
http://lmfit.github.io/lmfit-py/constraints.html#using-inequality-constraints .

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.