Can someone explain why does the random_state parameter affects the model so much?
I have a RandomForestClassifier model and want to set the random_state (for reproducibility pourpouses), but depending on the value I use I get very different values on my overall evaluation metric (F1 score)
For example, I tried to fit the same model with 100 different random_state values and after the training ad testing the smallest F1 was 0.64516129 and the largest 0.808823529). That is a huge difference.
This behaviour also seems to make very hard to compare two models.
Thoughts?
If the random_state affects your results it means that your model has a high variance. In case of Random Forest this simply means that you use too small forest and should increase number of trees (which due to bagging - reduce variance). In scikit-learn this is controlled by n_estimators parameters in the constructor.
Why this happens? Each ML method tries to minimize the error, which from matematial perspective can be usually decomposed to bias and variance [+noise] (see bias variance dillema/tradeoff). Bias is simply how far from true values your model has to end up in the expectation - this part of an error usually comes from some prior assumptions, such as using linear model for nonlinear problem etc. Variance is how much your results differ when you train on different subsets of data (or use different hyperparameters, and in case of randomized methods random seed is a parameter). Hyperparameters are initialized by us and Parameters are learnt by the model itself in the training process. Finally - noise is not reducible error coming from the problem itself (or data representation). Thus, in your case - you simply encountered model with high variance, decision trees are well known for their extremely high variance (and small bias). Thus to reduce variance, Breiman proposed the specific bagging method, known today as Random Forest. The larger the forest - stronger the effect of variance reduction. In particular - forest with 1 tree has huge variance, forest of 1000 trees is nearly deterministic for moderate size problems.
To sum up, what you can do?
Increase number of trees - this has to work, and is well understood and justified method
treat random_seed as a hyperparameter during your evaluation, because this is exactly this - a meta knowledge you need to fix before hand if you do not wish to increase size of the forest.
Related
For an exploratory semester project, I am trying to predict the outcome value of a quality control measurement using various measurements made during production. For the project I was testing different algorithms (LinearRegression, RandomForestRegressor, GradientBoostingRegressor, ...). I generally get rather low r2-values (around 0.3), which is probably due to the scattering of the feature values and not my real problem here.
Initially, I have around 100 features, which I am trying to reduce using RFE with LinearRegression() as estimator. Cross validation indicates, I should reduce my features to only 60 features. However, when I do so, for some models the R2-value increases. How is that possible? I was under the impression that adding variables to the model always increases R2 and thus reducing the number of variables should lead to lower R2 values.
Can anyone comment on this or provide an explanation?
Thanks in advance.
It depends on whether you are using the testing or training data to measure R2. This is a measure of how much of the variance of the data your model captures. So, if you increase the number of predictors then you are correct in that you do a better job predicting exactly where the training data lie and thus your R2 should increase (converse is true for decreasing the number of predictors).
However, if you increase number of predictors too much you can overfit to the training data. This means the variance of the model is actually artificially high and thus your predictions on the test set will begin to suffer. Therefore, by reducing the number of predictors you actually might do a better job of predicting the test set data and thus your R2 should increase.
I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?
Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.
When I train a classification model using lightgbm, I usually use validation set and early stopping to determine the number of iterations.
Now I want to combine training and validation set to train a model (so I have more training examples), and use the model to predict the test data, should I change the number of iterations derived from the validation process?
Thanks!
As you said in your comment, this is not comparable to the Deep Learning number of epochs because deep learning is usually stochastic.
With LGBM, all parameters and features being equals, by adding 10% up to 15% more training points, we can expect the trees to look alike: as you have more information your split values will be better, but it is unlikely to drastically change your model (this is less true if you use parameters such as bagging_fraction or if the added points are from a different distribution).
I saw people multiplying the number of iterations by 1.1 (can't find my sources sorry). Intuitively this makes sense to add some trees as you potentially add information. Experimentally this value worked well but the optimal value will be dependent of your model and data.
In a similar problem in deep learning with Keras: I do it by using an early stopper and cross validation with train and validation data, and let the model optimize itself using validation data during trainings.
After each training, I test the model with test data and examine the mean accuracies. In the mean time after each training I save the stopped_epoch from EarlyStopper. If CV scores are satisfying, I take the mean of stopped epochs and do a full training (including all data I have) with the number of mean stopped epochs, and save the model.
I'm not aware of a well-established rule of thumb to do such estimate. As Florian has pointed out, sometimes people rescale the number of iterations obtained from early stopping by a factor. If i remember correctly, typically the factor assumes a linear dependence of the data size and the optimal number of trees. I.e. in the 10-fold cv this would be a rescaling 1.1 factor. But there is no solid justification for this. As Florian also pointed out, the dependence around the optimum is typically reasonably flat, so +- a bit of trees will not have a dramatic effect.
Two suggestions:
do k-fold validation instead of a single train-validation split. This will allow to evaluate how stable the estimate of the optimal number of trees is. If this fluctuates a lot between folds- do not rely on such estimate :)
fix the size of the validation sample and re-train your model with early stopping using gradually increasing training set. This will allow to evaluae the dependence of the number of trees on the sample size and approximate it to the full sample size.
I'm creating a model to perform Logistic regression on a dataset using Python. This is my code:
from sklearn import linear_model
my_classifier2=linear_model.LogisticRegression(solver='lbfgs',max_iter=10000)
Now, according to Sklearn doc page, max_iter is maximum number of iterations taken for the solvers to converge. How do I specifically state that I need 'N' number of iterations ?
Any kind of help would be really appreciated.
I’m not sure, but, Do you want to know the optimal number of iterations for your model? If so, you are better off utilizing GridSearchCV that scan tune hyper parameter like max_iter.
Briefly,
Split your data into two groups: train/test data with train_test_split or KFold that can be imported from sklean
Set your parameter, for instance para=[{‘max_iter’:[1,10,100,100]}]
Instance, for example clf=GridSearchCV(LogisticRegression, param_grid=para, cv=5, scoring=‘r2’)
Implement with using train data like this: clf.fit(x_train, y_train)
You can also fetch the best number of iterations with RandomizedSearchCV or BayesianOptimization.
About the GridSearchCV of the max_iter parameter, the fitted LogisticRegression models have and attribute n_iter_ so you can discover the exact max_iter needed for a given sample size and regarding features:
n_iter_: ndarray of shape (n_classes,) or (1, )
Actual number of iterations for all classes. If binary or multinomial, it
returns only 1 element. For liblinear solver, only the maximum number of
iteration across all classes is given.
Scanning very short intervals, like 1 by 1, is a waste of resources that could be used for more important LogisticRegression fit parameters such as the combination of solver itself, its regularization penalty and the inverse of the regularization strength C which contributes for a faster convergence within a given max_iter.
Setting a very high max_iter could be also a waste of resources if you haven't previously did a minimal feature preprocessing, at least, feature scaling or maybe imputation, outlier clipping and a dimensionality reduction (e.g. PCA).
Things can become worse: a tunned max_iter could be ok for a given sample size but not for a bigger sample size, for instance, if you are developing a cross-validated learning curve, which by the way is imperative for optimal machine learning.
It becomes even worse if you increase a sample size in a pipeline that generates feature vectors such as n-grams (NLP): more rows will generate more (sparse) features for the LogisticRegression classification.
I think it's important to observe if different solvers converges or not on given sample size, generated features and max_iter.
Methods that help a faster convergence which eventually won't demand increasing max_iter are:
Feature scaling
Dimensionality Reduction (e.g. PCA) of scaled features
There's a nice sklearn example demonstrating the importance of feature scaling
I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen
The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset
There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.