Decision tree overfit test - python

I am currently working with data that is easy to overfit, so i made function with testing roc_auc score for each depth as i read on sklearn that max_depth is usually the reason for tree to overfit. But i am not sure if my thinking is correctly here there is pic of my result:
I was also trying to ust postpruning method, but my graph looked quite different from others i found at internet so i am not sure what it gives me

The term you are looking for is cross-validation. The basic idea is simple: you split your dataset into a training and validation (or testing) sets. Then you train a model on the training set and test it on the validation set. If your model is overfitted, it will perform well on training set but poorly on validation set. In this case it's best to decrease model complexity or add so-called regularization (e.g. tree pruning). Perhaps, the simplest way to perform cross-validation in SciKit Learn is to use cross_val_score function as described here.
Note 1: In some context (e.g. in neural networks) there are both - validation and test sets (in addition to the train set). I won't go into the detail here, but don't be confused with these terms in different contexts.
Note 2: Cross-validation is such a standard thing that it even gave a name to another StackExchange site - Cross Validated, where you may get more answers about statistics. Another and perhaps even more appropriate site has a self-explaining name - Data Science.

Related

Confusion around the SKLearn GridSearchCV scoring parameter and using train test split

I'm a little bit confused about how GridSearchCV works with Train Test Split.
As far as I know, when creating models for the dataset I'm using, a paper used roc-auc.
I'm trying to replicate what this paper did, at least as well as I can. From reading a few other posts here, I've gathered that running GridSearchCV on the entire dataset is prone to overfitting, so we should split the data into a training partition and a testing partition. Then, we should run the training partition with GridSearchCV with whatever model and parameters, and then fit it, and then get a score using the test part of the dataset we set aside.
Now where I'm confused is with GridSearchCV, as far as I understand, it gives us scores for each of the folds that the data is split into when doing the search for parameters and using best_score_ we can pull the best of these scores. I don't understand what the scores represent and why you can pass in a scoring parameter to begin with, since the job of GridSearchCV is to always find the best possible parameters anyways? (Perhaps I'm making a poor assumption here but I'm assuming that there is an objective best set of parameters, regardless of scoring method). What I figured was that I would find the best parameters with GridSearchCV and then use the said parameters to create fit a model, and finally use that model and the partition I saved for testing and test it using the roc-auc scoring method.
So in the end, does it matter (if at all) what scoring methods I'm passing into GridSearchCV, as it will always look to give the best set of parameters anyways, which I will use to compute my final score with the testing partition?
This document may help.
Here you see that the scoring parameter allows you to have various metrics, such as roc_auc. See here all Scikit's metrics.
Optimizing over different metrics result in different optimal parameters. Just think about optimizing precision versus recall. Optimizing precision leads to less false positives while optimizing recall leads to less false negatives.
Also, in GridSearchCV, the CV stands for cross validated. Train/test splitting happens inside this function, it's taken care of. You only have to provide the splitter as an argument to GridSearchCV, for example cv=StratifiedKFold(n_splits=5, shuffle=True).

Splitting the data in machine learning

In machine learning, why do we need to split the data? And why do we set the test size to 0.3 or 0.2 (usually)? Can we set the size to 1? and also why is 85% considered good for the accuracy?
PS. I'm a beginner please be easy on me ^^
So why do we split the data? We need some way to determine how well our classifier performs. One way to do this is use the same data for training and testing. If we do this, though, we have no way of telling if our model is overfitting, where our model just memorizes features of our training set instead of learning some underlying representation. Instead we hold-out some data for testing that our model never sees and can be evaluated on. Sometimes we even split data into 3 parts - training, validation (test set while we're still choosing the parameters of our model), and testing (for tuned model).
The test size is just the fraction of our data in the test set. If you set your test size to 1, that's your entire dataset, and there's nothing left to train on.
Why is 85% good for accuracy? Just a heuristic. Say you're doing a classification task, picking the most frequent class gives you at least 50% accuracy. We might be skeptical if a model has 100% accuracy because humans don't do that well, and your training data probably isn't 100% accurate, so 100% accuracy might indicate something's going wrong, like an overfit model.
This is a very basic question. I'd recommend you do some courses before asking questions here as literally every good course on machine learning will explain this. I recommend Andrew Ng's courses, you can find them on Coursera but the videos are also on youtube.
To answer your question more directly, you use a train set to teach your model how to classify data, and you test your model with a "dev" or "test" set to see how well it has learned on new unseen data.
Think of it like teaching a kid to add by telling them 1+4=5 and 5+3=8 and 2+7=9. If you then ask the kid "what's 1+4", you want to know if they have learned to add, and not just memorised the examples that you gave.
This video in particular answers the question of what size test/train sets to use.
https://www.youtube.com/watch?v=1waHlpKiNyY

Difference between training sets and validation sets?

I'm learning about machine learning, and I've often come across people separating their data into a 'training set' and a 'validation set.' I could never figure out why people never just used all of the data for training and then just used it again for validation. Is there a reason for this that I'm missing?
Think of it like this, you are going to take an exam and you are practicing hard with your practice materials. You don't know what you are going to be asked in the exam right ?
On the other hand if you practice with the exam itself, when you take the exam you will know all the answers, so you don't even have to bother studying.
That's the case for your model, if you train your model on both the train set and test set, your model will know all the answers beforehand. You need to give him something he does not know so that he can deduce some answers to you.
Basically, you would want the model to be trained using the train dataset, in order to test if the hyper-parameter tuning is done right you would like to test it out with a portion of the dataset.
If this was done on the test data directly then chances of over-fitting is high. To avoid this, you use the validation data set and measure your model's performance against the test dataset.

Cross-validation in LightGBM

How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions?
Here's an example - we train our cv model using the code below:
cv_mod = lgb.cv(params,
d_train,
500,
nfold = 10,
early_stopping_rounds = 25,
stratified = True)
How can we use the parameters found from the best iteration of the above code to predict an output? In this case, cv_mod has no "predict" method like lightgbm.train, and the dictionary output from lightgbm.cvthrows an error when used in lightgbm.train.predict(..., pred_parameters = cv_mod).
Am I missing an important transformation step?
In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to evaluate performance of model-building procedure.
A basic train/test split is conceptually identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about the estimate of generalisation error. There is more info in a sense of getting the error + stat uncertainty. There is an excellent discussion on CrossValidated (start with the links added to the question, which cover the same question, but formulated in a different way). It covers nested cross validation and is absolutely not straightforward. But if you will wrap your head around the concept in general, this will help you in various non-trivial situations. The idea that you have to take away is: The purpose of CV is to evaluate performance of model-building procedure.
Keeping that idea in mind, how does one approach hyperparameter estimation in general (not only in LightGBM)?
You want to train a model with a set of parameters on some data and evaluate each variation of the model on an independent (validation) set. Then you intend to choose the best parameters by choosing the variant that gives the best evaluation metric of your choice.
This can be done with a simple train/test split. But evaluated performance, and thus the choice of the optimal model parameters, might be just a fluctuation on a particular split.
Thus, you can evaluate each of those models more statistically robust averaging evaluation over several train/test splits, i.e k-fold CV.
Then you can make a step further and say that you had an additional hold-out set, that was separated before hyperparameter optimisation was started. This way you can evaluate the chosen best model on that set to measure the final generalisation error. However, you can make even step further and instead of having a single test sample you can have an outer CV loop, which brings us to nested cross validation.
Technically, lightbgm.cv() allows you only to evaluate performance on a k-fold split with fixed model parameters. For hyper-parameter tuning you will need to run it in a loop providing different parameters and recoding averaged performance to choose the best parameter set. after the loop is complete. This interface is different from sklearn, which provides you with complete functionality to do hyperparameter optimisation in a CV loop. Personally, I would recommend to use the sklearn-API of lightgbm. It is just a wrapper around the native lightgbm.train() functionality, thus it is not slower. But it allows you to use the full stack of sklearn toolkit, thich makes your life MUCH easier.
If you're happy with your CV results, you just use those parameters to call the 'lightgbm.train' method. Like #pho said, CV is usually just for param tuning. You don't use the actual CV object for predictions.
You should use CV for parameter optimization.
If your model performs well on all folds use these parameters to train on the whole training set.
Then evaluate that model on the external test set.

How can the output of a model be displayed?

I am performing a machine learning task wherein I am using logistic regression for topic classification.
If this is my code:
model= LogisticRegression()
model= model.fit(mat_tmp, label_tmp)
y_train_pred = model.predict(mat_tmp_test)
print(metrics.accuracy_score(label_tmp_test, y_train_pred))
Is there a way I can output what exactly is happening inside the model. Like probably a working example of what my model is doing? Like maybe displaying 2-3 documents and how they are being classified?
In order to be fully aware of what is happening in your model, you must first take some time to study the logistic regression algorithm (eg. from lecture notes or Wikipedia). As with other supervised techniques, logistic regression has hyper-parameters and parameters. Hyper-parameters basically specify how your algorithm runs, which you must provide at initialisation (ie. before it sees any data). For example, you could have prior information about the distribution of classes, which then would be a hyper-parameter. Parameters are "learnt" from your data.
Once you understand the algorithm, the interesting question will be what the parameters of your model are (recall that these are retrieved from the data). By visiting the documentation, you find in the attributes section, that this classifier has 3 parameters, which you can access by their field names.
If you are not interested in such details, but only want to assess the accuracy of your classifier, a useful technique is cross-validation. You split your labeled data into k equal sized subsets, and train your classifier using k-1 of them. Then you evaluate the trained classifier on the remaining 1 subset and calculate the accuracy (ie. what proportion of the data could be predicted properly). This method has its drawbacks, but proves to be very useful in general.

Categories

Resources