Splitting the data in machine learning - python

In machine learning, why do we need to split the data? And why do we set the test size to 0.3 or 0.2 (usually)? Can we set the size to 1? and also why is 85% considered good for the accuracy?
PS. I'm a beginner please be easy on me ^^

So why do we split the data? We need some way to determine how well our classifier performs. One way to do this is use the same data for training and testing. If we do this, though, we have no way of telling if our model is overfitting, where our model just memorizes features of our training set instead of learning some underlying representation. Instead we hold-out some data for testing that our model never sees and can be evaluated on. Sometimes we even split data into 3 parts - training, validation (test set while we're still choosing the parameters of our model), and testing (for tuned model).
The test size is just the fraction of our data in the test set. If you set your test size to 1, that's your entire dataset, and there's nothing left to train on.
Why is 85% good for accuracy? Just a heuristic. Say you're doing a classification task, picking the most frequent class gives you at least 50% accuracy. We might be skeptical if a model has 100% accuracy because humans don't do that well, and your training data probably isn't 100% accurate, so 100% accuracy might indicate something's going wrong, like an overfit model.

This is a very basic question. I'd recommend you do some courses before asking questions here as literally every good course on machine learning will explain this. I recommend Andrew Ng's courses, you can find them on Coursera but the videos are also on youtube.
To answer your question more directly, you use a train set to teach your model how to classify data, and you test your model with a "dev" or "test" set to see how well it has learned on new unseen data.
Think of it like teaching a kid to add by telling them 1+4=5 and 5+3=8 and 2+7=9. If you then ask the kid "what's 1+4", you want to know if they have learned to add, and not just memorised the examples that you gave.
This video in particular answers the question of what size test/train sets to use.
https://www.youtube.com/watch?v=1waHlpKiNyY

Related

Is it necessary to mitigate class imbalance problem in multiclass text classification?

I am performing multi-class text classification using BERT in python. The dataset that I am using for retraining my model is highly imbalanced. Now, I am very clear that the class imbalance leads to a poor model and one should balance the training set by undersampling, oversampling, etc. before model training.
However, it is also a fact that the distribution of the training set should be similar to the distribution of the production data.
Now, if I am sure that the data thrown at me in the production environment will also be imbalanced, i.e., the samples to be classified will likely belong to one or more classes as compared to some other classes, should I balance my training set?
OR
Should I keep the training set as it is as I know that the distribution of the training set is similar to the distribution of data that I will encounter in the production?
Please give me some ideas, or provide some blogs or papers for understanding this problem.
Class imbalance is not a problem by itself, the problem is too few minority class' samples make it harder to describe its statistical distribution, which is especially true for high-dimensional data (and BERT embeddings have 768 dimensions IIRC).
Additionally, logistic function tends to underestimate the probability of rare events (see e.g. https://gking.harvard.edu/files/gking/files/0s.pdf for the mechanics), which can be offset by selecting a classification threshold as well as resampling.
There's quite a few discussions on CrossValidated regarding this (like https://stats.stackexchange.com/questions/357466). TL;DR:
while too few class' samples may degrade the prediction quality, resampling is not guaranteed to give an overall improvement; at least, there's no universal recipe to a perfect resampling proportion, you'll have to test it out for yourself;
however, real life tasks often weigh classification errors unequally: resampling may help improving certain class' metrics at the cost of overall accuracy. Same applies to classification threshold selection however.
This depends on the goal of your classification:
Do you want a high probability that a random sample is classified correctly? -> Do not balance your training set.
Do you want a high probability that a random sample from a rare class is classified correctly? -> balance your training set or apply weighting during training increasing the weights for rare classes.
For example in web applications seen by clients, it is important that most samples are classified correctly, disregarding rare classes, whereas in the case of anomaly detection/classification, it is very important that rare classes are classified correctly.
Keep in mind that a highly imbalanced dataset tends to always predicting the majority class, therefore increasing the number or weights of rare classes can be a good idea, even without perfectly balancing the training set..
P(label | sample) is not the same as P(label).
P(label | sample) is your training goal.
In the case of gradient-based learning with mini-batches on models with large parameter space, rare labels have a small footprint on the model training. So, your model fits in P(label).
To avoid fitting to P(label), you can balance batches.
Overall batches of an epoch, data looks like an up-sampled minority class. The goal is to get a better loss function that its gradients move parameters toward a better classification goal.
UPDATE
I don't have any proof to show this here. It is perhaps not an accurate statement. With enough training data (with respect to the complexity of features) and enough training steps you may not need balancing. But most language tasks are quite complex and there is not enough data for training. That was the situation I imagined in the statements above.

Decision tree overfit test

I am currently working with data that is easy to overfit, so i made function with testing roc_auc score for each depth as i read on sklearn that max_depth is usually the reason for tree to overfit. But i am not sure if my thinking is correctly here there is pic of my result:
I was also trying to ust postpruning method, but my graph looked quite different from others i found at internet so i am not sure what it gives me
The term you are looking for is cross-validation. The basic idea is simple: you split your dataset into a training and validation (or testing) sets. Then you train a model on the training set and test it on the validation set. If your model is overfitted, it will perform well on training set but poorly on validation set. In this case it's best to decrease model complexity or add so-called regularization (e.g. tree pruning). Perhaps, the simplest way to perform cross-validation in SciKit Learn is to use cross_val_score function as described here.
Note 1: In some context (e.g. in neural networks) there are both - validation and test sets (in addition to the train set). I won't go into the detail here, but don't be confused with these terms in different contexts.
Note 2: Cross-validation is such a standard thing that it even gave a name to another StackExchange site - Cross Validated, where you may get more answers about statistics. Another and perhaps even more appropriate site has a self-explaining name - Data Science.

Knowing best model in Neural Network

I trained a model using CNN,
The results are,
Accuracy
Loss
I read from fast.ai, the experts say, a good model has the val_loss is slightly greater than the loss.
My model is different in points, So, Can I take this model as good or I need to train again...
Thank you
it seems that maybe your validation set is "too easy" for the CNN you trained or doesn't represent your problem enough. it's difficult to say witch the amount of information you provided.
I would say, your validation set is not properly chosen. or you could try using crossvalidation to get more insights.
First of all as already pointed out I'd check your data setup: How many data points do the training and the test (validation) data sets contain? And have they been properly chosen?
Additionally, this is an indicator that your model might be underfitting the training data. Generally, you are looking for the "sweet spot" regarding the trade off between predicting your training data well and not overfitting (i.e. doing bad on the test data):
If accuracy on the training data is not high then you might be underfitting (left hand side of the green dashed line). And doing worse on the training data than on the test data is an indication for that (although this case is not being shown in the plot). Therefore, I'd check what happens to your accuracy if you increase the model complexity and fit it better to the training data (moving towards the 'sweet spot' in the illustrative graph).
In contrast if your test data accuracy was low you might be in an overfitting situation (right hand side the green dashed line).

Machine Learning - Classification or Clustering

I am new to machine learning and had a problem I wanted to solve and see if anyone has any ideas on what type of algorithm would be best to use. I am not looking for code, but rather a process.
Problem: I am classifying people into 2 categories: high risk and low risk. (this is a very basic starting point and I will expand as I learn how to classify more detailed)
Each person has 11 variables I am looking at and each variable has a binary value (0 for no, 1 for yes). The variables are like has married, gun_owner, home_owner, etc. So I gather each person can have 2^11 or 2048 different combinations of these variables.
I have a data set that has this information and then the result (whether or not they committed a crime). I figured this data would be used for training and then the algorithm can make predictions on high risk individuals.
Does anyone have any ideas for what would be the best algorithm? Since there are so many variables, I am having more trouble trying to figure out what may work bets.
This is a binary classification problem, with each input a binary string of length 11. There are many algorithms for this problem. The simplest one is the naive Bayes model (https://en.wikipedia.org/wiki/Naive_Bayes_classifier). You could also try some linear classifiers such as logistic regression or SVM. They both work well for linear separable data and binary classification.
It seems like you want to classify people based on a few features. It looks like a simple binary classification problem. However, it is not very clear that if the data you have is labeled or not.
So the first question is, in you dataset, do you know which person is 'high risk' and which person is 'low risk'? If you have that information, you can use a whole lot of machine learning model for this classification task.
However, if the labels are not present ('high risk' or 'low risk') you cannot do that. Then you have to think about some unsupervised learning methods (clustering). Hope this answers your question.

Difference between training sets and validation sets?

I'm learning about machine learning, and I've often come across people separating their data into a 'training set' and a 'validation set.' I could never figure out why people never just used all of the data for training and then just used it again for validation. Is there a reason for this that I'm missing?
Think of it like this, you are going to take an exam and you are practicing hard with your practice materials. You don't know what you are going to be asked in the exam right ?
On the other hand if you practice with the exam itself, when you take the exam you will know all the answers, so you don't even have to bother studying.
That's the case for your model, if you train your model on both the train set and test set, your model will know all the answers beforehand. You need to give him something he does not know so that he can deduce some answers to you.
Basically, you would want the model to be trained using the train dataset, in order to test if the hyper-parameter tuning is done right you would like to test it out with a portion of the dataset.
If this was done on the test data directly then chances of over-fitting is high. To avoid this, you use the validation data set and measure your model's performance against the test dataset.

Categories

Resources