I'm learning about machine learning, and I've often come across people separating their data into a 'training set' and a 'validation set.' I could never figure out why people never just used all of the data for training and then just used it again for validation. Is there a reason for this that I'm missing?
Think of it like this, you are going to take an exam and you are practicing hard with your practice materials. You don't know what you are going to be asked in the exam right ?
On the other hand if you practice with the exam itself, when you take the exam you will know all the answers, so you don't even have to bother studying.
That's the case for your model, if you train your model on both the train set and test set, your model will know all the answers beforehand. You need to give him something he does not know so that he can deduce some answers to you.
Basically, you would want the model to be trained using the train dataset, in order to test if the hyper-parameter tuning is done right you would like to test it out with a portion of the dataset.
If this was done on the test data directly then chances of over-fitting is high. To avoid this, you use the validation data set and measure your model's performance against the test dataset.
Related
I've divided the data into three sets.
After this, I'm trying to implement this in SVM like this:
I am not sure if it is the correct way of doing it or not, as the difference between the scores I am getting and the ones mentioned in the papers is huge.
The end goal I want to achieve is to get accuracy, precision, and recall for all three sets.
If you are getting huge score that you aren't expecting, it maybe due to an unknown data leak. It all depends on how your data is. e.g., by default train_test_split has shuffle as True. if the data is time sensitive, that means the model has already peaked into future - and therefore this high accuracy.
Try with:
train_test_split(..., shuffle=False)
I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?
Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!
In machine learning, why do we need to split the data? And why do we set the test size to 0.3 or 0.2 (usually)? Can we set the size to 1? and also why is 85% considered good for the accuracy?
PS. I'm a beginner please be easy on me ^^
So why do we split the data? We need some way to determine how well our classifier performs. One way to do this is use the same data for training and testing. If we do this, though, we have no way of telling if our model is overfitting, where our model just memorizes features of our training set instead of learning some underlying representation. Instead we hold-out some data for testing that our model never sees and can be evaluated on. Sometimes we even split data into 3 parts - training, validation (test set while we're still choosing the parameters of our model), and testing (for tuned model).
The test size is just the fraction of our data in the test set. If you set your test size to 1, that's your entire dataset, and there's nothing left to train on.
Why is 85% good for accuracy? Just a heuristic. Say you're doing a classification task, picking the most frequent class gives you at least 50% accuracy. We might be skeptical if a model has 100% accuracy because humans don't do that well, and your training data probably isn't 100% accurate, so 100% accuracy might indicate something's going wrong, like an overfit model.
This is a very basic question. I'd recommend you do some courses before asking questions here as literally every good course on machine learning will explain this. I recommend Andrew Ng's courses, you can find them on Coursera but the videos are also on youtube.
To answer your question more directly, you use a train set to teach your model how to classify data, and you test your model with a "dev" or "test" set to see how well it has learned on new unseen data.
Think of it like teaching a kid to add by telling them 1+4=5 and 5+3=8 and 2+7=9. If you then ask the kid "what's 1+4", you want to know if they have learned to add, and not just memorised the examples that you gave.
This video in particular answers the question of what size test/train sets to use.
https://www.youtube.com/watch?v=1waHlpKiNyY
I am currently working with data that is easy to overfit, so i made function with testing roc_auc score for each depth as i read on sklearn that max_depth is usually the reason for tree to overfit. But i am not sure if my thinking is correctly here there is pic of my result:
I was also trying to ust postpruning method, but my graph looked quite different from others i found at internet so i am not sure what it gives me
The term you are looking for is cross-validation. The basic idea is simple: you split your dataset into a training and validation (or testing) sets. Then you train a model on the training set and test it on the validation set. If your model is overfitted, it will perform well on training set but poorly on validation set. In this case it's best to decrease model complexity or add so-called regularization (e.g. tree pruning). Perhaps, the simplest way to perform cross-validation in SciKit Learn is to use cross_val_score function as described here.
Note 1: In some context (e.g. in neural networks) there are both - validation and test sets (in addition to the train set). I won't go into the detail here, but don't be confused with these terms in different contexts.
Note 2: Cross-validation is such a standard thing that it even gave a name to another StackExchange site - Cross Validated, where you may get more answers about statistics. Another and perhaps even more appropriate site has a self-explaining name - Data Science.
I trained a model using CNN,
The results are,
Accuracy
Loss
I read from fast.ai, the experts say, a good model has the val_loss is slightly greater than the loss.
My model is different in points, So, Can I take this model as good or I need to train again...
Thank you
it seems that maybe your validation set is "too easy" for the CNN you trained or doesn't represent your problem enough. it's difficult to say witch the amount of information you provided.
I would say, your validation set is not properly chosen. or you could try using crossvalidation to get more insights.
First of all as already pointed out I'd check your data setup: How many data points do the training and the test (validation) data sets contain? And have they been properly chosen?
Additionally, this is an indicator that your model might be underfitting the training data. Generally, you are looking for the "sweet spot" regarding the trade off between predicting your training data well and not overfitting (i.e. doing bad on the test data):
If accuracy on the training data is not high then you might be underfitting (left hand side of the green dashed line). And doing worse on the training data than on the test data is an indication for that (although this case is not being shown in the plot). Therefore, I'd check what happens to your accuracy if you increase the model complexity and fit it better to the training data (moving towards the 'sweet spot' in the illustrative graph).
In contrast if your test data accuracy was low you might be in an overfitting situation (right hand side the green dashed line).