Text classification - is it overfitting? How can I prove?

Text classification - is it overfitting? How can I prove? - python

I have a multi classification problem and my data involves sequence of letters. It is a labelled data (used label encoder to encode string labels to numeric). There could be partial strings for the same class. May strings match but some could be just slightly different.
I am preparing my data with k-mer and countvectoriser (fitted on train data and transformed train and test data). With the combination of kmer size and ngram sizes, the dimension (feature size) varies between 8000+ to 35000+. I do not think that there is test information leak at the training of the model.
I fit different algorithms on the train data and test to review the generalisation. The test scores (accuracy, f1-score, precision and recall) are coming pretty high (more than 99%). Even though this is testing, do you think the model could be overfitting due to high dimensionality (curse of dimensionality)? I understand that if training score is high and generalises poorly then its overfitting but here the test scores are very high. This is not models as different algorithms giving similar results, its certainly about the data.
If I apply PCA to get 10 components which covers 99% variance, the test score on testing is high too. If I use selectkfeatures to select just about 10 best features, then the scores come down.
Really looking for your thoughts on how I can prove that this is not overfitting? Should I always go for reduced features size (through selection or pca) with such high dimension size? Thanks.
Regards,
Vijay

If your test score is high, then below are the possibilities
Overlap in test and train data: This can happen if you have duplicate records and while splitting one fall into train and other into test
Data Leak: If the class label information is some how encoded in the features. This can be easily verified: if train score are almost 100% even with basic models. Check this resource for understand what is a data leak.
You really have succeeded in building a good model
I suggest check the above 2 possibilities first and then try out K-fold cross validation.

Related

Checking model overfit of doc2vec with infer_vector()

my aim is to create document embeddings from the column df["text"] as a first step and then as a second step plug them along with other variables into a XGBoost Regressor model in order to make predictions. This works very well for the train_df.
I am currently trying to evaluate my trained Doc2Vec model by inferring vectors with infer_vector() on the unseen test_df and then again make predictions with it.However, the results are super bad. I got a very large error (RMSE).
I assume, this means that Doc2Vec is massively overfitting?
I am actually not sure if this is the correct way to evaluate my doc2vec model (by infer_vector)?
What to do to prevent doc2vec from overfitting?
Please find my code below for infering vectors from a model:
vectors_test=[]
for i in range(0, len(test_df)):
vecs=model.infer_vector(tokenize(test_df["text"][i]))
vectors_test.append(vecs)
vectors_test= pd.DataFrame(vectors_test)
test_df = pd.concat([test_df, vectors_test], axis=1)
I then make predictions with my XGBoost model:
np.random.seed(0)
test_df= test_df.reindex(np.random.permutation(test_df.index))
y = test_df['target'].values
X = test_df.drop(['target'], axis=1).values
y_pred = mod.predict(X)
pred = pd.DataFrame()
pred["Prediction"] = y_pred
rmse = np.sqrt(mean_squared_error(y,y_pred))
print(rmse)
Please see also the training of my doc2vec model:
doc_tag = train_df.apply(lambda train_df: TaggedDocument(words=tokenize(train_df["text"]), tags= [train_df.Tag]), axis = 1)
# initializing model, building a vocabulary
model = Doc2Vec(dm=0, vector_size=200, min_count=1, window=10, workers= cores)
model.build_vocab([x for x in tqdm(doc_tag.values)])
# train model for 5 epochs
for epoch in range(5):
model.train(utils.shuffle([x for x in tqdm(doc_tag.values)]), total_examples=len(doc_tag.values), epochs=1)

Without knowing what your XGBoost model is being trained to predict, or more about the type/quantity of your training data for certain steps, it's hard to speculate why one particular set of inputs are performing poorly. (For example, it could equally be the XGBoost model's data, parameters, or training that's mismatched to the task.)
But, some observations:
You generally shouldn't be calling train() multiple times in your own loop. See My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? for discussion of common problems here. (Yours isn't quite as stark, but the learning-rate isn't being handled properly in your 5 separate train()s - indeed there should even be some error in your log output.)
Similarly: it's often a bad idea to use a min_count so small as 1 in these kinds of models: such rare words, without enough varied examples to be truly understood, just inject idiosyncratic noise which dilutes the influence of other, surrounding tokens which are meaningful.
Most published work trains a Doc2Vec model for 10-20 epochs – you're only using 5. (And, for smaller datasets or smaller texts, often even more epochs help.) Inference will also default to the epochs configured when the model was created – here only 5 – but more epochs are often beneficial.
It's unclear the size of your training texts and their unique vocabulary, but Doc2Vec overfitting will be most likely if the model is relatively large – in terms of vector_size or total surviving vocabulary – compared to the training data. Then, the model has lots of opportunity to essentially 'memorize' idiosyncracies of the training set, instead of more-generalizable patterns that will still be useful for out-of-training data. (For example, min_count=1, if it's preserving many singleton words which appear in only one text each, gives the model lots of "nooks and crannies" in which to improve its training target results in ways unlikely to help on other examples.) If your training data is "small", you likely need to use a smaller vector_size and a larger min_count to avoid overfitting, and then perhaps more epochs to ensure adequate training.
infer_vector essentially ignores any words not in its vocabulary - so you should take a look at some of the specific texts in the set performing poorly, and check whether most of their words are present, or not. But note also: as Doc2Vec is an unsupervised method, a plausible case can be made for training it to learn textual patterns on all available data, including the texts in your 'test' set. Then, it is more likely to have some word data, top at least the min_count threshold, for words across all examples. (Of course the actual supervised predictor itself can only be fairly evaluated on test examples whose desired answers weren't provided during the predictor's training. But it still can receive its features from an unsupervised step that used all text data.)
a crude check of a Doc2Vec model for overfitting or other training problems (but not overall quality) is to re-infer doc-vectors from the same texts it was trained on, and checking the model's set of bulk-trained vectors (model.docvecs) for the nearest-neighbors to these re-inferred vectors. If the re-inferred vector's nearest neighbor isn't usually the same text's bulk-trained vector – or if more generally, re-inferring the same text multiple times doesn't yield vectors that are 'close' to each other – then something about the model training or inference is deficient: overfitting, or undertraining, or insufficient data, or unwise parameters.

Knowing best model in Neural Network

I trained a model using CNN,
The results are,
Accuracy
Loss
I read from fast.ai, the experts say, a good model has the val_loss is slightly greater than the loss.
My model is different in points, So, Can I take this model as good or I need to train again...
Thank you

it seems that maybe your validation set is "too easy" for the CNN you trained or doesn't represent your problem enough. it's difficult to say witch the amount of information you provided.
I would say, your validation set is not properly chosen. or you could try using crossvalidation to get more insights.

First of all as already pointed out I'd check your data setup: How many data points do the training and the test (validation) data sets contain? And have they been properly chosen?
Additionally, this is an indicator that your model might be underfitting the training data. Generally, you are looking for the "sweet spot" regarding the trade off between predicting your training data well and not overfitting (i.e. doing bad on the test data):
If accuracy on the training data is not high then you might be underfitting (left hand side of the green dashed line). And doing worse on the training data than on the test data is an indication for that (although this case is not being shown in the plot). Therefore, I'd check what happens to your accuracy if you increase the model complexity and fit it better to the training data (moving towards the 'sweet spot' in the illustrative graph).
In contrast if your test data accuracy was low you might be in an overfitting situation (right hand side the green dashed line).

How can I properly split imbalanced dataset to train and test set?

I have a flight delay dataset and try to split the set to train and test set before sampling. On-time cases are about 80% of total data and delayed cases are about 20% of that.
Normally in machine learning ratio of train and test set size is 8:2. But the data is too imbalanced. So considering extreme case, most of train data are on-time cases and most of test data are delayed cases and accuracy will be poor.
So my question is How can I properly split imbalanced dataset to train and test set??

Probably just by playing with ratio of train and test you might not get the correct prediction and results.
if you are working on imbalanced dataset, you should try re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
Also use different metric for performance measurement such as F1 Score etc in case of imbalanced data set
Please go through the below link, it will give you more clarity.
What is the correct procedure to split the Data sets for classification problem?
Cleveland heart disease dataset - can’t describe the class

Start from 50/50 and go on changing the sets as 60/40, 70/30, 80/20, 90/10. declare all the results and come to some conclusion. In one of my work on Flight delays prediction project, I used 60/40 database and got 86.8 % accuracy using MLP NN.

There are two approaches that you can take.
A simple one: no preprocessing of the dataset but careful sampling of the dataset so that both classes are represented in the same proportion in the test and train subsets. You can do it by splitting by class first and then randomly sampling from both sets.
import sklearn
XclassA = dataX[0] # TODO: change to split by class
XclassB = dataX[1]
YclassA = dataY[0]
YclassB = dataY[1]
XclassA_train, XclassA_test, YclassA_train, YclassA_test = sklearn.model_selection.train_test_split(XclassA, YclassA, test_size=0.2, random_state=42)
XclassB_train, XclassB_test, YclassB_train, YclassB_test = sklearn.model_selection.train_test_split(XclassB, YclassB, test_size=0.2, random_state=42)
Xclass_train = XclassA_train + XclassB_train
Yclass_train = YclassA_train + YclassB_train
A more involved, and arguably better one, you can try first to balance your dataset. For that you can use one of many techniques (under-, over-sampling, SMOTE, AdaSYN, Tomek links, etc.). I recommend you review the methods of imbalanced-learn package. Having done balancing you can use the ordinary test/train split using typical methods without any additional intermediary steps.
The second approach is better not only from the perspective of splitting the data but also from the speed and even ability to train a model (which for heavily imbalanced datasets is not guaranteed to work).

Imbalanced learning problem - out of sample vs validation

I am training on three classes with one dominant majority class of about 80% and the other two even. I am able to train a model using undersampling / oversampling techniques to get validation accuracy of 67% which would already be quite good for my purposes. The issue is that this performance is only present on the balanced validation data, once I test on out of sample with imbalanced data it seems to have picked up a bias towards even class predictions. I have also tried using weighted loss functions but also no joy on out of sample. Is there a good way to ensure the validation performance translates over? I have tried using auroc to validate the model successfully but again the strong performance is only present in the balanced validation data.
Methods of resampling I have tried: SMOTE oversampling and random undersampling.

If I understood correctly, may be you are looking for performance measurement and better classification results on imbalance datasets.
Alone measuring the performance using accuracy in case of imbalanced datasets usually high and misleading and minority class could be totally ignored Instead use f1-score, precision/recall score.
For my project work on imbalanced datasets, I have used SMOTE sampling methods along with the K-Fold cross validation.
Cross validation technique assures that model gets the correct patterns from the data, and it is not getting up too much noise.
References :
What is the correct procedure to split the Data sets for classification problem?

Identifying overfitting in a cross validated SVM when tuning parameters

I have an rbf SVM that I'm tuning with gridsearchcv. How do I tell if my good results are actually good results or whether they are overfitting?

Overfitting is generally associated with high variance, meaning that the model parameters that would result from being fitted to some realized data set have a high variance from data set to data set. You collected some data, fit some model, got some parameters ... you do it again and get new data and now your parameters are totally different.
One consequence of this is that in the presence of overfitting, usually the training error (the error from re-running the model directly on the data used to train it) will be very low, or at least low in contrast to the test error (running the model on some previously unused test data).
One diagnostic that is suggested by Andrew Ng is to separate some of your data into a testing set. Ideally this should have been done from the very beginning, so that happening to see the model fit results inclusive of this data would never have the chance to impact your decision. But you can also do it after the fact as long as you explain so in your model discussion.
With the test data, you want to compute the same error or loss score that you compute on the training data. If training error is very low, but testing error is unacceptably high, you probably have overfitting.
Further, you can vary the size of your test data and generate a diagnostic graph. Let's say that you randomly sample 5% of your data, then 10%, then 15% ... on up to 30%. This will give you six different data points showing the resulting training error and testing error.
As you increase the training set size (decrease testing set size), the shape of the two curves can give some insight.
The test error will be decreasing and the training error will be increasing. The two curves should flatten out and converge with some gap between them.
If that gap is large, you are likely dealing with overfitting, and it suggests to use a large training set and to try to collect more data if possible.
If the gap is small, or if the training error itself is already too large, it suggests model bias is the problem, and you should consider a different model class all together.
Note that in the above setting, you can also substitute a k-fold cross validation for the test set approach. Then, to generate a similar diagnostic curve, you should vary the number of folds (hence varying the size of the test sets). For a given value of k, then for each subset used for testing, the other (k-1) subsets are used for training error, and averaged over each way of assigning the folds. This gives you both a training error and testing error metric for a given choice of k. As k becomes larger, the training set sizes becomes bigger (for example, if k=10, then training errors are reported on 90% of the data) so again you can see how the scores vary as a function of training set size.
The downside is that CV scores are already expensive to compute, and repeated CV for many different values of k makes it even worse.
One other cause of overfitting can be too large of a feature space. In that case, you can try to look at importance scores of each of your features. If you prune out some of the least important features and then re-do the above overfitting diagnostic and observe improvement, it's also some evidence that the problem is overfitting and you may want to use a simpler set of features or a different model class.
On the other hand, if you still have high bias, it suggests the opposite: your model doesn't have enough feature space to adequately account for the variability of the data, so instead you may want to augment the model with even more features.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.