my aim is to create document embeddings from the column df["text"] as a first step and then as a second step plug them along with other variables into a XGBoost Regressor model in order to make predictions. This works very well for the train_df.
I am currently trying to evaluate my trained Doc2Vec model by inferring vectors with infer_vector() on the unseen test_df and then again make predictions with it.However, the results are super bad. I got a very large error (RMSE).
I assume, this means that Doc2Vec is massively overfitting?
I am actually not sure if this is the correct way to evaluate my doc2vec model (by infer_vector)?
What to do to prevent doc2vec from overfitting?
Please find my code below for infering vectors from a model:
vectors_test=[]
for i in range(0, len(test_df)):
vecs=model.infer_vector(tokenize(test_df["text"][i]))
vectors_test.append(vecs)
vectors_test= pd.DataFrame(vectors_test)
test_df = pd.concat([test_df, vectors_test], axis=1)
I then make predictions with my XGBoost model:
np.random.seed(0)
test_df= test_df.reindex(np.random.permutation(test_df.index))
y = test_df['target'].values
X = test_df.drop(['target'], axis=1).values
y_pred = mod.predict(X)
pred = pd.DataFrame()
pred["Prediction"] = y_pred
rmse = np.sqrt(mean_squared_error(y,y_pred))
print(rmse)
Please see also the training of my doc2vec model:
doc_tag = train_df.apply(lambda train_df: TaggedDocument(words=tokenize(train_df["text"]), tags= [train_df.Tag]), axis = 1)
# initializing model, building a vocabulary
model = Doc2Vec(dm=0, vector_size=200, min_count=1, window=10, workers= cores)
model.build_vocab([x for x in tqdm(doc_tag.values)])
# train model for 5 epochs
for epoch in range(5):
model.train(utils.shuffle([x for x in tqdm(doc_tag.values)]), total_examples=len(doc_tag.values), epochs=1)
Without knowing what your XGBoost model is being trained to predict, or more about the type/quantity of your training data for certain steps, it's hard to speculate why one particular set of inputs are performing poorly. (For example, it could equally be the XGBoost model's data, parameters, or training that's mismatched to the task.)
But, some observations:
You generally shouldn't be calling train() multiple times in your own loop. See My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? for discussion of common problems here. (Yours isn't quite as stark, but the learning-rate isn't being handled properly in your 5 separate train()s - indeed there should even be some error in your log output.)
Similarly: it's often a bad idea to use a min_count so small as 1 in these kinds of models: such rare words, without enough varied examples to be truly understood, just inject idiosyncratic noise which dilutes the influence of other, surrounding tokens which are meaningful.
Most published work trains a Doc2Vec model for 10-20 epochs – you're only using 5. (And, for smaller datasets or smaller texts, often even more epochs help.) Inference will also default to the epochs configured when the model was created – here only 5 – but more epochs are often beneficial.
It's unclear the size of your training texts and their unique vocabulary, but Doc2Vec overfitting will be most likely if the model is relatively large – in terms of vector_size or total surviving vocabulary – compared to the training data. Then, the model has lots of opportunity to essentially 'memorize' idiosyncracies of the training set, instead of more-generalizable patterns that will still be useful for out-of-training data. (For example, min_count=1, if it's preserving many singleton words which appear in only one text each, gives the model lots of "nooks and crannies" in which to improve its training target results in ways unlikely to help on other examples.) If your training data is "small", you likely need to use a smaller vector_size and a larger min_count to avoid overfitting, and then perhaps more epochs to ensure adequate training.
infer_vector essentially ignores any words not in its vocabulary - so you should take a look at some of the specific texts in the set performing poorly, and check whether most of their words are present, or not. But note also: as Doc2Vec is an unsupervised method, a plausible case can be made for training it to learn textual patterns on all available data, including the texts in your 'test' set. Then, it is more likely to have some word data, top at least the min_count threshold, for words across all examples. (Of course the actual supervised predictor itself can only be fairly evaluated on test examples whose desired answers weren't provided during the predictor's training. But it still can receive its features from an unsupervised step that used all text data.)
a crude check of a Doc2Vec model for overfitting or other training problems (but not overall quality) is to re-infer doc-vectors from the same texts it was trained on, and checking the model's set of bulk-trained vectors (model.docvecs) for the nearest-neighbors to these re-inferred vectors. If the re-inferred vector's nearest neighbor isn't usually the same text's bulk-trained vector – or if more generally, re-inferring the same text multiple times doesn't yield vectors that are 'close' to each other – then something about the model training or inference is deficient: overfitting, or undertraining, or insufficient data, or unwise parameters.
Related
I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.
Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)
so I am using pre-trained language model for binary classification. I fine-tune the model by training on data my downstream task. The results are good almost 98% F-measure.
However, when I remove a specific similar sentence from the training data and add it to my test data, the classifier fails to predict the class of that sentence. For example, sentiment analysis task
"I love the movie more specifically the acting was great"
I removed from training all sentences containing the words " more specifically" and surprisingly in the test set they were all misclassified, so the precision decreased by a huge amount.
Any ideas on how can I further fine-tune/improve my model to work better on unseen text in training to avoid the problem I described above? (of course without feeding the model on sentences containing the words "more specifically"
Note: I observed the same performance regardless of the language model in use (BERT, RoBERTa etc).
I think you might have the problem of overfitting, i.e. your model is focussing on too specific features for it to generalize well. This can lead to results such as yours, where some obscure part of the sentence is the main factor for a correct classification.
There are multiple ways to solve this (see here). One of which is cross-validation where you rotate your validation set, another is to use dropout layers, yet another is to not let your model train too long, which can also lead to overfitting.
I have a multi classification problem and my data involves sequence of letters. It is a labelled data (used label encoder to encode string labels to numeric). There could be partial strings for the same class. May strings match but some could be just slightly different.
I am preparing my data with k-mer and countvectoriser (fitted on train data and transformed train and test data). With the combination of kmer size and ngram sizes, the dimension (feature size) varies between 8000+ to 35000+. I do not think that there is test information leak at the training of the model.
I fit different algorithms on the train data and test to review the generalisation. The test scores (accuracy, f1-score, precision and recall) are coming pretty high (more than 99%). Even though this is testing, do you think the model could be overfitting due to high dimensionality (curse of dimensionality)? I understand that if training score is high and generalises poorly then its overfitting but here the test scores are very high. This is not models as different algorithms giving similar results, its certainly about the data.
If I apply PCA to get 10 components which covers 99% variance, the test score on testing is high too. If I use selectkfeatures to select just about 10 best features, then the scores come down.
Really looking for your thoughts on how I can prove that this is not overfitting? Should I always go for reduced features size (through selection or pca) with such high dimension size? Thanks.
Regards,
Vijay
If your test score is high, then below are the possibilities
Overlap in test and train data: This can happen if you have duplicate records and while splitting one fall into train and other into test
Data Leak: If the class label information is some how encoded in the features. This can be easily verified: if train score are almost 100% even with basic models. Check this resource for understand what is a data leak.
You really have succeeded in building a good model
I suggest check the above 2 possibilities first and then try out K-fold cross validation.
I have a flight delay dataset and try to split the set to train and test set before sampling. On-time cases are about 80% of total data and delayed cases are about 20% of that.
Normally in machine learning ratio of train and test set size is 8:2. But the data is too imbalanced. So considering extreme case, most of train data are on-time cases and most of test data are delayed cases and accuracy will be poor.
So my question is How can I properly split imbalanced dataset to train and test set??
Probably just by playing with ratio of train and test you might not get the correct prediction and results.
if you are working on imbalanced dataset, you should try re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
Also use different metric for performance measurement such as F1 Score etc in case of imbalanced data set
Please go through the below link, it will give you more clarity.
What is the correct procedure to split the Data sets for classification problem?
Cleveland heart disease dataset - can’t describe the class
Start from 50/50 and go on changing the sets as 60/40, 70/30, 80/20, 90/10. declare all the results and come to some conclusion. In one of my work on Flight delays prediction project, I used 60/40 database and got 86.8 % accuracy using MLP NN.
There are two approaches that you can take.
A simple one: no preprocessing of the dataset but careful sampling of the dataset so that both classes are represented in the same proportion in the test and train subsets. You can do it by splitting by class first and then randomly sampling from both sets.
import sklearn
XclassA = dataX[0] # TODO: change to split by class
XclassB = dataX[1]
YclassA = dataY[0]
YclassB = dataY[1]
XclassA_train, XclassA_test, YclassA_train, YclassA_test = sklearn.model_selection.train_test_split(XclassA, YclassA, test_size=0.2, random_state=42)
XclassB_train, XclassB_test, YclassB_train, YclassB_test = sklearn.model_selection.train_test_split(XclassB, YclassB, test_size=0.2, random_state=42)
Xclass_train = XclassA_train + XclassB_train
Yclass_train = YclassA_train + YclassB_train
A more involved, and arguably better one, you can try first to balance your dataset. For that you can use one of many techniques (under-, over-sampling, SMOTE, AdaSYN, Tomek links, etc.). I recommend you review the methods of imbalanced-learn package. Having done balancing you can use the ordinary test/train split using typical methods without any additional intermediary steps.
The second approach is better not only from the perspective of splitting the data but also from the speed and even ability to train a model (which for heavily imbalanced datasets is not guaranteed to work).
I have tried to train incrementally word2vec model produced by gensim. But I found that the vocabulary size doesn't increased , only the word2vec model weights are updated . But i need to update both vocabulary and model size .
#Load data
sentences = []
....................
#Training
model = Word2Vec(sentences, size=100)
model.save("modelbygensim.txt")
model.save_word2vec_format("modelbygensim_text.txt")
#Incremental Training
model = Word2Vec.load('modelbygensim.txt')
model.train(sentences)
model.save("modelbygensim_incremental.txt")
model.save_word2vec_format("modelbygensim_text_incremental.txt")
By default, gensim Word2Vec only does vocabulary-discovery once. It will happen when you supply a corpus like your sentences to the initial constructor (which does an automatic vocabulary-scan and train), or alternatively when you call build_vocab(). While you can continue to call train(), no new words will be recognized.
There is support (that I would consider experimental) for calling build_vocab() with new text examples, and an update=True parameter, to expand the vocabulary. While this would let further train() calls train both old-and-new words, there are many caveats:
such sequential training may not lead to models as good, or as self-consistent, as providing all examples interleaved. (For example, the continued training may drift words learned-from-later-batches arbitrarily far from words/word-senses in earlier batches that are not re-presented.)
such calls to train() should use one of the optional parameters to give an accurate estimate of the new batch size (in words or examples) so that learning-rate decay and progress-logging is done properly
the core algorithm and underlying theories aren't based on such batching, and multiple restarts of the learning-rate from high-to-low, so the interpretation of results – and relative strength/balance of resulting vectors - isn't as well-grounded
If at all possible, combine all your examples into one corpus, and do one large vocabulary-discovery then training.