Creating train,test data for Word2Vec model

Creating train,test data for Word2Vec model - python

I am trying to create a W2V model and then generate train and test data to be used for my model.My question is how can I generate test data after I am done with creating a W2V model with my train data.

Word2Vec is considered an 'unsupervised' algorithm, so at least during its training, it is not typical to hold back any 'test' data for later evaluation.
A Word2Vec model is usually then evaluated on how well it helps some other process - such as the analogy-solving highlighted by the original paper. In gensim, the method [evaluate_word_analogies()][1] can repeat that process. But note: word-vectors that perform best on word-analogies my not be best for other purposes, like classification or info-retrieval. It's always best to evaluate & tune your word-vectors in a repeatable way that's related to your actual underlying use.
(If you're using the Word2Vec model's outputs - word-vectors specific to your domain – as part of a larger system, where some steps should be evaluated with held-back data, the decision of whether to train the Word2Vec component on all data could go either way, depending on other considerations.)

Related

LightGBM: train() vs update() vs refit()

I'm implementing LightGBM (Python) into a continuous learning pipeline. My goal is to train an initial model and update the model (e.g. every day) with newly available data.
Most examples load an already trained model and apply train() once again:
updated_model = lightgbm.train(params=last_model_params, train_set=new_data, init_model = last_model)
However, I'm wondering if this is actually the correct way to approach continuous learning within the LightGBM library since the amount of fitted trees (num_trees()) grows for every application of train() by n_estimators. For my understanding a model update should take an initial model definition (under a given set of model parameters) and refine it without ever growing the amount of trees/size of the model definition.
I find the documentation regarding train(), update() and refit() not particularly helpful. What would be considered the right approach to implement continuous learning with LightGBM?

In lightgbm (the Python package for LightGBM), these entrypoints you've mentioned do have different purposes.
The main lightgbm model object is a Booster. A fitted Booster is produced by training on input data. Given an initial trained Booster...
Booster.refit() does not change the structure of an already-trained model. It just updates the leaf counts and leaf values based on the new data. It will not add any trees to the model.
Booster.update() will perform exactly 1 additional round of gradient boosting on an existing Booster. It will add at most 1 tree to the model.
train() with an init_model will perform gradient boosting for num_iterations additional rounds. It also allows for lots of other functionality, like custom callbacks (e.g. to change the learning rate from iteration-to-iteration) and early stopping (to stop adding trees if performance on a validation set fails to improve). It will add up to num_iterations trees to the model.
What would be considered the right approach to implement continuous learning with LightGBM?
There are trade-offs involved in this choice and no one of these is the globally "right" way to achieve the goal "modify an existing model based on newly-arrived data".
Booster.refit() is the only one of these approaches that meets your definition of "refine [the model] without ever growing the amount of trees/size of the model definition". But it could lead to drastic changes in the predictions produced by the model, especially if the batch of newly-arrived data is much smaller than the original training data, or if the distribution of the target is very different.
Booster.update() is the simplest interface for this, but a single iteration might not be enough to get most of the information from the newly-arrived data into the model. For example, if you're using fairly shallow trees (say, num_leaves=7) and a very small learning rate, even newly-arrived data that is very different from the original training data might not change the model's predictions by much.
train(init_model=previous_model) is the most flexible and powerful option, but it also introduces more parameters and choices. If you choose to use train(init_model=previous_model), pay attention to parameters num_iterations and learning_rate. Lower values of these parameters will decrease the impact of newly-arrived data on the trained model, higher values will allow a larger change to the model. Finding the right balance between those is a concern for your evaluation framework.

How to fit Word2Vec on test data?

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.

Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)

How to improve language model ex: BERT on unseen text in training?

so I am using pre-trained language model for binary classification. I fine-tune the model by training on data my downstream task. The results are good almost 98% F-measure.
However, when I remove a specific similar sentence from the training data and add it to my test data, the classifier fails to predict the class of that sentence. For example, sentiment analysis task
"I love the movie more specifically the acting was great"
I removed from training all sentences containing the words " more specifically" and surprisingly in the test set they were all misclassified, so the precision decreased by a huge amount.
Any ideas on how can I further fine-tune/improve my model to work better on unseen text in training to avoid the problem I described above? (of course without feeding the model on sentences containing the words "more specifically"
Note: I observed the same performance regardless of the language model in use (BERT, RoBERTa etc).

I think you might have the problem of overfitting, i.e. your model is focussing on too specific features for it to generalize well. This can lead to results such as yours, where some obscure part of the sentence is the main factor for a correct classification.
There are multiple ways to solve this (see here). One of which is cross-validation where you rotate your validation set, another is to use dropout layers, yet another is to not let your model train too long, which can also lead to overfitting.

Is it appropriate to train W2V model on entire corpus?

I have a corpus of free text medical narratives, for which I am going to use for a classification task, right now for about 4200 records.
To begin, I wish to create word embeddings using w2v, but I have a question about a train-test split for this task.
When I train the w2v model, is it appropriate to use all of the data for the model creation? Or should I only use the train data for creating the model?
Really, my question sort of comes down to: do I take the whole dataset, create the w2v model, transform the narratives with the model, and then split, or should I split, create w2v, and then transform the two sets independently?
Thanks!
EDIT
I found an internal project at my place of work which was built by a vendor; they create the split, and create the the w2v model on ONLY the train data, then transform the two sets independently in different jobs; so it's the latter of the two options that I specified above. This is what I thought would be the case, as I wouldn't want to contaminate the w2v model on any of the test data.

The answer to most questions like these in NLP is "try both" :-)
Contamination of test vs train data is not relevant or a problem in generating word vectors. That is a relevant issue in the model you use the vectors with. I found performance to be better with whole corpus vectors in my use cases.
Word vectors improve in quality with more data. If you don't use test corpus, you will need to have a method for initializing out-of-vocabulary vectors and understanding the impact they may have on your model performance.

Computing TF-IDF on the whole dataset or only on training data?

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?

According to the documentation of scikit-learn, fit() is used in order to
Learn vocabulary and idf from training set.
On the other hand, fit_transform() is used in order to
Learn vocabulary and idf, return term-document matrix.
while transform()
Transforms documents to document-term matrix.
On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).
Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.
For more details you can refer to the article fit() vs transform() vs fit_transform()

Author gives all text data before separating train and test to
function. Is it a true action or we must separate data first then
perform tfidf fit_transform on train and transform on test?
I would consider this as already leaking some information about the test set into the training set.
I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.

As we are talking about text data, we have to make sure that the model is trained only on the vocabulary of the training set as when we will deploy a model in real life, it will encounter words that it has never seen before so we have to do the validation on the test set keeping that in mind.
We have to make sure that the new words in the test set are not a part of the vocabulary of the model.
Hence we have to use fit_transform on the training data and transform on the test data.
If you think about doing cross validation, then you can use this logic across all the folds.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.