How to fit Word2Vec on test data?

How to fit Word2Vec on test data? - python

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.

Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)

Related

Checking model overfit of doc2vec with infer_vector()

my aim is to create document embeddings from the column df["text"] as a first step and then as a second step plug them along with other variables into a XGBoost Regressor model in order to make predictions. This works very well for the train_df.
I am currently trying to evaluate my trained Doc2Vec model by inferring vectors with infer_vector() on the unseen test_df and then again make predictions with it.However, the results are super bad. I got a very large error (RMSE).
I assume, this means that Doc2Vec is massively overfitting?
I am actually not sure if this is the correct way to evaluate my doc2vec model (by infer_vector)?
What to do to prevent doc2vec from overfitting?
Please find my code below for infering vectors from a model:
vectors_test=[]
for i in range(0, len(test_df)):
vecs=model.infer_vector(tokenize(test_df["text"][i]))
vectors_test.append(vecs)
vectors_test= pd.DataFrame(vectors_test)
test_df = pd.concat([test_df, vectors_test], axis=1)
I then make predictions with my XGBoost model:
np.random.seed(0)
test_df= test_df.reindex(np.random.permutation(test_df.index))
y = test_df['target'].values
X = test_df.drop(['target'], axis=1).values
y_pred = mod.predict(X)
pred = pd.DataFrame()
pred["Prediction"] = y_pred
rmse = np.sqrt(mean_squared_error(y,y_pred))
print(rmse)
Please see also the training of my doc2vec model:
doc_tag = train_df.apply(lambda train_df: TaggedDocument(words=tokenize(train_df["text"]), tags= [train_df.Tag]), axis = 1)
# initializing model, building a vocabulary
model = Doc2Vec(dm=0, vector_size=200, min_count=1, window=10, workers= cores)
model.build_vocab([x for x in tqdm(doc_tag.values)])
# train model for 5 epochs
for epoch in range(5):
model.train(utils.shuffle([x for x in tqdm(doc_tag.values)]), total_examples=len(doc_tag.values), epochs=1)

Without knowing what your XGBoost model is being trained to predict, or more about the type/quantity of your training data for certain steps, it's hard to speculate why one particular set of inputs are performing poorly. (For example, it could equally be the XGBoost model's data, parameters, or training that's mismatched to the task.)
But, some observations:
You generally shouldn't be calling train() multiple times in your own loop. See My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? for discussion of common problems here. (Yours isn't quite as stark, but the learning-rate isn't being handled properly in your 5 separate train()s - indeed there should even be some error in your log output.)
Similarly: it's often a bad idea to use a min_count so small as 1 in these kinds of models: such rare words, without enough varied examples to be truly understood, just inject idiosyncratic noise which dilutes the influence of other, surrounding tokens which are meaningful.
Most published work trains a Doc2Vec model for 10-20 epochs – you're only using 5. (And, for smaller datasets or smaller texts, often even more epochs help.) Inference will also default to the epochs configured when the model was created – here only 5 – but more epochs are often beneficial.
It's unclear the size of your training texts and their unique vocabulary, but Doc2Vec overfitting will be most likely if the model is relatively large – in terms of vector_size or total surviving vocabulary – compared to the training data. Then, the model has lots of opportunity to essentially 'memorize' idiosyncracies of the training set, instead of more-generalizable patterns that will still be useful for out-of-training data. (For example, min_count=1, if it's preserving many singleton words which appear in only one text each, gives the model lots of "nooks and crannies" in which to improve its training target results in ways unlikely to help on other examples.) If your training data is "small", you likely need to use a smaller vector_size and a larger min_count to avoid overfitting, and then perhaps more epochs to ensure adequate training.
infer_vector essentially ignores any words not in its vocabulary - so you should take a look at some of the specific texts in the set performing poorly, and check whether most of their words are present, or not. But note also: as Doc2Vec is an unsupervised method, a plausible case can be made for training it to learn textual patterns on all available data, including the texts in your 'test' set. Then, it is more likely to have some word data, top at least the min_count threshold, for words across all examples. (Of course the actual supervised predictor itself can only be fairly evaluated on test examples whose desired answers weren't provided during the predictor's training. But it still can receive its features from an unsupervised step that used all text data.)
a crude check of a Doc2Vec model for overfitting or other training problems (but not overall quality) is to re-infer doc-vectors from the same texts it was trained on, and checking the model's set of bulk-trained vectors (model.docvecs) for the nearest-neighbors to these re-inferred vectors. If the re-inferred vector's nearest neighbor isn't usually the same text's bulk-trained vector – or if more generally, re-inferring the same text multiple times doesn't yield vectors that are 'close' to each other – then something about the model training or inference is deficient: overfitting, or undertraining, or insufficient data, or unwise parameters.

Creating train,test data for Word2Vec model

I am trying to create a W2V model and then generate train and test data to be used for my model.My question is how can I generate test data after I am done with creating a W2V model with my train data.

Word2Vec is considered an 'unsupervised' algorithm, so at least during its training, it is not typical to hold back any 'test' data for later evaluation.
A Word2Vec model is usually then evaluated on how well it helps some other process - such as the analogy-solving highlighted by the original paper. In gensim, the method [evaluate_word_analogies()][1] can repeat that process. But note: word-vectors that perform best on word-analogies my not be best for other purposes, like classification or info-retrieval. It's always best to evaluate & tune your word-vectors in a repeatable way that's related to your actual underlying use.
(If you're using the Word2Vec model's outputs - word-vectors specific to your domain – as part of a larger system, where some steps should be evaluated with held-back data, the decision of whether to train the Word2Vec component on all data could go either way, depending on other considerations.)

Incremental Word2Vec Model Training in gensim

I have tried to train incrementally word2vec model produced by gensim. But I found that the vocabulary size doesn't increased , only the word2vec model weights are updated . But i need to update both vocabulary and model size .
#Load data
sentences = []
....................
#Training
model = Word2Vec(sentences, size=100)
model.save("modelbygensim.txt")
model.save_word2vec_format("modelbygensim_text.txt")
#Incremental Training
model = Word2Vec.load('modelbygensim.txt')
model.train(sentences)
model.save("modelbygensim_incremental.txt")
model.save_word2vec_format("modelbygensim_text_incremental.txt")

By default, gensim Word2Vec only does vocabulary-discovery once. It will happen when you supply a corpus like your sentences to the initial constructor (which does an automatic vocabulary-scan and train), or alternatively when you call build_vocab(). While you can continue to call train(), no new words will be recognized.
There is support (that I would consider experimental) for calling build_vocab() with new text examples, and an update=True parameter, to expand the vocabulary. While this would let further train() calls train both old-and-new words, there are many caveats:
such sequential training may not lead to models as good, or as self-consistent, as providing all examples interleaved. (For example, the continued training may drift words learned-from-later-batches arbitrarily far from words/word-senses in earlier batches that are not re-presented.)
such calls to train() should use one of the optional parameters to give an accurate estimate of the new batch size (in words or examples) so that learning-rate decay and progress-logging is done properly
the core algorithm and underlying theories aren't based on such batching, and multiple restarts of the learning-rate from high-to-low, so the interpretation of results – and relative strength/balance of resulting vectors - isn't as well-grounded
If at all possible, combine all your examples into one corpus, and do one large vocabulary-discovery then training.

Retrieve list of training features names from classifier

Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data.
The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.

I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.
Create and fit your model. For example
model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)
Then you can add an attribute which is the 'feature_names' since you know them at training time
model.feature_names = list(X_train.columns.values)
I typically then put the model into a binary file to pass it around but you can ignore this
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Then you can get the feature names back from the model to use them when you predict
f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])

Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.
Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.

You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.
The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.
If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.
What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.
However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.
You can look here to see how you can retrieve the names of the most important features you used.

You can extract feature names from a trained XGBOOST model as follows:
model.get_booster().feature_names

How to get ordered list of labels after fitting sklearn

train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?

tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.

Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.