Doc2vec and word2vec with negative sampling

Doc2vec and word2vec with negative sampling - python

My current doc2vec code is as follows.
# Train doc2vec model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20)
I also have a word2vec code as below.
# Train word2vec model
model = word2vec.Word2Vec(sentences, size=300, sample = 1e-3, sg=1, iter = 20)
I am interested in using both DM and DBOW in doc2vec AND both Skip-gram and CBOW in word2vec.
In Gensim I found the below mentioned sentence:
"Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling"
Thus, I am confused either to use hierarchical softmax or negative sampling. Please let me know what are the differences in these two methods.
Also, I am interested in knowing what are the parameters that need to be changed to use hierarchical softmax AND/OR negative sampling with respect to dm, DBOW, Skip-gram and CBOW?
P.s. my application is a recommendation system :)

Skip-gram or CBOW are different ways to choose the input contexts for the neural-network. Skip-gram picks one nearby word, then supplies it as input to try to predict a target word; CBOW averages together a bunch of nearby words, then supplies that average as input to try to predict a target word.
DBOW is most similar to skip-gram, in that a single paragraph-vector for a whole text is used to predict individual target words, regardless of distance and without any averaging. It can mix well with simultaneous skip-gram training, where in addition to using the single paragraph-vector, individual nearby word-vectors are also used. The gensim option dbow_words=1 will add skip-gram training to a DBOW dm=0 training.
DM is most similar to CBOW: the paragraph-vector is averaged together with a number of surrounding words to try to predict a target word.
So in Word2Vec, you must choose between skip-gram (sg=1) and CBOW (sg=0) – they can't be mixed. In Doc2Vec, you must choose between DBOW (dm=0) and DM (dm=1) - they can't be mixed. But you can, when doing Doc2Vec DBOW, also add skip-gram word-training (with dbow_words=1).
The choice between hierarchical-softmax and negative-sampling is separate and independent of the above choices. It determines how target-word predictions are read from the neural-network.
With negative-sampling, every possible prediction is assigned a single output-node of the network. In order to improve what prediction a particular input context creates, it checks the output-nodes for the 'correct' word (of the current training example excerpt of the corpus), and for N other 'wrong' words (that don't match the current training example). It then nudges the network's internal weights and the input-vectors to make the 'correct' word output node activation a little stronger, and the N 'wrong' word output node activations a little weaker. (This is called a 'sparse' approach, because it avoids having to calculate every output node, which is very expensive in large vocabularies, instead just calculation N+1 nodes and ignoring the rest.)
You could set negative-sampling with 2 negative-examples with the parameter negative=2 (in Word2Vec or Doc2Vec, with any kind of input-context mode). The default mode, if no negative specified, is negative=5, following the default in the original Google word2vec.c code.
With hierarchical-softmax, instead of every preictable word having its own output node, some pattern of multiple output-node activations is interpreted to mean specific words. Which nodes should be closer to 1.0 or 0.0 in order to represent a word is matter of the word's encoding, which is calculated so that common words have short encodings (involving just a few nodes), while rare words will have longer encodings (involving more nodes). Again, this serves to save calculation time: to check if an input-context is driving just the right set of nodes to the right values to predict the 'correct' word (for the current training-example), just a few nodes need to be checked, and nudged, instead of the whole set.
You enable hierarchical-softmax in gensim with the argument hs=1. By default, it is not used.
You should generally disable negative-sampling, by supplying negative=0, if enabling hierarchical-softmax – typically one or the other will perform better for a given amount of CPU-time/RAM.
(However, following the architecture of the original Google word2vec.c code, it is possible but not recommended to have them both active at once, for example negative=5, hs=1. This will result in a larger, slower model, which might appear to perform better since you're giving it more RAM/time to train, but it's likely that giving equivalent RAM/time to just one or the other would be better.)
Hierarchical-softmax tends to get slower with larger vocabularies (because the average number of nodes involved in each training-example grows); negative-sampling does not (because it's always N+1 nodes). Projects with larger corpuses tend to trend towards preferring negative-sampling.

Related

Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?

I am curious to know if there are any implications of using a different source while calling the build_vocab and train of Gensim FastText model. Will this impact the contextual representation of the word embedding?
My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.
Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.
vocab.txt contains a list of unique words of interest
corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat
A follow up question to this is what values should I set for total_examples & total_words during training in this case?
from gensim.models.fasttext import FastText
model = FastText(min_count=1, vector_size=300,)
corpus_path = f'data/{client}-corpus.txt'
vocab_path = f'data/{client}-vocab.txt'
# Unsure if below counts should be based on the training corpus or vocab
corpus_count = get_lines_count(corpus_path)
total_words = get_words_count(corpus_path)
# build the vocabulary
model.build_vocab(corpus_file=vocab_path)
# train the model
model.train(corpus_file=corpus.corpus_path, epochs=100,
total_examples=corpus_count, total_words=total_words,
)
# save the model
model.save(f'models/gensim-fastext-model-{client}')

Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:
You can try it, but I wouldn't expect it to work well for most
purposes.
The build_vocab() call establishes the known vocabulary of the
model, & caches some stats about the corpus.
If you then supply another corpus – & especially one with more words
– then:
You'll want your train() parameters to reflect the actual size of your training corpus. You'll want to provide a true total_examples and total_words count that are accurate for the training-corpus.
Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as
well filter your corpus down to just the words-of-interest first, then
use that same filtered corpus for both steps. Will the example texts
still make sense? Will that be enough data to train meaningful,
generalizable word-vectors for just the words-of-interest, alongside
other words-of-interest, without the full texts? (You could look at
your pref-filtered corpus to get a sense of that.) I'm not sure - it
could depend on how severely trimming to just the words-of-interest
changed the corpus. In particular, to train high-dimensional dense
vectors – as with vector_size=300 – you need a lot of varied data.
Such pre-trimming might thin the corpus so much as to make the
word-vectors for the words-of-interest far less useful.
You could certainly try it both ways – pre-filtered to just your
words-of-interest, or with the full original corpus – and see which
works better on downstream evaluations.
More generally, if the concern is training time with the full corpus,
there are likely other ways to get an adequate model in an acceptable
amount of time.
If using corpus_file mode, you can increase workers to equal the
local CPU core count for a nearly-linear speedup from number of cores.
(In traditional corpus_iterable mode, max throughput is usually
somewhere in the 6-12 workers threads, as long as you ahve that many
cores.)
min_count=1 is usually a bad idea for these algorithms: they tend to
train faster, in less memory, leaving better vectors for the remaining
words when you discard the lowest-frequency words, as the default
min_count=5 does. (It's possible FastText can eke a little bit of
benefit out of lower-frequency words via their contribution to
character-n-gram-training, but I'd only ever lower the default
min_count if I could confirm it was actually improving relevant
results.
If your corpus is so large that training time is a concern, often a
more-aggressive (smaller) sample parameter value not only speeds
training (by dropping many redundant high-frequency words), but ofthen
improves final word-vector quality for downstream purposes as well (by
letting the rarer words have relatively more influence on the model in
the absense of the downsampled words).
And again if the corpus is so large that training time is a concern,
than epochs=100 is likely overkill. I believe the GoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A
sufficiently large & varied corpus, with plenty of examples of all
words all throughout, could potentially train in 1 pass – because each
word-vector can then get more total training-updates than many epochs
with a small corpus. (In general larger epochs values are more often
used when the corpus is thin, to eke out something – not on a corpus
so large you're considering non-standard shortcuts to speed the
steps.)
-- Gordon

How to fill in the blank using bidirectional RNN and pytorch?

I am trying to fill in the blank using a bidirectional RNN and pytorch.
The input will be like: The dog is _____, but we are happy he is okay.
The output will be like:
1. hyper (Perplexity score here)
2. sad (Perplexity score here)
3. scared (Perplexity score here)
I discovered this idea here: https://medium.com/#plusepsilon/the-bidirectional-language-model-1f3961d1fb27
import torch, torch.nn as nn
from torch.autograd import Variable
text = ['BOS', 'How', 'are', 'you', 'EOS']
seq_len = len(text)
batch_size = 1
embedding_size = 1
hidden_size = 1
output_size = 1
random_input = Variable(
torch.FloatTensor(seq_len, batch_size, embedding_size).normal_(), requires_grad=False)
bi_rnn = torch.nn.RNN(
input_size=embedding_size, hidden_size=hidden_size, num_layers=1, batch_first=False, bidirectional=True)
bi_output, bi_hidden = bi_rnn(random_input)
# stagger
forward_output, backward_output = bi_output[:-2, :, :hidden_size], bi_output[2:, :, hidden_size:]
staggered_output = torch.cat((forward_output, backward_output), dim=-1)
linear = nn.Linear(hidden_size * 2, output_size)
# only predict on words
labels = random_input[1:-1]
# for language models, use cross-entropy :)
loss = nn.MSELoss()
output = loss(linear(staggered_output), labels)
I am trying to reimplement the code above found at the bottom of the blog post. I am new to pytorch and nlp, and can't understand what the input and output to the code is.
Question about the input: I am guessing the input are the few words that are given. Why does one need beginning of sentence and end of sentence tags in this case? Why don't I see the input being a corpus on which the model is trained like other classic NLP problems? I would like to use the Enron email corpus to train the RNN.
Question about the output: I see the output is a tensor. My understanding is the tensor is a vector, so maybe a word vector in this case. How can you use the tensor to output the words themselves?

As this question is rather open-ended I will start from the last parts, moving towards the more general answer to the main question posed in the title.
Quick note: as pointed in the comments by #Qusai Alothman, you should find a better resource on the topic, this one is rather sparse when it comes to necessary informations.
Additional note: full code for the process described in the last section would take way too much space to provide as an exact answer, it would be more of a blog post. I will highlight possible steps one should take to create such a network with helpful links as we go along.
Final note: If there is anything dumb down there below (or you would like to expand the answer in any way or form, please do correct me/add info by posting a comment below).
Question about the input
Input here is generated from the random normal distribution and has no connection to the actual words. It is supposed to represent word embeddings, e.g. representation of words as numbers carrying semantic (this is important!) meaning (sometimes depending on the context as well (see one of the current State Of The Art approaches, e.g. BERT)).
Shape of the input
In your example it is provided as:
seq_len, batch_size, embedding_size,
where
seq_len - means length of a single sentence (varies across your
dataset), we will get to it later.
batch_size - how many sentences
should be processed in one step of forward pass (in case of
PyTorch it is the forward method of class inheriting from
torch.nn.Module)
embedding_size - vector with which one word is represented (it
might range from the usual 100/300 using word2vec up to 4096 or
so using the more recent approaches like the BERT mentioned
above)
In this case it's all hard-coded of size one, which is not really useful for a newcomer, it only outlines the idea that way.
Why does one need beginning of sentence and end of sentence tags in this case?
Correct me if I'm wrong, but you don't need it if your input is separated into sentences. It is used if you provide multiple sentences to the model, and want to indicate unambiguously the beginning and end of each (used with models which depend on the previous/next sentences, it seems to not be the case here). Those are encoded by special tokens (the ones which are not present in the entire corpus), so neural network "could learn" they represent end and beginning of sentence (one special token for this approach would be enough).
If you were to use serious dataset, I would advise to split your text using libraries like spaCy or nltk (the first one is a pleasure to use IMO), they do a really good job for this task.
You dataset might be already splitted into sentences, in those cases you are kind of ready to go.
Why don't I see the input being a corpus on which the model is trained like other classic NLP problems?
I don't recall models being trained on the corpuses as is, e.g. using strings. Usually those are represented by floating-points numbers using:
Simple approaches, e.g. Bag Of
Words or
TF-IDF
More sophisticated ones, which provide some information about word
relationships (e.g. king is more semantically related to queen
than to a, say, banana). Those were already linked above, some
other noticeable might be
GloVe or
ELMo and tons of other creative
approaches.
Question about the output
One should output indices into embeddings, which in turn correspond to words represented by a vector (more sophisticated approach mentioned above).
Each row in such embedding represents a unique word and it's respective columns are their unique representations (in PyTorch, first index might be reserved for the words for which a representation is unknown [if using pretrained embeddings], you may also delete those words, or represent them as aj average of sentence/document, there are some other viable approaches as well).
Loss provided in the example
# for language models, use cross-entropy :)
loss = nn.MSELoss()
For this task it makes no sense, as Mean Squared Error is a regression metric, not a classification one.
We want to use one for classification, so softmax should be used for multiclass case (we should be outputting numbers spanning [0, N], where N is the number of unique words in our corpus).
PyTorch's CrossEntropyLoss already takes logits (output of last layer without activation like softmax) and returns loss value for each example. I would advise this approach as it's numerically stable (and I like it as the most minimal one).
I am trying to fill in the blank using a bidirectional RNN and pytorch
This is a long one, I will only highlight steps I would undertake in order to create a model whose idea represents the one outlined in the post.
Basic preparation of dataset
You may use the one you mentioned above or start with something easier like 20 newsgroups from scikit-learn.
First steps should be roughly this:
scrape the metadata (if any) from your dataset (those might be HTML tags, some headers etc.)
split your text into sentences using a pre-made library (mentioned above)
Next, you would like to create your target (e.g. words to be filled) in each sentence.
Each word should be replaced by a special token (say <target-token>) and moved to target.
Example:
sentence: Neural networks can do some stuff.
would give us the following sentences and it's respective targets:
sentence: <target-token> networks can do some stuff. target: Neural
sentence: Neural <target-token> can do some stuff. target: networks
sentence: Neural networks <target-token> do some stuff. target: can
sentence: Neural networks can <target-token> some stuff. target: do
sentence: Neural networks can do <target-token> stuff. target: some
sentence: Neural networks can do some <target-token>. target: some
sentence: Neural networks can do some stuff <target-token> target: .
You should adjust this approach to the problem at hand by correcting typos if there are any, tokenizing, lemmatizing and others, experiment!
Embeddings
Each word in each sentence should be replaced by an integer, which in turn points to it embedding.
I would advise you to use a pre-trained one. spaCy provides word vectors, but another interesting approach I would highly recommend is in the open source library flair.
You may train your own, but it would take a lot of time + a lot of data for unsupervised training, and I think it is way beyond the scope of this question.
Data batching
One should use PyTorch's torch.utils.data.Dataset and torch.utils.data.DataLoader.
In my case, a good idea is was to provide custom collate_fn to DataLoader, which is responsible for creating padded batches of data (or represented as torch.nn.utils.rnn.PackedSequence already).
Important: currently, you have to sort the batch by length (word-wise) and keep the indices able to "unsort" the batch into it's original form, you should remember that during implementation. You may use torch.sort for that task. In future versions of PyTorch, there is a chance, one might not have to do that, see this issue.
Oh, and remember to shuffle your dataset using DataLoader, while we're at it.
Model
You should create a proper model by inheriting from torch.nn.Module. I would advise you to create a more general model, where you can provide PyTorch's cells (like GRU, LSTM or RNN), multilayered and bidirectional (as is described in the post).
Something along those lines when it comes to model construction:
import torch
class Filler(torch.nn.Module):
def __init__(self, cell, embedding_words_count: int):
self.cell = cell
# We want to output vector of N
self.linear = torch.nn.Linear(self.cell.hidden_size, embedding_words_count)
def forward(self, batch):
# Assuming batch was properly prepared before passing into the network
output, _ = self.cell(batch)
# Batch shape[0] is the length of longest already padded sequence
# Batch shape[1] is the length of batch, e.g. 32
# Here we create a view, which allows us to concatenate bidirectional layers in general manner
output = output.view(
batch.shape[0],
batch.shape[1],
2 if self.cell.bidirectional else 1,
self.cell.hidden_size,
)
# Here outputs of bidirectional RNNs are summed, you may concatenate it
# It makes up for an easier implementation, and is another often used approach
summed_bidirectional_output = output.sum(dim=2)
# Linear layer needs batch first, we have to permute it.
# You may also try with batch_first=True in self.cell and prepare your batch that way
# In such case no need to permute dimensions
linear_input = summed_bidirectional_output.permute(1, 0, 2)
return self.linear(embedding_words_count)
As you can see, information about shapes can be obtained in a general fashion. Such approach will allow you to create a model with how many layers you want, bidirectional or not (batch_first argument is problematic, but you can get around it too in a general way, left it out for improved clarity), see below:
model = Filler(
torch.nn.GRU(
# Size of your embeddings, for BERT it could be 4096, for spaCy's word2vec 300
input_size=300,
hidden_size=100,
num_layers=3,
batch_first=False,
dropout=0.4,
bidirectional=True,
),
# How many unique words are there in your dataset
embedding_words_count=10000,
)
You may pass torch.nn.Embedding into your model (if pretrained and already filled), create it from numpy matrix or plethora of other approaches, it's highly dependent how your structure your code exactly. Still, please, make your code more general, do not hardcode shapes unless it's totally necessary (usually it's not).
Remember it's only a showcase, you will have to tune and fix it on your own.
This implementation returns logits and no softmax layer is used. If you wish to calculate perplexity, you may have to add it in order to obtain a correct probability distribution across all possible vectors.
BTW: Here is some info on concatenation of bidirectional output of RNN.
Model training
I would highly recommend PyTorch ignite as it's quite customizable, you can log a lot of info using it, perform validation and abstract cluttering parts like for loops in training.
Oh, and split your model, training and others into separate modules, don't put everything into one unreadable file.
Final notes
This is the outline of how I would approach this problem, you may have more fun using attention networks instead of merely using the last output layer as in this example, though you shouldn't start with that.
And please check PyTorch's 1.0 documentation and do not follow blindly tutorials or blog posts you see online as they might be out of date really fast and quality of the code varies enormously. For example torch.autograd.Variable is deprecated as can be seen in the link.

How does doc2vec.infer_vector combine across words?

I trained a doc2vec model using train(..) with default settings. That worked, but now I'm wondering how infer_vector combines across input words, is it just the average of the individual word vectors?
model.random.seed(0)
model.infer_vector(['cat', 'hat'])
model.random.seed(0)
model.infer_vector(['cat'])
model.infer_vector(['hat']) #doesn't average up to the ['cat', 'hat'] vector
model.random.seed(0)
model.infer_vector(['hat'])
model.infer_vector(['cat']) #doesn't average up to the ['cat', 'hat'] vector
Those don't add up, so I'm wondering what I'm misunderstanding.

infer_vector() doesn't combine the vectors for your given tokens – and in some modes doesn't consider those tokens' vectors at all.
Rather, it considers the entire Doc2Vec model as being frozen against internal changes, and then assumes the tokens you've provided are an example text, with a previously untrained tag. Let's call this implied but unnamed tag X.
Using a training-like process, it tries to find a good vector for X. That is, it starts with a random vector (as it did for all tags in original training), then sees how well that vector as model-input predicts the text's words (by checking the model neural-network's predictions for input X). Then via incremental gradient descent it makes that candidate vector for X better and better at predicting the text's words.
After enough such inference-training, the vector will be about as good (given the rest of the frozen model) as it possibly can be at predicting the text's words. So even though you're providing that text as an "input" to the method, inside the model, what you've provided is used to pick target "outputs" of the algorithm for optimization.
Note that:
tiny examples (like one or a few words) aren't likely to give very meaningful results – they are sharp-edged corner cases, and the essential value of these sorts of dense embedded representations usually arises from the marginal balancing of many word-influences
it will probably help to do far more training-inference cycles than the infer_vector() default steps=5 – some have reported tens or hundreds of steps work best for them, and it may be especially valuable to use more steps with short texts
it may also help to use a starting alpha for inference more like that used in bulk training (alpha=0.025), rather than the infer_vector() default (alpha=0.1)

How should I interpret "size" parameter in Doc2Vec function of gensim?

I am using Doc2Vec function of gensim in Python to convert a document to a vector.
An example of usage
model = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
How should I interpret the size parameter. I know that if I set size = 100, the length of output vector will be 100, but what does it mean? For instance, if I increase size to 200, what is the difference?

Word2Vec captures distributed representation of a word which essentially means, multiple neurons capture a single concept (concept can be word meaning/sentiment/part of speech etc.), and also a single neuron contributes to multiple concepts.
These concepts are automatically learnt and not pre-defined, hence you can think of them as latent/hidden. Also for the same reason, the word vectors can be used for multiple applications.
More is the size parameter, more will be the capacity of your neural network to represent these concepts, but more data will be required to train these vectors (as they are initialised randomly). In absence of sufficient number of sentences/computing power, its better to keep the size small.
Doc2Vec follows slightly different neural network architecture as compared to Word2Vec, but the meaning of size is analogous.

The difference is the detail, that the model can capture. Generally, the more dimensions you give Word2Vec, the better the model - up to a certain point.
Normally the size is between 100-300. You always have to consider that more dimensions also mean, that more memory is needed.

Training a Machine Learning predictor

I have been trying to build a prediction model using a user’s data. Model’s input is documents’ metadata (date published, title etc) and document label is that user’s preference (like/dislike). I would like to ask some questions that I have come across hoping for some answers:
There are way more liked documents than disliked. I read somewhere that if somebody train’s a model using way more inputs of one label than the other this affects the performance in a bad way (model tends to classify everything to the label/outcome that has the majority of inputs
Is there possible to have input to a ML algorithm e.g logistic regression be hybrid in terms of numbers and words and how that could be done, sth like:
input = [18,23,1,0,’cryptography’] with label = [‘Like’]
Also can we use a vector ( that represents a word, using tfidf etc) as an input feature (e.g. 50-dimensions vector) ?
In order to construct a prediction model using textual data the only way to do so is by deriving a dictionary out of every word mentioned in our documents and then construct a binary input that will dictate if a term is mentioned or not? Using such a version though we lose the weight of the term in the collection right?
Can we use something as a word2vec vector as a single input in a supervised learning model?
Thank you for your time.

You either need to under-sample the bigger class (take a small random sample to match the size of the smaller class), over-sample the smaller class (bootstrap sample), or use an algorithm that supports unbalanced data - and for that you'll need to read the documentation.
You need to turn your words into a word vector. Columns are all the unique words in your corpus. Rows are the documents. Cell values are one of: whether the word appears in the document, the number of times it appears, the relative frequency of its appearance, or its TFIDF score. You can then have these columns along with your other non-word columns.
Now you probably have more columns than rows, meaning you'll get a singularity with matrix-based algorithms, in which case you need something like SVM or Naive Bayes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.