Inner workings of Gensim Word2Vec

Inner workings of Gensim Word2Vec - python

I have a couple of issues regarding Gensim in its Word2Vec model.
The first is what is happening if I set it to train for 0 epochs? Does it just create the random vectors and calls it done. So they have to be random every time, correct?
The second is concerning the WV object in the doc page says:
This object essentially contains the mapping between words and embeddings.
After training, it can be used directly to query those embeddings in various ways.
See the module level docstring for examples.
But that is not clear to me, allow me to explain I have my own created word vectors which I have substitute in the
word2vecObject.wv['word'] = my_own
Then call the train method with those replacement word vectors. But I would like to know which part am I replacing, is it the input to hidden weight layer or the hidden to input? This is to check if it can be called pre-training or not. Any help? Thank you.

I've not tried the nonsense parameter epochs=0, but it might behave as you expect. (Have you tried it and seen otherwise?)
However, if your real goal is to be able to tamper with the model after initialization, but before training, the usual way to do that is to not supply any corpus when constructing the model instance, and instead manually do the two followup steps, .build_vocab() & .train(), in your own code - inserting extra steps between the two. (For even finer-grained control, you can examine the source of .build_vocab() & its helper methods, and simply ensure you do all those necessary things, with your own extra steps interleaved.)
The "word vectors" in the .wv property of type KeyedVectors are essentially the "input projection layer" of the model: the data which converts a single word into a vector_size-dimensional dense embedding. (You can think of the keys – word token strings – as being somewhat like a one-hot word-encoding.)
So, assigning into that structure only changes that "input projection vector", which is the "word vector" usually collected from the model. If you need to tamper with the hidden-to-output weights, you need to look at the model's .syn1neg (or .syn1 for HS mode) property.

Related

Facebook fasttext bin model preprocessing

I have downloaded a .bin FastText model,and load it as follows:
ft = fasttext.load_model('/content/drive/MyDrive/dataset/cc.en.300.bin')
how can i make preprocessing and normalization on cc.en.300.bin model.
i want to make lemmatization, removing stopwords and other operation

Your question doesn't really make sense, given how FastText models are usually used, on a couple levels:
A pre-trained FastText model, like cc.en.300.bin, no longer has any original text inside it, as would normally be the input for preprocessing/normalization. It is the end result of someone else's training on a corpus they already prepared for the FastText model. Essentially, you're stuck with their choices of tokenization, normalization, & other preprocessing.
Because FastText models learn from the same kind of word-morphology (including roots/stems/alternate-forms) that's removed by stemming/lemmatization, generally the training texts used for FastText training aren't preprocessed in that way. And, you wouldn't perform any such transformation on word-tokens you want to look-up in such a pre-trained model.
The only way I can imagine your question representing a real need is if you already have some other texts that have been destructively preprocessed – such as by replacing words with their stems/lemmas – and now you want to look up those words in this model.
If that's your true need:
You should try to go back to texts that weren't destructively changed, and look up the original (non-lemmatized) words in the FastText model. That's the usual way to use FastText, and what I'd expect to work best. (For example, looking up each of 'walking', 'walked', 'walks', etc will ikely give better vectors than reducing them all to the lemma 'walk' and only looking that up.) If you can't recover the original words...
You could try just looking up the lemmas/etc you have directly. The FastText model will give you its vector for that root word, synthesizing a guess-vector if necessary from the word-fragments it knows. That might work fine. Or...
You could conceivably iterate over all the model's known words, and map them to the alernate/normalized words of your preprocessing scheme. This would potentially map N of known words to 1 new 'reduced' word. Then, create a new model where that single reduced word, in your new scheme, has just one vector – perhaps by some sort of (weighted) average of all the others words it might have been, before the reductive canonicalization. But, I have a hard time imagining any situation where this extra complexity would offer any advantage over either (1) above, using FastText as intended by looking up the words that are already in the model, or (2) above, just settling for what it already has for your new word. So really, don't do anything like this unless you've got some good reason.

How to fill in the blank using bidirectional RNN and pytorch?

I am trying to fill in the blank using a bidirectional RNN and pytorch.
The input will be like: The dog is _____, but we are happy he is okay.
The output will be like:
1. hyper (Perplexity score here)
2. sad (Perplexity score here)
3. scared (Perplexity score here)
I discovered this idea here: https://medium.com/#plusepsilon/the-bidirectional-language-model-1f3961d1fb27
import torch, torch.nn as nn
from torch.autograd import Variable
text = ['BOS', 'How', 'are', 'you', 'EOS']
seq_len = len(text)
batch_size = 1
embedding_size = 1
hidden_size = 1
output_size = 1
random_input = Variable(
torch.FloatTensor(seq_len, batch_size, embedding_size).normal_(), requires_grad=False)
bi_rnn = torch.nn.RNN(
input_size=embedding_size, hidden_size=hidden_size, num_layers=1, batch_first=False, bidirectional=True)
bi_output, bi_hidden = bi_rnn(random_input)
# stagger
forward_output, backward_output = bi_output[:-2, :, :hidden_size], bi_output[2:, :, hidden_size:]
staggered_output = torch.cat((forward_output, backward_output), dim=-1)
linear = nn.Linear(hidden_size * 2, output_size)
# only predict on words
labels = random_input[1:-1]
# for language models, use cross-entropy :)
loss = nn.MSELoss()
output = loss(linear(staggered_output), labels)
I am trying to reimplement the code above found at the bottom of the blog post. I am new to pytorch and nlp, and can't understand what the input and output to the code is.
Question about the input: I am guessing the input are the few words that are given. Why does one need beginning of sentence and end of sentence tags in this case? Why don't I see the input being a corpus on which the model is trained like other classic NLP problems? I would like to use the Enron email corpus to train the RNN.
Question about the output: I see the output is a tensor. My understanding is the tensor is a vector, so maybe a word vector in this case. How can you use the tensor to output the words themselves?

As this question is rather open-ended I will start from the last parts, moving towards the more general answer to the main question posed in the title.
Quick note: as pointed in the comments by #Qusai Alothman, you should find a better resource on the topic, this one is rather sparse when it comes to necessary informations.
Additional note: full code for the process described in the last section would take way too much space to provide as an exact answer, it would be more of a blog post. I will highlight possible steps one should take to create such a network with helpful links as we go along.
Final note: If there is anything dumb down there below (or you would like to expand the answer in any way or form, please do correct me/add info by posting a comment below).
Question about the input
Input here is generated from the random normal distribution and has no connection to the actual words. It is supposed to represent word embeddings, e.g. representation of words as numbers carrying semantic (this is important!) meaning (sometimes depending on the context as well (see one of the current State Of The Art approaches, e.g. BERT)).
Shape of the input
In your example it is provided as:
seq_len, batch_size, embedding_size,
where
seq_len - means length of a single sentence (varies across your
dataset), we will get to it later.
batch_size - how many sentences
should be processed in one step of forward pass (in case of
PyTorch it is the forward method of class inheriting from
torch.nn.Module)
embedding_size - vector with which one word is represented (it
might range from the usual 100/300 using word2vec up to 4096 or
so using the more recent approaches like the BERT mentioned
above)
In this case it's all hard-coded of size one, which is not really useful for a newcomer, it only outlines the idea that way.
Why does one need beginning of sentence and end of sentence tags in this case?
Correct me if I'm wrong, but you don't need it if your input is separated into sentences. It is used if you provide multiple sentences to the model, and want to indicate unambiguously the beginning and end of each (used with models which depend on the previous/next sentences, it seems to not be the case here). Those are encoded by special tokens (the ones which are not present in the entire corpus), so neural network "could learn" they represent end and beginning of sentence (one special token for this approach would be enough).
If you were to use serious dataset, I would advise to split your text using libraries like spaCy or nltk (the first one is a pleasure to use IMO), they do a really good job for this task.
You dataset might be already splitted into sentences, in those cases you are kind of ready to go.
Why don't I see the input being a corpus on which the model is trained like other classic NLP problems?
I don't recall models being trained on the corpuses as is, e.g. using strings. Usually those are represented by floating-points numbers using:
Simple approaches, e.g. Bag Of
Words or
TF-IDF
More sophisticated ones, which provide some information about word
relationships (e.g. king is more semantically related to queen
than to a, say, banana). Those were already linked above, some
other noticeable might be
GloVe or
ELMo and tons of other creative
approaches.
Question about the output
One should output indices into embeddings, which in turn correspond to words represented by a vector (more sophisticated approach mentioned above).
Each row in such embedding represents a unique word and it's respective columns are their unique representations (in PyTorch, first index might be reserved for the words for which a representation is unknown [if using pretrained embeddings], you may also delete those words, or represent them as aj average of sentence/document, there are some other viable approaches as well).
Loss provided in the example
# for language models, use cross-entropy :)
loss = nn.MSELoss()
For this task it makes no sense, as Mean Squared Error is a regression metric, not a classification one.
We want to use one for classification, so softmax should be used for multiclass case (we should be outputting numbers spanning [0, N], where N is the number of unique words in our corpus).
PyTorch's CrossEntropyLoss already takes logits (output of last layer without activation like softmax) and returns loss value for each example. I would advise this approach as it's numerically stable (and I like it as the most minimal one).
I am trying to fill in the blank using a bidirectional RNN and pytorch
This is a long one, I will only highlight steps I would undertake in order to create a model whose idea represents the one outlined in the post.
Basic preparation of dataset
You may use the one you mentioned above or start with something easier like 20 newsgroups from scikit-learn.
First steps should be roughly this:
scrape the metadata (if any) from your dataset (those might be HTML tags, some headers etc.)
split your text into sentences using a pre-made library (mentioned above)
Next, you would like to create your target (e.g. words to be filled) in each sentence.
Each word should be replaced by a special token (say <target-token>) and moved to target.
Example:
sentence: Neural networks can do some stuff.
would give us the following sentences and it's respective targets:
sentence: <target-token> networks can do some stuff. target: Neural
sentence: Neural <target-token> can do some stuff. target: networks
sentence: Neural networks <target-token> do some stuff. target: can
sentence: Neural networks can <target-token> some stuff. target: do
sentence: Neural networks can do <target-token> stuff. target: some
sentence: Neural networks can do some <target-token>. target: some
sentence: Neural networks can do some stuff <target-token> target: .
You should adjust this approach to the problem at hand by correcting typos if there are any, tokenizing, lemmatizing and others, experiment!
Embeddings
Each word in each sentence should be replaced by an integer, which in turn points to it embedding.
I would advise you to use a pre-trained one. spaCy provides word vectors, but another interesting approach I would highly recommend is in the open source library flair.
You may train your own, but it would take a lot of time + a lot of data for unsupervised training, and I think it is way beyond the scope of this question.
Data batching
One should use PyTorch's torch.utils.data.Dataset and torch.utils.data.DataLoader.
In my case, a good idea is was to provide custom collate_fn to DataLoader, which is responsible for creating padded batches of data (or represented as torch.nn.utils.rnn.PackedSequence already).
Important: currently, you have to sort the batch by length (word-wise) and keep the indices able to "unsort" the batch into it's original form, you should remember that during implementation. You may use torch.sort for that task. In future versions of PyTorch, there is a chance, one might not have to do that, see this issue.
Oh, and remember to shuffle your dataset using DataLoader, while we're at it.
Model
You should create a proper model by inheriting from torch.nn.Module. I would advise you to create a more general model, where you can provide PyTorch's cells (like GRU, LSTM or RNN), multilayered and bidirectional (as is described in the post).
Something along those lines when it comes to model construction:
import torch
class Filler(torch.nn.Module):
def __init__(self, cell, embedding_words_count: int):
self.cell = cell
# We want to output vector of N
self.linear = torch.nn.Linear(self.cell.hidden_size, embedding_words_count)
def forward(self, batch):
# Assuming batch was properly prepared before passing into the network
output, _ = self.cell(batch)
# Batch shape[0] is the length of longest already padded sequence
# Batch shape[1] is the length of batch, e.g. 32
# Here we create a view, which allows us to concatenate bidirectional layers in general manner
output = output.view(
batch.shape[0],
batch.shape[1],
2 if self.cell.bidirectional else 1,
self.cell.hidden_size,
)
# Here outputs of bidirectional RNNs are summed, you may concatenate it
# It makes up for an easier implementation, and is another often used approach
summed_bidirectional_output = output.sum(dim=2)
# Linear layer needs batch first, we have to permute it.
# You may also try with batch_first=True in self.cell and prepare your batch that way
# In such case no need to permute dimensions
linear_input = summed_bidirectional_output.permute(1, 0, 2)
return self.linear(embedding_words_count)
As you can see, information about shapes can be obtained in a general fashion. Such approach will allow you to create a model with how many layers you want, bidirectional or not (batch_first argument is problematic, but you can get around it too in a general way, left it out for improved clarity), see below:
model = Filler(
torch.nn.GRU(
# Size of your embeddings, for BERT it could be 4096, for spaCy's word2vec 300
input_size=300,
hidden_size=100,
num_layers=3,
batch_first=False,
dropout=0.4,
bidirectional=True,
),
# How many unique words are there in your dataset
embedding_words_count=10000,
)
You may pass torch.nn.Embedding into your model (if pretrained and already filled), create it from numpy matrix or plethora of other approaches, it's highly dependent how your structure your code exactly. Still, please, make your code more general, do not hardcode shapes unless it's totally necessary (usually it's not).
Remember it's only a showcase, you will have to tune and fix it on your own.
This implementation returns logits and no softmax layer is used. If you wish to calculate perplexity, you may have to add it in order to obtain a correct probability distribution across all possible vectors.
BTW: Here is some info on concatenation of bidirectional output of RNN.
Model training
I would highly recommend PyTorch ignite as it's quite customizable, you can log a lot of info using it, perform validation and abstract cluttering parts like for loops in training.
Oh, and split your model, training and others into separate modules, don't put everything into one unreadable file.
Final notes
This is the outline of how I would approach this problem, you may have more fun using attention networks instead of merely using the last output layer as in this example, though you shouldn't start with that.
And please check PyTorch's 1.0 documentation and do not follow blindly tutorials or blog posts you see online as they might be out of date really fast and quality of the code varies enormously. For example torch.autograd.Variable is deprecated as can be seen in the link.

Can i build vocaburay in twice with gensim word2vec or doc2vec?

I have two different corpus and what i want is to train the model with both and to do it it I thought that it could be something like this:
model.build_vocab(sentencesCorpus1)
model.build_vocab(sentencesCorpus2)
Would it be right?

No: each time you call build_vocab(corpus), like that, it creates a fresh vocabulary from scratch – discarding any prior vocabulary.
You can provide an optional argument to build_vocab(), update=True, which tries to add to the existing vocabulary. However:
it wasn't designed/tested with Doc2Vec in mind, and as of right now (February 2018), using it with Doc2Vec is unlikely to work and often causes memory-fault crashes. (See https://github.com/RaRe-Technologies/gensim/issues/1019.)
it's still best to train() with all available data together - any sort of multiple-calls to train(), with differing data subsets each time, introduces other murky tradeoffs in model quality/correctness that are easy to get wrong. (And, when calling train(), be sure to provide correct values for its required parameters – the practices shown in most online examples are typically only correct for the case where build_vocab() was called once, with exactly the same texts as later calling train().)

How does doc2vec.infer_vector combine across words?

I trained a doc2vec model using train(..) with default settings. That worked, but now I'm wondering how infer_vector combines across input words, is it just the average of the individual word vectors?
model.random.seed(0)
model.infer_vector(['cat', 'hat'])
model.random.seed(0)
model.infer_vector(['cat'])
model.infer_vector(['hat']) #doesn't average up to the ['cat', 'hat'] vector
model.random.seed(0)
model.infer_vector(['hat'])
model.infer_vector(['cat']) #doesn't average up to the ['cat', 'hat'] vector
Those don't add up, so I'm wondering what I'm misunderstanding.

infer_vector() doesn't combine the vectors for your given tokens – and in some modes doesn't consider those tokens' vectors at all.
Rather, it considers the entire Doc2Vec model as being frozen against internal changes, and then assumes the tokens you've provided are an example text, with a previously untrained tag. Let's call this implied but unnamed tag X.
Using a training-like process, it tries to find a good vector for X. That is, it starts with a random vector (as it did for all tags in original training), then sees how well that vector as model-input predicts the text's words (by checking the model neural-network's predictions for input X). Then via incremental gradient descent it makes that candidate vector for X better and better at predicting the text's words.
After enough such inference-training, the vector will be about as good (given the rest of the frozen model) as it possibly can be at predicting the text's words. So even though you're providing that text as an "input" to the method, inside the model, what you've provided is used to pick target "outputs" of the algorithm for optimization.
Note that:
tiny examples (like one or a few words) aren't likely to give very meaningful results – they are sharp-edged corner cases, and the essential value of these sorts of dense embedded representations usually arises from the marginal balancing of many word-influences
it will probably help to do far more training-inference cycles than the infer_vector() default steps=5 – some have reported tens or hundreds of steps work best for them, and it may be especially valuable to use more steps with short texts
it may also help to use a starting alpha for inference more like that used in bulk training (alpha=0.025), rather than the infer_vector() default (alpha=0.1)

implementing a perceptron classifier

Hi I'm pretty new to Python and to NLP. I need to implement a perceptron classifier. I searched through some websites but didn't find enough information. For now I have a number of documents which I grouped according to category(sports, entertainment etc). I also have a list of the most used words in these documents along with their frequencies. On a particular website there was stated that I must have some sort of a decision function accepting arguments x and w. x apparently is some sort of vector ( i dont know what w is). But I dont know how to use the information I have to build the perceptron algorithm and how to use it to classify my documents. Have you got any ideas? Thanks :)

How a perceptron looks like
From the outside, a perceptron is a function that takes n arguments (i.e an n-dimensional vector) and produces m outputs (i.e. an m-dimensional vector).
On the inside, a perceptron consists of layers of neurons, such that each neuron in a layer receives input from all neurons of the previous layer and uses that input to calculate a single output. The first layer consists of n neurons and it receives the input. The last layer consist of m neurons and holds the output after the perceptron has finished processing the input.
How the output is calculated from the input
Each connection from a neuron i to a neuron j has a weight w(i,j) (I'll explain later where they come from). The total input of a neuron p of the second layer is the sum of the weighted output of the neurons from the first layer. So
total_input(p) = Σ(output(k) * w(k,p))
where k runs over all neurons of the first layer. The activation of a neuron is calculated from the total input of the neuron by applying an activation function. An often used activation function is the Fermi function, so
activation(p) = 1/(1-exp(-total_input(p))).
The output of a neuron is calculated from the activation of the neuron by applying an output function. An often used output function is the identity f(x) = x (and indeed some authors see the output function as part of the activation function). I will just assume that
output(p) = activation(p)
When the output off all neurons of the second layer is calculated, use that output to calculate the output of the third layer. Iterate until you reach the output layer.
Where the weights come from
At first the weights are chosen randomly. Then you select some examples (from which you know the desired output). Feed each example to the perceptron and calculate the error, i.e. how far off from the desired output is the actual output. Use that error to update the weights. One of the fastest algorithms for calculating the new weights is Resilient Propagation.
How to construct a Perceptron
Some questions you need to address are
What are the relevant characteristics of the documents and how can they be encoded into an n-dimansional vector?
Which examples should be chosen to adjust the weights?
How shall the output be interpreted to classify a document? Example: A single output that yields the most likely class versus a vector that assigns probabilities to each class.
How many hidden layers are needed and how large should they be? I recommend starting with one hidden layer with n neurons.
The first and second points are very critical to the quality of the classifier. The perceptron might classify the examples correctly but fail on new documents. You will probably have to experiment. To determine the quality of the classifier, choose two sets of examples; one for training, one for validation. Unfortunately I cannot give you more detailed hints to answering these questions due to lack of practical experience.

I think that trying to solve an NLP problem with a Neural Network when you're not familiar with either might be a step too far. That you're doing it in a new language is the least of your worries.
I'll link you to my Neural Computation module slides that gets taught at my university. You'll want the slides from session 1 and session 2 in week 2. Right at the bottom of the page is a link to how to implement a neural network in C. With a few modifications should be able to port it to python. You should note that it details how to implement a multilayer perceptron. You only need to implement a single layer perceptron, so ignore anything that talks about hidden layers.
A quick explanation of x and w. Both x and w are vectors. x is the input vector. x contains normalised frequencies for each word you are concerned about. w contains weights for each word you are concerned with. The perceptron works by multiplying the input frequency for each word by its respective weight and summing them up. It passes the result to a function (typically a sigmoid function) that turns the result into a value between 0 and 1. 1 means the perceptron is positive that the inputs are an instance of the class it represents and 0 means it is sure that the inputs really aren't an example of its class.
With NLP you typically learn about the bag of words model first, before moving on to other, more complex, models. With a neural network, hopefully, it will learn its own model. The problem with this is that the neural network will not give you much of an understanding of NLP, other than documents can be classified by the words they contain, and that usually the number and type of words in a document contains most of the information you need to classify a document -- context and grammar do not add much extra detail.
Anyway, I hope that gives a better place from which to start your project. If you're still stuck on a particular part then ask again and I'll do my best to help.

You should take a look at this survey paper on text classification by Frabizio Sebastiani. It tells you all of the best ways to do text classification.
Now, I'm not going to bother you to read the whole thing, but there's one table near the end, where he compares how lots of different people's techniques stack up on lots of different test corpora. Find it, pick the best one (the best perceptron one, if you assignment is specifically to learn how to do this with perceptron), and read the paper he cites that describes that method in detail.
You now know how to construct a good topical text classifier.
Turning the algorithm that Oswald gave you (and that you posted in your other question) into code is a Small Matter of Programming (TM). And if you encounter unfamiliar terms like TF-IDF while you're working, ask your teacher to help you by explaining those terms.

MultiLayer perceptrons (A specific NeuralNet architecture for general classification problem.) Now available for Python from the GraphLab folks:
https://dato.com/products/create/docs/generated/graphlab.deeplearning.MultiLayerPerceptrons.html#graphlab.deeplearning.MultiLayerPerceptrons

I had a try at implementing something similar the other day. I made some code to recognize english looking text vs non-english. I hadn't done AI or statistics in many years, so it was a bit of a shotgun attempt.
My code is here (don't want to bloat the post): http://cnippit.com/content/perceptron-statistically-recognizing-english
Inputs:
I take a text file, split it up into
tri-grams (eg "abcdef" => ["abc",
"bcd", "cde", "def"])
I calculate the relative frequencies of each, and feed that as the inputs to the perceptron (so there are 26^3 inputs)
Despite me not really knowing what I was doing, it seems to work fairly well. The success depends quite heavily on the training data though. I was getting poor results until I trained it on more french/spanish/german text etc.
It's a very small example though, with lots of "lucky guesses" at values (eg. initial weights, bias, threshold, etc.).
Multiple classes:
If you have multiple classes you want to distinquish between (ie. not as simple as "is A or NOT-A"), then one approach is to use a perceptron for each class. Eg. one for sport, one for news, etc.
Train the sport-perceptron on data grouped as either sport or NOT-sport. Similar for news or Not-news, etc.
When classifying new data, you pass your input to all perceptrons, and whichever one returns true (or "fires"), then that's the class the data belongs to.
I used this approach way back in university, where we used a set of perceptrons for recognizing handwritten characters. It's simple and worked pretty effectively (>98% accuracy if I recall correctly).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.