Gensim Word2Vec Training Data

Gensim Word2Vec Training Data - python

I am currently trying to train my own word2vec model with my own training data and I am utterly confused about the training data preprocessing.
I ran a short script over my text which lemmatizes and also lower-cases the words in the text such that in the end my training data from a sentence (in German) like:
"Er hat heute zwei Birnen gegessen."
the following comes out:
[er, haben, heute, zwei, birne, essen]
translated in English:
He ate two pears today.
results in:
[he, eat, two, pear, today]
Now the problem is: I haven't seen anyone do this to their training data. The words are kept in uppercase and also not lemmatized and I absolutely don't get how this works. Especially for German there are so many inflections of verbs. Should I just leave them that way? I don't understand how it works not doing the lemmatization since gensim doesn't even know which language it is trained on right?
So in short: Should I do lemmatization and/or lowercasing or just leave every word as it is?
Thanks a lot!

The answer depends on what you are going to use the embeddings for, however, they are usually trained on word forms. Word embeddings are typically pre-trained on very large datasets and cover vocabularies up to 500k words, which typically covers most words in most forms even in language with much richer morphology than German.
You might also want to use FastText (bindings for Gensim exist) instead of Word2Vec. FastText considers character n-gram statistics for the embeddings, so it better generalize for regularities in morphology.
But in general, the data preprocessing always depends on how do you plan to use the embeddings. If you want to do a quantitative diachronic study on how the meaning of the words shifted over the 20th century, then embedding lemmas is a good idea. If you work with low resource language that has a good lemmatizer, it also might be a good idea. If you plan to the embeddings as an input for a downstream NLP model, then you should probably use forms and/or use already pre-trained embeddings.

Related

NLP: check if a detected sentence is a complete sentence

In my NLP project I build my own model to identify sentences in a PDF document. Now I would like to check if my extracted sentences are complete sentences. During my research I have already come across this question, with the solutions presented there allowing quite a few false positives. Does anyone perhaps have a tip on how I can check whether a sentence is a complete sentence?

This is a non-trivial problem, so no approach will work in each and every case. You should also consider that whatever parser you use might merge or split sentences which in the original document were complete sentences, but after they are parsed are not any more.
Generally an alternative to the purely rule-based approaches: you could use a model which was pretrained on the CoLA (Corpus of Linguistic Acceptability) task. These models try to classify sentences/documents into the classes "linguistically acceptable" and "lingustically inacceptable".
On huggingface's model hub there are several pretrained transformer models for this, see for example this inference API for one which is a fine-tuned version of Facebook's RoBERTa model:
Correct Sentence
Incorrect Sentence
You should have a look at how the model was trained when it comes to bullet points/self-standing half sentences etc. though, as some scores might be surprising at first glance.
You might want to combine the models results with a rule-based approach, say for example: "The sentence is acceptable if the score is 0.95 or higher AND the sentence has at least 4 words AND ends with a . ? or !.". You can see what sentences your model + rule-based approach combinations spits out and keep modifying the rules until the results are to your satisfaction.

Is it possible to use Google BERT to calculate similarity between two textual documents?

Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
https://github.com/AndriyMulyar/semantic-text-similarity
https://github.com/beekbin/bert-cosine-sim
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?

BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.
BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.
This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.
BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at
See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.
Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:
Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE
Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!

BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.

To add to #jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.

Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.

Do gensim Doc2Vec distinguish between same Sentence with positive and negative context.?

While learning Doc2Vec library, I got stuck on the following question.
Do gensim Doc2Vec distinguish between the same Sentence with positive and negative context?
For Example:
Sentence A: "I love Machine Learning"
Sentence B: "I do not love Machine Learning"
If I train sentence A and B with doc2vec and find cosine similarity between their vectors:
Will the model be able to distinguish the sentence and give a cosine similarity very less than 1 or negative?
Or Will the model represent both the sentences very close in vector space and give cosine similarity close to 1, as mostly all the words are same except the negative word (do not).
Also, If I train only on sentence A and try to infer Sentence B, will both vectors be close to each other in vector space.?
I would request the NLP community and Doc2Vec experts for helping me out in understanding this.
Thanks in Advance !!

Inherently, all that the 'Paragraph Vector' algorithm behind gensim Doc2Vec does is find a vector that (together with a neural-network) is good at predicting the words that appear in a text. So yes, texts with almost-identical words will have very close vectors. (There's no syntactic understanding that certain words, in certain places, have a big reversing-effect.)
However, even such vectors may be ok (though not state-of-the-art) at sentiment analysis. One of the ways the original 'Paragraph Vectors' paper evaluated the vector usability was estimating the sentiment of short movie reviews. (These were longer than a single sentence – into the hundreds of words.) When training a classifier on the doc-vectors, the classifier did a pretty good job, and better than other baseline techniques, at estimating the negativity/positivity of reviews.
Your single, tiny, contrived sentences could be harder – they're short with just a couple words' difference, so the vectors will be very close. But those different words (especially 'not') are often very indicative of sentiment – so the tiny difference might be enough to shift the vector from the 'positive' regions to the 'negative' regions.
So you'd have to try it, with a real training corpus of tens of thousands of varied text examples (because this technique doesn't work well on toy-sized datasets) and a post-vectorization classifier step.
Note also that in pure Doc2Vec, adding known labels (like 'positive' or 'negative') during training (alongside or instead of any unique document-ID based tags) can sometimes help the resulting vector-space be more sensitive to the distinction you want. And, other variant techniques like 'FastText' or 'StarSpace' more directly integrate known-labels into the vectorization in a way that might help.
The best results on short sentences, though, would probably take into account the relative ordering of words and grammatical parsing. You can see a demo of such a more-advanced technique at a page from Stanford's NLP research group:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
Though look in the comments there for various examples of hard cases that it still struggles with.

use textblob and set the sentiment and polarity for each sentence. tokenize the sentences using nlp

How to calculate the similarity of English words that do not appear in WordNet?

A particular natural language practice is to calculate the similarity between two words using WordNet. I start my question with the following python code:
from nltk.corpus import wordnet
sport = wordnet.synsets("sport")[0]
badminton = wordnet.synsets("badminton")[0]
print(sport.wup_similarity(badminton))
We will get 0.8421
Now what if I look for "haha" and "lol" as following:
haha = wordnet.synsets("haha")
lol = wordnet.synsets("lol")
print(haha)
print(lol)
We will get
[]
[]
Then we cannot consider the similarity between them. What can we do in this case?

You can create a semantic space from cooccurrence matrices using a tool like Dissect (DIStributional SEmantics Composition Toolkit)
and then you are set to measure semantic similarity between words or phrases (if you compose words).
In your case for ha and lol you'll need to collect those cooccurrences.
Another thing to try is word2vec.

There are two possible other ways:
CBOW: continuous bag of word
skip gram model: This model is vice versa of CBOW model
look at this: https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms
These model are well represted here: https://www.tensorflow.org/tutorials/word2vec, also GENSIM is a good python library for doing such these things
Try to look for Tensorflow Solutions, For example this: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
Or try to look for word2vec: https://en.wikipedia.org/wiki/Word2vec

You can use other frameworks. I was trying also NLTK but finally landed on spacy (spacy.io) very fast and functional framework. There is a method for words called 'similarity' which compers to other words(but it works also for sentences, docs etc). It is implemented using word2vec. Actually I don't know how big is their vocabulary and how it struggle in case the word is unknown but it might be worth to try.
I was also playing a little bit with this one:
https://radimrehurek.com/gensim/models/word2vec.html
Where in 2 lines you can load google's big word2vec model(this project ports google word2vec c++ library into python) accessible here:
https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit

There are different models for measuring similarity, such as word2vec or glove, but you seem to be looking more for a corpus which includes social, informal phrases like 'lol'.
However, I'm going to bring up word2vec because it leads to what I think is an answer to your question.
The foundational concept of word2vec (and other word embedding models like glove) is the representation of words in a vector space which incorporates relationships between words. This lends itself very well to measuring similarity, since vectors have lots of established math to draw from. You can read more about the technical details of word2vec in the original paper, but I quite like this blog post because it is well-written and concise.
Again, since word2vec is just a model, you need to pair it with the right training set to get the kind of scope you seek. There are some pre-trained models floating around on the web, such as this bunch. The training set is really what allows you to query a larger variety of terms, rather than the model.
You can certainly use those pre-trained models if they have social phrases like the ones you're seeking. However, if you don't see a model that has been trained on a suitable corpus, you could easily train a model yourself. I suggest Twitter or Wikipedia for corpora (training sets), and the implementation of word2vec in gensim as a word embedding model.

which tokenizer is better to be used with nltk

I have started learning nltk and following this tutorial. First we use the built-in tokenizer by using sent_tokenize and later we use PunktSentenceTokenizer. The tutorial mentions that PunktSentenceTokenizer is capable of unsupervised machine learning.
So does that mean it is better than the default one? Or what is the standard of comparison among various tokenizers?

Looking at the source code for sent_tokenize() reveals that this method currently uses the pre-trained punkt tokenizer, so it is the equivalent to PunktSentenceTokenizer. Whether or not you will need to retrain your tokenizer depends on the nature of the text you are working with. If it is nothing too exotic, like newspaper articles, then you will likely find the pre-trained tokenizer to be sufficient. Tokenizing boils down to a categorization task, and thus different tokenizers could be compared by using the typical metrics such as precision, recall, f-score etc. on labelled data.
The punkt tokenizer is based on the work published in the following paper:
http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485#.V2ouLXUrLeQ
It is fundamentally a heuristic based approach geared to disambiguating sentence boundaries from abbreviations - the bane of sentence tokenization. Calling it a heuristic approach is not meant to be disparaging. I have used the built-in sentence tokenizer before and it worked fine for what I was doing, of course, my task did not really depend on accurate sentence tokenizing. Or rather, I was able to throw enough data at it where it did not really matter.
Here is an example of a question on SO where a user found the pre-trained tokenizer lacking, and needed to train a new one:
How to tweak the NLTK sentence tokenizer
The text in question was Moby Dick, and the odd sentence structure was tripping up the tokenizer. Some examples of where you might need to train your own tokenizer are social media (e.g. twitter) or technical literature with lots of strange abbreviations not encountered by the pre-trained tokenizer.

Sentences and words are often manually tokenized. There exist various corpora that deal with POS tagging for words according to the sentence contexts. PunktSentenceTokenizer is employed when your data(sentences and words) needs to be trained to achieve a uniform understanding of how the words should be tagged contextually. It could be possible that the data scientist manually annotates words tags for a whole bunch of sentences and then tells the machine to learn them(supervised learning). However, PunktSentenceTokenizer employs ML algorithms to learn these tags on its own (unsupervised).You just choose which data it trains upon.
Depending on the data you are the working with, the results of sent_tokenizeand consequently word_tokenize may not be that different from PunktSentenceTokenizer. Choosing between tokenizers is left upto the data scientist but the standard is always compared against manually annotated tags(because they are the most correct tags).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.