I am trying to see what pre-trained model has included common phrases in news and I thought GoogleNews-vectors-negative300.bin should be a comprehensive one but it turned out that it does not even include deep_learning, machine_learning, social_network, social_responsibility. What pre-trained model could include those words that often occur in news, public reports?
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
model.similarity('deep_learning', 'machine_learning')
These are MWE (Multi-Word Expressions) that are unlikely to be included. You could theoretically model them by taking an average of the vectors obtained for each of the words comprising an MWE.
The different considerations for operations applicable for comprising vectors and the obtained results is: word2vec - what is best? add, concatenate or average word vectors?
The GoogleNews vectors were trained by Google, circa 2012-2013, on a large internal corpus of news articles.
Further, the promotion of individual words into multiword phrases seems to have been done using a purely-statistical co-occurrence analysis (similar to that implemented by gensim Phrases class) - so often won't match human-level perception of entities/concepts, missing some word combos, over-combining others.
So, concepts that were obscure (or not even yet coined!) then, or seldom covered in news articles, will be missing or underrepresented.
Training your own vectors on text from your own domain of interest is often best, for both coverage, and to ensure the vectors reflect word/phrase senses that are dominant in your texts – not general news or reference materials.
Related
I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand].
As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this.
Now I would like to try to create and train a neural network that would take entire sentences, e.g.
Jaguar F-PACE is a great SUV sports car.
Among cats, only tigers and lions are bigger than jaguars.
And then it would undertake the task of text classification (I have a dataset with several categories like animals, cars, etc.), but the result would be new representations for the word jaguar, but in different contexts, so two different embeddings.
Does anyone have any idea how I could create such a network? I don't hide that I'm a beginner and have no idea how to go about it.
If you've already been able to perform sense-disambiguation outside word2vec, then you can change the word-tokens to reflect your external judgement. For example, change some appearances of the token 'jaguar' to 'jaguar*car' and others to 'jaguar*animal'. Proceeding with normal word2vec training will then get your two different tokens two different word-vectors.
If you're hoping for the training to discover these itself, as ~Erwan mentioned in a comment, that seems like an open research question, without a standard or off-the-shelf solution that a beginner could drop-in.
I'd once seen a paper (around the time of the original word2vec papers, but can't find the link now) that tried to do this in a word2vec-compatible way by 1st proceeding with traditional polysemy-oblivious training. Then, for every appearance of a word X, model its surrounding context via some combination of the word-vectors of neighbors within a certain number of positions. (That in itself is very similar to the preparation of a context-vector in the CBOW mode of word2vec.) Perform some clustering on that collection-of-all-contexts to come up with some idea of alternate senses – each associated with one cluster. Then, in a followup pass on the original corpus, replace word-tokens with those that also reflect their nearby-context cluster. (EG: 'jaguar' might be replaced with 'jaguar*1', 'jaguar*2', etc based on which discrete cluster its context suggested.) Then, repeat (or continue) word2vec training to get sense-specific word-vectors. Of course, the devil would be in the details of how contexts are defined, how clusters are deduced, and tough edge-cases (where potentially the text's author is themselves deploying the multiple senses).
Some other interesting efforts to model or deduce polysemy in word2vec models:
"Linear Algebraic Structure of Word Meanings"
"A Simple Approach to Learn Polysemous Word Embeddings"
But per above, I've not seen these sorts of techniques widely implemented/adopted in a form that's easy to drop-in to another project.
Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
https://github.com/AndriyMulyar/semantic-text-similarity
https://github.com/beekbin/bert-cosine-sim
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?
BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.
BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.
This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.
BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at
See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.
Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:
Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE
Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!
BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.
To add to #jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.
Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.
While learning Doc2Vec library, I got stuck on the following question.
Do gensim Doc2Vec distinguish between the same Sentence with positive and negative context?
For Example:
Sentence A: "I love Machine Learning"
Sentence B: "I do not love Machine Learning"
If I train sentence A and B with doc2vec and find cosine similarity between their vectors:
Will the model be able to distinguish the sentence and give a cosine similarity very less than 1 or negative?
Or Will the model represent both the sentences very close in vector space and give cosine similarity close to 1, as mostly all the words are same except the negative word (do not).
Also, If I train only on sentence A and try to infer Sentence B, will both vectors be close to each other in vector space.?
I would request the NLP community and Doc2Vec experts for helping me out in understanding this.
Thanks in Advance !!
Inherently, all that the 'Paragraph Vector' algorithm behind gensim Doc2Vec does is find a vector that (together with a neural-network) is good at predicting the words that appear in a text. So yes, texts with almost-identical words will have very close vectors. (There's no syntactic understanding that certain words, in certain places, have a big reversing-effect.)
However, even such vectors may be ok (though not state-of-the-art) at sentiment analysis. One of the ways the original 'Paragraph Vectors' paper evaluated the vector usability was estimating the sentiment of short movie reviews. (These were longer than a single sentence – into the hundreds of words.) When training a classifier on the doc-vectors, the classifier did a pretty good job, and better than other baseline techniques, at estimating the negativity/positivity of reviews.
Your single, tiny, contrived sentences could be harder – they're short with just a couple words' difference, so the vectors will be very close. But those different words (especially 'not') are often very indicative of sentiment – so the tiny difference might be enough to shift the vector from the 'positive' regions to the 'negative' regions.
So you'd have to try it, with a real training corpus of tens of thousands of varied text examples (because this technique doesn't work well on toy-sized datasets) and a post-vectorization classifier step.
Note also that in pure Doc2Vec, adding known labels (like 'positive' or 'negative') during training (alongside or instead of any unique document-ID based tags) can sometimes help the resulting vector-space be more sensitive to the distinction you want. And, other variant techniques like 'FastText' or 'StarSpace' more directly integrate known-labels into the vectorization in a way that might help.
The best results on short sentences, though, would probably take into account the relative ordering of words and grammatical parsing. You can see a demo of such a more-advanced technique at a page from Stanford's NLP research group:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
Though look in the comments there for various examples of hard cases that it still struggles with.
use textblob and set the sentiment and polarity for each sentence. tokenize the sentences using nlp
A particular natural language practice is to calculate the similarity between two words using WordNet. I start my question with the following python code:
from nltk.corpus import wordnet
sport = wordnet.synsets("sport")[0]
badminton = wordnet.synsets("badminton")[0]
print(sport.wup_similarity(badminton))
We will get 0.8421
Now what if I look for "haha" and "lol" as following:
haha = wordnet.synsets("haha")
lol = wordnet.synsets("lol")
print(haha)
print(lol)
We will get
[]
[]
Then we cannot consider the similarity between them. What can we do in this case?
You can create a semantic space from cooccurrence matrices using a tool like Dissect (DIStributional SEmantics Composition Toolkit)
and then you are set to measure semantic similarity between words or phrases (if you compose words).
In your case for ha and lol you'll need to collect those cooccurrences.
Another thing to try is word2vec.
There are two possible other ways:
CBOW: continuous bag of word
skip gram model: This model is vice versa of CBOW model
look at this: https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms
These model are well represted here: https://www.tensorflow.org/tutorials/word2vec, also GENSIM is a good python library for doing such these things
Try to look for Tensorflow Solutions, For example this: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
Or try to look for word2vec: https://en.wikipedia.org/wiki/Word2vec
You can use other frameworks. I was trying also NLTK but finally landed on spacy (spacy.io) very fast and functional framework. There is a method for words called 'similarity' which compers to other words(but it works also for sentences, docs etc). It is implemented using word2vec. Actually I don't know how big is their vocabulary and how it struggle in case the word is unknown but it might be worth to try.
I was also playing a little bit with this one:
https://radimrehurek.com/gensim/models/word2vec.html
Where in 2 lines you can load google's big word2vec model(this project ports google word2vec c++ library into python) accessible here:
https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit
There are different models for measuring similarity, such as word2vec or glove, but you seem to be looking more for a corpus which includes social, informal phrases like 'lol'.
However, I'm going to bring up word2vec because it leads to what I think is an answer to your question.
The foundational concept of word2vec (and other word embedding models like glove) is the representation of words in a vector space which incorporates relationships between words. This lends itself very well to measuring similarity, since vectors have lots of established math to draw from. You can read more about the technical details of word2vec in the original paper, but I quite like this blog post because it is well-written and concise.
Again, since word2vec is just a model, you need to pair it with the right training set to get the kind of scope you seek. There are some pre-trained models floating around on the web, such as this bunch. The training set is really what allows you to query a larger variety of terms, rather than the model.
You can certainly use those pre-trained models if they have social phrases like the ones you're seeking. However, if you don't see a model that has been trained on a suitable corpus, you could easily train a model yourself. I suggest Twitter or Wikipedia for corpora (training sets), and the implementation of word2vec in gensim as a word embedding model.
I have started learning nltk and following this tutorial. First we use the built-in tokenizer by using sent_tokenize and later we use PunktSentenceTokenizer. The tutorial mentions that PunktSentenceTokenizer is capable of unsupervised machine learning.
So does that mean it is better than the default one? Or what is the standard of comparison among various tokenizers?
Looking at the source code for sent_tokenize() reveals that this method currently uses the pre-trained punkt tokenizer, so it is the equivalent to PunktSentenceTokenizer. Whether or not you will need to retrain your tokenizer depends on the nature of the text you are working with. If it is nothing too exotic, like newspaper articles, then you will likely find the pre-trained tokenizer to be sufficient. Tokenizing boils down to a categorization task, and thus different tokenizers could be compared by using the typical metrics such as precision, recall, f-score etc. on labelled data.
The punkt tokenizer is based on the work published in the following paper:
http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485#.V2ouLXUrLeQ
It is fundamentally a heuristic based approach geared to disambiguating sentence boundaries from abbreviations - the bane of sentence tokenization. Calling it a heuristic approach is not meant to be disparaging. I have used the built-in sentence tokenizer before and it worked fine for what I was doing, of course, my task did not really depend on accurate sentence tokenizing. Or rather, I was able to throw enough data at it where it did not really matter.
Here is an example of a question on SO where a user found the pre-trained tokenizer lacking, and needed to train a new one:
How to tweak the NLTK sentence tokenizer
The text in question was Moby Dick, and the odd sentence structure was tripping up the tokenizer. Some examples of where you might need to train your own tokenizer are social media (e.g. twitter) or technical literature with lots of strange abbreviations not encountered by the pre-trained tokenizer.
Sentences and words are often manually tokenized. There exist various corpora that deal with POS tagging for words according to the sentence contexts. PunktSentenceTokenizer is employed when your data(sentences and words) needs to be trained to achieve a uniform understanding of how the words should be tagged contextually. It could be possible that the data scientist manually annotates words tags for a whole bunch of sentences and then tells the machine to learn them(supervised learning). However, PunktSentenceTokenizer employs ML algorithms to learn these tags on its own (unsupervised).You just choose which data it trains upon.
Depending on the data you are the working with, the results of sent_tokenizeand consequently word_tokenize may not be that different from PunktSentenceTokenizer. Choosing between tokenizers is left upto the data scientist but the standard is always compared against manually annotated tags(because they are the most correct tags).