While learning Doc2Vec library, I got stuck on the following question.
Do gensim Doc2Vec distinguish between the same Sentence with positive and negative context?
For Example:
Sentence A: "I love Machine Learning"
Sentence B: "I do not love Machine Learning"
If I train sentence A and B with doc2vec and find cosine similarity between their vectors:
Will the model be able to distinguish the sentence and give a cosine similarity very less than 1 or negative?
Or Will the model represent both the sentences very close in vector space and give cosine similarity close to 1, as mostly all the words are same except the negative word (do not).
Also, If I train only on sentence A and try to infer Sentence B, will both vectors be close to each other in vector space.?
I would request the NLP community and Doc2Vec experts for helping me out in understanding this.
Thanks in Advance !!
Inherently, all that the 'Paragraph Vector' algorithm behind gensim Doc2Vec does is find a vector that (together with a neural-network) is good at predicting the words that appear in a text. So yes, texts with almost-identical words will have very close vectors. (There's no syntactic understanding that certain words, in certain places, have a big reversing-effect.)
However, even such vectors may be ok (though not state-of-the-art) at sentiment analysis. One of the ways the original 'Paragraph Vectors' paper evaluated the vector usability was estimating the sentiment of short movie reviews. (These were longer than a single sentence – into the hundreds of words.) When training a classifier on the doc-vectors, the classifier did a pretty good job, and better than other baseline techniques, at estimating the negativity/positivity of reviews.
Your single, tiny, contrived sentences could be harder – they're short with just a couple words' difference, so the vectors will be very close. But those different words (especially 'not') are often very indicative of sentiment – so the tiny difference might be enough to shift the vector from the 'positive' regions to the 'negative' regions.
So you'd have to try it, with a real training corpus of tens of thousands of varied text examples (because this technique doesn't work well on toy-sized datasets) and a post-vectorization classifier step.
Note also that in pure Doc2Vec, adding known labels (like 'positive' or 'negative') during training (alongside or instead of any unique document-ID based tags) can sometimes help the resulting vector-space be more sensitive to the distinction you want. And, other variant techniques like 'FastText' or 'StarSpace' more directly integrate known-labels into the vectorization in a way that might help.
The best results on short sentences, though, would probably take into account the relative ordering of words and grammatical parsing. You can see a demo of such a more-advanced technique at a page from Stanford's NLP research group:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
Though look in the comments there for various examples of hard cases that it still struggles with.
use textblob and set the sentiment and polarity for each sentence. tokenize the sentences using nlp
Related
I am currently trying to train my own word2vec model with my own training data and I am utterly confused about the training data preprocessing.
I ran a short script over my text which lemmatizes and also lower-cases the words in the text such that in the end my training data from a sentence (in German) like:
"Er hat heute zwei Birnen gegessen."
the following comes out:
[er, haben, heute, zwei, birne, essen]
translated in English:
He ate two pears today.
results in:
[he, eat, two, pear, today]
Now the problem is: I haven't seen anyone do this to their training data. The words are kept in uppercase and also not lemmatized and I absolutely don't get how this works. Especially for German there are so many inflections of verbs. Should I just leave them that way? I don't understand how it works not doing the lemmatization since gensim doesn't even know which language it is trained on right?
So in short: Should I do lemmatization and/or lowercasing or just leave every word as it is?
Thanks a lot!
The answer depends on what you are going to use the embeddings for, however, they are usually trained on word forms. Word embeddings are typically pre-trained on very large datasets and cover vocabularies up to 500k words, which typically covers most words in most forms even in language with much richer morphology than German.
You might also want to use FastText (bindings for Gensim exist) instead of Word2Vec. FastText considers character n-gram statistics for the embeddings, so it better generalize for regularities in morphology.
But in general, the data preprocessing always depends on how do you plan to use the embeddings. If you want to do a quantitative diachronic study on how the meaning of the words shifted over the 20th century, then embedding lemmas is a good idea. If you work with low resource language that has a good lemmatizer, it also might be a good idea. If you plan to the embeddings as an input for a downstream NLP model, then you should probably use forms and/or use already pre-trained embeddings.
I am trying to see what pre-trained model has included common phrases in news and I thought GoogleNews-vectors-negative300.bin should be a comprehensive one but it turned out that it does not even include deep_learning, machine_learning, social_network, social_responsibility. What pre-trained model could include those words that often occur in news, public reports?
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
model.similarity('deep_learning', 'machine_learning')
These are MWE (Multi-Word Expressions) that are unlikely to be included. You could theoretically model them by taking an average of the vectors obtained for each of the words comprising an MWE.
The different considerations for operations applicable for comprising vectors and the obtained results is: word2vec - what is best? add, concatenate or average word vectors?
The GoogleNews vectors were trained by Google, circa 2012-2013, on a large internal corpus of news articles.
Further, the promotion of individual words into multiword phrases seems to have been done using a purely-statistical co-occurrence analysis (similar to that implemented by gensim Phrases class) - so often won't match human-level perception of entities/concepts, missing some word combos, over-combining others.
So, concepts that were obscure (or not even yet coined!) then, or seldom covered in news articles, will be missing or underrepresented.
Training your own vectors on text from your own domain of interest is often best, for both coverage, and to ensure the vectors reflect word/phrase senses that are dominant in your texts – not general news or reference materials.
Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
https://github.com/AndriyMulyar/semantic-text-similarity
https://github.com/beekbin/bert-cosine-sim
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?
BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.
BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.
This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.
BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at
See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.
Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:
Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE
Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!
BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.
To add to #jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.
Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.
Background: I have been evaluating a variety of text classification methods on my dataset, including using feature vectors derived from word counts and TF-IDF, and then running these through various classifiers. My dataset is very small (about 2300 sentences and about 5 classes), and considering the above approaches treat different ones as completely separate, would like to use a word vector approach to classification. I have used pretrained word vectors with a shallow NN with little success.
Problem: I am looking for an alternative method of using word vectors to classify my sentences and have thought of taking the word vectors for a sentence, combining them into a single vector, then taking the centroid of each class of sentence vectors - classification would then happen via a distance measure between a new sentence and the centroid.
How can I combine the word vectors into a "sentence vector" given my small dataset?
A great feature of word2vecs is that you can perform simple operations on them. One common way of getting from words to sentences, is to simply take the average your word vectors for all words in your sentence.
since your sample data is small, I'd use a pertained embedding from Gensim Data, retrain using your own sample, and at the end use a simpler classifier like logistic regression.
To Nathan's point if you want to classify documents, Doc2Vec is a great extension of Word2Vec which reduces a lot of the steps. With a few iterations, you can actually achieve really good results. Here is a great implementation of Doc2Vec.
Basically you need to know where to split your sentences first, then you can use a doc2vec model for those sentences.
https://radimrehurek.com/gensim/models/doc2vec.html
Determine where your sentence boundaries are
Model sentence splitting
Train Doc2Vec model on sentences
Input sentence vectors to NN model
Ive done this with limited success. Your corpus is small but you can always try it out and then test/validate/evaluate!
Good luck
I would use gensim's implementation of Paragraph Vector, Doc2Vec for this. I've just written an article describing how to use it classify movie reviews, which might help you!
I'm currently trying to classify Tweets using the Naive Bayes classifier in NLTK. I'm classifying tweets related to particular stock symbols, using the '$' prefix (eg: $AAPL). I've been basing my Python script of off this blog post: Twitter Sentiment Analysis using Python and NLTK . So far, I've been getting reasonably good results. However, I feel there is much, much room for improvement.
In my word-feature selection method, I decided to implement the tf-idf algorithm to select the most informative words. After having done this though, I felt that the results weren't that impressive.
I then implemented the technique on the following blog: Text Classification Sentiment Analysis Eliminate Low Information Features. The results were very similar to the ones obtained with the tf-idf algorithm, which led me to inspect my classifier's 'Most Informative Features' list more thoroughly. That's when I realized I had a bigger problem:
Tweets and real language don't use the same grammar and wording. In a normal text, many articles and verbs can be singled out using tf-idf or stopwords. However, in a tweet corpus, some extremely uninformative words, such as 'the', 'and', 'is', etc., occur just as much as words that are crucial to categorizing text correctly. I can't just remove all words that have less than 3 letters, because some uninformative features are bigger than that, and some informative ones are smaller.
If I could, I would like to not have to use stopwords, because of the need to frequently update the list. However, if that's my only option, I guess I'll have to go with it.
So, to summarize my question, does anyone know how to truly get the most informative words in the specific source that is a Tweet?
EDIT: I'm trying to classify into three groups: positive, negative, and neutral. Also, I was wondering, for TF-IDF, should I only be cutting off the words with the low scores, or also some with the higher scores? In each case, what percentage of the vocabulary of the text source would you exclude from the feature selection process?
The blog post you links to describes the show_most_informative_features method, but the NaiveBayesClassifier also has a most_informative_features method that returns the features rather than just printing them. You could simply set a cutoff based on your training set- features like "the", "and" and other unimportant features would be at the bottom of the list in terms of informativeness.
It's true that this approach could be subject to overfitting (some features would be much more important in your training set than in your test set), but that would be true of anything that filters features based on your training set.