Doc2Vec: Differentiate Sentence and Document - python

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.
The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.
But the original paper: Doc2Vec-Paper claims that the method can be used to infer fixed length vectors of paragraphs/documents.
Can someone explain the difference between a sentence and a document in this context, and how i would go about inferring paragraph vectors.
Since a question can sometimes span multiple sentences,
I thought, during training i will give sentences arising from the same question the same tags, but then how would i do this to infer_vector on unseen questions?
And this notebook : Doc2Vec-Notebook
seems to be training vectors on TRAIN and TEST docs, can someone explain the rationale behind this and should i do the same?

Gensim's Doc2Vec expects you to provide text examples of the same object-shape as the example TaggedDocument class: having both a words and a tags property.
The words are an ordered sequence of string tokens of the text – they might be a single sentence worth, or a paragraph, or a long document, it's up to you.
The tags are a list of tags to be learned from the text – such as plain ints, or string-tokens, that somehow serve to name the corresponding texts. In the original 'Paragraph Vectors' paper, they were just unique IDs for each text – such as integers monotonically increasing from 0. (So the first TaggedDocument might have a tags of just [0], the next [1], etc.)
The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be. (Just consider them all 'documents' for the purpose of Doc2Vec, with you deciding what's the right kind of 'document' from your corpus.) It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.
Inference occurs via the infer_vector() method, which takes a mandatory parameter doc_words, which should be a list-of-string-tokens just like those that were supplied as text words during training.
You don't supply any tags on inferred text: Doc2Vec just gives you back a raw vector that, within the relationships learned by the model, fits the text well. (That is: the vector is good at predicting the text's words, in the same way that the vectors and internal model weights learned during bulk training were good at prediction the training texts' words.)
Note that many have found better results from inference by increasing the optional steps parameter (and possibly decreasing the inference starting alpha to be more like the bulk-training starting alpha, 0.025 to 0.05).
The doc2vec-IMDB demo notebook tries to reproduce one of the experiments from the original Paragraph Vectors paper, so it's following what's described there, and a demo script that one of the authors (Mikolov) once released. Since 'test' documents (withoout their target-labels/known-sentiments) may still be available, at training time, to help improve the text-modelling, it can be reasonable to include their raw texts during the unsupervised Doc2Vec training. (Their known-labels are not used when training the classifier which uses the doc-vectors.)
(Note that at the moment, February 2017, the doc2vec-IMDB demo notebook is a little out-of-date compared to the current gensim Doc2Vec defaults & best-practices – in particular the models aren't given the right explicit iter=1 value to make the later manual loop-and-train() do just the right umber of training passes.)


Contextual word embeddings from pretrained word2vec vectors

I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand].
As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this.
Now I would like to try to create and train a neural network that would take entire sentences, e.g.
Jaguar F-PACE is a great SUV sports car.
Among cats, only tigers and lions are bigger than jaguars.
And then it would undertake the task of text classification (I have a dataset with several categories like animals, cars, etc.), but the result would be new representations for the word jaguar, but in different contexts, so two different embeddings.
Does anyone have any idea how I could create such a network? I don't hide that I'm a beginner and have no idea how to go about it.
If you've already been able to perform sense-disambiguation outside word2vec, then you can change the word-tokens to reflect your external judgement. For example, change some appearances of the token 'jaguar' to 'jaguar*car' and others to 'jaguar*animal'. Proceeding with normal word2vec training will then get your two different tokens two different word-vectors.
If you're hoping for the training to discover these itself, as ~Erwan mentioned in a comment, that seems like an open research question, without a standard or off-the-shelf solution that a beginner could drop-in.
I'd once seen a paper (around the time of the original word2vec papers, but can't find the link now) that tried to do this in a word2vec-compatible way by 1st proceeding with traditional polysemy-oblivious training. Then, for every appearance of a word X, model its surrounding context via some combination of the word-vectors of neighbors within a certain number of positions. (That in itself is very similar to the preparation of a context-vector in the CBOW mode of word2vec.) Perform some clustering on that collection-of-all-contexts to come up with some idea of alternate senses – each associated with one cluster. Then, in a followup pass on the original corpus, replace word-tokens with those that also reflect their nearby-context cluster. (EG: 'jaguar' might be replaced with 'jaguar*1', 'jaguar*2', etc based on which discrete cluster its context suggested.) Then, repeat (or continue) word2vec training to get sense-specific word-vectors. Of course, the devil would be in the details of how contexts are defined, how clusters are deduced, and tough edge-cases (where potentially the text's author is themselves deploying the multiple senses).
Some other interesting efforts to model or deduce polysemy in word2vec models:
"Linear Algebraic Structure of Word Meanings"
"A Simple Approach to Learn Polysemous Word Embeddings"
But per above, I've not seen these sorts of techniques widely implemented/adopted in a form that's easy to drop-in to another project.

Train Spacy on unlabeled text corpus to extract "important phrases"

I'm looking to find a way to extract "important phrases" from text documents. Was hoping to do this using Spacy, but there is one caveat: my data contains mostly product information and therefore the important phrases are different from what they would be in natural spoken language. For this reason, I would like to train spacy on my own corpus, but the only info I can find is for training spacy using labeled data.
Does anyone know if what I want to do is possible?
If you are looking for a scheme to weight phrases according to "Importance" without any labeled data, you can try using TF-IDF.
For this answer, I will refer to terms - these can be phrases or words. It just represents a single entity of text.
A Brief Look at TF-IDF
TF-IDF stands for (Term Frequency) x (Inverse Document Frequency).
It is a measure of how often a term appears in a single document vs. how often that term appears across the entire corpus of documents.
It is commonly used as a statistical measure to determine how important terms are in a corpus.
For a longer, but readable explanation of it, check out the wiki:
Code Implementation
Check out Scikit-Learn's TfidfVectorizer.
This has a fit_transform function that takes raw text as an input and output the appropriate TF-IDF weights for words and/or n-grams.
If you prefer to do your own tokenization with spaCy, or only include doc.noun_chunks and doc.ents that satisfy len(span) >= 2 (i.e. phrases), there is a little hack for the TfidfVectorizer.
To use your own tokenization, do the following:
dummy = lambda x: x
vectorizer = TfidfVectorizer(analyzer=dummy)
tfidf = vectorizer.fit_transform(list_of_tokenized_docs)
This overrides the default tokenization and lets you use your own list of tokens.
From there you can find the terms that have the highest average TF-IDF score across all documents, and consider those as Important. You can try using those as input to the PhraseMatcher:
Or you can find some way to use these to automatically label documents. If you can locate them in your documents after determining they are important, you can then add an appropriate label and use that as training data to some training pipeline.
if you want exact phrases to be recognised, you can compile a list of those phrases and use spaCy's PhraseMatcher component to train and recognise it later.
The only thing is it will only recognise the exact phrases supplied to it. This is in contrary to how NER works, it can recognise additional phrases based on training provided , but PhraseMatcher will only recognise the ones you provide it.

Is it possible to use Google BERT to calculate similarity between two textual documents?

Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?
BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.
BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.
This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.
BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at
See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.
Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:
Perform either Extractive or Abstractive summarization: (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE
Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!
BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.
To add to #jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.
Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.

Training Data Set in NLTK Python

I am working on Python NLTK tagging, and my input text is non hindi.
In order to tokenize my input text it must first be trained.
My question is how to train the data?
I am having this line of code as suggested to me here on stackoverflow.
train_data = indian.tagged_sents('hindi.pos')
*how about non-hindi data input.
The short answer is: Training a tagger requires a tagged corpus.
Assigning part of speech tags must be done according to some existing model.
Unfortunately, unlike some problems like finding sentence boundaries, there is no way to choose them out of thin air. There are some experimental approaches that try to assign parts of speech using parallel texts and machine-translation alignment algorithms, but all real POS taggers must be trained on text that has been tagged already.
Evidently you don't have a tagged corpus for your unnamed language, so you'll need to find or create one if you want to build a tagger. Creating a tagged corpus is a major undertaking, since you'll need a lot of training materials to get any sort of decent performance. There may be ways to "bootstrap" a tagged corpus (put together a poor-quality tagger that will make it easier to retag the results by hand), but all that depends on your situation.

Tweet Classifier Feature-Selection NLTK

I'm currently trying to classify Tweets using the Naive Bayes classifier in NLTK. I'm classifying tweets related to particular stock symbols, using the '$' prefix (eg: $AAPL). I've been basing my Python script of off this blog post: Twitter Sentiment Analysis using Python and NLTK . So far, I've been getting reasonably good results. However, I feel there is much, much room for improvement.
In my word-feature selection method, I decided to implement the tf-idf algorithm to select the most informative words. After having done this though, I felt that the results weren't that impressive.
I then implemented the technique on the following blog: Text Classification Sentiment Analysis Eliminate Low Information Features. The results were very similar to the ones obtained with the tf-idf algorithm, which led me to inspect my classifier's 'Most Informative Features' list more thoroughly. That's when I realized I had a bigger problem:
Tweets and real language don't use the same grammar and wording. In a normal text, many articles and verbs can be singled out using tf-idf or stopwords. However, in a tweet corpus, some extremely uninformative words, such as 'the', 'and', 'is', etc., occur just as much as words that are crucial to categorizing text correctly. I can't just remove all words that have less than 3 letters, because some uninformative features are bigger than that, and some informative ones are smaller.
If I could, I would like to not have to use stopwords, because of the need to frequently update the list. However, if that's my only option, I guess I'll have to go with it.
So, to summarize my question, does anyone know how to truly get the most informative words in the specific source that is a Tweet?
EDIT: I'm trying to classify into three groups: positive, negative, and neutral. Also, I was wondering, for TF-IDF, should I only be cutting off the words with the low scores, or also some with the higher scores? In each case, what percentage of the vocabulary of the text source would you exclude from the feature selection process?
The blog post you links to describes the show_most_informative_features method, but the NaiveBayesClassifier also has a most_informative_features method that returns the features rather than just printing them. You could simply set a cutoff based on your training set- features like "the", "and" and other unimportant features would be at the bottom of the list in terms of informativeness.
It's true that this approach could be subject to overfitting (some features would be much more important in your training set than in your test set), but that would be true of anything that filters features based on your training set.

