Calculate accuracy of word2vec model in Python

Calculate accuracy of word2vec model in Python - python

I want to get the accuracy from the Doc2Vec model implemented in Python.
I saw in the official documentation that there is a method to get the accuracy, which takes as parameter a file. What should be the content of that input file?
I tried to put 4-tuple as documentation says, but I get all the patterns misclassified.

There's no simple measurement of a Doc2Vec model's accuracy – you'd need to have a evaluation method that's custom to your corpus and project goals.
The accuracy() method on Word2Vec, also inherited by Doc2Vec, does a very narrow kind of analogy-testing, using word-vectors only, because the same method was used in the original word2vec paper and original Google word2vec.c toolkit. You can see the test-files they used, questions-words.txt and questions-phrases.txt, in a Github mirror of the Google word2vec-toolkit.
Since some Doc2Vec modes generate word-vectors, you could do this sort of analogy test on those Doc2Vec models – but it doesn't check the document-vectors at all, and a model that does well on those word-analogies might not be best for whatever your downstream document task is.

Related

How to retrofit a fasttext model?

I have read various research papers that one can retrofitting a fasttext model to improve its accuracy (https://github.com/mfaruqui/retrofitting). However I am having trouble on how to implement it.
The github link above, will take one vector file and retrofitting it, output another vector file. I can load it using gensim library. However, since it is a vector file, it is no longer a model and it will not predict OOV (out-of-vocabulary) words. This makes it pointless. Is there a way to retrain the model somehow so it has better accuracy?

As far as I understand by reading the paper and browsing the repository, the proposed methodology only allows to improve the quality of the vectors (.vec) given in input.
As you can read here, fastText's ability to represent out-of-vocabulary words is inherent in the .bin model (which contains the vectors for all the n-grams).
As you too may have understood, there is no out-of-the-box way to retrofit a fastText model, using the proposed methodology.

Is it possible to use Google BERT to calculate similarity between two textual documents?

Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
https://github.com/AndriyMulyar/semantic-text-similarity
https://github.com/beekbin/bert-cosine-sim
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?

BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.
BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.
This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.
BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at
See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.
Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:
Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE
Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!

BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.

To add to #jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.

Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.

Creating train,test data for Word2Vec model

I am trying to create a W2V model and then generate train and test data to be used for my model.My question is how can I generate test data after I am done with creating a W2V model with my train data.

Word2Vec is considered an 'unsupervised' algorithm, so at least during its training, it is not typical to hold back any 'test' data for later evaluation.
A Word2Vec model is usually then evaluated on how well it helps some other process - such as the analogy-solving highlighted by the original paper. In gensim, the method [evaluate_word_analogies()][1] can repeat that process. But note: word-vectors that perform best on word-analogies my not be best for other purposes, like classification or info-retrieval. It's always best to evaluate & tune your word-vectors in a repeatable way that's related to your actual underlying use.
(If you're using the Word2Vec model's outputs - word-vectors specific to your domain – as part of a larger system, where some steps should be evaluated with held-back data, the decision of whether to train the Word2Vec component on all data could go either way, depending on other considerations.)

How to predict output word for KeyedVectors word2vec?

A gensim.models.Word2Vec class has method predict_output_word(). Now I use prelearned model but it was saved in class gensim.models.KeyedVectors. Have a the class analogue method? Or how can I get instance of gensim.models.Word2Vec from gensim.models.KeyedVectors`?
I know about most_similar() but it something another.

A KeyedVectors instance is only the words and vectors themselves, not the full model, including internal weights that were important for training (and the internal predictions made during training).
So, a KeyedVectors object lacks the state required to make predictions, and thus also the method. (Note also that method is relatively expensive to run, only works with negative-sampling models, and isn't given results that are weighted quite the same as the 'sparse' semi-predictions made internally during training. The point of word2Vec isn't really accurate neighbor predictions, but using attempts-at-such-predictions to bootstrap an arrangement of vectors that has other useful properties.)
If you're training the words yourself, you should save the full model if you need full-model functionality later.
There's no way to turn a KeyedVectors method, or mere set-of-word-vectors, into a full Word2Vec model.

Scikit-learn model parameters unavailable? If so what ML workbench alternative?

I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?

The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.