How can we use ANN to find some similar documents? I know its a silly question, but I am new to this NLP field.
I have made a model using kNN and bag-of-words approach to solve my problem. Using that I can get n number of documents (along with their closeness) that are somewhat similar to the input, but now I want to implement the same using ANN and I am not getting any idea.
Thanks in advance for any help or suggestions.
You can use "word embeddings" - technique, that presents words in the dense vector representation. To find similar documents as the vectors, you can simply use cosine similarity.
An example how to build word2vec model using TensorFlow. One more example how to use embeddings layer from Keras.
The way to obtain embeddings for your language is either training them yourself on your corpus of choice (large enough - e.g. wikipedia) or downloading the trained embeddings (for python there are plenty of sources for embeddings trained or loadable with gensim module - which is a de facto standard for Python word2vec).
You can also use GloVe (using glove-python) or FastText word embeddings.
If you are interested you can find more detailed descriptions of embeddings with code examples and source papers.
Have a look at the paper https://arxiv.org/pdf/1805.10685.pdf that gives you a overall idea.
check this link for more references https://github.com/Hironsan/awesome-embedding-models
Related
I have read various research papers that one can retrofitting a fasttext model to improve its accuracy (https://github.com/mfaruqui/retrofitting). However I am having trouble on how to implement it.
The github link above, will take one vector file and retrofitting it, output another vector file. I can load it using gensim library. However, since it is a vector file, it is no longer a model and it will not predict OOV (out-of-vocabulary) words. This makes it pointless. Is there a way to retrain the model somehow so it has better accuracy?
As far as I understand by reading the paper and browsing the repository, the proposed methodology only allows to improve the quality of the vectors (.vec) given in input.
As you can read here, fastText's ability to represent out-of-vocabulary words is inherent in the .bin model (which contains the vectors for all the n-grams).
As you too may have understood, there is no out-of-the-box way to retrofit a fastText model, using the proposed methodology.
Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
https://github.com/AndriyMulyar/semantic-text-similarity
https://github.com/beekbin/bert-cosine-sim
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?
BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.
BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.
This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.
BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at
See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.
Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:
Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE
Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!
BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.
To add to #jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.
Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.
I am using gensim version 0.12.4 and have trained two separate word embeddings using the same text and same parameters. After training I am calculating the Pearsons correlation between the word occurrence-frequency and vector-length. One model I trained using save_word2vec_format(fname, binary=True) and then loaded using load_word2vec_format the other I trained using model.save(fname) and then loaded using Word2Vec.load(). I understand that the word2vec algorithm is non deterministic so the results will vary however the difference in the correlation between the two models is quite drastic. Which method should I be using in this instance?
EDIT: this was intended as a comment. Don't know how to change it now, sorry
correlation between the word occurrence-frequency and vector-length I don't quite follow - aren't all your vectors the same length? Or are you not referring to the embedding vectors?
I am using gensim word2vec package in python.
I would like to retrieve the W and W' weight matrices that have been learn during the skip-gram learning.
It seems to me that model.syn0 gives me the first one but I am not sure how I can get the other one. Any idea?
I would actually love to find any exhaustive documentation on models accessible attributes because the official one does not seem to be precise (for instance syn0 is not described as an attribute)
The model.wv.syn0 contains the input embedding matrix. Output embedding is stored in model.syn1 when it's trained with hierarchical softmax (hs=1) or in model.syn1neg when it uses negative sampling (negative>0). That's it! When both hierarchical softmax and negative sampling are not enabled, Word2Vec uses a single weight matrix model.wv.syn0 for training.
See also a related discussion here.
A particular natural language practice is to calculate the similarity between two words using WordNet. I start my question with the following python code:
from nltk.corpus import wordnet
sport = wordnet.synsets("sport")[0]
badminton = wordnet.synsets("badminton")[0]
print(sport.wup_similarity(badminton))
We will get 0.8421
Now what if I look for "haha" and "lol" as following:
haha = wordnet.synsets("haha")
lol = wordnet.synsets("lol")
print(haha)
print(lol)
We will get
[]
[]
Then we cannot consider the similarity between them. What can we do in this case?
You can create a semantic space from cooccurrence matrices using a tool like Dissect (DIStributional SEmantics Composition Toolkit)
and then you are set to measure semantic similarity between words or phrases (if you compose words).
In your case for ha and lol you'll need to collect those cooccurrences.
Another thing to try is word2vec.
There are two possible other ways:
CBOW: continuous bag of word
skip gram model: This model is vice versa of CBOW model
look at this: https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms
These model are well represted here: https://www.tensorflow.org/tutorials/word2vec, also GENSIM is a good python library for doing such these things
Try to look for Tensorflow Solutions, For example this: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
Or try to look for word2vec: https://en.wikipedia.org/wiki/Word2vec
You can use other frameworks. I was trying also NLTK but finally landed on spacy (spacy.io) very fast and functional framework. There is a method for words called 'similarity' which compers to other words(but it works also for sentences, docs etc). It is implemented using word2vec. Actually I don't know how big is their vocabulary and how it struggle in case the word is unknown but it might be worth to try.
I was also playing a little bit with this one:
https://radimrehurek.com/gensim/models/word2vec.html
Where in 2 lines you can load google's big word2vec model(this project ports google word2vec c++ library into python) accessible here:
https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit
There are different models for measuring similarity, such as word2vec or glove, but you seem to be looking more for a corpus which includes social, informal phrases like 'lol'.
However, I'm going to bring up word2vec because it leads to what I think is an answer to your question.
The foundational concept of word2vec (and other word embedding models like glove) is the representation of words in a vector space which incorporates relationships between words. This lends itself very well to measuring similarity, since vectors have lots of established math to draw from. You can read more about the technical details of word2vec in the original paper, but I quite like this blog post because it is well-written and concise.
Again, since word2vec is just a model, you need to pair it with the right training set to get the kind of scope you seek. There are some pre-trained models floating around on the web, such as this bunch. The training set is really what allows you to query a larger variety of terms, rather than the model.
You can certainly use those pre-trained models if they have social phrases like the ones you're seeking. However, if you don't see a model that has been trained on a suitable corpus, you could easily train a model yourself. I suggest Twitter or Wikipedia for corpora (training sets), and the implementation of word2vec in gensim as a word embedding model.