Find most similar words for OOV word

Find most similar words for OOV word - python

I am looking for the most similar words for out-of-vocab OOV words using gensim. Something like this:
def get_word_vec(self, model, word):
try:
if word not in model.wv.vocab:
mostSimWord = model.wv.similar_by_word(word)
print(mostSimWord)
else:
print( word )
except Exception as ex:
print(ex)
Is there are way to achieve this task? Options other than gensim also welcomed.

If you train a FastText model instead of a Word2Vec model, it inherently learns vectors for word-fragments (of configurable size ranges) in addition to full words.
In languages like English & many others (but not all), unknown words are often typos, alternate forms, or related in terms of roots and suffixes to knwon words. Thus, having vectors for subwords, then using those to tally up a good guess vector for an unknown word, can work well enough to be worth trying – better than ignoring such words, or using a totally random or origin-point vector.
There's no built-in way to try to extract such relationships from an existing set of word-vectors that isn't FastText/subword-based – but it'd be theoretically possible. You could compute edit distances to, or counts-of-shared-subwords with, all known words, & create a guess-vector by weighted combination of the N-closest words. (This might work really well with typos & rarer alternate spellings, but not as much for truly-absent novel words.)

Related

improve gensim most_similar() return values by using wordnet hypernyms

import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-200')
I first ran this code to download the pre-trained model.
glove.most_similar(positive=['sushi', 'uae'], negative=['japan'])
would then result in:
[('nahyan', 0.5181387066841125),
('caviar', 0.4778318405151367),
('paella', 0.4497394263744354),
('nahayan', 0.44313961267471313),
('zayed', 0.4321245849132538),
('omani', 0.4285220503807068),
('seafood', 0.4279175102710724),
('saif', 0.426000714302063),
('dirham', 0.4214130640029907),
('sashimi', 0.4165934920310974)]
and in this example, we can see that the method failed to capture the 'type' or 'category' of the query. 'zayed', 'nahyan' are not actually of 'type' food and rather they represent person name.
The approach suggested by my professor is to use wordnet hypernyms to find the 'type'.
With much research, the closest solution I found is to somehow incorporate
lowest_common_hypernyms() that will give the lowest common hypernym between two synsets and use it to filter the results of most_similar().
I am not sure if my idea make sense and would like the community feedback on this.
My idea is compute the hypernym of, e.g. 'sushi' and the hypernyms of all the similar words returned by most_similar() and only choose the word with 'longest' lowest common hypernym path. I expect this should return the word that best matches the 'type'
Not sure if it makes sense...

Does your proposed approach give adequate results when you try it?
That's the only test of whether the idea makes sense.
Word2vec is generally oblivious to the all the variations of category that a lexicon like WordNet can provide – all the words that are similar to another word, in any aspect, will be neighbors. Even words that people consider opposites – like 'hot' and 'cold' – will be often be fairly close to each other, in some direction in the coordinate space, as they are similar in what they describe and what contexts they're used in. (They can be drop-in replacements for each other.)
Word2vec is also fairly oblivious to polysemy in its standard formulation.
Some other things worth trying might be:
if you need only answers of a certain type, mix-in some measurement ranking candidate answers by their closeness to a word either describing that type ('food') or representing multiple examples (say an average vector for many food-names you'd know to be good answers)
choose another vector-set, or train your own. There's no universal "goodness" for word-vectors: their quality for certain tasks will vary based on their training data & parameters. Vectors trained on something broader than Wikipedia (your named vector file), or some text corpus more focused on your domain-of-interest – say, food criticism – might do better on some tasks. Changing training parameters can also change which kinds of similarity are most emphasized in the resulting vectors. For example, some observers have noticed small context-windows tend to put words that are direct drop-in replacements for each other closer-together, while larger context-windows bring words from the same domains-of-use, even if not drop-in replacements of the same 'type', closer. (It sounds like your current need might be best served with a model trained with smaller windows.)

Nahyan is from the UAE - it seems to be part of the name of all three presidents. So you seem to be getting what you ask for. If you want more foods, add "food" to your positive query, and maybe "people" to your negative query?
Another approach is to post-filter your results to remove anything that isn't a food. Or is a person. (WordNet won't be much help, as it is nowhere near comprehensive on foods, and even less so on people; Wikidata is likely to be more useful.)
By the way, if you find the common hypernym of sushi and UAE it will probably be the top-level entity in wordnet. So that will give you no filtering.

Alternatives to NER taggers for long, heterogeneous phrases?

I am looking for ideas/thoughts on the following problem:
I am working with food ingredient data such as: milk, sugar, eggs, flour, may contain nuts
From such piece of text I want to be able to identify and extract phrases like may contain nuts, to preprocess them separately
These kinds of phrases can change quite a lot in terms of length and content. I thought of using NER taggers, but I don't know if they will do the job correctly as they are mainly used for identifying single-word entities...
Any ideas on what to use as a phrase-entity-recognition system? Also which package would you use? Cheers

IMHO NER (or model-based entity extraction in general) alone is a poor choice of methodology for this particular problem as it requires LOTS of manual annotation to do it right. Instead I suggest using Word2Vec (https://radimrehurek.com/gensim/models/word2vec.html) with phrasing (https://radimrehurek.com/gensim/models/phrases.html).
The idea is to have an unsupervised model containing phrases and their similarities which can then queried using some seed words to list all possible ingredients (e.g. "cat" produces similar words like "dog" or "rat"). Next step would be either to create dictionaries containing the ingredient words & phrases or try clustering the vocabulary of the model using cosine similarity between each word/phrase pair.
Now if you want to take things further you can always match your created dictionaries/clusters back to the corpus the W2V model was trained on and then train a custom entity recognition model using those matches as you now have annotated examples.

I believe this is a Multiword-Expression problem.
There are a few ways you can try to solve this:
Build a named entity recognition model (NER)
Search with Regex for a fixed set of known phrases
Chunking tokens with POS tags
Find collocations of tokens
Let's look at each of these
Build a named entity recognition model (NER)
Named Entity Recognition labels known spans of tokens an a entity type
For each input token you have to label it as part of a known named entity.
Eddy N PERSON
Bonte N PERSON
is V O
woordvoerder N O
van Prep O
diezelfde Pron O
Hogeschool N ORG
. Punc O
This is costly and requires a lot of time for labelling.
It is probably not a good choice for your task.
Search with Regex
This is not a bad idea, using some known phrases you could easily search input texts with minimal word boundaries for context.
import re
re.findall(r"\bmay contain nuts\b", text)
This would require you knowing all phrases you want to search for up front, and might not be possible.
Chunking tokens with POS tags
This could be a good intermediate step but could give many false positives.
You could do this my knowing the sequences of POS tags you expect
may MD
contain VB
nuts NNS
Then you could use chunking with the known tag sequence (MD, VB, NNS).
The problem is that you may not know this, and would have to capture many use cases. It will also capture many sequences which you wont want to capture (False Positive)
Find collocations of tokens
This is probably the best way, as it seems you are looking for a highly common sequences of words (tokens) in a corpus.
You can do this using:
Word2Vec Phrases
NLTK Collocations
Both do the same thing, they look for statistically common sequences of tokens which occur in a corpus.
That can then be used to extract the same collocation phrases from new texts.

It looks like your ingredient list is easy to split into a list. In that case you don't really need a sequence tagger; I wouldn't treat this problem as phrase extraction or NER. What I would do is train a classifier on different items in the list to label them as "food" or "non-food". You should be able to start with rules and train a basic classifier using anything really.
Before training a model, an even simpler step would be to run each list item through a PoS tagger (say spaCy), and if there's a verb you can guess that it's not a food item.

Whats a good way to match text to sets of keywords (NLP)

I'm trying to match an input text (e.g. a headline of a news article) to sets of keywords, s.t. the best-matching set can be selected.
Let's assume, I have some sets of keywords:
[['democracy', 'votes', 'democrats'], ['health', 'corona', 'vaccine', 'pandemic'], ['security', 'police', 'demonstration']]
and as input the (hypothetical) headline: New Pfizer vaccine might beat COVID-19 pandemic in the next few months.. Obviously, it fits well to the second set of keywords.
Exact matching words is one way to do it, but more complex situations might arise, for which it might make sense to use base forms of words (e.g. duck instead of ducks, or run instead of running) to enhance the algorithm. Now we're talking NLP already.
I experimented with Spacy word and document embeddings (example) to determine similarity between a headline and each set of keywords. Is it a good idea to calculate document similarity between a full sentence and a limited number of keywords? Are there other ways?
Related: What NLP tools to use to match phrases having similar meaning or semantics

There is not one correct solution for such a task. you have to try what fits your problem!
Possible ways to solve your problem I can think of:
Matching: either exact or more elaborated such as lemma/stemming, or Levensthein.
Embedding Similarity: I guess word similarity would outperform document-keywords similarity, but again, just experiment with it.
Classification: Your problem seems to be a classic classification problem, which each set being one class. If you don't have enough labeled training data, you could try active-learning.

classification of documents considering the order of words

I'm trying to classify a list of documents. I'm using CountVectorizer and TfidfVectorizer to vectorize the documents before the classification. The results are good but I think that they could be better if we will consider not only the existence of specific words in the document but also the order of these words. I know that it is possible to consider also pairs and triples of words but I'm looking for something more inclusive.

Believe it or not, but bag of words approaches work quite well on a wide range of text datasets. You've already thought of bi-grams or tri-grams. Let's say you had 10-grams. You have information about the order of your words, but it turns out there are rarely more than one instance of each 10-gram, so there would be few examples for your classification model to learn from. You could try some other custom feature engineering based on the text, but it would be a good amount of work that rarely help much. There are other successful approaches in Natural Language Processing, especially in the last few years, but they usually focus on more than word ordering.

How to automatically label a cluster of words using semantics?

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.

The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.

The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.

One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.