Error in extracting phrases using Gensim

Error in extracting phrases using Gensim - python

I am trying to get the bigrams in the sentences using Phrases in Gensim as follows.
from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
#print(sentence_stream)
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
for sent in sentence_stream:
tokens_ = bigram_phraser[sent]
print(tokens_)
Even though it catches "new", "york" as "new york", it does not catch "machine", learning as "machine learning"
However, in the example shown in Gensim Website they were able to catch the words "machine", "learning" as "machine learning".
Please let me know how to get "machine learning" as a bigram in the above example

The technique used by gensim Phrases is purely based on statistics of co-occurrences: how often words appear together, versus alone, in a formula also affected by min_count and compared against the threshold value.
It is only because your training set has 'new' and 'york' occur alongside each other twice, while other words (like 'machine' and 'learning') only occur alongside each other once, that 'new_york' becomes a bigram, and other pairings do not. What's more, even if you did find a combination of min_count and threshold that would promote 'machine_learning' to a bigram, it would also pair together every other bigram-that-appears-once – which is probably not what you want.
Really, to get good results from these statistical techniques, you need lots of varied, realistic data. (Toy-sized examples may superficially succeed, or fail, for superficial toy-sized reasons.)
Even then, they will tend to miss combinations a person would consider reasonable, and make combinations a person wouldn't. Why? Because our minds have much more sophisticated ways (including grammar and real-world knowledge) for deciding when clumps of words represent a single concept.
So even with more better data, be prepared for nonsensical n-grams. Tune or judge the model on whether it is overall improving on your goal, not any single point or ad-hoc check of matching your own sensibility.
(Regarding the referenced gensim documentation comment, I'm pretty sure that if you try Phrases on just the two sentences listed there, it won't find any of the desired phrases – not 'new_york' or 'machine_learning'. As a figurative example, the ellipses ... imply the training set is larger, and the results indicate that the extra unshown texts are important. It's just because of the 3rd sentence you've added to your code that 'new_york' is detected. If you added similar examples to make 'machine_learning' look more like a statistically-outlying pairing, your code could promote 'machine_learning', too.)

It is probably below your threshold?
Try using more data.

Related

improve gensim most_similar() return values by using wordnet hypernyms

import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-200')
I first ran this code to download the pre-trained model.
glove.most_similar(positive=['sushi', 'uae'], negative=['japan'])
would then result in:
[('nahyan', 0.5181387066841125),
('caviar', 0.4778318405151367),
('paella', 0.4497394263744354),
('nahayan', 0.44313961267471313),
('zayed', 0.4321245849132538),
('omani', 0.4285220503807068),
('seafood', 0.4279175102710724),
('saif', 0.426000714302063),
('dirham', 0.4214130640029907),
('sashimi', 0.4165934920310974)]
and in this example, we can see that the method failed to capture the 'type' or 'category' of the query. 'zayed', 'nahyan' are not actually of 'type' food and rather they represent person name.
The approach suggested by my professor is to use wordnet hypernyms to find the 'type'.
With much research, the closest solution I found is to somehow incorporate
lowest_common_hypernyms() that will give the lowest common hypernym between two synsets and use it to filter the results of most_similar().
I am not sure if my idea make sense and would like the community feedback on this.
My idea is compute the hypernym of, e.g. 'sushi' and the hypernyms of all the similar words returned by most_similar() and only choose the word with 'longest' lowest common hypernym path. I expect this should return the word that best matches the 'type'
Not sure if it makes sense...

Does your proposed approach give adequate results when you try it?
That's the only test of whether the idea makes sense.
Word2vec is generally oblivious to the all the variations of category that a lexicon like WordNet can provide – all the words that are similar to another word, in any aspect, will be neighbors. Even words that people consider opposites – like 'hot' and 'cold' – will be often be fairly close to each other, in some direction in the coordinate space, as they are similar in what they describe and what contexts they're used in. (They can be drop-in replacements for each other.)
Word2vec is also fairly oblivious to polysemy in its standard formulation.
Some other things worth trying might be:
if you need only answers of a certain type, mix-in some measurement ranking candidate answers by their closeness to a word either describing that type ('food') or representing multiple examples (say an average vector for many food-names you'd know to be good answers)
choose another vector-set, or train your own. There's no universal "goodness" for word-vectors: their quality for certain tasks will vary based on their training data & parameters. Vectors trained on something broader than Wikipedia (your named vector file), or some text corpus more focused on your domain-of-interest – say, food criticism – might do better on some tasks. Changing training parameters can also change which kinds of similarity are most emphasized in the resulting vectors. For example, some observers have noticed small context-windows tend to put words that are direct drop-in replacements for each other closer-together, while larger context-windows bring words from the same domains-of-use, even if not drop-in replacements of the same 'type', closer. (It sounds like your current need might be best served with a model trained with smaller windows.)

Nahyan is from the UAE - it seems to be part of the name of all three presidents. So you seem to be getting what you ask for. If you want more foods, add "food" to your positive query, and maybe "people" to your negative query?
Another approach is to post-filter your results to remove anything that isn't a food. Or is a person. (WordNet won't be much help, as it is nowhere near comprehensive on foods, and even less so on people; Wikidata is likely to be more useful.)
By the way, if you find the common hypernym of sushi and UAE it will probably be the top-level entity in wordnet. So that will give you no filtering.

Use word2vec to expand a glossary in order to classify texts

I have a database containing about 3 million texts (tweets). I put clean texts (removing stop words, tags...) in a list of lists of tokens called sentences (so it contains a list of tokens for each text).
After these steps, if I write
model = Word2Vec(sentences, min_count=1)
I obtain a vocabulary of about 400,000 words.
I have also a list of words (belonging to the same topic, in this case: economics) called terms. I found that 7% of the texts contain at least one of these words (so we can say that 7% of total tweets talk about economics).
My goal is to expand the list terms in order to retrieve more texts belonging to the economic topic.
Then I use
results = model.most_similar(terms, topn=5000)
to find, within the list of lists of tokens sentences, the words most similar to those contained in terms.
Finally if I create the data frame
df = pd.DataFrame(results, columns=['key', 'similarity'])
I get something like that:
key similarity
word1 0.795432
word2 0.787954
word3 0.778942
... ...
Now I think I have two possibilities to define the expanded glossary:
I take the first N words (what should be the value of N?);
I look at the suggested words one by one and decide which one to include in the expanded glossary based on my knowledge (does this word really belong to the economic glossary?)
How should I proceed in a case like this?

There's no general answer for what the cutoff should be, or how much you should use your own manual judgement versus cruder (but fast/automatic) processes. Those are inherently decisions which will be heavily influenced by your data, model quality, & goals – so you have to try different approaches & see what works there.
If you had a goal for what percentage of the original corpus you want to take – say, 14% instead of 7% – you could go as deeply into the ranked candidate list of 'similar words' as necessary to hit that 14% target.
Note that when you retrieve model.most_similar(terms), you are asking the model to 1st average all words in terms together, then return words close to that one average point. To the extent your seed set of terms is tightly around the idea of economics, that might find words close to that generic average idea – but might not find other interesting words, such as close sysnonyms of your seed words that you just hadn't thought of. For that, you might want to get not 5000 neighbors for one generic average point, but (say) 3 neighbors for every individual term. To the extent the 'shape' of the topic isn't a perfect sphere around someplace in the word-vector-space, but rather some lumpy complex volume, that might better reflect your intent.
Instead of using your judgement of the candidate words standing alone to decide whether a word is economics-related, you could instead look at the texts that a word uniquely brings in. That is, for new word X, look at the N texts that contain that word. How many, when applying your full judgement to their full text, deserve to be in your 'economics' subset? Only if it's above some threshold T would you want to move X into your glossary.
But such an exercise may just highlight: using a simple glossary – "for any of these hand-picked N words, every text mentioning at least 1 word is in" – is a fairly crude way of assessing a text's topic. There are other ways to approach the goal of "pick a relevant subset" in an automated way.
For example, you could view your task as that of training a text binary classifier to classify texts as 'economics' or 'not-economics'.
In such a case, you'd start with some training data - a set of example documents that are already labeled 'economics' or 'not-economics', perhaps via individual manual review, or perhaps via some crude bootstrapping (like labeling all texts with some set of glossary words as 'economics', & all others 'not-economics'). Then you'd draw from the full range of potential text-preprocessing, text-feature-extracton, & classification options to train & evaluate classifiers that make that judgement for you. Then you'd evaluate/tune those – a process wich might also improve your training data, as you add new definitively 'economics' or 'not-economics' texts – & eventually settle on one that works well.
Alternatively, you could use some other richer topic-modeling methods (LDA, word2vec-derived Doc2Vec, deeper neural models etc) for modeling the whole dataset, then from some seed-set of definite-'economics' texts, expand outward from them – finding nearest-examples to known-good documents, either auto-including them or hand-reviewing them.
Separately: min_count=1 is almost always a mistake in word2vec & related algorihtms, which do better if you discard words so rare they lack the variety of multiple usage examples the algorithm needs to generate good word-vectors.

Whats a good way to match text to sets of keywords (NLP)

I'm trying to match an input text (e.g. a headline of a news article) to sets of keywords, s.t. the best-matching set can be selected.
Let's assume, I have some sets of keywords:
[['democracy', 'votes', 'democrats'], ['health', 'corona', 'vaccine', 'pandemic'], ['security', 'police', 'demonstration']]
and as input the (hypothetical) headline: New Pfizer vaccine might beat COVID-19 pandemic in the next few months.. Obviously, it fits well to the second set of keywords.
Exact matching words is one way to do it, but more complex situations might arise, for which it might make sense to use base forms of words (e.g. duck instead of ducks, or run instead of running) to enhance the algorithm. Now we're talking NLP already.
I experimented with Spacy word and document embeddings (example) to determine similarity between a headline and each set of keywords. Is it a good idea to calculate document similarity between a full sentence and a limited number of keywords? Are there other ways?
Related: What NLP tools to use to match phrases having similar meaning or semantics

There is not one correct solution for such a task. you have to try what fits your problem!
Possible ways to solve your problem I can think of:
Matching: either exact or more elaborated such as lemma/stemming, or Levensthein.
Embedding Similarity: I guess word similarity would outperform document-keywords similarity, but again, just experiment with it.
Classification: Your problem seems to be a classic classification problem, which each set being one class. If you don't have enough labeled training data, you could try active-learning.

R/python: build model from training sentences

What I'm trying to achieve:
I have been looking for an approach for a long while now but I'm not able to find an (effective) way to this:
build a model from example sentences while taking word order and synonyms into account.
map a sentence against this model and get a similarity score (thus a score indicating how much this sentence fits the model, in other words fits the sentences which were used to train the model)
What I tried:
Python: nltk in combination with gensim (as far as I could code and read it was only capable to use word similarity (but not taking order into
account).
R: used tm to build a TermDocumentMatrix which looked really promising but was not able to map anything to this matrix. Further this TermDocumentMatrix seems to take the order into account but misses the synonyms (I think).
I know the lemmatization didn't go that well hahah :)
Question:
Is there any way to do achieve the steps described above using either R or Python? A simple sample code would be great (or references to a good tutorial)

There are many ways to do what you described above, and it will of course take lots of testing to find an optimized solution. But here is some helpful functionality to help solve this using python/nltk.
build a model from example sentences while taking word order and
synonyms into account.
1. Tokenization
In this step you will want to break down individual sentences into a list of words.
Sample code:
import nltk
tokenized_sentence = nltk.word_tokenize('this is my test sentence')
print(tokenized_sentence)
['this', 'is', 'my', 'test', 'sentence']
2. Finding synonyms for each word.
Sample code:
from nltk.corpus import wordnet as wn
synset_list = wn.synsets('motorcar')
print(synset_list)
[Synset('car.n.01')]
Feel free to research synsets if you are unfamiliar, but for now just know the above returns a list, so multiple synsets are possibly returned.
From the synset you can get a list of synonyms.
Sample code:
print( wn.synset('car.n.01').lemma_names() )
['car', 'auto', 'automobile', 'machine', 'motorcar']
Great, now you are able to convert your sentence into a list of words, and you're able to find synonyms for all words in your sentences (while retaining the order of your sentence). Also, you may want to consider removing stopwords and stemming your tokens, so feel free to look up those concepts if you think it would be helpful.
You will of course need to write the code to do this for all sentences, and store the data in some data structure, but that is probably outside the scope of this question.
map a sentence against this model and get a similarity score (thus a
score indicating how much this sentence fits the model, in other words
fits the sentences which were used to train the model)
This is difficult to answer since the possibilities to do this are endless, but here are a few examples of how you could approach it.
If you're interested in binary classification you could do something as simple as, Have I seen this sentence of variation of this sentence before (variation being same sentence but words replaced by their synonyms)? If so, score is 1, else score is 0. This would work, but may not be what you want.
Another example, store each sentence along with synonyms in a python dictionary and calculate score depending on how far down the dictionary you can align the new sentence.
Example:
training_sentence1 = 'This is my awesome sentence'
training_sentence2 = 'This is not awesome'
And here is a sample data structure on how you would store those 2 sentences:
my_dictionary = {
'this': {
'is':{
'my':{
'awesome': {
'sentence':{}
}
},
'not':{
'awesome':{}
}
}
}
}
Then you could write a function that traverses that data structure for each new sentence, and depending how deep it gets, give it a higher score.
Conclusion:
The above two examples are just some possible ways to approach the similarity problem. There are countless articles/whitepapers about computing semantic similarity between text, so my advice would be just explore many options.
I purposely excluded supervised classification models, since you never mentioned having access to labelled training data, but of course that route is possible if you do have a gold standard data source.

How to automatically label a cluster of words using semantics?

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.

The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.

The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.

One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.