Given a set of words V, I would like to group the synonym words in V together. I am wondering if there is any built-in function in NLTK and Wordnet that takes V as the input and automatically cluster them based on synonymity.
I already know how to extract the synonym of each word, but this is not what I am looking for. If I do so, the problem becomes complicated when the synonym sets are intersecting each other, or being subset/superset of each other, which needs writing a function removing the conflicts.
As an example, let's consider
V = ["good","constipate","bad","nice","defective","right","respectable","powerful"]
What I want to get as output is:
[('constipate'), ('nice'), ('bad', 'defective'), ('good', 'powerful', 'respectable', 'right')]
Now based on the size/number of the clusters, some sets might split into several sets, or combine together. Here, I am just caring for the words in V and their synonyms in V.
Yes, there is a way to do using nltk and wordnet. Following is an example. I am using built in sysnets and looking for synonyms for a 'book',
import nltk
from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets('book'):
for lemma in syn.lemmas():
synonyms.append(lemma.name())
resulting synonyms for 'book' is
print(synonyms)
>>['book', 'book', 'volume', 'record', 'record_book', 'book', 'script', 'book', 'playscript', 'ledger', 'leger', 'account_book', 'book_of_account', 'book', 'book', 'book', 'rule_book', 'Koran', 'Quran', "al-Qur'an", 'Book', 'Bible', 'Christian_Bible', ..]
length of synonyms,
len(synonyms)
>>38
Note: Some synonyms are verb forms, and many synonyms are just different usages of 'book'. If, instead, we take the set of synonyms, there are fewer unique words, as shown in the following code:
len(set(synonyms))
>>25
After using set operation,
{'record', 'Quran', 'Holy_Scripture', 'Koran', 'Good_Book', 'playscript', 'book', 'Word_of_God', 'hold', 'Holy_Writ', 'script', 'leger', 'book_of_account', 'Scripture', 'ledger', 'reserve', 'volume', 'record_book', "al-Qur'an", 'Christian_Bible', 'Word', 'rule_book', 'Bible', 'Book', 'account_book'}
Related
Is there a way for Gensim to generate strictly the bigrams, trigrams in a list of words?
I can successfully generate the unigrams, bigrams, trigrams but I would like to extract only the bigrams, trigrams.
For example, in the list below:
words = [['the', 'mayor', 'of', 'new', 'york', 'was', 'there'],["i","love","new","york"],["new","york","is","great"]]
I use
bigram = gensim.models.Phrases(words, min_count=1, threshold=1)
bigram_mod = gensim.models.phrases.Phraser(bigram)
words_bigram = [bigram_mod[doc] for doc in words]
This creates a list of unigrams and bigrams as follows:
[['the', 'mayor', 'of', 'new_york', 'was', 'there'],
['i', 'love', 'new_york'],
['new_york', 'is', 'great']]
My question is, is there a way (other than regular expressions) to extract strictly the bigrams, so that in this example only "new_york" would be a result?
It's not a built-in option of the gensim Phrases functionality.
If we can assume none of your original unigrams had the '_' character in them, a step to select only tokens with a '_'shouldn't be too expensive (and doesn't need full regular expressions). For example, your last line could be:
words_bigram = [ [token for token in bigram_mod[doc] if '_' in token] for doc in words ]
(You could change the joining character if for some reason there were underscores in your unigrams, and you didn't want those confused with Phrases-combined bigrams.)
If none of that is good enough, you could potentially look at the code in gensim which actually scores & combines unigrams into bigrams...
https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300
...and either extend that module with your extra needed option, or mimic its behavior outside the class in your own code.
How to get similar words using wordnet, and not only the synonyms using synsets and their lemmas?
For example, if you search for "happy" on the wordnet online tool (http://wordnetweb.princeton.edu/). For the first synset there is only one synonym (happy) but if you click on it (on the S: link) you get additional words in "see also" and "similar to" words, like "cheerful".
How do I get these words and what are they called in wordnet terminology? I am using python with nltk and can only get the synsets and lemmas at best (excluding the hypernyms etc.)
"also_sees()" and "similar_tos()".
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets("happy")[0].also_sees()
[Synset('cheerful.a.01'), Synset('contented.a.01'), Synset('elated.a.01'), Synset('euphoric.a.01'), Synset('felicitous.a.01'), Synset('glad.a.01'), Synset('joyful.a.01'), Synset('joyous.a.01')]
>>> wn.synsets("happy")[0].similar_tos()
[Synset('blessed.s.06'), Synset('blissful.s.01'), Synset('bright.s.09'), Synset('golden.s.02'), Synset('laughing.s.01')]
If you want to see the full list of what a WordNet synset can do, try the "dir()" command. (It'll be full of objects you probably don't want, so I stripped out the underscored below.)
>>> [func for func in dir(wn.synsets("happy")[0]) if func[0] != "_"]
['acyclic_tree', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'in_region_domains', 'in_topic_domains', 'in_usage_domains', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'mst', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'usage_domains', 'verb_groups', 'wup_similarity']
I have a list called dictionary1. I use the following code to get sparse count matrices of texts:
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=None)
cv1.fit_transform(dictionary1)
I notice however that
list(set(dictionary1)-set(cv1.get_feature_names()))
results in ['i']. So "i" is in my dictionary but CountVectorizer ignores it (presumably some default setting discards one-char words). In the documentation I could not find such an option. Can someone point me to the problem? Indeed I would like to keep "i" in my analysis, as it could refer to more personal language.
A working work-around is passing the dictionary as the vocabulary directly (actually I don't know why I did not do thath in the first place). I.e.
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=[], vocabulary=dictionary1)
cv1._validate_vocabulary()
list(set(dictionary1)-set(cv1.get_feature_names())) then returns [].
In my original post, I should have mentioned that dictionary1 already is a list of unique tokens.
The default configuration tokenizes the string by extracting words of at least 2 letters.
Check out this link to see more details about sklearn vectorizers.
In your case, you should use a different tokenizer, not analyzer. For example, you can use TweetTokenizer from nltk library:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer
corpus = [...some_texts...]
tk = TweetTokenizer()
vectorizer = CountVectorizer(tokenizer=tk.tokenize)
x = vectorizer.fit_transform(corpus)
For example, if corpus is defined as below, you would get:
corpus = ['I love ragdolls',
'I received a cat',
'I take it as my best friend']
vectorizer.get_feature_names()
> ['a', 'as', 'best', 'cat', 'friend', 'i', 'it', 'love', 'my', 'ragdolls', 'received', 'take']
I would like to do some text analysis on job descriptions and was going to use nltk. I can build a dictionary and remove the stopwords, which is part of what I want. However in addition to the single words and their frequencies I would like to keep meaningful 'word groups' and count them as well.
For example in job descriptions containing 'machine learning' I don't want to consider 'machine' and 'learning' separately but keep retain the word group in my dictionary if it frequently occurs together. What is the most efficient method to do that? (I think I wont need to go beyond word groups containing 2 or words). And: At which point should I do the stopword removal?
Here is an example:
text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'
The dictionary I would like to have is:
dict = ['data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural'
'language', 'processing']
Sounds like what you want do is use collocations from nltk.
Tokenize your multi-word expressions into tuples, then put them in a set for easy lookup. The easiest way is to use nltk.ngrams which allows you to iterate directly over the ngrams in your text. Since your sample data includes a trigram, here's a search for n up to 3.
raw_keywords = [ 'data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural'
'language', 'processing']
keywords = set(tuple(term.split()) for term in raw_keywords)
tokens = nltk.word_tokenize(text.lower())
# Scan text once for each ngram size.
for n in 1, 2, 3:
for ngram in nltk.ngrams(tokens, n):
if ngram in keywords:
print(ngram)
If you have huge amounts of text, you could check you you'll get a speed-up by iterating over maximal ngrams only (with the option pad_right=True to avoid missing small ngram sizes). The number of lookups is the same both ways, so I doubt it will make much difference, except in the order of returned results.
for ngram in nltk.ngrams(tokens, n, pad_right=True):
for k in range(n):
if ngram[:k+1] in keywords:
print(ngram[:k+1])
As for stopword removal: If you remove them, you'll produce ngrams where there were none before, e.g., "sewing machine and learning center" will match "machine learning" after stopword removal. You'll have to decide if this is something you want, or not. If it were me I would remove punctuation before the keyword scan, but leave the stopwords in place.
Thanks #Batman, I played around a bit with collocations and ended up only needing a couple of lines of code. (Obviously 'meaningful text' should be a lot longer to find actual collocations)
meaningful_text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(meaningful_text))
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)
Like the title says, how can I check two POS tags are in the same category?
For example,
go -> VB
goes -> VBZ
These two words are both verbs. Or,
bag -> NN
bags -> NNS
These two are both nouns.
So my question is that whether there exists any function in NLTK to check if two given tags are in the same category?
Let's take the simple case first: Your corpus is tagged with the Brown tagset (that's what it looks like), and you'd be happy with the simple tags defined in the nltk's "universal" tagset: ., ADJ, ADP, ADV, CONJ, DET, NOUN, NUM, PRON, PRT, VERB, X, where the dot stands for "punctuation". In this case, simply load the nltk's map and use it with your data:
tagmap = nltk.tag.mapping.tagset_mapping("en-brown", "universal")
if tagmap[tag1] == tagmap[tag2]:
print("The two words have the same part of speech")
If that's not your use case, you'll need to manually decide on a mapping from each individual tag to the simplified category you want to assign it to. If you are working with the Brown corpus tagset, you can see the tags and their meanings here, or from within python like this:
print(nltk.help.brown_tagset())
Study your tags and define a dictionary that maps each POS tag to your chosen category; people sometimes find it useful to just group Brown corpus tags by their first two letters, putting together "NN", "NN$", "NNS-HL", etc. You could create this particular mapping automatically like this:
from nltk.corpus import brown
alltags = set(t for w, t in brown.tagged_words())
tagmap = dict(t[:2] for t in alltags)
Then you can customize this map according to your needs; e.g., to put all punctuation tags together in the category ".":
for tag in tagmap:
if not tag.isalpha():
tagmap[tag] = "."
Once your tagmap is to your liking, use it like the one I imported from the nltk.
Finally, you might find it convenient to retag your entire corpus in one go, so that you can simply compare the assigned tags. If corpus is a list of tagged sentences in the format of the nltk's <corpus>.tagged_sents() command (so not a corpus reader object), you can retag everything like this:
newcorpus = []
for sent in corpus:
newcorpus.append( [ (w, tagmap[t]) for w, t in sent ] )
Not sure if this is what you are looking for, but you can tag with a universal tagset:
from pprint import pprint
from collections import defaultdict
from nltk import pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize
s = "I go. He goes. This bag is brown. These bags are brown."
d = defaultdict(list)
for sent in sent_tokenize(s):
text = word_tokenize(sent)
for value, tag in pos_tag(text, tagset='universal'):
d[tag].append(value)
pprint(dict(d))
Prints:
{'.': ['.', '.', '.', '.'],
'ADJ': ['brown'],
'DET': ['This', 'These'],
'NOUN': ['bag', 'bags'],
'PRON': ['I', 'He'],
'VERB': ['go', 'goes', 'is', 'brown', 'are']}
Note how bag and bags fall into NOUN category and go and goes fall into VERB.