How do I plot the 50 least frequent words?
Maybe I am thinking too complicated. Here's the way I get the words:
distr = nltk.FreqDist(word for word in items)
words = distr .keys()
seldomwords = words [:50]
How do I plot this now?
With the plot function of FreqDist I get all or only the x most frequent words.
I tried something like:
distr .plot(:50)
But that's syntactically incorrect.
It's sort of strange but the simplest way is to
first you have to extract the least common items from the FreqDist
then recreate the least common items and feed it back into a new FreqDist object
use FreqDist.plot() using the new FreqDist.
[Code]:
>>> from nltk import FreqDist
>>> fd = FreqDist(list('aaabbbbbcccccdddddddd'))
>>> last_two = FreqDist(dict(fd.most_common()[-2:]))
>>> last_two.plot()
[out]:
[Code]:
>>> from nltk import FreqDist
>>> fd = FreqDist(list('aaabbbbbcccccdddddddd'))
>>> last_two = FreqDist(dict(fd.most_common()[-2:]))
>>> last_two.plot()
>>> last_three = FreqDist(dict(fd.most_common()[-3:]))
>>> last_three.plot()
[out]:
Related
I have a TF IDF vocabulary I already get from gensim or tfidfvectorizer. Is there any specific metric or method to drop tails of TF IDF vocabulary? I mean tails at Zipf diagram. How to visualize it?
I would like to see how accuracy changes when I drop number of words in vocabulary. For instance, I have vocabulary that has 175000 of words.
These are a lot of questions, so I will try to answer them one by one. First of all, TF-IDF is a word-document matrix. So, you can use the min_df=5 argument to filter some of the words that were mentioned less than 5 times in a document. However, I don't think that's what you wanted since you mentioned Zipf's law.
So, the following function reads the following small dataset of SMS messages and gets the count of unique words in the dataset.
>>> from nltk.tokenize.casual import TweetTokenizer
>>> from collections import Counter
>>> words = []
>>> corpus = []
>>> tknzr = TweetTokenizer()
>>> with open('SMSSpamCollection.txt', 'r') as fin:
... for line in fin.readlines():
... label, text = line.split('\t')
... text = text.lower()
... words.extend(tknzr.tokenize(text))
... corpus.append(text)
>>>
>>> most_common = Counter(words).most_common()
>>> most_common[:10]
[('.', 5004), ('i', 2351), ('to', 2252), ('you', 2149), (',', 1942), ('?', 1540), ('a', 1447), ('!', 1388), ('the', 1338), ('...', 1219)]
Let's get the words ordered by their occurrence in the dataset:
>>> words, counts = zip(*most_common)
>>> len(words)
9232
How to visualize Zipf's graph?
You can visualize the Zipf's graph using the Counter object. In the following code, we will visualize the most common 100 words. More than that, the graph won't be readable:
>>> import matplotlib.pyplot as plt
>>>
>>> plt.bar(words[:100], counts[:100])
>>> plt.xticks(rotation='vertical')
>>> plt.show()
Which will produce the following graph:
Is there any specific metric or method to drop tails of TF IDF vocabulary?
Now, you can use the words list to remove some words. In the following example, I'm going to remove the least frequent 500 words when creating the TfidfVectorizer:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>>
>>> vec = TfidfVectorizer(vocabulary=words[:-500]) #removing last 500 words
>>> vec.fit_transform(corpus)
>>> # make sure everything is as expected
>>> len(vec.vocabulary_)
8732
As we can see, the words within the TfidfVectorizer is 8732 instead of the whole words which are 9232.
I've tried removing 500 least-frequent words. Now, you can do that with different numbers and test the accuracy. Hope this answers your question
In NLTK, you can easily compute the counts for the words in a text, say, by doing
from nltk.probability import FreqDist
fd = FreqDist([word for word in text.split()])
where text is a string.
Now, you can plot the distribution as
fd.plot()
and that will give you a nice line plot with the counts for each word. In the docs there is no mention of a way to plot the actual frequencies instead, which you can see in fd.freq(x).
Any straightforward way to plot the normalised counts, without taking the data into other data structures, normalising and plotting separately?
You can update fd[word] with fd[word] / total
from nltk.probability import FreqDist
text = "This is an example . This is test . example is for freq dist ."
fd = FreqDist([word for word in text.split()])
total = fd.N()
for word in fd:
fd[word] /= float(total)
fd.plot()
NOTE : You will lose original FreqDist values.
Pardon the lack of documentation. In nltk, FreqDist provides you with the raw counts (i.e. frequencies of words) in the text but ProbDist provides you with the probabilities of a word given a text.
For more information, you have to do some code reading: https://github.com/nltk/nltk/blob/develop/nltk/probability.py
The specific lines that do the normalization comes form https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L598
So to get a normalized ProbDist, you can do the following:
>>> from nltk.corpus import brown
>>> from nltk.probability import FreqDist
>>> from nltk.probability import DictionaryProbDist
>>> brown_freqdist = FreqDist(brown.words())
# Cast the frequency distribution into probabilities
>>> brown_probdist = DictionaryProbDist(brown_freqdist)
# Something strange in NLTK to note though
# When asking for probabilities in a ProbDist without
# normalization, it looks it returns the count instead...
>>> brown_freqdist['said']
1943
>>> brown_probdist.prob('said')
1943
>>> brown_probdist.logprob('said')
10.924070185585345
>>> brown_probdist = DictionaryProbDist(brown_freqdist, normalize=True)
>>> brown_probdist.logprob('said')
-9.223104921442907
>>> brown_probdist.prob('said')
0.0016732805599763002
I'm trying to do the NLTK exercices but I can't do this one. "Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the -s suffix.)". I spent a day thinking about this and trying things but I just can't get it.
Thank you.
Take a corpus, do a count_:
>>> from collections import Counter
>>> from nltk.corpus import brown
>>> texts = brown.words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts['dollar']
5
>>> word_counts['dollars']
15
But do note that sometimes it's unclear when you use only surface strings in counting, e.g.
>>> texts = brown.words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts['hits']
14
>>> word_counts['hit']
34
>>> word_counts['needs']
14
>>> word_counts['need']
30
POS sensitive counts (see types vs tokens):
>>> texts = brown.tagged_words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts[('need', 'NN')]
6
>>> word_counts[('needs', 'NNS')]
3
>>> word_counts[('hit', 'NN')]
0
>>> word_counts[('hits', 'NNS')]
0
Let's reverse engineer a little, brown corpus is nice and it's tokenized and tagged in NLTK but if you want to use your own corpus, then you have to consider the follow:
Which corpus to use? How to tokenize? How to POS-tag?
What are you counting? Types or tokens?
How to handle POS ambiguity? How to differentiate Nouns from non-Nouns?
Finally, consider this:
Is there really a way to find out whether plural or singular is more common for a word in language? Or will it always be relative to the corpus you chose to analyze?
Are there cases where plural or singular don't exists for certain nouns? (Most probably the answer is yes).
brw is an array of words.
counter = Counter(brw);
plurals = [];
for word in brw:
if(word[-1]!='s'):
plural = counter[word+'s'];
singul = counter[word];
if(plural>singul):
plurals.append(word+'s');
plurals is the output array, only with plurals (repeated, meh). If I use set(), they wont be repeated. Is this right?
I'm trying to write a script that will look through my corpus which contains 93,000 txt files and find the frequency distributions of the trigrams present across all of them (so not separate frequency distributions but one frequency distribution for the entire corpus). I've gotten it to do the frequency distributions for a single file in the corpus but don't have the skills at all to get any further. Here's the code:
import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist
corpus_root = '/Users/jolijttamanaha/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')
print "Finished importing corpus"
f = speeches.open('Mr. THOMPSON of Pennsylvania.2010-12-07.2014sep17_at_233337.txt')
raw = f.read()
tokens = nltk.word_tokenize(raw)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
for k,v in fdist.items():
print k,v
Thank you in advance for your help.
Once you define your speeches corpus with PlaintextCorpusReader as you have, you can get trigrams for the entire corpus very simply:
fdist = nltk.FreqDist(nltk.trigrams(speeches.words()))
But this has an undesirable glitch: It forms trigrams that span from the end of one file to the next. But such trigrams do not represent tokens that could follow each other in a text-- they are completely accidental. What you really want is to combine the trigram counts from each individual file, which you can get like this:
fdist = nltk.FreqDist() # Empty distribution
for filename in speeches.fileids():
fdist.update(nltk.trigrams(speeches.words(filename)))
Your fdist now contains the cumulative statistics, which you can examine in the various available ways. E.g.,
fdist.tabulate(10)
For pre-coded corpora API, instead of using corpus.raw(), you can try also corpus.words(), e.g.
>>> from nltk.util import ngrams
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> trigrams = ngrams(brown.words(), 3)
>>> for i in trigrams:
... print i
As #alexis pointed out, the code above should also work for custom corpora loaded with PlaintextCorpusReader, see http://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html
I can get the sense offset from a princeton WN sense as marked in the NTLK corpus library:
[in]:'dog.n.01'
>>> from nltk.corpus import wordnet as wn
>>> ss = wn.synset('dog.n.01')
>>> offset = str(ss.offset).zfill(8)+"-"+ss.pos
>>> print offset
[out]:'02084071-n'
That offset is similar to the convention used in http://casta-net.jp/~kuribayashi/cgi-bin/wn-multi.cgi?synset=02084071-n&lang=eng
How can i do the reverse without looping through the whole wordnet corpus? where:
[in]: '02084071-n'
[out]: 'dog.n.01' or Synset('dog.n.01')
I could do this but it's just way way too long and too many redundant cycles:
[in]: '02084071-n'
in_offset, in_pos = "02084071-n".split("-")
from nltk.corpus import wordnet as wn
nltk_ss = [i for i in wn.all_synsets() if i.offset == int(in_offset) and i.pos == in_pos][0]
print nltk_ss
[out]: Synset('dog.n.01')
Unfortunately, you cannot reverse lookup without iterating over the corpus at least once (like you have shown). The only thing I can suggest would be to keep it in a dictionary if you are going to be looking up synsets based on offsets multiple times.
>>> senseIdToSynset = {s.offset:s for s in wn.all_synsets()}
>>> senseIdToSynset[2084071]
Synset('dog.n.01')