In NLTK, you can easily compute the counts for the words in a text, say, by doing
from nltk.probability import FreqDist
fd = FreqDist([word for word in text.split()])
where text is a string.
Now, you can plot the distribution as
fd.plot()
and that will give you a nice line plot with the counts for each word. In the docs there is no mention of a way to plot the actual frequencies instead, which you can see in fd.freq(x).
Any straightforward way to plot the normalised counts, without taking the data into other data structures, normalising and plotting separately?
You can update fd[word] with fd[word] / total
from nltk.probability import FreqDist
text = "This is an example . This is test . example is for freq dist ."
fd = FreqDist([word for word in text.split()])
total = fd.N()
for word in fd:
fd[word] /= float(total)
fd.plot()
NOTE : You will lose original FreqDist values.
Pardon the lack of documentation. In nltk, FreqDist provides you with the raw counts (i.e. frequencies of words) in the text but ProbDist provides you with the probabilities of a word given a text.
For more information, you have to do some code reading: https://github.com/nltk/nltk/blob/develop/nltk/probability.py
The specific lines that do the normalization comes form https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L598
So to get a normalized ProbDist, you can do the following:
>>> from nltk.corpus import brown
>>> from nltk.probability import FreqDist
>>> from nltk.probability import DictionaryProbDist
>>> brown_freqdist = FreqDist(brown.words())
# Cast the frequency distribution into probabilities
>>> brown_probdist = DictionaryProbDist(brown_freqdist)
# Something strange in NLTK to note though
# When asking for probabilities in a ProbDist without
# normalization, it looks it returns the count instead...
>>> brown_freqdist['said']
1943
>>> brown_probdist.prob('said')
1943
>>> brown_probdist.logprob('said')
10.924070185585345
>>> brown_probdist = DictionaryProbDist(brown_freqdist, normalize=True)
>>> brown_probdist.logprob('said')
-9.223104921442907
>>> brown_probdist.prob('said')
0.0016732805599763002
Related
I have a TF IDF vocabulary I already get from gensim or tfidfvectorizer. Is there any specific metric or method to drop tails of TF IDF vocabulary? I mean tails at Zipf diagram. How to visualize it?
I would like to see how accuracy changes when I drop number of words in vocabulary. For instance, I have vocabulary that has 175000 of words.
These are a lot of questions, so I will try to answer them one by one. First of all, TF-IDF is a word-document matrix. So, you can use the min_df=5 argument to filter some of the words that were mentioned less than 5 times in a document. However, I don't think that's what you wanted since you mentioned Zipf's law.
So, the following function reads the following small dataset of SMS messages and gets the count of unique words in the dataset.
>>> from nltk.tokenize.casual import TweetTokenizer
>>> from collections import Counter
>>> words = []
>>> corpus = []
>>> tknzr = TweetTokenizer()
>>> with open('SMSSpamCollection.txt', 'r') as fin:
... for line in fin.readlines():
... label, text = line.split('\t')
... text = text.lower()
... words.extend(tknzr.tokenize(text))
... corpus.append(text)
>>>
>>> most_common = Counter(words).most_common()
>>> most_common[:10]
[('.', 5004), ('i', 2351), ('to', 2252), ('you', 2149), (',', 1942), ('?', 1540), ('a', 1447), ('!', 1388), ('the', 1338), ('...', 1219)]
Let's get the words ordered by their occurrence in the dataset:
>>> words, counts = zip(*most_common)
>>> len(words)
9232
How to visualize Zipf's graph?
You can visualize the Zipf's graph using the Counter object. In the following code, we will visualize the most common 100 words. More than that, the graph won't be readable:
>>> import matplotlib.pyplot as plt
>>>
>>> plt.bar(words[:100], counts[:100])
>>> plt.xticks(rotation='vertical')
>>> plt.show()
Which will produce the following graph:
Is there any specific metric or method to drop tails of TF IDF vocabulary?
Now, you can use the words list to remove some words. In the following example, I'm going to remove the least frequent 500 words when creating the TfidfVectorizer:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>>
>>> vec = TfidfVectorizer(vocabulary=words[:-500]) #removing last 500 words
>>> vec.fit_transform(corpus)
>>> # make sure everything is as expected
>>> len(vec.vocabulary_)
8732
As we can see, the words within the TfidfVectorizer is 8732 instead of the whole words which are 9232.
I've tried removing 500 least-frequent words. Now, you can do that with different numbers and test the accuracy. Hope this answers your question
I would like to find the most relevant words over a set of documents.
I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency.
After that, I will take only the ones with a high number and I will use them.
I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/.
I call that jar using the subprocess library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs some words, it has problems with ' and - (I think). I am using it over the text of 3 books (Harry Potter) and , for example, I am obtaining words such hermiones, hermionell, riddlehermione, thinghermione instead of just hermione in the csv file.
Am I doing wrong something? Can you give me a working implementation of the Tf idf algorithm? Is there a python library that does that?
Here is an implementation of the Tf-idf algorithm using scikit-learn.
Before applying it, you can word_tokenize() and stem your words.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
How do I plot the 50 least frequent words?
Maybe I am thinking too complicated. Here's the way I get the words:
distr = nltk.FreqDist(word for word in items)
words = distr .keys()
seldomwords = words [:50]
How do I plot this now?
With the plot function of FreqDist I get all or only the x most frequent words.
I tried something like:
distr .plot(:50)
But that's syntactically incorrect.
It's sort of strange but the simplest way is to
first you have to extract the least common items from the FreqDist
then recreate the least common items and feed it back into a new FreqDist object
use FreqDist.plot() using the new FreqDist.
[Code]:
>>> from nltk import FreqDist
>>> fd = FreqDist(list('aaabbbbbcccccdddddddd'))
>>> last_two = FreqDist(dict(fd.most_common()[-2:]))
>>> last_two.plot()
[out]:
[Code]:
>>> from nltk import FreqDist
>>> fd = FreqDist(list('aaabbbbbcccccdddddddd'))
>>> last_two = FreqDist(dict(fd.most_common()[-2:]))
>>> last_two.plot()
>>> last_three = FreqDist(dict(fd.most_common()[-3:]))
>>> last_three.plot()
[out]:
I'm trying to write a script that will look through my corpus which contains 93,000 txt files and find the frequency distributions of the trigrams present across all of them (so not separate frequency distributions but one frequency distribution for the entire corpus). I've gotten it to do the frequency distributions for a single file in the corpus but don't have the skills at all to get any further. Here's the code:
import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist
corpus_root = '/Users/jolijttamanaha/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')
print "Finished importing corpus"
f = speeches.open('Mr. THOMPSON of Pennsylvania.2010-12-07.2014sep17_at_233337.txt')
raw = f.read()
tokens = nltk.word_tokenize(raw)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
for k,v in fdist.items():
print k,v
Thank you in advance for your help.
Once you define your speeches corpus with PlaintextCorpusReader as you have, you can get trigrams for the entire corpus very simply:
fdist = nltk.FreqDist(nltk.trigrams(speeches.words()))
But this has an undesirable glitch: It forms trigrams that span from the end of one file to the next. But such trigrams do not represent tokens that could follow each other in a text-- they are completely accidental. What you really want is to combine the trigram counts from each individual file, which you can get like this:
fdist = nltk.FreqDist() # Empty distribution
for filename in speeches.fileids():
fdist.update(nltk.trigrams(speeches.words(filename)))
Your fdist now contains the cumulative statistics, which you can examine in the various available ways. E.g.,
fdist.tabulate(10)
For pre-coded corpora API, instead of using corpus.raw(), you can try also corpus.words(), e.g.
>>> from nltk.util import ngrams
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> trigrams = ngrams(brown.words(), 3)
>>> for i in trigrams:
... print i
As #alexis pointed out, the code above should also work for custom corpora loaded with PlaintextCorpusReader, see http://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html
I can get the sense offset from a princeton WN sense as marked in the NTLK corpus library:
[in]:'dog.n.01'
>>> from nltk.corpus import wordnet as wn
>>> ss = wn.synset('dog.n.01')
>>> offset = str(ss.offset).zfill(8)+"-"+ss.pos
>>> print offset
[out]:'02084071-n'
That offset is similar to the convention used in http://casta-net.jp/~kuribayashi/cgi-bin/wn-multi.cgi?synset=02084071-n&lang=eng
How can i do the reverse without looping through the whole wordnet corpus? where:
[in]: '02084071-n'
[out]: 'dog.n.01' or Synset('dog.n.01')
I could do this but it's just way way too long and too many redundant cycles:
[in]: '02084071-n'
in_offset, in_pos = "02084071-n".split("-")
from nltk.corpus import wordnet as wn
nltk_ss = [i for i in wn.all_synsets() if i.offset == int(in_offset) and i.pos == in_pos][0]
print nltk_ss
[out]: Synset('dog.n.01')
Unfortunately, you cannot reverse lookup without iterating over the corpus at least once (like you have shown). The only thing I can suggest would be to keep it in a dictionary if you are going to be looking up synsets based on offsets multiple times.
>>> senseIdToSynset = {s.offset:s for s in wn.all_synsets()}
>>> senseIdToSynset[2084071]
Synset('dog.n.01')