Train customized corpus, with NLTK for Python

Train customized corpus, with NLTK for Python - python

I try to train a corpus with my own documents. My documents are structured in the same way as the original movie_reviews corpus data, so 1K positive text files in folder 'pos' and 1K negative text files in folder 'neg'. Each textfile contains 25 lines of tweets, which are cleaned, as in: urls, usernames, capital letters, punctuation removed.
How can I adjust this code to use my own text data instead of the movie_reviews?
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from collections import defaultdict
import numpy as np
# define the split of % training / % test
SPLIT = 0.8
def word_feats(words):
return dict([(word, True) for word in words])
posids = movie_reviews.fileids('pos')
negids = movie_reviews.fileids('neg')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
cutoff = int(len(posfeats) * SPLIT)
trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
testfeats = negfeats[cutoff:] + posfeats[cutoff:]
print 'Train on %d instances\nTest on %d instances' % (len(trainfeats),len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'Accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

You can login as a root user and change you directory path to this:
/usr/local/lib/python2.7/dist-packages/nltk/corpus/__init__.py
In this document you can find already existing movie_reviews corpora loaded using LazyCorpusLoader:
movie_reviews = LazyCorpusLoader(
'movie_reviews', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Then try adding some thing similar to this:
My_Movie = LazyCorpusLoader(
'My_Movie', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Where My_Movie is the name which you have created for your movie reviews.
Once Everything is done save and exit.
Finally place you corpus in nltk directory where you can find the movie_review corpus.
Try performing this:
from nltk.corpus import My_Movie # Newly created you own corpus
Hope this will work.

Related

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it?
This is my code:
sequence = nltk.tokenize.word_tokenize(raw)
bigram = ngrams(sequence,2)
freq_dist = nltk.FreqDist(bigram)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()
However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.
There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.

You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:
import os
import requests
import io #codecs
from nltk import word_tokenize, sent_tokenize
# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
Then the language modelling:
# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)
To get the counts:
model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')
To get the probabilities:
model.score('is', 'language'.split()) # P('is'|'language')
model.score('never', 'language is'.split()) # P('never'|'language is')
There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

UnpicklingError was unhandled by user code : invalid load key,'%'

I have referred the website https://radimrehurek.com/gensim/tut2.html. I have come across the error UnpicklingError was unhandled by user code : invalid load key,'%'. How do I clear that error? I had referred the other queries and included the klepto package but still that error persists. I am using anacoanda2. This is the code:-
import logging
import xml.etree.cElementTree
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
import os
import klepto
from gensim import corpora
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1] for text in texts]
from pprint import pprint # pretty-printer
pprint(texts)
dictionary = corpora.Dictionary(texts)
dictionary.save_as_text('/tmp/deerwester.dict') # store the dictionary, for future reference
print(dictionary)
print(dictionary.token2id)
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec) # the word "interaction" does not appear in the dictionary and is ignored
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.dict', corpus) # store to disk, for later use
for c in corpus:
print(c)
class MyCorpus(object):
def __iter__(self):
for line in open('/datasets/mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)
for vector in corpus_memory_friendly: # load one vector into memory at a time
print(vector)
from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('/datasets/mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
# remove stop words and words that appear only once
dictionary.filter_tokens(stop_ids + once_ids)
# remove gaps in id sequence after words that were removed
dictionary.compactify()
print(dictionary)
# create a toy corpus of 2 documents, as a plain Python list
corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it
corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)
corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)
corpus = corpora.MmCorpus('/tmp/corpus.mm')
print(corpus)
# one way of printing a corpus: load it entirely into memory
print(list(corpus)) # calling list() will convert any sequence to a plain Python list
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
print(doc)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2])
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
numpy_matrix_dense = gensim.matutils.corpus2dense(corpus, num_terms=10)
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2)
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
from gensim import corpora, models, similarities
if (os.path.exists("/tmp/deerwester.dict")):
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm')
print("Used files generated from first tutorial")
else:
print("Please run first tutorial to generate data set")
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
print(doc)
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi.print_topics(2)
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print(doc)
lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('/tmp/model.lsi')
model = models.TfidfModel(corpus, normalize=True)
model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
lsi_vec = model[tfidf_vec]
model = models.RpModel(tfidf_corpus, num_topics=500)
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
model = models.HdpModel(corpus, id2word=dictionary)

Tokenizing a corpus composed of articles into sentences Python

I will like to analyze my first deep learning model using Python and in order to do so I have to first split my corpus (8807 articles) into sentences. My corpus is built as follows:
## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import json
import nltk
import re
import pandas
appended_data = []
#for i in range(20014,2016):
# df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
# appended_data.append(df0)
for i in range(2005,2016):
if i > 2013:
df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
appended_data.append(df0)
df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
appended_data.append(df1)
appended_data.append(df2)
appended_data.append(df3)
appended_data.append(df4)
appended_data = pandas.concat(appended_data)
# doc_set = df1.body
doc_set = appended_data.body
I am trying to use the function Word2Vec.load_word2vec_format from the library gensim.models but I have to first split my corpus (doc_set) into sentences.
from gensim.models import word2vec
model = Word2Vec.load_word2vec_format(doc_set, binary=False)
Any recommendations?
cheers

So, Gensim's Word2Vec requires this format for its training input: sentences = [['first', 'sentence'], ['second', 'sentence']].
I assume your documents contain more than one sentence. You should first split by sentences, you can do that with nltk (you might need to download the model first). Then tokenize each sentence and put everything together in a list.
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentenized = doc_set.body.apply(sent_detector.tokenize)
sentences = itertools.chain.from_iterable(sentenized.tolist()) # just to flatten
result = []
for sent in sentences:
result += [nltk.word_tokenize(sent)]
gensim.models.Word2Vec(result)
Unfortunately I am not good enough with Pandas to perform all the operations in a "pandastic" way.
Pay a lot of attention to the parameters of Word2Vec picking them right can make a huge difference.

NLTK classify interface using trained classifier

I have this little chunk of code I found here:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
But how can I classify a random word that might be in the corpus.
classifier.classify('magnificent')
Doesn't work. Does it need some kind of object?
Thank you very much.
EDIT: Thanks to #unutbu's feedback and some digging here and reading the comments on the original post the following yields 'pos' or 'neg' for this code (this one's a 'pos')
print(classifier.classify(word_feats(['magnificent'])))
and this yields the evaluation of the word for 'pos' or 'neg'
print(classifier.prob_classify(word_feats(['magnificent'])).prob('neg'))

print(classifier.classify(word_feats(['magnificent'])))
yields
pos
The classifier.classify method does not operate on individual words per se, it classifies based on a dict of features. In this example, word_feats maps a sentence (a list of words) to a dict of features.
Here is another example (from the NLTK book) which uses the NaiveBayesClassifier. By comparing what is similar and different between that example, and the one you posted, you may get a better perspective of how it can be used.

Creating a new corpus with NLTK

I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python.
I have a bunch of .txt files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data.
I've tried PlaintextCorpusReader but I couldn't get further than:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I segment the newcorpus sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader class?
Can you also lead me to how I can write the segmented data into text files?

After some years of figuring out how it works, here's the updated tutorial of
How to create an NLTK corpus with a directory of textfiles?
The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.
If you have a directory that looks like this:
newcorpus/
file1.txt
file2.txt
...
Simply use these lines of code and you can get a corpus:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'newcorpus/' # Directory of corpus.
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
NOTE: that the PlaintextCorpusReader will use the default nltk.tokenize.sent_tokenize() and nltk.tokenize.word_tokenize() to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.
Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]
# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
filename+=1
with open(corpusdir+str(filename)+'.txt','w') as fout:
print>>fout, text
# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
assert open(corpusdir+infile,'r').read().strip() == text.strip()
# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
print infile # The fileids of each file.
with newcorpus.open(infile) as fin: # Opens the file.
print fin.read().strip() # Prints the content of the file
print
# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()
print
# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and
# nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()
print
# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])
# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()
print
# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])
# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()
# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
>>> sent_tokenize(txt1)
['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.']
>>> word_tokenize(sent_tokenize(txt1)[0])
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']

I think the PlaintextCorpusReader already segments the input with a punkt tokenizer, at least if your input language is english.
PlainTextCorpusReader's constructor
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle').
For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())

>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = './'
>>> newcorpus = PlaintextCorpusReader(corpus_root, '.*')
"""
if the ./ dir contains the file my_corpus.txt, then you
can view say all the words it by doing this
"""
>>> newcorpus.words('my_corpus.txt')

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
filecontent1 = "This is a cow"
filecontent2 = "This is a Dog"
corpusdir = 'nltk_data/'
with open(corpusdir + 'content1.txt', 'w') as text_file:
text_file.write(filecontent1)
with open(corpusdir + 'content2.txt', 'w') as text_file:
text_file.write(filecontent2)
text_corpus = PlaintextCorpusReader(corpusdir, ["content1.txt", "content2.txt"])
no_of_words_corpus1 = len(text_corpus.words("content1.txt"))
print(no_of_words_corpus1)
no_of_unique_words_corpus1 = len(set(text_corpus.words("content1.txt")))
no_of_words_corpus2 = len(text_corpus.words("content2.txt"))
no_of_unique_words_corpus2 = len(set(text_corpus.words("content2.txt")))
enter code here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Train customized corpus, with NLTK for Python - python

Related

How to get the probability of bigrams in a text of sentences?

UnpicklingError was unhandled by user code : invalid load key,'%'

Tokenizing a corpus composed of articles into sentences Python

NLTK classify interface using trained classifier

Creating a new corpus with NLTK

Categories

Resources