Creating a new corpus with NLTK

Creating a new corpus with NLTK - python

I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python.
I have a bunch of .txt files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data.
I've tried PlaintextCorpusReader but I couldn't get further than:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I segment the newcorpus sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader class?
Can you also lead me to how I can write the segmented data into text files?

After some years of figuring out how it works, here's the updated tutorial of
How to create an NLTK corpus with a directory of textfiles?
The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.
If you have a directory that looks like this:
newcorpus/
file1.txt
file2.txt
...
Simply use these lines of code and you can get a corpus:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'newcorpus/' # Directory of corpus.
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
NOTE: that the PlaintextCorpusReader will use the default nltk.tokenize.sent_tokenize() and nltk.tokenize.word_tokenize() to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.
Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]
# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
filename+=1
with open(corpusdir+str(filename)+'.txt','w') as fout:
print>>fout, text
# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
assert open(corpusdir+infile,'r').read().strip() == text.strip()
# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
print infile # The fileids of each file.
with newcorpus.open(infile) as fin: # Opens the file.
print fin.read().strip() # Prints the content of the file
print
# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()
print
# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and
# nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()
print
# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])
# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()
print
# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])
# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()
# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
>>> sent_tokenize(txt1)
['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.']
>>> word_tokenize(sent_tokenize(txt1)[0])
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']

I think the PlaintextCorpusReader already segments the input with a punkt tokenizer, at least if your input language is english.
PlainTextCorpusReader's constructor
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle').
For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())

>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = './'
>>> newcorpus = PlaintextCorpusReader(corpus_root, '.*')
"""
if the ./ dir contains the file my_corpus.txt, then you
can view say all the words it by doing this
"""
>>> newcorpus.words('my_corpus.txt')

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
filecontent1 = "This is a cow"
filecontent2 = "This is a Dog"
corpusdir = 'nltk_data/'
with open(corpusdir + 'content1.txt', 'w') as text_file:
text_file.write(filecontent1)
with open(corpusdir + 'content2.txt', 'w') as text_file:
text_file.write(filecontent2)
text_corpus = PlaintextCorpusReader(corpusdir, ["content1.txt", "content2.txt"])
no_of_words_corpus1 = len(text_corpus.words("content1.txt"))
print(no_of_words_corpus1)
no_of_unique_words_corpus1 = len(set(text_corpus.words("content1.txt")))
no_of_words_corpus2 = len(text_corpus.words("content2.txt"))
no_of_unique_words_corpus2 = len(set(text_corpus.words("content2.txt")))
enter code here

Related

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it?
This is my code:
sequence = nltk.tokenize.word_tokenize(raw)
bigram = ngrams(sequence,2)
freq_dist = nltk.FreqDist(bigram)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()
However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.
There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.

You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:
import os
import requests
import io #codecs
from nltk import word_tokenize, sent_tokenize
# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
Then the language modelling:
# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)
To get the counts:
model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')
To get the probabilities:
model.score('is', 'language'.split()) # P('is'|'language')
model.score('never', 'language is'.split()) # P('never'|'language is')
There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Doing the text analysis of italian text (tokenization, lemmalization) for future use of TF-IDF technics and constructing clusters based on that. For preprocessing I use NLTK and for one text file everything is working fine:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
it_stop_words = nltk.corpus.stopwords.words('italian')
lmtzr = WordNetLemmatizer()
with open('3003.txt', 'r' , encoding="latin-1") as myfile:
data=myfile.read()
word_tokenized_list = nltk.tokenize.word_tokenize(data)
word_tokenized_no_punct = [str.lower(x) for x in word_tokenized_list if x not in string.punctuation]
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lmtzr.lemmatize(x) for x in word_tokenized_no_punct_no_sw_no_apostrophe]
But the question is that I need to perform the following to bunch of .txt files in the folder. For that I'm trying to use possibilities of PlaintextCorpusReader():
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'reports/'
newcorpus = PlaintextCorpusReader(corpusdir, '.txt')
Basically I can not just apply newcorpus into the previous functions because it's an object and not a string. So my questions are:
How should the functions look like (or how should I change the existing ones for a distinct file) for doing tokenization and lemmatization for a corpus of files (using PlaintextCorpusReader())
How would the TF-IDF approach (standard sklearn approach of vectorizer = TfidfVectorizer() will look like in PlaintextCorpusReader()
Many Thanks!

I think your question can be answered by reading: this question, this another one and [TfidfVectorizer docs][3]. For completeness, I wrapped the answers below:
First, you want to get the files ids, by the first question you can get them as follows:
ids = newcorpus.fileids()
Then, based on the second quetion you can retrieve documents' words, sentences or paragraphs:
doc_words = []
doc_sents = []
doc_paras = []
for id_ in ids:
# Get words
doc_words.append(newcorpus.words(id_))
# Get sentences
doc_sents.append(newcorpus.sents(id_))
# Get paragraph
doc_paras.append(newcorpus.paras(id_))
Now, on the ith position of doc_words, doc_sents and doc_paras you have all words, sentences and paragraphs (respectively) for every document in the corpus.
For tf-idf you probably just want the words. Since TfidfVectorizer.fit's method gets an iterable which yields str, unicode or file objects, you need to either transform your documents (array of tokenized words) into a single string, or use a similar approach to this one. The latter solution uses a dummy tokenizer to deal directly with arrays of words.
You can also pass your own tokenizer to TfidVectorizer and use PlaintextCorpusReader simply for file reading.

Create Information content corpora to be used by webnet from a custom dump

I am using Brown corpus ic-brown.dat for calculating the information content of a word using wordnet nltk library. But results are not good looking. I was wondering how can i build my own custome.dat (information content file).
custom_ic = wordnet_ic.ic('custom.dat')

In (...)/nltk_data/corpora/wordnet_ic/ you will find IC-compute.sh that contains some calls to some Perl scripts to generate the IC dat files from a given corpus. I founded the instructions tricky, and I do not have the require Perl scripts, so I decided to create a python script by analyzing the dat files structure and wordnet.ic() function.
You can compute your own IC counts by calling the wordnet.ic() function over a corpus reader object. In fact, you only need an object with a word() function that returns all the words in the corpus. For more details check the ic function (line 1729 to 1789) in the file ..../nltk/corpus/reader/wordnet.py.
For example, for the XML version of the BNC corpus (2007):
reader_bnc = nltk.corpus.reader.BNCCorpusReader(root='../Corpus/2554/2554/download/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
bnc_ic = wn.ic(reader_bnc, False, 0.0)
To generate the .dat file I created the following functions:
def is_root(synset_x):
if synset_x.root_hypernyms()[0] == synset_x:
return True
return False
def generate_ic_file(IC, output_filename):
"""Dump in output_filename the IC counts.
The expected format of IC is a dict
{'v':defaultdict, 'n':defaultdict, 'a':defaultdict, 'r':defaultdict}"""
with codecs.open(output_filename, 'w', encoding='utf-8') as fid:
# Hash code of WordNet 3.0
fid.write("wnver::eOS9lXC6GvMWznF1wkZofDdtbBU"+"\n")
# We only stored nouns and verbs because those are the only POS tags
# supported by wordnet.ic() function
for tag_type in ['v', 'n']:#IC:
for key, value in IC[tag_type].items():
if key != 0:
synset_x = wn.of2ss(of="{:08d}".format(key)+tag_type)
if is_root(synset_x):
fid.write(str(key)+tag_type+" "+str(value)+" ROOT\n")
else:
fid.write(str(key)+tag_type+" "+str(value)+"\n")
print("Done")
generate_ic_file(bnc_ic, "../custom.dat")
Then, just call the function:
custom_ic = wordnet_ic.ic('../custom.dat')
The imports needed are:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
import codecs

Train customized corpus, with NLTK for Python

I try to train a corpus with my own documents. My documents are structured in the same way as the original movie_reviews corpus data, so 1K positive text files in folder 'pos' and 1K negative text files in folder 'neg'. Each textfile contains 25 lines of tweets, which are cleaned, as in: urls, usernames, capital letters, punctuation removed.
How can I adjust this code to use my own text data instead of the movie_reviews?
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from collections import defaultdict
import numpy as np
# define the split of % training / % test
SPLIT = 0.8
def word_feats(words):
return dict([(word, True) for word in words])
posids = movie_reviews.fileids('pos')
negids = movie_reviews.fileids('neg')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
cutoff = int(len(posfeats) * SPLIT)
trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
testfeats = negfeats[cutoff:] + posfeats[cutoff:]
print 'Train on %d instances\nTest on %d instances' % (len(trainfeats),len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'Accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

You can login as a root user and change you directory path to this:
/usr/local/lib/python2.7/dist-packages/nltk/corpus/__init__.py
In this document you can find already existing movie_reviews corpora loaded using LazyCorpusLoader:
movie_reviews = LazyCorpusLoader(
'movie_reviews', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Then try adding some thing similar to this:
My_Movie = LazyCorpusLoader(
'My_Movie', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Where My_Movie is the name which you have created for your movie reviews.
Once Everything is done save and exit.
Finally place you corpus in nltk directory where you can find the movie_review corpus.
Try performing this:
from nltk.corpus import My_Movie # Newly created you own corpus
Hope this will work.

how to extract the contextual words of a token in python

Actually i want to extract the contextual words of a specific word. For this purpose i can use the n-gram in python but the draw back of this is that it slides the window by one but i only need the contextual words of a specific word. E.g. my file is like this
IL-2
gene
expression
and
NF-kappa
B
activation
through
CD28
requires
reactive
oxygen
production
by
5-lipoxygenase
.
mean each token on every line. now i want to extract the surrounding words of each e.g. through and requires are the surrounding words of "CD28". I write a python code but did not worked and generating an error of ValueError: list.index(x): x not in list.
My code is
import re;
import nltk;
file=open("C:/Python26/test.txt");
contents= file.read()
tokens = nltk.word_tokenize(contents)
f=open("trigram.txt",'w');
for l in tokens:
print tokens[l],tokens[l+1]
f.close();

First of all, list.index(x) : Return the index in the list of the first item whose value is x.
>>> ["foo", "bar", "baz"].index('bar')
1
In your code, the variable 'word' is populated using range of integers not by actual contents. so we can't directly use 'word' in the list.index() function.
>>> print lines.index(1)
ValueError: 1 is not in list
change your code like this :
file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.close()

I dont really understood what you want to do, but, I'll do my best.
If you want to process words with python there is a library called NLTK which means Natural Language Toolkit.
You may need to tokenize a sentence or a document.
import nltk
def tokenize_query(query):
return nltk.word_tokenize(query)
f = open('C:/Python26/tokens.txt')
raw = f.read()
tokenize_query(raw)
We can also read a file one line at a time using a for loop:
f = open('C:/Python26/tokens.txt', 'rU')
for line in f:
print(line.strip())
r means 'read' and U means 'universal', if you are wondering.
strip() is just cutting '\n' from the text.
The context may be provided by wordnet and all its functions.
I guess you should use synsets with the word's pos (part of speech).
A synset is sort of a synonyms list in a semantic way.
NLTK can provide you some others nice features like sentiment analysis and similarity between synsets.

file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.write("\n")
f.close()

This code also gives the same result
import nltk;
from nltk.util import ngrams
from nltk import word_tokenize
file = open("C:/Python26/tokens.txt");
contents=file.read();
tokens = nltk.word_tokenize(contents);
f_tri = open("trigram.txt",'w');
trigram = ngrams(tokens,3)
for t in trigram:
f_tri.write(str(t)+"\n")
f_tri.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a new corpus with NLTK - python

>>> import nltk >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root = './' >>> newcorpus = PlaintextCorpusReader(corpus_root, '.*') """ if the ./ dir contains the file my_corpus.txt, then you can view say all the words it by doing this """ >>> newcorpus.words('my_corpus.txt')

Related

How to get the probability of bigrams in a text of sentences?

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Create Information content corpora to be used by webnet from a custom dump

Train customized corpus, with NLTK for Python

how to extract the contextual words of a token in python

Categories

Resources