How can I read english.pickle file from nltk module? - python

I am trying to figure out why I cant read the contents of the english.pickle file downloaded from nltk module.
I first downloaded the nltk file using this code:
import nltk
nltk.download('punkt')
I then looked for inside the punkt file that I have on my home directory and found english.pickle file. I used the following code to read the file in python:
import pickle
with open('english.pickle', 'rb') as file:
x = pickle.load(file)
It all seemed fine, however, when I am running the variable x (which should be storing the pickled data) i am unable to retrieve the data from as I would from any other pickled file.
Instead I am only getting the object name and the id:
<nltk.tokenize.punkt.PunktParameters at 0x7f86cf6c0cd0>
The problem is I need to access the content of the file and I cant iterate through as it is not iterable.
Has anyone encountered the same problem?

You have downloaded the punkt tokenizer, for which the documentation says:
This tokenizer divides a text into a list of sentences by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. It must be trained on a
large collection of plaintext in the target language before it can be
used.
After this:
with open('english.pickle', 'rb') as file:
x = pickle.load(file)
You should have a nltk.tokenize.punkt.PunktSentenceTokenizer object. You can call methods on that object to perform tokenization. E.g.:
>>> x.tokenize('This is a test. I like apples. The cow is blue.')
['This is a test.', 'I like apples.', 'The cow is blue.']

Related

How to handle big textual data to create WordCloud?

I have a huge textual data that I need to create its word cloud. I am using a Python library named word_cloud in order to create the word cloud which is quite configurable. The problem is that my textual data is really huge, so a high-end computer is not able to complete the task even for long hours.
The data is firstly stored in MongoDB. Due to Cursor issues while reading the data into a Python list, I have exported the whole data to a plain text file - simply a txt file which is 304 MB.
So the question that I am looking for the answer is how can I handle this huge textual data? The word_cloud library needs a String parameter that contains the whole data separated with ' ' in order to create the Word Cloud.
p.s. Python version: 3.7.1
p.s. word_cloud is an open source Word Cloud generator for Python which is available on GitHub: https://github.com/amueller/word_cloud
You don't need to load all the file in memory.
from wordcloud import WordCloud
from collections import Counter
wc = WordCloud()
counts_all = Counter()
with open('path/to/file.txt', 'r') as f:
for line in f: # Here you can also use the Cursor
counts_line = wc.process_text(line)
counts_all.update(counts_line)
wc.generate_from_frequencies(counts_all)
wc.to_file('/tmp/wc.png')

After training my own classifier with nltk, how do I load it in textblob?

The built-in classifier in textblob is pretty dumb. It's trained on movie reviews, so I created a huge set of examples in my context (57,000 stories, categorized as positive or negative) and then trained it using nltk. I tried using textblob to train it but it always failed:
with open('train.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")
That would run for hours and end in a memory error.
I looked at the source and found it was just using nltk and wrapping that, so I used that instead, and it worked.
The structure for nltk training set needed to be a list of tuples, with the first part was a Counter of words in the text and frequency of appearance. The second part of tuple was 'pos' or 'neg' for sentiment.
>>> train_set = [(Counter(i["text"].split()),i["label"]) for i in data[200:]]
>>> test_set = [(Counter(i["text"].split()),i["label"]) for i in data[:200]] # withholding 200 examples for testing later
>>> cl = nltk.NaiveBayesClassifier.train(train_set) # <-- this is the same thing textblob was using
>>> print("Classifier accuracy percent:",(nltk.classify.accuracy(cl, test_set))*100)
('Classifier accuracy percent:', 66.5)
>>>>cl.show_most_informative_features(75)
Then I pickled it.
with open('storybayes.pickle','wb') as f:
pickle.dump(cl,f)
Now... I took this pickled file, and re opened it to get the nltk.classifier 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- and tried to feed it into textblob. Instead of
from textblob.classifiers import NaiveBayesClassifier
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
I tried:
blob = TextBlob("I love this library", analyzer=myclassifier)
Traceback (most recent call last):
File "<pyshell#116>", line 1, in <module>
blob = TextBlob("I love this library", analyzer=cl4)
File "C:\python\lib\site-packages\textblob\blob.py", line 369, in __init__
parser, classifier)
File "C:\python\lib\site-packages\textblob\blob.py", line 323, in
_initialize_models
BaseSentimentAnalyzer, BaseBlob.analyzer)
File "C:\python\lib\site-packages\textblob\blob.py", line 305, in
_validated_param
.format(name=name, cls=base_class_name))
ValueError: analyzer must be an instance of BaseSentimentAnalyzer
what now? I looked at the source and both are classes, but not quite exactly the same.
I wasn't able to be certain that a nltk corpus cannot work with textblob, and that would surprise me since textblob imports all of the nltk functions in its source code, and is basically a wrapper.
But what I did conclude after many hours of testing is that nltk offers a better built-in sentiment corpus called "vader" that outperformed all of my trained models.
import nltk
nltk.download('vader_lexicon') # do this once: grab the trained model from the web
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Analyzer = SentimentIntensityAnalyzer()
Analyzer.polarity_scores("I find your lack of faith disturbing.")
{'neg': 0.491, 'neu': 0.263, 'pos': 0.246, 'compound': -0.4215}
CONCLUSION: NEGATIVE
vader_lexicon and nltk code does a lot more parsing of negation language in sentences in order to negate positive Words. Like when Darth Vader says "lack of faith" that changes the sentiment to its opposite.
I explained it here, with examples of the better results:
https://chewychunks.wordpress.com/2018/06/19/sentiment-analysis-discovering-the-best-way-to-sort-positive-and-negative-feedback/
That replaces this textblob implementation:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
TextBlob("I find your lack of faith disturbing.", analyzer=NaiveBayesAnalyzer())
{'neg': 0.182, 'pos': 0.817, 'combined': 0.635}
CONCLUSION: POSITIVE
The vader nltk classifier also has additional documentation here on using it for sentiment analysis: http://www.nltk.org/howto/sentiment.html
textBlob always crashed my computer with as little as 5000 examples.
Going over the error message, it seems like the analyzer must be inherited from the abstract class BaseSentimentAnalyzer. As mentioned in the docs here, this class must implement the analyze(text) function. However, while checking the docs of NLTK's implementation, I could not find this method in it's main documentation here or its parent class ClassifierI here. Hence, I believe both these implementations cannot be combined, unless you can implement a new analyze function in NLTK's implementation to make it compatible with TextBlob's.
Another more forward-looking solution is to use spaCy to build the model instead of textblob or nltk. This is new to me, but seems a lot easier to use and more powerful:
https://spacy.io/usage/spacy-101#section-lightning-tour
"spaCy is the Ruby of Rails of natural language processing."
import spacy
import random
nlp = spacy.load('en') # loads the trained starter model here
train_data = [("Uber blew through $1 million", {'entities': [(0, 4, 'ORG')]})] # better model stuff
with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
optimizer = nlp.begin_training()
for i in range(10):
random.shuffle(train_data)
for text, annotations in train_data:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')

Word2Vec Vocabulary not definded error

I am new to python and word2vec and keep getting a "you must first build vocabulary before training the model" error. What is wrong with my code?
Here is my code:
file_object=open("SupremeCourt.txt","w")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)
out=model.most_similar()
print(out[1])
print(out[2])
I could see some wrong things in your code like the file is opened in write mode and the model which you have loaded doesn't contain the word which you want to find the most similar words.
I would like to suggest to use the predefined models like google_news_vectors to load in the gensim or to build your own word2vec model so that you won't get the error.
the usage of most_similar in gensim is out = model.most_similar("word-name")
file_object=open("SupremeCourt.txt","r")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)#use google news vectors here
out=model.most_similar("word")
print(out)
You're opening that file in write mode with this line:
file_object = open("SupremeCourt.txt", "w")
By doing this, you're erasing the contents of your file, so that when you try to pass the file the model for training, there is no data to read. That's why that error is thrown.
Remove that line (and also restore your file contents), and it'll work.

loading files to categorized plain text corpus

I am using ubuntu and as part of my assignment,im doing text sentiment analysis. I am making a training set to classify text using NaiveBayes classifier, i have many files containing sentences and saved as sent1.txt,sent2.txt. . . and a file called label.txt
label.txt contains
sent1.txt:pos
sent2.txt:pos
...
sent 15:txt:neg
sent16.txt:neg
all the sent files and lable.txt files are stored in \home\abha.I tried this
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
reader = CategorizedPlaintextCorpusReader('.', r'.*\.txt', cat_file='cats/cats.txt')
Please tell me What should my third argument be?
Im having such silly issues with where to store the label.txt file and the sent files.

How to build a IMS open source corpus workbench and NLTK readable corpus?

Currently i've a bunch of .txtfiles. within each .txt files, each sentence is separated by newline. how do i change it to the IMS CWB format so that it's readable by CWB? and also to nltk format.
Can someone lead me to a howto page to do that? or is there a guide page to do that, i've tried reading through the manual but i dont really know. www.cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf
Does it mean i create a data and registry directory and then i run the cwb-encode command and it will be all converted to vrt file? does it convert one file at a time? how do i script it to run through multiple file in a directory?
It's easy to produce cwb's "verticalized" format from an NLTK-readable corpus:
from nltk.corpus import brown
out = open('corpus.vrt','w')
for sentence in nltk.brown.sents():
print >>out,'<s>'
for word in sentence:
print >>out,word
print >>out,'</s>'
out.close()
From there, you can follow the instructions on the CWB website.

Categories

Resources