Using Latent Dirichlet Allocation with Gensim

Using Latent Dirichlet Allocation with Gensim - python

I am working on a project and I would like to use Latent Dirichlet Allocation in order to extract topics from a large amount of articles.
My code is this:
import gensim
import csv
import json
import glob
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from time import gmtime, strftime
tokenizer = RegexpTokenizer(r'\w+')
cachedStopWords = set(stopwords.words("english"))
body = []
processed = []
with open('/…/file.json') as j:
data = json.load(j)
for i in range(0,len(data)):
body.append(data[i]['text'].lower())
for entry in body:
row = tokenizer.tokenize(entry)
processed.append([word for word in row if word not in cachedStopWords])
dictionary = corpora.Dictionary(processed)
corpus = [dictionary.doc2bow(text) for text in processed]
lda = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50, update_every=1, passes=1)
topics = lda.show_topics(num_topics=50, num_words=8)
other_doc = "After being jailed for life in 1964, Nelson Mandela became a worldwide symbol of resistance to apartheid. But his opposition to racism began many years before."
print lda[other_doc]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/gensim/models/ldamodel.py", line 714, in __getitem__
gamma, _ = self.inference([bow])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site
packages/gensim/models/ldamodel.py", line 361, in inference ids = [id for id, _ in doc]
ValueError: need more than 1 value to unpack
I also tried to use LdaMulticore in 3 different ways :
lda = gensim.models.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
lda = gensim.models.ldamodel.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
And every time I got this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute ‘LdaMulticore'
Any ideas?
Thank you in advance.

You have to convert back to the phase space.
http://radimrehurek.com/gensim/tut3.html#similarity-interface
vec_bow = dictionary.doc2bow(other_doc.lower().split())
vec_lsi = lda[vec_bow] # convert the query to LSI space

I realize this is old, but I just had this same problem. You are probably pointing to an older version of Gensim. You have to make sure you're using version >= 0.10.2.
Update with "easy_install -U gensim" and then make sure your IDE is seeing the updated library.

Related

Python .words issue?

Ok so I'm trying to create a program that tells me how positive or negative each line of that paulryan.txt file is. I'm using the opinion_lexicon, and the file is '_io.TextIOWrapper'
Is there something I can use instead of .words?
Other less important problem: any ideas how to make my WHOLE paulryan.txt file lowercase while keeping it tokenized by line? Thinking it won't give me an accurate positive or negative score if I don't make the whole thing lowercase because there are only lowercase words in the opinion_lexicon.
import nltk
from nltk.corpus import opinion_lexicon
from nltk.tokenize.simple import (LineTokenizer, line_tokenize)
poswords = set(opinion_lexicon.words("positive-words.txt"))
negwords = set(opinion_lexicon.words("negative-words.txt"))
f=open("paulryan.txt", "rU")
raw = f.read()
token= nltk.line_tokenize(raw)
print(token)
def finddemons():
for x in token:
y = token.words()
percpos = len([w for w in token if w in poswords ]) / len(y)
percneg = len([w for w in token if w in negwords ]) / len(y)
print(x, "pos:", round(percpos, 3), "neg:", round(percneg, 3))
finddemons()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in finddemons
AttributeError: 'list' object has no attribute 'words'

I suggest you to read the file line by line. Then , use word_ tokenize:
for line in f:
tokens = word_tokenize(line)
You are right about lowercase the text for searching in the lexicon :
for line in f:
tokens = word_tokenize(line.lower())
You could even try to lemmatize the tokens by using wordnet, because the opinion lexicon is not that rich in vocabulary. Especially if you use tweets, where words are often in different forms.

ValueError : too many values to unpack (expected 2)

I'm doing sentiment analysis using naive bayes classifier of nltk. I'm just inserting a csv file that contains words and their labels as training set and not testing it yet. I'm finding sentiment of each sentence and then finding average of sentiments of all sentences in the end. My file contains words in the format:
good,0.6
amazing,0.95
great,0.8
awesome,0.95
love,0.7
like,0.5
better,0.4
beautiful,0.6
bad,-0.6
worst,-0.9
hate,-0.8
sad,-0.4
disappointing,-0.6
angry,-0.7
happy,0.7
But the file doesn't get trained and the above mentioned error shows up. Here's my python code:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.classify.api import ClassifierI
operators=set(('not','never','no'))
stop_words=set(stopwords.words("english"))-operators
text="this restaurant is good but i hate it ."
sent=0.0
x=0
text2=""
xyz=[]
dot=0
if "but" in text:
i=text.find("but")
text=text[:i]+"."+text[i+3:]
if "whereas" in text:
i=text.find("whereas")
text=text[:i]+"."+text[i+7:]
if "while" in text:
i=text.find("while")
text=text[:i]+"."+text[i+5:]
a=open('C:/Users/User/train_words.csv','r')
for w in text.split():
if w in stop_words:
continue
else:
text2=text2+" "+w
print (text2)
cl=nltk.NaiveBayesClassifier.train(a)
xyz=sent_tokenize(text2)
print(xyz)
for s in xyz:
x=x+1
print(s)
if "not" in s or "n't" in s:
print(float(cl.classify(s))*-1)
sent=sent+(float(cl.classify(s))*-1)
else:
print(cl.classify(s))
sent=sent+float(cl.classify(s))
print("sentiment of the overall document:",sent/x)
error:
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users /User/Documents')
restaurant good . hate .
Traceback (most recent call last):
File "<ipython-input-8-d03fac6844c7>", line 1, in <module>
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users/User/Documents')
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/User/Documents/untitled1.py", line 37, in <module>
cl = nltk.NaiveBayesClassifier.train(a)
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack (expected 2)

if am not wrong train() takes list of tuple and you are providing file obj.
Instead of this
a = open('C:/Users/User/train_words.csv','r')
Try this
a = open('C:/Users/User/train_words.csv','r').read() # this is string
a_list = a.split('\n')
a_list_of_tuple = [tuple(x.split(',')) for x in a_list]
and pass a_list_of_tuple variable to train()
hop this will help :)

From the doc:
def train(cls, labeled_featuresets, estimator=ELEProbDist):
"""
:param labeled_featuresets: A list of classified featuresets,
i.e., a list of tuples ``(featureset, label)``.
"""
So you can write something similar:
feature_set = [line.split(',')[::-1] for line in open('filename').readline()]

NLTK accuracy: "ValueError: too many values to unpack"

I'm trying to do some sentiment analysis of a new movie from Twitter using the NLTK toolkit. I've followed the NLTK 'movie_reviews' example and I've built my own CategorizedPlaintextCorpusReader object. The problem arises when I call nltk.classify.util.accuracy(classifier, testfeats). Here is the code:
import os
import glob
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
trainfeats = negfeats + posfeats
# Building a custom Corpus Reader
tweets = nltk.corpus.reader.CategorizedPlaintextCorpusReader('./tweets', r'.*\.txt', cat_pattern=r'(.*)\.txt')
tweetsids = tweets.fileids()
testfeats = [(word_feats(tweets.words(fileids=[f]))) for f in tweetsids]
print 'Training the classifier'
classifier = NaiveBayesClassifier.train(trainfeats)
for tweet in tweetsids:
print tweet + ' : ' + classifier.classify(word_feats(tweets.words(tweetsids)))
classifier.show_most_informative_features()
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
It all seems to work fine until it gets to the last line. That's when I get the error:
>>> nltk.classify.util.accuracy(classifier, testfeats)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs,l) in gold])
ValueError: too many values to unpack
Does anybody see anything wrong within the code?
Thanks.

The error message
File "/usr/lib/python2.7/dist-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs,l) in gold])
ValueError: too many values to unpack
arises because items in gold can not be unpacked into a 2-tuple, (fs,l):
[fs for (fs,l) in gold] # <-- The ValueError is raised here
It is the same error you would get if gold equals [(1,2,3)], since the 3-tuple (1,2,3) can not be unpacked into a 2-tuple (fs,l):
In [74]: [fs for (fs,l) in [(1,2)]]
Out[74]: [1]
In [73]: [fs for (fs,l) in [(1,2,3)]]
ValueError: too many values to unpack
gold might be buried inside the implementation of nltk.classify.util.accuracy, but this hints that your inputs, classifier or testfeats are of the wrong "shape".
There is no problem with classifer, since calling accuracy(classifier, trainfeats)
works:
In [61]: print 'accuracy:', nltk.classify.util.accuracy(classifier, trainfeats)
accuracy: 0.9675
The problem must be in testfeats.
Compare trainfeats with testfeats.
trainfeats[0] is a 2-tuple containing a dict and a classification:
In [63]: trainfeats[0]
Out[63]:
({u'!': True,
u'"': True,
u'&': True,
...
u'years': True,
u'you': True,
u'your': True},
'neg') # <--- Notice the classification, 'neg'
but testfeats[0] is just a dict, word_feats(tweets.words(fileids=[f])):
testfeats = [(word_feats(tweets.words(fileids=[f]))) for f in tweetsids]
So to fix this you would need to define testfeats to look more like trainfeats -- each dict returned by word_feats must be paired with a classification.

Applying LDA to a corpus for training using gensim

I have corpus with around 20,000 documents and I have to train that data set for topic modelling using LDA.
import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
id2word = gensim.corpora.Dictionary('questions.dict')
mm = gensim.corpora.MmCorpus('questions.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, chunksize=3000, passes=20)
lda.print_topics(20)
Whenever I run this program I come across this error:
2013-04-28 09:57:09,750 : INFO : adding document #0 to Dictionary(0 unique tokens)
2013-04-28 09:57:09,759 : INFO : built Dictionary(11 unique tokens) from 14 documents (total 14 corpus positions)
2013-04-28 09:57:09,785 : INFO : loaded corpus index from questions.mm.index
2013-04-28 09:57:09,790 : INFO : initializing corpus reader from questions.mm
2013-04-28 09:57:09,796 : INFO : accepted corpus with 19188 documents, 15791 features, 106222 non-zero entries
2013-04-28 09:57:09,802 : INFO : using serial LDA version on this node
2013-04-28 09:57:09,808 : INFO : running batch LDA training, 100 topics, 20 passes over the supplied corpus of 19188 documents, updating model once every 19188 documents
2013-04-28 09:57:10,267 : INFO : PROGRESS: iteration 0, at document #3000/19188
Traceback (most recent call last):
File "C:/Users/Animesh/Desktop/NLP/topicmodel/lda.py", line 10, in <module>
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, chunksize=3000, passes=20)
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 265, in __init__
self.update(corpus)
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 445, in update
self.do_estep(chunk, other)
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 365, in do_estep
gamma, sstats = self.inference(chunk, collect_sstats=True)
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 318, in inference
expElogbetad = self.expElogbeta[:, ids]
IndexError: index (11) out of range (0<=index<10) in dimension 1
I even tried to change the values in LdaModel function but I always get the same error!
What should be done ?

It seems that your dictionary (id2word) is not correctly matched up with your corpus object (mm).
For whatever reason, id2word (the mapping of word tokens to wordids) only contains 11 tokens
2013-04-28 09:57:09,759 : INFO : built Dictionary(11 unique tokens) from 14 documents (total 14 corpus positions)
Your corpus contains 15791 features, so when it looks for a feature with id > 10, it fails. ids in
expElogbetad = self.expElogbeta[:, ids]
is a list of all the word ids in a particular document.
I'd rerun the creation of the corpus and dictionary:
$ python -m gensim.scripts.make_wiki
(from the gensim LDA tutorial).
The logging data for the created dictionary should indicate far more than 11 tokens I believe. I've run into a similar problem to this myself.

nltk stemming and stop words for naive bayes

I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.
I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.
I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier
word_features = list(all_words.keys())[:15000]
testing_set = featuresets[10000:]
training_set = featuresets[:10000]
nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)
nbclassifier.show_most_informative_features(30)
This produces around 45000 words and has an accuracy of 85%.
I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors
Traceback (most recent call last):
File "foo.py", line 108, in <module>
print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs, l) in gold])
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
X = self._vectorizer.transform(featuresets)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
return self._transform(X, fitting=False)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.
I don't understand why adding stemming and or removing stop words breaks the classifier?

Adding stemming or removing stop words could not cause your issue. I think you have an issue further up in your code due to how you read the file. When I was following sentdex's tutorial on YouTube, I came across this same error. I was stuck for the past hour, but I finally got it. If you follow his code you get this:
short_pos = open("short_reviews/positive.txt", "r").read()
short_neg = open("short_reviews/negative.txt", "r").read()
documents = []
for r in short_pos.split('\n'):
documents.append( (r, 'pos' ))
for r in short_neg.split('\n'):
documents.append( (r, 'neg' ))
all_words = []
short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)
for w in short_pos_words:
all_words.append(w.lower())
for w in short_neg_words:
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]
I kept running into this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6056: invalid start byte.
You get this error because there are non UTF-8 characters in the files provided. I was able to get around the error by changing the code to this:
fname = 'short_reviews/positive.txt'
with open(fname, 'r', encoding='utf-16') as f:
for line in f:
pos_lines.append(line)
Unfortunately, then I started getting this error:
UnicodeError: UTF-16 stream does not start with BOM
I forget how, but I made this error go away too. Then I started getting the same error as your original question:
ValueError: Sample sequence X is empty.
When I printed the length of featuresets, I saw it was only 2.
print("Feature sets list length : ", len(featuresets))
After digging on this site, I found these two questions:
Delete every non utf-8 symbols froms string
'str' object has no attribute 'decode' in Python3
The first one didn't really help, but the second one solved my problem (Note: I'm using python-3).
I'm not one for one liners, but this worked for me:
pos_lines = [line.rstrip('\n') for line in open('short_reviews/positive.txt', 'r', encoding='ISO-8859-1')]
I will update my github repo later this week with the full code for the nlp tutorial if you'd like to see the complete solution. I realize this answer probably comes 2 years too late, but hopefully it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Latent Dirichlet Allocation with Gensim - python

You have to convert back to the phase space. http://radimrehurek.com/gensim/tut3.html#similarity-interface vec_bow = dictionary.doc2bow(other_doc.lower().split()) vec_lsi = lda[vec_bow] # convert the query to LSI space

I realize this is old, but I just had this same problem. You are probably pointing to an older version of Gensim. You have to make sure you're using version >= 0.10.2. Update with "easy_install -U gensim" and then make sure your IDE is seeing the updated library.

Related

Python .words issue?

ValueError : too many values to unpack (expected 2)

NLTK accuracy: "ValueError: too many values to unpack"

Applying LDA to a corpus for training using gensim

nltk stemming and stop words for naive bayes

Categories

Resources