UnpicklingError was unhandled by user code : invalid load key,'%' - python

I have referred the website https://radimrehurek.com/gensim/tut2.html. I have come across the error UnpicklingError was unhandled by user code : invalid load key,'%'. How do I clear that error? I had referred the other queries and included the klepto package but still that error persists. I am using anacoanda2. This is the code:-
import logging
import xml.etree.cElementTree
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
import os
import klepto
from gensim import corpora
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1] for text in texts]
from pprint import pprint # pretty-printer
pprint(texts)
dictionary = corpora.Dictionary(texts)
dictionary.save_as_text('/tmp/deerwester.dict') # store the dictionary, for future reference
print(dictionary)
print(dictionary.token2id)
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec) # the word "interaction" does not appear in the dictionary and is ignored
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.dict', corpus) # store to disk, for later use
for c in corpus:
print(c)
class MyCorpus(object):
def __iter__(self):
for line in open('/datasets/mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)
for vector in corpus_memory_friendly: # load one vector into memory at a time
print(vector)
from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('/datasets/mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
# remove stop words and words that appear only once
dictionary.filter_tokens(stop_ids + once_ids)
# remove gaps in id sequence after words that were removed
dictionary.compactify()
print(dictionary)
# create a toy corpus of 2 documents, as a plain Python list
corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it
corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)
corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)
corpus = corpora.MmCorpus('/tmp/corpus.mm')
print(corpus)
# one way of printing a corpus: load it entirely into memory
print(list(corpus)) # calling list() will convert any sequence to a plain Python list
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
print(doc)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2])
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
numpy_matrix_dense = gensim.matutils.corpus2dense(corpus, num_terms=10)
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2)
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
from gensim import corpora, models, similarities
if (os.path.exists("/tmp/deerwester.dict")):
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm')
print("Used files generated from first tutorial")
else:
print("Please run first tutorial to generate data set")
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
print(doc)
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi.print_topics(2)
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print(doc)
lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('/tmp/model.lsi')
model = models.TfidfModel(corpus, normalize=True)
model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
lsi_vec = model[tfidf_vec]
model = models.RpModel(tfidf_corpus, num_topics=500)
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
model = models.HdpModel(corpus, id2word=dictionary)

Related

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it?
This is my code:
sequence = nltk.tokenize.word_tokenize(raw)
bigram = ngrams(sequence,2)
freq_dist = nltk.FreqDist(bigram)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()
However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.
There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.
You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:
import os
import requests
import io #codecs
from nltk import word_tokenize, sent_tokenize
# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
Then the language modelling:
# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)
To get the counts:
model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')
To get the probabilities:
model.score('is', 'language'.split()) # P('is'|'language')
model.score('never', 'language is'.split()) # P('never'|'language is')
There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

cosine-similarity between consecutive pairs using whole articles in JSON file

I would like to calculate the cosine similarity for the consecutive pairs of articles in a JSON file. So far I manage to do it but.... I just realize that when transforming the tfidf of each article I am not using the terms from all articles available in the file but only those from each pair. Here is the code that I am using which provides the cosine-similarity coefficient of each consecutive pair of articles.
import json
import nltk
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
## Defining our functions to filter the data
# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## Lastly, a super function is created that contains all the previous ones plus stopwords removal
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
## Calculation one by one of the cosine similatrity
def foo(x, y):
tfidf = vectorizer.fit_transform([x, y])
return ((tfidf * tfidf.T).A)[0,1]
my_funcs = {}
for i in range(len(data) - 1):
x = data[i]['body']
y = data[i+1]['body']
foo.func_name = "cosine_sim%d" % i
my_funcs["cosine_sim%d" % i] = foo
print(foo(x,y))
Any idea of how to develop the cosine-similarity using the whole terms of all articles available in the JSON file rather than only those of each pair?
Kind regards,
Andres
I think, based on our discussion above, you need to change the foo function and everything below. See the code below. Note that I haven't actually run this, since I don't have your data and no sample lines are provided.
## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
import json
from sklearn.metrics.pairwise import cosine_similarity
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
## Defining our functions to filter the data
# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## tfidf
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf_data = vectorizer.fit_transform(data)
#cosine dists
similarity matrix = cosine_similarity(tfidf_data)

Train customized corpus, with NLTK for Python

I try to train a corpus with my own documents. My documents are structured in the same way as the original movie_reviews corpus data, so 1K positive text files in folder 'pos' and 1K negative text files in folder 'neg'. Each textfile contains 25 lines of tweets, which are cleaned, as in: urls, usernames, capital letters, punctuation removed.
How can I adjust this code to use my own text data instead of the movie_reviews?
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from collections import defaultdict
import numpy as np
# define the split of % training / % test
SPLIT = 0.8
def word_feats(words):
return dict([(word, True) for word in words])
posids = movie_reviews.fileids('pos')
negids = movie_reviews.fileids('neg')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
cutoff = int(len(posfeats) * SPLIT)
trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
testfeats = negfeats[cutoff:] + posfeats[cutoff:]
print 'Train on %d instances\nTest on %d instances' % (len(trainfeats),len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'Accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
You can login as a root user and change you directory path to this:
/usr/local/lib/python2.7/dist-packages/nltk/corpus/__init__.py
In this document you can find already existing movie_reviews corpora loaded using LazyCorpusLoader:
movie_reviews = LazyCorpusLoader(
'movie_reviews', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Then try adding some thing similar to this:
My_Movie = LazyCorpusLoader(
'My_Movie', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Where My_Movie is the name which you have created for your movie reviews.
Once Everything is done save and exit.
Finally place you corpus in nltk directory where you can find the movie_review corpus.
Try performing this:
from nltk.corpus import My_Movie # Newly created you own corpus
Hope this will work.

python gensim: indices array has non-integer dtype (float64)

I am using this gensim tutorial to find similarities between texts. Here is the code
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
'''
documents = ["Human machine interface for lab abc computer applications",
"bags loose tea water second ingredient tastes water",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
"red cow butter oil"]
'''
documents = ["Human machine interface for lab abc computer applications",
"bags loose tea water second ingredient tastes water"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
#print corpus
tfidf = models.TfidfModel(corpus)
#print tfidf
corpus_tfidf = tfidf[corpus]
#print corpus_tfidf
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lsi.print_topics(1)
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lda.print_topics(1)
corpora.MmCorpus.serialize('dict.mm', corpus)
corpus = corpora.MmCorpus('dict.mm')
#print corpus
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
#print vec_lsi
index = similarities.MatrixSimilarity(lsi[corpus])
index.save('dict.index')
index = similarities.MatrixSimilarity.load('dict.index')
sims = index[vec_lsi]
#print list(enumerate(sims))
sims = sorted(enumerate(sims),key=lambda item: -item[1])
for sim in sims:
print documents[sim[0]], " ==> ", sim[1]
There are two documents here. One has 10 texts and another has 2. One is commented out. If I use the first documents list everything goes fine and generates meaningful output. If I use the second document list(having 2 texts) an error occured. Here is it
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64)
% self.indices.dtype.name )
What is the reason behind this error and how can I fix it?
I am using a 64bit machine.
It could be caused by the fact that your second list will be [[], ['water']] by the time that you have removed singletons, trying to do matrix operations on matrices with dimensions of 0 and 1 could cause all sorts of issues.
Having a play with your code:
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpus
[[], [(0, 2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:23:31,415 : INFO : collecting document frequencies
2013-07-21 09:23:31,415 : INFO : PROGRESS: processing document #0
2013-07-21 09:23:31,415 : INFO : calculating IDF weights for 2 documents and 1 features (1 matrix non-zeros)
>>> corpus = [[(1,)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:16,452 : INFO : collecting document frequencies
2013-07-21 09:24:16,452 : INFO : PROGRESS: processing document #0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__
self.initialize(corpus)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 119, in initialize
for termid, _ in bow:
ValueError: need more than 1 value to unpack
>>> corpus = [[(1,3)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:26,892 : INFO : collecting document frequencies
2013-07-21 09:24:26,892 : INFO : PROGRESS: processing document #0
2013-07-21 09:24:26,892 : INFO : calculating IDF weights for 2 documents and 2 features (2 matrix non-zeros)
>>>
As I said above you need to ensure that corpus does not have any empty lists before calling models.TfidfModel(corpus) on it.
That's not an error, it's a warning. You can ignore it.
Your query document doc is empty in the second case, which causes the warning. You still get the correct answer though anyway (=an empty vec_lsi).

n-grams with Naive Bayes classifier

Im new to python and need help!
i was practicing with python NLTK text classification.
Here is the code example i am practicing on
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
Ive tried this one
from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict
train_samples = {}
with file ('positive.txt', 'rt') as f:
for line in f.readlines():
train_samples[line]='pos'
with file ('negative.txt', 'rt') as d:
for line in d.readlines():
train_samples[line]='neg'
f=open("test.txt", "r")
test_samples=f.readlines()
def bigramReturner(text):
tweetString = text.lower()
bigramFeatureVector = {}
for item in bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
def get_labeled_features(samples):
word_freqs = {}
for text, label in train_samples.items():
tokens = text.split()
for token in tokens:
if token not in word_freqs:
word_freqs[token] = {'pos': 0, 'neg': 0}
word_freqs[token][label] += 1
return word_freqs
def get_label_probdist(labeled_features):
label_fd = FreqDist()
for item,counts in labeled_features.items():
for label in ['neg','pos']:
if counts[label] > 0:
label_fd.inc(label)
label_probdist = ELEProbDist(label_fd)
return label_probdist
def get_feature_probdist(labeled_features):
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
num_samples = len(train_samples) / 2
for token, counts in labeled_features.items():
for label in ['neg','pos']:
feature_freqdist[label, token].inc(True, count=counts[label])
feature_freqdist[label, token].inc(None, num_samples - counts[label])
feature_values[token].add(None)
feature_values[token].add(True)
for item in feature_freqdist.items():
print item[0],item[1]
feature_probdist = {}
for ((label, fname), freqdist) in feature_freqdist.items():
probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
feature_probdist[label,fname] = probdist
return feature_probdist
labeled_features = get_labeled_features(train_samples)
label_probdist = get_label_probdist(labeled_features)
feature_probdist = get_feature_probdist(labeled_features)
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
for sample in test_samples:
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
but getting this error, why?
Traceback (most recent call last):
File "C:\python\naive_test.py", line 76, in <module>
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
File "C:\python\naive_test.py", line 23, in bigramReturner
bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'
A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.
As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.
import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)
Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.
Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.
Edit 1:
There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence
"I drive a truck"
the word unigram features would be each of those 4 words while the word bigram features would be:
["I drive", "drive a", "a truck"]
Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.
def bigramReturner (tweetString):
tweetString = tweetString.lower()
tweetString = removePunctuation (tweetString)
bigramFeatureVector = []
for item in nltk.bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.

Categories

Resources