Nltk naive bayesian classifier memory issue - python

my first post here!
I have problems using the nltk NaiveBayesClassifier. I have a training set of 7000 items. Each training item has a description of 2 or 3 worlds and a code. I would like to use the code as label of the class and each world of the description as features.
An example:
"My name is Obama", 001
...
Training set = {[feature['My']=True,feature['name']=True,feature['is']=True,feature[Obama]=True], 001}
Unfortunately, using this approach, the training procedure NaiveBayesClassifier.train use up to 3 GB of ram..
What's wrong in my approach?
Thank you!
def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)
...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
t = t.split("'")
code = t[0] #class
desc = t[1] # description
words = words.union(s) #update dictionary with the new words in the description
entries.append((s,code))
t = readfile.readline()
train_set = classify.util.apply_features(document_features, entries[:train_length])
classifier = NaiveBayesClassifier.train(train_set) # Training

Use nltk.classify.apply_features which returns an object that acts like a list but does not store all the feature sets in memory.
from nltk.classify import apply_features
More Information and a Example here
You are loading the file anyway into the memory, you will need to use some form of lazy loading method. Which will load as per need basis.
Consider looking into this

Related

How do I use gensim to vectorize these words in my dataframe so I can perform clustering on them?

I am trying to do a clustering analysis (preferably k-means) of poetry words on a pandas dataframe. I am firstly trying to vectorize the words by using the word-to-vector feature in the gensim package. However, the vectors just come out with 0s, so my code is failing to translate the words into vectors. As a result, the clustering doesn't work. Here is my code:
# create a gensim model
model = gensim.models.Word2Vec(vector_size=100)
# copy original pandas dataframe with poems
data = poems.copy(deep=True)
# get data ready for kmeans clustering
final_data = [] # empty list
for i, row in data.iterrows():
poem_vectorized = []
poem = row['Main_text']
poem_all_words = poem.split(sep=" ")
for poem_w in poem_all_words: #iterate through list of words
try:
poem_vectorized.append(list(model.wv[poem_w]))
except Exception as e:
pass
try:
poem_vectorized = np.asarray(poem_vectorized)
poem_vectorized_mean = list(np.mean(poem_vectorized, axis=0))
except Exception as e:
poem_vectorized_mean = list(np.zeros(100))
pass
try:
len(poem_vectorized_mean)
except:
poem_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(poem_vectorized_mean)
final_data.append(temp_row)
X = np.asarray(final_data)
print(X)
At closer inspection of:
poem_vectorized.append(list(model.wv[poem_w]))
the problem seems to be this:
If I understand it correctly you want to use an existing model to get the semantic embeddings of the tokens and then cluster the words, right?
Because the way you set the model up you are preparing a new model for training, but then don't feed any training data to it and train it, so your model doesn't know any words and just always throws a KeyError when calling model.wv[poem_w].
Use gensim.downloader to load an existing model (check out their repository for a list of all available models):
import gensim.downloader as api
import numpy as np
import pandas
poems = pandas.DataFrame({"Main_text": ["This is a sample poem.", "This is another sample poem."]})
model = api.load("glove-wiki-gigaword-100")
Then use it to retrieve the vectors for all words the models knows:
final_data = []
for poem in poems['Main_text']:
poem_all_words = poem.split()
poem_vectorized = []
for poem_w in poem_all_words:
if poem_w in model:
poem_vectorized.append(model[poem_w])
poem_vectorized_mean = np.mean(poem_vectorized, axis=0)
final_data.append(poem_vectorized_mean)
Or as list comprehension:
final_data = []
for poem in poems['Main_text']:
poem_vectorized_mean = np.mean([model[poem_w] for poem_w in poem.split() if poem_w in model], axis=0)
final_data.append(poem_vectorized_mean)
Which both will give you:
X = np.asarray(final_data)
print(X)
> [[-3.74696642e-01 3.73661995e-01 4.09943342e-01 -2.07784668e-01
...
-1.85739681e-01 -7.07386672e-01 3.31366658e-01 3.31600010e-01]
[-3.29973340e-01 4.13213342e-01 5.26199996e-01 -2.29261339e-01
...
-1.25366330e-01 -5.87253332e-01 2.80240029e-01 2.56700337e-01]]
Note that attempting to get np.mean() on an empty list will throw an error so you might want to catch that in case there are poems which are empty or where all words are unknown to the model.

training a Fasttext model

I want to train a Fasttext model in Python using the "gensim" library. First, I should tokenize each sentences to its words, hence converting each sentence to a list of words. Then, this list should be appended to a final list. Therefore, at the end, I will have a nested list containing all tokenized sentences:
word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = []
for line in open('sentences.txt'):
new = line.strip()
new = word_punctuation_tokenizer.tokenize(new)
if len(new) != 0:
word_tokenized_corpus.append(new)
Then, the model should be built as the following:
embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2
ft_model = FastText(word_tokenized_corpus,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
However, the number of sentences in "word_tokenized_corpus" is very large and the program can't handle it. Is it possible that I train the model by giving each tokenized sentence to it one by one, such as the following:?
for line in open('sentences.txt'):
new = line.strip()
new = word_punctuation_tokenizer.tokenize(new)
if len(new) != 0:
ft_model = FastText(new,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
Does this make any difference to the final results? Is it possible to train the model without having to build such a large list and keeping it in the memory?
Since the volume of the data is very high, it is better to convert the text file into a COR file. Then, read it in the following way:
from gensim.test.utils import datapath
corpus_file = datapath('sentences.cor')
As for the next step:
model = FastText(size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
model.build_vocab(corpus_file=corpus_file)
total_words = model.corpus_total_words
model.train(corpus_file=corpus_file, total_words=total_words, epochs=5)
If you want to use the default fasttext API, here how you can do it:
root = "path/to/all/the/texts/in/a/single/txt/files.txt"
training_param = {
'ws': window_size,
'minCount': min_word,
'dim': embedding_size,
't': down_sampling,
'epoch': 5,
'seed': 0
}
# for all the parameters: https://fasttext.cc/docs/en/options.html
model = fasttext.train_unsupervised(path, **training_param)
model.save_model("embeddings_300_fr.bin")
The advantage of using the fasttext API is (1) implemented in C++ with a wrapper in Python (way faster than Gensim) (also multithreaded) (2) manage better the reading of the text. It is also possible to use it directly from the command line.

Creating features function for further classification in python

I have read a description, how to apply random forest regression here. In this example the authors use the following code to create the features:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
I am thinking of combining several possibilities as features and turn them on and off. And I don't know how to do it.
What I have so far is that I define a class, where I will be able to turn on and off the features and see if it brings something (for example, all unigrams and 20 most frequent unigrams, it could be then 10 most frequent adjectives, tf-idf). But for now I don't understand how to combine them together.
The code looks like this, and in the function part I am lost (the kind of function I have would replicate what they do in the tutorial, but it doesn't seem to be really helpful the way I do it):
class FeatureGen: #for example, feat = FeatureGen(unigrams = False) creates feature set without the turned off feature
def __init__(self, unigrams = True, unigrams_freq = True)
self.unigrams = unigrams
self.unigrams_freq = unigrams_freq
def get_features(self, input):
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
tokens = input["token"]
if self.unigrams:
train_data_features = vectorizer.fit_transform(tokens)
return train_data_features
What should I do to add one more feature possibility? Like contains 10 most frequent words.
if self.unigrams
train_data_features = vectorizer.fit_transform(tokens)
if self.unigrams_freq:
#something else
return features #and this should be a combination somehow
Looks like you need np.hstack
However you need each features array to have one row per training case.

Speed up classification task on sklearn/Machine Learning with pickle?

I have already a classifier trained that I load up through pickle.
My main doubt is if there is anything that can speed up the classification task. It is taking almost 1 minute for each text (feature extraction and classification), is that normal? Should I go on multi-threading?
Here some code fragments to see the overall flow:
for item in items:
review = ''.join(item['review_body'])
review_features = getReviewFeatures(review)
normalized_predicted_rating = getPredictedRating(review_features)
item_processed['rating'] = str(round(float(normalized_predicted_rating),1))
def getReviewFeatures(review, verbose=True):
text_tokens = tokenize(review)
polarity = getTextPolarity(review)
subjectivity = getTextSubjectivity(review)
taggs = getTaggs(text_tokens)
bigrams = processBigram(taggs)
freqBigram = countBigramFreq(bigrams)
sort_bi = sortMostCommun(freqBigram)
adjectives = getAdjectives(taggs)
freqAdjectives = countFreqAdjectives(adjectives)
sort_adjectives = sortMostCommun(freqAdjectives)
word_features_adj = list(sort_adjectives)
word_features = list(sort_bi)
features={}
for bigram,freq in word_features:
features['contains(%s)' % unicode(bigram).encode('utf-8')] = True
features["count({})".format(unicode(bigram).encode('utf-8'))] = freq
for word,freq in word_features_adj:
features['contains(%s)' % unicode(word).encode('utf-8')] = True
features["count({})".format(unicode(word).encode('utf-8'))] = freq
features["polarity"] = polarity
features["subjectivity"] = subjectivity
if verbose:
print "Get review features..."
return features
def getPredictedRating(review_features, verbose=True):
start_time = time.time()
classifier = pickle.load(open("LinearSVC5.pickle", "rb" ))
p_rating = classifier.classify(review_features) # in the form of "# star"
predicted_rating = re.findall(r'\d+', p_rating)[0]
predicted_rating = int(predicted_rating)
best_rating = 5
worst_rating = 1
normalized_predicted_rating = 0
normalized_predicted_rating = round(float(predicted_rating)*float(10.0)/((float(best_rating)-float(worst_rating))+float(worst_rating)))
if verbose:
print "Get predicted rating..."
print "ML_RATING: ", normalized_predicted_rating
print("---Took %s seconds to predict rating for the review---" % (time.time() - start_time))
return normalized_predicted_rating
NLTK is a great tool and a good starting point for Natural Language Processing, but it's sometimes not very useful if speed is important as the authors implicitly said:
NLTK has been called β€œa wonderful tool for teaching, and working in, computational linguistics using Python,” and β€œan amazing library to play with natural language.”
So if your problem only lies in the speed of the classifier of the toolkit you have to use another ressource or you have to write the classifier by yourself.
Scikit might be helpful for you if you want to use a classifier which is probably faster.
It seems that you use a dictionary to build the feature vector. I strongly suspect that the problem is there.
The proper way would be using a numpy ndarray, with examples on rows and features on columns. So, something like
import numpy as np
# let's suppose 6 different features = 6-dimensional vector
feats = np.array((1, 6))
# column 0 contains polarity, column 1 subjectivity, and so on..
feats[:, 0] = polarity
feats[:, 1] = subjectivity
# ....
classifier.classify(feats)
Of course, you must use the same data structure and respect the same convention during training.

Doc2vec: How to get document vectors

How to get document vectors of two text documents using Doc2vec?
I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial
I am using gensim.
doc1=["This is a sentence","This is another sentence"]
documents1=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)
I get
AttributeError: 'list' object has no attribute 'words'
whenever I run this.
If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).
# Import libraries
from gensim.models import doc2vec
from collections import namedtuple
# Load data
doc1 = ["This is a sentence", "This is another sentence"]
# Transform data (you can add more data preprocessing steps)
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))
# Train model (set min_count = 1, if you want the model to work with the provided example data set)
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
# Get the vectors
model.docvecs[0]
model.docvecs[1]
UPDATE (how to train in epochs):
This example became outdated, so I deleted it. For more information on training in epochs, see this answer or #gojomo's comment.
Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags - see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html
However, #bee2502 was right with
docvec = model.docvecs[99]
It will should the 100th vector's value for trained model, it works with integers and strings.
doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)
I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format.
I hope this below example will help you understand the format.
documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])
More details are here : http://rare-technologies.com/doc2vec-tutorial/
However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object.
Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.
sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)
To get document vector :
You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument
docvec = model.docvecs[99]
where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
Documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(doc1)]
Model = Doc2Vec(Documents, other parameters~~)
This should work fine. You need to tag your documents for training doc2vec model.

Categories

Resources