I'm using the MaxEnt classifier from the Python NLTK library. For my dataset, I have many possible labels, and as expected, MaxEnt returns just one label. I have trained my dataset and get about 80% accuracy. I've also tested my model on unknown data items, and the results are good. However, for any given unknown input, I want to be able to print/display a ranking of all the possible labels based on some internal criteria MaxEnt used to select the one, such as confidence/probability. For example, suppose I had a,b,c as possible labels and I use MaxEnt.classify(input), I get currently one label, let's say c. However, I want to be able to view something like a (0.9), b(0.7), c(0.92), so I can see why c was selected, and possibly choose multiple labels based on those parameters. Apologies for my fuzzy terminology, I'm fairly new to NLP and machine learning.
Solution
Based on the accepted answer, here's a skeleton code example to demonstrate what I wanted and how it can be achieved. More classifier examples on the NLTK website.
import nltk
contents = read_data('mydataset.csv')
data_set = [(feature_sets(input), label) for (label, input) in contents] # User-defined feature_sets() function
train_set, test_set = data_set[:1000], data_set[1000:]
labels = [label for (input, label) in train_set]
maxent = nltk.MaxentClassifier.train(train_set)
maxent.classify(feature_sets(new_input)) # Returns one label
multi_label = maxent.prob_classify(feature_sets(new_input)) # Returns a DictionaryProbDist object
for label in labels:
multi_label.prob(label)
Try prob_classify(input)
It returns dictionary with probability for each label, see docs.
Related
I have used fasttext train_supervised utility to train a classification model according to their webpage https://fasttext.cc/docs/en/supervised-tutorial.html .
model = fasttext.train_supervised(input='train.txt', autotuneValidationFile='validation.txt', autotuneDuration=600)
After I got the model how could I explore what kind of best parameters for the model like in sklearn after a set of best parameters trained, we could always check the values for these parameters but I could not find any document to explain this.
I also used this trained model to make prediction on my data
model.predict(test_df.iloc[2, 1])
It will return the label with a probability like this
(('__label__2',), array([0.92334366]))
I'm wondering if I have 5 labels, every time when make prediction,is it possible for each text to get all the probability for each label?
Like for the above test_df text,
I could get something like
model.predict(test_df.iloc[2, 1])
(('__label__2',), array([0.92334366])),(('__label__1',), array([0.82334366])),
(('__label__3',), array([0.52333333])),(('__label__0',), array([0.07000000])),
(('__label__4',), array([0.00002000]))
could find anything related to make change to get such prediction results.
Any suggestion?
Thanks.
As you can see here in the documentation, when using predict method, you should specify k parameter to get the top-k predicted classes.
model.predict("Why not put knives in the dishwasher?", k=5)
OUTPUT:
((u'__label__food-safety', u'__label__baking', u'__label__equipment',
u'__label__substitutions', u'__label__bread'), array([0.0857 , 0.0657,
0.0454, 0.0333, 0.0333]))
I create a Python script for training and inferring test document vectors using doc2vec.
My problem is when I try to determine the most similar phrase for example ("the world") it shows me only on the list of most similar words. It didn't shows the list of most similar phrase.
Am I missing something in my code?
#python example to infer document vectors from trained doc2vec model
import gensim.models as g
import codecs
#parameters
model="toy_data/model.bin"
test_docs="toy_data/test_docs.txt"
output_file="toy_data/test_vectors.txt"
#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000
#load model
m = g.Doc2Vec.load(model)
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]
#infer test vectors
output = open(output_file, "w")
for d in test_docs:
output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
output.flush()
output.close()
m.most_similar('the word'.split())
I get this list :
[('refutations', 0.9990279078483582),
('volume', 0.9989271759986877),
('italic', 0.9988381266593933),
('syllogisms', 0.998751699924469),
('power', 0.9987285137176514),
('alibamu', 0.9985184669494629),
("''", 0.99847412109375),
('roman', 0.9984466433525085),
('soil', 0.9984269738197327),
('plants', 0.9984176754951477)]
The Doc2Vec model collects its doc-vectors for later lookup or search in a property .docvecs. To get doc-vector results, you would perform a most_similar() on that property. If your Doc2Vec instance is held in a variable d2v_model, and doc_id holds one of the known doc-tags from training, that might be:
d2v_model.docvecs.most_similar(doc_id)
If you were inferring a vector for a new document, and looking up training docs similar to that inferred vector, your code might be like:
new_dv = d2v_model.infer_vector('some new document'.split())
d2v_model.docvecs.most_similar(positive=[new_dv])
(The Doc2Vec model class is derived from the very-similar Word2Vec class, and thus inherits a most_similar() which by default consults just the internal word-vectors. Those word-vectors might be useful, in some Doc2Vec modes, or random – but it's best to use either d2v_model.wv.most_similar() or d2v_model.docvecs.most_similar() to be clear.)
Basic Doc2Vec examples, like the notebook installed with gensim in the docs/notebooks directory doc2vec-lee.ipynb, contain useful examples.
I want to build a model that can classification news into specific categorize. As i imagine that i will put all the selected train paper into specific label category then you word2vec for training and generate model?. I wonder does it possible?.
I have try some small example to build vocab in gensim but it keep telling me that word doesn't exist in vocab.. I'm so confuse.
randomTxt = 'loop is good. loop infinity is not good. they are good at some point.'
x = randomTxt.split() #This finds words in the document
a = Counter(x)
print x
w1 = 'so'
model1 = Word2Vec(randomTxt,min_count=0)
print model1.wv['loop']
I wonder if anyone have idea or know how to build from the beginning dataset can help me with this ? Or maybe some documentation is good.
I have read this docs: https://radimrehurek.com/gensim/models/word2vec.html
but as i follow like above, it keep telling me loop doesn't exist in vocabulary word2vec build.
I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.
You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.
You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)
try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.
I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()
I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.
I have many short sentences with an associated label (that I want to predict). For example:
This is a short sentence, label1
For each of the short sentences, I am extracting all of the nouns and adjectives and using them as features for a naive bayes classifier. So for the sentence above:
short, sentence
Are the two words that would be extracted.
How many features?
In total I have 260,000 observations (50 MB csv file). From this I am able to extract about 7,103 unique nouns and adjectives. But using them all is just impractical. So I want to use the top N most frequently occuring words as features.
If I use the top 50 words my code runs just fine. Here is a snippet:
featuresets = [(document_features(i), i.label) for i in y]
train_set, test_set = featuresets[testTrainCutoff:], featuresets[:testTrainCutoff]
classifier = nltk.NaiveBayesClassifier.train(train_set)
Problem
If use the top 100 words though, I get 'MemoryError' when trying to build 'featuresets'. From another question on stackoverflow, I found a potential solution which was to "use nltk.classify.apply_features which returns an object that acts like a list but does not store all the feature sets in memory" (source).
So I updated my code:
train_set, test_set = y[testTrainCutoff:], y[:testTrainCutoff]
train_set, test_set = apply_features(document_features,train_set), apply_features(document_features,test_set)
classifier = nltk.NaiveBayesClassifier.train(train_set)
However, now I get the following error:
File "C:\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 191, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack
As a side note: I am running this on a 40 GB RAM machine, so I find it odd that I am getting these errors
I saw the same issue, and for me it was not related to system resources. Here is the solution for my case.
I was populating my feature object as as dict of features:
features = {<features>}
instead of as a tuple containing a dict and a classification value:
features = ({<features>},classification)
No matter how small I made the feature array, it always gave me
ValueError: too many values to unpack
with this incorrect data structure. Once I changed to a tuple, everything worked fine. Hope this helps.