I'm trying to train a NLTK classifier for sentiment analysis and then save the classifier using pickle.
The freshly trained classifier works fine. However, if I load a saved classifier the classifier will either output 'positive', or 'negative' for ALL examples.
I'm saving the classifier using
classifier = nltk.NaiveBayesClassifier.train(training_set)
classifier.classify(words_in_tweet)
f = open('classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
and loading the classifier using
f = open('classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
classifier.classify(words_in_tweet)
I'm not getting any errors.
Any idea what the problem could be, or how to debug this correctly?
The most likely place a pickled classifier can go wrong is with the feature extraction function. This must be used to generate the feature vectors that the classifier works with.
The NaiveBayesClassifier expects feature vectors for both training and classification; your code looks as if you passed the raw words to the classifier instead (but presumably only after unpickling, otherwise you wouldn't get different behavior before and after unpickling). You should store the feature extraction code in a separate file, and import it in both the training and the classifying (or testing) script.
I doubt this applies to the OP, but some NLTK classifiers take the feature extraction function as an argument to the constructor. When you have separate scripts for training and classifying, it can be tricky to ensure that the unpickled classifier successfully finds the same function. This is because of the way pickle works: pickling only saves data, not code. To get it to work, just put the extraction function in a separate file (module) that your scripts import. If you put in in the "main" script, pickle.load will look for it in the wrong place.
Related
I was working on a text classification problem with Keras. But
I tried to test the model I created, but I cannot use the TfidfVectorizer to test the class.
with open('model_architecture.json', 'r') as f:
model = model_from_json(f.read())
model.load_weights('model_weights.h5')
After installing the model I have prepared a test list to use.
test_data=["sentence1","sentence2","sentence3"]
No problem so far
But..
tf=TfidfVectorizer(binary=True)
train=tf.fit_transform(test_data)
test=tf.transform(test_data)
print(model.predict_classes(test))
ValueError: Error when checking input: expected dense_1_input to have shape (11103,) but got array with shape (92,)
I get such an error
And I also tried
tf=TfidfVectorizer(binary=True)
test=tf.transform(test_data)
sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
but I have received such an error, I learned that the fit () method should come before this can not be used.
But I still can't test the model I'm training
You need to encode your test data using the exact same TfIdfVectorizer object you fit and used to transform the original training data, way back when you originally trained the model. If you fit a different TfidfVectorizer to your test data then the encoding (including the vocab length) will be completely different and it will not work. It is this difference in vocab length that is the proximate cause of the error you're seeing. However, even if you do get the dimensions match purely by chance, it still won't work because the model was trained with an encoding that maps "cat" to 42, or whatever, while you're testing it with an encoding that maps "cat" to 13 or something. You'd basically be feeding it scrabbled nonsense. There really is no alternative but to go and get the original TfidfVectorizer, or at least to fit a TfidfVectorizer to the exact same documents with the exact same configuration. If this is not possible, then you'll simply have to train a new model and this time remember to save off the TfidfVectorizer as well.
Normally the fitted preprocessing are saved to a pickle file via pickle.dump() during the initial training, and loading with pickle.load() for testing and production, similar to what you did for model_architecture.json and model_weights.hd5. It is also convenient to put everything together into an sklearn pipeline so you only have to pickle one object, but I'm not sure how works together with the Keras model.
I am building a basic NLP program using nltk and sklearn. I have a large dataset in a database and I am wondering what the best way to train the classifier is.
Is it advisable to download the training data in chunks and pass each chunk to the classifier? Is that even possible, or would I be overwriting what was learned from the previous chunk?
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
while True:
training_set, proceed = download_chunk() # pseudo
trained = SklearnClassifier(MultinomialNB()).train(training_set)
if not proceed:
break
How is this normally done? I want to avoid keeping the database connection open for too long.
The way you're doing it right now will actually just overwrite the classifier for each chunk in your training data as you're creating a new SklearnClassifier object each time. What you need to do is instantiate the SklearnClassifier prior to getting into the training loop. However, looking at the code here, it appears that the NLTK SklearnClassifier uses the fit method of the underlying Sklearn model. This means that you can't actually update a model once it is trained. What you need to do is instantiate the Sklearn model directly and use the partial_fit method. Something like this should work:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() # must instantiate classifier outside of the loop or it will just get overwritten
while True:
training_set, proceed = download_chunk() # pseudo
clf.partial_fit(training_set)
if not proceed:
break
At the end, you'll have a MultinomialNB() classifier that has been trained on each chunk of your data.
Typically, if the whole dataset will fit in memory, it is somewhat more performant to just download the whole thing and call fit once (in which case you could actually use the nltk SklearnClassifier). See the notes about the partial_fit method here. However, if you are unable to fit the entire set in memory, it is certainly common practice to train on chunks of the data. You can do this by making several calls to the database or by extracting all of the information from the database, placing it in a CSV on your hard drive, and reading chunks of it from there.
Note
If you're using a shared database with other users, the DBAs may prefer you to extract all of it at once as once as this would (probably) take up fewer DB resources than making several separate, smaller calls to the database would.
I am relatively new to logistic regression using SciKit learn in Python. After reading some topics and viewing some demo's, I decided to dive in myself.
So, basically, I am trying to predict the conversion rate of customers, based on some features. The outcome is either Active (1) or Not active (0). I tried KNN and logistic regression. With KNN I get an average accuracy of 0.893 and with logistic regression 0.994. The latter seems so high, is that even realistic / possible?
Anyway: Suppose that my model is indeed very accurate, I would now like to import a new dataset with the same feauture columns and predict their conversions (they end this month). In the case above I used cross_val_score to get the accuracy scores.
Do I now need to import the new set, somehow fit that new set to this model. (not training it again, now I just want to use it)
Can someone please inform me how I can proceed? If additional info is needed, please comment on that.
Thanks in advance!
For the statistic question: of course, it can happen, either your data is having little noise or the scenario Clock Slave mentioned in the comments.
For the import of the classifier, you could pickle it ( save it as a binary with the pickle module, and then just load it whenever you need it and use the clf.predict() method on the new data
import pickle
#Do the classification and name the fitted object clf
with open('clf.pickle', 'wb') as file :
pickle.dump(clf,file,pickle.HIGHEST_PROTOCOL)
And then later you can load it
import pickle
with open('clf.pickle', 'rb') as file :
clf =pickle.load(file)
# Now predict on the new dataframe df as
pred = clf.predict(df.values)
Beside 'Pickle', 'joblib' can be used as well.
##
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
assume there X,Y, already defined
model = LogisticRegression()
model.fit(X, Y)
save the model to disk
filename = 'finalized_model.sav'
joblib.dump(model, filename)
load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
I have trained a classifier model using RapidMiner after a trying a lot of algorithms and evaluate it on my dataset.
I also export the model from RapidMiner as XML and pkl file, but I can't read it in my python program (scikit-learn).
Is there any way to import RapidMiner classifier/model in a python program and use it to predict or classify new data in my end application?
Practically, I would say no - just train your model in sklearn from the beginning if that's where you want it.
Your RapidMiner model is some kind of object. The two formats you are exporting as are just storage methods. Sklearn models are a different kind of object. You can't directly save one and load it into the other. A similar example would be to ask if you can take an airplane engine and load it into a train.
To do what you're asking, you'll need to take the underlying data that your classifier saved, find the format, and then figure out a way to get it in the same format as a sklearn classifier. This is dependent on what type of classifier you have. For example, if you're using a bayesian model, you could somehow capture the prior probabilities and then use those, but this isn't trivial.
You could use the pmml extenstion for RapidMiner to export your model.
For python there is for example the augustus library that can work with pmml files.
in text mining/classification when a vectorizer is used to transform a text into numerical features, in the training TfidfVectorizer(...).fit_transform(text) or TfidfVectorizer(...).fit(text) is used. In testing it supposes to utilize former training info and just transform the data following the training fit.
In general case the test run(s) is completely separate from train run. But it needs some info regarding the fit obtained during the training stage otherwise the transformation fails with error sklearn.utils.validation.NotFittedError: idf vector is not fitted . It's not just a dictionary, it's something else.
What should be saved after the training is done, to make the test stage passing smoothly?
In other words train and test are separated in time and space, how to make test working, utilizing training results?
Deeper question would be what 'fit' means in scikit-learn context, but it's probably out of scope
In the test phase you should use the same model names as you used in the trainings phase. In this way you will be able to use the model parameters which are derived in the training phase. Here is an example below;
First give a name to your vectorizer and to your predictive algoritym (It is NB in this case)
vectorizer = TfidVectorizer()
classifier = MultinomialNB()
Then, use these names to vectorize and predict your data
trainingdata_counts = vectorizer.fit_transform(trainingdata.values)
classifier.fit(trainingdata_counts, trainingdatalabels)
testdata_counts = vectorizer.transform(testdata.values)
predictions=classifier.predict(testdata_counts)
By this way, your code will be able to process training and the test phases continuously.