I was working on a text classification problem with Keras. But
I tried to test the model I created, but I cannot use the TfidfVectorizer to test the class.
with open('model_architecture.json', 'r') as f:
model = model_from_json(f.read())
model.load_weights('model_weights.h5')
After installing the model I have prepared a test list to use.
test_data=["sentence1","sentence2","sentence3"]
No problem so far
But..
tf=TfidfVectorizer(binary=True)
train=tf.fit_transform(test_data)
test=tf.transform(test_data)
print(model.predict_classes(test))
ValueError: Error when checking input: expected dense_1_input to have shape (11103,) but got array with shape (92,)
I get such an error
And I also tried
tf=TfidfVectorizer(binary=True)
test=tf.transform(test_data)
sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
but I have received such an error, I learned that the fit () method should come before this can not be used.
But I still can't test the model I'm training
You need to encode your test data using the exact same TfIdfVectorizer object you fit and used to transform the original training data, way back when you originally trained the model. If you fit a different TfidfVectorizer to your test data then the encoding (including the vocab length) will be completely different and it will not work. It is this difference in vocab length that is the proximate cause of the error you're seeing. However, even if you do get the dimensions match purely by chance, it still won't work because the model was trained with an encoding that maps "cat" to 42, or whatever, while you're testing it with an encoding that maps "cat" to 13 or something. You'd basically be feeding it scrabbled nonsense. There really is no alternative but to go and get the original TfidfVectorizer, or at least to fit a TfidfVectorizer to the exact same documents with the exact same configuration. If this is not possible, then you'll simply have to train a new model and this time remember to save off the TfidfVectorizer as well.
Normally the fitted preprocessing are saved to a pickle file via pickle.dump() during the initial training, and loading with pickle.load() for testing and production, similar to what you did for model_architecture.json and model_weights.hd5. It is also convenient to put everything together into an sklearn pipeline so you only have to pickle one object, but I'm not sure how works together with the Keras model.
Related
I have a TF model that's trained for sentiment analysis. After I compile and fit the model, save it, then load it in another notebook session, locally or in colab, the model predicts nowhere near the accuracy I get. I assume it is because of the tokenizer. I defined the tokenizer, called the fit_on_texts method on the training data, converted to sequences and applied the padding then passed the output to the model. When I have to predict on new text, I have to do the same process again but this time, define a different tokenizer and that yields random results.
This is how I create my tokenizer:
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(text)
text_tokenized = tokenizer.texts_to_sequences(text)
X_text = sequence.pad_sequences(text_tokenized, maxlen=MAXLEN)
A solution I tried is pickling the main tokenizer object then loading it back in other sessions and using it on the predicted text. That works. But I was wondering if there's something more convenient. Maybe adding the main tokenizer as a layer in the model itself, if that's possible? I tried researching for a solution but came up short.
I am working on a small project that requires using different classification models on BoW data.
I understand that the train and test data must be different to get the model's true accuracy.
For model.score() to work correctly, I need to give it test data and labels in the same dimensions as the initial data. But the test data is in different dimensions, so I do it like this:
vectorizer = CountVectorizer()
traindata_bow = vectorizer.fit_transform(traindata)
testdata_bow = vectorizer.transform(testdata)
Now, the test data has the same dimensions as the initial train data.
Now on to my question:
Test data has its own set of dimensions/"characteristics".
So by transforming it using the vectorizer are we not losing any of the test data's characteristics?
I am asking because my model's accuracy ends up being in the 99.9% range and I worry something gets calculated incorrectly (Though my dataset is quite easy)
For example, after the code above:
traindata_bow.shape is (35918, 34319) and
testdata_bow.shape is (8980, 34319)
But if I run:
testdata_bow = vectorizer.fit_transform(testdata) i get
testdata_bow.shape is (8980, 20806)
So is there any data loss (or even partial merging with the train data) in the transform stage?
i m new in machine learning and i am create a one small project using CountVectorizer model. i am split my data to 80% -20%. 80% for training model and 20% for testing model. my model work properly run on 20% test data but can i used to test my model on different data set that is similar to training data set?
i am using joblib for dump and load my model.
from joblib import dump, load
dump(pipe, filename)
loaded_model = load('filename')
my question is how i directly test my model using different dataset?
Yes, you can use the model to test similar datasets.
However, you must keep in mind the preprocessing step according to the model.
When you trained your model, it was trained on a particular dimension and the size of input would have been AxB matric. When you have a new test sentence or new dataset, you must first do the same preprocessing, otherwise, it will throw dimension mismatch errors.
Example:
suppose you have the following count vectorizer object
cv = CountVectorizer()
then you must first fit it on your training dataset, for say
X = dataframe['text_column_name']
X = cv.fit_transform(X) # Fit the Data
Once this is done, whenever you have a new sentence, say
test_sentence = "this is a test sentence"
then you must use the cv object in the following manner
model_input = cv.transform([test_sentence]).toarray()
and then you can make predictions:
model.predict(model_input)
This method must be followed even if you wish to test a new dataset which is in a data frame or some other file format.
I'm trying to train a NLTK classifier for sentiment analysis and then save the classifier using pickle.
The freshly trained classifier works fine. However, if I load a saved classifier the classifier will either output 'positive', or 'negative' for ALL examples.
I'm saving the classifier using
classifier = nltk.NaiveBayesClassifier.train(training_set)
classifier.classify(words_in_tweet)
f = open('classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
and loading the classifier using
f = open('classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
classifier.classify(words_in_tweet)
I'm not getting any errors.
Any idea what the problem could be, or how to debug this correctly?
The most likely place a pickled classifier can go wrong is with the feature extraction function. This must be used to generate the feature vectors that the classifier works with.
The NaiveBayesClassifier expects feature vectors for both training and classification; your code looks as if you passed the raw words to the classifier instead (but presumably only after unpickling, otherwise you wouldn't get different behavior before and after unpickling). You should store the feature extraction code in a separate file, and import it in both the training and the classifying (or testing) script.
I doubt this applies to the OP, but some NLTK classifiers take the feature extraction function as an argument to the constructor. When you have separate scripts for training and classifying, it can be tricky to ensure that the unpickled classifier successfully finds the same function. This is because of the way pickle works: pickling only saves data, not code. To get it to work, just put the extraction function in a separate file (module) that your scripts import. If you put in in the "main" script, pickle.load will look for it in the wrong place.
in text mining/classification when a vectorizer is used to transform a text into numerical features, in the training TfidfVectorizer(...).fit_transform(text) or TfidfVectorizer(...).fit(text) is used. In testing it supposes to utilize former training info and just transform the data following the training fit.
In general case the test run(s) is completely separate from train run. But it needs some info regarding the fit obtained during the training stage otherwise the transformation fails with error sklearn.utils.validation.NotFittedError: idf vector is not fitted . It's not just a dictionary, it's something else.
What should be saved after the training is done, to make the test stage passing smoothly?
In other words train and test are separated in time and space, how to make test working, utilizing training results?
Deeper question would be what 'fit' means in scikit-learn context, but it's probably out of scope
In the test phase you should use the same model names as you used in the trainings phase. In this way you will be able to use the model parameters which are derived in the training phase. Here is an example below;
First give a name to your vectorizer and to your predictive algoritym (It is NB in this case)
vectorizer = TfidVectorizer()
classifier = MultinomialNB()
Then, use these names to vectorize and predict your data
trainingdata_counts = vectorizer.fit_transform(trainingdata.values)
classifier.fit(trainingdata_counts, trainingdatalabels)
testdata_counts = vectorizer.transform(testdata.values)
predictions=classifier.predict(testdata_counts)
By this way, your code will be able to process training and the test phases continuously.