i m new in machine learning and i am create a one small project using CountVectorizer model. i am split my data to 80% -20%. 80% for training model and 20% for testing model. my model work properly run on 20% test data but can i used to test my model on different data set that is similar to training data set?
i am using joblib for dump and load my model.
from joblib import dump, load
dump(pipe, filename)
loaded_model = load('filename')
my question is how i directly test my model using different dataset?
Yes, you can use the model to test similar datasets.
However, you must keep in mind the preprocessing step according to the model.
When you trained your model, it was trained on a particular dimension and the size of input would have been AxB matric. When you have a new test sentence or new dataset, you must first do the same preprocessing, otherwise, it will throw dimension mismatch errors.
Example:
suppose you have the following count vectorizer object
cv = CountVectorizer()
then you must first fit it on your training dataset, for say
X = dataframe['text_column_name']
X = cv.fit_transform(X) # Fit the Data
Once this is done, whenever you have a new sentence, say
test_sentence = "this is a test sentence"
then you must use the cv object in the following manner
model_input = cv.transform([test_sentence]).toarray()
and then you can make predictions:
model.predict(model_input)
This method must be followed even if you wish to test a new dataset which is in a data frame or some other file format.
Related
I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.
Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)
I'm trying to get into machine learning, and I've been following this tutorial:
https://www.analyticsvidhya.com/blog/2021/05/classification-algorithms-in-python-heart-attack-prediction-and-analysis/
Near the end, we split the dataset into training and testing using train_test_split
x = data3.drop("output", axis=1)
y = data3["output"]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
That is, we use the same dataset for training and testing, 70% for training and 30% for testing.
But how can I use another dataset to test my model ?
One scenario came to mind: "You trained your model in 250 patients, now test it against these 3 patients that we have, so we can see the chances of them having a heart attack".
How can I, instead of splitting the data, use another csv/dataframe as a test ? Assuming this test data has the same format as the train, just fewer rows.
train_test_split(x,y,test_size=0.3) only divides data into training and testing set. After training the model on training data, you can use your other data for testing too. This function is mainly for splitting current data and you can use any data for testing purposes. You just have to make sure the attributes and the type are same as the training data. If you have to test on 3 patients, all you have to do is to pass the patients data into model.predict() function as a dataframe or an array depends on data.
Just as you load one dataframe from a file:
data1 = pd.read_csv("heart.csv")
If you had two separate data files, you'd load them into separate data files and skip the train_test_split step.
train_df = pd.read_csv("heart_train.csv")
test_df = pd.read_csv("heart_test.csv")
Since the two dataframes are already separate, you just have to make sure you do any cleaning and pre-processing steps on both of them, including removal of the target variable (y).
I am working on a small project that requires using different classification models on BoW data.
I understand that the train and test data must be different to get the model's true accuracy.
For model.score() to work correctly, I need to give it test data and labels in the same dimensions as the initial data. But the test data is in different dimensions, so I do it like this:
vectorizer = CountVectorizer()
traindata_bow = vectorizer.fit_transform(traindata)
testdata_bow = vectorizer.transform(testdata)
Now, the test data has the same dimensions as the initial train data.
Now on to my question:
Test data has its own set of dimensions/"characteristics".
So by transforming it using the vectorizer are we not losing any of the test data's characteristics?
I am asking because my model's accuracy ends up being in the 99.9% range and I worry something gets calculated incorrectly (Though my dataset is quite easy)
For example, after the code above:
traindata_bow.shape is (35918, 34319) and
testdata_bow.shape is (8980, 34319)
But if I run:
testdata_bow = vectorizer.fit_transform(testdata) i get
testdata_bow.shape is (8980, 20806)
So is there any data loss (or even partial merging with the train data) in the transform stage?
The issue:
I am confused as to why we transform our test data using the CountVectorizer fitted on our train data for bag of words classification.
Why would we not create a new CountVectorizer and fit the test data to this and have the classifier predict on the test CountVectorizer?
Looking here: How to standardize the bag of words for train and test?
Ripped from the answer:
LabeledWords=pd.DataFrame(columns=['word','label'])
LabeledWords.append({'word':'Church','label':'Religion'} )
vectorizer = CountVectorizer()
Xtrain,yTrain=vectorizer.fit_transform(LabeledWords['word']).toarray(),vectorizer.fit_transform(LabeledWords['label']).toarray()
forest = RandomForestClassifier(n_estimators = 100)
clf=forest.fit(Xtrain,yTrain)
for each_word,label in Preprocessed_list:
test_featuresX.append(vectorizer.transform(each_word),toarray())
test_featuresY.append(label.toarray())
clf.score(test_featuresX,test_featuresY)
We can see the user created a CountVectorizer and fit it to the training data. Then fit the classifier to this CountVectorizer. Afterwards the user just transformed the test data using the CountVectorizer which was fit to the train data, and fed this into the classifier. Why is that?
What I am trying to accomplish:
I am trying to implement bag of visual words. It uses the same concept, but I am unsure how should create my train and test sets for classification.
You want your test data to go through the same pipeline as your train data to make the difference between test and train as similar as possible to the difference between the real world and your model. The whole point to having test and train splits in your data is to help validate that your model generalizes, and allowing extra data from the test set to leak into the model trained on your train set prevents you from getting an accurate picture of that generalization ability.
Also, as #Juanpa.arrivillaga said, text processing opens cans of worms on top of the standard rules of thumb for predictive analytics. By using two different count vectorizers, you would have a model (in this case a random forest) trained to expect the first coordinate to correspond to some word like "apple" and then feed it some word like "grape." Any success you might have in such a scenario would be purely accidental.
in text mining/classification when a vectorizer is used to transform a text into numerical features, in the training TfidfVectorizer(...).fit_transform(text) or TfidfVectorizer(...).fit(text) is used. In testing it supposes to utilize former training info and just transform the data following the training fit.
In general case the test run(s) is completely separate from train run. But it needs some info regarding the fit obtained during the training stage otherwise the transformation fails with error sklearn.utils.validation.NotFittedError: idf vector is not fitted . It's not just a dictionary, it's something else.
What should be saved after the training is done, to make the test stage passing smoothly?
In other words train and test are separated in time and space, how to make test working, utilizing training results?
Deeper question would be what 'fit' means in scikit-learn context, but it's probably out of scope
In the test phase you should use the same model names as you used in the trainings phase. In this way you will be able to use the model parameters which are derived in the training phase. Here is an example below;
First give a name to your vectorizer and to your predictive algoritym (It is NB in this case)
vectorizer = TfidVectorizer()
classifier = MultinomialNB()
Then, use these names to vectorize and predict your data
trainingdata_counts = vectorizer.fit_transform(trainingdata.values)
classifier.fit(trainingdata_counts, trainingdatalabels)
testdata_counts = vectorizer.transform(testdata.values)
predictions=classifier.predict(testdata_counts)
By this way, your code will be able to process training and the test phases continuously.