CountVectorizer test data loss?

CountVectorizer test data loss? - python

I am working on a small project that requires using different classification models on BoW data.
I understand that the train and test data must be different to get the model's true accuracy.
For model.score() to work correctly, I need to give it test data and labels in the same dimensions as the initial data. But the test data is in different dimensions, so I do it like this:
vectorizer = CountVectorizer()
traindata_bow = vectorizer.fit_transform(traindata)
testdata_bow = vectorizer.transform(testdata)
Now, the test data has the same dimensions as the initial train data.
Now on to my question:
Test data has its own set of dimensions/"characteristics".
So by transforming it using the vectorizer are we not losing any of the test data's characteristics?
I am asking because my model's accuracy ends up being in the 99.9% range and I worry something gets calculated incorrectly (Though my dataset is quite easy)
For example, after the code above:
traindata_bow.shape is (35918, 34319) and
testdata_bow.shape is (8980, 34319)
But if I run:
testdata_bow = vectorizer.fit_transform(testdata) i get
testdata_bow.shape is (8980, 20806)
So is there any data loss (or even partial merging with the train data) in the transform stage?

Related

How to fit Word2Vec on test data?

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.

Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)

Apply machine learning model to another dataset

I'm trying to get into machine learning, and I've been following this tutorial:
https://www.analyticsvidhya.com/blog/2021/05/classification-algorithms-in-python-heart-attack-prediction-and-analysis/
Near the end, we split the dataset into training and testing using train_test_split
x = data3.drop("output", axis=1)
y = data3["output"]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
That is, we use the same dataset for training and testing, 70% for training and 30% for testing.
But how can I use another dataset to test my model ?
One scenario came to mind: "You trained your model in 250 patients, now test it against these 3 patients that we have, so we can see the chances of them having a heart attack".
How can I, instead of splitting the data, use another csv/dataframe as a test ? Assuming this test data has the same format as the train, just fewer rows.

train_test_split(x,y,test_size=0.3) only divides data into training and testing set. After training the model on training data, you can use your other data for testing too. This function is mainly for splitting current data and you can use any data for testing purposes. You just have to make sure the attributes and the type are same as the training data. If you have to test on 3 patients, all you have to do is to pass the patients data into model.predict() function as a dataframe or an array depends on data.

Just as you load one dataframe from a file:
data1 = pd.read_csv("heart.csv")
If you had two separate data files, you'd load them into separate data files and skip the train_test_split step.
train_df = pd.read_csv("heart_train.csv")
test_df = pd.read_csv("heart_test.csv")
Since the two dataframes are already separate, you just have to make sure you do any cleaning and pre-processing steps on both of them, including removal of the target variable (y).

how can i test my model using different dataset in machine learning

i m new in machine learning and i am create a one small project using CountVectorizer model. i am split my data to 80% -20%. 80% for training model and 20% for testing model. my model work properly run on 20% test data but can i used to test my model on different data set that is similar to training data set?
i am using joblib for dump and load my model.
from joblib import dump, load
dump(pipe, filename)
loaded_model = load('filename')
my question is how i directly test my model using different dataset?

Yes, you can use the model to test similar datasets.
However, you must keep in mind the preprocessing step according to the model.
When you trained your model, it was trained on a particular dimension and the size of input would have been AxB matric. When you have a new test sentence or new dataset, you must first do the same preprocessing, otherwise, it will throw dimension mismatch errors.
Example:
suppose you have the following count vectorizer object
cv = CountVectorizer()
then you must first fit it on your training dataset, for say
X = dataframe['text_column_name']
X = cv.fit_transform(X) # Fit the Data
Once this is done, whenever you have a new sentence, say
test_sentence = "this is a test sentence"
then you must use the cv object in the following manner
model_input = cv.transform([test_sentence]).toarray()
and then you can make predictions:
model.predict(model_input)
This method must be followed even if you wish to test a new dataset which is in a data frame or some other file format.

How to get ordered list of labels after fitting sklearn

train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?

tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.

Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)

Vectorizer where or how fit information is stored?

in text mining/classification when a vectorizer is used to transform a text into numerical features, in the training TfidfVectorizer(...).fit_transform(text) or TfidfVectorizer(...).fit(text) is used. In testing it supposes to utilize former training info and just transform the data following the training fit.
In general case the test run(s) is completely separate from train run. But it needs some info regarding the fit obtained during the training stage otherwise the transformation fails with error sklearn.utils.validation.NotFittedError: idf vector is not fitted . It's not just a dictionary, it's something else.
What should be saved after the training is done, to make the test stage passing smoothly?
In other words train and test are separated in time and space, how to make test working, utilizing training results?
Deeper question would be what 'fit' means in scikit-learn context, but it's probably out of scope

In the test phase you should use the same model names as you used in the trainings phase. In this way you will be able to use the model parameters which are derived in the training phase. Here is an example below;
First give a name to your vectorizer and to your predictive algoritym (It is NB in this case)
vectorizer = TfidVectorizer()
classifier = MultinomialNB()
Then, use these names to vectorize and predict your data
trainingdata_counts = vectorizer.fit_transform(trainingdata.values)
classifier.fit(trainingdata_counts, trainingdatalabels)
testdata_counts = vectorizer.transform(testdata.values)
predictions=classifier.predict(testdata_counts)
By this way, your code will be able to process training and the test phases continuously.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.