Generate Test data using TfIdfVectorizer

Generate Test data using TfIdfVectorizer - python

I have separated my data into train and test parts. My data table has a 'text' column. Consider that I have ten other columns representing numerical features. I have used TfidfVectorizer and the training data to generate term matrix and combine that with numerical features to create the training dataframe.
tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_features=5000, max_df=0.95)
tfidf_vectorizer_train = tfidf_vectorizer.fit_transform(X_train['text'].values)
df1_tfidf_train = pd.DataFrame(tfidf_vectorizer_train.toarray(), columns=tfidf_vectorizer.get_feature_names())
df2_train = df_main_ques.iloc[train_index][traffic_metrics]#to collect numerical features
df_combined_train = pd.concat([df1_tfidf_train, df2_train], axis=1)
To calculate the tf-idf score for test part, I need to reuse the training data set. I am not sure how to generate the test data part.
Related post:
[1]Append tfidf to pandas dataframe: discuss only creating training dataset part
[2]How does TfidfVectorizer compute scores on test data: Discussed test data part but it is not clear how to generate the test dataframe that contains both terms and numerical features.

you can use transform method of trained vectorizer for transforming your test data on already trained vectorizer. you can reuse the trained vectorizer for test data set TF-IDF score generation by
tfidf_vectorizer_test = tfidf_vectorizer.transform(X_test['text'].values)

Related

ValueError: X has 2 features, but SVC is expecting 472082 features as input

I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?

The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.

adding more data to Support Vector Classifier training

I am using the LinearSVC() available on scikit learn to classify texts into a max of 7 seven labels. So, it is a multilabel classification problem. I am training on a small amount of data and testing it. Now, I want to add more data (retrieved from a pool based on a criteria) to the fitted model and evaluate on the same test set. How can this be done?
Question:
It is necessary to merge the previous data set with the new data set, get everything preprocessed and then retrain to see if the performance improve with the old + new data?
My code so far is below:
def preprocess(data, x, y):
global Xfeatures
global y_train
global labels
porter = PorterStemmer()
multilabel=MultiLabelBinarizer()
y_train=multilabel.fit_transform(data[y])
print("\nLabels are now binarized\n")
data[multilabel.classes_] = y_train
labels = multilabel.classes_
print(labels)
data[x].apply(lambda x:nt.TextFrame(x).noise_scan())
print("\English stop words were extracted\n")
data[x].apply(lambda x:nt.TextExtractor(x).extract_stopwords())
corpus = data[x].apply(nfx.remove_stopwords)
corpus = data[x].apply(lambda x: porter.stem(x))
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()
print('\nThe text is now vectorized\n')
return Xfeatures, y_train
Xfeatures, y_train = preprocess(df1, 'corpus', 'zero_level_name')
Xfeatures_train=Xfeatures[:300]
y_train_features = y_train[:300]
X_test=Xfeatures[300:400]
y_test=y_train[300:400]
X_pool=Xfeatures[400:]
y_pool=y_train[400:]
def model(modelo, tipo):
svc= modelo
clf = tipo(svc)
clf.fit(Xfeatures_train,y_train_features)
clf_predictions = clf.predict(X_test)
return clf_predictions
preds_pool = model(LinearSVC(class_weight='balanced'), OneVsRestClassifier)

It depends on how your previous dataset was. If your previous dataset was a well representation of your problem at hand, then adding more data will not increase your model performance by a large. So you can just test with the new data.
However, it is also possible that your initial dataset was not representative enough, and therefore with more data your classification accuracy increases. So in that case it is better to include all the data and preprocess it. Because preprocessing generally includes parameters that are computed on the dataset as whole. e.g., I can see you have TFIDF, or mean which is sensitive to the dataset at hand.

Improving classification by using clustering as a feature

I'm trying to improve my classification results by doing clustering and use the clustered data as another feature (or use it alone instead of all other features - not sure yet).
So let's say that I'm using unsupervised algorithm - GMM:
gmm = GaussianMixture(n_components=4, random_state=RSEED)
gmm.fit(X_train)
pred_labels = gmm.predict(X_test)
I trained the model with training data and predicted the clusters by the test data.
Now I want to use a classifier (KNN for example) and use the clustered data within it. So I tried:
#define the model and parameters
knn = KNeighborsClassifier()
parameters = {'n_neighbors':[3,5,7],
'leaf_size':[1,3,5],
'algorithm':['auto', 'kd_tree'],
'n_jobs':[-1]}
#Fit the model
model_gmm_knn = GridSearchCV(knn, param_grid=parameters)
model_gmm_knn.fit(pred_labels.reshape(-1, 1),Y_train)
model_gmm_knn.best_params_
But I'm getting:
ValueError: Found input variables with inconsistent numbers of samples: [418, 891]
Train and Test are not with same dimension.
So how can I implement such approach?

Your method is not correct - you are attempting to use as a single feature the cluster labels of your test data pred_labels, in order to fit a classifier with your training labels Y_train. Even in the huge coincidental case that the dimensions of these datasets were the same (hence not giving a dimension mismatch error, as here), this is conceptually wrong and does not actually make any sense.
What you actually want to do is:
Fit a GMM with your training data
Use this fitted GMM to get cluster labels for both your training and test data.
Append the cluster labels as a new feature in both datasets
Fit your classifier with this "enhanced" training data.
All in all, and assuming that your X_train and X_test are pandas dataframes, here is the procedure:
import pandas as pd
gmm.fit(X_train)
cluster_train = gmm.predict(X_train)
cluster_test = gmm.predict(X_test)
X_train['cluster_label'] = pd.Series(cluster_train, index=X_train.index)
X_test['cluster_label'] = pd.Series(cluster_test, index=X_test.index)
model_gmm_knn.fit(X_train, Y_train)
Notice that you should not fit your clustering model with your test data - only with your training ones, otherwise you have data leakage similar to the one encountered when using the test set for feature selection, and your results will be both invalid and misleading .

How to print clusters of SVM in python

I want to classify rows of a column using SVM clustering method. I can find so many content on net which produces graphs or print prediction accuracy but i cannot find ways to print my cluster. Below example will better explain what i am trying to do:
I have a dataframe to be used as test dataset
import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal',cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
}
df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
print (df)
I want to predict whether the text row is talking about Animal/Thing or miscelleneus. The test data i want to pass is
test_data = {'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
}
df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
Expected result is an additional column 'Classification' getting created in the test dataframe with values ['Animal','Miscellenous','Animal','Animal','Miscellenous']

Here is the solution to your problem:
# import tfidf-vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# import support vector classifier
from sklearn.svm import SVC
import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal','cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
}
train_df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
display(train_df)
test_data = {'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
}
test_df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
display(test_df)
# Load training data (text) from the dataframe and form to a list containing all the entries
training_data = train_df['Text'].tolist()
# Load training labels from the dataframe and form to a list as well
training_labels = train_df['classification'].tolist()
# Load testing data from the dataframe and form a list
testing_data = test_df['Text'].tolist()
# Get a tfidf vectorizer to process the text into vectors
vectorizer = TfidfVectorizer()
# Fit the tfidf-vectorizer to training data and transform the training text into vectors
X_train = vectorizer.fit_transform(training_data)
# Transform the testing text into vectors
X_test = vectorizer.transform(testing_data)
# Get the SVC classifier
clf = SVC()
# Train the SVC with the training data (data points and labels)
clf.fit(X_train, training_labels)
# Predict the test samples
print(clf.predict(X_test))
# Add classification results to test dataframe
test_df['Classification'] = clf.predict(X_test)
# Display test dataframe
display(test_df)
As an explanation for the approach:
You have your training data and want to use it to train a SVM and then predict the test data with labels.
That means you need to extract the training data and labels for each data point (so for each phrase, you need to know if its an animal or a thing etc.) and then you need to set up and train a SVM. Here, I used the implementation from scikit-learn.
Moreover you can't just train the SVM with raw text data, because it requires numerical values (numbers). This means you need to transform the text data into numbers. This is "feature extraction from text" and for this one of the common approaches is to use the Term-Frequency Inverted-Document-Frequency (TF-IDF) concept.
Now you can use a vector representation of each phrase coupled with a label for it to train the SVM and then use it to classify the test data :)
In short the steps are:
Extract data points and labels from training
Extract data points from testing
Set up SVM classifier
Set up TF-IDF vectorizer and fit it to training data
Transform training data and testing data with tf-idf vectorizer
Train the SVM classifier
Classify test data with trained classifier
I hope this helps!

When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?

When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?
SAMPLE_COUNT = 5000
TEST_COUNT = 20000
seed(0)
sample = list()
test_sample = list()
for index, line in enumerate(open('covtype.data','rb')):
if index < SAMPLE_COUNT:
sample.append(line)
else:
r = randint(0,index)
if r < SAMPLE_COUNT:
sample[r] = line
else:
k = randint(0,index)
if k < TEST_COUNT:
if len(test_sample) < TEST_COUNT:
test_sample.append(line)
else:
test_sample[k] = line
from sklearn.preprocessing import StandardScaler
for n, line in enumerate(sample):
sample[n] = map(float, line.strip().split(','))
y = np.array(sample)[:,-1]
scaling = StandardScaler()
X = scaling.fit_transform(np.array(sample)[:,:-1]) ##here use fit and transform
for n,line in enumerate(test_sample):
test_sample[n] = map(float,line.strip().split(','))
yt = np.array(test_sample)[:,-1]
Xt = scaling.transform(np.array(test_sample)[:,:-1])##why here only use transform
As the annotation says, why Xt only use transform but no fit?

We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data.
We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.
This is the standart procedure to scale. You always learn your scaling parameters on the train and then use them on the test. Here is an article that explane it very well : https://sebastianraschka.com/faq/docs/scale-training-test.html

We have two datasets : The training and the test dataset. Imagine we have just 2 features :
'x1' and 'x2'.
Now consider this (A very hypothetical example):
A sample in the training data has values: 'x1' = 100 and 'x2' = 200
When scaled, 'x1' gets a value of 0.1 and 'x2' a value of 0.1 too. The response variable value is 100 for this. These have been calculated w.r.t only the training data's mean and std.
A sample in the test data has the values : 'x1' = 50 and 'x2' = 100. When scaled according to the test data values, 'x1' = 0.1 and 'x2' = 0.1. This means that our function will predict response variable value of 100 for this sample too. But this is wrong. It shouldn't be 100. It should be predicting something else because the not-scaled values of the features of the 2 samples mentioned above are different and thus point to different response values. We will know what the correct prediction is only when we scale it according to the training data because those are the values that our linear regression function has learned.
I have tried to explain the intuition behind this logic below:
We decide to scale both the features in the training dataset before applying linear regression and fitting the linear regression function. When we scale the features of the training dataset, all 'x1' features get adjusted according to the mean and the standard deviations of the different samples w.r.t to their 'x1' feature values. Same thing happens for 'x2' feature.
This essentially means that every feature has been transformed into a new number based on just the training data. It's like Every feature has been given a relative position. Relative to the mean and std of just the training data. So every sample's new 'x1' and 'x2' values are dependent on the mean and the std of the training data only.
Now what happens when we fit the linear regression function is that it learns the parameters (i.e, learns to predict the response values) based on the scaled features of our training dataset. That means that it is learning to predict based on those particular means and standard deviations of 'x1' and 'x2' of the different samples in the training dataset. So the value of the predictions depends on the:
*learned parameters. Which in turn depend on the
*value of the features of the training data (which have been scaled).And because of the scaling the training data's features depend on the
*training data's mean and std.
If we now fit the standardscaler() to the test data, the test data's 'x1' and 'x2' will have their own mean and std. This means that the new values of both the features will in turn be relative to only the data in the test data and thus will have no connection whatsoever to the training data. It's almost like they have been subtracted by and divided by random values and have got new values now which do not convey how they are related to the training data.

Any transformation you do to the data must be done by the parameters generated by the training data.
Simply what fit() method does is create a model that extracts the various parameters from your training samples to do the neccessary transformation later on. transform() on the other hand is doing the actual transformation to the data itself returning a standardized or scaled form.
fit_transform() is just a faster way of doing the operations of fit() and transform() consequently.
Important thing here is that when you divide your dataset into train and test sets what you are trying to achieve is somewhat simulate a real world application. In a real world scenario you will only have training data and you will develop a model according to that and predict unseen instances of similar data.
If you transform the entrire data with fit_transform() and then split to train test you violate that simulation approach and do the transformation according to the unseen examples as well. Which will inevatibly result in an optimistic model as you already somewhat prepared your model by the unseen samples metrics as well.
If you split the data to train test and apply fit_transform() to both you will also be mistaken as your first transformation of train data will be done by train splits metrics only and your second will be done by test metrics only.
The right way to do these preprocessings is to train any transformer with train data only and do the transformations to the test data. Because only then you can be sure that your resulting model represents a real world solution.
Following this it actually doesnt matter if you
fit(train) then transform(train) then transform(test) OR
fit_transform(train) then transform(test)

fit() is used to compute the parameter needed for transformation and transform() is for scaling the data to convert into standard format for the model.
fit_tranform() is combination of two which is doing above work in efficiently.
Since fit_transform() is already computing and transforming the training data only transformation for testing data is left,since parameter needed for transformation is already computed and stored only transformation() of testing data is left therefor only transform() is used instead of fit_transform().

there could be two approaches:
1st approach scale with fit and transform train data, transform only test data
2nd fit and transform the whole set :train + test
if you think about: how will the model handle scaling when goes live?: When new data arrives, new data will behave just like the unseen test data in your backtest.
In the 1st case , new data will will just be scale transformed and your model backtest scaled values remain unchanged.
But in the 2nd case when new data comes then you will need to fit transform the whole dataset , that means that the backtest scaled values will no longer be the same and then you need to re-train the model..if this task can be done quickly then I guess it is ok
but the 1st case requires less work...
and if there are big differences between scaling in train and test then probably the data is non stationary and ML is probably not a good idea

fit() and transform() are the two methods used to generally account for the missing values in the dataset.The missing values can be filled either by computing the mean or the median of the data and filling that empty places with that mean or median.
fit() is used to calculate the mean or the median.
transform() is used to fill in missing values with the calculated mean or the median.
fit_tranform() performs the above 2 tasks in a single stretch.
fit_transform() is used for the training data to perform the above.When it comes to validation set only transform() is required since you dont want to change the way you handle missing values when it comes to the validation set, because by doing so you may take your model by surprise!! and hence it may fail to perform as expected.

we use fit() or fit_transform() in order to learn (to train the model) on the train data set. transform() can be used on the trained model against the test data set.

fit_transform() - learn the parameter of scaling (Train data)
transform() - Apply those learned scaling method here (Test data)
ss = StandardScaler()
X_train = ss.fit_transform(X_train) #here we need to feed this to the model to learn so it will learn the parameter of scaling
X_test = ss.transform(X_test) #It will use the learn parameter to transform

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.