Updating the feature names into scikit TFIdfVectorizer - python

I am trying out this code
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
train_data = ["football is the sport","gravity is the movie", "education is imporatant"]
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
print "Applying first train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()
print "\n\nApplying second train data"
train_data = ["cricket", "Transformers is a film","AIMS is a college"]
X_train = vectorizer.transform(train_data)
print vectorizer.get_feature_names()
print "\n\nApplying fit transform onto second train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()
The output for this one is
Applying first train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']
Applying second train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']
Applying fit transform onto second train data
[u'aims', u'college', u'cricket', u'film', u'transformers']
I gave the first set of data using fit_transform to vectorizer so it gave me feature names like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport'] after that i applied another train set to the same vectorizer but it gave me the same feature names as I didnt use fit or fit_transform. But I want to know how to update the features of a vectorizer without overwriting the previous oncs. If I use fit_transform again the previous features will get overwritten. So I want to update the feature list of the vectorizer. So i want something like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport',u'aims', u'college', u'cricket', u'film', u'transformers'] How can I get that.

In sklearn terminology, this is called a partial fit and you can't do it with a TfidfVectorizer. There are two ways around this:
Concatenate the two training sets and re-vectorize
use a HashingVectorizer, which support partial fitting. However, that does not have a get_feature_names method due to the fact that is hashes features, so the original isn't kept. Another advantage is that this is much more memory efficient.
Example of the first approach:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
train_data1 = ["football is the sport", "gravity is the movie", "education is important"]
vectorizer = TfidfVectorizer(stop_words='english')
print("Applying first train data")
X_train = vectorizer.fit_transform(train_data1)
print(vectorizer.get_feature_names())
print("\n\nApplying second train data")
train_data2 = ["cricket", "Transformers is a film", "AIMS is a college"]
X_train = vectorizer.transform(train_data2)
print(vectorizer.get_feature_names())
print("\n\nApplying fit transform onto second train data")
X_train = vectorizer.fit_transform(train_data1 + train_data2)
print(vectorizer.get_feature_names())
Output:
Applying first train data
['education', 'football', 'gravity', 'important', 'movie', 'sport']
Applying second train data
['education', 'football', 'gravity', 'important', 'movie', 'sport']
Applying fit transform onto second train data
['aims', 'college', 'cricket', 'education', 'film', 'football', 'gravity', 'important', 'movie', 'sport', 'transformers']

I found this question while googling for the same issue that OP raised. Like mbatchkarov said Scikit-Learn's TfidfVectorizer doesn't natively support partial fitting.
HashingVectorizer is usually a great alternative, but it really depends on your use-case. Specifically, if you care very much about representing infrequent terms precisely, then collisions will hurt performance.
So I went ahead and wrote my own implementation of "partial_fit" for both TfidfVectorizer and CountVectorizer (see here). Hope it's useful for other people reaching this post. Note that this kind of partial fitting does change the dimension of the output vector given by the vectorizer since the whole point is to update the vocabulary (so take this into account when using in a pipeline).

Related

adding more data to Support Vector Classifier training

I am using the LinearSVC() available on scikit learn to classify texts into a max of 7 seven labels. So, it is a multilabel classification problem. I am training on a small amount of data and testing it. Now, I want to add more data (retrieved from a pool based on a criteria) to the fitted model and evaluate on the same test set. How can this be done?
Question:
It is necessary to merge the previous data set with the new data set, get everything preprocessed and then retrain to see if the performance improve with the old + new data?
My code so far is below:
def preprocess(data, x, y):
global Xfeatures
global y_train
global labels
porter = PorterStemmer()
multilabel=MultiLabelBinarizer()
y_train=multilabel.fit_transform(data[y])
print("\nLabels are now binarized\n")
data[multilabel.classes_] = y_train
labels = multilabel.classes_
print(labels)
data[x].apply(lambda x:nt.TextFrame(x).noise_scan())
print("\English stop words were extracted\n")
data[x].apply(lambda x:nt.TextExtractor(x).extract_stopwords())
corpus = data[x].apply(nfx.remove_stopwords)
corpus = data[x].apply(lambda x: porter.stem(x))
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()
print('\nThe text is now vectorized\n')
return Xfeatures, y_train
Xfeatures, y_train = preprocess(df1, 'corpus', 'zero_level_name')
Xfeatures_train=Xfeatures[:300]
y_train_features = y_train[:300]
X_test=Xfeatures[300:400]
y_test=y_train[300:400]
X_pool=Xfeatures[400:]
y_pool=y_train[400:]
def model(modelo, tipo):
svc= modelo
clf = tipo(svc)
clf.fit(Xfeatures_train,y_train_features)
clf_predictions = clf.predict(X_test)
return clf_predictions
preds_pool = model(LinearSVC(class_weight='balanced'), OneVsRestClassifier)
It depends on how your previous dataset was. If your previous dataset was a well representation of your problem at hand, then adding more data will not increase your model performance by a large. So you can just test with the new data.
However, it is also possible that your initial dataset was not representative enough, and therefore with more data your classification accuracy increases. So in that case it is better to include all the data and preprocess it. Because preprocessing generally includes parameters that are computed on the dataset as whole. e.g., I can see you have TFIDF, or mean which is sensitive to the dataset at hand.

Mismatch dimension error when trying to predict using naive bayes

I am struggling with a dimension error when I try to predict using naive bayes classifier.
The data consists of a column for sentences and then a column for sentiments (aka labels). I want to use a naive bayes classifier to predict the sentiment of each sentence.
I start off with separating out testing, training and validation data sets
import pandas as pd
from sklearn.feature_extraction.text import (CountVectorizer,TfidfVectorizer, TfidfTransformer)
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest, chi2
training_set,sentence_split_further,training_set_sentiments,sentiments_split_further=train_test_split(sentence_data.Sentence,sentence_data.Sentiment,test_size=.5, train_size=.5, random_state=1)
testing_set,validation_set,testing_set_sentiments,validation_set_sentiments=train_test_split(sentence_split_further,sentiments_split_further,test_size=.5, train_size=.5, random_state=1)
Then I create a feature matrix, apply tfid and prune the best k words. I did this all in a function that I created called feature_selection_vector
tfidf_testing_feature_matrix=feature_selection_vector(testing_set,testing_set_sentiments)
tfidf_validation_feature_matrix=feature_selection_vector(validation_set,validation_set_sentiments)
Here is the code for the feature_selection_vector function
def feature_selection_vector( sentence_data, sentiments ):
#creates the feature vector and calculates tfid
vectorizer = CountVectorizer(analyzer='word',
token_pattern=r'\b[a-zA-Z]{3,}\b',
ngram_range=(1, 1)
)
count_vectorized = vectorizer.fit_transform(sentence_data)
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
vectorized = tfidf_transformer.fit_transform(count_vectorized)
vector=pd.DataFrame(vectorized.toarray(),
index=['sentence '+str(i)
for i in range(1, 1+len(sentence_data))],
columns=vectorizer.get_feature_names())
selector = SelectKBest(chi2, k=1000)
selector.fit(vector, sentiments)
return vector
Now I want to fit the Naive Bayes Classifier with training data and then use the model to predict using testing data.
naive_bayes = MultinomialNB()
naive_bayes.fit(tfidf_training_feature_matrix,training_set_sentiments)
NBC_tfidf_sentiment_predicted=naive_bayes.predict(tfidf_testing_feature_matrix)
However I keep getting this error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 892 is different from 348)
The two sizes it is complaining about is number of columns of the training set (892) and the number of columns of the testing set (348)
You cannot use fit_transform to get features for the validation and test sets, as you do here (using your feature_selection_vector() function).
fit_transform is used only once with the training data; for the validation and test ones, simple transform should be used instead, using the existing CountVectorizer and TfidfTransformer as they have already been fitted to the training data.
In your code, both the CountVectorizer and TfidfTransformer are fitted again with the validation and test data, leading to different number of features, and eventually to the expected error you report.
For more details, see What is the difference between fit_transform and transform in sklearn countvectorizer?
You should seriously think wrapping up all the stages in a pipeline.

ValueError in while predict where Test data is having different shape of word vector [duplicate]

This question already has an answer here:
Testing text classification ML model with new data fails
(1 answer)
Closed 2 years ago.
Below is my code I am trying for text classification model;
from sklearn.feature_extraction.text import TfidfVectorizer
ifidf_vectorizer = TfidfVectorizer()
X_train_tfidf = ifidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape
(3, 16)
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)
Till now only training set has been vectorized into a full vocabulary. In order to perform analysis on test set I need to submit it to the same procedures.
So I did
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test)
X_test_tfidf.shape
(2, 12)
And finally when trying to predict its showing error;
predictions = clf.predict(X_test_tfidf)
ValueError: X has 12 features per sample; expecting 16
But when I use pipeline from sklearn.pipeline import Pipeline then it worked fine;
Can’t I code the way I was trying?
The error is with fit_transform of test data. You fit_transform training data and only transform test data:
# change this
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test)
X_test_tfidf.shape
(2, 12)
# to
X_test_tfidf = ifidf_vectorizer.transform(X_test)
X_test_tfidf.shape
Reasons:
When you do fit_transform, you teach your model the vectors with fit. The model learns the vectors to which they are used to transform data. You use the train data to learn the vectors, then you apply them to both train and test with transform
If you do a fit_transform on test data, you replaced the vectors learned in training data and replaced them with test data. Given that your test data is smaller than your train data, it is likely you would get two different vectorisation.
A Better Way
The best way to do what you do is using Pipelines which will make your flow easy to understand
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
clf = Pipeline(steps=[
('vectorizer', TfidfVectorizer()),
('model', LinearSVC()),
])
# train
clf.fit(X_train,y_train)
# predict
clf.predict(X_test)
This is easier as the transformation are taking care for you. You don’t have to worry about fit_transform when fitting the model or transform when predicting or scoring.
You can access the features independently if you with with
clf.named_steps('vectorizer') # or 'model'
Under the hood, when you do clf.fit, your data will pass throw your vectorizer using fit_transform and then to the model. When you predict or score, your data will pass throw your vectorizer with transform before reaching your model.
Your code fails as you are refitting the vectorizer with .fit_transform() on the test set X_test again. However, you should only transform the data with the vectorizer:
X_test_tfidf = ifidf_vectorizer.transform(X_test)
Now it should work as expected. You only fit the ifidf_vectorizer according to X_train and transform all data according to this. It ensures that the same vocabulary is used and that you get outputs of the same shape.

How to print clusters of SVM in python

I want to classify rows of a column using SVM clustering method. I can find so many content on net which produces graphs or print prediction accuracy but i cannot find ways to print my cluster. Below example will better explain what i am trying to do:
I have a dataframe to be used as test dataset
import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal',cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
}
df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
print (df)
I want to predict whether the text row is talking about Animal/Thing or miscelleneus. The test data i want to pass is
test_data = {'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
}
df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
Expected result is an additional column 'Classification' getting created in the test dataframe with values ['Animal','Miscellenous','Animal','Animal','Miscellenous']
Here is the solution to your problem:
# import tfidf-vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# import support vector classifier
from sklearn.svm import SVC
import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal','cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
}
train_df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
display(train_df)
test_data = {'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
}
test_df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
display(test_df)
# Load training data (text) from the dataframe and form to a list containing all the entries
training_data = train_df['Text'].tolist()
# Load training labels from the dataframe and form to a list as well
training_labels = train_df['classification'].tolist()
# Load testing data from the dataframe and form a list
testing_data = test_df['Text'].tolist()
# Get a tfidf vectorizer to process the text into vectors
vectorizer = TfidfVectorizer()
# Fit the tfidf-vectorizer to training data and transform the training text into vectors
X_train = vectorizer.fit_transform(training_data)
# Transform the testing text into vectors
X_test = vectorizer.transform(testing_data)
# Get the SVC classifier
clf = SVC()
# Train the SVC with the training data (data points and labels)
clf.fit(X_train, training_labels)
# Predict the test samples
print(clf.predict(X_test))
# Add classification results to test dataframe
test_df['Classification'] = clf.predict(X_test)
# Display test dataframe
display(test_df)
As an explanation for the approach:
You have your training data and want to use it to train a SVM and then predict the test data with labels.
That means you need to extract the training data and labels for each data point (so for each phrase, you need to know if its an animal or a thing etc.) and then you need to set up and train a SVM. Here, I used the implementation from scikit-learn.
Moreover you can't just train the SVM with raw text data, because it requires numerical values (numbers). This means you need to transform the text data into numbers. This is "feature extraction from text" and for this one of the common approaches is to use the Term-Frequency Inverted-Document-Frequency (TF-IDF) concept.
Now you can use a vector representation of each phrase coupled with a label for it to train the SVM and then use it to classify the test data :)
In short the steps are:
Extract data points and labels from training
Extract data points from testing
Set up SVM classifier
Set up TF-IDF vectorizer and fit it to training data
Transform training data and testing data with tf-idf vectorizer
Train the SVM classifier
Classify test data with trained classifier
I hope this helps!

How does tfidf transform test data after being fitted to train data?

I am using the following code:
pipeline = Pipeline([('vect',
TfidfVectorizer( ngram_range=(1,2),
stop_words="english",
sublinear_tf=True ,
use_idf=True,
norm='l2' )),
('reduce_dim',
SelectPercentile(f_classif, 90)),
('clf',
SVC(kernel='linear',C=1.0,
probability=True, max_iter=70000,
class_weight='balanced'))])
model = pipeline.fit(X_train,y_train)
model.predict(X_test)
x=vectorizer.fit_transform(X_train_text)
y=vectorizer.transform(X_test_text)
As per my understanding, pipeline.fit() fits tfidf to the train data and when model.predict() is called on X_test, it only does a tfidf transformation based on the fitted train data.
Since tf idf works by getting frequency of words in the document and corpus, I am wondering what happens underneath in the .fit_transform and .transform functions.
1) pretty close to your question you can find here:What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?
2)tfidf transformation is done inside of fit-transform, predict here doesn't correspond to tfidf vectorizer, as it doesnt have such a function, it is method of SVC.
Here is the basic documentation of fit() and fit_transform().
Your understanding of the working is correct. When testing the parameters are set for the tf-idf Vectorizer. These parameters are stored and used later to just transform the testing data.
Training data - fit_transform()
Testing data - transform()
If you want to look at the inside workings, you should have a look at the source code for the same.

Categories

Resources