Ttfidfvectorizer adjust test by training frequencies in pipeline during cross-validation - python

I have a text classification task based on documents, where I expect the classes are related to word frequencies. Because of the specific nature of my application, where I have a corpus that will grow over time and want to classify new documents as they arrive, I have used FeatureHasher rather than the existing TFidfVectorizer (which both vectorizes and does adjustment), since the vocabulary size can grow with new documents.
As discussed here for instance (https://stats.stackexchange.com/questions/154660/tfidfvectorizer-should-it-be-used-on-train-only-or-traintest), it seems correct to me that the term frequencies when doing TFIDF should be calculated relative to the train set only, then used to rescale the test set, rather than first doing rescaling on the entire corpus and then splitting. This is because using the test dataset for frequency calculations is violating the principle that you shouldn't use this information.
Let's assume you start with a matrix X of raw term frequencies (not adjusted yet) and y, a vector of classes. The typical order that many code examples show is:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
vec = TfidfTransformer()
#rescale X by its own frequencies, then split
X = vec.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#...now fit a model
but the correct thing should be the following:
vec = TfidfTransformer()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#store rescaling based on X_train frequencies alone
vec.fit(X_train)
#resacale each (transform) by the same model
X_train = vec.transform(X_train)
X_test = vec.transform(X_test)
#...now fit a model
Okay, now the main question: I want to conduct some kind of cross-validation, perhaps with GridSearchCV, where I can feed it a set of potential model parameters and conduct several splits of the data for each one. The typical way to do this is to build a model pipeline and then feed it into the cross-validation utility. Since pipelines are kind of black boxes that are hard to view the details of, I just wanted to verify whether, if TfidfTransformer is included as a step in the pipeline, that it does the adjustment correctly, as I've mentioned above, by conducting the adjustment on the training data of each split.

Related

How to use multiclassification model to make predicitions in entire dataframe

I have trained multiclassification models in my training and test sets and have achieved good results with SVC. Now, I want to use the model o make predictions in my entire dataframe, but when I get the following error: ValueError: X has 36976 features, but SVC is expecting 8989 features as input.
My dataframe has two columns: one with the categories (which I manually labeled for around 1/5 of the dataframe) and the text columns with all the texts (including those that have not been labeled).
data={'categories':['1','NaN','3', 'NaN'], 'documents':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.', ''Paragraph 1.\nParagraph 2.']}
df=pd.DataFrame(data)
First, I drop the rows with Nan values in the 'categories' column. I then, create the document term matrix, define the 'y', and split into training and test sets.
tf = CountVectorizer(tokenizer=word_tokenize)
X = tf.fit_transform(df['documents'])
y = df['categories']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Second, I run the SVC model getting good results:
from sklearn.svm import SVC
svm = SVC(C=0.1, class_weight='balanced', kernel='linear', probability=True)
model = svm.fit(X_train, y_train)
print('accuracy:', model.score(X_test, y_test))
y_pred = model.predict(X_test)
print(metrics.classification_report(y_test, y_pred))
Finally, I try to apply the the SVC model to predict the categories of the entire column 'documents' of my dataframe. To do so, I create the document term matrix of the entire column 'documents' and then apply the model:
tf_entire_df = CountVectorizer(tokenizer=word_tokenize)
X_entire_df = tf_entire_df.fit_transform(df['documents'])
y_pred_entire_df = model.predict(X_entire_df)
Bu then I get the error that my X_entire_df has more features than the SVC model is expecting as input. I magine that this is because now I am trying to apply the model to the whole column documents, but I do know how to fix this.
I would appreciate your help!
These issues usually comes from the fact that you are feeding the model with unknown or unseen data (more/less features than the one used for training).
I would strongly suggest you to use sklearn.pipeline and create a pipeline to include preprocessing (CountVectorizer) and your machine learning model (SVC) in a single object.
From experience, this helps a lot to avoid tedious complex preprocessing fitting issues.

How do I split a dataset into training and testing whilst retaining the proportions of binary data (i.e some drugs work some don't)?

I have a dataset of drugs, associated chemical features and whether they are "responsive" or "Unresponsive". I need to ensure that once I split the dataset into test and train they both have the same proportion of responsive:unresponsive. I know how to randomly split the data where training is 80% and test is 20%. Not sure how to do the stratified sampling necessary here, is this what I'm meant to use - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?
The train_test_split function already has one parameters that allows you keeping the proportion of y. The parameter is stratify; and is defined in the documentation as "If not None, data is split in a stratified fashion, using this as the class labels".
An example of code would be:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

different score when using train_test_split before vs after SMOTETomek

I'm trying to classify a text to a 6 different classes.
Since I'm having an imbalanced dataset, I'm also using SMOTETomek method that should synthetically balance the dataset with additional artificial samples.
I've noticed a huge score difference when applying it via pipeline vs 'Step by step" where the only difference is (I believe) the place I'm using train_test_split
Here are my features and labels:
for curr_features, label in self.training_data:
features.append(curr_features)
labels.append(label)
algorithms = [
linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None),
naive_bayes.MultinomialNB(),
naive_bayes.BernoulliNB(),
tree.DecisionTreeClassifier(max_depth=1000),
tree.ExtraTreeClassifier(),
ensemble.ExtraTreesClassifier(),
svm.LinearSVC(),
neighbors.NearestCentroid(),
ensemble.RandomForestClassifier(),
linear_model.RidgeClassifier(),
]
Using Pipeline:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
# Provide Report for all algorithms
score_dict = {}
for algorithm in algorithms:
model = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('smote', SMOTETomek()),
('classifier', algorithm)
])
model.fit(X_train, y_train)
# Score
score = model.score(X_test, y_test)
score_dict[model] = int(score * 100)
sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
for classifier, score in sorted_score_dict.items():
print(f'{classifier.__class__.__name__}: score is {score}%')
Using Step by Step:
vectorizer = CountVectorizer()
transformer = TfidfTransformer()
cv = vectorizer.fit_transform(features)
text_tf = transformer.fit_transform(cv).toarray()
smt = SMOTETomek()
X_smt, y_smt = smt.fit_resample(text_tf, labels)
X_train, X_test, y_train, y_test = train_test_split(X_smt, y_smt, test_size=0.2, random_state=0)
self.test_classifiers(X_train, X_test, y_train, y_test, algorithms)
def test_classifiers(self, X_train, X_test, y_train, y_test, classifiers_list):
score_dict = {}
for model in classifiers_list:
model.fit(X_train, y_train)
# Score
score = model.score(X_test, y_test)
score_dict[model] = int(score * 100)
print()
print("SCORE:")
sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
for model, score in sorted_score_dict.items():
print(f'{model.__class__.__name__}: score is {score}%')
I'm getting (for the best classifier model) around 65% using pipeline vs 90% using step by step.
Not sure what am I missing.
There is nothing wrong with your code by itself. But your step-by-step approach is using bad practice in Machine Learning theory:
Do not resample your testing data
In your step-by-step approach, you resample all of the data first and then split them into train and test sets. This will lead to an overestimation of model performance because you have altered the original distribution of classes in your test set and it is not representative of the original problem anymore.
What you should do instead is to leave the testing data in its original distribution in order to get a valid approximation of how your model will perform on the original data, which is representing the situation in production. Therefore, your approach with the pipeline is the way to go.
As a side note: you could think about shifting the whole data preparation (vectorization and resampling) out of your fitting and testing loop as you probably want to compare the model performance against the same data anyway. Then you would only have to run these steps once and your code executes faster.
The correct approach in such cases is described in detail in own answer in the Data Science SE thread Why you shouldn't upsample before cross validation (although the answer is about CV, the rationale is identical for the train/test split case as well). In short, any resampling method (SMOTE included) should be applied only to the training data and not to the validation or test ones.
Given that, your Pipeline approach here is correct: you apply SMOTE only to your training data after splitting, and, according to the documentation of the imblearn pipeline:
The samplers are only applied during fit.
So, no SMOTE is actually applied to your test data during model.score, which is exactly as it should be.
Your step-by-step approach, on the other hand, is wrong on many levels, and SMOTE is only one of them; all these preprocessing steps should be applied after the train/test split, and fitted only on the training portion of your data, which is not the case here, thus the results are invalid (no wonder they look "better"). For a general discussion (and a practical demonstration) of how & why such preprocessing should be applied only to the training data, see my (2) answers in Should Feature Selection be done before Train-Test Split or after? (again, the discussion there is about feature selection, but it is applicable to such feature engineering tasks like count vectorizer and TF-IDF transformation as well).

Are TF-IDF and BoW techniques incompatible?

I have studied the difference between TF-IDF and BoW methods but I have a big doubt about it. I thought that the two methods could be combined, I will explain better. I have a csv file (MY_DATA) with thousands of comments from a social network, I would like to use this dataset to create my BoW for the creation of a classification model of the sentiment of comments (the sentiment of comments is the other variable of MY_DATA and is of three types: positive, negative and neutral)
tf = TfidfVectorizer()
text_tf = tf.fit_transform(MY_DATA['comments'])
X_train, X_test, y_train, y_test = train_test_split(text_tf, MY_DATA['sentiment'], test_size=0.2)
#Classification model Multinomial Naive Bayes
clf = MultinomialNB().fit(X_train, y_train)
predicted = clf.predict(X_test)
Now that you have seen my script I would like to know if I am using the TF-IDF method correctly. How could I apply the BoW method in my case? Do the two methods inevitably remain incompatible?

Does fitting a sklearn Linear Regression classifier multiple times add data points or just replace them?

X = np.array(df.drop([label], 1))
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
df.dropna(inplace=True)
y = np.array(df[label])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
linReg.fit(X_train, y_train)
I've been fitting my linear regression classifier over and over again with data from different spreadsheets under the assumption that every time I fit the same model with a new spreadsheet, it is adding points and making the model more robust.
Was this assumption correct? Or am I just wiping the model every time I fit it?
If so, is there a way for me to fit my model multiple times for this 'cumulative' type effect?
Linear regression is a batch (aka. offline) training method, you can't add knowledge with new patterns. So, sklearn is re-fitting the whole model. The only way to add data is to append the new patterns to your original training X, Y matrices and re-fit.
You're almost certainly wiping your mode land starting from scratch. To do what you want, you need to append the additional data to the bottom of your data frame and re-fit using that.

Categories

Resources