adding more data to Support Vector Classifier training

adding more data to Support Vector Classifier training - python

I am using the LinearSVC() available on scikit learn to classify texts into a max of 7 seven labels. So, it is a multilabel classification problem. I am training on a small amount of data and testing it. Now, I want to add more data (retrieved from a pool based on a criteria) to the fitted model and evaluate on the same test set. How can this be done?
Question:
It is necessary to merge the previous data set with the new data set, get everything preprocessed and then retrain to see if the performance improve with the old + new data?
My code so far is below:
def preprocess(data, x, y):
global Xfeatures
global y_train
global labels
porter = PorterStemmer()
multilabel=MultiLabelBinarizer()
y_train=multilabel.fit_transform(data[y])
print("\nLabels are now binarized\n")
data[multilabel.classes_] = y_train
labels = multilabel.classes_
print(labels)
data[x].apply(lambda x:nt.TextFrame(x).noise_scan())
print("\English stop words were extracted\n")
data[x].apply(lambda x:nt.TextExtractor(x).extract_stopwords())
corpus = data[x].apply(nfx.remove_stopwords)
corpus = data[x].apply(lambda x: porter.stem(x))
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()
print('\nThe text is now vectorized\n')
return Xfeatures, y_train
Xfeatures, y_train = preprocess(df1, 'corpus', 'zero_level_name')
Xfeatures_train=Xfeatures[:300]
y_train_features = y_train[:300]
X_test=Xfeatures[300:400]
y_test=y_train[300:400]
X_pool=Xfeatures[400:]
y_pool=y_train[400:]
def model(modelo, tipo):
svc= modelo
clf = tipo(svc)
clf.fit(Xfeatures_train,y_train_features)
clf_predictions = clf.predict(X_test)
return clf_predictions
preds_pool = model(LinearSVC(class_weight='balanced'), OneVsRestClassifier)

It depends on how your previous dataset was. If your previous dataset was a well representation of your problem at hand, then adding more data will not increase your model performance by a large. So you can just test with the new data.
However, it is also possible that your initial dataset was not representative enough, and therefore with more data your classification accuracy increases. So in that case it is better to include all the data and preprocess it. Because preprocessing generally includes parameters that are computed on the dataset as whole. e.g., I can see you have TFIDF, or mean which is sensitive to the dataset at hand.

Related

Can I use StandardScaler() on whole data set, or should I calculate on train and test sets separately?

I'm developing a SVR for ~100 continuous features and a continuous label.
For scaling the data, I wrote:
#Read in
df = pd.read_csv(data_path,sep='\t')
features = df.iloc[:,1:-1] #100 features
target = df.iloc[:,-1] #The label
names = df.iloc[:,0] #Column names
#Scale features
scaler = StandardScaler()
scaled_df = scaler.fit_transform(features)
# rename columns (since now its an np array)
features.columns = df_columns
So now I have a scaled data frame, and my next step was to split into train and test, and then develop a model (SVR):
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
model = SVR()
...and then I fit the model to the data.
But I noticed other people don't fit the StandardScaler() to the whole data frame, but they split the dataframe into train and test first, and then apply StandardScaler() to each separately.
Is there a difference between whether you apply the StandardScaler to the whole data frame, or train and test separately?

The previous answer says that you should separate the training and testing set when scaling, otherwise the testing one might bias the transformation of the training one. This is half correct and half wrong.
If you do the transformation separately, then it might well be that the training set will get scaled to wrong proportions (e.g. if it comes from a narrow continuous time range, thus taking on a subset of the values range). You will end up having wrong values for the variables of the test set.
In general, what you must do is scale on the training set and transfer the scale over to the testing set. This is done by using the methods fit and transform separately, as seen in the documentation.

You need to apply StandardScaler to the training set to prevent the distribution of the test set leaking into the model. If you fit the scaler on the full dataset before splitting, the test set information is used to transform the training set and use it to train the model.

Improving classification by using clustering as a feature

I'm trying to improve my classification results by doing clustering and use the clustered data as another feature (or use it alone instead of all other features - not sure yet).
So let's say that I'm using unsupervised algorithm - GMM:
gmm = GaussianMixture(n_components=4, random_state=RSEED)
gmm.fit(X_train)
pred_labels = gmm.predict(X_test)
I trained the model with training data and predicted the clusters by the test data.
Now I want to use a classifier (KNN for example) and use the clustered data within it. So I tried:
#define the model and parameters
knn = KNeighborsClassifier()
parameters = {'n_neighbors':[3,5,7],
'leaf_size':[1,3,5],
'algorithm':['auto', 'kd_tree'],
'n_jobs':[-1]}
#Fit the model
model_gmm_knn = GridSearchCV(knn, param_grid=parameters)
model_gmm_knn.fit(pred_labels.reshape(-1, 1),Y_train)
model_gmm_knn.best_params_
But I'm getting:
ValueError: Found input variables with inconsistent numbers of samples: [418, 891]
Train and Test are not with same dimension.
So how can I implement such approach?

Your method is not correct - you are attempting to use as a single feature the cluster labels of your test data pred_labels, in order to fit a classifier with your training labels Y_train. Even in the huge coincidental case that the dimensions of these datasets were the same (hence not giving a dimension mismatch error, as here), this is conceptually wrong and does not actually make any sense.
What you actually want to do is:
Fit a GMM with your training data
Use this fitted GMM to get cluster labels for both your training and test data.
Append the cluster labels as a new feature in both datasets
Fit your classifier with this "enhanced" training data.
All in all, and assuming that your X_train and X_test are pandas dataframes, here is the procedure:
import pandas as pd
gmm.fit(X_train)
cluster_train = gmm.predict(X_train)
cluster_test = gmm.predict(X_test)
X_train['cluster_label'] = pd.Series(cluster_train, index=X_train.index)
X_test['cluster_label'] = pd.Series(cluster_test, index=X_test.index)
model_gmm_knn.fit(X_train, Y_train)
Notice that you should not fit your clustering model with your test data - only with your training ones, otherwise you have data leakage similar to the one encountered when using the test set for feature selection, and your results will be both invalid and misleading .

decision tree algorithm on feature set

I'm trying to predict the no.of updates('sys_mod_count')based on the text description('eng')
I have predefined the 'sys_mod_count' into two classes if >=17 as 1; <17 as 0.
But I want to remove this condition as this value is not available at decision time in real world.
I'm thinking to do this in Decision tree/ Random forest method to train the classifier on feature set.
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
# return metrics.accuracy_score(predictions, valid_y)
return predictions
import pandas as pd
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
df_3 =pd.read_csv('processedData.csv', sep=";")
st_new = df_3[['sys_mod_count','eng','ger']]
st_new['updates_binary'] = st_new['sys_mod_count'].apply(lambda x: 1 if x >= 17 else 0)
st_org = st_new[['eng','updates_binary']]
st_org = st_org.dropna(axis=0, subset=['eng']) #Determine if column 'eng'contain missing values are removed
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(st_org['eng'], st_org['updates_binary'],stratify=st_org['updates_binary'],test_size=0.20)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(st_org['eng'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("NB, WordLevel TF-IDF: ", metrics.accuracy_score(accuracy, valid_y))

This seems to be a threshold setting problem - you would like to set a threshold at which a certain classification is made. No supervised classifier can set the threshold for you because if it does not have any training data with binary classes, then you cannot train the cvlassifier, and to create training data, you need to set the threshold to begin with. It's a chicken and egg problem.
If you have some way of identifying which binary label is correct, then you can vary the threshold and measure errors similar to how it's suggested here. Then you can either run a Classifier on your binary labels based on the threshold or a Regressor on sys_mod_count and convert to binary based on the identified threshold.
The above approach does not work if you have no way to identify what the correct binary label should be. Then, the problem you are trying to solve is creating some boundary between points based on the value of your sys_mod_count variable. This is unsupervised learning. So, techniques like clustering will be helpful here. You can cluster your data into two clusters based on the distance of points from each other, and then label each cluster, which becomes your binary label.

using saved sklearn model to make prediction

I have a saved logistic regression model which I trained with training data and saved using joblib. I am trying to load this model in a different script, pass it new data and make a prediction based on the new data.
I am getting the following error "sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn't fitted." Do I need to fit the data again ? I would have thought that the point of being able to save the model would be to not have to do this.
The code I am using is below excluding the data cleaning section. Any help to get the prediction to work would be appreciated.
new_df = pd.DataFrame(latest_tweets,columns=['text'])
new_df.to_csv('new_tweet.csv',encoding='utf-8')
csv = 'new_tweet.csv'
latest_df = pd.read_csv(csv)
latest_df.dropna(inplace=True)
latest_df.reset_index(drop=True,inplace=True)
new_x = latest_df.text
loaded_model = joblib.load("finalized_mode.sav")
tfidf_transformer = TfidfTransformer()
cvec = CountVectorizer()
x_val_vec = cvec.transform(new_x)
X_val_tfidf = tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
print (result)

Your training part have 3 parts which are fitting the data:
CountVectorizer: Learns the vocabulary of the training data and returns counts
TfidfTransformer: Learns the counts of the vocabulary from previous part, and returns tfidf
LogisticRegression: Learns the coefficients for features for optimum classification performance.
Since each part is learning something about the data and using it to output the transformed data, you need to have all 3 parts while testing on new data. But you are only saving the lr with joblib, so the other two are lost and with it is lost the training data vocabulary and count.
Now in your testing part, you are initializing new CountVectorizer and TfidfTransformer, and calling fit() (fit_transform()), which will learn the vocabulary only from this new data. So the words will be less than the training words. But then you loaded the previously saved LR model, which expects the data according to features like training data. Hence this error:
ValueError: X has 130 features per sample; expecting 223086
What you need to do is this:
During training:
filename = 'finalized_model.sav'
joblib.dump(lr, filename)
filename = 'finalized_countvectorizer.sav'
joblib.dump(cvec, filename)
filename = 'finalized_tfidftransformer.sav'
joblib.dump(tfidf_transformer, filename)
During testing
loaded_model = joblib.load("finalized_model.sav")
loaded_cvec = joblib.load("finalized_countvectorizer.sav")
loaded_tfidf_transformer = joblib.load("finalized_tfidftransformer.sav")
# Observe that I only use transform(), not fit_transform()
x_val_vec = loaded_cvec.transform(new_x)
X_val_tfidf = loaded_tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
Now you wont get that error.
Recommendation:
You should use TfidfVectorizer in place of both CountVectorizer and TfidfTransformer, so that you dont have to use two objects all the time.
And along with that you should use Pipeline to combine the two steps:- TfidfVectorizer and LogisticRegression, so that you only have to use a single object (which is easier to save and load and generic handling).
So edit the training part like this:
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
# Internally your X_train will be automatically converted to tfidf
# and that will be passed to lr
tfidf_lr_pipe.fit(X_train, y_train)
# Similarly here only transform() will be called internally for tfidfvectorizer
# And that data will be passed to lr.predict()
y_preds = tfidf_lr_pipe.predict(x_test)
# Now you can save this pipeline alone (which will save all its internal parts)
filename = 'finalized_model.sav'
joblib.dump(tfidf_lr_pipe, filename)
During testing, do this:
loaded_pipe = joblib.load("finalized_model.sav")
result = loaded_model.predict(new_x)

You have not fit the CountVectorizer.
You should do like this..
cvec = CountVectorizer()
x_val_vec = cvec.fit_transform(new_x)
Similarly, TfidTransformer must be used like this..
X_val_tfidf = tfidf_transformer.fit_transform(x_val_vec)

Python -- SciKit -- Text Feature Extraction of Classifer

I have to classify articles into my custom categories. So I chose MultinomialNB from SciKit. I am doing supervised learning. So I have an editor who look at the articles daily and then tag them. Once they are tagged I include them into my Learning model and so on. Below is the code to get an idea what i am doing and using. (I am not including any import lines because I am just trying to give you an idea of what I am doing) (Reference)
corpus = (train_set)
vectorizer = HashingVectorizer(stop_words='english', non_negative=True)
x = vectorizer.transform(corpus)
x_array = x.toarray()
data_array = np.array(x_array)
cat_set = list(cat_set)
cat_array = np.array(cat_set)
filename = '/home/ubuntu/Classifier/Intelligence-MultinomialNB.pkl'
if(not os.path.exists(filename)):
classifier.partial_fit(data_array,cat_array,classes)
print "Saving Classifier"
joblib.dump(classifier, filename, compress=9)
else:
print "Loading Classifier"
classifier = joblib.load(filename)
classifier.partial_fit(data_array,cat_array)
print "Saving Classifier"
joblib.dump(classifier, filename, compress=9)
Now I have a Classifier ready after custom tagging and it works well with new articles and work like a charm. Now the requirement has arisen to get most frequent words against each category. In short I have to extract feature from the learned model. By looking into documentation I only found out how to extract text features at the time of learning.
But once learned and I only have the model file (.pkl) is it possible to load that classifier and extract features from it?
Will it be possible to get the most frequent terms against each class or category?

You can access the features by using the feature_count_ property. This will tell you how many times a particular feature occurred. For example:
# Imports
import numpy as np
from sklearn.naive_bayes import MultinomialNB
# Data
X = np.random.randint(3, size=(3, 10))
X2 = np.random.randint(3, size=(3, 10))
y = np.array([1, 2, 3])
# Initial fit
clf = MultinomialNB()
clf.fit(X, y)
# Check to see that the stored features are equal to the input features
print np.all(clf.feature_count_ == X)
# Modify fit with new data
clf.partial_fit(X2, y)
# Check to see that the stored features represents both sets of input
print np.all(clf.feature_count_ == (X + X2))
In the above example, we can see that the feature_count_ property is nothing more than a running sum of the number of features for each class. Using this, you can go backwards from your classifier model to your features, to determine the frequency of your features. Unfortunately, your problem is more complex, you now need to go back one more step, as your features are not simply words.
This is where the bad news comes - you used a HashingVectorizer feature extractor. If you refer to the docs:
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
So even though we know the frequency of the features, we can't translate those features back to words. Had you used a different type of feature extractor (perhaps the one referenced on that same page, CountVectorizer) the situation would be different entirely.
In short - You can extract the features from the model and determine their frequency by class, but you can't convert those features back to words.
To obtain the functionality you desire you would need to start over using a reversible mapping function (a feature extractor that allows you to encode words into features and decode features back into words).

I would suggest using the code below. you just need to load the pickel object and transform the test data using the same vectorizer. try just TFIDF vectorizer in case you face problems.
clf = joblib.load("'/home/ubuntu/Classifier/Intelligence-MultinomialNB.pkl'")
# you need to read the test sample
# type (data_test) list of list
X_test = vectorizer.transform(data_test)
print "pickel model loaded"
print clf
pred = clf.predict(X_test)
print ("prediction done")
for p in enumerate(pred):
print p

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

adding more data to Support Vector Classifier training - python

Related

Can I use StandardScaler() on whole data set, or should I calculate on train and test sets separately?

Improving classification by using clustering as a feature

decision tree algorithm on feature set

using saved sklearn model to make prediction

Python -- SciKit -- Text Feature Extraction of Classifer

Categories

Resources