scikit-learn predict() new data - python

I have a question regarding the predict() function from scikit-learn. I'm trying to validate my trained classifier by testing with data, that is not in the training data and also has a different label. So I basically want, that the classifiers output is: 'new data is not predictable'. How do I implement that?
Right now the classifier is just trying to predict the trained labels on the new data with totally different labels. Could you help me out?!
my classifier pipeline:
text_clf_NB = Pipeline([('vect', vects_NB),
('tfidf', tf_idf_NB),
('clf', classifier('NB')) # choose classifier
])
my prediction:
pred_NB = text_clf_NB.fit(X_train, Y_train).predict(X_others)
X_others has a new case with a non-trained label, and I want the classifier to notice, that it is a case not similar to the trained cases and not just predicting what the most likely label of the trained ones is for the new case.

Ok, solved the problem like this now. Is this a reasonable way to do it.
# prediction probabilites
pred_NB = text_clf_NB.fit(X_train, Y_train).predict_proba(X_test)
# prediction depending on probs
new_pred =[]
case = 0
for pairs in pred_NB:
if max(pairs) >= 0.995:
pred = text_clf_NB.fit(X_train, Y_train).predict([X_test[case]])
elif max(pairs) < 0.995:
pred = (["No sufficient similarity to trained cases"])
case += 1
new_pred.extend(pred)
new_pred

Related

adding more data to Support Vector Classifier training

I am using the LinearSVC() available on scikit learn to classify texts into a max of 7 seven labels. So, it is a multilabel classification problem. I am training on a small amount of data and testing it. Now, I want to add more data (retrieved from a pool based on a criteria) to the fitted model and evaluate on the same test set. How can this be done?
Question:
It is necessary to merge the previous data set with the new data set, get everything preprocessed and then retrain to see if the performance improve with the old + new data?
My code so far is below:
def preprocess(data, x, y):
global Xfeatures
global y_train
global labels
porter = PorterStemmer()
multilabel=MultiLabelBinarizer()
y_train=multilabel.fit_transform(data[y])
print("\nLabels are now binarized\n")
data[multilabel.classes_] = y_train
labels = multilabel.classes_
print(labels)
data[x].apply(lambda x:nt.TextFrame(x).noise_scan())
print("\English stop words were extracted\n")
data[x].apply(lambda x:nt.TextExtractor(x).extract_stopwords())
corpus = data[x].apply(nfx.remove_stopwords)
corpus = data[x].apply(lambda x: porter.stem(x))
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()
print('\nThe text is now vectorized\n')
return Xfeatures, y_train
Xfeatures, y_train = preprocess(df1, 'corpus', 'zero_level_name')
Xfeatures_train=Xfeatures[:300]
y_train_features = y_train[:300]
X_test=Xfeatures[300:400]
y_test=y_train[300:400]
X_pool=Xfeatures[400:]
y_pool=y_train[400:]
def model(modelo, tipo):
svc= modelo
clf = tipo(svc)
clf.fit(Xfeatures_train,y_train_features)
clf_predictions = clf.predict(X_test)
return clf_predictions
preds_pool = model(LinearSVC(class_weight='balanced'), OneVsRestClassifier)
It depends on how your previous dataset was. If your previous dataset was a well representation of your problem at hand, then adding more data will not increase your model performance by a large. So you can just test with the new data.
However, it is also possible that your initial dataset was not representative enough, and therefore with more data your classification accuracy increases. So in that case it is better to include all the data and preprocess it. Because preprocessing generally includes parameters that are computed on the dataset as whole. e.g., I can see you have TFIDF, or mean which is sensitive to the dataset at hand.

Improving classification by using clustering as a feature

I'm trying to improve my classification results by doing clustering and use the clustered data as another feature (or use it alone instead of all other features - not sure yet).
So let's say that I'm using unsupervised algorithm - GMM:
gmm = GaussianMixture(n_components=4, random_state=RSEED)
gmm.fit(X_train)
pred_labels = gmm.predict(X_test)
I trained the model with training data and predicted the clusters by the test data.
Now I want to use a classifier (KNN for example) and use the clustered data within it. So I tried:
#define the model and parameters
knn = KNeighborsClassifier()
parameters = {'n_neighbors':[3,5,7],
'leaf_size':[1,3,5],
'algorithm':['auto', 'kd_tree'],
'n_jobs':[-1]}
#Fit the model
model_gmm_knn = GridSearchCV(knn, param_grid=parameters)
model_gmm_knn.fit(pred_labels.reshape(-1, 1),Y_train)
model_gmm_knn.best_params_
But I'm getting:
ValueError: Found input variables with inconsistent numbers of samples: [418, 891]
Train and Test are not with same dimension.
So how can I implement such approach?
Your method is not correct - you are attempting to use as a single feature the cluster labels of your test data pred_labels, in order to fit a classifier with your training labels Y_train. Even in the huge coincidental case that the dimensions of these datasets were the same (hence not giving a dimension mismatch error, as here), this is conceptually wrong and does not actually make any sense.
What you actually want to do is:
Fit a GMM with your training data
Use this fitted GMM to get cluster labels for both your training and test data.
Append the cluster labels as a new feature in both datasets
Fit your classifier with this "enhanced" training data.
All in all, and assuming that your X_train and X_test are pandas dataframes, here is the procedure:
import pandas as pd
gmm.fit(X_train)
cluster_train = gmm.predict(X_train)
cluster_test = gmm.predict(X_test)
X_train['cluster_label'] = pd.Series(cluster_train, index=X_train.index)
X_test['cluster_label'] = pd.Series(cluster_test, index=X_test.index)
model_gmm_knn.fit(X_train, Y_train)
Notice that you should not fit your clustering model with your test data - only with your training ones, otherwise you have data leakage similar to the one encountered when using the test set for feature selection, and your results will be both invalid and misleading .

I can't get my test accuracy to increase in a sentiment analysis

I'm not sure if this is the right place but my test accuracy is always at about .40 while I can get my training set accuracy to 1.0. I'm trying to do a sentiment analysis of tweets on trump, I have annotated each tweet with a positive,negative or neutral polarity. I want to be able to predict the polarity of new data based on my model. I've tried different models but the SVM seems to give me the highest test accuracy. I'm unsure as to why my data model accuracy is so low but would appreciate any help or direction.
trump = pd.read_csv("trump_data.csv", delimiter = ";")
#drop all nan values
trump = trump.dropna()
trump = trump.rename(columns = {"polarity,,,":"polarity"})
#print(trump.columns)
def tokenize(text):
ps = PorterStemmer()
return [ps.stem(w.lower()) for w in word_tokenize(text)
X = trump.text
y = trump.polarity
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state = 42)
svm = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'),
tokenizer=tokenize)), ('svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3,
random_state=42,max_iter=5, tol=None))])
svm.fit(X_train, y_train)
model = svm.score(X_test, y_test)
print("The svm Test Classification Accuracy is:", model )
print("The svm training set accuracy is : {}".format(naive.score(X_train,y_train)))
y_pred = svm.predict(X)
This is an example of one of the strings in the text column of the dataset
".#repbilljohnson congress must step up and overturn president trump’s discriminatory #eo banning #immigrants & #refugees #oxfam4refugees"
Data set
Why are you using naive.score? I assume it's a copy-paste mistake. Here are a few steps you can follow.
Make sure you enough data points and clean it. Cleaning the dataset is the inevitable process in data science.
Make use of the parameters like ngram_range, max_df, min_df, max_features while featurizing the text with either TfidfVectorizer or CountVectorizer. You may also try embeddings using Word2Vec.
Do a hyperparameter tuning on alpha, penalty & other variables using GridSearch or RandomizedSearchCV. Make sure you are CV currently. Refer the documentation for more info
If the dataset is imbalanced, then try using other matrices like log-loss, precision, recall, f1-score, etc. Refer this for more info.
Make sure your model is neither overfitted not underfitted by checking train-error & test error.
Other than SVM, also try the traditional models like Logistic Regression, NV, RF etc. If you have a large number of data points, then you may try Deep Learning models.
Turns out I needed to clean the polarity data set as it had values such as "positive," , "positive,," and "positive,,," hence not registering them as different so I just removed all "," from the column.

Sklearn - How to predict probability for all target labels

I have a data set with a target variable that can have 7 different labels. Each sample in my training set has only one label for the target variable.
For each sample, I want to calculate the probability for each of the target labels. So my prediction would consist of 7 probabilities for each row.
On the sklearn website I read about multi-label classification, but this doesn't seem to be what I want.
I tried the following code, but this only gives me one classification per sample.
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
Does anyone have some advice on this? Thanks!
You can do that by simply removing the OneVsRestClassifer and using predict_proba method of the DecisionTreeClassifier. You can do the following:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)
This will give you a probability for each of your 7 possible classes.
Hope that helps!
You can try using scikit-multilearn - an extension of sklearn that handles multilabel classification. If your labels are not overly correlated you can train one classifier per label and get all predictions - try (after pip install scikit-multilearn):
from skmultilearn.problem_transform import BinaryRelevance
classifier = BinaryRelevance(classifier = DecisionTreeClassifier())
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
Predictions will contain a sparse matrix of size (n_samples, n_labels) in your case - n_labels = 7, each column contains prediction per label for all samples.
In case your labels are correlated you might need more sophisticated methods for multi-label classification.
Disclaimer: I'm the author of scikit-multilearn, feel free to ask more questions.
If you insist on using the OneVsRestClassifer, then you could also call predict_proba(X_test) as it is supported by OneVsRestClassifer as well.
For eg:
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)
The order of the labels for which you get the result can be found in:
clf.classes_

Scikit Learn - ValueError: X has 26879 features per sample; expecting 7087

I am doing feature selection by first training LogisticRegression with L1 penalty and then using the reduced feature set to re-train the model using L2 penalty. Now, when I try to predict test data, the transform() done on it results in a different dimensional array. I am confused as to how to re-size the test data to be able to predict.
Appreciate any help. Thank you.
vectorizer = CountVectorizer()
output = vectorizer.fit_transform(train_data)
output_test = vectorizer.transform(test_data)
logistic = LogisticRegression(penalty = "l1")
logistic.fit(output, train_labels)
predictions = logistic.predict(output_test)
logistic = LogisticRegression(penalty = "l2", C = i + 1)
output = logistic.fit_transform(output, train_labels)
predictions = logistic.predict(output_test)
The following error message is shown resulting from the last predict line. Original number of features is 26879:
ValueError: X has 26879 features per sample; expecting 7087
There seem to be a couple of things wrong here.
Firstly, I suggest you give different names to the two logistic models, as you need both to make a prediction.
In you code, you never call the transform of the l1 logistic regression, which is not what you say you want to do.
What you should be doing is
l1_logreg = LogisticRegression(penalty="l1")
l1_logreg.fit(output, train_labels)
out_reduced = l1_logreg.transform(out)
out_reduced_test = l1_logreg.transform(out_test)
l2_logreg = LogisticRegression(penalty="l2")
l2_logreg.fit(out_reduced, train_labels)
pedictions = l2_logreg.predict(out_reduced_test)
or
pipe = make_pipeline(CountVectorizer(), LogisticRegression(penalty="l1"),
LogisticRegression(penalty="l2"))
pipe.fit(train_data, train_labels)
preditions = pipe.predict(test_data)
FYI I wouldn't expect that to work better than just doing l2 logreg. Also you could try SGDClassifier(penalty="elasticnet").

Categories

Resources