Check skills of a classifier in scikit learn - python

After training a classifier, I tried passing a few sentences to check if it is going to classify it correctly.
During that testing the results are not appearing well.
I suppose some variables are not correct.
Explanation
I have a dataframe called df that looks like this:
news type
0 From: mathew <mathew#mantis.co.uk>\n Subject: ... alt.atheism
1 From: mathew <mathew#mantis.co.uk>\n Subject: ... alt.space
2 From: I3150101#dbstu1.rz.tu-bs.de (Benedikt Ro... alt.tech
...
#each row in the news column is a document
#each row in the type column is the category of that document
Preprocessing:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn import metrics
vectorizer = TfidfVectorizer( stop_words = 'english')
vectors = vectorizer.fit_transform(df.news)
clf = SVC(C=10,gamma=1,kernel='rbf')
clf.fit(vectors, df.type)
vectors_test = vectorizer.transform(df_test.news)
pred = clf.predict(vectors_test)
Attempt to check how some sentences are classified
texts = ["The space shuttle is made in 2018",
"stars are shining",
"galaxy"]
text_features = vectorizer.transform(texts)
predictions = clf.predict(text_features)
for text, predicted in zip(texts, predictions):
print('"{}"'.format(text))
print(" - Predicted as: '{}'".format(df.type[pred]))
print("")
The problem is that it returns this:
"The space shuttle is made in 2018"
- Predicted as: 'alt.atheism NaN
alt.atheism NaN
alt.atheism NaN
alt.atheism NaN
alt.atheism NaN
What do you think?
EDIT
Example
This is kind of how it should look like :
>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
>>> X_new_counts = count_vect.transform(docs_new)
>>> X_new_tfidf = tfidf_transformer.transform(X_new_counts)
>>> predicted = clf.predict(X_new_tfidf)
>>> for doc, category in zip(docs_new, predicted):
... print('%r => %s' % (doc, twenty_train.target_names[category]))
...
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics

As you mentioned in the comments, you have around 700 samples. To test how good your classifier works, you should always split your data into training and test samples. For example 500 sample as training data and 200 to test your classifier. You should then only use your training samples for training and your test samples for testing. Test data created by hand as you did are not necessarily meaningful. sklearn comes with a handy function to separate data into test and training:
#separate training and test data, 20% og your data is selected as test data
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)
vectors = vectorizer.fit_transform(df_train.news)
clf = SVC(C=10,gamma=1,kernel='rbf')
#train classifier
clf.fit(vectors, df_train.type)
#test classifier on the test set
vectors_test = vectorizer.transform(df_test.news)
pred = clf.predict(vectors_test)
#prints accuracy of your classifier
from sklearn.metrics import classification_report
classification_report(df_test.type, pred)
This will give you a hint how good your classifier actually is. If you think it is not good enough, you should try another classifier, for example logistic regression. Or you could change your data to all lower case letters and see if this helps to augment your accuracy.
Edit:
You can also write your predictions back to your test_datframe:
df_test['Predicted'] = preds
df_test.head()
This will help you to see a pattern. Is acctually all predicted as alt.atheism as your example suggests?

The data with which you train your classifier is significantly different to the phrases you test it on. As you mentioned in your comment on my first answer, you get an accuracy of more than 90%, which is pretty good. But you tought your classifier to classify mailing list items which are long documents with e-mail adresses in them. Your phrases such as "The space shuttle is made in 2018" are pretty short and do not contain e-mail adresses. Its possible that your classifier uses those e-mail adresses to classify the documents, which explaines the good results. You can test if that is really the case if you remove the e-mail adresses from the data before training.

Related

Why is this accuracy of this Random forest sentiment classification so low?

I want to use RandomForestClassifier for sentiment classification. The x contains data in string text, so I used LabelEncoder to convert strings. Y contains data in numbers. And my code is this:
import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.ensemble import *
from sklearn import *
from sklearn.preprocessing.label import LabelEncoder
data = pd.read_csv('data.csv')
x = data['Reviews']
y = data['Ratings']
le = LabelEncoder()
x_encoded = le.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_encoded,y, test_size = 0.2)
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
Then I printed out the accuracy like below:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
And here's the output:
Accuracy: 0.5975
I have read that Random forests has high accuracy, because of the number of decision trees participating in the process. But I think that the accuracy is much lower than it should be. I have looked for some similar questions on Stack Overflow, but I couldn't find a solution for my problem.
Is there any problem in my code using Random Forest library? Or is there any exceptions of cases when using Random forest?
It is not a problem regarding Random Forests or the library, it is rather a problem how you transform your text input into a feature or feature vector.
What LabelEncoding does is; given some labels like ["a", "b", "c"] it transforms those labels into numeric values between 0 and n-1 with n-being the number of distinct input labels. However, I assume Reviews contain texts and not pure labels so to say. This means, all your reviews (if not 100% identical) are transformed into different labels. Eventually, this leads to your classifier doing random stuff. give that input. This means you need something different to transform your textual input into a numeric input that Random Forests can work on.
As a simple start, you can try something like TfIDF or also some simple count vectorizer. Those are available from sklearn https://scikit-learn.org/stable/modules/feature_extraction.html section 6.2.3. Text feature extraction. There are more sophisticated ways of transforming texts into numeric vectors but that should be a good start for you to understand what has to happen conceptually.
A last important note is that you fit those vectorizers only on the training set and not on the full dataset. Otherwise, you might leak information from training to evaluation/testing. A good way of doing this would be to build a sklearn pipeline that consists of a feature transformation step and the classifier.

How to deal with dataset that contains both discrete and continuous data

I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data
This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.
Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).
your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.

I can't get my test accuracy to increase in a sentiment analysis

I'm not sure if this is the right place but my test accuracy is always at about .40 while I can get my training set accuracy to 1.0. I'm trying to do a sentiment analysis of tweets on trump, I have annotated each tweet with a positive,negative or neutral polarity. I want to be able to predict the polarity of new data based on my model. I've tried different models but the SVM seems to give me the highest test accuracy. I'm unsure as to why my data model accuracy is so low but would appreciate any help or direction.
trump = pd.read_csv("trump_data.csv", delimiter = ";")
#drop all nan values
trump = trump.dropna()
trump = trump.rename(columns = {"polarity,,,":"polarity"})
#print(trump.columns)
def tokenize(text):
ps = PorterStemmer()
return [ps.stem(w.lower()) for w in word_tokenize(text)
X = trump.text
y = trump.polarity
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state = 42)
svm = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'),
tokenizer=tokenize)), ('svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3,
random_state=42,max_iter=5, tol=None))])
svm.fit(X_train, y_train)
model = svm.score(X_test, y_test)
print("The svm Test Classification Accuracy is:", model )
print("The svm training set accuracy is : {}".format(naive.score(X_train,y_train)))
y_pred = svm.predict(X)
This is an example of one of the strings in the text column of the dataset
".#repbilljohnson congress must step up and overturn president trump’s discriminatory #eo banning #immigrants & #refugees #oxfam4refugees"
Data set
Why are you using naive.score? I assume it's a copy-paste mistake. Here are a few steps you can follow.
Make sure you enough data points and clean it. Cleaning the dataset is the inevitable process in data science.
Make use of the parameters like ngram_range, max_df, min_df, max_features while featurizing the text with either TfidfVectorizer or CountVectorizer. You may also try embeddings using Word2Vec.
Do a hyperparameter tuning on alpha, penalty & other variables using GridSearch or RandomizedSearchCV. Make sure you are CV currently. Refer the documentation for more info
If the dataset is imbalanced, then try using other matrices like log-loss, precision, recall, f1-score, etc. Refer this for more info.
Make sure your model is neither overfitted not underfitted by checking train-error & test error.
Other than SVM, also try the traditional models like Logistic Regression, NV, RF etc. If you have a large number of data points, then you may try Deep Learning models.
Turns out I needed to clean the polarity data set as it had values such as "positive," , "positive,," and "positive,,," hence not registering them as different so I just removed all "," from the column.

How to print clusters of SVM in python

I want to classify rows of a column using SVM clustering method. I can find so many content on net which produces graphs or print prediction accuracy but i cannot find ways to print my cluster. Below example will better explain what i am trying to do:
I have a dataframe to be used as test dataset
import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal',cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
}
df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
print (df)
I want to predict whether the text row is talking about Animal/Thing or miscelleneus. The test data i want to pass is
test_data = {'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
}
df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
Expected result is an additional column 'Classification' getting created in the test dataframe with values ['Animal','Miscellenous','Animal','Animal','Miscellenous']
Here is the solution to your problem:
# import tfidf-vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# import support vector classifier
from sklearn.svm import SVC
import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal','cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
}
train_df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
display(train_df)
test_data = {'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
}
test_df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
display(test_df)
# Load training data (text) from the dataframe and form to a list containing all the entries
training_data = train_df['Text'].tolist()
# Load training labels from the dataframe and form to a list as well
training_labels = train_df['classification'].tolist()
# Load testing data from the dataframe and form a list
testing_data = test_df['Text'].tolist()
# Get a tfidf vectorizer to process the text into vectors
vectorizer = TfidfVectorizer()
# Fit the tfidf-vectorizer to training data and transform the training text into vectors
X_train = vectorizer.fit_transform(training_data)
# Transform the testing text into vectors
X_test = vectorizer.transform(testing_data)
# Get the SVC classifier
clf = SVC()
# Train the SVC with the training data (data points and labels)
clf.fit(X_train, training_labels)
# Predict the test samples
print(clf.predict(X_test))
# Add classification results to test dataframe
test_df['Classification'] = clf.predict(X_test)
# Display test dataframe
display(test_df)
As an explanation for the approach:
You have your training data and want to use it to train a SVM and then predict the test data with labels.
That means you need to extract the training data and labels for each data point (so for each phrase, you need to know if its an animal or a thing etc.) and then you need to set up and train a SVM. Here, I used the implementation from scikit-learn.
Moreover you can't just train the SVM with raw text data, because it requires numerical values (numbers). This means you need to transform the text data into numbers. This is "feature extraction from text" and for this one of the common approaches is to use the Term-Frequency Inverted-Document-Frequency (TF-IDF) concept.
Now you can use a vector representation of each phrase coupled with a label for it to train the SVM and then use it to classify the test data :)
In short the steps are:
Extract data points and labels from training
Extract data points from testing
Set up SVM classifier
Set up TF-IDF vectorizer and fit it to training data
Transform training data and testing data with tf-idf vectorizer
Train the SVM classifier
Classify test data with trained classifier
I hope this helps!

sklearn grid search f1_score does not match f1_score function

I've been experimenting with the sklearn grid search and pipeline functionality and have noticed that the f1_score returned does not match the f1_score I generate using hard coded parameters. Looking for help understanding why this may be.
Data background: two column .csv file
customer comment (string), category tag (string)
Using out of the box sklearn bag of words approach with no pre-processing of text, just the countVectorizer.
Hard coded model...
get .csv data into dataFrame
data_file = 'comment_data_basic.csv'
data = pd.read_csv(data_file,header=0,quoting=3)
#remove data without 'web issue' or 'product related' tag
data = data.drop(data[(data.tag != 'WEB ISSUES') & (data.tag != 'PRODUCT RELATED')].index)
#split dataFrame into two series
comment_data = data['comment']
tag_data = data['tag']
#split data into test and train samples
comment_train, comment_test, tag_train, tag_test = train_test_split(
comment_data, tag_data, test_size=0.33)
#build count vectorizer
vectorizer = CountVectorizer(min_df=.002,analyzer='word',stop_words='english',strip_accents='unicode')
vectorizer.fit(comment_data)
#vectorize features and convert to array
comment_train_features = vectorizer.transform(comment_train).toarray()
comment_test_features = vectorizer.transform(comment_test).toarray()
#train LinearSVM Model
lin_svm = LinearSVC()
lin_svm = lin_svm.fit(comment_train_features,tag_train)
#make predictions
lin_svm_predicted_tags = lin_svm.predict(comment_test_features)
#score models
lin_svm_score = round(f1_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
lin_svm_accur = round(accuracy_score(tag_test,lin_svm_predicted_tags),3)
lin_svm_prec = round(precision_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
lin_svm_recall = round(recall_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
#write out scores
print('Model f1Score Accuracy Precision Recall')
print('------ ------- -------- --------- ------')
print('LinSVM {f1:.3f} {ac:.3f} {pr:.3f} {re:.3f} '.format(f1=lin_svm_score,ac=lin_svm_accur,pr=lin_svm_prec,re=lin_svm_recall))
The f1_score output is generally around 0.86 (depending on random seed value)
Now if I basically reconstruct the same output with grid search and pipeline...
#get .csv data into dataFrame
data_file = 'comment_data_basic.csv'
data = pd.read_csv(data_file,header=0,quoting=3)
#remove data without 'web issue' or 'product related' tag
data = data.drop(data[(data.tag != 'WEB ISSUES') & (data.tag != 'PRODUCT RELATED')].index)
#build processing pipeline
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', LinearSVC()),])
#define parameters to be used in gridsearch
parameters = {
#'vect__min_df': (.001,.002,.003,.004,.005),
'vect__analyzer': ('word',),
'vect__stop_words': ('english', None),
'vect__strip_accents': ('unicode',),
#'clf__C': (1,10,100,1000),
}
if __name__ == '__main__':
grid_search = GridSearchCV(pipeline,parameters,scoring='f1_macro',n_jobs=1)
grid_search.fit(data['comment'],data['tag'])
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_params = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_params[param_name]))
The returned f1_score is closer to 0.73, with all model parameters the same. My understanding is that grid search applies a cross-val approach internally, but my guess is that the difference comes from whatever approach it is using as compared to by use of test_train_split in the original code. However a drop from 0.83 -> 0.73 feels large to me and I would like to be confident in my results.
Any insight would be greatly appreciated.
In the code you provided you are not setting the random_state parameter of the LinearSVC model, thus even with the same hyperparameters, you will be unlikely to reproduce an exact duplicate of the best estimator from your GridSearchCV. However, that is more trivial than what is really going on.
The GridSearch is being cross validated using in your case 3 folds of the data. The best_score that you see is the score from the model that performed best on average across all of your folds when scored on your test data, and it may not be the estimator with the best score on your train/test split. It is possible that given the split you provided the GridSearch a different estimator would score higher, but if you were to generate a handful of different splits and score the estimators on each of the test sets, on average the best_estimator will come out on top. The idea is that by cross validating you will choose an estimator that is more resilient to changes in the data not necessarily represented in a single train/test split. So the more splits you take the better your model will perform on new unseen data. In this case better may not mean that it produces a more accurate result every time, but that given the variations present in the existing data the model will do a better job encompassing these variations and on average produce a more accurate result in the long run as long as new unseen data falls within what was seen in the training data.
If you want to see more info about how the estimators performed within splits, take a look at grid_search.cv_results_ for a better picture of what happened step by step through the process.

Categories

Resources