Problem Statement - Classify a product review
classes - Travel,Hotel,Cars,Electronics,Food,Movies
I am approaching this problem with the famous Text Classification problem. Feature set is prepared by using Doc2Vec default model from gensim and for classification I am using Logistic Regression oneVSrest from sklearn.
For every class I feed 10000 reviews to Doc2Vec.( I am following this Doc2Vec tutorial). In this way the model learns vector for each sentence. From the resulting vectors, 80% from each class are given to LogisticRegression for training and 20% for testing. The accuracy of classifier is 98%. But for unseen data the accuracy is just 17%. Also PCA of all sentence vectors when plotted in a 2D graph resulted in one dense cluster. What I can conclude from the graph is that the data is inseparable but then how the classifier gave an accuracy of 98%? Also, why on unseen data the accuracy is very low? How can I evaluate/validate my results.
Related
I have a question regarding the process to make a late fusion between SVM (Linear) and a NeuralNetwork (NN),
I have done some research and I found that concatenated the clf.predict_prob of SVM and Model.predic of NN, I should train the new model, however, these scores are for the test data and I cannot figure what to do with the training data.
In other words, I train the new model with the concatenated probability scores of the test data from my two models (SVM and NN) and I test this new model with the same concatenated data, and I'm not really sure of this.
Can you please give me an insight into if this is correct?
After a lot of searching and research I found the solution:
The solution is to train and test a new classifier, in my case it was another Neural Network, with the concatenated probability scores obtained from both data sets (training and test), of the two classifiers, the Linear SVM and the Neural Network.
An example of this of three Linear SVM Late fusion was implemented in python, and can be found in the following link:
https://github.com/JMalhotra7/Learning-image-by-parts-using-early-and-late-fusion-of-auto-encoder-features
What will happen if I use the same training data and validation data for my machine learning classifier?
If the train data and the validation data are the same, the trained classifier will have a high accuracy, because it has already seen the data. That is why we use train-test splits. We take 60-70% of the training data to train the classifier, and then run the classifier against 30-40% of the data, the validation data which the classifier has not seen yet. This helps measure the accuracy of the classifier and its behavior, such as over fitting or under fitting, against a real test set with no labels.
We create multiple models and then use the validation to see which model performed the best. We also use the validation data to reduce the complexity of our model to the correct level. If you use train data as your validation data, you will achieve incredibly high levels of success (your misclassification rate or average square error will be tiny), but when you apply the model to real data that isn't from your train data, your model will do very poorly. This is called OVERFITTING to the train data.
Basically nothing happens. You are just trying to validate your model's performance on the same data it was trained on, which practically doesn't yield anything different or useful. It is like teaching someone to recognize an apple and asking them to recognize just the same apple and see how they performed.
Why a validation set is used then? To answer this in short, the train and validation sets are assumed to be generated from the same distribution and thus the model trained on training set should perform almost equally well on the examples from validation set that it has not seen before.
Generally, we divide the data to validation and training to prevent overfitting. To explain it, we can think a model that classifies that it is human or not and you have dataset contains 1000 human images. If you train your model with all your images in that dataset , and again validate it with again same data set your accuracy will be 99%. However, when you put another image from different dataset to be classified by the your model ,your accuracy will be much more lower than the first. Therefore, generalization of the model for this example is a training a model looking for a stickman to define basically it is human or not instead of looking for specific handsome blonde man. Therefore, we divide dataset into validation and training to generalize the model and prevent overfitting.
TLDR;
If you use the same dataset for training and validation then:
training_accuracy = testing_accuracy
Your testing_accuracy will be the same as training_accuracy if you use the training dataset as the validation dataset. Therefore you will NOT be able to tell if your model has underfit or not.
Let's talk about datasets and evaluation metrics. Here is some terminology (reference) -
Datasets:
Training dataset: The data used to fit the model.
Validation dataset: the data used to validate the generalization ability of the model or for early stopping, during the training process. In most cases, this is the same as the test dataset
Evaluations:
Training accuracy: The accuracy you achieve when comparing predictions and actuals from the training data itself.
Testing accuracy: The accuracy you achieve when comparing predictions and actuals from the testing/validation data.
With the training_accuracy, you can get a sense of how well a model fits your data and the testing_accuracy tells you how well that model is generalizable. If train_accuracy is low, then your model has underfitted and you may need a better model (better features, different architecture, etc) for modeling the given problem. If training_accuracy is high but testing_accuracy is low, this means your model fits the data well, but it's not generalizable on unseen data. This is overfitting.
Note: In practice, it is better to have a overfit model and regularize it heavily rather than work with an underfit model.
Another important thing you need to understand that training a model (fit) and inference from a model (predict / score) are 2 separate tasks. Therefore, when you use the validation dataset as the training dataset, you are basically still training the model on the same training dataset but while inference, you are using the training dataset which will give you the same accuracy as the training_accuracy.
You will therefore not come to know if at all you overfit BUT that doesn't mean you will get 99% accuracy like the other answer to suggest! You may still underfit and get an extremely low model accuracy
I'm using Keras and Python to train a MLP Sequential model for classification of two classes. My Training Data has 247 features and I've got 17 Samples of class 1, 922 Samples of class 2. I use Smote Borderline Oversampling Algorithm to balance the dataset. I use Cross Validation with k=4 to validate the performance of precision and recall. For Training on each fold I plot the loss curve over training and validation, to estimate if the model is under- or overfitted.
I trained a model with 3 hidden layers, and reached 95% of precision and 71% for recall. The loss function plot for each fold doesn't seem to be overfitted. (I'm not allowed by stackoverflow to post the image). But the evaluation of this model is worse, than with a model whose training precision and recall is worse.
Is this overfitting? And how can I detect it before evaluation?
Thanks in advance!
You can't detect overfitting simply from the learning curve. The definition of overfitting is when your model does extremely well on the training set, and poorly on your evaluation set, which is exactly what you're reporting.
In this case, I suspect the main problem is the unbalanced dataset. You can verify the spread of both classes in each of your sets (training, validation folds and test set) and see how your model performs on the minority class.
I have a sample of approximately 10,000 tweets that I want to classify into the categories "relevant" and "not relevant". I am using Python's scikit-learn for this model. I manually coded 1,000 tweets as "relevant" or "not relevant". Then, I ran a SVM model using 80% of the manually coded data as training data and the rest as test data. I obtained good results (prediction accuracy ~0.90), but to avoid overfitting I decided to use cross-validation on all 1,000 manually coded tweets.
Below is my code after already obtaining the tf-idf matrix for the tweets in my sample. "target" is an array listing whether the tweet was marked as "relevant" or "not relevant".
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
clf = SGDClassifier()
scores = cross_val_score(clf, X_tfidf, target, cv=10)
predicted = cross_val_predict(clf, X_tfidf, target, cv=10)
With this code, I was able to get predictions of what classes the 1,000 tweets belonged to, and I could compare that against my manual coding.
I'm stuck on what to do next in order to use my model to classify the other ~9,000 tweets that I did not manually code. I was thinking to use cross_val_predict again, but I'm not sure what to put in the third argument since the class is what I'm trying to predict.
Thanks for all your help in advance!
cross_val_predict is not method to actually obtain predictions from the model. Cross validation is a technique for model selection/evaluation, no to train model. cross_val_predict is very specific function (which gives you predictions of many models, trained during cross validation procedure). For actual model building yu are supposed to use fit to train your model and predict to get predictions. No cross validation involved here - as said before - this is for model selection (to choose your classifier, hyperparamters etc.) and not to train actual model.
I want to perform text classification using topic modeling information as features that are fed to an svm classifier. So I was wondering how is it possible to generate topic modeling features by performing LDA on both the training and test partitions of the dataset since the corprus changes for the two partitions of the dataset?
Am I making a wrong assumption?
Could you provide an example on how to do it by using scikit learn?
Your assumption is right. What you do is that you train your LDA on your training data and then transform both training and testing data based on that trained model.
So you'll have something like this:
from sklearn.decomposition import LatentDirichletAllocation as LDA
lda = LDA(n_topics=10,...)
lda.fit(training_data)
training_features = lda.transform(training_data)
testing_features = lda.transform(testing_data)
If I were you, I would concatenate the LDA features with Bag of words features using numpy.hstack or scipy.hstack if your bow features are sparse.