How to improve my multiclass text-classification on German text?

How to improve my multiclass text-classification on German text? - python

I am new in NLP and it is a bit confusing me.
I am trying to do a text classification with SVC on my dataset.
I have an imbalanced dataset of 6 classes.
The text is news for classes of health, sport, culture, economy, science and web.
I am using TF-IDF for vectorization.
the preprocessing steps: lower-case all the texts and to remove the stop-words. since my text is in German I did not use lemmatization
my first try:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)
X_train = train['text']
y_train = train['category']
X_test = test['text']
y_test = test['category']
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC()),])
predictions = text_clf_lsvc.predict(X_test)
my metrci acuuracy score was: 93%
then I decided to reduce the dimensionality: so on my 2nd try I added TruncatedSVD
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),('svd', TruncatedSVD()),
('clf', LinearSVC()),])
predictions = text_clf_lsvc.predict(X_test)
my metrci acuuracy score dropped to 34%.
my questions:
1- How can I improve my model if I want to stick to TF-IDF and SVC for classification
2- What can I do other than that if I want to have a good classification

The best way to improve accuracy, given that you want to stick with this configuration is through hyperparameter tuning, or by introducing additional components, such as feature selection.
Hyperparameter tuning
Most machine learning algorithms and parts of a machine learning pipeline have several parameters you can change. For example, the TfidfVectorizer has different ngram ranges, different analysis levels, different tokenizers, and many more parameters to vary. Most of these will affect your performance. So, what you can do is systematically vary these parameters (and those of your SVC), while monitoring you accuracy on a development set (i.e., not the test data!). Instead of fixed development set, cross-validation is typically used in these kinds of settings.
The best way to do this in sklearn is through a RandomizedSearchCV (see here for details). This class automatically cross-validates and searches through the possible options you pre-specify by randomly sampling from the option set for a fixed number of iterations. By applying this technique on your training data, you will automatically find models that perform better for your given training data and your options. Ideally, these models would also perform better on your test data. Fair warning: cross-validated search techniques can take a while to run.
Feature Selection
In addition to grid search, another way to improve performance is through feature selection. Feature selection typically consists of a statistical test that determines which features explain variance in the typical task you are trying to solve. The feature selection methods in sklearn are detailed here.
By far the most important bit here is that the performance of anything you add to your model should be verified on an independent development set, or in cross-validation. Leave your test data alone.

Related

How to determine the best baseline model to perform hyperparameter tuning on in scikit learn?

I'm working on data where I'm trying different classification algorithms and see which one performs best as a baseline model. The code for that is as follows:
# Trying out different classifiers and selecting the best
## Creat list of classifiers we're going to loop through
classifiers = [
KNeighborsClassifier(),
SVC(),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier()
]
classifier_names = [
'kNN',
'SVC',
'DecisionTree',
'RandomForest',
'AdaBoost',
'GradientBoosting'
]
model_scores = []
## Looping through the classifiers
for classifier, name in zip(classifiers, classifier_names):
pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('selector', SelectKBest(k=len(X.columns))),
('classifier', classifier)])
score = cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
model_scores.append(score)
print("Model score for {}: {}".format(name, score))
The output is:
Model score for kNN: 0.7472524440239673
Model score for SVC: 0.7896621728161464
Model score for DecisionTree: 0.7302148734267939
Model score for RandomForest: 0.779058799919727
Model score for AdaBoost: 0.7949635904933918
Model score for GradientBoosting: 0.7930712637252372
Turns out the best model is the AdaBoostClassifier(). I would normally pick the best baseline model and perform a GridSearchCV on it to further improve its baseline performance.
However, what if, hypothetically speaking, the model that performed best as a baseline model (in this case AdaBoost), only improves 1% through hyperparameter tuning while a model that initially performed not as well (for example the SCV()), would have more "potential", to improve through hyperparameter tuning (i.e. would improve by for example 4%) and would - after tuning - end up being the better model?
Is there a way to know this "potential" beforehand, without performing a GridSearch for all classifiers?

No, there is no way to know for 100 % certain before hyperparameter tuning which classifier will end up performing best on any given problem. However, in practice what Kaggle competitions have shown on tabular data classification problems (as opposed to text or image-based) is that in almost every case, a gradient-boosted decision tree-based model (like XGBoost or LightGBM) works best. Given this, it's likely that GradientBoosting will perform better under hyperparamter tuning since it's based off LightGBM.
What you are doing in the above code is to use simply all default values of the hyper parameters and for those algorithms which are more sensitive to hyperparameter tuning, it's not necessarily indicative of the final (fine-tuned) performance, as you've suggested.

Yes, there are ways like Univariate, Bivariate and Multivariate analysis to look at the data and then decide which model you can start as baseline.
you can also use the sklearn way to choose the right estimator.
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

How can I properly split imbalanced dataset to train and test set?

I have a flight delay dataset and try to split the set to train and test set before sampling. On-time cases are about 80% of total data and delayed cases are about 20% of that.
Normally in machine learning ratio of train and test set size is 8:2. But the data is too imbalanced. So considering extreme case, most of train data are on-time cases and most of test data are delayed cases and accuracy will be poor.
So my question is How can I properly split imbalanced dataset to train and test set??

Probably just by playing with ratio of train and test you might not get the correct prediction and results.
if you are working on imbalanced dataset, you should try re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
Also use different metric for performance measurement such as F1 Score etc in case of imbalanced data set
Please go through the below link, it will give you more clarity.
What is the correct procedure to split the Data sets for classification problem?
Cleveland heart disease dataset - can’t describe the class

Start from 50/50 and go on changing the sets as 60/40, 70/30, 80/20, 90/10. declare all the results and come to some conclusion. In one of my work on Flight delays prediction project, I used 60/40 database and got 86.8 % accuracy using MLP NN.

There are two approaches that you can take.
A simple one: no preprocessing of the dataset but careful sampling of the dataset so that both classes are represented in the same proportion in the test and train subsets. You can do it by splitting by class first and then randomly sampling from both sets.
import sklearn
XclassA = dataX[0] # TODO: change to split by class
XclassB = dataX[1]
YclassA = dataY[0]
YclassB = dataY[1]
XclassA_train, XclassA_test, YclassA_train, YclassA_test = sklearn.model_selection.train_test_split(XclassA, YclassA, test_size=0.2, random_state=42)
XclassB_train, XclassB_test, YclassB_train, YclassB_test = sklearn.model_selection.train_test_split(XclassB, YclassB, test_size=0.2, random_state=42)
Xclass_train = XclassA_train + XclassB_train
Yclass_train = YclassA_train + YclassB_train
A more involved, and arguably better one, you can try first to balance your dataset. For that you can use one of many techniques (under-, over-sampling, SMOTE, AdaSYN, Tomek links, etc.). I recommend you review the methods of imbalanced-learn package. Having done balancing you can use the ordinary test/train split using typical methods without any additional intermediary steps.
The second approach is better not only from the perspective of splitting the data but also from the speed and even ability to train a model (which for heavily imbalanced datasets is not guaranteed to work).

How to use document vectors in isolationforest in sklearn

In understanding what isolation forest really does, I did a sample project as follows using 8 features as follows.
from sklearn.ensemble import IsolationForest
#features
df_selected = df[["feature1", "feature2", "feature3", "feature4", "feature5", "feature6", "feature7", "feature8"]]
X = np.array(df_selected)
#isolation forest
clf = IsolationForest(max_samples='auto', random_state=42, behaviour="new", contamination=.01)
clf.fit(X)
y_pred_train = clf.predict(X)
print(np.where(y_pred_train == -1)[0])
Now, I want to identify what are the outlier documents using isolation forest. For that I trained a doc2vec model using gensim. Now for each of my document in the dataset I have a 300-dimensional vector.
My question is can I straight away use the document vectors in isolation forest as X in the above code to detect outliers? Or do I need to reduce the dimensionality of the vectors before applying them to isolation forest?
I am happy to provide more details if needed.

You can straight away use the predict() to detect outliers unless you plan on removing some variables that would not be considered in the training model.
In general, I would say to do a correlation analysis and remove the variables that are highly correlated with each other (Logic basis being that if they are highly correlated, then they are the same and should not encourage the bias of the variables by doubling the consideration).
Feel free to dispute otherwise or state your considerations as I think the above is really my opinion on how to approach the problem.

Does GridSearchCV perform cross-validation?

I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.
First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?
Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?
Any answer would be appreciated, thank you.

All estimators in scikit where name ends with CV perform cross-validation.
But you need to keep a separate test set for measuring the performance.
So you need to split your whole data to train and test. Forget about this test data for a while.
And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.
Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
You can look at my other answers which describe the GridSearch in more detail:
Model help using Scikit-learn when using GridSearch
scikit-learn GridSearchCV with multiple repetitions

Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it.
So you train your models against train data set and test them on a testing data set.
Here I was doing almost the same - you might want to check it...

Imbalance in scikit-learn

I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?

There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.

In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.

I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
### Train test split
X_train, X_val, y_train, y_val = train_test_split(X, y)
### Scale the data before applying SMOTE
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
### Resample X_train_scaled
X_train_resampled, y_train_resampled = sm.fit_sample(X_train_scaled, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_resampled))
### Train a model
xgbc_smote = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote,
eval_set = [(X_val_scaled, y_val)],
early_stopping_rounds=10)
### Evaluate the model
print('\ntrain\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print(f1_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print('\nval\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
print(f1_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))

SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally
linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by #nos, is also quite good.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.