I have a dataset which has 450.000 data points, 12 features and label(0 or 1). I am using imblearn library of python because my dataset is imbalanced(ratio= 1:50, class 1 is minority). I am using EasyEnsembleClassifier as classifier. My problem is; I get high recall but very low precision as you can see from image below(90% recall, 8% precision, 14% f1 score).
Here is my code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from dask_ml.preprocessing import RobustScaler
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, confusion_matrix
from sklearn import metrics
df = read_csv(...)
X = df[['features...']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
clf = EasyEnsembleClassifier(n_estimators=50, n_jobs=-1, sampling_strategy = 1.0)
clf.fit(X_train, y_train)
X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)
------code for report------
.............
Output:
Classification Report
I tried different scalers namely MinMaxScaler, StandardScaler. I tried changing test-train split ratio, different parameters of EasyEnsembleClassifier. I also tried BalancedRandomForestClassifier from same library but result are same. Changing number of estimators in classifier parameter also doesn't change the result.
What is the reason of this results? What can I do to improve precision without damaging recall? It looks like I am doing something wrong in my code or I am missing an important concept.
Edit:
I still couldn't figure out the true reason of my problem but since no one answered my question here is some ideas about what could be the reason of this weird model in case someone else encounters with similar problem;
Most probably my dataset is poorly labeled. It is possible that model cannot distinguish classes because they are very alike. I will try to generate some synthetic data to train my model again.
I did not test this but some features may be harming the model. I need to visually inspect to find out if there is correlation between features and remove some of them but I highly suspect this is the problem because boosting classifiers should handle this problem automatically by weighting each feature.
Also 12 features in my case may not be enough. I may need more. Although it is not easy for my dataset to generate more features I will think about it.
Finally maybe undersampling is not suited for my dataset. I will give a shot to oversampling techniques or SMOTE if I feel desperate enough.
You could try other ensemble methods for class imbalance learning. SMOTEBoost is one such method that combines boosting and data sampling method, essential injects SMOTE technique at each boosting iteration.
This article could be of interest to you.
Related
So, I am struggling to understand why is it that, as a common practice, a cross-validation step is done to a model does has not been trained yet. An example of what I am saying can be found in here. A piece of the code is pasted below:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
Questions:
What would be the purpose of the cross-validation at that point?
Does some training procedure take place on any part of that code?
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
Thanks in advance!
cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
according to the documentation the "cross_val_score" fits the model using the given cross validation technique, there
in the code above, "model" contains the model that will be fit, and "cv" contains information about the cross validation method that the "cross_val_score" will use to structure the training and CV sets and evaluate the model.
in other words, those are just definitions, the actual training and CV happen inside the "cross_val_score" function.
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
KFold CV generally doesn't tackle an unbalanced dataset, it just assures that the result will not be biased by the choice of the training/CV datasets,
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
if you want to tackle an unbalanced dataset you have to use a better metric than accuracy, like ‘balanced_accuracy’ or ‘roc_auc’ and making sure both the training and CV datasets have both positive and negative cases.
I have a Gaussian naive bayes algorithm running against a dataset. What I need is to to get the feature importance (impactfulness of the features) on the target class.
Here's my code:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2)
gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train, y_train)
gaussian_nb.score(X_test, y_test)*100
And I tried:
importance = gaussian_nb.coefs_ # and even tried coef_
and it gives an error:
AttributeError: 'GaussianNB' object has no attribute 'coefs_'
Can someone please help me?
The GaussianNB does not offer an intrinsic method to evaluate feature importances. Naïve Bayes methods work by determining the conditional and unconditional probabilities associated with the features and predict the class with the highest probability. Thus, there are no coefficients computed or associated with the features you used to train the model (compare with its documentation).
That being said, there are methods that you can apply post-hoc to analyze the model after it has been trained. One of these methods is the Permutation Importance and it, conveniently, has also been implemented in scikit-learn. With the code you provided as a base, you would use permutation_importance the following way:
from sklearn.inspection import permutation_importance
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2)
gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train, y_train)
imps = permutation_importance(gaussian_nb, X_test, y_test)
print(imps.importances_mean)
Observe that the Permutation Importance is dataset dependent and you have to pass a dataset to obtain the values. This can be either the same data you used to train the model, i.e. X_train and y_train, or a hold-out set that you saved for evaluation, like X_test and y_test. The latter approach is but the superior choice in regard to generalization power.
If you want to know more about Permutation Importance as a method and how it works, then the user guide provided by scikit-learn is definitely a good start.
If you have a look at the documentation, Naive Bayes does not have these attributes for feature importance. You can use get_params method for the priors learned, but not really individual features. If you need to understand feature importance, a good solution would be to to that analysis on something like a decision tree and then implement GaussianNB the using the most important features.
I am practicing simple regression models as an intro to machine learning. I have reviewed a few sample models for multiple regression, which is, I believe, an extension of linear regression, but with more than 1 feature. From the examples I have seen, the syntax is the same for linear regression and multiple regression. I get this error when running the code below:
ValueError: x and y must be the same size.
Why do I get this error, and how can I fix it?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv(r"C:\Users\****\Desktop\data.csv")
#x.shape =(20640, 2), y=(20640,)
X = df[['total_rooms', 'median_income']]
y = df['median_house_value']
X_test, y_test, X_train, y_train = train_test_split(X, y, test_size=.2, random_state=0)
reg = LinearRegression()
reg.fit(X_train, y_train)
Am I missing a step? Thanks for your time.
You have a mistake in your train_test_split - the order of results matters; the correct usage is:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
Check the documentation.
You don't have to do anything you don't want to do :-). But generally speaking, you're going to want to handle things like multi-collinearity somehow -- but that doesn't necessarily mean dimensionality reduction.
What's the shape of your data? If you have, say 20 features, but 10k observations, there should be no need for dimensionality reduction (at least not in a first pass).
But if you have, say, 1k features and 10k observations, then you'd be well suited for a unsupervised dimensionality reduction step before the learner.
You might want to first try some regularization (see https://web.stanford.edu/~hastie/ElemStatLearn/ -- you can download the book for free from there).
So for instance, try using the ElasticNet class instead of the LinearRegression class. It's pretty much the same thing, but with a penalty on the $L_1$ and $L_2$ norms of the weights. This tends to help with generalization.
Without know much more about your particular problem, it's difficult to say anything else.
I have a question regarding preprocessing data for GridSearchCV, mainly pertaining to scaling.
So what I hope to achieve is:
Perform scaling (e.g. StandardScaler()) on training data during GridSearchCV, and not on the entire set which consists of both training data and testing data.
If I use a Pipeline, for example:
clf = make_pipeline(StandardScaler(), GridSearchCV(KerasRegressor(), param_grid = [....], cv=10, refit=True))
I do not get the ability to choose whether the StandardScaler() is applied on the training set only during each fold -> I believe the scaling is done for the entire set for each fold.
Could I seek your advice please?
Thank you.
I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?
There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.
In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.
I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
### Train test split
X_train, X_val, y_train, y_val = train_test_split(X, y)
### Scale the data before applying SMOTE
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
### Resample X_train_scaled
X_train_resampled, y_train_resampled = sm.fit_sample(X_train_scaled, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_resampled))
### Train a model
xgbc_smote = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote,
eval_set = [(X_val_scaled, y_val)],
early_stopping_rounds=10)
### Evaluate the model
print('\ntrain\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print(f1_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print('\nval\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
print(f1_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally
linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by #nos, is also quite good.