I am using LogisticRegression algorithm
it works fine, except it is taking long time to finish
I decided to use multiprocessing feature (n_jobs=-1) as per https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
but no change in the performance
Here is my code
mdl = LogisticRegression(n_jobs=-1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
mdl.fit(X_train,y_train)
y_pred=mdl.predict(X_test)
How can I use it on LogisticRegression?
Are you doing multiclass classification?
If your data does not have more than two classes, setting the n_jobs argument is virtually useless.
To improve speed try feature engineering to reduce the number of features.
You could also try changing the solver. Here's what the documentation says:
"For small datasets, ‘liblinear’ (used to be the
default) is a good choice, whereas ‘sag’ and
‘saga’ are faster for large ones.
For multiclass problems, only ‘newton-cg’,
‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial
loss; ‘liblinear’ is limited to one-versus-rest
schemes."
There are also some parameters like tol you could try changing.
Finally, if nothing works, use another model.
Related
I have a doubt about classification algorithm comparation.
I am doing a project regarding hyperparameter tuning and classification model comparation for a dataset.
The Goal is to find out the best fitted model with the best hyperparameters for my dataset.
For example: I have 2 classification models (SVM and Random Forest), my dataset has 1000 rows and 10 columns (9 columns are features) and 1 last column is lable.
First of all, I splitted dataset into 2 portions (80-10) for training (800 rows) and tesing (200rows) correspondingly. After that, I use Grid Search with CV = 10 to tune hyperparameter on training set with these 2 models (SVM and Random Forest). When hyperparameters are identified for each model, I use these hyperparameters of these 2 models to test Accuracy_score on training and testing set again in order to find out which model is the best one for my data (conditions: Accuracy_score on training set < Accuracy_score on testing set (not overfiting) and which Accuracy_score on testing set of model is higher, that model is the best model).
However, SVM shows the accuracy_score of training set is 100 and the accuracy_score of testing set is 83.56, this means SVM with tuning hyperparameters is overfitting. On the other hand, Random Forest shows the accuracy_score of training set is 72.36 and the accuracy_score of testing set is 81.23. It is clear that the accuracy_score of testing set of SVM is higher than the accuracy_score of testing set of Random Forest, but SVM is overfitting.
I have some question as below:
_ Is my method correst when I implement comparation of accuracy_score for training and testing set as above instead of using Cross-Validation? (if use Cross-Validation, how to do it?
_ It is clear that SVM above is overfitting but its accuracy_score of testing set is higher than accuracy_score of testing set of Random Forest, could I conclude that SVM is a best model in this case?
Thank you!
It's good that you've done quite an analysis on your part to investigate the best model. However, I would suggest you elaborate on your investigation a bit. As you're searching for the best model for your data, "Accuracy" alone is not a good evaluation metric for your models. You should also evaluate your model on "Precision Score", "Recall Score", "ROC", "Sensitivity", "Specificity" etc. Find out if your data has imbalance (If they do, there're techniques to work 'em around). After evaluating all those metrics you may come up with a decision.
For the training-testing part, you're quite on the right track, with only one issue (which is quite severe), the time you're testing your model on the test set, you're injecting a sort of bias. So I would say make 3 partitions of your data, and use cross-validation (sklearn has got what you need for this) on your "training set", after cross-validation, you may use another partition "validation set" for testing the generalization power of your model (performance on unseen data), you may change some parameter after that. And after you've come up to a conclusion and tuning everything you needed to, only then use your "test set". No matter what the results are (on the test set) don't change the model after that, as those scores represent the true capability of your model.
you can create 3 partitions of your data in the following way for example-
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
# Dummy dataset for example purpose
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, cluster_std=6.0)
# first partition i.e. "train-set" and "test-set"
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=123)
# second partition, we're splitting the "train-set" into 2 sets, thus creating a new partition of "train-set" and "validation-set"
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.9, random_state=123)
print(X_train.shape, X_test.shape, X_val.shape) # output : ((810, 2), (100, 2), (90, 2))
I would suggest splitting your data into three sets, rather than two:
Training
Validation
Testing
Training is used to train the model, as you have been doing. The validation set is used to evaluate the performance of a model trained with a given set of hyperparameters. The optimal set of hyperparameters is then used to generate predictions on the test set, which wasn't part of either training or hyper parameter selection. You can then compare performance on the test set between your classifiers.
The large decrease in performance on your SVM model on your validation dataset does suggest overfitting, though it is common for a classifier to perform better on the training dataset than an evaluation or test dataset.
For your second question, yes your SVM would be overfitting although in most machine-learning cases, the training set's accuracy would not really matter. It is much more important to look at the testing set's accuracy. It is not unusual to have a higher training accuracy than testing accuracy so I suggest to not look to overfitting and only look at the difference in the testing accuracy. With the information provided, yes you could say that the SVM is the best model in your case.
For your first question, you are already doing a type of cross validation and it is an acceptable way to do evaluate the model.
This might be a useful article for you to read
I'm creating a classifier that takes vectorized book text as input and as output predicts whether the book is "good" or "bad".
I have 40 books, 27 good and 13 bad. I split each book into 5 records (5 ten-page segments) to increase the amount of data, so 200 records total.
Ultimately, I'll fit the model on all the books and use it to predict unlabeled books.
What's the best way to estimate the accuracy my model's going to have? I'll also use this estimate for model comparison, tuning, etc.
The two options I'm thinking of:
Run a loop to test-train split the model X times and look at the accuracy for each split
use cross-validation (GroupKFold specifically so that the 5 records for each book are kept together, since if not that would be major leakage)
I want to estimate the accuracy within a small margin of error as quickly as possible. Repeated train-test splits are slower, since even when I stratify by label (choosing 8 good books and 4 bad books for test) the accuracy for a particular model can vary from 0.6 to 0.8, so I'd have to run a lot to get an accurate estimate.
CV, on the other hand, is giving me the same score every time I run it, and seems to line up relatively well with the average accuracies of the models after 100 train-test splits (within 1-1.5%).
CV is much faster, so I'd prefer to use it. Does CV make sense to use here? I'm currently using 5-fold (so it's choosing 8 holdout books each run, or 40 holdout records total).
Also, should CV be giving the exact same accuracy every time I run it? (and exact same list of accuracies in the same order, for that matter). I'm shuffling my corpus before putting X, y, and groups into the cross_val_score. Would a ShuffleSplit be preferable? Here's my code:
for i in range(0,5):
dfcopy = df.copy()
dfcopy = dfcopy.sample(frac=1).reset_index(drop=True)
X, y = dfcopy.text, dfcopy.label
groups = dfcopy.title.tolist()
model = MultinomialNB()
name = 'LR'
pipe = Pipeline([('cleaner', clean_transformer()),
('vectorizer', bow_vector),
('classifier', model)])
score = cross_val_score(estimator=pipe, X=X, y=y, groups=groups, cv=GroupKFold())
print(score)
print(np.mean(score))
Finally, should I be using stratification? My thought was that I should since I effectively have 40 items to be split between train and test, so the test set (chosen randomly) could reasonably end up being all/mostly good or all/mostly bad, and I didn't think that would be a good test set for representing accuracy.
I will try to go in order:
What's the best way to estimate the accuracy my model's going to have? I'll also use this estimate for model comparison, tuning, etc.
CV is much faster, so I'd prefer to use it. Does CV make sense to use here?
If your folds are very similar between each other there will be no big difference between N-fold CV, and repeated test and train.
Should CV be giving the exact same accuracy every time I run it?
It depends on two factors, the hyperparameters and the data used, MultinomialNB have very little space of improvement with its hyperparameters. Therefore it comes down to the distribution of CV folds.
Would a ShuffleSplit be preferable?
ShuffleSplit might make some difference but do not expect huge differences.
As I see, at least in my experience, the big step up you could make is stop using MultinomialNB - which although being a good baseline will not deliver you crazy good results - and start using something a little bit more sophisticated, like SGDClassifier, Random Forest, Perceptron, you name it. Using scikit-learn it is rather easy to switch between a classification algorithm to another, thanks to the very good work in standardising calls and data used up to now.
Therefore your model would become:
model = RandomForestClassifier()
One more thing which might be helpful is using train/test/validate set and hyperparameter optimisation, like Gridsearch, the set up might take you a couple hours but it will certainly pay off.
If you decide to use train/test/validate, scikit-learn got you covered with the train_test_split function:
X, y = df.text, df.label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
If you decide to use gridsearch for hyperparameter optimisation, you will need to:
(1) define your set of possible parameters
grid_1 = {
"n_estimators": [100,200,500],
"criterion": ["gini", "entropy"],
"max_features": ['sqrt','log2',0.2,0.5,0.8],
"max_depth": [3,4,6,10],
"min_samples_split": [2, 5, 20,50]
}
(2) launch the grid search optimisation
model = RandomForestClassifier()
grid_search = GridSearchCV(model, grid_1, n_jobs=-1, cv=5)
grid_search.fit(X_train, Y_train)
Gridsearch is pretty simple as optimisation technique, but will be very helpful delivering better results. If you want to deepen you understanding of this topic and further enhance your code you can find an example code using more sophisticated Hyperparameter optimisation strategies like TPE here
Finally, your datasets seem to be pretty small, if you are experiencing long waiting times between a train and another, I would suggest you considering writing a little cache system in order to cut off loading and processing times. You can find an example code using a little cache system here
I am practicing simple regression models as an intro to machine learning. I have reviewed a few sample models for multiple regression, which is, I believe, an extension of linear regression, but with more than 1 feature. From the examples I have seen, the syntax is the same for linear regression and multiple regression. I get this error when running the code below:
ValueError: x and y must be the same size.
Why do I get this error, and how can I fix it?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv(r"C:\Users\****\Desktop\data.csv")
#x.shape =(20640, 2), y=(20640,)
X = df[['total_rooms', 'median_income']]
y = df['median_house_value']
X_test, y_test, X_train, y_train = train_test_split(X, y, test_size=.2, random_state=0)
reg = LinearRegression()
reg.fit(X_train, y_train)
Am I missing a step? Thanks for your time.
You have a mistake in your train_test_split - the order of results matters; the correct usage is:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
Check the documentation.
You don't have to do anything you don't want to do :-). But generally speaking, you're going to want to handle things like multi-collinearity somehow -- but that doesn't necessarily mean dimensionality reduction.
What's the shape of your data? If you have, say 20 features, but 10k observations, there should be no need for dimensionality reduction (at least not in a first pass).
But if you have, say, 1k features and 10k observations, then you'd be well suited for a unsupervised dimensionality reduction step before the learner.
You might want to first try some regularization (see https://web.stanford.edu/~hastie/ElemStatLearn/ -- you can download the book for free from there).
So for instance, try using the ElasticNet class instead of the LinearRegression class. It's pretty much the same thing, but with a penalty on the $L_1$ and $L_2$ norms of the weights. This tends to help with generalization.
Without know much more about your particular problem, it's difficult to say anything else.
It seems basic, but I can't see the difference and the advantages or disadvantages between the following 2 ways:
first way:
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
second way:
cross_val_score(clf, X, y, cv=2)
It seems that the 2 ways do the same thing, and the second one is shorter (one line).
What am I missing ?
What are the differences and advantages or disadvantages for each way ?
Arguably, the best way to see such differences is to experiment, although here the situation is rather easy to discern:
clf.score is in a loop; hence, after the loop execution, it contains just the score in the last validation fold, forgetting everything that has been done before in the previous k-1 folds.
cross_cal_score, on the other hand, returns the score from all k folds. It is generally preferable, but it lacks a shuffle option (which shuffling is always advisable), so you need to manually shuffle the data first, as shown here, or use it with cv=KFold(n_splits=k, shuffle=True).
A disadvantage of the for loop + kfold method is that it is run serially, while the CV procedure in cross_val_score can be parallelized in multiple cores with the n_jobs argument.
A limitation of cross_val_score is that it cannot be used with multiple metrics, but even in this case you can use cross_validate, as shown in this thread - not necessary to use for + kfold.
The use of kfold in a for loop gives additional flexibility for cases where neither cross_val_score nor cross_validate may be adequate, for example using the scikit-learn wrapper for Keras while still getting all the metrics returned by native Keras during training, as shown here; or if you want to permanently store the different folds in separate variables/files, as shown here.
In short:
if you just want the scores for a single metric, stick to cross_val_score (shuffle first and parallelize).
if you want multiple metrics, use cross_validate (again, shuffle first and parallelize).
if you need a greater degree of control or monitor of the whole CV process, revert to using kfold in a for loop accordingly.
I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?
There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.
In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.
I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
### Train test split
X_train, X_val, y_train, y_val = train_test_split(X, y)
### Scale the data before applying SMOTE
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
### Resample X_train_scaled
X_train_resampled, y_train_resampled = sm.fit_sample(X_train_scaled, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_resampled))
### Train a model
xgbc_smote = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote,
eval_set = [(X_val_scaled, y_val)],
early_stopping_rounds=10)
### Evaluate the model
print('\ntrain\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print(f1_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print('\nval\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
print(f1_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally
linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by #nos, is also quite good.