Python - SkLearn Logistic Regression: One-by-one train instance

Python - SkLearn Logistic Regression: One-by-one train instance - python

Here is my question, I have a huge train set so I can't load it in memory and apply this code.
model = LogisticRegression()
model = model.fit(train_set_df, y_label_df)
So, I am looking for a way to train my Sklearn.LogisticRegression model by passing instances one-by-one in order to avoid loading all the train data in memory. Thanks

You are looking for the partial_fit method. LogisticRegression does not support it. You can use MultinomialNB (or any other Naive Bayes) or SGDClassifier instead.

Related

Fit a scikit-learn model in parallel?

Is it possible to fit a scikit-learn model in parallel? Something along the lines of
model.fit(X, y, n_jobs=20)

It really depends on the model you are trying to fit. Usually it will have an n_jobs parameter when you initialize the model. See glossary on n_jobs. For example random forest:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=10)
If it is an ensemble method, it makes sense to parallelize because you can fit models separately (see help page for ensemble methods). LogisticRegression() also has an n_job option but I honestly don't know how much this speeds up the fitting process, if that's your bottle neck. See also this post
Other methods like elastic net, linear regression or SVM, i don't think there's a parallelization option.

GridSearchCV and prediction errors analysis (scikit-learn)

I'd like to manually analyse the errors that my ML model (whichever) does, comparing its predictions with the labels. From my understanding, this should be done on instances of the validation set, not the training set.
I trained my model through GridSearchCV, extracting the best_estimator_, the one performing the best during the cross validation then retrained on the entire dataset.
Therefore, my question is: how can I get prediction on a validation set to compare with the labels (without touching the test set), if my best model is re-trained on the whole training set?
One solution would be to split the training set further before performing the GridSearchCV, but I guess there must be a better solution, for example to get the predictions on the validation sets during the cross validation. Is there a way to get these prediction for the best estimator?
Thank you!

You can compute a validation curve with the model that you obtained from GridSearchCV. Read the documentation here. You will just need to define arrays for the hyperparameters that you want to inspect and a scoring function. Here is an example:
train_scores, valid_scores = validation_curve(model, X_train, y_train, "alpha", np.logspace(-7, 3, 3), cv=5, scoring="accuracy")

I understood my conceptual error, I'll post here since maybe it can help some other ML beginners as me!
The solution that should work is to use cross_val_predict splitting the fold in the same way as done in GridSearchCV. In fact, cross_val_predict re-trains the model on each fold and do not use the previously trained model! So the result is the same as getting the prediction on the validation sets during GridSearchCV.

How to retrain pipeline with different data in Scikit-learn?

I do a machine learning model training with pipelines, K-fold cross validation with Python and sklearn on a subset of my all historical data (omitting a test set), along the following:
pipeline = Pipeline([("combiner", PolynomialFeatures()),
("dimred", PCA()),
("classifier", RandomForestClassifier())])
parameters = [...]
CV = GridSearchCV(pipeline, parameters, cv=5, scoring="f1_weighted", refit=True, n_jobs=-1)
CV.fit(train_X, train_y)
So far, so good. However, at the end, I want to retrain the winning pipeline hyperparameter combination on my full X and y, without any cross validation. How could I have this? Simply applying CV.fit(X, y) again would re-doing the whole alternating process with CV, which is obviously unnecessary. I could also parse CV.get_params() for the best combination hyperparameters and build up the pipeline again accordingly, but this somehow seems clumsy and unprofessional...

The answer to your question is in the GridSearchCV documentation. See the Attributes section: best_estimator_ is where the best model is stored, so you can access it from there after you are done with fitting. You can use it by directly calling `CV.best_estimatory_', you can make a new reference to it or pickle it for later using joblib, ie.:
import joblib
joblib.dump(CV.best_estimator_, 'my_pipeline.pkl')
Later you can load your model for further work:
import joblib
my_pipeline = joblib.load('my_pipeline.pkl')
If you do not need the model, but only its hyperparameters you can access those from the best_params_ attribute, ie.:
CV.best_params_
which is a dictionary the best settings that you can use to construct a new pipeline.

How does LassoCV in scikit-learn partition data?

I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.

If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.

Imbalance in scikit-learn

I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?

There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.

In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.

I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
### Train test split
X_train, X_val, y_train, y_val = train_test_split(X, y)
### Scale the data before applying SMOTE
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
### Resample X_train_scaled
X_train_resampled, y_train_resampled = sm.fit_sample(X_train_scaled, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_resampled))
### Train a model
xgbc_smote = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote,
eval_set = [(X_val_scaled, y_val)],
early_stopping_rounds=10)
### Evaluate the model
print('\ntrain\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print(f1_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print('\nval\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
print(f1_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))

SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally
linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by #nos, is also quite good.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - SkLearn Logistic Regression: One-by-one train instance - python

You are looking for the partial_fit method. LogisticRegression does not support it. You can use MultinomialNB (or any other Naive Bayes) or SGDClassifier instead.

Related

Fit a scikit-learn model in parallel?

GridSearchCV and prediction errors analysis (scikit-learn)

How to retrain pipeline with different data in Scikit-learn?

How does LassoCV in scikit-learn partition data?

Imbalance in scikit-learn

Categories

Resources