What is .linear_model in sklearn.linear_model - python

I want to know what is the meaning of .linear_model in the following code -
from sklearn.linear_model import LogisticRegression
My understanding is sklearn is the library/module (both have same meaning) and LogisticRegression is the class inside this module.
But I'm not able to understand what .linear_model means?

linear_model is a module. sklearn is a package. A package is basically a module that contains other modules.

linear_model is a class of the sklearn module if contain different functions for performing machine learning with linear models.
The term linear model implies that the model is specified as a linear combination of features. Based on training data, the learning process computes one weight for each feature to form a model that can predict or estimate the target value.
It includes :
Linear regression and classification, Ridge regression and classification, Lasso, Multi-task Lasso
etc..
Check the sklearn doc for further details.

Related

Sklearn regression with clustered data

I'm trying to run a multinomial LogisticRegression in sklearn with a clustered dataset (that is, there are more than 1 observations for each individual, where only some features change and others remain constant per individual).
I am aware in statsmodels it is possible to account for this the following way:
mnl = MNLogit(x,y).fit(cov_type="cluster", cov_kwds={"groups": cluster_groups)
Is there a way to replicate this with the sklearn package instead?
In order to run multinomial Logistic Regression in sklearn, you can use the LogisticRegression module and then set the parameter multi_class to multinomial.
Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Sklearn RANSAC without intercept

I am trying to fit a linear model without intercept (forcing the intercept to 0) using sklearn's RANSAC: RANdom SAmple Consensus algorithm. In LinearRegression one can easily set fit_intercept=False. However, this option does not seem to exist in RANSAC's list of possible parameters. Is this functionality not implemented? How should one do it? What are alternatives to sklearn's RANSAC to objectively select inliers and outliers, that allow setting the intercept to 0?
The implementation should look like this, but it raises an error:
from sklearn.linear_model import RANSACRegressor
ransac_regressor = RANSACRegressor(fit_intercept=False)
RANSAC is a wrapper around other linear regressors to implement them using random sampling consesus, thus you can simply set the base_estimator to fit_intercept=False:
from sklearn.linear_model import RANSACRegressor, LinearRegression
ransac_lm = RANSACRegressor(base_estimator=LinearRegression(fit_intercept=False))

How to apply Leave one out cross validation with logistic regression and find the values of Coefficents?

I have written a code that performs logistic regression with leave one out cross validation. I need to know the value of coefficients for logistic regression. But the attribute model. Coefficients_ work only after the model have used fit function. But as I have performed Cross validation so I have not used fit function to train the model.
Here is the code:
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
reg=LogisticRegression()
loo=LeaveOneOut()
scores=cross_val_score(reg,train1,labels,cv=loo)
print(scores)
print(scores.mean())
coef = classifier.coef_
I want to know coefficient values for my features in train1 but as I have not used fit method, How can I get the values of these coefficients?

What is the difference between the python scikitlearn NearestNeighbors and KNeighbors classifiers

Trying to get started with Python's SciKitLearn library but got stuck on what the difference is between the NearestNeighbors classifier and the KNeighbors classifier. It seems that the arguments are similar but not identical...
NearestNeighbors
KNeighbors
NearestNeighbors is used for unsupervised learning, KNeighbors for supervised. See documentation. You use unsupervised learning for example when you want to find nearest neighbors between two datasets you use supervised learning when you want to classify based on the class of the nearest neighbors in the dataset.
NearestNeighbors class does not have .predict or .predict_prob methods to predict the label of a test sample. However KNeighbors, which is a class for supervised learning has .predict and .predict_prob methods to predict label and probability of a test sample.

Imbalance in scikit-learn

I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?
There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.
In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.
I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
### Train test split
X_train, X_val, y_train, y_val = train_test_split(X, y)
### Scale the data before applying SMOTE
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
### Resample X_train_scaled
X_train_resampled, y_train_resampled = sm.fit_sample(X_train_scaled, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_resampled))
### Train a model
xgbc_smote = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote,
eval_set = [(X_val_scaled, y_val)],
early_stopping_rounds=10)
### Evaluate the model
print('\ntrain\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print(f1_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print('\nval\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
print(f1_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally
linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by #nos, is also quite good.

Categories

Resources