Variation in output of Logistic Regression when using SMOTE - python

I am working on a logistic regression case with an imbalance in the target variable. To fix this I am using SMOTE (Synthetic Minority Oversampling Technique), but each time I run my regression model, I get different numbers in my confusion matrix. I have set random_state parameters while invoking SMOTE as well as Logistic Regression still to no avail. Even my features are the same in each iteration. I was able to get the best value for recall as 0.81 and AUC value as 0.916 once but they are not coming anymore. On some occasions, the value of False Positives and False Negatives shoots up very much indicating that the classifier is very bad.
Please guide what I am doing wrong here, below is the code snippet.
# Feature Selection
features = [ 'FEMALE','MALE','SINGLE','UNDER_WEIGHT','OBESE','PROFESSION_ANYS',
'PROFESSION_PROF_UNKNOWN']
# Set X and Y Variables
X5 = dataframe[features]
# Target variable
Y5 = dataframe['PURCHASE']
# Splitting using SMOTE
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state = 4)
X5_train, X5_test, Y5_train, Y5_test = train_test_split(X5,Y5, test_size=0.20)
os_data_X5,os_data_Y5 = os.fit_sample(X5_train, Y5_train)
columns = X5_train.columns
os_data_X5 = pd.DataFrame(data = os_data_X5, columns = columns )
os_data_Y5 = pd.DataFrame(data = os_data_Y5, columns = ['PURCHASE'])
# Instantiate Logistic Regression model (using the default parameters)
logreg_5 = LogisticRegression(random_state = 4, penalty='l1', class_weight = 'balanced')
# Fit the model with train data
logreg_5.fit(os_data_X5,os_data_Y5)
# Make predictions on test data set
Y5_pred = logreg_5.predict(X5_test)
# Make Confusion Matrix to compare results against actual values
cnf_matrix = metrics.confusion_matrix(Y5_test, Y5_pred)
cnf_matrix

Related

How do I calculate the MSE during walk-forward optimization?

I am trying to predict the variable "ec" using features with a time lag of 1 period with different models. In order to see which model (I am comparing OLS, Ridge, Lasso and ARIMAX) fits the data best, I use a Walk-Forward approach (expanding window) and want to calculate the Mean Squared Error for each of the models. (I am providing the code for my OLS model as an example) Although my code seems to be working, I am not sure whether the calculation of my MSE is correct: As can be seen in the code below, I am saving the MSE of each loop (Each combination of Training and Test set) in a list (ols_mse_list) and then I calculate the "overall" MSE taking the average of the list. Is that the correct way?? I am slightly confused, as I couldn't find a proper instruction on how to calculate the MSE during the optimization process...
# Separate the predictors and label
X_bss = data_bss[data_bss.columns[~data_bss.columns.isin(
["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"])]]
y = data_bss["ec"]
tscv = expanding_window(initial=350, horizon = 12, period = 1)
for train_index, test_index in tscv.split(X_bss):
print("Train:",train_index)
print("Test :",test_index)
from sklearn.linear_model import LinearRegression
ols = LinearRegression()
ols_mse_list = []
ols_mean_mse = []
# Loop through the splits. Run a Linear Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(data_bss)):
X_train = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[train_index]
y_train = data_bss[["ec"]].iloc[train_index]
X_test = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[test_index]
y_test = data_bss[["ec"]].iloc[test_index]
ols.fit(X_train,y_train)
ols_mse = mean_squared_error(y_test,ols.predict(X_test))
ols_mse_list.append(ols_mse)
ols_mean_mse.append(np.mean(ols_mse_list))
print("OLS MSE:",ols_mean_mse)

The intercept are half of the real values in logistic regression

For a scientific study, I need to analyze the traditional logistic regression using python and sci-kit learn. After fitting my regression model with "penalty='none'", I can get the correct coefficients but the intercept is the half of the real value. My code is mostly as follows:
df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_
With statsmodels I get the intercept (constant) "28.7140" but with the sci-kit learn "14.35698738". Other coefficients are same. I verified it on SPSS and the first one is the correct value. I don't want to use statsmodels only for logistic regression. Could you please help?
PS: Without intercept model works fine.
The issue here is that in the code you posted you add a constant term (a column of 1's) to x_train with x_train = sm.add_constant(x_train). Then, you pass that same x_train object to sklearn's LogisticRegression() method where the default value of fit_intercept= is True. So, at that stage, you end up creating another constant term, causing the discrepancy in your estimated coefficients.
So, you should either turn off fit_intercept= in the sklearn code, or leave fit_intercept=True but use the x_train array without the added constant term.

Making predictions using all labels in multilabel text classification

I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
https://www.transfernow.net/dl/20210921NbWDt3eo
Code Analysis
The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.
Multi-Output Regressor
Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
The output should be the same as what you have, but you only need to maintain a single model and train it once.
To use this approach, wrap your LR model in a MultiOutputRegressor.
Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', MultiOutputRegressor(model))])
preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)
combine_data() merges all data into a single DataFrame for convenience:
def combine_data(X, Y, y_cols):
""" X is a dataframe, Y is a np array, y_cols is a list """
df_out = pd.DataFrame(Y, columns=y_cols)
df_out.index = X.index
return pd.concat([X, df_out], axis=1).sort_index()
Multinomial Logistic Regression
To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
The softmax function is used to find the predicted probability of a sample belonging to a class.
You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])
# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)
# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)

Why does LinearRegression with polynomial features with degree = 1 gives different results?

I have a dataset for regression: (X_train_scaled, y_train) and (X_val_scaled, y_val) for training and validation respectively. The inputs were scaled using StandardScaler.
I create a linear regression model using sklearn.linear_model.LinearRegression like follows:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
linear_reg = LinearRegression()
linear_reg.fit(X_train_scaled, y_train)
y_pred_train = linear_reg.predict(X_train_scaled)
y_pred_val = linear_reg.predict(X_val_scaled)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
After that I do the same but use polynomial features with degree = 1 (which are just the same as the original features but with an additional feature of ones, i.e. x^0, which I ignore).
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(1)
X_train_poly = pf.fit_transform(X_train_scaled)[:, 1:] # ignore first col
X_val_poly = pf.transform(X_val_scaled)[:, 1:] # ignore first col
linear_reg = LinearRegression()
linear_reg.fit(X_train_poly, y_train)
y_pred_train = linear_reg.predict(X_train_poly)
y_pred_val = linear_reg.predict(X_val_poly)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
However, I get different results. The first code gives me the following outputs:
r2_train 0.7409525513417043
r2_val 0.7239859358973735
whereas the second code gives this output:
r2_train 0.7410093370149977
r2_val 0.7241725658840452
Why are the outputs different although the dataset and model are the same?
To prove the datasets are the same, I tried the following code:
print(X_train_scaled.shape, X_train_poly.shape)
print(X_val_scaled.shape, X_val_poly.shape)
print((X_train_poly != X_train_scaled).sum())
print((X_val_poly != X_val_scaled).sum())
which has the output:
(802, 9) (802, 9)
(268, 9) (268, 9)
0
0
which indicates that the two datasets are identical.
Also, I use LinearRegession in the two cases which uses OLS algorithm and has no random operations at all. So, it's supposed to do the same calculations on the same data. However, I get different results.
Does anyone have an idea about the reason?
Sklearn LinearRegression uses ordinary least squares optimization to fit train data into a linear model while it is not clear what Sklearn PolynomialFeatures use. But based on its transform() function:
Prefer CSR over CSC for sparse input (for speed), but CSC is required
if the degree is 4 or higher. If the degree is less than 4 and the
input format is CSC, it will be converted to CSR, have its polynomial
features generated, then converted back to CSC.
(see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Assuming PolynomialFeatures uses ordinary least squares optimization, you would still have same results but with slight difference (just like yours) because Compressed Sparse Row (CSR) method would compromise float values (in other words, truncation/approximation error).

Unexpected R^2 loss value in corss_val_score

I am dealing with a regression dataset, and I wish to fit a particular model to my
dataset after evaluating various model's performances. I used cross_val_score
from sklearn.model_selection for this purpose. After I chose scoring parameter as 'r2' I got highly negative values for some of my models.
demo = pd.read_csv('demo.csv')
X_train = demo.iloc[0:1460, : ]
Y_train = pd.read_csv('train.csv').loc[:, 'SalePrice':'SalePrice']
X_test = demo.iloc[1460: , : ]
regressors = []
regressors.append(LinearRegression())
regressors.append(Ridge())
regressors.append(Lasso())
regressors.append(ElasticNet())
regressors.append(Lars())
regressors.append(LassoLars())
regressors.append(OrthogonalMatchingPursuit())
regressors.append(BayesianRidge())
regressors.append(HuberRegressor())
regressors.append(RANSACRegressor())
regressors.append(SGDRegressor())
regressors.append(GaussianProcessRegressor())
regressors.append(DecisionTreeRegressor())
regressors.append(RandomForestRegressor())
regressors.append(ExtraTreesRegressor())
regressors.append(AdaBoostRegressor())
regressors.append(GradientBoostingRegressor())
regressors.append(KernelRidge())
regressors.append(SVR())
regressors.append(NuSVR())
regressors.append(LinearSVR())
cv_results = []
for regressor in regressors:
cv_results.append(cross_val_score(regressor, X = X_train, y = Y_train, scoring = 'r2', verbose = True, cv = 10))
After the above mentioned code is compiled and run, cv_results is as follows. It is a list of float64 arrays. Each array contains 10 'r2' value (due to cv = 10).
I open the first array and notice that for this particular model, some of the 'r2' values are extremely negative.
Since 'r2' values should be between 0 and 1, why are there very large negative values?
Here's the thing: R^2 values don't actually need to be in [0, 1].
Essentially, R^2 has a baseline of 0, in that 0 means that your model does no better and
no worse than purely taking the mean of the response variable. In OLS where you have an intercept term, this implies that R^2 is in [0, 1].
However, for other models this is not true in general; for instance, if you fix your intercept in a linear regression model, you could end up doing far worse than just taking
the mean of your response.

Categories

Resources