For a scientific study, I need to analyze the traditional logistic regression using python and sci-kit learn. After fitting my regression model with "penalty='none'", I can get the correct coefficients but the intercept is the half of the real value. My code is mostly as follows:
df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_
With statsmodels I get the intercept (constant) "28.7140" but with the sci-kit learn "14.35698738". Other coefficients are same. I verified it on SPSS and the first one is the correct value. I don't want to use statsmodels only for logistic regression. Could you please help?
PS: Without intercept model works fine.
The issue here is that in the code you posted you add a constant term (a column of 1's) to x_train with x_train = sm.add_constant(x_train). Then, you pass that same x_train object to sklearn's LogisticRegression() method where the default value of fit_intercept= is True. So, at that stage, you end up creating another constant term, causing the discrepancy in your estimated coefficients.
So, you should either turn off fit_intercept= in the sklearn code, or leave fit_intercept=True but use the x_train array without the added constant term.
Related
I am trying to predict the variable "ec" using features with a time lag of 1 period with different models. In order to see which model (I am comparing OLS, Ridge, Lasso and ARIMAX) fits the data best, I use a Walk-Forward approach (expanding window) and want to calculate the Mean Squared Error for each of the models. (I am providing the code for my OLS model as an example) Although my code seems to be working, I am not sure whether the calculation of my MSE is correct: As can be seen in the code below, I am saving the MSE of each loop (Each combination of Training and Test set) in a list (ols_mse_list) and then I calculate the "overall" MSE taking the average of the list. Is that the correct way?? I am slightly confused, as I couldn't find a proper instruction on how to calculate the MSE during the optimization process...
# Separate the predictors and label
X_bss = data_bss[data_bss.columns[~data_bss.columns.isin(
["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"])]]
y = data_bss["ec"]
tscv = expanding_window(initial=350, horizon = 12, period = 1)
for train_index, test_index in tscv.split(X_bss):
print("Train:",train_index)
print("Test :",test_index)
from sklearn.linear_model import LinearRegression
ols = LinearRegression()
ols_mse_list = []
ols_mean_mse = []
# Loop through the splits. Run a Linear Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(data_bss)):
X_train = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[train_index]
y_train = data_bss[["ec"]].iloc[train_index]
X_test = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[test_index]
y_test = data_bss[["ec"]].iloc[test_index]
ols.fit(X_train,y_train)
ols_mse = mean_squared_error(y_test,ols.predict(X_test))
ols_mse_list.append(ols_mse)
ols_mean_mse.append(np.mean(ols_mse_list))
print("OLS MSE:",ols_mean_mse)
I am training a model by LR algorithm on this white-wine dataset ( https://archive.ics.uci.edu/ml/datasets/wine). After having model trained on Python, I printed out model.coef to see level of importance for all model just to notice that "residual sugar" is assigned quite large weight ( 1.3 ). However when looking at correlation matrix ( image below ), the correlation coefficent between independent feature ( residual sugar ) and dependent feature is pretty low compared to other independent features, so I just wonder whether weights assigned are not factors to consider how importance a feature is and if it's not how I evaluate whether a feature is important. Below is also my code, if anything is wrong pls help me correct as I am new to this area
enter code here
engine = create_engine("mysql+mysqlconnector://root:21041996#localhost/mydatabase")
con = engine.connect()
dataframe= pd.read_sql('select * from wine_quality',con)
df = dataframe[dataframe['type']=='white']
seaborn.heatmap(df.corr(),annot= True)
plt.show()
y = df['"quality"']
x = df.drop(columns=['type','"quality"'])
x = x.to_numpy()
y=y.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = y_train>6
y_test = y_test>6
model = LogisticRegression(solver='liblinear',max_iter=2000)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))
print(model.coef_)
This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.
Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.33,
random_state = 0)
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
vs using the full data set to train like this.
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
Is one of these options going to result in incorrect predictions?
You split off the training data from the test data to evaluate your model, because otherwise you have no idea if you are over fitting the data. For example, just place data in excel and plot it with a smooth line. Technically, that spline function from excel is a perfect model but useless for predicting new values.
In your example, your predictions are over a uniform space to allow you to visualize what your model thinks is the underlying function. But it would be useless for understanding how general the model is. Sometimes you can get very high accuracy (> 95%) on training data and less than chance for testing data, which means the model is over fitting.
In addition to plotting a uniform prediction space to visualize the model, you should also predict values from the test set, then see accuracy metrics for both testing and training data.
I am dealing with a regression dataset, and I wish to fit a particular model to my
dataset after evaluating various model's performances. I used cross_val_score
from sklearn.model_selection for this purpose. After I chose scoring parameter as 'r2' I got highly negative values for some of my models.
demo = pd.read_csv('demo.csv')
X_train = demo.iloc[0:1460, : ]
Y_train = pd.read_csv('train.csv').loc[:, 'SalePrice':'SalePrice']
X_test = demo.iloc[1460: , : ]
regressors = []
regressors.append(LinearRegression())
regressors.append(Ridge())
regressors.append(Lasso())
regressors.append(ElasticNet())
regressors.append(Lars())
regressors.append(LassoLars())
regressors.append(OrthogonalMatchingPursuit())
regressors.append(BayesianRidge())
regressors.append(HuberRegressor())
regressors.append(RANSACRegressor())
regressors.append(SGDRegressor())
regressors.append(GaussianProcessRegressor())
regressors.append(DecisionTreeRegressor())
regressors.append(RandomForestRegressor())
regressors.append(ExtraTreesRegressor())
regressors.append(AdaBoostRegressor())
regressors.append(GradientBoostingRegressor())
regressors.append(KernelRidge())
regressors.append(SVR())
regressors.append(NuSVR())
regressors.append(LinearSVR())
cv_results = []
for regressor in regressors:
cv_results.append(cross_val_score(regressor, X = X_train, y = Y_train, scoring = 'r2', verbose = True, cv = 10))
After the above mentioned code is compiled and run, cv_results is as follows. It is a list of float64 arrays. Each array contains 10 'r2' value (due to cv = 10).
I open the first array and notice that for this particular model, some of the 'r2' values are extremely negative.
Since 'r2' values should be between 0 and 1, why are there very large negative values?
Here's the thing: R^2 values don't actually need to be in [0, 1].
Essentially, R^2 has a baseline of 0, in that 0 means that your model does no better and
no worse than purely taking the mean of the response variable. In OLS where you have an intercept term, this implies that R^2 is in [0, 1].
However, for other models this is not true in general; for instance, if you fix your intercept in a linear regression model, you could end up doing far worse than just taking
the mean of your response.
I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.
You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data,
then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform() to the test data, not fit_transform().
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.