How do I add a Regression table ( table that includes t-statistic, p-value, r^2 etc.). I've attached an image of some of my code.
sklearn is aimed at predictive modeling, so you don’t get the regression table that you are used to. An alternative in Python to sklearn is statsmodels.
See also How to get a regression summary in Python scikit like R does?
I have read about two ways statsmodel and classification_report
STATS MODEL
import statsmodels.api as sm
X = sm.add_constant(X.ravel())
results = sm.OLS(y,x).fit()
results.summary()
CLASSIFICATION REPORT
from sklearn.metrics import classifiation_report
y_preds = reg.predict(X_test)
print(classification_report(y_test,y_preds)
Related
I tried using pycaret for a machine learning project and got very high accuracies. When I tried to verify these using my sklearn code I found that I could not get the same numbers. Here is an example where I reproduce this issue on the public poker dataset from pycaret:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pycaret.classification import *
from pycaret.datasets import get_data
data = get_data('poker')
grid = setup(data=data, target='CLASS', fold_shuffle=True, session_id=2)
dt = create_model('dt')
This gives an accuracy using 10-fold cross validation of about 57%. When I try to reproduce this number using sklearn on the same dataset with the same model I get only 49%. Does anyone understand where this difference comes from??
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
X = data.drop('CLASS', axis = 1)
y = data['CLASS']
y_pred_cv = cross_val_predict(dt, X, y, cv=10)
accuracy_score(y, y_pred_cv)
0.4911698233964679
I think the difference could be due to how your CV folds are being randomized. Did you set the same seed (2) in sklearn? Is the shuffle parameter used in Kfolds set the same?
I had some trouble validating the results from PyCaret myself. I see two options you can try to validate the results:
Is your data correlated in some way? You are using sklearn.model_selection.cross_val_predict and specify cv=10. This means that (stratified) k-fold cross-validation is used to generate your folds. In either case, these splitters are instantiated with shuffle=False. If your data is correlated, this may explain the higher accuracy that you observe. You want to set shuffle=True.
PyCaret by default makes a 70%/30% train/test split. If you use its create_model method, then the cross-validation is done using the train set only. In your validation you use 100% of the data. This might alter the results a bit but I doubt it explains the gap that you observe.
The parameters could be the same but did you reproduce all features engineering inside the setup ? (feature selection, collinearity, normalisation, etc... )
I'm trying to run a multinomial LogisticRegression in sklearn with a clustered dataset (that is, there are more than 1 observations for each individual, where only some features change and others remain constant per individual).
I am aware in statsmodels it is possible to account for this the following way:
mnl = MNLogit(x,y).fit(cov_type="cluster", cov_kwds={"groups": cluster_groups)
Is there a way to replicate this with the sklearn package instead?
In order to run multinomial Logistic Regression in sklearn, you can use the LogisticRegression module and then set the parameter multi_class to multinomial.
Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
I want to know what is the meaning of .linear_model in the following code -
from sklearn.linear_model import LogisticRegression
My understanding is sklearn is the library/module (both have same meaning) and LogisticRegression is the class inside this module.
But I'm not able to understand what .linear_model means?
linear_model is a module. sklearn is a package. A package is basically a module that contains other modules.
linear_model is a class of the sklearn module if contain different functions for performing machine learning with linear models.
The term linear model implies that the model is specified as a linear combination of features. Based on training data, the learning process computes one weight for each feature to form a model that can predict or estimate the target value.
It includes :
Linear regression and classification, Ridge regression and classification, Lasso, Multi-task Lasso
etc..
Check the sklearn doc for further details.
I have written a code that performs logistic regression with leave one out cross validation. I need to know the value of coefficients for logistic regression. But the attribute model. Coefficients_ work only after the model have used fit function. But as I have performed Cross validation so I have not used fit function to train the model.
Here is the code:
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
reg=LogisticRegression()
loo=LeaveOneOut()
scores=cross_val_score(reg,train1,labels,cv=loo)
print(scores)
print(scores.mean())
coef = classifier.coef_
I want to know coefficient values for my features in train1 but as I have not used fit method, How can I get the values of these coefficients?
I am studying with LASSO in python with sklearn, but it is incorrect when I run the code for classification data set and the obtained result is only one after 10-fold cross-validation.
Y is binary label with 1 and 2.
import numpy as np
from sklearn.linear_model import LassoCV, Lasso
from sklearn.model_selection import cross_val_score
lasso = Lasso().fit(X,Y)
accs=cross_val_score(lasso, X, Y, scoring=None, cv=10)
print('The results:',accs)
I expect get the ten different results after cross-validation with lasso in python.
LASSO is for regression type of machine learning. There are two types: Classification and Regression. Perhaps you should try Random forest classification instead.