I am training a model by LR algorithm on this white-wine dataset ( https://archive.ics.uci.edu/ml/datasets/wine). After having model trained on Python, I printed out model.coef to see level of importance for all model just to notice that "residual sugar" is assigned quite large weight ( 1.3 ). However when looking at correlation matrix ( image below ), the correlation coefficent between independent feature ( residual sugar ) and dependent feature is pretty low compared to other independent features, so I just wonder whether weights assigned are not factors to consider how importance a feature is and if it's not how I evaluate whether a feature is important. Below is also my code, if anything is wrong pls help me correct as I am new to this area
enter code here
engine = create_engine("mysql+mysqlconnector://root:21041996#localhost/mydatabase")
con = engine.connect()
dataframe= pd.read_sql('select * from wine_quality',con)
df = dataframe[dataframe['type']=='white']
seaborn.heatmap(df.corr(),annot= True)
plt.show()
y = df['"quality"']
x = df.drop(columns=['type','"quality"'])
x = x.to_numpy()
y=y.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = y_train>6
y_test = y_test>6
model = LogisticRegression(solver='liblinear',max_iter=2000)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))
print(model.coef_)
Related
I'm trying to perfrom LassoCV feature selection on my miRNA expression dataset and after finding out the 100 best features(miRNAs in this case) I want to build some classification models (like SVM, RF,KNN etc.) for prediction using those 100 miRNAs. I can use the following code for my data without any problems if I don't do train-test splitting.
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
feature_names = df.columns[0:2565]
clf = LassoCV().fit(X, y)
importance = np.abs(clf.coef_)
idx_third = importance.argsort()[-3]
threshold = importance[idx_third] + 0.01
idx_features = (-importance).argsort()[:100]
name_features = np.array(feature_names)[idx_features]
print('Selected features: {}'.format(name_features))
sfm = SelectFromModel(clf, threshold=threshold)
sfm.fit(X, y)
X = sfm.transform(X)
But my goal is to select features after the split. And I think I'm having trouble identifying the x_train and X_test after applying LassoCV. Here's the code after train_test_split:
clf = LassoCV().fit(X_train, y_train)
importance = np.abs(clf.coef_)
idx_third = importance.argsort()[-3]
threshold = importance[idx_third] + 0.01
idx_features = (-importance).argsort()[:100]
name_features = np.array(feature_names)[idx_features]
print('Selected features: {}'.format(name_features))
sfm = SelectFromModel(clf, threshold=threshold)
sfm.fit(X_train, y_train)
and the output:
Selected features: ['MIMAT0019071' 'MIMAT0019947' 'MIMAT0005951' 'MIMAT0025458'
'MIMAT0019710' 'MIMAT0005880' 'MIMAT0004810' 'MIMAT0026481'
'MIMAT0016904' 'MIMAT0003340' 'MIMAT0016851' 'MIMAT0019033'
'MIMAT0004508' 'MIMAT0024615' 'MIMAT0022478' 'MIMAT0019004'
'MIMAT0004948' 'MIMAT0005898' 'MIMAT0000064' 'MIMAT0015087'
'MIMAT0005942' 'MIMAT0004602' 'MIMAT0027666' 'MIMAT0003250'
'MIMAT0022289' 'MIMAT0005866' 'MIMAT0004903' 'MIMAT0004592'
'MIMAT0021040' 'MIMAT0003237' 'MIMAT0018954' 'MIMAT0019858'
'MIMAT0003270' 'MIMAT0030416' 'MIMAT0019361' 'MIMAT0018083'
'MIMAT0000440' 'MIMAT0018070' 'MIMAT0016863' 'MIMAT0015066'
'MIMAT0027576' 'MIMAT0017997' 'MIMAT0000421' 'MIMAT0003165'
'MIMAT0027587' 'MIMAT0004603' 'MIMAT0003330' 'MIMAT0019948'
'MIMAT0004978' 'MIMAT0018951' 'MIMAT0016872' 'MIMAT0019203'
'MIMAT0015005' 'MIMAT0003319' 'MIMAT0003316' 'MIMAT0022265'
'MIMAT0011159' 'MIMAT0016898' 'MIMAT0003240' 'MIMAT0004925'
'MIMAT0027580' 'MIMAT0019067' 'MIMAT0018121' 'MIMAT0028112'
'MIMAT0019714' 'MIMAT0000685' 'MIMAT0019742' 'MIMAT0027627'
'MIMAT0003277' 'MIMAT0019737' 'MIMAT0003284' 'MIMAT0020925'
'MIMAT0022929' 'MIMAT0022938' 'MIMAT0020924' 'MIMAT0020603'
'MIMAT0020602' 'MIMAT0020956' 'MIMAT0020601' 'MIMAT0020600'
'MIMAT0022719' 'MIMAT0020300' 'MIMAT0022939' 'MIMAT0022940'
'MIMAT0019984' 'MIMAT0019983' 'MIMAT0019982' 'MIMAT0019981'
'MIMAT0019980' 'MIMAT0019979' 'MIMAT0019978' 'MIMAT0019977'
'MIMAT0019976' 'MIMAT0022941' 'MIMAT0020541' 'MIMAT0019985'
'MIMAT0020958' 'MIMAT0019975' 'MIMAT0021036' 'MIMAT0021037']
SelectFromModel(estimator=LassoCV(), threshold=0.041810456987634005)
So, no problems until here and we can see the 100 miRNAs to be selected. I try to select these features by applying X = sfm.transform(X) to the split dataset like this:
X_train = sfm.transform(X_train)
X_test = sfm.transform(X_test)
But when I check the X_train.shape and X_test.shape the output is like this:
((164, 0), (55, 0))
So, of course when I try to train my model:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)
it gives me this error:
ValueError: Found array with 0 feature(s) (shape=(164, 0)) while a minimum of 1 is required.
I'm new to machine learning especially feature selection bit. If anyone can tell me how to develope models with the selected features in this particular case, I would greatly appreciate it.
For a scientific study, I need to analyze the traditional logistic regression using python and sci-kit learn. After fitting my regression model with "penalty='none'", I can get the correct coefficients but the intercept is the half of the real value. My code is mostly as follows:
df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_
With statsmodels I get the intercept (constant) "28.7140" but with the sci-kit learn "14.35698738". Other coefficients are same. I verified it on SPSS and the first one is the correct value. I don't want to use statsmodels only for logistic regression. Could you please help?
PS: Without intercept model works fine.
The issue here is that in the code you posted you add a constant term (a column of 1's) to x_train with x_train = sm.add_constant(x_train). Then, you pass that same x_train object to sklearn's LogisticRegression() method where the default value of fit_intercept= is True. So, at that stage, you end up creating another constant term, causing the discrepancy in your estimated coefficients.
So, you should either turn off fit_intercept= in the sklearn code, or leave fit_intercept=True but use the x_train array without the added constant term.
I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
Question: How can I find out which feature does the output coefficients belongs to without manually tracking the order of feature fed into linear regresion
I have a datasets with following feature.
usertype contains Subscriber & Customer.
I train_test_split the data.
feature = ['age','usertype','gender']
X = citibike_dropped[feature]
y = citibike_dropped['tripduration']
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=123)
I used sklearn Pipeline to pre-processes and fit into linear regression
ct = ColumnTransformer(
[('ohe',OneHotEncoder(handle_unknown = 'ignore'),['usertype']),
('scaler',MinMaxScaler(),['age'])],
remainder = 'passthrough')
lr = LinearRegression()
Input = [('transformer',ct),('clf',lr)]
pipe = Pipeline(Input)
I check the coefficient after fitting the pipe with x_train and y_train
pipe.fit(X_train,y_train);
pipe.named_steps['clf'].coef_
OUTPUT
array([ 0. , 499.85347478, 177.64720307])
How can I find out which feature does the above coefficients belongs to?**
Pherhaps you might have a look here or here to find the decision line or the area that belongs to the feature
i have built this model for the dataset Concrete Compressive Strength.
X_train_new, X_test_new, Y_train_new, Y_test_new = train_test_split(new_X, Y, test_size=0.3, shuffle='False')
new_X = pd.concat((X**(i+1) for i in range(n)), axis=1)
colnames=[]
for j in range(8*n):
colnames.append(j)
new_X.columns = colnames
X_train_new_scaled = scaler.fit_transform(X_train_new)
X_test_new_scaled = scaler.transform(X_test_new)
ols.fit(X_train_new_scaled, Y_train)
poly_train_pred = ols.predict(X_train_new_scaled)
poly_test_pred = ols.predict(X_test_new_scaled)
poly_mae_train_error = metrics.mean_absolute_error(Y_train_new, poly_train_pred)
poly_mae_test_error = metrics.mean_absolute_error(Y_test_new, poly_test_pred)
print(poly_mae_train_error, "\n", poly_mae_test_error)```
8 is the number of the features.
ols is a LinearRegression class that i created earlier.
Is it ok that the test_error of the this model is higher than the one of the simple linear model?
Also how i implement the fucntion test_poly_regression(X_train, y_train, X_test, y_test, n=2) to run this model for multiple n and check the errors?