Question: How can I find out which feature does the output coefficients belongs to without manually tracking the order of feature fed into linear regresion
I have a datasets with following feature.
usertype contains Subscriber & Customer.
I train_test_split the data.
feature = ['age','usertype','gender']
X = citibike_dropped[feature]
y = citibike_dropped['tripduration']
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=123)
I used sklearn Pipeline to pre-processes and fit into linear regression
ct = ColumnTransformer(
[('ohe',OneHotEncoder(handle_unknown = 'ignore'),['usertype']),
('scaler',MinMaxScaler(),['age'])],
remainder = 'passthrough')
lr = LinearRegression()
Input = [('transformer',ct),('clf',lr)]
pipe = Pipeline(Input)
I check the coefficient after fitting the pipe with x_train and y_train
pipe.fit(X_train,y_train);
pipe.named_steps['clf'].coef_
OUTPUT
array([ 0. , 499.85347478, 177.64720307])
How can I find out which feature does the above coefficients belongs to?**
Pherhaps you might have a look here or here to find the decision line or the area that belongs to the feature
Related
Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?
For a scientific study, I need to analyze the traditional logistic regression using python and sci-kit learn. After fitting my regression model with "penalty='none'", I can get the correct coefficients but the intercept is the half of the real value. My code is mostly as follows:
df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_
With statsmodels I get the intercept (constant) "28.7140" but with the sci-kit learn "14.35698738". Other coefficients are same. I verified it on SPSS and the first one is the correct value. I don't want to use statsmodels only for logistic regression. Could you please help?
PS: Without intercept model works fine.
The issue here is that in the code you posted you add a constant term (a column of 1's) to x_train with x_train = sm.add_constant(x_train). Then, you pass that same x_train object to sklearn's LogisticRegression() method where the default value of fit_intercept= is True. So, at that stage, you end up creating another constant term, causing the discrepancy in your estimated coefficients.
So, you should either turn off fit_intercept= in the sklearn code, or leave fit_intercept=True but use the x_train array without the added constant term.
I am doing a project based on Machine learning (Python) and trying all models on my data.
Really confused in
For Classification and For Regression
If I have to apply normalization, Z Score or Standard deviation on whole data set and then set the values of Features(X) and output(y).
def normalize(df):
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
data=normalize(data)
X=data.drop['col']
y=data['col']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Or only have to apply on features(X)
X=data.drop['col']
y=data['col']
def normalize(df):
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
X=normalize(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
TLDR; do normalization on input data, but don't do it on output.
Logically, the normalization is both algorithm dependent and also feature based.
Some algorithms do not require any normalization (like decision trees).
Applying normalization on the dataset: You should perform normalization per feature but on all examples existing in the whole dataset if you have more than one feature in your dataset.
For example, let's say you have two features of X and Y. feature X is always a decimal in the range [0,10]. On the other hand, you have Y in the range [100K,1M]. If you do normalization once for X and Y and once for X and Y combined, you would see how the values of feature X become insignificant.
For Output (labels):
Generally, there is no need to normalize output or labels for any regression or classification tasks. But, make sure to do normalization on training data during training time and inference time.
if the task is the classification, the common approach is just encoding the class numbers (if you have classes dog and cat. you assign 0 to one and 1 to the other)
I am training a model by LR algorithm on this white-wine dataset ( https://archive.ics.uci.edu/ml/datasets/wine). After having model trained on Python, I printed out model.coef to see level of importance for all model just to notice that "residual sugar" is assigned quite large weight ( 1.3 ). However when looking at correlation matrix ( image below ), the correlation coefficent between independent feature ( residual sugar ) and dependent feature is pretty low compared to other independent features, so I just wonder whether weights assigned are not factors to consider how importance a feature is and if it's not how I evaluate whether a feature is important. Below is also my code, if anything is wrong pls help me correct as I am new to this area
enter code here
engine = create_engine("mysql+mysqlconnector://root:21041996#localhost/mydatabase")
con = engine.connect()
dataframe= pd.read_sql('select * from wine_quality',con)
df = dataframe[dataframe['type']=='white']
seaborn.heatmap(df.corr(),annot= True)
plt.show()
y = df['"quality"']
x = df.drop(columns=['type','"quality"'])
x = x.to_numpy()
y=y.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = y_train>6
y_test = y_test>6
model = LogisticRegression(solver='liblinear',max_iter=2000)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))
print(model.coef_)
I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
https://www.transfernow.net/dl/20210921NbWDt3eo
Code Analysis
The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.
Multi-Output Regressor
Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
The output should be the same as what you have, but you only need to maintain a single model and train it once.
To use this approach, wrap your LR model in a MultiOutputRegressor.
Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', MultiOutputRegressor(model))])
preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)
combine_data() merges all data into a single DataFrame for convenience:
def combine_data(X, Y, y_cols):
""" X is a dataframe, Y is a np array, y_cols is a list """
df_out = pd.DataFrame(Y, columns=y_cols)
df_out.index = X.index
return pd.concat([X, df_out], axis=1).sort_index()
Multinomial Logistic Regression
To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
The softmax function is used to find the predicted probability of a sample belonging to a class.
You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])
# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)
# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)