Feature importance using gridsearchcv for logistic regression - python

I've trained a logistic regression model like this:
reg = LogisticRegression(random_state = 40)
cvreg = GridSearchCV(reg, param_grid={'C':[0.05,0.1,0.5],
'penalty':['none','l1','l2'],
'solver':['saga']},
cv = 5)
cvreg.fit(X_train, y_train)
Now to show the feature's importance I've tried this code, but I don't get the names of the coefficients in the plot:
from matplotlib import pyplot
importance = cvreg.best_estimator_.coef_[0]
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Obviously, the plot isn't very informative. How do I add the names of the coefficients to the x-axis?
The importance of the coeff is:
cvreg.best_estimator_.coef_
array([[1.10303023e+00, 7.48816905e-01, 4.27705027e-04, 6.01404570e-01]])

The coefficients correspond to the columns of X_train, so pass in the X_train names instead of range(len(importance)).
Assuming X_train is a pandas dataframe:
import matplotlib.pyplot as plt
features = X_train.columns
importance = cvreg.best_estimator_.coef_[0]
plt.bar(features, importance)
plt.show()
Note that if X_train is just a numpy array without column names, you will have to define the features list based on your own data dictionary.

Related

how to correctly plot regression output with right datetime index on x-axis in matplotlib?

I have air pollution time series data that I need to make a forward period estimation. To do so, I used randomforest regressor from scikit-learn to make prediction, and I want to visualize the prediction output but I have trouble visualizing the regression output where x-axis must show the right time index. Can suggest me how should I get better visualization for my below regression approach? Is there any better way to make this happen? Any idea?
my attempt
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url, parse_dates=['date'])
df.date = pd.DatetimeIndex(df.date)
# df.sort_values(by='date').reset_index(drop=True)
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
resultsDict={}
predictionsDict={}
split_date ='2017-12-01'
df_training = df.loc[df.date <= split_date]
df_test = df.loc[df.date > split_date]
## exclude pollution_index columns from training and testing data
df_tr = df_training.drop(['pollution_index'],axis=1)
df_te = df_test.drop(['pollution_index'],axis=1)
## scaling features
scaler = StandardScaler()
scaler.fit(df_tr)
X_train = scaler.transform(df_tr)
y_train = df_training['pollution_index']
X_test = scaler.transform(df_te)
y_test = df_test['pollution_index']
X_train_df = pd.DataFrame(X_train,columns=df_tr.columns)
X_test_df = pd.DataFrame(X_test,columns=df_te.columns)
reg = RandomForestRegressor(max_depth=2, random_state=0)
reg.fit(X_train, y_train)
yhat = reg.predict(X_test)
resultsDict['Randomforest'] = evaluate(df_test['eyci'], yhat)
predictionsDict['Randomforest'] = yhat
## print out prediction from RandomForest
print(predictionsDict['Randomforest'])
plt.plot(df_test['pollution_index'].values , label='Original')
plt.plot(yhat,color='red',label='predicted')
plt.legend()
output of current attempt
here is the output of the above attempt:
In this attempt, I tried to make regression using randomforest regressor and intend to make simple plot but plot didn't show time on x-axis? Why? Does anyone know how to make this right? Any thoughts? Thanks
desired plot
Ideally, after trained the model, I want to make a forward period estimation, and this is the possible plot that I want to make from my above attempt:
Can anyone suggest to me the possible way of making the right visualization on regression output? Any thoughts?
You will need to provide the dates explicitly to matplotlib.pyplot.plot().
plt.plot(df_test['date'],df_test['pollution_index'].values , label='Original')
plt.plot(df_test['date'],yhat,color='red',label='predicted')
You can also use the matplotlib-based plotting function from pandas:
df_test['yhat'] = yhat
df_test.plot(x='date',y=['pollution_index','yhat'])
It automatically plots title, x/y labels and a legend.

After training the Linear Regression model using scikit-learn , How to do predictions for new data points which are not there in original data set?

I am learning Linear regression, I wrote this Linear Regression code using scikit-learn , after making the prediction, how to do prediction for new data points which are not there in my original data set.
In this data set you are given the salaries of people according to their work experience.
For example , The predicted salary for a person with work experience of 15 years should be [167005.32889087]
Here is image of data set
Here is my code ,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
data = pd.read_csv('project_1_dataset.csv')
X = data.iloc[:,0].values.reshape(-1,1)
Y = data.iloc[:,1].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X,Y)
Y_pred = linear_regressor.predict(X)
plt.scatter(X,Y)
plt.plot(X, Y_pred, color = 'red')
plt.show()
After fitting and training your model with your existed dataset (i.e. after linear_regressor.fit(X,Y)), you could make predictions in new instances in the same way:
new_prediction = linear_regressor.predict(new_data)
print(new_prediction)
where new_data is your new data point.
If you want to make predictions on particular random new data points, the above way should be enough. If your new data points belong to another dataframe, then you could replace new_data with the respective dataframe containing the new instances to be predicted.

Identify feature names from a pandas dataframe

I have a piece of code as follow:
# transforming data to best 20 features
from sklearn.feature_selection import SelectKBest, chi2
import matplotlib.pyplot as plt
fs = SelectKBest(score_func=chi2, k=20)
fs.fit(X_train, y_train)
X_train = fs.transform(X_train)
X_test = fs.transform(X_test)
# what are scores for the features
for i in range(len(fs.scores_)):
print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()
The plot gives me the output as indicated in this picture, I'm wondering how can I identify the actual feature names of these features instead of "1-20" ? I tried get_support() but it gives an error since my data is in array format as I used train_test_split.
The features are in the same order as the data inside the X_train array. So in order to get your feature names you should extract them before making the X_train to a numpy array. If you are using a pandas dataframe named df what you can do is:
for i in range(len(fs.scores_)):
print(f'Feature {df.columns[i]}: {fs.scores_[i]}')

How to use scikit learn inverse_transform with new values

I have a set of data that I have used scikit learn PCA. I scaled the data before performing PCA with StandardScaler().
variance_to_retain = 0.99
np_scaled = StandardScaler().fit_transform(df_data)
pca = PCA(n_components=variance_to_retain)
np_pca = pca.fit_transform(np_scaled)
# make dataframe of scaled data
# put column names on scaled data for use later
df_scaled = pd.DataFrame(np_scaled, columns=df_data.columns)
num_components = len(pca.explained_variance_ratio_)
cum_variance_explained = np.cumsum(pca.explained_variance_ratio_)
eigenvalues = pca.explained_variance_
eigenvectors = pca.components_
I then ran K-Means clustering on the scaled dataset. I can plot the cluster centers just fine in scaled space.
My question is: how do I transform the locations of the centers back into the original data space. I know that StandardScaler.fit_transform() make the data have zero mean and unit variance. But with the new points of shape (num_clusters, num_features), can I use inverse_transform(centers) to get the centers transformed back into the range and offset of the original data?
Thanks, David
you can get cluster_centers on a kmeans, and just push that into your pca.inverse_transform
here's an example
import numpy as np
from sklearn import decomposition
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
scal = StandardScaler()
X_t = scal.fit_transform(X)
pca = decomposition.PCA(n_components=3)
pca.fit(X_t)
X_t = pca.transform(X_t)
clf = KMeans(n_clusters=3)
clf.fit(X_t)
scal.inverse_transform(pca.inverse_transform(clf.cluster_centers_))
Note that sklearn has multiple ways to do the fit/transform. You can do StandardScaler().fit_transform(X) but you lose the scaler, and can't reuse it; nor can you use it to create an inverse.
Alternatively, you can do scal = StandardScaler() followed by scal.fit(X) and then by scal.transform(X)
OR you can do scal.fit_transform(X) which combines the fit/transform step
Here I am using SVR to Fit the data before that I am using scaling technique to scale the values and to get the prediction I am using the Inverse transform function
from sklearn.preprocessing import StandardScaler
#Creating two objects for dependent and independent variable
ss_X = StandardScaler()
ss_y = StandardScaler()
X = ss_X.fit_transform(X)
y = ss_y.fit_transform(y.reshape(-1,1))
#Creating a model object and fiting the data
reg = SVR(kernel='rbf')
reg.fit(X,y)
#To make a prediction
#First we have transform the value into scalar level
#Second inverse tranform the value to see the original value
ss_y.inverse_transform(reg.predict(ss_X.transform(np.array([[6.5]]))))

mlextend plot_decision_regions with model fit on Pandas DataFrame?

I'm a big fan of mlxtend's plot_decision_regions function, (http://rasbt.github.io/mlxtend/#examples , https://stackoverflow.com/a/43298736/1870832)
It accepts an X(just two columns at a time), y, and (fitted) classifier clf object, and then provides a pretty awesome visualization of the relationship between model predictions, true y-values, and a pair of independent variables.
A couple restrictions:
X and y have to be numpy arrays, and clf needs to have a predict() method. Fair enough. My problem is that in my case, the classifier clf object I would like to visualize has already been fitted on a Pandas DataFrame...
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib
matplotlib.use('Agg')
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit a Classifier to the data
# This classifier is fit on the data as a Pandas DataFrame
X = df[['Planned_End', 'Actual_End']]
y = df['Late']
clf = xgb.XGBClassifier()
clf.fit(X, y)
So now when I try to use plot_decision_regions passing X/y as numpy arrays...
# Plot Decision Region using mlxtend's awesome plotting function
plot_decision_regions(X=X.values,
y=y.values,
clf=clf,
legend=2)
I (understandably) get an error that the model can't find the column names of the dataset it was trained on
ValueError: feature_names mismatch: ['Planned_End', 'Actual_End'] ['f0', 'f1']
expected Planned_End, Actual_End in input data
training data did not have the following fields: f1, f0
In my actual case, it would be a big deal to avoid training our model on Pandas DataFrames. Is there a way to still produce decision_regions plots for a classifier trained on a Pandas DataFrame?
Try to change:
X = df[['Planned_End', 'Actual_End']].values
y = df['Late'].values
and proceed to:
clf = xgb.XGBClassifier()
clf.fit(X, y)
plot_decision_regions(X=X,
y=y,
clf=clf,
legend=2)
OR fit & plot using X.values and y.values

Categories

Resources