Python - Scikit find variable importance for categorical variables

Python - Scikit find variable importance for categorical variables - python

I'm trying to use scikit learn in python to do a couple different classifier problems (RF, GBM, etc). In addition to building models and making predictions, I'd like to see variable importance. I know there is a way to get the importances
importances = clf.feature_importances_
print(importances)
but how do I get something more refined that has the importance connected to the variable name (ie summary(gbm) in R or varImp(randomForest) in R) especially if it's a categorical variable with multiple levels?

The variable importance (or feature importance) is calculated for all the features that you are fitting your model to. This pseudo code gives you an idea of how variable names and importance can be related:
import pandas as pd
train = pd.read_csv("train.csv")
cols = ['hour', 'season', 'holiday', 'workingday', 'weather', 'temp', 'windspeed']
clf = YourClassifiers()
clf.fit(train[cols], train.targets) # targets/labels
print len(clf.feature_importances_)
print len(cols)
You will see that the lengths of the two lists being printed are the same - you can essentially map the lists together or manipulate them how you wish. If you'd like to show variable importance nicely in a plot, you could use this:
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(6 * 1.618, 6))
index = np.arange(len(cols))
bar_width = 0.35
plt.bar(index, clf.feature_importances_, color='black', alpha=0.5)
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Feature importance')
plt.xticks(index + bar_width, cols)
plt.tight_layout()
plt.show()
If you don't want to use this method (meaning that you are fitting all columns, not just selected few as set in cols variable), then you could get the column/feature/variable names of your data with train.columns.values (and then map this list together with variable importance list or manipulate in some other way).

Related

Should features that correlate be deleted from ML models?

I've seen that it's common practice to delete input features that demonstrate co-linearity (and leave only one of them).
However, I've just completed a course on how a linear regression model will give different weights to different features, and I've thought that maybe the model will do better than us giving a low weight to less necessary features instead of completely deleting them.
To try to solve this doubt myself, I've created a small dataset resembling a x_squared function and applied two linear regression models using Python:
A model that keeps only the x_squared feature
A model that keeps both the x and x_squared features
The results suggest that we shouldn't delete features, and let the model decide the best weights instead. However, I would like to ask the community if the rationale of my exercise is right, and whether you've found this doubt in other places.
Here's my code to generate the dataset:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate the data
all_Y = [10, 3, 1.5, 0.5, 1, 5, 8]
all_X = range(-3, 4)
all_X_2 = np.square(all_X)
# Store the data into a dictionary
data_dic = {"x": all_X, "x_2": all_X_2, "y": all_Y}
# Generate a dataframe
df = pd.DataFrame(data=data_dic)
# Display the dataframe
display(df)
which produces this:
and this is the code to generate the ML models:
# Create the lists to iterate over
ids = [1, 2]
features = [["x_2"], ["x", "x_2"]]
titles = ["$x^{2}$", "$x$ and $x^{2}$"]
colors = ["blue", "green"]
# Initiate figure
fig = plt.figure(figsize=(15,5))
# Iterate over the necessary lists to plot results
for i, model, title, color in zip(ids, features, titles, colors):
# Initiate model, fit and make predictions
lr = LinearRegression()
lr.fit(df[model], df["y"])
predicted = lr.predict(df[model])
# Calculate mean squared error of the model
mse = mean_squared_error(all_Y, predicted)
# Create a subplot for each model
plt.subplot(1, 2, i)
plt.plot(df["x"], predicted, c=color, label="f(" + title + ")")
plt.scatter(df["x"], df["y"], c="red", label="y")
plt.title("Linear regression using " + title + " --- MSE: " + str(round(mse, 3)))
plt.legend()
# Display results
plt.show()
which generate this:
What do you think about this issue? This difference in the Mean Squared Error can be of high importance on certain contexts.

Because x and x^2 are not linear anymore, that is why deleting one of them is not helping the model. The general notion for regression is to delete those features which are highly co-linear (which is also highly correlated)

So x2 and y are highly correlated and you are trying to predict y with x2? A high correlation between predictor variable and response variable is usually a good thing - and since x and y are practically uncorrelated you are likely to "dilute" your model and with that get worse model performance.
(Multi-)Colinearity between the predicor variables themselves would be more problematic.

Fixed effect regression model in Python

I have a book dataset. I want to make a fixed effect regression model.
I want to fixed effect of year, month, day and book_genre in my model, so in this case I will take out the effects of repetition of the same books in multiple observations. I want to use Python code for my fixed effect model. My variables are:
Variables that I want to fix them are: year, month, day and book_genre.
Other variables in the model are: Read_or_not: categorical variable, ne_factor, x1, x2, x3, x4, x5= numerical variables
Response variable: Y
I used this code but I get an error "DataFrame input must have a MultiIndex with 2 levels"
I highly appreciate it if you help me with how I can fix my code to make a fixed effect model regression.
I also attach a png of dataset to show the variables:
''''
import pandas as pd
from linearmodels import PanelOLS
import numpy as np
df = pd.read_csv('all_a.csv')
df
# Set the index for fixed effects
data = df.set_index(['year', 'month', 'day','book_genre'])
data = df.dropna(subset=['book_id','year','month','day','Read_or_not ' ,'ne_factor,','Y','book_genre','X1', 'X2','X3',"X4" ,"X5"])
# Regression
FE = PanelOLS(data.attention_data_score, data[ 'Y'],
entity_effects = True,
time_effects=True
)
# Result
result = FE.fit(cov_type = 'clustered',
cluster_entity=True,
cluster_time=True
)

how can I drop low correlated features

I am making a preprocessing code for my LSTM training. My csv contains more than 30 variables. After applying some EDA techniques, I found that half of the features can be drop and they don't make any effect on training.
Right now I am dropping such features manually by using pandas.
I want to make a code which can drop such features automaticlly.
I wrote a code to visualize heat map and correlation in this way:
#I am making a class so this part is from preprocessing.
# self.data is a Dataframe which contains all csv data
def calculateCorrelationByPearson(self):
columns = self.data.columns
plt.figure(figsize=(12, 8))
sns.heatmap(data=self.data.corr(method='pearson'), annot=True, fmt='.2f',
linewidths=0.5, cmap='Blues')
plt.show()
for column in columns:
corr = stats.spearmanr(self.data['total'], self.data[columns])
print(f'{column} - corr coefficient:{corr[0]}, p-value:{corr[1]}')
This gives me a perfect view of my features and relationship with each other.
Now I want to drop columns which are not important.
Let's say correlation less than 0.4.
How can I apply this logic in to my code?

Here is an approach to remove variables with a correlation coef value below some threshold:
import pandas as pd
from scipy.stats import spearmanr
data = pd.DataFrame([{"A":1, "B":2, "C":3},{"A":2, "B":3, "C":1},{"A":3, "B":4, "C":0},{"A":4, "B":4, "C":1},{"A":5, "B":6, "C":2}])
targetVar = "A"
corr_threshold = 0.4
corr = spearmanr(data)
corrSeries = pd.Series(corr[0][:,0], index=data.columns) #Series with column names and their correlation coefficients
corrSeries = corrSeries[(corrSeries.index != targetVar) & (corrSeries > corr_threshold)] #apply the threshold
vars_to_keep = list(corrSeries.index.values) #list of variables to keep
vars_to_keep.append(targetVar) #add the target variable back in
data2 = data[vars_to_keep]

Scatterplot: different colour & annotation for each observation depending on previous untransformed dataset PCA matplotlib python

I'm implementing a PCA on the following data (provided in code). I choose 2 components which gives me output in the form [x1, y1], [x2, y2], etc.
I then want to plot these two PCs (a) on a plot (as shown below the code) however I want to colour code them according to the letter in the untransformed data (data). i.e. observation [x1, y1] is originally assigned as "A" therefore I want that to be a different colour to those with corresponding "B" and "C" labels. I think a dictionary is appropriate but not sure how to link the original dataset to the new PCA variables.
I also want to annotate these points (from a) with the names in the original set (data), i.e. [x1, y1] would be annotated with "John".
Any help is greatly appreciated.
# load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
# load data
data = np.array([["John","A",1,2,1,3,4,6],
["Julie","A",3,1,2,2,2,4],
["James","B",2,4,1,1,2,5],
["Jemma","C",3,5,1,2,3,2],
["Jet","B",1,3,2,1,1,3],
["Jane","A",2,4,2,1,3,4]])
# feature array & scale
y = data[:,[2,3,4,5,6,7]]
z = scale(y)
# PCA
pca = PCA(n_components=6)
pca.fit(z)
# scree plot
var = pca.explained_variance_ratio_
var1 = np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
#print(var1)
#plt.plot(var1)
#plt.show()
# PCA w/ 2 components
pca = PCA(n_components=2)
pca.fit(z)
a = pca.fit_transform(z)
# colour map **HELP**
#colours = {"A":"red", "B":"green", "C":"blue"}
# annotation **HELP**
# scatter plot
plt.scatter(a[:,0],a[:,1])
plt.show()
EDIT:
colour problem SOLVED
annotation problem NEED HELP:
names = [rows[0] for rows in data]
plt.scatter(a[:,0], a[:,1], c=point_colours)
plt.annotate(names, (a[:,0], a[:,1]))
same problem when coding as:
for i in names:
plt.annotate(names, (a[:,0], a[:,1]))
although print names outputs the names i want to annotate, it does not show up on the plot. I have tried using both names and str(names) in the annotate parameters but keep getting
TypeError: only length-1 arrays can be converted to Python scalars
and then the graph is outputted without labels.
any ideas?

Something like:
point_colors = [colours[row[1]] for row in data]
plt.scatter(a[:,0], a[:,1], c=point_colors)
it creates a list with the color of points.
For the annotation:
for i, row in enumerate(data):
xy = (a[:,0][i], a[:,1][i])
name = data[i][0]
plt.annotate(name, xy=xy)
You should move xy to avoid the overlap of the text with the point.

Mapping column names to random forest feature importances

I am trying to plot feature importances for a random forest model and map each feature importance back to the original coefficient. I've managed to create a plot that shows the importances and uses the original variable names as labels but right now it's ordering the variable names in the order they were in the dataset (and not by order of importance). How do I order them in order of feature importance? Thanks!
My code is:
importances = brf.feature_importances_
std = np.std([tree.feature_importances_ for tree in brf.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(x_dummies.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure(figsize=(8,8))
plt.title("Feature importances")
plt.bar(range(x_train.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
feature_names = x_dummies.columns
plt.xticks(range(x_dummies.shape[1]), feature_names)
plt.xticks(rotation=90)
plt.xlim([-1, x_dummies.shape[1]])
plt.show()

A sort of generic solution would be to throw the features/importances into a dataframe and sort them before plotting:
import pandas as pd
%matplotlib inline
#do code to support model
#"data" is the X dataframe and model is the SKlearn object
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(data.columns, model.feature_importances_):
feats[feature] = importance #add the name/value pair
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='bar', rot=45)

It is simple, I plotted it like this.
feat_importances = pd.Series(extraTree.feature_importances_, index=X.columns)
feat_importances.nlargest(15).plot(kind='barh')
plt.title("Top 15 important features")
plt.show()

I use a similar solution to Sam:
import pandas as pd
important_features = pd.Series(data=brf.feature_importances_,index=x_dummies.columns)
important_features.sort_values(ascending=False,inplace=True)
I always just print the list using print important_features but to plot you could always use Series.plot

Another simple way to get a sorted list
importances = list(zip(xgb_classifier.feature_importances_, df.columns))
importances.sort(reverse=True)
Next code adds a visualization if it's necessary
pd.DataFrame(importances, index=[x for (_,x) in importances]).plot(kind = 'bar')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Scikit find variable importance for categorical variables - python

Related

Should features that correlate be deleted from ML models?

Fixed effect regression model in Python

how can I drop low correlated features

Scatterplot: different colour & annotation for each observation depending on previous untransformed dataset PCA matplotlib python

Mapping column names to random forest feature importances

Categories

Resources