How to choose the Chi Squared threshold in feature selection

How to choose the Chi Squared threshold in feature selection - python

About this:
NLP in Python: Obtain word names from SelectKBest after vectorizing
I found this code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2
THRESHOLD_CHI = 5 # or whatever you like. You may try with
# for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
# and measure the f1 scores
X = df['text']
y = df['labels']
cv = CountVectorizer()
cv_sparse_matrix = cv.fit_transform(X)
cv_dense_matrix = cv_sparse_matrix.todense()
chi2_stat, pval = chi2(cv_dense_matrix, y)
chi2_reshaped = chi2_stat.reshape(1,-1)
which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])
This code computes the chi squared test and should keep the best features within a chosen threshold.
My question is how to choose a theshold for the chi squared test scores?

Chi square does not have a specific range of outcome, so it's hard to determine a threshold beforehand. Usually what you can do is to sort the variables depending on their p values, the logic is that lower p values are better, because they imply a higher correlation between features and the target variable (we want to discard features that are independent, i.e. not predictors of the target variable). In this case you have anyway to decide how many features to keep, and that is a hyper parameter that you can tune manually or even better by using a grid search.
Be aware that you can avoid to perform the selection manually, sklearn implement already a function SelectKBest to select the best k features based on chi square, you can use it as follow:
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
But if for any reason you want to rely solely on the raw chi2 value, you could calculate the minimum and maximum values among the variables, and then divide the interval in n steps to test trough a grid search.

Related

How can I find the respective P-values for a multiple linear regression using the linear model from sklearn?

So, I'm trying to develop a ml model for multiple linear regression that predicts the Y given n number of X variables. So far, my model can read in a data set and give the predicted value with a coefficient of determination as well as the respective coefficients for a 1-unit increase in X. The only issues are:
I can't get the p-value for the life of me, it says most of the time the data isn't shaped right due to it being 5 columns and 1329 rows. When I do get an output, they're just incorrect, I know because I did the regression in analysis toolpak in excel.
Is there a way to make the model recursive so that it recognizes the highest pvalue above .05 and calls itself again without said value until it hits the base case. Which would be something like
While dependent_v[pvalue] > .05:
Also what would be the best visualization method to show my data?
Thank you for any and all that help, I'm just starting to delve into machine learning on my own and want to hone my skills before an upcoming data science internship in the summer.
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model
def multipleReg():
dfreg = pd.read_csv("dfreg.csv")
#Setting dependent variables
dependent_v = ['Large_size', 'Mid_Level', 'Senior_Level', 'Exec_Level', 'Company_Location']
#Setting independent variable
independent_v = 'Salary_In_USD'
X = dfreg[dependent_v] #Drawing dependent variables from dataframe
y = dfreg[independent_v] #Drawing independent variable from dataframe
reg = linear_model.LinearRegression() #Initializing regression model
reg.fit(X.values,y) #Fitting appropriate values
predicted_sal = reg.predict([[1,0,1,0,0]]) #Prediction using 2 dimensions in array
percent_rscore = (reg.score(X.values,y)*100) #Model coefficient of determination
print('\n')
print("The predicted salary is:", predicted_sal)
print("The Coefficient of deterimination is:", "{:,.2f}%".format(percent_rscore))
#Printing coefficents of dependent variables(How much Y increases due to 1
#unit increase in X)
print("The corresponding coefficients for the dependent variables are:", reg.coef_)

As far as i know sklearn doesn't return p values, is better using the statsmodels library.
But if you need to use sklearn anyway, you can find various solutions here:
Find p-value (significance) in scikit-learn LinearRegression

How to give more importance to some features in sklearn Isolation Forest

I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way:
We select any feature (uniformly) randomly and perform a split on a random value of that feature.
But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.
How can I do that? From the source code it seems I have to change the function _generate_bagging_indices in _bagging.py, but not sure.

You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.
In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.
Illustration
Load Data
from sklearn.ensemble import IsolationForest
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
data = load_iris()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)
Default settings
IF = IsolationForest()
IF.fit(df)
preds = IF.predict(df)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
plt.title("Default settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Weighted Settings
df1 = df.copy()
weight_feature = 10
for i in range(weight_feature):
df1["duplicated_" + str(i)] = df1["sepal length (cm)"]
IF1 = IsolationForest(max_features=0.3)
IF1.fit(df1)
preds1 = IF1.predict(df1)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
plt.title("Weighted settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.

Shap statistics

I used shap to determine the feature importance for multiple regression with correlated features.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import shap
boston = load_boston()
regr = pd.DataFrame(boston.data)
regr.columns = boston.feature_names
regr['MEDV'] = boston.target
X = regr.drop('MEDV', axis = 1)
Y = regr['MEDV']
fit = LinearRegression().fit(X, Y)
explainer = shap.LinearExplainer(fit, X, feature_dependence = 'independent')
# I used 'independent' because the result is consistent with the ordinary
# shapely values where `correlated' is not
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, plot_type = 'bar')
shap offers a chart to get the shap values. Is there also a statistic available? I am interested in the exact shap values. I read the Github repository and the documentation but I found nothing regarding this topic.

When we look at shap_values we see that it contains some positive and negative numbers, and its dimensions equal the dimensions of boston dataset. Linear regression is a ML algorithm, which calculates optimal y = wx + b, where y is MEDV, x is feature vector and w is a vector of weights. In my opinion, shap_values stores wx - a matrix with the value of the each feauture multiplyed by the vector of weights calclulated by linear regression.
So to calculate wanted statistics, I first extracted absolute values and then averaged over them. The order is important! Next I used initial column names and sorted from biggest effect to smallest one. With this, I hope I have answered your question!:)
from matplotlib import pyplot as plt
#rataining only the size of effect
shap_values_abs = np.absolute(shap_values)
#dividing to get good numbers
means_norm = shap_values_abs.mean(axis = 0)/1e-15
#sorting values and names
idx = np.argsort(means_norm)
means = np.array(means_norm)[idx]
names = np.array(boston.feature_names)[idx]
#plotting
plt.figure(figsize=(10,10))
plt.barh(names, means)

DecisionTreeClassifier on multiple levels

I am trying to classify objects that have multiple levels. The best way I can explain it is with an example:
I can do this:
from sklearn import tree
features = ['Hip Hop','Boston'],['Metal', 'Cleveland'],['Gospel','Ohio'],['Grindcore','Agusta']]
labels = [1,0,0,0]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
But I want to do this:
from sklearn import tree
features = ['Hip Hop','Boston',['Run DMC','Kanye West']],['Metal', 'Cleveland',['Guns n roses','Poison']],['Gospel','Ohio',['Christmania','I Dream of Jesus']],['Grindcore','Agusta', ['Pig Destroyer', 'Carcas', 'Cannibal Corpse']]
labels = [1,0,0,0]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
clf.predict_proba(<blah blah>)
I am trying to assign a probability that a person will enjoy a band based on their location, favorite genre, and other bands they like.

You have a simple solution: just turn each band into a binary feature (you can use MultiLabelBinarizer or something similar). Your X matrix just before feeding it into a tree will look like this:
You could create such a matrix with this code:
import pandas as pd
features = [['Hip Hop','Boston',['Run DMC','Kanye West']],
['Metal', 'Cleveland',['Guns n roses','Poison']],
['Gospel','Ohio',['Christmania','I Dream of Jesus']],
['Grindcore','Agusta', ['Pig Destroyer', 'Carcas', 'Cannibal Corpse']]]
df = pd.DataFrame([{**{f[0]:1, f[1]:1}, **{k:1 for k in f[2]}} for f in features]).fillna(0)
If the number of bands is low, binary encoding will suffice. But if there are too many bands, you might want to reduce dimensionality. You can accomplish it with the following steps:
Create the user-bands count matrix, like above
(Optionally) normalize it e.g. with tf-idf
Apply a matrix decomposition algorithm to it to extract the "latent features" from the matrix.
Feed the latent features to your decision tree (or any other estimator).
If the number of bands is large, but you have too few observations, even matrix decomposition may not help much. If it is the case, the only advice is to use simpler features, e.g. replace the groups with their corresponding genres.

PCA on sklearn - how to interpret pca.components_

I ran PCA on a data frame with 10 features using this simple code:
pca = PCA()
fit = pca.fit(dfPca)
The result of pca.explained_variance_ratio_ shows:
array([ 5.01173322e-01, 2.98421951e-01, 1.00968655e-01,
4.28813755e-02, 2.46887288e-02, 1.40976609e-02,
1.24905823e-02, 3.43255532e-03, 1.84516942e-03,
4.50314168e-16])
I believe that means that the first PC explains 52% of the variance, the second component explains 29% and so on...
What I dont undestand is the output of pca.components_. If I do the following:
df = pd.DataFrame(pca.components_, columns=list(dfPca.columns))
I get the data frame bellow where each line is a principal component.
What I'd like to understand is how to interpret that table. I know that if I square all the features on each component and sum them I get 1, but what does the -0.56 on PC1 mean? Dos it tell something about "Feature E" since it is the highest magnitude on a component that explains 52% of the variance?
Thanks

Terminology: First of all, the results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
PART 1:
In your case, the value -0.56 for Feature E is the score of this feature on the PC1. This value tells us 'how much' the feature influences the PC (in our case the PC1).
So the higher the value in absolute value, the higher the influence on the principal component.
After performing the PCA analysis, people usually plot the known 'biplot' to see the transformed features in the N dimensions (2 in our case) and the original variables (features).
I wrote a function to plot this.
Example using iris data:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], pca.components_)
plt.show()
Results
PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value on the component.
TO get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

Basic Idea
The Principle Component breakdown by features that you have there basically tells you the "direction" each principle component points to in terms of the direction of the features.
In each principle component, features that have a greater absolute weight "pull" the principle component more to that feature's direction.
For example, we can say that in PC1, since Feature A, Feature B, Feature I, and Feature J have relatively low weights (in absolute value), PC1 is not as much pointing in the direction of these features in the feature space. PC1 will be pointing most to the direction of Feature E relative to other directions.
Visualization in Lower Dimensions
For a visualization of this, look at the following figures taken from here and here:
The following shows an example of running PCA on correlated data.
We can visually see that both eigenvectors derived from PCA are being "pulled" in both the Feature 1 and Feature 2 directions. Thus, if we were to make a principle component breakdown table like you made, we would expect to see some weightage from both Feature 1 and Feature 2 explaining PC1 and PC2.
Next, we have an example with uncorrelated data.
Let us call the green principle component as PC1 and the pink one as PC2. It's clear that PC1 is not pulled in the direction of feature x', and as isn't PC2 in the direction of feature y'.
Thus, in our table, we must have a weightage of 0 for feature x' in PC1 and a weightage of 0 for feature y' in PC2.
I hope this gives an idea of what you're seeing in your table.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.