Scikit learn categorical features ranking

Scikit learn categorical features ranking - python

My data contained a lot of categorical data, for example, Age, color, size, race, gender and so on.
The problem is that in scikit-learn we could not set the features as a factor as in R, therefore we have to convert the categorical data in to the dummy column. As
color size
green M
red L
blue XL
convert to
color_blue color_green color_red size_L size_M size_XL
0.0 1.0 0.0 0.0 1.0 0.0
0.0 0.0 1.0 1.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 1.0
However, I would like to rank the features as the color or size, not color_blue or size_M.
Is there any possible ways to do it? or I can summarize the value from the ranking score from each related feature?
(like score for color column should be sum of (green blue and red scores))
Note that I use ExtraTreesClassifier for the ranking score calculation.

Related

Understanding FeatureHasher, collisions and vector size trade-off

I'm preprocessing my data before implementing a machine learning model. Some of the features are with high cardinality, like country and language.
Since encoding those features as one-hot-vector can produce sparse data, I've decided to look into the hashing trick and used python's category_encoders like so:
from category_encoders.hashing import HashingEncoder
ce_hash = HashingEncoder(cols = ['country'])
encoded = ce_hash.fit_transform(df.country)
encoded['country'] = df.country
encoded.head()
When looking at the result, I can see the collisions
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 country
0 0 0 1 0 0 0 0 0 US <━┓
1 0 1 0 0 0 0 0 0 CA. ┃ US and SE collides
2 0 0 1 0 0 0 0 0 SE <━┛
3 0 0 0 0 0 0 1 0 JP
Further investigation lead me to this Kaggle article. The example of Hashing there include both X and y.
What is the purpose of y, does it help to fight the collision problem?
Should I add more columns to the encoder and encode more than one feature together (for example country and language)?
Will appreciate an explanation of how to encode such categories using the hashing trick.
Update:
Based on the comments I got from #CoMartel, Iv'e looked at Sklearn FeatureHasher and written the following code to hash the country column:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10,input_type='string')
f = h.transform(df.country)
df1 = pd.DataFrame(f.toarray())
df1['country'] = df.country
df1.head()
And got the following output:
0 1 2 3 4 5 6 7 8 9 country
0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
1 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
2 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
3 0.0 -1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CA
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 SE
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 JP
6 -1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AU
7 -1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AU
8 -1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 DK
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 SE
Is that the way to use the library in order to encode high categorical
values?
Why are some values negative?
How would you choose the "right" n_features value?
How can I check the collisions ratio?

Is that the way to use the library in order to encode high categorical
values?
Yes. There is nothing wrong with your implementation.
You can think about the hashing trick as a "reduced size one-hot encoding with a small risk of collision, that you won't need to use if you can tolerate the original feature dimension".
This idea was first introduced by Kilian Weinberger. You can find in their paper the whole analysis of the algorithm theoretically and practically/empirically.
Why are some values negative?
To avoid collision, a signed hash function is used. That is, the strings are hashed by using the usual hash function first (e.g. a string is converted to its corresponding numerical value by summing ASCII value of each char, then modulo n_feature to get an index in (0, n_features]). Then another single-bit output hash function is used. The latter produces +1 or -1 by definition, where it's added to the index resulted from the first hashing function.
Pseudo code (it looks like Python, though):
def hash_trick(features, n_features):
for f in features:
res = np.zero_like(features)
h = usual_hash_function(f) # just the usual hashing
index = h % n_features # find the modulo to get index to place f in res
if single_bit_hash_function(f) == 1: # to reduce collision
res[index] += 1
else:
res[index] -= 1 # <--- this will make values to become negative
return res
How would you choose the "right" n_features value?
As a rule of thumb, and as you can guess, if we hash an extra feature (i.e. #n_feature + 1), the collision is certainly going to happen. Hence, the best case-scenario is when each feature is mapped to a unique hash value -- hopefully. In this case, logically speaking, n_features should be at least equal to the actual number of features/categories (in your particular case, the number of different countries). Nevertheless, please remember that this is the "best" case scenario, which is not the case "mathematically speaking". Hence, the higher the better of course, but how high? See next.
How can I check the collisions ratio?
If we ignore the second single-bit hash function, the problem is reduced to something called "Birthday problem for Hashing".
This is a big topic. For a comprehensive introduction to this problem, I recommend you read this, and for some detailed math, I recommend this answer.
In a nutshell, what you need to know is that, the probability of no collisions is exp(-1/2) = 60.65%, that means there is approximately 39.35% chance of one collision, at least, to happen.
So, as a rule of thumb, if we have X countries, there is about 40% chance, for at least one collision, if the hash function output range (i.e. n_feature parameter) is X^2. In other words, there is 40% chance of collision if the number of countries in your example = square_root(n_features). As you increase n_features exponentially, the chances of collision is reduced by half. (personally, if it is not for security purposes, but just a plain conversion from string to numbers, it is not worth going too high).
Side-note for curios readers: For a large enough hash function output size(e.g. 256 bits), the chances an attacker guess (or avail of) the collision is almost impossible (from a security perspective).
Regarding the y parameter, as you've already got in a comment, it is just for compatibility purpose, not used (scikit-learn has this along many other implementations).

During handling of the above exception, another exception occurred when using SHAP to interpret keras neural network model

The x_train looks like this (22 features):
total_amount reward difficulty duration discount bogo mobile social web income ... male other_gender age_under25 age_25_to_35 age_35_to_45 age_45_to_55 age_55_to_65 age_65_to_75 age_75_to_85 age_85_to_105
0 0.006311 0.2 0.50 1.000000 1.0 0.0 1.0 1.0 1.0 0.355556 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.015595 0.2 0.50 1.000000 1.0 0.0 1.0 1.0 1.0 0.977778 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
The label is 0 and 1, it's a binary classification problem, here's the code for building the model, and I was following this page to implement SHAP:
#use SHAG
deep_explainer = shap.DeepExplainer(nn_model_2, x_train[:100])
# explain the first 10 predictions
# explaining each prediction requires 2 * background dataset size runs
shap_values = deep_explainer.shap_values(x_train)
This gave me error:
KeyError: 0
During handling of the above exception, another exception occurred
I have no idea what this message is complaining, I tried to use SHAP with a XGBoost and Logistic Regression model and they both work fine, I'm new to keras and SHAP, can someone have a look for me and how I can solved it? Many thanks.

I think SHAP (whatever it is) is expecting a Numpy array and so indexing x_train like a Numpy array, it yields an error. Try:
shap_values = deep_explainer.shap_values(x_train.values)

Find out which features are in my components after PCA

I performed a PCA of my data. The data looks like the following:
df
Out[60]:
Drd1_exp1 Drd1_exp2 Drd1_exp3 ... M7_pppp M7_puuu Brain_Region
0 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
3 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
4 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
... ... ... ... ... ... ...
150475 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
150478 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
150479 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
I know used every row until 'Brain Regions' as features. I also standardized them.
These features are different experiments, that give me information about a 3D image of a brain.
I'll show you my code:
from sklearn.preprocessing import StandardScaler
x = df.loc[:, listend1].values
y= df.loc[:, 'Brain_Region'].values
x = StandardScaler().fit_transform(x)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
finalDf = pd.concat([principalDf, df[['Brain_Region']]], axis = 1)
I then plotted finalDF:
My question now is: How can I find out, which features contribute to my Components? How can I find out, to interpret the data?

You can use pca.components_ (or pca.components depending on the sklearn version).
It has shape (n_components, n_features), in your case (2, n_features) and represents the directions of maximum variance in the data, which reflects the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance). You will have something like this:
[[0.522 0.26 0.58 0.56],
[0.37 0.92 0.02 0.06]]
implying that for the first component (first row) the first, third and last features have an higher importance, while for the second component only the second feature is important.
Have a look to sklern PCA attributes description or to this post.
By the way, you can also use a Random Forest Classifier including the labels, and after the training you can explore the feature importance, e.g. this post.

One hot encoding error python machine learning

I am working with categorical variables in Machine Learning.Here is sample of my data:
age,gender,height,class,label
25,m,43,A,0
35,f,45,B,1
12,m,36,C,0
14,f,42,A,0
There are two categorical variables gender and height.I have used LabelEncoding technique.
My code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data=X.iloc[:,:].values
lben = LabelEncoder()
data[:,1] = lben.fit_transform(data[:,1])
data[:,3] = lben.fit_transform(data[:,3])
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
onehotencoder = OneHotEncoder(categorical_features=[3])
data = onehotencoder.fit_transform(data).toarray()
print(data.shape)
np.savetxt('data.csv',data,fmt='%s')
The data.csv looks like this:
0.0 0.0 1.0 0.0 0.0 1.0 25.0 0.0
0.0 0.0 0.0 1.0 1.0 0.0 35.0 1.0
1.0 0.0 0.0 0.0 0.0 1.0 12.0 2.0
0.0 1.0 0.0 0.0 1.0 0.0 14.0 0.0
I am unable to understand why the column is like this i.e where is the value of the 'height' column.Also the data.shape is (4,8) instead of (4,7) i.e(gender represented by 2 columns and class by 3 and 'age' and 'height' features.

Are you sure that you need to use LabelEncoder+OneHotEncoder? There is a much simpler method (which does not allow to do advanced procedures, but so far you seem to work on basics):
import pandas as pd
import numpy as np
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data = pd.get_dummies(X)
The problem with the current code is that after you have done the first OHE:
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
the columns get shifted and column 3 is in fact the original height column instead of the label-encoded class column. So change the second one to use column 4 and you will get what you want.

Stacked Bar Plot-Starting with NonNumerical Items

I am trying to create a stacked bar graph to show how the launch vehicles of satellites has changed over time. I'd like the x axis to be the year of the launch, and y axis to be the number of satellites launched on the vehicle, where each section of the bar is a different color that represents the launch vehicle. I am struggling to come up with a way to do this because my Launch Vehicle column is non-numerical. I looked into the group by function as well as value_counts but can't seem to get it to do what I am looking for.

You have to reorganize your data to use DataFrame.plot in desired way:
import pandas as pd
import matplotlib.pylab as plt
# test data
df = pd.DataFrame({'Launch Vehicle':["Soyuz 2.1a",'Ariane 5 ECA','Falcon 9','Long March','Falcon 9', 'Atlas 3','Atlas 3'],
'Year of Launch': [2016,2014,2016,1997,2015,2004,2004]})
# make groupby by year and rocket type to get the pivot table
# fillna put zero launch if there is no start of such type during the year
df2 = df.groupby(['Year of Launch','Launch Vehicle'])['Year of Launch'].count().unstack('Launch Vehicle').fillna(0)
print(df2)
# plot the data
df2.plot(kind='bar', stacked=True, rot=1)
plt.show()
Output of df2:
Launch Vehicle Ariane 5 ECA Atlas 3 Falcon 9 Long March Soyuz 2.1a
Year of Launch
1997 0.0 0.0 0.0 1.0 0.0
2004 0.0 2.0 0.0 0.0 0.0
2014 1.0 0.0 0.0 0.0 0.0
2015 0.0 0.0 1.0 0.0 0.0
2016 0.0 0.0 1.0 0.0 1.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit learn categorical features ranking - python

Related

Understanding FeatureHasher, collisions and vector size trade-off

During handling of the above exception, another exception occurred when using SHAP to interpret keras neural network model

Find out which features are in my components after PCA

One hot encoding error python machine learning

Stacked Bar Plot-Starting with NonNumerical Items

Categories

Resources