DecisionTreeClassifier on multiple levels - python

I am trying to classify objects that have multiple levels. The best way I can explain it is with an example:
I can do this:
from sklearn import tree
features = ['Hip Hop','Boston'],['Metal', 'Cleveland'],['Gospel','Ohio'],['Grindcore','Agusta']]
labels = [1,0,0,0]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
But I want to do this:
from sklearn import tree
features = ['Hip Hop','Boston',['Run DMC','Kanye West']],['Metal', 'Cleveland',['Guns n roses','Poison']],['Gospel','Ohio',['Christmania','I Dream of Jesus']],['Grindcore','Agusta', ['Pig Destroyer', 'Carcas', 'Cannibal Corpse']]
labels = [1,0,0,0]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
clf.predict_proba(<blah blah>)
I am trying to assign a probability that a person will enjoy a band based on their location, favorite genre, and other bands they like.

You have a simple solution: just turn each band into a binary feature (you can use MultiLabelBinarizer or something similar). Your X matrix just before feeding it into a tree will look like this:
You could create such a matrix with this code:
import pandas as pd
features = [['Hip Hop','Boston',['Run DMC','Kanye West']],
['Metal', 'Cleveland',['Guns n roses','Poison']],
['Gospel','Ohio',['Christmania','I Dream of Jesus']],
['Grindcore','Agusta', ['Pig Destroyer', 'Carcas', 'Cannibal Corpse']]]
df = pd.DataFrame([{**{f[0]:1, f[1]:1}, **{k:1 for k in f[2]}} for f in features]).fillna(0)
If the number of bands is low, binary encoding will suffice. But if there are too many bands, you might want to reduce dimensionality. You can accomplish it with the following steps:
Create the user-bands count matrix, like above
(Optionally) normalize it e.g. with tf-idf
Apply a matrix decomposition algorithm to it to extract the "latent features" from the matrix.
Feed the latent features to your decision tree (or any other estimator).
If the number of bands is large, but you have too few observations, even matrix decomposition may not help much. If it is the case, the only advice is to use simpler features, e.g. replace the groups with their corresponding genres.

Related

How to give more importance to some features in sklearn Isolation Forest

I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way:
We select any feature (uniformly) randomly and perform a split on a random value of that feature.
But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.
How can I do that? From the source code it seems I have to change the function _generate_bagging_indices in _bagging.py, but not sure.
You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.
In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.
Illustration
Load Data
from sklearn.ensemble import IsolationForest
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
data = load_iris()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)
Default settings
IF = IsolationForest()
IF.fit(df)
preds = IF.predict(df)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
plt.title("Default settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Weighted Settings
df1 = df.copy()
weight_feature = 10
for i in range(weight_feature):
df1["duplicated_" + str(i)] = df1["sepal length (cm)"]
IF1 = IsolationForest(max_features=0.3)
IF1.fit(df1)
preds1 = IF1.predict(df1)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
plt.title("Weighted settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.

How to choose the Chi Squared threshold in feature selection

About this:
NLP in Python: Obtain word names from SelectKBest after vectorizing
I found this code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2
THRESHOLD_CHI = 5 # or whatever you like. You may try with
# for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
# and measure the f1 scores
X = df['text']
y = df['labels']
cv = CountVectorizer()
cv_sparse_matrix = cv.fit_transform(X)
cv_dense_matrix = cv_sparse_matrix.todense()
chi2_stat, pval = chi2(cv_dense_matrix, y)
chi2_reshaped = chi2_stat.reshape(1,-1)
which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])
This code computes the chi squared test and should keep the best features within a chosen threshold.
My question is how to choose a theshold for the chi squared test scores?
Chi square does not have a specific range of outcome, so it's hard to determine a threshold beforehand. Usually what you can do is to sort the variables depending on their p values, the logic is that lower p values are better, because they imply a higher correlation between features and the target variable (we want to discard features that are independent, i.e. not predictors of the target variable). In this case you have anyway to decide how many features to keep, and that is a hyper parameter that you can tune manually or even better by using a grid search.
Be aware that you can avoid to perform the selection manually, sklearn implement already a function SelectKBest to select the best k features based on chi square, you can use it as follow:
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
But if for any reason you want to rely solely on the raw chi2 value, you could calculate the minimum and maximum values among the variables, and then divide the interval in n steps to test trough a grid search.

Feature agglomeration: How to retrieve the features that make up the clusters?

I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.

kmeans clustering with dataframe (scipy)

I would like to run kmeans clustering with more than 3 features. I've tried with two features and wondering how to provide more than 3 features to sklearn.cluster KMeans.
Here's my code and dataframe that I'd like to select features to run. I have multiple dataframes as an input and I have to provide them as features.
# currently two features are selected
# I'd like to combine more than 3 features and provide them to dataset
df_features = pd.merge(df_max[['id', 'max']],
df_var[['id', 'variance']], on='id', how='left')
cols = list(df_features.loc[:,'max':'variance'])
X = df_features.as_matrix(columns=cols)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroid = kmeans.cluster_centers_
labels = kmeans.labels_
colors = ["g.","r.","c."]
for i in range(len(X)):
print ("coordinate:" , X[i], "label:", labels[i])
plt.plot(X[i][0],X[i][1],colors[labels[i]],markersize=10)
plt.scatter(centroid[:,0],centroid[:,1], marker = "x", s=150, linewidths = 5, zorder =10)
plt.show()
Generally you wouldn't want id to be a feature, because, unless you have good reason to believe otherwise, they do not correlate with anything.
As long as you feed in a valid matrix X at kmeans.fit(X), it will run KMean algorithm for you regardless of number of features in X. Though, if you have a huge amount of features, it may take longer to finish.
The problem is then how to construct X. As you have shown in your example, you can simply merge dataframes, select the wanted columns, and extract feature matrix with a .as_matrix() call. If you have more dataframes and columns, I guess you just merge more and select more.
Feature selection and dimensional reduction may come in handy once you have more than enough features in your dataset. Read more about them when you have time.
P.S. Why scipy in the title?

How to get the top N frequent words in each cluster? Sklearn

I have a text corpus that contains 1000+ articles each in a separate line. I used Hierarchy Clustering using Sklearn in python to produce clusters of related articles. This is the code I used to do the clustering
Note: X, is a sparse NumPy 2D array with rows corresponding to documents and columns corresponding to terms
# Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(affinity="euclidean",linkage="complete",n_clusters=3)
model.fit(X.toarray())
clustering = model.labels_
print (clustering)
I specify the number of clusters = 3 at which to cut off the tree to get a flat clustering like K-mean
My question is : How to get the top N frequent words in each cluster? so that I can suggest a topic for each cluster.
Thanks
One option is to convert X from the sparse numpy array to a pandas dataframe. The rows will still correspond to documents, and the columns to words. If you have a list of your vocabulary in order of your array columns (used as your_word_list below) you could try something like this:
import pandas as pd
X = pd.DataFrame(X.toarray(), columns=your_word_list) # columns argument is optional
X['Cluster'] = clustering # Add column corresponding to cluster number
word_frequencies_by_cluster = X.groupby('Cluster').sum()
# To get sorted list for a numbered cluster, in this case 1
print word_frequencies_by_cluster.loc[1, :].sort(ascending=False)
As a side note, you may want to look into algorithms (e.g. LDA) and distance metrics (cosine) that are more commonly used for natural language processing. If you are looking to extract topics, there is a nice sklearn tutorial on topic modeling.

Categories

Resources