Hierarchical clustering from confusion matrix with python - python

Using on the following answer, I tried to code hierarchical class clustering based on confusion matrix. Confusion matrix is used to evaluate results of classification problem and isn't symmetric. Each row represents the instances in an actual class. Here is an example of confusion matrix where you can read that 25% of the samples of the 'zero' class is predicted as class 'six'.
I tried to modify the code with:
conf_mat = 1 - conf_mat # 1.0 means dissimilarity
sch.linkage(conf_mat, method='warp')
But I got wrong results.
How should I organize my data in order to apply the clustering?
The following should give me the rearranged order of cluster, right ?
ind = sch.fcluster(Y, 0, 'distance')

I'm not sure I understand WHY you are doing this, but, based on the comment which you posted above, it seems that you'd like to cluster 10 objects ('zero', 'one' 'nine') by comparing their values in your confusion matrix, generated by some other algorithm.
I would like the clusters to maximize the classification results: if
one class is mainly recognize as another one then both classes should
be fused.
... So, looking at your data, object 'eight' and object 'nine' might be in the same cluster because they have both have mostly low values and one relatively high value for the 'eight' column.
To do this, you can treat each of the 10 objects as having 10 arbitrary properties; then this is a standard setup. Perhaps Euclidean distance is appropriate to determine the distance between objects; you would know best.
It sounds like you'd like to do some hierarchical clustering, which you can do with scipy.cluster.hierarchy ; example below.
Example
I didn't want to type up your data by hand, so I just randomly generated a matrix. To avoid confusion I'm calling the objects 'zero' ... 'nine' (spelled out) and I'm using numerals '1' through '9' as the object's properties.
0 1 2 3 4 5 6 7 8 9
zero 0.37 0.27 0.23 0.92 0.86 0.62 0.08 0.95 0.35 0.69
one 0.24 0.23 0.70 0.39 0.52 0.03 0.14 0.00 0.53 0.10
two 0.78 0.12 0.85 0.79 0.32 0.90 0.78 0.07 0.07 0.62
...
nine 0.15 0.39 0.27 0.93 0.12 0.14 0.34 0.11 0.72 0.52
So this is my "confusion matrix".
Hierarchical clustering with SciPy. I'm using Euclidean distance, and the single-link agglomerative method.
from scipy.cluster import hierarchy
Y = hierarchy.distance.pdist(data.as_matrix(), metric='euclidean')
Z = hierarchy.linkage(Y, method='single')
ax = hierarchy.dendrogram(Z, show_contracted=True, labels=data.index.tolist())
[I put my matrix in a dataframe so I could add labels to columns and indices. That's why I'm using pandas commands data.as_matrix() to get the raw data, and data.index.tolist() to set the labels.]
This gives:

Related

Default positive class in multilevel sklearn classification

I am working on a churn classification with 3 classes 0, 1,2 but want to optimize class 0 and 1 for recall, does that mean sklearn needs to take classes 0 & 1 to be the positive classes. How can I explicitly mention for which class do I want to optimise recall , if that is not possible should I consider renaming the classes in an ascending order so that 1, 2 are default positive?
precision recall f1-score support
0 0.71 0.18 0.28 2611
1 0.57 0.54 0.56 5872
2 0.70 0.88 0.78 8913
accuracy 0.66 17396
macro avg 0.66 0.53 0.54 17396
weighted avg 0.66 0.66 0.63 17396
Here is the code I am using for reference (although I need more of an understanding of how to optimize for recall for only 0, 1 class here)
param_test1={'learning_rate':(0.05,0.1),'max_depth':(3,5)}
estimator=GridSearchCV(estimator=GradientBoostingClassifier(loss='deviance',subsample=0.8,random_state=10,
n_estimators=200),param_grid=param_test1,cv=2, refit='recall_score')
estimator.fit(df[predictors],df[target])

How to assign new observations to cluster using distance matrix and kmedoids?

I have a dataframe that holds the Word Mover's Distance between each document in my dataframe. I am running kmediods on this to generate clusters.
1 2 3 4 5
1 0.00 0.05 0.07 0.04 0.05
2 0.05 0.00 0.06 0.04 0.05
3. 0.07 0.06 0.00 0.06 0.06
4 0.04 0.04. 0.06 0.00 0.04
5 0.05 0.05 0.06 0.04 0.00
kmed = KMedoids(n_clusters= 3, random_state=123, method ='pam').fit(distance)
After running on this initial matrix and generating clusters, I want to add new points to be clustered. After adding a new document to the distance matrix I end up with:
1 2 3 4 5 6
1 0.00 0.05 0.07 0.04 0.05 0.12
2 0.05 0.00 0.06 0.04 0.05 0.21
3. 0.07 0.06 0.00 0.06 0.06 0.01
4 0.04 0.04. 0.06 0.00 0.04 0.05
5 0.05 0.05 0.06 0.04 0.00 0.12
6. 0.12 0.21 0.01 0.05 0.12 0.00
I have tried using kmed.predict on the new row.
kmed.predict(new_distance.loc[-1: ])
However, this gives me an error of incompatible dimensions X.shape[1] == 6 while Y.shape[1] == 5.
How can I use this distance of the new document to determine which cluster it should be a part of? Is this even possible, or do I have to recompute clusters every time? Thanks!
The source code for k-medoids says the following:
def transform(self, X):
"""Transforms X to cluster-distance space.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_query, n_features), \
or (n_query, n_indexed) if metric == 'precomputed'
Data to transform.
"""
I assume that you use the precomputed metric (because you compute the distances outside the classifier), so in your case n_query is the number of new documents, and n_indexed is the number of the documents for which the fit method was called.
In your particular case when you fit the model on 5 documents and then want to classify the 6'th one, the X for classification should have shape (1,5), that can be computed as
kmed.predict(new_distance.loc[-1: , :-1])
this is my trial, we must recompute the distance between the new point and the old ones each time.
import pandas as pd
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import pairwise_distances
import numpy as np
# dummy data for trial
df = pd.DataFrame({0: [0,1],1 : [1,2]})
# calculatie distance
distance = pairwise_distances(df.values, df.values)
# fit model
kmed = KMedoids(n_clusters=2, random_state=123, method='pam').fit(distance)
new_point = [2,3]
distance = pairwise_distances(np.array(new_point).reshape(1, -1), df.values)
#calculate the distance between the new point and the initial dataset
print(distance)
#get ride of the last element which is the ditance of the new point with itself
print(kmed.predict(distance[0][:2].reshape(1, -1)))

Getting meaningful results from pandas.describe()

I called describe on one column of a dataframe and ended up with the following output,
count 1.048575e+06
mean 8.232821e+01
std 2.859016e+02
min 0.000000e+00
25% 3.000000e+00
50% 1.400000e+01
75% 6.000000e+01
max 8.599700e+04
What parameter do I pass to get meaningful integer values. What I mean is when I check the SQL count its about 43 million. All the other values are also different.Can someone help me understand what this conversion means and how do I get float rounded to 2 decimal places. I'm new to Pandas.
You can directly use round() and pass the number of decimals you want as argument
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# setting the seed to create the dataframe
np.random.seed(25)
# Creating a 5 * 4 dataframe
df = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])
# rounding describe
df.describe().round(2)
A B C D
count 5.00 5.00 5.00 5.00
mean 0.52 0.47 0.38 0.42
std 0.21 0.23 0.19 0.29
min 0.33 0.12 0.16 0.11
25% 0.41 0.37 0.28 0.19
50% 0.45 0.58 0.37 0.44
75% 0.56 0.59 0.40 0.52
max 0.87 0.70 0.68 0.84
DOCS
There are two ways to control the output of pandas, either by controlling it or by using apply.
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df['X'].describe().apply("{0:.5f}".format)

How can I repercentage a cell from several data points in Python Pandas?

I've been browsing different questions here on Stackexchange but haven't figured out how to do what I need in Pandas. I think it'll ultimately be pretty simple!
I'm doing a task where a dataset has a bunch of products, and each product has a row for each of the stores it's located in. So, Product A will have individual lines for food, drugstore, Target, Walmart, etc. Then, its availability and the importance of that outlet is multiplied and I need to repercentage that result to equal 100%.
Right now I'm doing it manually in Excel/Google Sheets, but that's annoying and tedious. I can tell how to get the sum total of column E per Product by using Groupby, but I can't figure out how to then make that number appear for each product so that each figure from column E can be divided into it.
Anyone have suggestions?Link to example of what the dataset looks like
To get the sum to show up for every product you want to .transform('sum')
In one line:
df['Repercentaged'] = df.groupby('Product').Multiplied.transform(lambda x: x/x.sum())
But if you want to keep the Sum Column...
import pandas as pd
df['Sum'] = df.groupby('Product').Multiplied.transform('sum')
# Location Multiplied Product Sum
#0 Food 0.09 A 0.88
#1 Drugstore 0.21 A 0.88
#2 Walmart 0.35 A 0.88
#3 Target 0.23 A 0.88
#4 Food 0.13 B 0.73
#5 Drugstore 0.13 B 0.73
#6 Walmart 0.25 B 0.73
#7 Target 0.22 B 0.73
df['Repercentaged'] = df['Multiplied']/df['Sum']

Precision of sklearn.metric classification_report

I would like to know if it is possible to get more numbers after the comma with classification_report from sklearn (scikit).
atm it looks like this:
precision recall f1-score support
1 0.61 0.73 0.67 71194
2 0.64 0.33 0.43 13877
3 0.56 0.59 0.57 61591
4 0.64 0.51 0.57 13187
5 0.66 0.69 0.67 57530
6 0.54 0.06 0.11 2391
7 0.54 0.40 0.46 30223
avg / total 0.60 0.60 0.60 249993
I don't think it is possible with that method, but maybe someone had the same idea (probably).
I know that sklearn.metrics.precision_score exists, though the classification_report is such a nice way to display all the results at once.
Not possible according to the source code. See lines 819 and 830, format strings are hardcoded to %0.2f. If you really want it, just change it in your local file sklearn/metrics/metrics.py. Better yet, add an argument to classification_report with a precision number and use that. And submit your patch to the project!

Categories

Resources