What is the difference between fit() and fit_predict() in SpectralClustering

What is the difference between fit() and fit_predict() in SpectralClustering - python

I am trying to understand and use the spectral clustering from sklearn.
Let us say we have X matrix input and we create a spectral clustering object as follows:
clustering = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
Then, we call a fit_predict using the spectral cluster object.
clusters = clustering.fit_predict(X)
What confuses me is that when does 'the affinity matrix for X using the selected affinity is created'? Because as per the documentation the
fit_predict() method 'Performs clustering on X and returns cluster labels.' But it doesn't explicitly say that it also computes 'the affinity matrix for X using the selected affinity' before clustering.
I appreciate any help or tips.

As already implied in another answer, fit_predict is just a convenience method in order to return the cluster labels. According to the documentation, fit
Creates an affinity matrix for X using the selected affinity, then applies spectral clustering to this affinity matrix.
while fit_predict
Performs clustering on X and returns cluster labels.
Here, Performs clustering on X should be understood as what is described for fit, i.e. Creates an affinity matrix [...].
It is not difficult to verify that calling fit_predict is equivalent to getting the labels_ attribute from the object after fit; using some dummy data, we have
from sklearn.cluster import SpectralClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [10, 0],
[10, 2], [10, 4], [1, 0]])
# 1st way - use fit and get the labels_
clustering = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
clustering.fit(X)
clustering.labels_
# array([1, 1, 0, 0, 0, 1])
# 2nd way - using fit_predict
clustering2 = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
clustering2.fit_predict(X)
# array([1, 1, 0, 0, 0, 1])
np.array_equal(clustering.labels_, clustering2.fit_predict(X))
# True

Looking at source code of fit_predict() it seems that it's just a convenience method - it literally just calls fit() and returns labels from the object.

Related

Why does my sklearn.metrics confusion_matrix output look transposed?

It's my understanding that confusion matrices should show the TRUE classes in the columns and the PREDICTED classes in the rows. Therefore the sum of the columns should be equal to the value_counts() of the TRUE series.
I have provided an example here:
from sklearn.metrics import confusion_matrix
pred = [0, 0, 0, 1]
true = [1, 1, 1, 1]
confusion_matrix(true, pred)
Why does this give me the following output? Surely it should be the transpose of that?
array([[0, 0],
[3, 1]], dtype=int64)

The confusion probably arises because sklearn follows a different convention for axes of confusion matrix than the wikipedia article. So, to answer your question: It gives you the output in that specific format because sklearn expects you to read it in a specific way.
Here are the two different ways of writing confusion matrix:
sklearn's way of reading/writing confusion matrix: true labels in rows, and predicted labels in columns
wikipedia example opposite of sklearn

scikit-learn's confusion matrix follows a specific order and structure.
Reference: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

It is possible to do as you wish using sklearn, only change the code below appropriately
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1,figsize=(7,4))
ConfusionMatrixDisplay(confusion_matrix(predict,y_test,labels=[1,0]),
display_labels=[1,0]).plot(values_format=".0f",ax=ax)
ax.set_xlabel("True Label")
ax.set_ylabel("Predicted Label")
plt.show()

Map test data using sklearn TSNE

Is there a way to extract the mapping procedure in sklearn.manifold.TSNE in python so that you can map new data into the reduced dimensional space?
Importantly, I mean without having to retrain on the new data as well here.
For example say you trained a TSNE map as follows:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
X_embedded = TSNE(n_components=2).fit_transform(X)
As seen in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Can you extract the transformation so that you can map new data into the same space:
Y = np.array([[0, 0.8, 0.8], [0.1, 0, 1], [1.2, 0.2, 1], [1, 1.1, 1]])
Any help on this matter would be greatly appreciated!

tSNE is a non-linear, non-parametric embedding.
So there is no "closed form" way of updating it with new points. Even worse: adding new points may require existing points to move.
Because of this, making tSNE apply to new data will require substantial changes to the method, it won't be the original tSNE anymore.

Parametric t-SNE has option to apply on the test data but this is not available in Sklearn. Reference issue.
Having set this we have mention that it is implemented in other place here

warning message in scikit-learn [duplicate]

This question already has answers here:
Preprocessing in scikit learn - single sample - Depreciation warning
(8 answers)
Closed 5 years ago.
I wrote a very simple scikit-learn decision tree to implement XOR:
from sklearn import tree
X = [[0, 0], [1, 1], [0, 1], [1, 0]]
Y = [0, 0, 1, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([0,1]))
print(clf.predict([0,0]))
print(clf.predict([1,1]))
print(clf.predict([1,0]))
predict part generates some warning like this:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
I don't have a clear idea what needs to change and why? Please enlighten me!
Thank you in advance!

The input to clf.predict should be a 2D array. Thus, instead of writing
print(clf.predict([0,1]))
you need to write
print(clf.predict([[0,1]]))

The method operates on matrices (2D arrays), rather than vectors (1D arrays). As a convenience, the older code accepted a vector as a 1xN matrix. This led to usage errors as some users forgot which way a vector was oriented (1xN vs Nx1).
The suggestion tells you how to reshape your vector to the proper matrix shape. For constant vectors, just write them as matrices:
clf.predict( [ [0, 1] ] )
The "other direction" (wrong for this application) would be
clf.predict( [ [0], [1] ] )

As the warning message pointed out, you have single sample to test. Thus you could use reshape or fix as followings,
from sklearn import tree
X = [[0, 0], [1, 1], [0, 1], [1, 0]]
Y = [0, 0, 1, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print (clf.predict([[0,1]]))
print (clf.predict([[0,0]]))
print (clf.predict([[1,1]]))
print (clf.predict([[1,0]]))

Adding new points to the t-SNE model

I try to use t-SNE algorithm in the scikit-learn:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X)
Output:
array([[ 0.00017599, 0.00003993], #1
[ 0.00009891, 0.00021913],
[ 0.00018554, -0.00009357],
[ 0.00009528, -0.00001407]]) #2
After that I try to add some points with the coordinates exactly like in the first array X to the existing model:
Y = np.array([[0, 0, 0], [1, 1, 1]])
model.fit_transform(Y)
Output:
array([[ 0.00017882, 0.00004002], #1
[ 0.00009546, 0.00022409]]) #2
But coords in the second array not equal to the first and last coords from the first array.
I understand that this is the right behaviour, but how can I add new coords to the model and get the same coords in the output array for the same coords in the input array?
Also I still need to get closest points even after appending new points.

Quoting the author of t-SNE from here: https://lvdmaaten.github.io/tsne/
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.
Also, this answer on stats.stackexchange.com contains ideas and a link to
a very nice and very fast recent Python implementation of t-SNE https://github.com/pavlin-policar/openTSNE that allows embedding of new points out of the box
and a link to https://github.com/berenslab/rna-seq-tsne/.

What the function apply() in scikit-learn can do?

In scikit-learn new version ,there is a new function called apply() in Gradient boosting. I'm really confused about it .
Does it like the method:GBDT + LR that facebook has used?
If dose, how can we make it work like GBDT + LR?

From the Sci-Kit Documentation
apply(X) Apply trees in the ensemble to X, return leaf indices
This function will take input data X and each data point (x) in it will be applied to each non-linear classifier tree. After application, data point x will have associated with it the leaf it end up at for each decision tree. This leaf will have its associated classes ( 1 if binary ).
apply(X) returns the above information, which is of the form [n_samples, n_estimators, n_classes].
Thus, the apply(X) function doesn't really have much to do with the Gradient Boosted Decision Tree + Logic Regression (GBDT+LR) classification and feature transform methods. It is a function for the application of data to an existing classification model.
I'm sorry if I have misunderstood you in any way, though a few grammar/syntax errors in your question made it harder to decipher.

apply(X) returns raw indices of tree leaves, I think you need to transform the discrete indices into one-hot encoding style and then you can perform the lr step.
For example ,apply(X) would return
[
[[1], [2], [3], [4]],
[[2], [3], [4], [5]],
[[3], [4], [5], [6]]
]
where n_samples = 3, n_estimators=4, and n_classes=1.
you must first know the number of each tree used in the gbm classifier. As we know, gbm use sklearn decision tree regressor, according to sklearn decision tree regressor apply function, we get:
X_leaves : array_like, shape = [n_samples,]
For each datapoint x in X, return the index of the leaf x
ends up in. Leaves are numbered within
[0; self.tree_.node_count), possibly with gaps in the
numbering.
AS a result, you need to pad zero into other indices. Take the above example, if the first tree has tree_.node_count = 5, then the first column of the three samples should be transferred into:
[
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]
]
process other columns correspondingly then you can get what you want. Hope it will help you!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the difference between fit() and fit_predict() in SpectralClustering - python

Looking at source code of fit_predict() it seems that it's just a convenience method - it literally just calls fit() and returns labels from the object.

Related

Why does my sklearn.metrics confusion_matrix output look transposed?

Map test data using sklearn TSNE

warning message in scikit-learn [duplicate]

Adding new points to the t-SNE model

What the function apply() in scikit-learn can do?

Categories

Resources