What the function apply() in scikit-learn can do? - python

In scikit-learn new version ,there is a new function called apply() in Gradient boosting. I'm really confused about it .
Does it like the method:GBDT + LR that facebook has used?
If dose, how can we make it work like GBDT + LR?

From the Sci-Kit Documentation
apply(X) Apply trees in the ensemble to X, return leaf indices
This function will take input data X and each data point (x) in it will be applied to each non-linear classifier tree. After application, data point x will have associated with it the leaf it end up at for each decision tree. This leaf will have its associated classes ( 1 if binary ).
apply(X) returns the above information, which is of the form [n_samples, n_estimators, n_classes].
Thus, the apply(X) function doesn't really have much to do with the Gradient Boosted Decision Tree + Logic Regression (GBDT+LR) classification and feature transform methods. It is a function for the application of data to an existing classification model.
I'm sorry if I have misunderstood you in any way, though a few grammar/syntax errors in your question made it harder to decipher.

apply(X) returns raw indices of tree leaves, I think you need to transform the discrete indices into one-hot encoding style and then you can perform the lr step.
For example ,apply(X) would return
[
[[1], [2], [3], [4]],
[[2], [3], [4], [5]],
[[3], [4], [5], [6]]
]
where n_samples = 3, n_estimators=4, and n_classes=1.
you must first know the number of each tree used in the gbm classifier. As we know, gbm use sklearn decision tree regressor, according to sklearn decision tree regressor apply function, we get:
X_leaves : array_like, shape = [n_samples,]
For each datapoint x in X, return the index of the leaf x
ends up in. Leaves are numbered within
[0; self.tree_.node_count), possibly with gaps in the
numbering.
AS a result, you need to pad zero into other indices. Take the above example, if the first tree has tree_.node_count = 5, then the first column of the three samples should be transferred into:
[
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]
]
process other columns correspondingly then you can get what you want. Hope it will help you!

Related

Sklearn fit Features as Lists into Modell:

I want to build a classification model where I want to enter the features as lists. Is it possible to enter the features as lists?
z. B. X=[[0:300], [0:300], [0:300], [0:300]]
I have 38 features and every single feature consists of a list with 300 values.
Is it possible to enter the features as lists?
Yes.
But they will wind up as an array (perhaps within a dataframe)
by the time .fit() sees them,
zum Beispiel:
>>> np.array([range(3), range(3)])
array([[0, 1, 2],
[0, 1, 2]])
or the equivalent np.array([list(range(3)), list(range(3))]).

Re-calculate similarity ranking/sorting without re-sorting

I have some code which calculates the nearest neighbors amongst some vectors (values).
However, the values of these vectors are dependent on weights. Each column of the vectors has a different weight at every iteration.
Just for the sake of the example, at the code below I try to find everytime the nearest neighbor of the last vector (vector[3]).
That's a very simplified version of my code:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=1)
values = [
[2, 5, 1],
[4, 2, 3],
[1, 5, 2],
[4, 5, 4]
]
weights = [
[1, 3, 1],
[0.5, 2, 1],
[3, 1, 2]
]
# weights set No1
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[0])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
# weights set No2
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[1])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
# weights set No3
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[2])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
(Obviously I could have a for loop for the different weights sets but I just wanted to point the repetition of the matter)
My question is, is there any way that I can avoid using the KNN 3 times but just use it once at the beginning to do the initial similarity ranking/sorting and then just do some re-calculations?
In different words, is there any way to reduce the computation complexity of this code in terms of calling the KNN fewer times?
PS
I know that there are KNN implementations which are much faster than the ScikitLearn one but that's not really the point; the point is more on using KNN just once instead of N=3 times or something like that.
assuming calling the KNN fewer times means the number of times the KNN is fit, yes it's possible. if calling the KNN means the number of times kneighbors is invoked, that might be difficult due to how relative distances aren't preserved under affine transformations.
This solution runs in O(wk log n) time compared to the original O(wn) time with w being the number of weights.
what you're doing is
taking the input points
scaling its dimensions (projecting the input points into a new coordinate space)
building a knn model from the scaled inputs
classifying the target based on the scaled input.
However, consider
taking the input points
building a knn model from the scaled inputs
inverse scaling the target point (projecting the target into the original coordinate space)
classifying the inverse scaled target based on the input
the result of this process would be that steps 1 and 2 could be reused for each target point. weights with value 0 will require special handling.
this would look would be something like:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=1, algorithm="kd_tree")
values = [
[2, 5, 1],
[4, 2, 3],
[1, 5, 2],
[4, 5, 4]
]
weights = [
[1, 3, 1],
[0.5, 2, 1],
[3, 1, 2]
]
targets = [
[4, 15, 4], # values[3] * weights[0]
[2.0, 10, 4], # values[3] * weights[1]
[12, 5, 8] # values[3] * weights[2]
]
knn.fit(values)
# weights set No1
print(knn.kneighbors([[a/b for a, b in zip(targets[0], weights[0])]]))
# weights set No2
print(knn.kneighbors([[a/b for a, b in zip(targets[1], weights[1])]]))
# weights set No3
print(knn.kneighbors([[a/b for a, b in zip(targets[2], weights[2])]]))

What is the difference between fit() and fit_predict() in SpectralClustering

I am trying to understand and use the spectral clustering from sklearn.
Let us say we have X matrix input and we create a spectral clustering object as follows:
clustering = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
Then, we call a fit_predict using the spectral cluster object.
clusters = clustering.fit_predict(X)
What confuses me is that when does 'the affinity matrix for X using the selected affinity is created'? Because as per the documentation the
fit_predict() method 'Performs clustering on X and returns cluster labels.' But it doesn't explicitly say that it also computes 'the affinity matrix for X using the selected affinity' before clustering.
I appreciate any help or tips.
As already implied in another answer, fit_predict is just a convenience method in order to return the cluster labels. According to the documentation, fit
Creates an affinity matrix for X using the selected affinity, then applies spectral clustering to this affinity matrix.
while fit_predict
Performs clustering on X and returns cluster labels.
Here, Performs clustering on X should be understood as what is described for fit, i.e. Creates an affinity matrix [...].
It is not difficult to verify that calling fit_predict is equivalent to getting the labels_ attribute from the object after fit; using some dummy data, we have
from sklearn.cluster import SpectralClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [10, 0],
[10, 2], [10, 4], [1, 0]])
# 1st way - use fit and get the labels_
clustering = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
clustering.fit(X)
clustering.labels_
# array([1, 1, 0, 0, 0, 1])
# 2nd way - using fit_predict
clustering2 = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
clustering2.fit_predict(X)
# array([1, 1, 0, 0, 0, 1])
np.array_equal(clustering.labels_, clustering2.fit_predict(X))
# True
Looking at source code of fit_predict() it seems that it's just a convenience method - it literally just calls fit() and returns labels from the object.

Adding new points to the t-SNE model

I try to use t-SNE algorithm in the scikit-learn:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X)
Output:
array([[ 0.00017599, 0.00003993], #1
[ 0.00009891, 0.00021913],
[ 0.00018554, -0.00009357],
[ 0.00009528, -0.00001407]]) #2
After that I try to add some points with the coordinates exactly like in the first array X to the existing model:
Y = np.array([[0, 0, 0], [1, 1, 1]])
model.fit_transform(Y)
Output:
array([[ 0.00017882, 0.00004002], #1
[ 0.00009546, 0.00022409]]) #2
But coords in the second array not equal to the first and last coords from the first array.
I understand that this is the right behaviour, but how can I add new coords to the model and get the same coords in the output array for the same coords in the input array?
Also I still need to get closest points even after appending new points.
Quoting the author of t-SNE from here: https://lvdmaaten.github.io/tsne/
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.
Also, this answer on stats.stackexchange.com contains ideas and a link to
a very nice and very fast recent Python implementation of t-SNE https://github.com/pavlin-policar/openTSNE that allows embedding of new points out of the box
and a link to https://github.com/berenslab/rna-seq-tsne/.

Balanced Error Rate as metric function

I am trying to solve a binary classification problem with the sequential model from Keras
and have to meet a given Balanced Error Rate (BER)
So I thought it would be a good idea to use the BER instead of accuracy as a metric.
My custom metric implementation for BER looks like this:
def balanced_error_rate(y_true, y_pred):
labels = theano.shared(np.asmatrix([[0, 1]], dtype='int8'))
label_matrix = K.repeat_elements(labels, K.shape(y_true)[0], axis=1)
true_matrix = K.repeat_elements(y_true, K.shape(labels)[0], axis=1)
pred_matrix = K.repeat_elements(K.round(y_pred), K.shape(labels)[0], axis=1)
class_lens = K.sum(K.equal(label_matrix, true_matrix), axis=1)
return K.sum(K.sum(class_lens - K.sum(K.equal(label_matrix, K.not_equal(true_matrix,pred_matrix)), axis=1), axis=0)/class_lens, axis=0)/2
The idea is to create a matrix from the available labels and compare it to the input data (then sum the ones) to get the number of elements of this label....
My problem is that:
> K.shape(y_true)
Shape.0
> Typeinfo:
> type(y_true)
<class 'theano.tensor.var.TensorVariable'>
> type(K.shape(y_true))
<class 'theano.tensor.var.TensorVariable'>
...and I can't find out why.
I am now looking for:
A way to get the array dimensions / an explanation why shape acts like it does / the reason why y_true seems to have 0 dimensions
or
A method to create a tensor matrix with a given with/height by repeating a given row/column vector.
or
A smarter solution to calculate the BER using tensor functions.
A way to get the array dimensions / an explanation why shape acts like it does / the reason why y_true seems to have 0 dimensions
The deal with print and abstraction libraries like Theano is that you usually do not get the values but a represenation of the value. So if you do
print(foo.shape)
You won't get the actual shape but a representation of the operation that is done at runtime. Since this is all computed on an external device the computation is not run immediately but only after creating a function with appropriate inputs (or calling foo.shape.eval()).
Another way to print the value is to use theano.printing.Print when using the value, e.g.:
shape = theano.printing.Print('shape of foo')(foo.shape)
# use shape (not foo.shape!)
A method to create a tensor matrix with a given with/height by repeating a given row/column vector.
See theano.tensor.repeat for that. Example in numpy (usage is quite similar):
>>> x
array([[1, 2, 3]])
>>> x.repeat(3, axis=0)
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])

Categories

Resources