Sklearn fit Features as Lists into Modell: - python

I want to build a classification model where I want to enter the features as lists. Is it possible to enter the features as lists?
z. B. X=[[0:300], [0:300], [0:300], [0:300]]
I have 38 features and every single feature consists of a list with 300 values.

Is it possible to enter the features as lists?
Yes.
But they will wind up as an array (perhaps within a dataframe)
by the time .fit() sees them,
zum Beispiel:
>>> np.array([range(3), range(3)])
array([[0, 1, 2],
[0, 1, 2]])
or the equivalent np.array([list(range(3)), list(range(3))]).

Related

Randomly choose index based on condition in numpy

Let's say I have 2D numpy array with 0 and 1 as values. I want to randomly pick an index that contains 1. Is there efficient way to do this using numpy?
I achieved it in pure python, but it's too slow.
Example input:
[[0, 1], [1, 0]]
output:
(0, 1)
EDIT:
For clarification: I want my function to get 2D numpy array with values belonging to {0, 1}. I want the output to be a tuple (2D index) of randomly (uniformly) picked value from the given array that is equal to 1.
EDIT2:
Using Paul H's suggestion, I came up with this:
nonzero = np.nonzero(a)
return random.choice(list(zip(nonzero)))
But it doesn't work with numpy's random choice, only with python's. Is there a way to optimise it better?
It's easier to get all the non-zero coordinates and sample from there:
xs,ys = np.where([[0, 1], [1, 0]])
# randomly pick a number:
idx = np.random.choice(np.arange(len(xs)) )
# output:
out = xs[idx], ys[idx]
You may try argwhere and permutation
a = np.array([[0, 1], [1, 0]])
b = np.argwhere(a)
tuple(np.random.permutation(b)[0])

What is the difference between this two lines of code? ML arrays

So I started learning ML and I need to code in Python. I follow a tutorial on back propagation. So, in it, I am presented with the problem of turning an algorithm to support matrix multiplication so it would run faster. The following code is from a function that updates the biases and weights. I needed to change it so instead of running each input-output pair at a single time, I would run the entire inputs and outputs matrices. The mini-batch is a list with 10 elements. Each element in the list is two tuples, one tuple is the input, a 784x1 sized matrix. The other tuple is the output, a 10x1 sized matrix. I tried to group the inputs to a 784x10x1 array and then convert it to 784x10 array. I did it in two ways as shown in the following code:
# for clearance - batch[0] is the first tuple in the element in mini_batch, which is, as recalled, the input array.
# mini_batch[0][0] is the input array of the first element in mini_batch, which is a 784x1 array as I mentioned earlier
inputs3 = np.array([batch[0] for batch in mini_batch]).reshape(len(mini_batch[0][0]), len(mini_batch))
inputs2 = np.array([batch[0].ravel() for batch in mini_batch]).transpose()
both inputs3 and inputs2 are 784x10 arrays but for some reason, they are not equal. I don't understand why, so I would really appreciate if someone could explain to me why there's a difference.
>>> A = np.array([[1,2,3],[4,5,6]])
>>> A.reshape(3,2)
array([[1, 2],
[3, 4],
[5, 6]])
>>> A.transpose()
array([[1, 4],
[2, 5],
[3, 6]])
From this short example you can see that A.transpose() != A.reshape(3,2).
Imagine a blank matrix with dimensions 3x2. A.reshape(3,2) will read values form A(a 2x3 matrix) left to right starting from the top row and storing them in the blank matrix. Making these matrices have different values.

Adding new points to the t-SNE model

I try to use t-SNE algorithm in the scikit-learn:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X)
Output:
array([[ 0.00017599, 0.00003993], #1
[ 0.00009891, 0.00021913],
[ 0.00018554, -0.00009357],
[ 0.00009528, -0.00001407]]) #2
After that I try to add some points with the coordinates exactly like in the first array X to the existing model:
Y = np.array([[0, 0, 0], [1, 1, 1]])
model.fit_transform(Y)
Output:
array([[ 0.00017882, 0.00004002], #1
[ 0.00009546, 0.00022409]]) #2
But coords in the second array not equal to the first and last coords from the first array.
I understand that this is the right behaviour, but how can I add new coords to the model and get the same coords in the output array for the same coords in the input array?
Also I still need to get closest points even after appending new points.
Quoting the author of t-SNE from here: https://lvdmaaten.github.io/tsne/
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.
Also, this answer on stats.stackexchange.com contains ideas and a link to
a very nice and very fast recent Python implementation of t-SNE https://github.com/pavlin-policar/openTSNE that allows embedding of new points out of the box
and a link to https://github.com/berenslab/rna-seq-tsne/.

Balanced Error Rate as metric function

I am trying to solve a binary classification problem with the sequential model from Keras
and have to meet a given Balanced Error Rate (BER)
So I thought it would be a good idea to use the BER instead of accuracy as a metric.
My custom metric implementation for BER looks like this:
def balanced_error_rate(y_true, y_pred):
labels = theano.shared(np.asmatrix([[0, 1]], dtype='int8'))
label_matrix = K.repeat_elements(labels, K.shape(y_true)[0], axis=1)
true_matrix = K.repeat_elements(y_true, K.shape(labels)[0], axis=1)
pred_matrix = K.repeat_elements(K.round(y_pred), K.shape(labels)[0], axis=1)
class_lens = K.sum(K.equal(label_matrix, true_matrix), axis=1)
return K.sum(K.sum(class_lens - K.sum(K.equal(label_matrix, K.not_equal(true_matrix,pred_matrix)), axis=1), axis=0)/class_lens, axis=0)/2
The idea is to create a matrix from the available labels and compare it to the input data (then sum the ones) to get the number of elements of this label....
My problem is that:
> K.shape(y_true)
Shape.0
> Typeinfo:
> type(y_true)
<class 'theano.tensor.var.TensorVariable'>
> type(K.shape(y_true))
<class 'theano.tensor.var.TensorVariable'>
...and I can't find out why.
I am now looking for:
A way to get the array dimensions / an explanation why shape acts like it does / the reason why y_true seems to have 0 dimensions
or
A method to create a tensor matrix with a given with/height by repeating a given row/column vector.
or
A smarter solution to calculate the BER using tensor functions.
A way to get the array dimensions / an explanation why shape acts like it does / the reason why y_true seems to have 0 dimensions
The deal with print and abstraction libraries like Theano is that you usually do not get the values but a represenation of the value. So if you do
print(foo.shape)
You won't get the actual shape but a representation of the operation that is done at runtime. Since this is all computed on an external device the computation is not run immediately but only after creating a function with appropriate inputs (or calling foo.shape.eval()).
Another way to print the value is to use theano.printing.Print when using the value, e.g.:
shape = theano.printing.Print('shape of foo')(foo.shape)
# use shape (not foo.shape!)
A method to create a tensor matrix with a given with/height by repeating a given row/column vector.
See theano.tensor.repeat for that. Example in numpy (usage is quite similar):
>>> x
array([[1, 2, 3]])
>>> x.repeat(3, axis=0)
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])

What the function apply() in scikit-learn can do?

In scikit-learn new version ,there is a new function called apply() in Gradient boosting. I'm really confused about it .
Does it like the method:GBDT + LR that facebook has used?
If dose, how can we make it work like GBDT + LR?
From the Sci-Kit Documentation
apply(X) Apply trees in the ensemble to X, return leaf indices
This function will take input data X and each data point (x) in it will be applied to each non-linear classifier tree. After application, data point x will have associated with it the leaf it end up at for each decision tree. This leaf will have its associated classes ( 1 if binary ).
apply(X) returns the above information, which is of the form [n_samples, n_estimators, n_classes].
Thus, the apply(X) function doesn't really have much to do with the Gradient Boosted Decision Tree + Logic Regression (GBDT+LR) classification and feature transform methods. It is a function for the application of data to an existing classification model.
I'm sorry if I have misunderstood you in any way, though a few grammar/syntax errors in your question made it harder to decipher.
apply(X) returns raw indices of tree leaves, I think you need to transform the discrete indices into one-hot encoding style and then you can perform the lr step.
For example ,apply(X) would return
[
[[1], [2], [3], [4]],
[[2], [3], [4], [5]],
[[3], [4], [5], [6]]
]
where n_samples = 3, n_estimators=4, and n_classes=1.
you must first know the number of each tree used in the gbm classifier. As we know, gbm use sklearn decision tree regressor, according to sklearn decision tree regressor apply function, we get:
X_leaves : array_like, shape = [n_samples,]
For each datapoint x in X, return the index of the leaf x
ends up in. Leaves are numbered within
[0; self.tree_.node_count), possibly with gaps in the
numbering.
AS a result, you need to pad zero into other indices. Take the above example, if the first tree has tree_.node_count = 5, then the first column of the three samples should be transferred into:
[
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]
]
process other columns correspondingly then you can get what you want. Hope it will help you!

Categories

Resources