How to perform Kernel Density Estimation in Tensorflow - python

I'm trying to write a Kernel Density Estimation algorithm in Tensorflow.
When fitting the KDE model, I am iterating through all the data in the current batch and, for each, I am creating a kernel using the tensorflow.contrib.distributions.MultivariateNormalDiag object:
self.kernels = [MultivariateNormalDiag(loc=data, scale=bandwidth) for data in X]
Later, when trying to predict the likelihood of a data point with respect to the model fitted above, for each data point I am evaluating, I am summing together the probability given by each of the kernels above:
tf.reduce_sum([kernel._prob(X) for kernel in self.kernels], axis=0)
This approach only works when X is a numpy array, as TF doesn't let you iterate over a Tensor. My question is whether or not there is a way to make the algorithm above work with X as a tf.Tensor or tf.Variable?

One answer that I found for this problem tackles the problem of fitting the KDE and predicting the probabilities in one fell swoop. The implementation is a bit hacky, though.
def fit_predict(self, data):
return tf.map_fn(lambda x: \
tf.div(tf.reduce_sum(
tf.map_fn(lambda x_i: self.kernel_dist(x_i, self.bandwidth).prob(x), self.fit_X)),
tf.multiply(tf.cast(data.shape[0], dtype=tf.float64), self.bandwidth[0])), self.X)
The first tf.map_fn iterates through the data for which we are calculating the likelihood, summing together the probabilities from each of the individual kernels.
The second tf.map_fn iterates through all the data that we use to fit our model, and creates a tf.contrib.distributions.Distribution (here this is parameterized by kernel_dist).
self.X and self.fit_X are placeholders that are created when initializing the KernelDensity object.

Related

Python PLSRegression : obtaining the latent variables scores using loadings

In sklearn.cross_decomposition.PLSRegression, we can obtain the latent variables scores from the X array using x_scores_.
I would like to extract the loadings to calculate the latent variables scores for a new array W. Intuitively, what I whould do is: scores = W*loadings (matrix multiplication). I tried this using either x_loadings_, x_weights_, and x_rotations_ as loadings as I could not figure out which array was the good one (there is little info on the sklearn website). I also tried to standardize W (subtracting the mean and dividing by the standard deviation of X) before multiplying by the loadings. But none of these works (I tried using the X array and I cannot obtain the same scores as in the x_scores_ array).
Any help with this?
Actually, I just had to better understand the fit() and transform() methods of Sklearn. I need to use transform(W) to obtain the latent variables scores of the W array:
1.Fit(): generates learning model parameters from training data
2.Transform(): uses the parameters generated from fit() method to transform a particular dataset

Linear Discriminant Analysis transform function

x = data.values
y = target.values
lda = LDA(solver='eigen', shrinkage='auto',n_components=2)
df_lda = lda.fit(x,y).transform(x)
df_lda.shape
This is the small part of the code. I am trying to reduce the dimensionality to the most discriminative directions. To my understanding the transform() function projects data to maximize class separation for my data set and should return an array of shape (n_samples, n_components)
But my df_lda is of shape (614, 1).
What am I missing here ? Or is my data not linearly separable?.
For the case of K distinct classes in target.values there are K-1 components in the transformed data (without further dimensionality reduction). Since you only have two classes in your data set, there is only one transformed component so you cannot get more components than that.
I suppose it might by helpful for sklearn to issue a warning when you request more than are available.

efficient computation Jacobian of layers in theano

I want to take a closer look at the Jacobians of each layer in a fully connected neural network, i.e. ∂y/∂x where x is the input vector (activations previous layer) to the layer and y is the output vector (activations this layer) of that layer.
In an online learning scheme, this could be easily done as follows:
import theano
import theano.tensor as T
import numpy as np
x = T.vector('x')
w = theano.shared(np.random.randn(10, 5))
y = T.tanh(T.dot(w, x))
# computation of Jacobian
j = T.jacobian(y, x)
When learning on batches, you need an additional scan to get the Jacobian for each sample
x = T.matrix('x')
...
# computation of Jacobian
j = theano.scan(lambda i, a, b : jacobian(b[i], a)[:,i],
sequences = T.arange(y.shape[0]), non_sequences = [x, y]
)
This works perfectly well for toy examples, but when learning a network with multiple layers with 1000 hidden units and for thousands of samples, this approach leads to a massive slowdown of the computations. (The idea behind indexing the result of the Jacobian can be found in this question)
The thing is that I believe there is no need for this explicit Jacobian computation when we are already computing the derivative of the loss. After all, the gradient of the loss with regard to e.g. the inputs of the network, can be decomposed as
∂L(y,yL)/∂x = ∂L(y,yL)/∂yL ∂yL/∂y(L-1) ∂y(L-1)/∂y(L-2) ... ∂y2/∂y1 ∂y1/∂x
i.e. the gradient of the loss w.r.t. x is the product of derivatives of each layer (L would be the number of layers here).
My question is thus whether (and how) it is possible to avoid the extra computation and use the decomposition discussed above. I assume it should be possible, because automatic differentiation is practically an application of the chain rule (for as far I understood it). However, I don't seem to find anything that could back this idea. Any suggestions, hints or pointers?
T.jacobian is very inefficient because it uses scan internally. If you plan to multiply jacobian matrix with something, you should use T.Lop or T.Rop for left / right multiplication respectively. Currently "smart" jacobian does not exist theano's in gradient module. You have to hand craft them if you want optimized jacobian.
Instead of using T.scan, use batched Op such asT.batched_dot when possible. T.scan will always results in a CPU loop.

Ridge regression using stochastic gradient descent in Python

I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:
def fit(self, X, Y):
# Convert to data frame in case X is numpy matrix
X = pd.DataFrame(X)
# Define a function to calculate the error given a weight vector beta and a training example xi, yi
# Prepend a column of 1s to the data for the intercept
X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))
# Find dimensions of train
m, d = X.shape
# Initialize weights to random
beta = self.initializeRandomWeights(d)
beta_prev = None
epochs = 0
prev_error = None
while (beta_prev is None or epochs < self.nb_epochs):
print("## Epoch: " + str(epochs))
indices = range(0, m)
shuffle(indices)
for i in indices: # Pick a training example from a randomly shuffled set
beta_prev = beta
xi = X.iloc[i]
errori = sum(beta*xi) - Y[i] # Error[i] = sum(beta*x) - y = error of ith training example
gradient_vector = xi*errori + self.l*beta_prev
beta = beta_prev - self.alpha*gradient_vector
epochs += 1
The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.
My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?
You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.
Also, normalization is advantageous for regression methods.
The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.
SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.
Your assumption is not correct.
It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.
Normalization
When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!
Convergence
There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
Even optimizing normalized data can fail with SGD when those params are bad!
This is one of the most important downsides of SGD; dependence on hyper-parameters
As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!
In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.
Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.
Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy for the signal means better snr) it doesn't make sense to normalize.

Dimension of data before and after performing PCA

I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.
After removing labels from the training data, I add each row in CSV into a list like this:
for row in csv:
train_data.append(np.array(np.int64(row)))
I do the same for the test data.
I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):
def preprocess(train_data, test_data, pca_components=100):
# convert to matrix
train_data = np.mat(train_data)
# reduce both train and test data
pca = decomposition.PCA(n_components=pca_components).fit(train_data)
X_train = pca.transform(train_data)
X_test = pca.transform(test_data)
return (X_train, X_test)
I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.
Using this method I can get around 97% accuracy.
My question is about the dimensionality of the data before and after PCA is performed
What are the dimensions of train_data and X_train?
How does the number of components influence the dimensionality of the output? Are they the same thing?
TL;DR: Yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).
The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.
The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.
The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.
For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.

Categories

Resources