Dimension of data before and after performing PCA

Dimension of data before and after performing PCA - python

I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.
After removing labels from the training data, I add each row in CSV into a list like this:
for row in csv:
train_data.append(np.array(np.int64(row)))
I do the same for the test data.
I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):
def preprocess(train_data, test_data, pca_components=100):
# convert to matrix
train_data = np.mat(train_data)
# reduce both train and test data
pca = decomposition.PCA(n_components=pca_components).fit(train_data)
X_train = pca.transform(train_data)
X_test = pca.transform(test_data)
return (X_train, X_test)
I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.
Using this method I can get around 97% accuracy.
My question is about the dimensionality of the data before and after PCA is performed
What are the dimensions of train_data and X_train?
How does the number of components influence the dimensionality of the output? Are they the same thing?

TL;DR: Yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).
The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.
The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.
The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.
For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.

Related

TimeSeries K-means clustering for multi-dimensional data

I'm using Tslearn's TimeSeriesKmeans library to cluster my dataset with shape (3000,300,8), However the documentation only talks about cases where the dimension of the dataset being (n_samples,timesteps,1)i.e (single feature). Can anybody help me understand if I can perform clustering with a higher dimension?
I'm using "DTW" as my distance metric.

I used TimeSeriesKMeans from tslearn.clustering library. As you mentioned, the only example available on tslearn documentation is using 1 dimension input. However, it is very common to work with time series data with higher dimensions. For instance, in my case, I was clustering human motion which was 30 frames of 135 joint key points for each frame. Therefore, my data shape was like (number_of_samples, number_of_frames, features).
In order to use tslearns's Timeserieskmeans, you need to input an ndarray with (n_sample, m_time_step(sequence_length), k_features(k_dimensions) ).
If you take a look at the documentations, fit function parameters is as follows:
fit(X, y=None)[source] Compute k-means clustering.
Parameters: X : array-like of shape=(n_ts, sz, d) Time series
dataset.
y Ignored
The point is, your input data should be an ndarray with shape of (n_sample, seq_length, n_features) otherwise, it won't work. For example, at the first, my data was like a list of (n_samples,) and each element in that list was like (seq_length, features). It wan't work until I converted it to an ndarray with (n_sample, seq_length, features).

Python PLSRegression : obtaining the latent variables scores using loadings

In sklearn.cross_decomposition.PLSRegression, we can obtain the latent variables scores from the X array using x_scores_.
I would like to extract the loadings to calculate the latent variables scores for a new array W. Intuitively, what I whould do is: scores = W*loadings (matrix multiplication). I tried this using either x_loadings_, x_weights_, and x_rotations_ as loadings as I could not figure out which array was the good one (there is little info on the sklearn website). I also tried to standardize W (subtracting the mean and dividing by the standard deviation of X) before multiplying by the loadings. But none of these works (I tried using the X array and I cannot obtain the same scores as in the x_scores_ array).
Any help with this?

Actually, I just had to better understand the fit() and transform() methods of Sklearn. I need to use transform(W) to obtain the latent variables scores of the W array:
1.Fit(): generates learning model parameters from training data
2.Transform(): uses the parameters generated from fit() method to transform a particular dataset

How to perform Kernel Density Estimation in Tensorflow

I'm trying to write a Kernel Density Estimation algorithm in Tensorflow.
When fitting the KDE model, I am iterating through all the data in the current batch and, for each, I am creating a kernel using the tensorflow.contrib.distributions.MultivariateNormalDiag object:
self.kernels = [MultivariateNormalDiag(loc=data, scale=bandwidth) for data in X]
Later, when trying to predict the likelihood of a data point with respect to the model fitted above, for each data point I am evaluating, I am summing together the probability given by each of the kernels above:
tf.reduce_sum([kernel._prob(X) for kernel in self.kernels], axis=0)
This approach only works when X is a numpy array, as TF doesn't let you iterate over a Tensor. My question is whether or not there is a way to make the algorithm above work with X as a tf.Tensor or tf.Variable?

One answer that I found for this problem tackles the problem of fitting the KDE and predicting the probabilities in one fell swoop. The implementation is a bit hacky, though.
def fit_predict(self, data):
return tf.map_fn(lambda x: \
tf.div(tf.reduce_sum(
tf.map_fn(lambda x_i: self.kernel_dist(x_i, self.bandwidth).prob(x), self.fit_X)),
tf.multiply(tf.cast(data.shape[0], dtype=tf.float64), self.bandwidth[0])), self.X)
The first tf.map_fn iterates through the data for which we are calculating the likelihood, summing together the probabilities from each of the individual kernels.
The second tf.map_fn iterates through all the data that we use to fit our model, and creates a tf.contrib.distributions.Distribution (here this is parameterized by kernel_dist).
self.X and self.fit_X are placeholders that are created when initializing the KernelDensity object.

Linear Discriminant Analysis transform function

x = data.values
y = target.values
lda = LDA(solver='eigen', shrinkage='auto',n_components=2)
df_lda = lda.fit(x,y).transform(x)
df_lda.shape
This is the small part of the code. I am trying to reduce the dimensionality to the most discriminative directions. To my understanding the transform() function projects data to maximize class separation for my data set and should return an array of shape (n_samples, n_components)
But my df_lda is of shape (614, 1).
What am I missing here ? Or is my data not linearly separable?.

For the case of K distinct classes in target.values there are K-1 components in the transformed data (without further dimensionality reduction). Since you only have two classes in your data set, there is only one transformed component so you cannot get more components than that.
I suppose it might by helpful for sklearn to issue a warning when you request more than are available.

sklearn LogisticRegression classifier performance varies with same element values but different hash-range sparse matrix

I was trying to train a lr classifier against text dataset, different from common scene where text data directly feed to tfidf vectorizer, orginal text line was first transformed into dictionary like {a:0.1, phrase:0.5, in:0.3, line:0.8}, in which weights were computed due to some specific rules and some words were omitted. so, in order to feed these dictionaries to lr classifier, I chose FeatureHasher to do the hash trick. However, I found the lr classifier worked extremely slow when the n_features param of FeatureHasher grew large, say 10^8.
But as far as I know, both memory-cost and calculation-cost of sparse matrix should not grow with dimensions while the number of valid elements is fixed. For example, if we have a two-element sparse vector [coordinate:(1,2), value:(3,4)], where its original dimension is 10. we change the hash-range to 20, and we get [(3,7), (3,4)], there is no difference in storing these two vectors, and if we calculate its distance with another sparse vector, we only need to traverse to list with fixed number of elements therefore calculation-cost if fixed.
I think there must be something wrong with my understanding, or I should have missed something with the lr classifier of sklearn, hope someone would correct me, thanks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.