x = data.values
y = target.values
lda = LDA(solver='eigen', shrinkage='auto',n_components=2)
df_lda = lda.fit(x,y).transform(x)
df_lda.shape
This is the small part of the code. I am trying to reduce the dimensionality to the most discriminative directions. To my understanding the transform() function projects data to maximize class separation for my data set and should return an array of shape (n_samples, n_components)
But my df_lda is of shape (614, 1).
What am I missing here ? Or is my data not linearly separable?.
For the case of K distinct classes in target.values there are K-1 components in the transformed data (without further dimensionality reduction). Since you only have two classes in your data set, there is only one transformed component so you cannot get more components than that.
I suppose it might by helpful for sklearn to issue a warning when you request more than are available.
Related
I'm using Tslearn's TimeSeriesKmeans library to cluster my dataset with shape (3000,300,8), However the documentation only talks about cases where the dimension of the dataset being (n_samples,timesteps,1)i.e (single feature). Can anybody help me understand if I can perform clustering with a higher dimension?
I'm using "DTW" as my distance metric.
I used TimeSeriesKMeans from tslearn.clustering library. As you mentioned, the only example available on tslearn documentation is using 1 dimension input. However, it is very common to work with time series data with higher dimensions. For instance, in my case, I was clustering human motion which was 30 frames of 135 joint key points for each frame. Therefore, my data shape was like (number_of_samples, number_of_frames, features).
In order to use tslearns's Timeserieskmeans, you need to input an ndarray with (n_sample, m_time_step(sequence_length), k_features(k_dimensions) ).
If you take a look at the documentations, fit function parameters is as follows:
fit(X, y=None)[source] Compute k-means clustering.
Parameters: X : array-like of shape=(n_ts, sz, d) Time series
dataset.
y Ignored
The point is, your input data should be an ndarray with shape of (n_sample, seq_length, n_features) otherwise, it won't work. For example, at the first, my data was like a list of (n_samples,) and each element in that list was like (seq_length, features). It wan't work until I converted it to an ndarray with (n_sample, seq_length, features).
In sklearn.cross_decomposition.PLSRegression, we can obtain the latent variables scores from the X array using x_scores_.
I would like to extract the loadings to calculate the latent variables scores for a new array W. Intuitively, what I whould do is: scores = W*loadings (matrix multiplication). I tried this using either x_loadings_, x_weights_, and x_rotations_ as loadings as I could not figure out which array was the good one (there is little info on the sklearn website). I also tried to standardize W (subtracting the mean and dividing by the standard deviation of X) before multiplying by the loadings. But none of these works (I tried using the X array and I cannot obtain the same scores as in the x_scores_ array).
Any help with this?
Actually, I just had to better understand the fit() and transform() methods of Sklearn. I need to use transform(W) to obtain the latent variables scores of the W array:
1.Fit(): generates learning model parameters from training data
2.Transform(): uses the parameters generated from fit() method to transform a particular dataset
I am trying to use scikit's GaussianRandomProjection with my dataset which has a shape of 1599 x 11 as follows:
transformer = random_projection.GaussianRandomProjection()
X_new = transformer.fit_transform(wine_data.values[:, :11])
As I do this, I get an error that says:
ValueError: eps=0.100000 and n_samples=1599 lead to a
target dimension of 6323 which is larger than the original
space with n_features=1
I do not understand the error. What exactly does it mean? How could I use GaussianRandomProjection to reduce data dimensionality?
Here's a direct quotation from the official Scikit-Learn Doc on GaussianRandomProjection in its parameter n_components:
Dimensionality of the target projection space.
n_components can be automatically adjusted according to the number of
samples in the dataset and the bound given by the
Johnson-Lindenstrauss lemma. In that case the quality of the embedding
is controlled by the eps parameter.
It should be noted that Johnson-Lindenstrauss lemma can yield very
conservative estimated of the required number of components as it
makes no assumption on the structure of the dataset.
It seems that in your case the estimator tends to yield a 6323-dimensional projected target after "reducing" the dimensionality. This is obviously unexpected, because you desired to reduce the dimension other than increasing it. I suggest that you first presume the dimension (i.e. 8) of your desired output and then test if the model works in an expected way.
transformer = GaussianRandomProjection(n_components=8) #Set your desired dimension of the output
X_new = transformer.fit_transform(wine_data.values[:, :11])
Good luck
I'm trying to write a Kernel Density Estimation algorithm in Tensorflow.
When fitting the KDE model, I am iterating through all the data in the current batch and, for each, I am creating a kernel using the tensorflow.contrib.distributions.MultivariateNormalDiag object:
self.kernels = [MultivariateNormalDiag(loc=data, scale=bandwidth) for data in X]
Later, when trying to predict the likelihood of a data point with respect to the model fitted above, for each data point I am evaluating, I am summing together the probability given by each of the kernels above:
tf.reduce_sum([kernel._prob(X) for kernel in self.kernels], axis=0)
This approach only works when X is a numpy array, as TF doesn't let you iterate over a Tensor. My question is whether or not there is a way to make the algorithm above work with X as a tf.Tensor or tf.Variable?
One answer that I found for this problem tackles the problem of fitting the KDE and predicting the probabilities in one fell swoop. The implementation is a bit hacky, though.
def fit_predict(self, data):
return tf.map_fn(lambda x: \
tf.div(tf.reduce_sum(
tf.map_fn(lambda x_i: self.kernel_dist(x_i, self.bandwidth).prob(x), self.fit_X)),
tf.multiply(tf.cast(data.shape[0], dtype=tf.float64), self.bandwidth[0])), self.X)
The first tf.map_fn iterates through the data for which we are calculating the likelihood, summing together the probabilities from each of the individual kernels.
The second tf.map_fn iterates through all the data that we use to fit our model, and creates a tf.contrib.distributions.Distribution (here this is parameterized by kernel_dist).
self.X and self.fit_X are placeholders that are created when initializing the KernelDensity object.
I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.
After removing labels from the training data, I add each row in CSV into a list like this:
for row in csv:
train_data.append(np.array(np.int64(row)))
I do the same for the test data.
I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):
def preprocess(train_data, test_data, pca_components=100):
# convert to matrix
train_data = np.mat(train_data)
# reduce both train and test data
pca = decomposition.PCA(n_components=pca_components).fit(train_data)
X_train = pca.transform(train_data)
X_test = pca.transform(test_data)
return (X_train, X_test)
I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.
Using this method I can get around 97% accuracy.
My question is about the dimensionality of the data before and after PCA is performed
What are the dimensions of train_data and X_train?
How does the number of components influence the dimensionality of the output? Are they the same thing?
TL;DR: Yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).
The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.
The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.
The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.
For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.