Single sample prediction problem when training set dimensions reduced with LDA

Single sample prediction problem when training set dimensions reduced with LDA - python

I have a trouble with supervised classification method I am using for my data.
Let's think we are training our algorithm with a data (N=70) after reducing the dimensions from 100 to 2 by using LDA dimensionality reduction method.
Now, we would like to predict the class of the 71st sample, whose class is completely unknown to us. However, it has 100 features still; so we have to reduce its dimensions.
That seems easy in the first look: I can use the transform characteristics of the first reduction. For example, in python:
clf.fit(X,Y)
lda = LinearDiscriminantAnalysis(n_components=2)
flda = lda.fit(X, Y)
X_lda = flda.transform(X)
I had stored the fitting properties of training data. X_p is my single sample. So when I use 'flda' again for transformation, same fitting information is used:
X_p = flda.transform(X_p.reshape(1, -1))
However, it doesn't predict properly! To test, I used my first N=70 data. Extract one of them (so now, it is N=69). I used 70th data as test sample. And it didn't predict properly again.
When I compared my previous data (N=70) and the new one (N=69), I saw that every single number changed! If I am not missing something (I hope I am missing and you can tell me what I am missing) LDA dimensionality reduction is not applicable for real machine learning applications, because only one data could change everything.
As a note, the plot of the reduced data doesn't change, despite all numbers significanly change (which means the relative locations of points do not change).
Do you know how LDA dimensionality reduction is used in real machine learning applications? What must I do, to test one sample in the following order:
Reduce dimensions to 2 for training data
Reduce dimensions to 2 for test data
Predict!
without using the same tranformation charactheristics?

Related

Get labels out of Python Gaussian Mixture Model clustering after PCA

I have a deep learning model which extracts features of the original time series data, then uses PCA to reduce the dimensionality to 2D, then perform clustering using GMM. I am then planning to use my clustered info to label a class of signals that I am interested in looking for in the original data. However, I'm having trouble wrapping my head around on how to do that since from my understanding, I have lost information after doing PCA. So is this possible, and if yes how would I go about it?
I first start with 3 columns of data, each with length 1780800. They are then reshaped to an array of size (108, 3, 16800) to be fed into the model.
The model I am referring to is as below:
Full research paper is https://www.nature.com/articles/s41467-020-17841-x.

It does not seem right if you are using PCA on time points as feature values. What you need is a feature extractor that first converts the time series to euclidean feature space (Check feature extraction for time series). Then, you can use basic clustering tools in sklearn and visualize it using TSNE to check if it makes sense or not. You need to validate each step before going into the next one.

Unexpected behavior with LinearDiscriminantAnalysis of scikit-learn

I am using LinearDiscriminantAnalysis of scikit-learn to perform class prediction on a dataset. The problem seems to be that the performance of LDA is not consistent. In general, when I increase the number of features that I use for training, I expect the performance to increase as well. Same thing with the number of samples that I use for training, the more samples the better the performance of LDA.
However, in my case, there seems to be a sweet spot where the LDA performs poorly depending on the number of samples and features used for training. More precisely, LDA performs poorly when the number of features equals the number of samples. I think that it does not have to do with my dataset. Not sure exactly what is the issue here but I have an extensive example code that can recreate these results.
Here is an image of the LDA performance results that I am talking about.
The dataset I use has shape 400 X 75 X 400 (trials X time X features). Here the trials represent the different samples. Each time I shuffle the trial indices of the dataset. I create the train set by picking the trials for training and similarly for the test set. Finally, I calculate the mean across time (second axis) and insert the final matrix with shape (trials X features) as input in the LDA to compute the score on the test set. The test set is always of size 50 trials.
A detailed jupyter notebook with comments and the data I use can be found here https://github.com/esigalas/LDA-test. In my environment I use
sklearn: 1.1.1,
numpy: 1.22.4.
Not sure if there is an issue with LDA itself (that would be worthy of opening an issue on the github) or something wrong with how I handle the dataset, but this behavior of LDA looks wrong.
Any comment/help is welcome. Thanks in advance!

CNN regression results in two distinct (incorrect) predictions

I'm trying to solve a regression problem using a Python Keras CNN (Tensorflow as the backbone), where I try to predict a single y-value based on an 8-dimensional satellite image (23x45 pixels) that I have fetched from Google Earth Engine using their Python API. I currently have 280 images that I augment to get 2500 images using flipping and random noise. The data is normalized & standardized and I have removed outliers and images with only zeros.
I've tested numerous CNN-architecture, for example, this:
(Convolution2D(4,4,3),MaxPooling2D(2,2),dense(50),dropout(0.4),dense(30),dropout(0.4),dense(1)
This results in a weird behaviour where the predicted value is in mainly two distinct groups or clusters (where each group has very little variance). The true value has a much higher variance. See image below.
I have chosen not to publish any code snippets as my question is more of a general nature. What might lead to such clustered predictions? Are there any good common tricks to improve the results?
I've tried to solve the problem using a normal neural network and regression tools from SciKit-Learn, by flattening the images to one long array (length 23x45x8 = 8280). That doesn't result in clustering, although the accuracy is still quite low. I assume that is due to insufficient or inappropriate data.
Plotted Truth (x) vs Prediction (y) which shows that the prediction is heavily clustered

your model is quite simple, it cannot even properly extract feature, so i guess it is under fit. and your dropout is 40% in 2 layers, which quite high for such small network. you also have linear activation, it seems that way.
and yes number of sample can also have effect on group prediction, mostly class with majority of samples is chosen.
i have also noticed some of your truth values are greater than 1 and less than 0. you have to normalize properly and use proper activation function.

Splitting train test sets for Node2vec link prediction in Stellargraph

I'm trying to understand how to use Stellargraph's EdgeSplitter class. In particular, the examples on the documentation for training a link prediction model based on Node2Vec splits the graph in the following parts:
Distrution of samples across train, val and test set
Following the examples on the documentation, first you sample 10% of the links of the full graph in order to obtain the test set:
# Define an edge splitter on the original graph:
edge_splitter_test = EdgeSplitter(graph)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from graph, and obtain the
# reduced graph graph_test with the sampled links removed:
graph_test, examples_test, labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global"
)
As far as I understand from the docs, graph_test is the original graph but with the test links removed. Then you perform the same operation with the training set,
# Do the same process to compute a training subset from within the test graph
edge_splitter_train = EdgeSplitter(graph_test)
graph_train, examples, labels = edge_splitter_train.train_test_split(
p=0.1, method="global"
)
Following the previous logic, graph_train corresponds to graph_test with the training links removed.
Further down the code, my understanding is that we use graph_train to train the embedding and the training samples (examples, labels) to train the classifier. So I have several questions here:
Why are we using disjoint sets of training data to train different parts of the model? Shouldn´t we train both the embedding and the classifier with the full training set of links?
Why is the test set so big? Wouldn´t it be better to have most samples in the training set?
What is the correct way of using the EdgeSplitter class?
Thanks you in advance for your help!

Why disjoint sets:
This may or may not matter depending on the embedding algorithm.
The risk with edges that are both seen by the embedding algorithm and the classifier as targets is that the embedding algorithm may encode non-generalizable features.
For example, theoretically one feature of the embedding could be the node id, and then you could have other features encoding the entire neighborhood of the node. When combining two node's embeddings into a link vector in a weird way, or when using a multilayer model, one could therefore create a binary feature which is 1 if the two nodes are connected during embedding training and 0 otherwise.
In this case the classifier would perhaps just learn to use this trivial feature which is not present (i.e. has value 0) when you go to the test data.
The above would not happen in a real scenario, but more subtle features could have the same effect to a lesser degree.
In the end, this only risks to make model selection bad.
That is, the first split is to make the test reliable. The second split is to improve model selection. You can therefore omit the second split if you wish.
Why test set so big:
You are likely to get higher score with a bigger train set. As long as the experiment is repeated with different splits and variance is under control, it should be fine to increase train size.
What is the correct way to use EdgeSplitter:
I dont know what 'correct' means here. I think graph splitting is still an active research field.

Very large log probabilities from sklearn's BayesianGaussianMixture

I've been using python to experiment with sklearn's BayesianGaussianMixture (and with GaussianMixture, which shows the same issue).
I fit the model with a number of items drawn from a distribution, then tested the model with a held out data set (some from the distribution, some outside it).
Something like:
X_train = ... # 70x321 matrix
X_in = ... # 20x321 matrix of held out data points from X
X_out = ... # 20x321 matrix of data points drawn from a different distribution
model = BayesianGaussianMixture(n_components=1)
model.fit(X_train)
print(model.score_samples(X_in).mean())
print(model.score_samples(X_out).mean())
outputs:
-1334380148.57
-2953544628.45
The score_samples method returns a per-sample log likelihood of the given data, and "in" samples are much more likely than the "out" samples as expected - I'm just wondering why the absolute values are so high?
The documentation for score_samples states "Compute the weighted log probabilities for each sample" - but I'm unclear what the weights are based on.
Do I need to scale my input first? Is my input dimensionality too high? Do I need to do some additional parameter tuning? Or am I just misunderstanding what the method returns?

The weights are based on the mixture weights.
Do I need to scale my input first?
This is usually not a bad idea but I can't say not knowing more about your data.
Is my input dimensionality too high?
It seems given the amount of data you are fitting it actually is too high. Remember the curse of dimensionality. You have very few rows of data and 312 features, 1:4 ratio; that's not really going to work in practice.
Do I need to do some additional parameter tuning? Or am I just
misunderstanding what the method returns?
Your outputs are log-probabilites that are very negative. If you raise e to such a large negative magnitude you get a probability that is very close to zero. Your results actually make sense from that perspective. You may want to check the log-probability in areas where you know there is a higher probability of being in that component. You may also want to check the covariances for each component to make sure you don't have a degenerate solution, which is quite likely given the amount of data and dimensionality in this case. Before any of that, you may want to get more data or see if you can reduce the number of dimensions.
I forgot to mention a rather important point: The output is for the Density so keep that in mind too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.