CNN regression results in two distinct (incorrect) predictions - python

I'm trying to solve a regression problem using a Python Keras CNN (Tensorflow as the backbone), where I try to predict a single y-value based on an 8-dimensional satellite image (23x45 pixels) that I have fetched from Google Earth Engine using their Python API. I currently have 280 images that I augment to get 2500 images using flipping and random noise. The data is normalized & standardized and I have removed outliers and images with only zeros.
I've tested numerous CNN-architecture, for example, this:
(Convolution2D(4,4,3),MaxPooling2D(2,2),dense(50),dropout(0.4),dense(30),dropout(0.4),dense(1)
This results in a weird behaviour where the predicted value is in mainly two distinct groups or clusters (where each group has very little variance). The true value has a much higher variance. See image below.
I have chosen not to publish any code snippets as my question is more of a general nature. What might lead to such clustered predictions? Are there any good common tricks to improve the results?
I've tried to solve the problem using a normal neural network and regression tools from SciKit-Learn, by flattening the images to one long array (length 23x45x8 = 8280). That doesn't result in clustering, although the accuracy is still quite low. I assume that is due to insufficient or inappropriate data.
Plotted Truth (x) vs Prediction (y) which shows that the prediction is heavily clustered

your model is quite simple, it cannot even properly extract feature, so i guess it is under fit. and your dropout is 40% in 2 layers, which quite high for such small network. you also have linear activation, it seems that way.
and yes number of sample can also have effect on group prediction, mostly class with majority of samples is chosen.
i have also noticed some of your truth values are greater than 1 and less than 0. you have to normalize properly and use proper activation function.

Related

Model doesn't learn from data

We have a dataset with ~40000 data points each having 160 features. We know nothing about what each feature represents, but they are 0-5 integers, most probably some rankings. Our task is to take a subset of those features, lets say (40000,30) and predict the initial (40000,160) data. In other words, we need to create a model, that takes 30 features as input and outputs the full 160 set of features.
https://i.stack.imgur.com/Ko6nR.png
the example of the dataset.
What we have done so far, we trained a ANN with the following architecture:
30->200->150->163
We are calculating an accuracy score by rounding the prediction(lets say I predicted 3.6 for 4, 3.6~4, 4==4, so True)
We got ~52% accuracy and nothing makes it go higher.
So, the problem is a multi-output regression problem. The prediction is done using 30 discrete numeric features. The normalization was done both by using Min-Max Scaling and Standardization(The target is also normalized). In the model, we tried different number of layers with different capacity, tried to use batch-norm, different activations (relu is used now, for the output layer no activation is used), different losses (mse is the current one), different optimizers (adam is the current one). Both Keras and PyTorch is used in the case something is wrong with the PyTorch implementation.
So, the accuracy still remains 50-52%. There is one straightforward thing - when we increase the model capacity (the number of parameters) the model is more prone to overfitting. Even after increasing the model capacity very very much, we couldn't make the model overfit the data. We tried to use the features separately (For example, predict one feature from another) - nothing useful. Tried to predict 1 feature using 159 features, but again ~52% and even less.
What I understand and can conclude from these - there is no relationship between those ratings and most of them can't predict others. What do you think about this case?

What's the reason for the weights of my NN model don't change a lot?

I am training a neural network model, and my model fits the training data well. The training loss decreases stably. Everything works fine. However, when I output the weights of my model, I found that it didn't change too much since random initialization (I didn't use any pretrained weights. All weights are initialized by default in PyTorch). All dimension of the weights only changed about 1%, while the accuracy on training data climbed from 50% to 90%.
What could account for this phenomenon? Is the dimension of weights too high and I need to reduce the size of my model? Or is there any other possible explanations?
I understand this is a quite broad question, but I think it's impractical for me to show my model and analyze it mathematically here. So I just want to know what could be the general / common cause for this problem.
There are almost always many local optimal points in a problem so one thing you can't say specially in high dimensional feature spaces is which optimal point your model parameters will fit into. one important point here is that for every set of weights that you are computing for your model to find a optimal point, because of real value weights, there are infinite set of weights for that optimal point, the proportion of weights to each other is the only thing that matters, because you are trying to minimize the cost, not finding a unique set of weights with loss of 0 for every sample. every time you train you may get different result based on initial weights. when weights change very closely with almost same ratio to each others this means your features are highly correlated(i.e. redundant) and since you are getting very high accuracy just with a little bit of change in weights, only thing i can think of is that your data set classes are far away from each other. try to remove features one at a time, train and see results if accuracy was good continue to remove another one till you hopefully reach to a 3 or 2 dimensional space which you can plot your data and visualize it to see how data points are distributed and make some sense out of this.
EDIT: Better approach is to use PCA for dimensionality reduction instead of removing one by one

What is the difference in interpretation of the "probability" returned by a kNN or a DNN algorithm

I have two datasets, each defined by the same two parameters. If you plot them on a scatter plot, there is some overlap. I'd like to classify them, but also get a probability that a given point is in one dataset or another. So in the overlap region, I would never expect the probability to be 100%.
I've implemented this using python's scikit-learn package and the kNN algorithm, KNeighborsClassifier. It looks pretty good! When I use predict_proba to return the probability, it looks like what I would expect!
So then I tried doing the same thing with TensorFlow and the DNNClassifier classifier, mostly as a learning exercise for myself. When I evaluate the test samples I used predict_proba to return the probabilities, but the distribution of probabilities look much different than the kNN approach. It looks like the DNNClassifier is really trying to drive the probabilities to 1 or 0, rather than somewhere in between for the overlapping region.
I've not posted code here because my questions is more basic: can I interpret the probabilities returned by these two approaches in the same way? Or is there a fundamental difference between them?
Thanks!
Yes. Provided you used sigmoid or softmax for prediction you should be getting values that are reasonable to interpret as probabilities (DNNClassifier will use softmax as far as I know).
Now you didn't give us any details on the models. Depending on the complexity of the models and the training parameters you might be getting more over fitting.
If you are seeing extreme (0 or 1) values for the overlapping area it's probably over fitting. Use test/validation set to keep a check on it.
From what you are describing a very simple model should do, try to have less depth, less parameters.

What algorithm to chose for binary image classification

Lets say I have two arrays in dataset:
1) The first one is array classified as (0,1) - [0,1,0,1,1,1,0.....]
2) And the second array costists of grey scale image vectors with 2500 elements in each(numbers from 0 to 300). These numbers are pixels from 50*50px images. - [[13 160 239 192 219 199 4 60..][....][....][....][....]]
The size of this dataset is quite significant (~12000 elements).
I am trying to build bery basic binary classificator which will give appropriate results. Lets say I wanna choose non deep learning but some supervised method.
Is it suitable in this case? I've already tried SVM of sklearn with various parameters. But the outcome is inappropriately inacurate and consists mainly of 1: [1,1,1,1,1,0,1,1,1,....]
What is the right approach? Isnt a size of dataset enough to get a nice result with supervised algorithm?
You should probably post this on cross-validated:
But as a direct answer you should probably look into sequence to sequence learners as it has been clear to you SVM is not the ideal solution for this.
You should look into Markov models for sequential learning if you dont wanna go the deep learning route, however, Neural Networks have a very good track record with image classification problems.
Ideally for a Sequential learning you should try to look into Long Short Term Memory Recurrent Neural Networks, and for your current dataset see if pre-training it on an existing data corpus (Say CIFAR-10) may help.
So my recomendation is give Tensorflow a try with a high level library such as Keras/SKFlow.
Neural Networks are just another tool in your machine learning repertoire and you might aswell give them a real chance.
An Edit to address your comment:
Your issue there is not a lack of data for SVM,
the SVM will work well, for a small dataset, as it will be easier for it to overfit/fit a separating hyperplane on this dataset.
As you increase your data dimensionality, keep in mind that separating it using a separating hyperplane becomes increasingly difficult[look at the curse of dimensionality].
However if you are set on doing it this way, try some dimensionality reduction
such as PCA.
Although here you're bound to find another fence-off with Neural Networks,
since the Kohonen Self Organizing Maps do this task beautifully, you could attempt to
project your data in a lower dimension therefore allowing the SVM to separate it with greater accuracy.
I still have to stand by saying you may be using the incorrect approach.

Accuracy difference on normalization in KNN

I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.
To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!
That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.
If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.

Categories

Resources