Train a cascaded network with a classifier in between

Train a cascaded network with a classifier in between - python

I am attempting to do a two-fold task. The input is an image and based on the input I want to pick another image from a set of images (classification task) and then use both the images to obtain an output tensor. Clearly, I can train both the models separately if I know the ground truth of which image I should pick from that set. But, I only have the output tensor ground truth.
The problem, as it appears to me, is that if we employ a classification layer, the gradients will not be differentiable anymore. How do I deal with this problem? Is there literature which uses this kind of architecture for any application? TIA
More details: I have multiple images of an object/scene and I want to use two of those images for some kind of reconstruction problem. To maximize the performance of reconstruction, I want to smartly choose the second image if I am given the first image. For eg., I have three images A, B, C and using AC gives the best result. I need a model which given A predicts C and then using AC I can achieve the reconstruction. Is the task clear now? I do not have ground truth which says AC is better than AB. Is the task clear now?

So basically, you want to do a classification task followed by a reconstruction task.
Here is what I suggest (I do not pretend this the absolute best solution, but it's how I would approach this problem) :
You can create a single task that does Classification--> Reconstruction with a single loss. Let's still separate this network in two and call net_class the part that does classification , and net_reconstruct the part performing reconstruction.
Let's say your classification network predicts {'B': 0.1, 'C': 0.9). Instead of using only image 'C' for reconstruction, I would feed both pairs (A-B and A-C) to the second network and compute a reconstruction loss L (I'm not an expert in reconstruction, but I guess there are some classical losses in this).
Therefore, you would compute two losses L(A-B) and L(A-C).
My total loss would be 0.1 * L(A-B) + 0.9 L(A-C). This way, you would train net_class to choose the pairing that minimizes the reconstruction loss and you would still train net_reconstruct to minimize both losses, and the loss is continuous (and therefore, differentiable according to AI experts ;) ).
The idea behind this loss is three-fold :
1 - Improving the reconstructor makes the loss go down (since both L(A-B) and L(A-C) would decrease. therefore, this loss should make your reconstructor converge towards something you want.
2 - Let's imagine your reconstructor is pretty much trained (L(A-B) and L(A-C) are relatively low). Then, your classifier has an incentive to predict the class which has the lowest reconstruction loss.
3 - Now, your reconstructor and your classifier will train at the same time. You can expect, at the end of the training, to have a classifier that would output pretty much binary results (like 0.998 vs 0.002). AAt that point, your reconstructor will almost only train on the scene associated with the 0.998 ouput. This should not be a problem, since, if I understood correctly your problem, you want to perform the reconstruction part only for the top classified scene.
Note that this method also works if you're not performing deep learning for the reconstruction part.
If you want some inspiration on this kind of topic, I recommend you read some blog posts about GANs (Generative Adversarial Networks). They use the same two stage - one loss trick (with some slight differences of course, but the ideas are very close).
Good luck !

Related

CNN regression results in two distinct (incorrect) predictions

I'm trying to solve a regression problem using a Python Keras CNN (Tensorflow as the backbone), where I try to predict a single y-value based on an 8-dimensional satellite image (23x45 pixels) that I have fetched from Google Earth Engine using their Python API. I currently have 280 images that I augment to get 2500 images using flipping and random noise. The data is normalized & standardized and I have removed outliers and images with only zeros.
I've tested numerous CNN-architecture, for example, this:
(Convolution2D(4,4,3),MaxPooling2D(2,2),dense(50),dropout(0.4),dense(30),dropout(0.4),dense(1)
This results in a weird behaviour where the predicted value is in mainly two distinct groups or clusters (where each group has very little variance). The true value has a much higher variance. See image below.
I have chosen not to publish any code snippets as my question is more of a general nature. What might lead to such clustered predictions? Are there any good common tricks to improve the results?
I've tried to solve the problem using a normal neural network and regression tools from SciKit-Learn, by flattening the images to one long array (length 23x45x8 = 8280). That doesn't result in clustering, although the accuracy is still quite low. I assume that is due to insufficient or inappropriate data.
Plotted Truth (x) vs Prediction (y) which shows that the prediction is heavily clustered

your model is quite simple, it cannot even properly extract feature, so i guess it is under fit. and your dropout is 40% in 2 layers, which quite high for such small network. you also have linear activation, it seems that way.
and yes number of sample can also have effect on group prediction, mostly class with majority of samples is chosen.
i have also noticed some of your truth values are greater than 1 and less than 0. you have to normalize properly and use proper activation function.

Getting some sort of Math Formula from a Machine Learning trained model

I already asked this question here: Can Convolutional Neural Networks (CNN) be represented by a Mathematical formula? but I feel that I was not clear enough and also the proposed idea did not work for me.
Let's say that using my computer, I train a certain machine learning algorithm (i.e. naive bayes, decision tree, linear regression, and others). So I already have a trained model which I can give a input value and it returns the result of the prediction (i.e. 1 or 0).
Now, let's say that I still want to give an input and get a predicted output. However, at this time I would like that my input value to be, for example, multiplied by some sort of mathematical formula, weights, or matrix that represents my "trained model".
In other words, I would like that my trained model "transformed" in some sort of formula which I can give an input and get the predicted number.
The reason why I want to do this is because I wanna train a big dataset and use complex prediction model. And use this trained prediciton model in simpler hardwares such as a PIC32 microcontroler. The PIC32 Microntroler would not train the machine learning or store all inputs. Instead, the microcontroler would simple read from the system certain numbers, apply a math formula or some sort of matrix multiplication and give me the predicted output. With that, I can use "fancy" neural networks in much simpler devices that can easily operate math formulas.

If I read this properly, you want a generally continuous function in many variables to replace a CNN. The central point of a CNN existing in a world with ANNs ("normal" neural networks) is that in includes irruptive transformations: non-linearities, discontinuities, etc. that enable the CNN to develop recognitions and relationships that simple linear combinations -- such as matrix multiplication -- cannot handle.
If you want to understand this better, I recommend that you choose an introduction to Deep Learning and CNNs in whatever presentation mode fits your learning styles.

Essentially, every machine learning algorithm is a parameterized formula, with your trained model being the learned parameters that are applied to the input.
So what you're actually asking is to simplify arbitrary computations to, more or less, a matrix multiplication. I'm afraid that's mathematically impossible. If you ever do come up with a solution to this, make sure to share it - you'd instantly become famous, most likely rich, and put a hell of a lot of researchers out of business. If you can't train a matrix multiplication to get the accuracy you want from the start, what makes you think you can boil down arbitrary "complex prediction models" to such simple computations?

What's the reason for the weights of my NN model don't change a lot?

I am training a neural network model, and my model fits the training data well. The training loss decreases stably. Everything works fine. However, when I output the weights of my model, I found that it didn't change too much since random initialization (I didn't use any pretrained weights. All weights are initialized by default in PyTorch). All dimension of the weights only changed about 1%, while the accuracy on training data climbed from 50% to 90%.
What could account for this phenomenon? Is the dimension of weights too high and I need to reduce the size of my model? Or is there any other possible explanations?
I understand this is a quite broad question, but I think it's impractical for me to show my model and analyze it mathematically here. So I just want to know what could be the general / common cause for this problem.

There are almost always many local optimal points in a problem so one thing you can't say specially in high dimensional feature spaces is which optimal point your model parameters will fit into. one important point here is that for every set of weights that you are computing for your model to find a optimal point, because of real value weights, there are infinite set of weights for that optimal point, the proportion of weights to each other is the only thing that matters, because you are trying to minimize the cost, not finding a unique set of weights with loss of 0 for every sample. every time you train you may get different result based on initial weights. when weights change very closely with almost same ratio to each others this means your features are highly correlated(i.e. redundant) and since you are getting very high accuracy just with a little bit of change in weights, only thing i can think of is that your data set classes are far away from each other. try to remove features one at a time, train and see results if accuracy was good continue to remove another one till you hopefully reach to a 3 or 2 dimensional space which you can plot your data and visualize it to see how data points are distributed and make some sense out of this.
EDIT: Better approach is to use PCA for dimensionality reduction instead of removing one by one

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this

First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

What algorithm to chose for binary image classification

Lets say I have two arrays in dataset:
1) The first one is array classified as (0,1) - [0,1,0,1,1,1,0.....]
2) And the second array costists of grey scale image vectors with 2500 elements in each(numbers from 0 to 300). These numbers are pixels from 50*50px images. - [[13 160 239 192 219 199 4 60..][....][....][....][....]]
The size of this dataset is quite significant (~12000 elements).
I am trying to build bery basic binary classificator which will give appropriate results. Lets say I wanna choose non deep learning but some supervised method.
Is it suitable in this case? I've already tried SVM of sklearn with various parameters. But the outcome is inappropriately inacurate and consists mainly of 1: [1,1,1,1,1,0,1,1,1,....]
What is the right approach? Isnt a size of dataset enough to get a nice result with supervised algorithm?

You should probably post this on cross-validated:
But as a direct answer you should probably look into sequence to sequence learners as it has been clear to you SVM is not the ideal solution for this.
You should look into Markov models for sequential learning if you dont wanna go the deep learning route, however, Neural Networks have a very good track record with image classification problems.
Ideally for a Sequential learning you should try to look into Long Short Term Memory Recurrent Neural Networks, and for your current dataset see if pre-training it on an existing data corpus (Say CIFAR-10) may help.
So my recomendation is give Tensorflow a try with a high level library such as Keras/SKFlow.
Neural Networks are just another tool in your machine learning repertoire and you might aswell give them a real chance.
An Edit to address your comment:
Your issue there is not a lack of data for SVM,
the SVM will work well, for a small dataset, as it will be easier for it to overfit/fit a separating hyperplane on this dataset.
As you increase your data dimensionality, keep in mind that separating it using a separating hyperplane becomes increasingly difficult[look at the curse of dimensionality].
However if you are set on doing it this way, try some dimensionality reduction
such as PCA.
Although here you're bound to find another fence-off with Neural Networks,
since the Kohonen Self Organizing Maps do this task beautifully, you could attempt to
project your data in a lower dimension therefore allowing the SVM to separate it with greater accuracy.
I still have to stand by saying you may be using the incorrect approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.