Train a feed forward neural network indirectly

Train a feed forward neural network indirectly - python

I am faced with this problem:
I have to build an FFNN that has to approximate an unknown function f:R^2 -> R^2. The data in my possession to check the net is a one-dimensional R vector. I know the function g:R^2->R that will map the output of the net into the space of my data. So I would use the neural network as a filter against bias in the data. But I am faced with two problems:
Firstly, how can I train my network in this way?
Secondly, I am thinking about adding an extra hidden layer that maps R^2->R and lets the net train itself to find the correct maps and then remove the extra layer. Would this algorithm be correct? Namely, would the output be the same that I was looking for?

Your idea with additional layer is good, although the problem is, that your weights in this layer have to be fixed. So in practise, you have to compute the partial derivatives of your R^2->R mapping, which can be used as the error to propagate through your network during training. Unfortunately, this may lead to the well known "vanishing gradient problem" which stopped the development of NN for many years.
In short - you can either manually compute the partial derivatives, and given expected output in R, simply feed the computed "backpropagated" errors to the network looking for R^2->R^2 mapping or as you said - create additional layer, and train it normally, but you will have to make the upper weights constant (which will require some changes in the implementation).

Related

Neural network and the law of large numbers

I am struggling to implement the following function in python, which holds by the law of large numbers:
where ANN stands for artificial neural network.
I have created a sample from where I have several subsamples. I want to feed each subsample at a time, increaigly, to train a neural network. That implies I will have a neural network for each subsample:
ANN((X_t,N,\theta_1,1)+ANN(X_t,N,\theta_2,2)+....
And each needs to be incorporated in a sum.
However I have no idea on how to implement this, once I would need to store, not the values but the neural network itself after each computation. Is there any references on how to solve a problem of this kind? I have looked at the recurrent neural networks implemented in Python, namely the LSTM, but that does not "store" each neural network, furthemore it selects the variablles that are more meaningful across time.
Thanks in advance.

By invoking (artificial) neural networks and the Central Limit Theorem you step into quite a few concepts. Let me try to elaborate on these concepts before trying to suggest a solution.
First, the fact that
holds P-almost surely for a family of random variables X_{1},X_{2},... that are iid (independently and identically distributed) like the random variable X is called the Strong Law of Large Numbers (LLN). In contrast, the Central Limit Theorem (CLT) refers to the limiting distribution (as the name suggests) which is Gaussian. Both theorems require proper scaling, namely for the LLN
and for the CLT, respectively. Both theorems allow approximation through a finite sum of up to J summands which is what you attempt. However, equality is lost and approximate equality i.e. ≈ is appropriate. Moreover, there is no normalization in your summation which will cause the term to diverge. Note that the limits hold for certain functions being applied to X. You assume that the function ANN(X_t, N, Θ, j).
Second, the (artificial) neural network. Like any statistical model, a neural network takes in data input X, hyperparameters that determine the network architecture (e.g. depth and size of the involved layers) that might be N in your case, and a parameter vector Θ. The latter is only obtained after the model has been trained on data. In turn, I'd interpret your function
def ANN(X_t, N, Θ)
as the inference function that compiles a previously trained neural network by combining hyperparameter value N the parameter vector Θ and applies it to the current data input X_{t}. However, you don't clarify what the input j is. j and Θ_j seem to suggest a recurrent neural network (RNN). An LSTM is a special type of RNN. However, it is unclear what the inputs actually are as you leave this vague. RNNs are used on speech, text, and numeric time-series data. This is further complicated by the fact that $X_{t}$ is on the left-hand side in the expectation and on the right-hand side as the input to the neural network.
Finally, the suggested solution. If the ANNs are in fact independent and you meant to write E(Y), then your equation vaguely describes ensemble learning. There, several neural networks (of the same architecture) are trained on the same dataset and their prediction is averaged (not summed) to gain a more accurate prediction of the expectation of Y. If, on the other hand, you do describe RNNs, the equation above for E(X) vaguely describes a convergence of non-independent random variables as X_{t+1} and Θ_{t+1} depend on the previous X_t's and Θ_t's. Intuitively, you try to show that the output of an RNN converges to some numeric value when applied iteratively. Mathematically speaking, there are LLM-like results for non-iid random variables but they impose other very specific assumptions e.g. on the type of dependence.
Regarding storing neural networks. You can implement your own ANN program which is a lot of work (as it requires training and inference functions). Virtually every deep learning framework in Python allows storing/loading a parameter vector Θ which would allow you to implement your procedure regardless of what mathematical meaning you'd like to derive from it. In keras, for example, a model can be saved via
model.save(PARAMETER_PATH)
and later re-loaded via
keras.models.load_model(PARAMETER_PATH)
see the reference. Similar methods exist for PyTorch another very popular deep learning framework in Python.

How to use TF Adam to optimize multiple outputs from the same neural-network; but with a varying linear transformation?

So this is a tough question to ask, as i dont have examples of code.
I have a task where the same neural network predicts a time-series characteristic(say a single sigmoid activation) for single time period ahead. So you have several areas with information at time t say A(t),B(t),C(t) -- this same neural network will take in some past information and give you various outputs Y(A;t),Y(B;t).Y(C;t).
You want to take the time based outputs of this neural network for various inputs, and at each output apply a time-varying but known linear transformation. At this point you have a collection of outputs depending on the network parameters, time-dependent inputs and linear transformations.
After this you basically you want to collect a number of these outputs over a period of a month. You then compute a value X based on all of these outputs (after their linear transformation) for that month.
I now want to use ADAM to optimize the weights of this neural network(the same network is used each time). I have something that depends on the outputs of a lot of different fixed neural network inputs, and a lot of different fixed linear transformations.
So i am not sure how to use tensorflow ADAM to change the weights, if you can simply use the model.output or predict in a custom version of the optimizer and it will understand this is a function of the weights.
Sorry if this is hard to understand, i am finding it hard to explain and i dont have code for it as that would make the question redundant.

Does the squared error depends on the number of hidden layers?

I wanted to know if the squared error depends also on the number of hidden layers and the number of neurons on each hidden layer , because I've created a neuron network with one hidden layer but I can't reach a small squared error , so maybe the function is not convex ? Can I optimize weights by adding more hidden layers ?

The more neurons (e.g. layers) you add to your model, the better you can approximate arbitrary functions. If your loss on your training data is not decreasing any further you are underfitting. This can be solved by making the model more complex, i.e. adding more trainable parameters. But you have to be careful, that you do not overdo it and end up overfitting.

Though this is not a programming question, I'll try my best to answer it here.
The squared error, i.e. the 'loss' of your neural network, depends on your neural network prediction and the ground truth. And it is convex from its definition.
The reasons that you're not getting low losses could be:
You're not normalizing your inputs. For example, if you got a series of house prices as input, which is around 500k to 1m, and you didn't normalize them, your prediction will be the linear combination of the prices, which is about the same order of magnitude, then pass through the activation function. This could result in large losses.
You're not initializing your weights and biases correctly. Similar to above, you could have large weights/biases which lead to large prediction values.
You didn't choose the proper activation function. When you're doing classification, your labels are generally one hot encoded, so your activation functions should limit the prediction to [0,1] or similar, so relu won't be a proper option. Also you don't want sigmoid as activation for regression problems.
Your labels are not predictable or have too much noise. Or maybe your network is not complex enough to capture important patterns, in that case you could try adding more layers and more nodes per layer.
Your learning rate is too small, this leads to slow convergence.
That's all I have in mind. You probably need more work to find out the reason to your problem.

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this

First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

How to decide activation function in neural network

I am using feedforward, backpropagation, multilayer neural network and I am using sigmoid function as a activation function which is having range of -1 to 1. But the minimum error is not going below 5.8 and I want so less, you can see the output after 100000 iterations.
I think this is because of my output range is above 1, and sigmoid functions range is only -1 to 1. Can anybody suggest me how i can overcome this problem as my desired output range is 0 to 2.5. Suggest me which activation function will be best for this range.

If you are seeking to reduce output error, there are a couple of things to look at before tweaking a node's activation function.
First, do you have a bias node? Bias nodes have several implications, but - most relevant to this discussion - they allow the network output to be translated to the desired output range. As this reference states:
The use of biases in a neural network increases the capacity of the network to solve problems by allowing the hyperplanes that separate individual classes to be offset for superior positioning.
This post provides a very good discussion:
Role of Bias in Neural Networks.
This one is good, too: Why the BIAS is necessary in ANN? Should we have separate BIAS for each layer?
Second method: it often helps to normalize your inputs and outputs. As you note, your sigmoid offers a range of +/- 1. This small range can be problematic when trying to learn functions that have a range of 0 to 1000 (for example). To aid learning, it's common to scale and translate inputs to accommodate the node activation functions. In this example, one might divide the range by 500, yielding a 0 to 2 range, and then subtract 1 from this range. In this manner, the inputs have been normalized to a range of -1 to 1, which better fits the activation function. Note that network output should be denormalized: first, add +1 to the output, then multiply by 500.
In your case, you might consider scaling the inputs by 0.8, then subtracting 1 from the result. You would then add 1 to the network output, and then multiply by 1.25 to recover the desired range. Note that this method may be easiest to accomplish since it does not directly change your network topology like the addition of bias would.
Finally, have you experimented with changing the number of hidden nodes? Although I believe the first two options are better candidates for improving performance, you might give this one a try. (Just as a point of reference, I can't recall an instance in which modifying the activation function's shape improved network response more than option 1 and 2.)
Here are some good discussion of hidden layer/node configuration:
multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer?
How to choose number of hidden layers and nodes in neural network?
24 inputs make your problem a high-dimensional one. Ensure that your training dataset adequately covers the input state space, and ensure that you are your test data and training data are drawn from similarly representative populations. (Take a look at the "cross-validation" discussions when training neural networks).

The vanilla sigmoid function is:
def sigmoid(x):
return 1/(1+math.e**-x)
You could transform that to:
def mySigmoid(x):
return 2.5/(1+math.e**-x)
in order to make the transformation that you want

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.