Why is my neural network not working? - python

background
I have created a neural network that can be of n inputs, n hidden layers of n length, n outputs. When using it for handwriting recognition - using the Kaggle dataset (a 76mb text file of 28x28 matrix of 0-255 values for hand written numbers), the results are showing that somewhere, something must be wrong. In this case, i am using 784 inputs (each pixel 28x28), 1 hidden layer of 15 neurons, and an output layer of 10 neurons.
Output guesses are a vector like this [0,0,0,1,0,0,0,0,0,0] - which would mean its guessing a 3. This is based on this http://neuralnetworksanddeeplearning.com/chap1.html#a_simple_network_to_classify_handwritten_digits
(same principals and set up)
I am assuming my problem is somewhere within the back propagation - and because my program has a completely flexible network size in all dimensions (layers, length of layers, etc), my algorithm for back propagating is quite complex - and based on the chain rule explained here https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
Where essentially, the total error for each output is calculated with respect to each weight, and for hidden layers, the sum of the weight changes in previous layers are used.
when using a learning rate of 0.5, e_total starts at 2.252 and within a minute gets to 0.4462, and then within 5 mins gets no lower than 0.2.
This makes me think somethings must be working. But, when i output the desired outputs and the output guesses, they rarely match, even after 5 mins of iteraton/learning. I would hope to see results like this
output layer: [0.05226,0.0262,0.03262,0.0002, 0.1352, 0.99935, 0.00, etc]
output desired: [0,0,0,0,0,1,0, etc]
(all < 0.1 except the correct guess value should be > 0.9)
but instead i get things like
output layer: [0.15826,0.0262,0.33262,0.0002, 0.1352, 0.0635, 0.00, etc]
output desired: [0,1,0,0,0,0,0, etc]
(all < 0.1, so no clear classification, let alone an accurate one.)
I even added a line of code to output 'correct' when the guess value and desired value match - and even though, as i said, the e_total decreases, 'correct' was always happening about 1 in 10 times - which is no better than random!
I have tried different hidden layer lengths, different all sorts of different learning rates - but no good.
I've given more information in comments which may help
UPDATE:
As recommend, I have used my system to try and learn XOR function - with 2 inputs, 1 hidden layer of 2 neurons, and 1 output.
meaning, the desired_list is now a single element array, either [1] or [0]. Output values seem to be random >0.5 and < 0.7, with no clear relation to desired output. Just to confirm, I have manually tested my feed forward and back prop many times, and they defiantly work how explained in tutorials i've linked.

You used in this example one hidden layer. Error backpropagation is capable correct learn one or two hidden layers. In comments you claim the weight initialization from interval 0-1. Once I tried a recognition a paper from picture and obtain miserably results. I have weight init from interval 0-1. When I improve this on -1 to 1 results were excelent.
Ok. Your parameters:
768 15 10
Parameter of deep network doing same task:
768 500 500 2000 10
Error backpropagation is capable doing this task. Try use an two hidden layers with more neurons.
For example, something like this
768 2000 1500 10 :)
And also, you should normalize input from 0-255 to 0-1.
Update:
XOR trainig has long duration. Please try 100 000 epochs. If results will be bad, something wrong is in BP implementation. Please initialize weights from -1 to 1 for XOR problem and hidden and output unit must have a bias.

You don't need to reinvent the wheel...
You may use pybrain module, which provide optimized "Supervised Learning" features like Back-Propagation, R-Prop, etc...
(and have also supervised learning, unsupervised learning, reinforcement learning and black-box optimization algorithm features)
You may find here an example of how to use pybrain module to make OCR with a 10×9 inputs array (just adapt to your 28x28 need)
If you definitely would to reinvent the wheel... you may do some introspection of the pybrain source code (because the back prop version of pybrain works) in order to explain/double-check why your code version is not working.
As NN debug is a difficult task, you may also publish more code and share any ressources which are relative to your code ...
Regards

Related

Binary vs non-binary hidden layer output

I'm trying to understand exactly how to implement a basic neural network in python that will use genetic algorithms for unsupervised learning and have ran into a small problem that the literature I've been able to pull up hasn't solved.
Lets say I have an input of 2 values, that are passed to a 3 neuron hidden layer with all weights/biases applied. After I determine if it fired I now send what exactly? Do I send the output from my sigmoid or do I send a full stop/start. In other words is my output into hidden layer 2 going to be binary or is it non-binary?
Can anyone explain this with the reasoning behind we choose one or the other?
This really depends on your network design, but there is no such restriction that inputs have to be binary. In fact that will not be a case you face often. For output layer, the type of output can be easily and clearly determined, eg. if you have something like a classifier that classifies this answer is spam or not, then the output (of 'a single neuron' output layer) will be binary. If you have a neural network to recognise handwritten digits, then probably it's better to have a 10 neuron output layer, each giving probability of the input image being one of the digits [0, 9].
For other layers (hidden and input), the output can be anything, most of the time it won't be binary.
EDIT:
I think I misunderstood your question a bit, and also you probably aren't talking about Fuzzy Neural Networks.
So if you are not considering those (in most cases), when you say a neuron has fired, you mean its output is 1 (binary high), and 0 otherwise, so yes it's binary.
Do I send the output from my sigmoid or do I send a full stop/start
The way sigmoid function is used in neural networks (with weights) it attempts to make the computation output a binary result, so basically both the options mean the same. There is a difference, but usually NNs try to avoid that region where sigmoid (or related neuron) outputs some value which can not be approximated to 0 or 1 nicely. Weights of inputs of that neuron are moved so that the neuron gives a clear 0 or 1.
Also note that, while it's not good to not know sigmoid (and tanh), but for practical purposes ReLU, Leaky ReLU, or maxout are better choices.
Suggested: http://cs231n.github.io/neural-networks-1/
Also you can find lectures (videos and notes) by Andrew Ng, Andrej Karpathy etc helpful.

GAN Generator output layer is [way] out of target range - potentially due to loss func

The TL;DR is that my G(z) (z is normal distributed noise) is way out of the expected range. Picture - see comparison of real against generated data.
(It's also a terrible model, being just a few dense layers - but I can sort that later!)
I've thrown s**t at the wall for the past week to see what sticks and now it's time to ask for an informed opinion! :)
Has anyone else encountered this problem before? Suggested fixes?
What is your generic loss function for your regressive Generator (with a categorical discriminator)?
As it's a regression problem (with target data normalised between 0 & 1) my G output has a linear activation. However, I'm trying to follow Soumith's GAN Hacks to get it working, where we're conflicting over the activation - he says it should be tanh. One could assume with my expected target range I could choose sigmoid, but as I say, I infer this should be linear as a regressive problem to solve. Opinions?
Plus there isn't much clarity on the loss function. He Says:-
In GAN papers, the loss function to optimize G is min (log 1-D), but in practice folks practically use max log D
Does this mean one should aim to minimise the error log(1-D), or pick the minimum of log(1-D)? Similarly with max log(D). (I'm using class definitions of true/false, so a maximum and minimum from the can be chosen from the two classes.)
In the Goodfellow et. al 2014 paper on GANs, I can see how some of the loss function for the Discriminator resolves to 0 when applied to the Generator. However, I'm obviously missing something, as the equation states I should take log(D(x)) which will sooner or later hit a NaN.
Clarification about my confused state is thanked many times in advance.
Update:
This tutorial on O'Reilly says
Take a look at the last line of our discriminator: there's no softmax or sigmoid layer at the end. GANs can fail if their discriminators "saturate," or become confident enough to return exactly 0 when they're given a generated image; that leaves the discriminator without a useful gradient to descend.
Which sounds a little like my problem, especially as I was using Softmax. Using no activation doesn't solve the problem, but does make my GAN compete in interesting new ways!

How to decide activation function in neural network

I am using feedforward, backpropagation, multilayer neural network and I am using sigmoid function as a activation function which is having range of -1 to 1. But the minimum error is not going below 5.8 and I want so less, you can see the output after 100000 iterations.
I think this is because of my output range is above 1, and sigmoid functions range is only -1 to 1. Can anybody suggest me how i can overcome this problem as my desired output range is 0 to 2.5. Suggest me which activation function will be best for this range.
If you are seeking to reduce output error, there are a couple of things to look at before tweaking a node's activation function.
First, do you have a bias node? Bias nodes have several implications, but - most relevant to this discussion - they allow the network output to be translated to the desired output range. As this reference states:
The use of biases in a neural network increases the capacity of the network to solve problems by allowing the hyperplanes that separate individual classes to be offset for superior positioning.
This post provides a very good discussion:
Role of Bias in Neural Networks.
This one is good, too: Why the BIAS is necessary in ANN? Should we have separate BIAS for each layer?
Second method: it often helps to normalize your inputs and outputs. As you note, your sigmoid offers a range of +/- 1. This small range can be problematic when trying to learn functions that have a range of 0 to 1000 (for example). To aid learning, it's common to scale and translate inputs to accommodate the node activation functions. In this example, one might divide the range by 500, yielding a 0 to 2 range, and then subtract 1 from this range. In this manner, the inputs have been normalized to a range of -1 to 1, which better fits the activation function. Note that network output should be denormalized: first, add +1 to the output, then multiply by 500.
In your case, you might consider scaling the inputs by 0.8, then subtracting 1 from the result. You would then add 1 to the network output, and then multiply by 1.25 to recover the desired range. Note that this method may be easiest to accomplish since it does not directly change your network topology like the addition of bias would.
Finally, have you experimented with changing the number of hidden nodes? Although I believe the first two options are better candidates for improving performance, you might give this one a try. (Just as a point of reference, I can't recall an instance in which modifying the activation function's shape improved network response more than option 1 and 2.)
Here are some good discussion of hidden layer/node configuration:
multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer?
How to choose number of hidden layers and nodes in neural network?
24 inputs make your problem a high-dimensional one. Ensure that your training dataset adequately covers the input state space, and ensure that you are your test data and training data are drawn from similarly representative populations. (Take a look at the "cross-validation" discussions when training neural networks).
The vanilla sigmoid function is:
def sigmoid(x):
return 1/(1+math.e**-x)
You could transform that to:
def mySigmoid(x):
return 2.5/(1+math.e**-x)
in order to make the transformation that you want

Neural Network Becomes Unruly with Large Layers

This is a higher-level question about the performance of a neural network. The issue I'm having is that with larger numbers of neurons per layer, the network has frequent rounds of complete stupidity. They are not consistent; it seems that the probability of general success vs failure is about 50/50 when layers get larger than 60 neurons (always 3 layers).
I tested this by teaching the same function to networks with input and hidden layers of sizes from 10-200. The success rate is either 0-1% or 90+%, but nothing in between. To help visualize this, I graphed it. Failures is a total count of incorrect responses on 200 data sets after 5k training iterations. .
I think it's also important to note that the numbers at which the network succeeds or fails change for each run of the experiment. The only possible culprit I've come up with is local minima (but don't let this influence your answers, I'm new to this, and initial attempts to minimize the chance of local minima seem to have no effect).
So, the ultimate question is, what could cause this behavior? Why is this thing so wildly inconsistent?
The Python code is on Github and the code that generated this graph is the testHugeNetwork method in test.py (line 172). If any specific parts of the network algorithm would be helpful I'm glad to post relevant snippets.
My guess is, that your network is oscillating heavily across a jagged error surface. Trying a lower error rate might help. But first of all, there are a few things you can do to better understand what your network is doing:
plot the output error over training epochs. This will show you when in the training process things go wrong.
have a graphical representation (an image) of your weight matrices and of your outputs. Makes it much easier to spot irregularities.
A major problem with ANN training is saturation of the sigmoid function. Towards the asymptotes of both the logistic function and tanh, the derivative is close to 0, numerically it probably even is zero. As a result, the network will only learn very slowly or not at all. This problem occurs when the input for the sigmoid are too big, here's what you can do about it:
initialize your weights proportional to number of inputs a neuron receives. Standard literature suggests to draw them from a distribution with mean = 0 and standard deviation 1/sqrt(m), where m is the number of input connections.
scale your teachers so that they lie where the network can learn the most; that is, where the activation function is the steepest: the maximum of the first derivative. For tanh you can alternatively scale the function to f(x) = 1.7159 * tanh(2/3 * x) and keep the teachers at [-1, 1]. However, don't forget to adjust the derivative to f'(x) = 2/3 * 1.7159 * (1 - tanh^2 (2/3 * x)
Let me know if you need additional clarification.

implementing a perceptron classifier

Hi I'm pretty new to Python and to NLP. I need to implement a perceptron classifier. I searched through some websites but didn't find enough information. For now I have a number of documents which I grouped according to category(sports, entertainment etc). I also have a list of the most used words in these documents along with their frequencies. On a particular website there was stated that I must have some sort of a decision function accepting arguments x and w. x apparently is some sort of vector ( i dont know what w is). But I dont know how to use the information I have to build the perceptron algorithm and how to use it to classify my documents. Have you got any ideas? Thanks :)
How a perceptron looks like
From the outside, a perceptron is a function that takes n arguments (i.e an n-dimensional vector) and produces m outputs (i.e. an m-dimensional vector).
On the inside, a perceptron consists of layers of neurons, such that each neuron in a layer receives input from all neurons of the previous layer and uses that input to calculate a single output. The first layer consists of n neurons and it receives the input. The last layer consist of m neurons and holds the output after the perceptron has finished processing the input.
How the output is calculated from the input
Each connection from a neuron i to a neuron j has a weight w(i,j) (I'll explain later where they come from). The total input of a neuron p of the second layer is the sum of the weighted output of the neurons from the first layer. So
total_input(p) = Σ(output(k) * w(k,p))
where k runs over all neurons of the first layer. The activation of a neuron is calculated from the total input of the neuron by applying an activation function. An often used activation function is the Fermi function, so
activation(p) = 1/(1-exp(-total_input(p))).
The output of a neuron is calculated from the activation of the neuron by applying an output function. An often used output function is the identity f(x) = x (and indeed some authors see the output function as part of the activation function). I will just assume that
output(p) = activation(p)
When the output off all neurons of the second layer is calculated, use that output to calculate the output of the third layer. Iterate until you reach the output layer.
Where the weights come from
At first the weights are chosen randomly. Then you select some examples (from which you know the desired output). Feed each example to the perceptron and calculate the error, i.e. how far off from the desired output is the actual output. Use that error to update the weights. One of the fastest algorithms for calculating the new weights is Resilient Propagation.
How to construct a Perceptron
Some questions you need to address are
What are the relevant characteristics of the documents and how can they be encoded into an n-dimansional vector?
Which examples should be chosen to adjust the weights?
How shall the output be interpreted to classify a document? Example: A single output that yields the most likely class versus a vector that assigns probabilities to each class.
How many hidden layers are needed and how large should they be? I recommend starting with one hidden layer with n neurons.
The first and second points are very critical to the quality of the classifier. The perceptron might classify the examples correctly but fail on new documents. You will probably have to experiment. To determine the quality of the classifier, choose two sets of examples; one for training, one for validation. Unfortunately I cannot give you more detailed hints to answering these questions due to lack of practical experience.
I think that trying to solve an NLP problem with a Neural Network when you're not familiar with either might be a step too far. That you're doing it in a new language is the least of your worries.
I'll link you to my Neural Computation module slides that gets taught at my university. You'll want the slides from session 1 and session 2 in week 2. Right at the bottom of the page is a link to how to implement a neural network in C. With a few modifications should be able to port it to python. You should note that it details how to implement a multilayer perceptron. You only need to implement a single layer perceptron, so ignore anything that talks about hidden layers.
A quick explanation of x and w. Both x and w are vectors. x is the input vector. x contains normalised frequencies for each word you are concerned about. w contains weights for each word you are concerned with. The perceptron works by multiplying the input frequency for each word by its respective weight and summing them up. It passes the result to a function (typically a sigmoid function) that turns the result into a value between 0 and 1. 1 means the perceptron is positive that the inputs are an instance of the class it represents and 0 means it is sure that the inputs really aren't an example of its class.
With NLP you typically learn about the bag of words model first, before moving on to other, more complex, models. With a neural network, hopefully, it will learn its own model. The problem with this is that the neural network will not give you much of an understanding of NLP, other than documents can be classified by the words they contain, and that usually the number and type of words in a document contains most of the information you need to classify a document -- context and grammar do not add much extra detail.
Anyway, I hope that gives a better place from which to start your project. If you're still stuck on a particular part then ask again and I'll do my best to help.
You should take a look at this survey paper on text classification by Frabizio Sebastiani. It tells you all of the best ways to do text classification.
Now, I'm not going to bother you to read the whole thing, but there's one table near the end, where he compares how lots of different people's techniques stack up on lots of different test corpora. Find it, pick the best one (the best perceptron one, if you assignment is specifically to learn how to do this with perceptron), and read the paper he cites that describes that method in detail.
You now know how to construct a good topical text classifier.
Turning the algorithm that Oswald gave you (and that you posted in your other question) into code is a Small Matter of Programming (TM). And if you encounter unfamiliar terms like TF-IDF while you're working, ask your teacher to help you by explaining those terms.
MultiLayer perceptrons (A specific NeuralNet architecture for general classification problem.) Now available for Python from the GraphLab folks:
https://dato.com/products/create/docs/generated/graphlab.deeplearning.MultiLayerPerceptrons.html#graphlab.deeplearning.MultiLayerPerceptrons
I had a try at implementing something similar the other day. I made some code to recognize english looking text vs non-english. I hadn't done AI or statistics in many years, so it was a bit of a shotgun attempt.
My code is here (don't want to bloat the post): http://cnippit.com/content/perceptron-statistically-recognizing-english
Inputs:
I take a text file, split it up into
tri-grams (eg "abcdef" => ["abc",
"bcd", "cde", "def"])
I calculate the relative frequencies of each, and feed that as the inputs to the perceptron (so there are 26^3 inputs)
Despite me not really knowing what I was doing, it seems to work fairly well. The success depends quite heavily on the training data though. I was getting poor results until I trained it on more french/spanish/german text etc.
It's a very small example though, with lots of "lucky guesses" at values (eg. initial weights, bias, threshold, etc.).
Multiple classes:
If you have multiple classes you want to distinquish between (ie. not as simple as "is A or NOT-A"), then one approach is to use a perceptron for each class. Eg. one for sport, one for news, etc.
Train the sport-perceptron on data grouped as either sport or NOT-sport. Similar for news or Not-news, etc.
When classifying new data, you pass your input to all perceptrons, and whichever one returns true (or "fires"), then that's the class the data belongs to.
I used this approach way back in university, where we used a set of perceptrons for recognizing handwritten characters. It's simple and worked pretty effectively (>98% accuracy if I recall correctly).

Categories

Resources