neural networks regression using pybrain

neural networks regression using pybrain - python

I need to solve a regression problem with a feed forward network and I've been trying to use PyBrain to do it. Since there are no examples of regression on pybrain's reference, I tried to adapt it's classification example for regression instead, but with no success (The classification example can be found here: http://pybrain.org/docs/tutorial/fnn.html). Following is my code:
This first function converts my data in numpy array form to a pybrain SupervisedDataset. I use the SupervisedDataset because according to pybrain's reference it is the dataset to use when the problem is regression. The parameters are an array with the feature vectors (data) and their expected output (values):
def convertDataNeuralNetwork(data, values):
fulldata = SupervisedDataSet(data.shape[1], 1)
for d, v in zip(data, values):
fulldata.addSample(d, v)
return fulldata
Next, is the function to run the regression. train_data and train_values are the train feature vectors and their expected output, test_data and test_values are the test feature vectors and their expected output:
regressionTrain = convertDataNeuralNetwork(train_data, train_values)
regressionTest = convertDataNeuralNetwork(test_data, test_values)
fnn = FeedForwardNetwork()
inLayer = LinearLayer(regressionTrain.indim)
hiddenLayer = LinearLayer(5)
outLayer = GaussianLayer(regressionTrain.outdim)
fnn.addInputModule(inLayer)
fnn.addModule(hiddenLayer)
fnn.addOutputModule(outLayer)
in_to_hidden = FullConnection(inLayer, hiddenLayer)
hidden_to_out = FullConnection(hiddenLayer, outLayer)
fnn.addConnection(in_to_hidden)
fnn.addConnection(hidden_to_out)
fnn.sortModules()
trainer = BackpropTrainer(fnn, dataset=regressionTrain, momentum=0.1, verbose=True, weightdecay=0.01)
for i in range(10):
trainer.trainEpochs(5)
res = trainer.testOnClassData(dataset=regressionTest )
print res
when I print res, all it's values are 0. I've tried to use the buildNetwork function as a shortcut to build the network, but it didn't work as well. I've also tried different kinds of layers and different number of nodes in the hidden layer, with no luck.
Does somebody have any idea of what I am doing wrong? Also, some pybrain regression examples would really help! I couldn't find any when I looked.
Thanks in advance

pybrain.tools.neuralnets.NNregression is a tool which
Learns to numerically predict the targets of a set of data, with
optional online progress plots.
so it seems like something well suited for constructing a neural network for your regression task.

I think there could be a couple of things going on here.
First, I'd recommend using a different configuration of layer activations than what you're using. In particular, for starters, try to use sigmoidal nonlinearities for the hidden layers in your network, and linear activations for the output layer. This is by far the most common setup for a typical supervised network and should help you get started.
The second thing that caught my eye is that you have a relatively large value for the weightDecay parameter in your trainer (though what constitutes "relatively large" depends on the natural scale of your input and output values). I would remove that parameter for starters, or set its value to 0. The weight decay is a regularizer that will help prevent your network from overfitting, but if you increase the value of that parameter too much, your network weights will all go to 0 very quickly (and then your network's gradient will be basically 0, so learning will halt). Only set weightDecay to a nonzero value if your performance on a validation dataset starts to decrease during training.

As originally pointed out by Ben Allison, for the network to be able to approximate arbitrary values (i.e. not necessarily in the range 0..1) it is important not to use an activation function with limited output range in the final layer. A linear activation function for example should work well.
Here is a simple regression example built from the basic elements of pybrain:
#----------
# build the dataset
#----------
from pybrain.datasets import SupervisedDataSet
import numpy, math
xvalues = numpy.linspace(0,2 * math.pi, 1001)
yvalues = 5 * numpy.sin(xvalues)
ds = SupervisedDataSet(1, 1)
for x, y in zip(xvalues, yvalues):
ds.addSample((x,), (y,))
#----------
# build the network
#----------
from pybrain.structure import SigmoidLayer, LinearLayer
from pybrain.tools.shortcuts import buildNetwork
net = buildNetwork(1,
100, # number of hidden units
1,
bias = True,
hiddenclass = SigmoidLayer,
outclass = LinearLayer
)
#----------
# train
#----------
from pybrain.supervised.trainers import BackpropTrainer
trainer = BackpropTrainer(net, ds, verbose = True)
trainer.trainUntilConvergence(maxEpochs = 100)
#----------
# evaluate
#----------
import pylab
# neural net approximation
pylab.plot(xvalues,
[ net.activate([x]) for x in xvalues ], linewidth = 2,
color = 'blue', label = 'NN output')
# target function
pylab.plot(xvalues,
yvalues, linewidth = 2, color = 'red', label = 'target')
pylab.grid()
pylab.legend()
pylab.show()
A side remark (since in your code example you have a hidden layer with linear activation functions): In any hidden layer, linear functions are not useful because:
the weights at the input side to this layer form a linear transformation
the activation function is linear
the weights at the output side to this layer form a linear transformation
which can be reduced to one single linear transformation, i.e. they corresponding layer may as well be eliminated without any reduction in the set of functions which can be approximated. An important point of neural networks is that the activation functions are non-linear in the hidden layers.

As Andre Holzner explained, hidden layer should be nonlinear. Andre's code example is great, but it doesn't work well when you have more features and not so much data. In this case because of large hidden layer we get quite good approximation, but when you are dealing with more complex data, only linear function in output layer is not enough, you should normalize features and target to be in range [0..1].

Related

Finding patterns in time series with PyTorch

I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
network.train()
network.float()
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
optimizer.zero_grad()
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
else:
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
loss.backward()
cummloss.append(loss.item())
optimizer.step()
index = index + batch_size_train
if index > len(data):
print(np.mean(cummloss))
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
train(0)
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.

This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.

From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.

Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition : https://www.narayanacharya.com/vision/2019-12-30-Action-Recognition-Using-LSTM
https://discuss.pytorch.org/t/any-pytorch-function-can-work-as-keras-timedistributed/1346
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::
https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html

How can I compare weights of different Keras models?

I've saved numbers of models in .h5 format. I want to compare their characteristics such as weight.
I don't have any Idea how I can appropriately compare them specially in the form of tables and figures.
Thanks in advance.

Weight-introspection is a fairly advanced endeavor, and requires model-specific treatment. Visualizing weights is a largely technical challenge, but what you do with that information's a different matter - I'll address largely the former, but touch upon the latter.
Update: I also recommend See RNN for weights, gradients, and activations visualization.
Visualizing weights: one approach is as follows:
Retrieve weights of layer of interest. Ex: model.layers[1].get_weights()
Understand weight roles and dimensionality. Ex: LSTMs have three sets of weights: kernel, recurrent, and bias, each serving a different purpose. Within each weight matrix are gate weights - Input, Cell, Forget, Output. For Conv layers, the distinction's between filters (dim0), kernels, and strides.
Organize weight matrices for visualization in a meaningful manner per (2). Ex: for Conv, unlike for LSTM, feature-specific treatment isn't really necessary, and we can simply flatten kernel weights and bias weights and visualize them in a histogram
Select visualization method: histogram, heatmap, scatterplot, etc - for flattened data, a histogram is the best bet
Interpreting weights: a few approaches are:
Sparsity: if weight norm ("average") is low, the model is sparse. May or may not be beneficial.
Health: if too many weights are zero or near-zero, it's a sign of too many dead neurons; this can be useful for debugging, as once a layer's in such a state, it usually does not revert - so training should be restarted
Stability: if weights are changing greatly and quickly, or if there are many high-valued weights, it may indicate impaired gradient performance, remedied by e.g. gradient clipping or weight constraints
Model comparison: there isn't a way for simply looking at two weights from separate models side-by-side and deciding "this is the better one"; analyze each model separately, for example as above, then decide which one's ups outweigh downs.
The ultimate tiebreaker, however, will be validation performance - and it's also the more practical one. It goes as:
Train model for several hyperparameter configurations
Select one with best validation performance
Fine-tune that model (e.g. via further hyperparameter configs)
Weight visualization should be mainly kept as a debugging or logging tool - as, put simply, even with our best current understanding of neural networks one cannot tell how well the model will generalize just by looking at the weights.
Suggestion: also visualize layer outputs - see this answer and sample output at bottom.
Visual example:
from tensorflow.keras.layers import Input, Conv2D, Dense, Flatten
from tensorflow.keras.models import Model
ipt = Input(shape=(16, 16, 16))
x = Conv2D(12, 8, 1)(ipt)
x = Flatten()(x)
out = Dense(16)(x)
model = Model(ipt, out)
model.compile('adam', 'mse')
X = np.random.randn(10, 16, 16, 16) # toy data
Y = np.random.randn(10, 16) # toy labels
for _ in range(10):
model.train_on_batch(X, Y)
def get_weights_print_stats(layer):
W = layer.get_weights()
print(len(W))
for w in W:
print(w.shape)
return W
def hist_weights(weights, bins=500):
for weight in weights:
plt.hist(np.ndarray.flatten(weight), bins=bins)
W = get_weights_print_stats(model.layers[1])
# 2
# (8, 8, 16, 12)
# (12,)
hist_weights(W)
Conv1D outputs visualization: (source)

To compare weights of two models, I vectorize model weights (i.e., create 1D array) for each model. Then, I calculate the percent difference between respective weights and construct a histogram of these percent differences. If all values are close to zero, there is suggestion (but not proof) that the models are practically the same. This is just one approach out of many for comparing models using their weights.
As an aside, I will note that I use this method when I want some indication that my model has converged on a global, rather than local, minimum. I will train models with several different initializations. If all the initializations result in convergence to the same weights, it suggests that the minimum is a global minimum.

Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?

I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
tf.reset_default_graph()
tf.set_random_seed(1)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt = sess.run([optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at https://github.com/megalinier/Helsinki-project.

By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.

I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).

Nonlinear input to output mapping (undefined range) using tensorflow

I have an array of 1D input data (30,1). I m trying to map this to output data (30,1) (with noise). I have plotted the data and it is definitely non-linear and continuous.
I want to train a neural network to reproduce this mapping. I am currently trying to complete this task using tensorflow.
My problem right now is that the output data is in an undefined range (e.g. -2.74230671e+01, 1.00000000e+03, 6.34566772e+02 etc), and non-linear tensorflow activation functions seem to all between -1 and 1?
https://www.tensorflow.org/versions/r0.12/api_docs/python/nn/activation_functions_
I am rather new to tensorflow etc, so my question is, how do I approach this problem?
I thought I could mean-normalize the data, but since I don't actually know the range of the output values (possibly unbounded).
Is this possible using tensorflow functions or will I need to build my own? The approach I am using is below, where I tried different functions for tf.nn.relu:
tf_x = tf.placeholder(tf.float32, x.shape) # input x
tf_y = tf.placeholder(tf.float32, y.shape) # output y
# neural network layers
l1 = tf.layers.dense(tf_x, 50, tf.nn.relu) # tried different activation functions here
output = tf.layers.dense(l1, 1) # tried here too
loss = tf.losses.mean_squared_error(tf_y, output)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.05)
train_op = optimizer.minimize(loss)
#train
for step in range(30):
_, l, pred = sess.run([train_op, loss, output], {tf_x: x, tf_y: y})
print(sess.run(loss, feed_dict={tf_x: x, tf_y: y}))

You definitely have to normalize your data for it to work and it does not necessarily have to be in the range [-1, 1].
Take a Computer Vision (CV) problem as an example. What some papers do is simply divide by 255.0. Other papers, compute the mean and standard_deviation of each RGB channel from all the images. To normalize the images, we simply do (x-mu)/sigma over each channel.
Since your data is unbounded like what you said, then we can't simply divide by a scalar. Perhaps the best approach is to normalize based on the data statistics. Specific to your case, you could perhaps find the mean and standard_deviation of each of your 30 dimensions.
This post is more detailed and will potentially help you.

Backpropagation with Rectified Linear Units

I have written some code to implement backpropagation in a deep neural network with the logistic activation function and softmax output.
def backprop_deep(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = dsigmoid(node_values[i][:,:-1]) * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
The code works well, but when I switched to ReLU as the activation function it stopped working. In the backprop routine I only change the derivative of the activation function:
def backprop_relu(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = (node_values[i]>0)[:,:-1] * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
However, the network no longer learns, and the weights quickly go to zero and stay there. I am totally stumped.

Although I have determined the source of the problem, I'm going to leave this up in case it might be of benefit to someone else.
The problem was that I did not adjust the scale of the initial weights when I changed activation functions. While logistic networks learn very well when node inputs are near zero and the logistic function is approximately linear, ReLU networks learn well for moderately large inputs to nodes. The small weight initialization used in logistic networks is therefore not necessary, and in fact harmful. The behavior I was seeing was the ReLU network ignoring the features and attempting to learn the bias of the training set exclusively.
I am currently using initial weights distributed uniformly from -.5 to .5 on the MNIST dataset, and it is learning very quickly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.