This is a classical visualization of the perceptron learning model with 1 neuron. Let's say that I'd like to use 3 neuron or 5 neuron for training, can I do it without hidden layer ? I just can't picture in my head. Here is the code;
import numpy as np
def tanh(x):
return (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
def tanh_derivative(x):
return 1-x**2
training_inputs = np.array([[0,0,0],[0,0,1],[0,1,0],[0,1,1],[1,0,0],[1,0,1],[1,1,0],[1,1,1]])
training_outputs =np.array([[1,0,0,1,0,1,1,0]]).T
#3 input 1 output //
synaptic_weights = 2* np.random.random((3,1))-1
print('Random weights :{}'.format(synaptic_weights))
for i in range(20000):
input_layer = training_inputs
outputs = tanh(,synaptic_weights))
error = training_outputs - outputs
weight_adjust = error * tanh_derivative(outputs)
synaptic_weights +=, weight_adjust)
print('After training Synaptic Weights: {}'.format(synaptic_weights))
print('After training Outputs :\n{}'.format(outputs))
If you have 3 neurons in the output layer, you have three outputs. This makes sense for some problems - imagine a color with RGB components.
The size of your input determines your number input nodes; the size of your output determines your number of output nodes. Only hidden layers sizes can be chosen freely. But any interesting network has at least one hidden layer.
I have developed a trivial Feed Forward neural network with Pytorch.
The neural network uses GloVe pre-trained embeddings in a freezed nn.Embeddings layer.
Next, the embedding layer splits into three embeddings. Each split is a different transformation applied to the initial embedding layer. Then the embeddings layer feed three nn.Linear layers. And finally I have a single output layer for a binary classification target.
The shape of the embedding tensor is [64,150,50]
-> 64: sentences in the batch,
-> 150: words per sentence,
-> 50: vector-size of a single word (pre-trained GloVe vector)
So after the transformation, the embedding layer splits into three layers with shape [64,50], where 50 = either the torch.mean(), torch.max() or torch.min() of the 150 words per sentence.
My questions are:
How could I feed the output layer from three different nn.Linear layers to predict a single target value [0,1].
Is this efficient and helpful to the total predictive power of the model? Or just selecting the average of the embeddings is sufficient and no improvement will be observed.
The forward() method of my PyTorch model is:
def forward(self, text):
embedded = self.embedding(text)
if self.use_pretrained_embeddings:
embedded_average = torch.mean(embedded, dim=1)
embedded_max = torch.max(embedded, dim=1)[0]
embedded_min = torch.min(embedded, dim=1)[0]
embedded = self.flatten_layer(embedded)
input_layer = self.input_layer(embedded_average) #each Linear layer has the same value of hidden unit
input_layer = self.activation(input_layer)
input_layer_max = self.input_layer(embedded_max)
input_layer_max = self.activation(input_layer_max)
input_layer_min = self.input_layer(embedded_min)
input_layer_min = self.activation(input_layer_min)
#What should I do here? to exploit the weights of the 3 hidden layers
output_layer = self.output_layer(input_layer)
output_layer = self.activation_output(output_layer) #Sigmoid()
return output_layer
After the proposed answer the function is:
def forward(self, text):
embedded = self.embedding(text)
if self.use_pretrained_embeddings:
embedded_average = torch.mean(embedded, dim=1)
embedded_max = torch.max(embedded, dim=1)[0]
embedded_min = torch.min(embedded, dim=1)[0]
#use of average embeddings transformation
input_layer_average = self.input_layer(embedded_average)
input_layer_average = self.activation(input_layer_average)
#use of max embeddings transformation
input_layer_max = self.input_layer(embedded_max)
input_layer_max = self.activation(input_layer_max)
#use of min embeddings transformation
input_layer_min = self.input_layer(embedded_min)
input_layer_min = self.activation(input_layer_min)
embedded = self.flatten_layer(embedded)
input_layer = torch.concat([input_layer_average, input_layer_max, input_layer_min], dim=1)
input_layer = self.activation(input_layer)
print("3",input_layer.shape) #[192,1] vs [64,1] -> output layer
if self.n_layers !=0:
for layer in self.layers:
input_layer = layer(input_layer)
output_layer = self.output_layer(input_layer)
output_layer = self.activation_output(output_layer)
return output_layer
This generates the following error:
ValueError: Using a target size (torch.Size([64, 1])) that is different to the input size (torch.Size([192, 1])) is deprecated. Please ensure they have the same size.
Expected outcome since the concatenated layer is 3x the size of the sentences (64). Any fix that could resolve it?
Regarding 1: You can use torch.concat to concatenate the outputs along the appropriate dimension, and then e.g. map them to a single output using another linear layer.
Regarding 2: You will have to try it yourself and see whether this is useful.
From what I understand about Neural Networks, you have a number of hidden layers which each consist of X neurons. A neuron takes in a number of inputs and prospective weights, then using an activation function (sigmoid in my case) gives an output.
My task is to implement a network from scratch (only using numpy), with 2 hidden layers, sigmoid activation function and 500 neurons in each hidden layer. What I don't understand is, how can I implement the concept of neurons? According to this article, one neuron is when all inputs are weighted and are passed into the activation function. So do I feed in the same inputs, 500 times, with different weights each time (in the first layers, then again in the second)? I've also read this topic, where the following is said:
The neuron is nothing more than a set of inputs, a set of weights, and an activation function. The neuron translates these inputs into a single output, which can then be picked up as input for another layer of neurons later on.
So according to this, I indeed should weigh the inputs differently, 500 times, and then pass these forward to the next layer which will do the same. Am I understanding this correctly?
Here is the code I have written so far (it is very elementary but I did not want to proceed further before I clear this up), but have no idea how I would be implementing this:
class NeuralNetwork:
def __init__(self, data, y, neurons, hidden):
self.input = data
self.y = y
self.output = np.zeros(y.shape)
self.layers = hidden
self.neurons = neurons
self.weights = self.generateWeightArray()
def generateWeightArray(self):
weightarr = []
#Last weight array is for inbetween hidden and output layer
for i in range(self.layers + 1):
return np.asarray(weightarr)
def generateWeightMatrix(self):
return np.random.rand(self.input.shape[0], self.input.shape[1]-1)
def sigmoid(self, x):
return 1/(1+np.exp(-x))
def dsigmoid(self, x):
return self.sigmoid(x)*(1-self.sigmoid(x))
def train(self):
def run(self):
#Since between each layer we have a matrix of weights, we can just keep going for the number of hidden
#layers we have
for i in range(self.layers):
out =, self.weights[i]).transpose() #step1
self.input = self.sigmoid(out) #step2
net = NeuralNetwork(np.array([[1,2,3,4],[3,5,1,2],[5,6,7,8]]), np.array([1,0,1]), 500, 2)
I have changed my code as follows
class NeuralNetwork:
def __init__(self, data, y, neurons, hidden):
self.input = data
self.y = y
self.output = np.zeros(y.shape)
self.layers = hidden
self.neurons = neurons
self.weights_to_hidden = np.random.rand(self.neurons, self.input.shape[1])
self.weights = self.generateWeightArray()
self.weights_to_output = np.random.rand(self.neurons,1)
#Generate a matrix with h+1 weight matrices, where h is the number of hidden layers (+1 for output)
def generateWeightArray(self):
weightarr = []
#Last weight array is for inbetween hidden and output layer
for i in range(self.layers):
return np.asarray(weightarr)
#Generate a matrix with n columns and m rows, where n is the number of features and m is the number of neurons
#in the layer
def generateWeightMatrix(self):
return np.random.rand(self.neurons, self.neurons)
def sigmoid(self, x):
return 1/(1+np.exp(-x))
def dsigmoid(self, x):
return self.sigmoid(x)*(1-self.sigmoid(x))
def train(self):
#2 hidden layers, then hidden -> output layer
hidden_in = self.sigmoid(, self.weights_to_hidden.transpose()).transpose())
print("Going into hidden layer:")
for i in range(self.layers):
in_hidden = self.sigmoid(, self.weights[i]).transpose())
print("After ",str(i+1), " hidden layer:")
out = self.sigmoid(, self.weights_to_output).transpose())
net = NeuralNetwork(np.array([[1,2,3,4],[3,5,1,2],[5,6,7,8]]), np.array([1,0,1]), 5, 2)
And the output after running is
[[0.89405222 0.89501672 0.89717842]]
I'm not sure if self.weights_to_output has the correct shape though because its a (n,1), so all features (in each record) will have the same weight, rather than having 3 weights for each row (?)
The 'neurons' (usually called 'units' these days) are the activations in each layer. These are represented as a vector with one element for each unit. You will represent a layer's activations as a 1D array. So the short answer is that the neurons are elements in 1D arrays.
Let's look at the activation of one unit in layer 3 of a deep neural network:
a_3 = sigma(w_3 # a_2 + b_3)
a_3 will be a scalar — the activation for this unit;
sigma is the activation function (e.g. logistic function, tanh, or ReLU)
w_3 is the vector of weights for this layer (one element for each unit of layer 2)
a_2 is the vector of activations for the previous layer (one element for each unit of layer 2)
b_3 is the bias for this unit.
Note that w_3 and a_2 are both 1D arrays (vectors). The # operator does matrix multiplication in Python 3.4+ and in this case it's going to perform the dot product.
Now, thanks to the magic of linear algebra, it turns out we don't need to loop over each unit to compute all the activations. If we let the weight vector w_3 be a matrix, W_3 (note the capital letter), then it can represent the weights for all units in layer 2, connecting to all units in layer 3. Then:
a_3 = sigma(W_3 # a_2 + b_3)
Now a_3 will be a vector.
The tricky part is keeping track of all the shapes. For W # a to work, the shapes must be compatible. For example, imagine we have a network with 2 units in layer 2 (so a_2 is a 1D array with 2 elements) and 3 units in layers 3 (so a_3 needs to have 3 elements). Now W_3 needs to be 3 × 2 and a_2 needs to be 2 × 1. Then the matrix multiply works. You can just use np.reshape() and np.transpose() to achieve the shapes you need.
I hope this helps... it's a lot of words.
Maybe this diagram (from this article) helps explain:
The diagram doesn't say how many records there are. There are 3 features per data instance (i.e. we have an M × 3 input matrix). The input layer is 'just' another layer, you can think of the inputs as just another set of activations. You could think of x as a_0 (the 0-th layer).
This 3 Blue 1 Brown video is well worth watching too:
I was trying to do a pretty simple thing, train an LSTM that picks a sequence of random numbers and outputs the sum of them. But after some hours without converging I decided to ask here which of my premises doesn't work.
The idea is simple:
I generate a training set of sequences of some sequence length of random numbers and label them with the sum of them (numbers are drawn from a normal distribution)
I use an LSTM with an RMSE loss for predicting the output, the sum of these numbers, given the sequence input
Intuitively the LSTM should learn to set the weight of the input gate to 1 (bias 0) the weights of the forget gate to 0 (bias 1) and the weight to the output gate to 1 (bias 0) and learn to add these numbers, but it doesn't. I pasting the code I use, I tried with different learning rates, optimizers, batching, observed the gradients and the outputs and don't find the exact reason why is failing.
Code for generating sequences:
import tensorflow as tf
import numpy as np
def generate_sequences(n_samples, seq_len):
total_shape = n_samples*seq_len
random_values = np.random.randn(total_shape)
random_values = random_values.reshape(n_samples, -1)
targets = np.sum(random_values, axis=1)
return random_values, targets
Code for training:
n_samples = 100000
seq_len = 2
epochs = n_samples
batch_size = 1
input_shape = 1
data, targets = generate_sequences(n_samples, seq_len)
train_data =, targets))
output = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(1, dtype='float64', recurrent_activation=None, activation=None), input_shape=(batch_size, seq_len, input_shape))
iterator = train_data.batch(batch_size).make_one_shot_iterator()
optimizer = tf.train.AdamOptimizer(lr)
for i in range(epochs):
my_inp, target = iterator.get_next()
with tf.GradientTape(persistent=True) as tape:
my_out = output(tf.reshape(my_inp, shape=(batch_size,seq_len,1)))
loss = tf.sqrt(tf.reduce_sum(tf.square(target - my_out)),1)/batch_size
grads = tape.gradient(loss, output.trainable_variables)
optimizer.apply_gradients(zip(grads, output.trainable_variables),
I also has a conjecture that this a theoretical problem (If we sum different random values drawn form a normal distribution then the output is not in the [-1, 1] range and the LSTM due to the tanh activations can't learn it. But changing them doesn't improved the performance (changed to linear in the example code).
Set activations to linear, I realised that the tanh() squashes the values.
The problem was actually the tanh() of the gates and recurrent states which was squashing my outputs and not allowing them to grow by adding up the summands. Putting all activations to linear works pretty fine.
First off, I am very new to Neural Nets and Keras.
I am trying to create a simple Neural Network using Keras where the input is a time series and the output is another time series of same length (1 dimensional vectors).
I made dummy code to create random input and output time series using a Conv1D layer. The Conv1D layer then outputs 6 different time series (because I have 6 filters) and the next layer I define to add all 6 of those outputs into one which is the output to the entire network.
import numpy as np
import tensorflow as tf
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.layers import Conv1D, Input, Lambda
def summation(x):
y = tf.reduce_sum(x, 0)
return y
time_len = 100 # total length of time series
num_filters = 6 # number of filters/outputs to Conv1D layer
kernel_len = 10 # length of kernel (memory size of convolution)
# create random input and output time series
X = np.random.randn(time_len)
Y = np.random.randn(time_len)
# Create neural network architecture
input_layer = Input(shape = X.shape)
conv_layer = Conv1D(filters = num_filters, kernel_size = kernel_len, padding = 'same')(input_layer)
summation_layer = Lambda(summation)(conv_layer)
model = Model(inputs = input_layer, outputs = summation_layer)
model.compile(loss = 'mse', optimizer = 'adam', metrics = ['mae']),Y,epochs = 1, metrics = ['mae'])
The error I get is:
ValueError: Input 0 of layer conv1d_1 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 100]
Looking at the Keras documentation for Conv1D, the input shape is supposed to be a 3D tensor of shape (batch, steps, channels) which I don't understand if we are working with 1 dimensional data.
Can you explain the meaning of each of the items: batch, steps, and channels? And how should I shape my 1D vectors to allow my network to run?
What is a (training) sample?
The (training) data may consists of tens, hundreds or thousands of samples. For example, each image in an image dataset like Cifar-10 or ImageNet is a sample. As another example, for a timseries dataset which consists of weather statistics recorded during the days over 10 years, each training sample may be a timeseries of each day. If we have recorded 100 measurements during the day and each measurement consists of temperature and humidity (i.e. we have two features per measurement) then the shape of our dataset is roughly (10x365, 100, 2).
What is batch size?
The batch size is simply the number of samples that can be processed by the model at a single time. We can set the batch size using the batch_size argument of fit method in Keras. The common values are 16, 32, 64, 128, 256, etc (though you must choose a number such that your machine could have enough RAM to allocate the required resources).
Further, the "steps" (also called "sequence length") and "channels" (also called "feature size") are the number of measurements and the size of each measurement, respectively. For example in our weather example above, we have steps=100 and channels=2.
To resolve the issue with your code you need to define your training data (i.e. X) such that it has a shape of (num_samples, steps or time_len, channels or feat_size):
n_samples = 1000 # we have 1000 samples in our training data
n_channels = 1 # each measurement has one feature
X = np.random.randn(n_samples, time_len, n_channels)
# if you want to predict one value for each measurement
Y = np.random.randn(n_samples, time_len)
# or if you want to predict one value for each sample
Y = np.random.randn(n_samples)
One more thing is that you should pass the shape of one sample as the input shape of the model. Therefore, the input shape of Input layer must be passed like shape=X.shape[1:].
I tried to solve Experiment 3a described in the original LSTM paper here: with tensorflow LSTM and failed
From the paper: The task is to observe and then classify input sequences. There are two classes, each occurring with probability 0.5. There is only one input line. Only the rst N real-valued sequence elements convey relevant information about the class. Sequence elements at positions t > N are generated by a Gaussian with mean zero and variance 0.2.
The net architecture that he described in the paper:
"We use a 3-layer net with 1 input unit, 1 output unit, and 3 cell blocks of size 1. The output layer receives connections only from memory cells. Memory cells and gate units receive inputs from input units, memory cells and gate units, and have bias weights. Gate units and output unit are logistic sigmoid in [0; 1], h in [-1; 1], and g in [-2; 2]"
I tried to reproduce it with LSTM with 3 hidden units for T=100 and N=3 but failed.
I used online training (i.e. update the weights after each sequence) as described in the original paper
The core of my code was as follow:
self.batch_size = batch_size = config.batch_size
hidden_size = 3
self._input_data = tf.placeholder(tf.float32, (1, T))
self._targets = tf.placeholder(tf.float32, [1, 1])
lstm_cell = rnn_cell.BasicLSTMCell(hidden_size , forget_bias=1.0)
cell = rnn_cell.MultiRNNCell([lstm_cell] * 1)
self._initial_state = cell.zero_state(1, tf.float32)
weights_hidden = tf.constant(1.0, shape= [config.num_features, config.n_hidden])
prepare the input
inputs = []
for k in range(num_steps):
nextitem = tf.matmul(tf.reshape(self._input_data[:, k], [1, 1]) , weights_hidden)
outputs, states = rnn.rnn(cell, inputs, initial_state=self._initial_state)
use the last output
pred = tf.sigmoid(tf.matmul(outputs[-1], tf.get_variable("weights_out", [config.n_hidden,1])) + tf.get_variable("bias_out", [1]))
self._final_state = states[-1]
self._cost = cost = tf.reduce_mean(tf.square((pred - self.targets)))
self._result = tf.abs(pred[0, 0] - self.targets[0,0])
optimizer = tf.train.GradientDescentOptimizer(learning_rate = config.learning_rate).minimize(cost)
Any idea why it couldn't learn?
My first instinct was to create 2 outputs one for each class but in the paper he specifically mentioned only one output unit.
It seems that i needed forget_bias > 1.0. for long sequences the network couldn't work with default forget_bias for T=50 for example i needed forget_bias = 2.1