Keras: Making a neural network to find a number's modulus - python

I'm an experienced Python developer, but a complete newbie in machine learning. This is my first attempt to use Keras. Can you tell what I'm doing wrong?
I'm trying to make a neural network that takes a number in binary form, and outputs its modulo when dividing by 7. (My goal was to take a very simple task just to see that everything works.)
In the code below I define the network and I train it on 10,000 random numbers. Then I test it on 500 random numbers.
For some reason the accuracy that I get is around 1/7, which is the accuracy you'd expect from a completely random algorithm, i.e. my neural network isn't doing anything.
Can anyone help me figure out what's wrong?
import keras.models
import numpy as np
from python_toolbox import random_tools
def _get_number(vector):
return sum(x * 2 ** i for i, x in enumerate(vector))
def _get_mod_result(vector):
return _get_number(vector) % RADIX
def _number_to_vector(number):
binary_string = bin(number)[2:]
if len(binary_string) > 20:
raise NotImplementedError
bits = (((0,) * (20 - len(binary_string))) +
tuple(map(int, binary_string)))[::-1]
assert len(bits) == 20
return np.c_[bits]
def get_mod_result_vector(vector):
return _number_to_vector(_get_mod_result(vector))
def main():
model = keras.models.Sequential(
units=20, activation='relu', input_dim=20
units=20, activation='relu'
units=20, activation='softmax'
data = np.random.randint(2, size=(10000, 20))
labels = np.vstack(map(get_mod_result_vector, data)), labels, epochs=10, batch_size=50)
def predict(number):
foo = model.predict(_number_to_vector(number))
return _get_number(tuple(map(round, foo[0])))
def is_correct_for_number(x):
return bool(predict(x) == x % RADIX)
sample = random_tools.shuffled(range(2 ** 20))[:500]
print('Total accuracy:')
print(sum(map(is_correct_for_number, sample)) / len(sample))
print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')
if __name__ == '__main__':

This achieves an accuracy of 99.74% and a validation accuracy of 99.69%.
import tensorflow as tf, numpy as np
def int2bits(i,fill=20):
return list(map(int,bin(i)[2:].zfill(fill)))
def bits2int(b):
return sum(i*2**n for n,i in enumerate(reversed(b)))
# Data.
I = np.random.randint(0,2**20,size=(250_000,))
X = np.array(list(map(int2bits,I)))
Y = np.array([int2bits(2**i,7) for i in I % 7])
# Test Data.
It = np.random.randint(0,2**20,size=(10_000,))
Xt = np.array(list(map(int2bits,It)))
Yt = np.array([int2bits(2**i,7) for i in It % 7])
# Model.
model = tf.keras.models.Sequential([
# Train.,Y,10_000,100,validation_data=(Xt,Yt))
Some take-aways:
1) You had way too little data. You were uniformly sampling points from 0 to 2**20, but only sampled 10,000, which is only about 1% of the possible vectors that the model is suppose to learn about. The point is that a lot of components (in the binary representation) would be mostly fixed at zero or one without any opportunity to learn how they function in the overall data or how they interact with other components.
2) You needed an embedding layer, namely extend the space into some massive higher dimension, so the neurons can move around more easily. This allows the learning to shuffle things better hopefully finding the algorithm your looking for. A single Dense(1000) seems to work.
3) Ran batches of 10_000 (just so I maximize my CPU usage). Ran 100 epochs. Included my validation_data in the training so I could see how the validation set performs at each epoch (including this doesn't effect the training, just makes it easier to see if the model is doing well, while training).
Thanks. :-)

After some tinkering I was able to get to a reasonably good solution using RNNs. It trains on less than 5% of all possible unique inputs and gives >90% accuracy on the random test sample. You can increase number of batches to 100 from 40 to make it a bit more accurate (though in some runs there is a chance the model won't converge to the right answer - here it is higher than usually). I have switched to using Adam optimizer here and had to increase number of samples to 50K (10K led to overfitting for me).
Please understand that this solution is a bit of a tongue-in-cheek thing, because it is based on the task-domain knowledge that our target function can be defined by a simple recurring formula on the sequence of input bits (even simpler formula if you reverse your input bit sequence, but using go_backwards=True in LSTM didn't help here).
If you inverse the input bits order (so that we always start with the most significant bit) than the recurring formula for the target function is just F_n = G(F_{n-1}, x_n), where F_n = MOD([x_1,...,x_n], 7), and G(x, y) = MOD(2*x+y, 7) - only has 49 different inputs and 7 possible outputs. So the model kind of have to learn initial state + this G update function. For the sequence starting with the least significant bit the recurring formula is slightly more complicated cause it will also need to keep track on what is current MOD(2**n, 7) on each step, but it seems that this difficulty doesn't matter for training.
Please note - these formulas are only to explain why RNN works here. The net below is just a plain LSTM layer + softmax with original input of bits treated as a sequence.
Full code for the answer using RNN layer:
import keras.models
import numpy as np
from python_toolbox import random_tools
def _get_number(vector):
return sum(x * 2 ** i for i, x in enumerate(vector))
def _get_mod_result(vector):
return _get_number(vector) % RADIX
def _number_to_vector(number):
binary_string = bin(number)[2:]
if len(binary_string) > FEATURE_BITS:
raise NotImplementedError
bits = (((0,) * (FEATURE_BITS - len(binary_string))) +
tuple(map(int, binary_string)))[::-1]
assert len(bits) == FEATURE_BITS
return np.c_[bits]
def get_mod_result_vector(vector):
v = np.repeat(0, 7)
v[_get_mod_result(vector)] = 1
return v
def main():
model = keras.models.Sequential(
(1, -1)
units=7, activation='softmax'
data = np.random.randint(2, size=(50000, FEATURE_BITS))
labels = np.vstack(map(get_mod_result_vector, data)), labels, epochs=40, batch_size=50)
def predict(number):
foo = model.predict(_number_to_vector(number))
return np.argmax(foo)
def is_correct_for_number(x):
return bool(predict(x) == x % RADIX)
sample = random_tools.shuffled(range(2 ** FEATURE_BITS))[:500]
print('Total accuracy:')
print(sum(map(is_correct_for_number, sample)) / len(sample))
print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')
if __name__ == '__main__':
I'm not sure how it happened, but the particular task you chose to check your code is extremely difficult for a NN. I think the best explanation would be that NNs are not really good when features are interconnected in such way that changing one feature always change value of your target output completely. One way to look at it would be to see the sets of features when you expect a certain answer - in your case they will look like unions of very large number of parallel hyper planes in 20 dimensional space - and for each of 7 categories these sets of planes are "nicely" interleaved and left for NN to distinguish.
That said - if your number of examples is large, say 10K and number of possible inputs is smaller, say your input bit numbers are only 8 bits large (so 256 unique inputs possible only) - networks should "learn" the right function quite ok (by "remembering" correct answers for every input, without generalization). In your case that doesn't happen because the code has the following bug.
Your labels were 20-dimensional vectors with bits of 0-6 integer (your actual desired label) - so I guess you were pretty much trying to teach NN to learn bits of the answer as separate classifiers (with only 3 bits ever possible to be non-zero). I changed that to what I assume you actually wanted - vectors of length 7 with only one value being 1 and others 0 (so-called one hot encoding which keras actually expects for categorical_crossentropy according to this). If you wanted to try to learn each bit separately you definitely shouldn't have used softmax 20 in the last layer, cause such output generates probabilities on 20 classes which sum up to 1 (in that case you should have trained 20-or-rather-3 binary classifiers instead). Since your code didn't give keras correct input the model you got in the end was kind of random and with rounding you applied was intented to output the same value for 95%-100% of inputs.
Slightly changed code below trains a model which can more or less correctly guess the mod 7 answer for every number 0 to 255 (again, pretty much remembers the correct answer for every input). If you try to increase FEATURE_BITS you will see large degradation of the results. If you actually want to train NN to learn this task as is with 20 or more bits of input (and without supplying NN with all possible inputs and infinite time to train) you will need to apply some task-specific feature transformations and/or some layers carefully designed to exactly be good at task you want to achieve as others already mentioned in comments to your question.
import keras.models
import numpy as np
from python_toolbox import random_tools
def _get_number(vector):
return sum(x * 2 ** i for i, x in enumerate(vector))
def _get_mod_result(vector):
return _get_number(vector) % RADIX
def _number_to_vector(number):
binary_string = bin(number)[2:]
if len(binary_string) > FEATURE_BITS:
raise NotImplementedError
bits = (((0,) * (FEATURE_BITS - len(binary_string))) +
tuple(map(int, binary_string)))[::-1]
assert len(bits) == FEATURE_BITS
return np.c_[bits]
def get_mod_result_vector(vector):
v = np.repeat(0, 7)
v[_get_mod_result(vector)] = 1
return v
def main():
model = keras.models.Sequential(
units=20, activation='relu', input_dim=FEATURE_BITS
units=20, activation='relu'
units=7, activation='softmax'
data = np.random.randint(2, size=(10000, FEATURE_BITS))
labels = np.vstack(map(get_mod_result_vector, data)), labels, epochs=100, batch_size=50)
def predict(number):
foo = model.predict(_number_to_vector(number))
return np.argmax(foo)
def is_correct_for_number(x):
return bool(predict(x) == x % RADIX)
sample = random_tools.shuffled(range(2 ** FEATURE_BITS))[:500]
print('Total accuracy:')
print(sum(map(is_correct_for_number, sample)) / len(sample))
print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')
if __name__ == '__main__':


Pytorch model output is not correct (torch.float32 and torch.float64)

I have created a DNN model with Pytorch (input_dim=6, output_dim=150). Normally, if I generate a random X_in=torch.randn(6000, 6), it will return me a model_out.shape=(6000, 150), and if I calculate the Rank of model_out, it should be 150 (since my model's weight and bias are also randomly initialised).
However, you can see this is NOT TRUE with the following code:
import torch
import torch.nn as nn
torch.manual_seed(923) # for reproducible result
class MyDNN(nn.Module):
def __init__(self):
super(MyDNN, self).__init__()
# layer 0:
self.linear_0 = nn.Linear(6, 150)
self.activ_0 = nn.Tanh()
# layer 1:
self.linear_1 = nn.Linear(150, 150)
self.activ_1 = nn.Tanh()
# layer 2:
self.linear_2 = nn.Linear(150, 150)
self.activ_2 = nn.Tanh()
# layer 3:
self.linear_3 = nn.Linear(150, 150)
self.activ_3 = nn.Tanh()
def forward(self, x):
out = self.activ_0(self.linear_0(x)) # output: layer 0
out = self.activ_1(self.linear_1(out)) # output: layer 1
out = self.activ_2(self.linear_2(out)) # output: layer 2
out = self.activ_3(self.linear_3(out)) # output: layer 3
return out
model = MyDNN()
X_in = torch.randn(6000, 6, dtype=torch.float32)
with torch.no_grad():
model_out = model(X_in)
print(f'model_out rank = {torch.linalg.matrix_rank(model_out)}')
model_out rank = 115. Apparently this is a WRONG output, there is no way that the output has so many linear dependent columns with all the inputs, weights and bias are randomly initialised!
This problem can be solved by changing the X_in dtype as well as the model dtype to float64 with the following code:
model_64 = MyDNN()
X_in_64 = torch.randn(6000, 6, dtype=torch.float64)
with torch.no_grad():
model_64_out = model_64(X_in_64)
print(f'model_64_out rank = {torch.linalg.matrix_rank(model_64_out)}')
model_64_out rank = 150
Here is my question:
Why does this happen? Is this really a problem of data size? I mean float32 already has a good precision. Actually when I use my own training_data, even with mini_batch_size = 10 -> output.shape = (10, 150), my Rank(output) is less than 10.
Although this problem can be solved by using double precision, this slows down the whole training process a lot (and with Mac M1 pro GPU, it only supports float32 type). Is there any other solution?
You have to realize that we are dealing with a numerical problem here: The rank of a matrix is a discrete value derived from a e.g. a singular value decomposition in the case of torch.matrix_rank. In this case we need to consider a threshold on the singular values: At what modulus tol do we consider a singular value as exactly zero?
Remember that we are dealing with floating point values where all operations always comes with truncation and rounding errors. In short there is no sense in trying to compute an exact rank.
So instead you might reconsider what kind of tolerance you use, you could e.g. use torch.linal.matrix_rank(..., tol=1e-6). The smaller the tolerance, the higher the expected rank.
But no matter what kind of floating point precision you use, I'd argue you will never be able to find meaningful "exact" number for the rank, it will always be a trade off! Therefore I'd reconsider whether you really need to compute the rank in the first place, or wether there is some other kind of criterion that is better suited for numerical considerations in the first place!

What are the expected values in the input in Pytorch?

I am new to Pytorch, following the tutorials. I want to implement a regresor for a nonlinear function with 4 real inputs and 2 real outputs. I cannot find anywhere what is the supposed range for the inputs and outputs. They should go between -1 and 1? Between 0 and 1? Can it be anything?
More details
I have written the following simple parameterized model for experimenting:
class Net(torch.nn.Module):
def __init__(self, n_inputs: int, n_outputs: int, n_hidden_layers: int, n_nodes_per_hidden_layer: int):
n_inputs = int(n_inputs)
n_outputs = int(n_outputs)
n_hidden_layers = int(n_hidden_layers)
n_nodes_per_hidden_layer = int(n_nodes_per_hidden_layer)
if any([i<=0 for i in [n_inputs,n_outputs,n_hidden_layers,n_nodes_per_hidden_layer]]):
raise ValueError(f'All n_inputs, n_outputs, n_hidden_layers and n_nodes_per_hidden_layer must be greater than 0.')
self.input_layer = torch.nn.Linear(n_inputs, n_nodes_per_hidden_layer)
self.hidden_layers = [torch.nn.Linear(n_nodes_per_hidden_layer, n_nodes_per_hidden_layer) for i in range(n_hidden_layers)]
self.output_layer = torch.nn.Linear(n_nodes_per_hidden_layer, n_outputs)
def forward(self, x):
x *= .1
activation_function = torch.nn.functional.relu
x = activation_function(self.input_layer(x))
for idx,layer in enumerate(self.hidden_layers):
x = activation_function(layer(x))
x = self.output_layer(x)
return x
and I instantiate it in this way:
dnn = Net(
n_inputs = 4, # Defined by the number of observables (charge of each channel now).
n_outputs = 2, # x and y.
n_hidden_layers = 3, # Choose yourself.
n_nodes_per_hidden_layer = 66, # Choose yourself.
My input x is data that distributes in a weird way from 0 to 1, my outputs are 2 values in the range 1e-2 +- 100e-6. I have tried x -= .5 and different scalings too, but cannot make it work. I don't get any error, just it does not seem to learn what it is supposed to learn.
I know that this model should work because I have used it with similar data that distributes in a similar way but the inputs in the range 0-100e-12 using x *= 1e9 and it was performing reasonably well. I don't know why, however.
Data transformation
They should go between -1 and 1? Between 0 and 1? Can it be anything?
They can be any real valued numbers, but in general we standardize input values using mean and standard deviation (so the result has 0 mean and 1 variance) like this (for two dimensional data that you have, assuming samples are zeroth dimension and features are the first dimension):
import torch
samples, features = 128, 4
data = torch.randn(samples, features)
std, mean = torch.std_mean(data, dim=0, keepdim=True)
normalized = (data - mean) / std
In general neural networks work best with normalized input with similar ranges.
And this transformation could (and should as it will make it easier for nn.Linear layer output) also be applied to your regression target as it's reversible.
Other things
Use torch.nn.MSELoss for regression
Use standard optimizer (like Adam) with default learning rates (you can fine tune it later)
Make sure your pipeline works correctly

LSTM Autoencoder problems

Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper:
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]=[N]) + b.
Same pattern for other elements: x[i]=[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x =[x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x =[x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model =
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x =
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
running_tl += loss.item()
if testload is not None:
with torch.no_grad():
for x in testload:
x =
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
The model only learns the average, no matter how complex I make the model or now long I train it.
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
What is causing my model to predict the average and how do I fix it?
Okay, after some debugging I think I know the reasons.
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
print(f"{i}: {loss}")
if loss < 1e-4:
# Plotting
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.

Neural network versus random forest performance discrepancy

I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.
The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.
I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.
Simple linear regression on this data set has a mean squared error of about 0.14.
I implemented a simple feedforward neural network, code in and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.
So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.
Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest,
... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.
Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
# data
df = pd.read_csv("test.csv")
# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)
# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)
# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)
# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500
# model
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.L0 = nn.Linear(in_features, hidden_size)
self.N0 = nn.ReLU()
self.L1 = nn.Linear(hidden_size, hidden_size)
self.N1 = nn.Tanh()
self.L2 = nn.Linear(hidden_size, hidden_size)
self.N2 = nn.ReLU()
self.L3 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = self.L0(x)
x = self.N0(x)
x = self.L1(x)
x = self.N1(x)
x = self.L2(x)
x = self.N2(x)
x = self.L3(x)
return x
model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# train
for epoch in range(1, epochs + 1):
# forward
output = model(X_tensor)
cost = criterion(output, y_tensor)
# backward
# print progress
if epoch % (epochs // 10) == 0:
print(f"{epoch:6d} {cost.item():10f}")
output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())
can you please print the shape of your input ?
I would say check those things first:
that your target y have the shape (-1, 1) I don't know if pytorch throws an Error in this case. you can use y.reshape(-1, 1) if it isn't 2 dim
your learning rate is high. usually when using Adam the default value is good enough or try simply to lower your learning rate. 0.1 is a high value for a learning rate to start with
place the optimizer.zero_grad at the first line inside the for loop
normalize/standardize your data ( this is usually good for NNs )
remove outliers in your data (my opinion: I think this can't affect Random forest so much but it can affect NNs badly)
use cross validation (maybe skorch can help you here. It's a scikit learn wrapper for pytorch and easy to use if you know keras)
Notice that Random forest regressor or any other regressor can outperform neural nets in some cases. There is some fields where neural nets are the heros like Image Classification or NLP but you need to be aware that a simple regression algorithm can outperform them. Usually when your data is not big enough.

Using Deep Learning to Predict Subsequence from Sequence

I have a data that looks like this:
It can be viewed here and has been included in the code below.
In actuality I have ~7000 samples (row), downloadable too.
The task is given antigen, predict the corresponding epitope.
So epitope is always an exact substring of antigen. This is equivalent with
the Sequence to Sequence Learning. Here is my code running on Recurrent Neural Network under Keras. It was modeled according the example.
My question are:
Can RNN, LSTM or GRU used to predict subsequence as posed above?
How can I improve the accuracy of my code?
How can I modify my code so that it can run faster?
Here is my running code which gave very bad accuracy score.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import json
import pandas as pd
from keras.models import Sequential
from import slice_X
from keras.layers.core import Activation, RepeatVector, Dense
from keras.layers import recurrent, TimeDistributed
import numpy as np
from six.moves import range
class CharacterTable(object):
Given a set of characters:
+ Encode them to a one hot integer representation
+ Decode the one hot integer representation to their character output
+ Decode a vector of probabilties to their character output
def __init__(self, chars, maxlen):
self.chars = sorted(set(chars))
self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
self.maxlen = maxlen
def encode(self, C, maxlen=None):
maxlen = maxlen if maxlen else self.maxlen
X = np.zeros((maxlen, len(self.chars)))
for i, c in enumerate(C):
X[i, self.char_indices[c]] = 1
return X
def decode(self, X, calc_argmax=True):
if calc_argmax:
X = X.argmax(axis=-1)
return ''.join(self.indices_char[x] for x in X)
class colors:
ok = '\033[92m'
fail = '\033[91m'
close = '\033[0m'
# Try replacing GRU, or SimpleRNN
RNN = recurrent.LSTM
def main():
Epitope_core = answers
Antigen = questions
epi_antigen_df ="")
antigens = epi_antigen_df["Antigen"].tolist()
epitopes = epi_antigen_df["Epitope Core"].tolist()
antigens = [ x[::-1] for x in antigens]
allchars = "".join(antigens+epitopes)
allchars = list(set(allchars))
aa_chars = "".join(allchars)
sys.stderr.write(aa_chars + "\n")
max_antigen_len = len(max(antigens, key=len))
max_epitope_len = len(max(epitopes, key=len))
X = np.zeros((len(antigens),max_antigen_len, len(aa_chars)),dtype=np.bool)
y = np.zeros((len(epitopes),max_epitope_len, len(aa_chars)),dtype=np.bool)
ctable = CharacterTable(aa_chars, max_antigen_len)
sys.stderr.write("Begin vectorization\n")
for i, antigen in enumerate(antigens):
X[i] = ctable.encode(antigen, maxlen=max_antigen_len)
for i, epitope in enumerate(epitopes):
y[i] = ctable.encode(epitope, maxlen=max_epitope_len)
# Shuffle (X, y) in unison as the later parts of X will almost all be larger digits
indices = np.arange(len(y))
X = X[indices]
y = y[indices]
# Explicitly set apart 10% for validation data that we never train over
split_at = len(X) - len(X) / 10
(X_train, X_val) = (slice_X(X, 0, split_at), slice_X(X, split_at))
(y_train, y_val) = (y[:split_at], y[split_at:])
sys.stderr.write("Build model\n")
model = Sequential()
# "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE
# note: in a situation where your input sequences have a variable length,
# use input_shape=(None, nb_feature).
model.add(RNN(HIDDEN_SIZE, input_shape=(max_antigen_len, len(aa_chars))))
# For the decoder's input, we repeat the encoded input for each time step
# The decoder RNN could be multiple layers stacked or a single layer
for _ in range(LAYERS):
model.add(RNN(HIDDEN_SIZE, return_sequences=True))
# For each of step of the output sequence, decide which character should be chosen
# Train the model each generation and show predictions against the validation dataset
for iteration in range(1, 200):
print('-' * 50)
print('Iteration', iteration), y_train, batch_size=BATCH_SIZE, nb_epoch=5,
validation_data=(X_val, y_val))
# Select 10 samples from the validation set at random so we can visualize errors
for i in range(10):
ind = np.random.randint(0, len(X_val))
rowX, rowy = X_val[np.array([ind])], y_val[np.array([ind])]
preds = model.predict_classes(rowX, verbose=0)
q = ctable.decode(rowX[0])
correct = ctable.decode(rowy[0])
guess = ctable.decode(preds[0], calc_argmax=False)
# print('Q', q[::-1] if INVERT else q)
print('T', correct)
print(colors.ok + '☑' + colors.close if correct == guess else + '☒' + colors.close, guess)
if __name__ == '__main__':
Can RNN, LSTM or GRU used to predict subsequence as posed above?
Yes, you can use any of these. LSTMs and GRUs are types of RNNs; if by RNN you mean a fully-connected RNN, these have fallen out of favor because of the vanishing gradients problem (1, 2). Because of the relatively small number of examples in your dataset, a GRU might be preferable to an LSTM due to its simpler architecture.
How can I improve the accuracy of my code?
You mentioned that training and validation error are both bad. In general, this could be due to one of several factors:
The learning rate is too low (not an issue since you're using Adam, a per-parameter adaptive learning rate algorithm)
The model is too simple for the data (not at all the issue, since you have a very complex model and a small dataset)
You have vanishing gradients (probably the issue since you have a 3-layer RNN). Try reducing the number of layers to 1 (in general, it's good to start by getting a simple model working and then increase the complexity), and also consider hyperparameter search (e.g. a 128-dimensional hidden state may be too large - try 30?).
Another option, since your epitope is a substring of your input, is to predict the start and end indices of the epitope within the antigen sequence (potentially normalized by the length of the antigen sequence) instead of predicting the substring one character at a time. This would be a regression problem with two tasks. For instance, if the antigen is FSKIAGLTVT (10 letters long) and its epitope is KIAGL (positions 3 to 7, one-based) then the input would be FSKIAGLTVT and the outputs would be 0.3 (first task) and 0.7 (second task).
Alternatively, if you can make all the antigens be the same length (by removing parts of your dataset with short antigens and/or chopping off the ends of long antigens assuming you know a priori that the epitope is not near the ends), you can frame it as a classification problem with two tasks (start and end) and sequence-length classes, where you're trying to assign a probability to the antigen starting and ending at each of the positions.
How can I modify my code so that it can run faster?
Reducing the number of layers will speed your code up significantly. Also, GRUs will be faster than LSTMs due to their simpler architecture. However, both types of recurrent networks will be slower than, e.g. convolutional networks.
Feel free to send me an email (address in my profile) if you're interested in a collaboration.

