Recreating char level RNN for generating text

Recreating char level RNN for generating text - python

I tied to follow a book on deep learning, where there is an chapter about generating text in the style of an example. They used an char level RNN with two LSTM layers in it to generate text in the style of shakespare. But the code in the book (also online: https://github.com/DOsinga/deep_learning_cookbook/blob/master/05.1%20Generating%20Text%20in%20the%20Style%20of%20an%20Example%20Text.ipynb) is written in keras and I only use pytorch. So i tied to recreate it exactly in pytorch, with same network structure and hyperparameters.
So after recreating it and making it work without errors it trained it and it only learned to write the most common character, a space. Then i tried to overfit it on one realy simple sentence, so I had to decrease the sequence lenght to 8. This also did not work, but when decreasing the hidden size of the LSTMs to only 32 it learned it nearly perfectly.
So then I continued working on the original text and started to play arround with the hidden size, learning rate, optimizer (also tried adam) and trained it even longer. The best I could achieve were some random letters, still with a lot of spaces and somtimes something like "her", but far from readable, with still an quite high loss. I used RMSprop with lr=0.01 and a hidden size of 128 over 20000 epochs. I also tried to initialize the hidden state and cell state to zero.
The problem is, that my results are far worse than those in the book, but I did exactly the same just in pytorch. Can someone please tell me, what I should try or what I have done wrong. Any help is appreciated!
PS: Sorry for my bad english.
Here is my code with the original hyperparameters:
#hyperparameters
batch_size = 256
seq_len = 160
hidden_size = 640
layers = 2
#network structure
class RNN(nn.Module):
def __init__(self):
super().__init__()
self.lstm = nn.LSTM(len(chars),hidden_size,layers)
self.linear = nn.Linear(hidden_size,len(chars))
self.softmax = nn.Softmax(dim=2)
def forward(self,x,h,c):
x,(h,c) = self.lstm(x,(h,c))
x = self.softmax(self.linear(x))
return x,h,c
#create network, optimizer and criterion
rnn = RNN().cuda()
optimizer = torch.optim.RMSprop(rnn.parameters(),lr=0.01)
criterion = nn.CrossEntropyLoss()
#training loop
plt.ion()
losses = []
loss_sum = 0
for epoch in range(10000):
#generate input and target filled with zeros
input = numpy.zeros((seq_len,batch_size,len(chars)))
target = numpy.zeros((seq_len,batch_size))
for batch in range(batch_size):
#choose random starting index in text
start = random.randrange(len(text)-seq_len-1)
#generate sequences for that batch filled with zeros
input_seq = numpy.zeros((seq_len+1,len(chars)))
target_seq = numpy.zeros((seq_len+1))
for i,char in enumerate(text[start:start+seq_len+1]):
#convert character to index
idx = char_to_idx[char]
#set value of index to one (one-hot-encoding)
input_seq[i,idx] = 1
#set value to index (only label)
target_seq[i] = idx
#insert sequences into input and target
input[:,batch,:] = input_seq[:-1]
target[:,batch] = target_seq[1:]
#convert input and target from numpy array to pytorch tensor on gpu
input = torch.from_numpy(input).float().cuda()
target = torch.from_numpy(target).long().cuda()
#initialize hidden state and cell state to zero
h0 = torch.zeros(layers,batch_size,hidden_size).cuda()
c0 = torch.zeros(layers,batch_size,hidden_size).cuda()
#run the network on the input
output,h,c = rnn(input,h0,c0)
#calculate loss and perform gradient descent
optimizer.zero_grad()
loss = criterion(output.view(-1,len(chars)),target.view(-1))
loss.backward()
optimizer.step()
Plot of the loss with original hyperparameters:
Example of target and output after training:
Target: can bring this instrument of honour
again into his native quarter, be magnanimous in the enterprise,
and go on; I will grace the attempt for a worthy e
Output:
Plot of the loss with hidden size of 128 over 20000 epochs (best results):

I later finally found a way to achive something close to real sentences, maybe it will help someone. Here is an example result:
-I have not seen him and the prince was a signt of the streme of the sumpering of the property of th
In my case the important change was to not inizialize the hidden and cell state to zero every batch but only every epoch. For this to work I had to rewrite the batch generator, so that it produces batches following on each other.

Related

Finding patterns in time series with PyTorch

I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
network.train()
network.float()
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
optimizer.zero_grad()
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
else:
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
loss.backward()
cummloss.append(loss.item())
optimizer.step()
index = index + batch_size_train
if index > len(data):
print(np.mean(cummloss))
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
train(0)
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.

This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.

From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.

Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition : https://www.narayanacharya.com/vision/2019-12-30-Action-Recognition-Using-LSTM
https://discuss.pytorch.org/t/any-pytorch-function-can-work-as-keras-timedistributed/1346
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::
https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html

How to Increase the Scope of Images a Neural Network Can Recognize?

I am working on an image recognition neural network with Pytorch. My goal is to take pictures of handwritten math equations, process them, and use the neural network to recognize each element. I've reached the point where I am able to separate every variable, number, or symbol from the equation, and everything is ready to be sent through the neural network. I've trained my network to recognize numbers quite well (this part was quite easy), but now I want to expand the scope of the neural network to recognizing letters as well as numbers. I loaded handwritten letters along with the numbers into tensors, shuffled the elements, and put them into batches. No matter how I vary my learning rate, my architecture (hidden layers and the number of neurons per layer), or my batch size I cannot get the neural network to recognize letters.
Here is my network architecture and the feed-forward function (you can see I experimented with the number of hidden layers):
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
inputNeurons, hiddenNeurons, outputNeurons = 784, 700, 36
# Create tensors for the weights
self.layerOne = nn.Linear(inputNeurons, hiddenNeurons)
self.layerTwo = nn.Linear(hiddenNeurons, hiddenNeurons)
self.layerThree = nn.Linear(hiddenNeurons, outputNeurons)
#self.layerFour = nn.Linear(hiddenNeurons, outputNeurons)
#self.layerFive = nn.Linear(hiddenNeurons, outputNeurons)
# Create function for Forward propagation
def Forward(self, input):
# Begin Forward propagation
input = torch.sigmoid(self.layerOne(torch.sigmoid(input)))
input = torch.sigmoid(self.layerTwo(input))
input = torch.sigmoid(self.layerThree(input))
#input = torch.sigmoid(self.layerFour(input))
#input = torch.sigmoid(self.layerFive(input))
return input
And this is the training code block (the data is shuffled in a dataloader, the ground truths are shuffled in the same order, batch size is 10, total number letter and number data points is 244800):
neuralNet = NeuralNetwork()
params = list(neuralNet.parameters())
criterion = nn.MSELoss()
print(neuralNet)
dataSet = next(iter(imageDataLoader))
groundTruth = next(iter(groundTruthsDataLoader))
for i in range(15):
for k in range(24480):
neuralNet.zero_grad()
prediction = neuralNet.Forward(dataSet)
loss = criterion(prediction, groundTruth)
loss.backward()
for layer in range(len(params)):
# Updating the weights of the neural network
params[layer].data.sub_(params[layer].grad.data * learningRate)
Thanks for the help in advance!

First thing i would recommend is writing a clean Pytorch code
For eg.
if i see your NeuralNetwork class it should have forward method (f in lower case),
so that you wont call it using prediction = neuralNet.Forward(dataSet). Reason being your hooks from neural network does not get dispatched if you use prediction = neuralNet.Forward(dataSet).
For more details refer this link
Second thing is : Since your dataset is not balance.....try to use undersampling / oversampling methods, which will be very helpful in your case.

LSTM Autoencoder problems

TLDR:
Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
Same pattern for other elements: x[i]= w.dot(hs[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
n_features,
latent_size,
batch_first=True)
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model = model.to(device)
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x = x.to(device).float()
opt.zero_grad()
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
loss.backward()
opt.step()
running_tl += loss.item()
if testload is not None:
model.eval()
with torch.no_grad():
for x in testload:
x = x.to(device).float()
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
model.train()
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Data:
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
super().__init__()
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
else:
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
try:
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
Problem:
The model only learns the average, no matter how complex I make the model or now long I train it.
Predicted/Reconstruction:
Actual:
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
Question:
What is causing my model to predict the average and how do I fix it?

Okay, after some debugging I think I know the reasons.
TLDR
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Analysis
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
]
).unsqueeze(0)
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
torch.manual_seed(0)
HIDDEN_SIZE = 10
SUBTRACT = False
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"{i}: {loss}")
if loss < 1e-4:
break
# Plotting
sns.lineplot(data=outputs.detach().numpy().squeeze())
sns.lineplot(data=input_tensor.detach().numpy().squeeze())
plt.show()
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
NO SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=False
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=True
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
NO SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=False
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=True
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Finally
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Questions
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.

Neural network versus random forest performance discrepancy

I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.
The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file https://github.com/russellw/ml/blob/master/test.csv is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.
I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.
Simple linear regression on this data set has a mean squared error of about 0.14.
I implemented a simple feedforward neural network, code in https://github.com/russellw/ml/blob/master/test_nn.py and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.
So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.
Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest, https://github.com/russellw/ml/blob/master/test_rf.py
... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.
Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
# data
df = pd.read_csv("test.csv")
print(df)
print()
# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)
# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)
# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)
# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500
# model
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.L0 = nn.Linear(in_features, hidden_size)
self.N0 = nn.ReLU()
self.L1 = nn.Linear(hidden_size, hidden_size)
self.N1 = nn.Tanh()
self.L2 = nn.Linear(hidden_size, hidden_size)
self.N2 = nn.ReLU()
self.L3 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = self.L0(x)
x = self.N0(x)
x = self.L1(x)
x = self.N1(x)
x = self.L2(x)
x = self.N2(x)
x = self.L3(x)
return x
model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# train
print("training")
for epoch in range(1, epochs + 1):
# forward
output = model(X_tensor)
cost = criterion(output, y_tensor)
# backward
optimizer.zero_grad()
cost.backward()
optimizer.step()
# print progress
if epoch % (epochs // 10) == 0:
print(f"{epoch:6d} {cost.item():10f}")
print()
output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())

can you please print the shape of your input ?
I would say check those things first:
that your target y have the shape (-1, 1) I don't know if pytorch throws an Error in this case. you can use y.reshape(-1, 1) if it isn't 2 dim
your learning rate is high. usually when using Adam the default value is good enough or try simply to lower your learning rate. 0.1 is a high value for a learning rate to start with
place the optimizer.zero_grad at the first line inside the for loop
normalize/standardize your data ( this is usually good for NNs )
remove outliers in your data (my opinion: I think this can't affect Random forest so much but it can affect NNs badly)
use cross validation (maybe skorch can help you here. It's a scikit learn wrapper for pytorch and easy to use if you know keras)
Notice that Random forest regressor or any other regressor can outperform neural nets in some cases. There is some fields where neural nets are the heros like Image Classification or NLP but you need to be aware that a simple regression algorithm can outperform them. Usually when your data is not big enough.

How can we train a deep learning model for input of any length?

I have created a deep learning model, using keras. I am training the model on comments of different lengths. I pre-processed the input data, and padded it to equal length. While predicting, if I give an input, with length larger then that of my training inputs, model fails. What should I do? Kindly share some examples as well that will be really helpful.
latent_dim = 1024
inpt = Input(shape=(max_len,)) ### max_len is the final length
emb_layer = Embedding(vocabulary_size, 100, weights=[embedding_matrix], trainable = False)(inpt)
bdlstm1 = Bidirectional(LSTM(64, return_sequences = True))(emb_layer)
dns2 = Dense(no_labels, activation='softmax')(bdlstm1)
model = Model(inpt, dns2)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x,y)

I think that you'll find a satisfying answer here:
https://ai.stackexchange.com/questions/2008/how-can-neural-networks-deal-with-varying-input-sizes
The best (and only simple) way is to use RNN.
Good luck !

Theoretically the limitation is given by the fully-connected layer/s at the end of the deep net, having all its units the need of be feeded.
Due to this limitation, we must fix the length normally a priori (max number of words in NLP). As I've seen in your code, you're using a 100-size embedding layer, so for every sentence that doesn't reach the minimum size you have to add 100 zeros for every missing word.
For longer sentences (as is the case you are commenting on) you can do the following, but you should know that you will lose information since words that exceed will not be taken into account for the prediction.
if max_len < len(sentence.split()):
# add zero padding until reach max_len
else:
inputSentence = " ".join(sentence.split()[:max_len]) # no. words = max_len

you used LSTM whis is a kind of RNN and it requires max time steps so you must limit upper bound of text.
if new input longer than max_len
clip it to a max_len
or use sliding window and make prediction for each of them and use average of them

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.