A Classifier Network Seems to be "Forgetting" older samples

A Classifier Network Seems to be "Forgetting" older samples - python

This is a strange problem: Imagine a neural network classifier. It is a simple linear layer followed by a sigmoid activation that has an input size of 64, and an output size of 112. There also are 112 training samples, where I expect the output to be a one-hot vector. So the basic structure of a training loop is as follows, where samples is a list of integer indices:
model = nn.Sequential(nn.Linear(64,112),nn.Sequential())
loss_fn = nn.BCELoss()
optimizer = optim.AdamW(model.parameters(),lr=3e-4)
for epoch in range(500):
for input_state, index in samples:
one_hot = torch.zeros(112).float()
one_hot[index] = 1.0
optimizer.zero_grad()
prediction = model(input_state)
loss = loss_fn(prediction,one_hot)
loss.backward()
optimizer.step()
This model does not perform well, but I don't think it's a problem with the model itself, but rather how it's trained. I think that this is happening because for the most part, all of the one_hot tensor is zeros, that the model just tends to gravitate toward all of the outputs being zeros, which is what's happening. The question becomes: "How does this get solved?" I tried using the average loss with all the samples, to no avail. So what do I do?

So this is very embarrassing, but the answer actually lies in how I process my data. This is a text-input project, so I used basic python lists to create blocks of messages, but when I did this, I ended up making it so that all of the inputs the net got were the same, but the output was different every time. I solved tho s problem with the copy method.

Related

Finding patterns in time series with PyTorch

I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
network.train()
network.float()
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
optimizer.zero_grad()
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
else:
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
loss.backward()
cummloss.append(loss.item())
optimizer.step()
index = index + batch_size_train
if index > len(data):
print(np.mean(cummloss))
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
train(0)
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.

This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.

From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.

Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition : https://www.narayanacharya.com/vision/2019-12-30-Action-Recognition-Using-LSTM
https://discuss.pytorch.org/t/any-pytorch-function-can-work-as-keras-timedistributed/1346
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::
https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html

how to throw some of samples within a mini-batch during training

I'm using AdamOptimizer to train a simple DNN network.(I'm using version tf1.4).
And I want to throw away some bad samples within a batch during training. Say I have 4096 samples within a batch, and I want to throw 96 samples away and only use the remaining 4000 samples to calculate loss and do backpropagation.
How can I achieve this?
The code set up is very straightforward like below:
lables = tf.reshape(labels, [batch_size, 1])
logits = tf.reshape(logits, [batch_size, 1])
loss_vector = tf.nn.sigmoid_cross_entropy_with_logits(multi_class_labels=labels,
logits=logits)
loss_scalar = tf.reduce_mean(loss_vector)
opt = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = opt.minimize(loss_scalar, global_step=global_step)
one possible solution is to do a mask operation after loss_vector and before reduce_mean. But I'm not sure if it's the right solution and I have some questions about what's going on underhood:
in the minimize() operation, since the input parameter is a scalar, how will it know that some of input samples are masked out?
in the minimize() operation, how many times of backpropagation will happen?
during training, samples are feed in the graph one by one, how can TF know that which should be kept and which should throw away?

Approximation of funtion with multi-dimensional output using a keras neural network

As part of a project for my studies I want to try and approximate a function f:R^m -> R^n using a Keras neural network (to which I am completely new). The network seems to be learning to some (indeed unsatisfactory) point. But the predictions of the network don't resemble the expected results in the slightest.
I have two numpy-arrays containing the training-data (the m-dimensional input for the function) and the training-labels (the n-dimensional expected output of the function). I use them for training my Keras model (see below), which seems to be learning on the provided data.
inputs = Input(shape=(m,))
hidden = Dense(100, activation='sigmoid')(inputs)
hidden = Dense(80, activation='sigmoid')(hidden)
outputs = Dense(n, activation='softmax')(hidden)
opti = tf.keras.optimizers.Adam(lr=0.001)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=opti,
loss='poisson',
metrics=['accuracy'])
model.fit(training_data, training_labels, verbose = 2, batch_size=32, epochs=30)
When I call the evaluate-method on my model with a set of test-data and a set of test-labels, I get an apparent accuracy of more than 50%. However, when I use the predict method, the predictions of the network do not resemble the expected results in the slightest. For example, the first ten entries of the expected output are:
[0., 0.08193582, 0.13141066, 0.13495408, 0.16852582, 0.2154705 ,
0.30517559, 0.32567417, 0.34073457, 0.37453226]
whereas the first ten entries of the predicted results are:
[3.09514281e-09, 2.20849714e-03, 3.84095078e-03, 4.99367528e-03,
6.06226595e-03, 7.18442770e-03, 8.96730460e-03, 1.03423093e-02, 1.16029680e-02, 1.31887039e-02]
Does this have something to do with the metrics I use? Could the results be normalized by Keras in some intransparent way? Have I just used the wrong kind of model for the problem I want to solve? What does 'accuracy' mean anyway?
Thank you in advance for your help, I am new to neural networks and have been stuck with this issue for several days.

The problem is with this line:
outputs = Dense(n, activation='softmax')(hidden)
We use softmax activation only in a classification problem, where we need a probability distribution over the classes as an output of the network. And so softmax makes ensures that the output sums to one and non zero (which is true in your case). But I don't think the problem at hand for you is a classification task, you are just trying to predict ten continuous target varaibles, so use a linear activation function instead. So modify the above line to something like this
outputs = Dense(n, activation='linear')(hidden)

Acc decreasing to zero in LSTM Keras Training

While trying to implement an LSTM network for trajectory classification, I have been struggling to get decent classification results even for simple trajectories. Also, my training accuracy keeps fluctuating without increasing significantly, this can also be seen in tensorboard:
Training accuracy:
This is my model:
model1 = Sequential()
model1.add(LSTM(8, dropout=0.2, return_sequences=True, input_shape=(40,2)))
model1.add(LSTM(8,return_sequences=True))
model1.add(LSTM(8,return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))`
and my training code:
model1.compile(optimizer='adagrad',loss='binary_crossentropy', metrics=['accuracy'])
hist1 = model1.fit(dataScatter[:,70:110,:],outputScatter,validation_split=0.25,epochs=50, batch_size=20, callbacks = [tensorboard], verbose = 2)
I think the problem is probably due to the data input and output shape, since the model itself seems to be fine. The Data input has (2000,40,2) shape and the output has (2000,1) shape.
Can anyone spot a mistake?

Try to change:
model1.add(Dense(1, activation='sigmoid'))`
to:
model1.add(TimeDistributed(Dense(1, activation='sigmoid')))
The TimeDistributed applies the same Dense layer (same weights) to the LSTMs outputs for one time step at a time.
I recommend this tutorial as well https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/ .

I was able to increase the accuracy to 97% with a few adjustments that were data related. The main obstacle was an unbalanced dataset split for the training and validation set. Further improvements came from normalizing the input trajectories. I also increased the number of cells in the first layer.

TimeDistributed layer and return sequences etc for LSTM in Keras

Sorry I am new to RNN. I have read this post on TimeDistributed layer.
I have reshaped my data in to Keras requried [samples, time_steps, features]: [140*50*19], which means I have 140 data points, each has 50 time steps, and 19 features. My output is shaped [140*50*1]. I care more about the last data point's accuracy. This is a regression problem.
My current code is :
x = Input((None, X_train.shape[-1]) , name='input')
lstm_kwargs = { 'dropout_W': 0.25, 'return_sequences': True, 'consume_less': 'gpu'}
lstm1 = LSTM(64, name='lstm1', **lstm_kwargs)(x)
output = Dense(1, activation='relu', name='output')(lstm1)
model = Model(input=x, output=output)
sgd = SGD(lr=0.00006, momentum=0.8, decay=0, nesterov=False)
optimizer = sgd
model.compile(optimizer=optimizer, loss='mean_squared_error')
My questions are:
My case is many-to-many, so I need to use return_sequences=True? How about if I only need the last time step's prediction, it would be many-to-one. So I need to my output to be [140*1*1] and return_sequences=False?
Is there anyway to enhance my last time points accuracy if I use many-to-many? I care more about it than the other points accuracy.
I have tried to use TimeDistributed layer as
output = TimeDistributed(Dense(1, activation='relu'), name='output')(lstm1)
the performance seems to be worse than without using TimeDistributed layer. Why is this so?
I tried to use optimizer=RMSprop(lr=0.001). I thought RMSprop is supposed to stabilize the NN. But I was never able to get good result using RMSprop.
How do I choose a good lr and momentum for SGD? I have been testing on different combinations manually. Is there a cross validation method in keras?

So:
Yes - return_sequences=False makes your network to output only a last element of sequence prediction.
You could define the output slicing using the Lambda layer. Here you could find an example on how to do this. Having the output sliced you can provide the additional output where you'll feed the values of the last timestep.
From the computational point of view these two approaches are equivalent. Maybe the problem lies in randomness introduced by weight sampling.
Actually - using RMSProp as a first choice for RNNs is a rule of thumb - not a general proved law. Moreover - it is strongly adviced not to change it's parameters. So this might cause the problems. Another thing is that LSTM needs a lot of time to stabalize. Maybe you need to leave it for more epochs. Last thing - is that maybe your data could favour another activation function.
You could use a keras.sklearnWrapper.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.