I have time-series classification problem where I use a dataset of 2000 data point. Each data point has 4 timeseries that are 25 length long.
I am using the following LSTM model on this dataset.
model = Sequential()
model.add(LSTM(10, input_shape=(25,4)))
model.add(Dense(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The above model gave me really bad results (nearly 0.3). This is really upsetting and now I am looking for ways to improve my results.
I think the problem is with the number of nodes I use in each layer (i.e. 10 and 32 respectively). My question is, is there a way to identify how many nodes we need in each layer? Moreover, is it sufficient to have one LSTM layer and one Dense layer? Do you think that I can improve these layers?
I am happy to provide more details if needed.
Related
I am doing a multiclass classification using LSTM model.
One sample is 20 frames of data, each frame has 64 infrared
signals, so each 20 × 64 dimension matrix signal is converted into a 1 × 1280 dimension vector (one sample).
There are 1280 nodes in the input layer of LSTM.
Then I need to build the following LSTM model:
the number of nodes in the hidden layer is 640 and each hidden
layer node is connected to a full connection layer with 100 backward nodes, and there is a
ReLU activation layer behind the full connection layer. Finally, the softmax activation
function is used to normalize the data to obtain the output. Additionally, the timesteps of
LSTM are set to 16.
Here is my attempt to build this architecture according to intsructions above:
embedding_vecor_length = 16
model_1 = Sequential()
model_1.add(Embedding(len(X_train), embedding_vecor_length, input_length=1280))
model_1.add(LSTM(640))
model_1.add(Dense(100, activation='relu'))
model_1.add(Dense(4, activation='softmax'))
model_1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_1.summary())
model_1.fit(X_train, y_train, epochs=10, batch_size=10)
I am very confused by the hidden layer of LSTM and fully connected layer. According to these instructions, should my fully connected layer be inside LSTM block? And what does it mean backward nodes? Also, where do we indicate the timesteps of LSTM? Could somebody explain please? Thank you!
The LSTM is itself a fully connected layer, which just happens to maintain a state between calls.
I'm not sure what the term backwards really means, but from the structure you show it simply looks like a non-linear transformation of the LSTM output.
The timesteps of the LSTM are generally indicated in the input. I may be wrong, but I think in your case you actually don't want to reshape your input the way you do (by multiplying the frames). You have 20 frames with 64 signals each, so really you have 20 timesteps with an input of size 64.
I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!
Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.
I am trying to use LSTM for my timeseries classification problem as follows. My dataset has about 2000 datapoints and each data point is having 25 length 4 timeseries.
model = Sequential()
model.add(LSTM(100, input_shape=(25,4)))
model.add(Dense(50))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
However, the LSTM model works very poorly and gives me very low results. While this is upsetting, I am thinking that LSTM provides low results as it is unable to capture some imporatant characteristics from the timeseries.
In that case, I am wondering if it is possible to give some handcrafted features along with the timeseries to the model? If so, please let me know how to do it.
I am happy to provide more details if needed.
EDIT:
I am thinking if it is possible to use kera's functional API in this regard. So that, I can I can use my features as a seperate input.
LSTM model takes-in a 3-dimensional tensor as an input with dimensions (batch-size, time-length, num-features).
To answer your question, you will have to concatenate those hand-crafted features along with these four raw features that you have, may be normalize them to bring all of them to the same scale, and pass a (batch-size, time-length, features+x) as an input to the LSTM model.
I have a dataset of N videos each video is characterized by some metrics (that will be inputs for a neural net) my goal is to predict the score that a person will give when he or she watches the video.
The problem is that in my dataset each video was watched more than once by different subjects, so I was forced to duplicate the same metrics (inputs) the number of time the video was watched to keep all the scores given by the subjects.
I built an MLP model to predictet the scores. But when I calculate the RMSE it's always higher than 0.7.
I want to know if having a dataset like that would affect the performance of my model ? And how can I deal with it ?
Here is how the dataset looks like:
The first 5 columns are the inputs and the last one is the score of subjects. Note that all of them are normalized.
Here is my Model:
def mlp_model():
# create model
model = Sequential()
model.add(Dense(100,input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
seed = 100
numpy.random.seed(seed)
myModel = mlp_model()
myModel.fit(x=x_train, y=y_train, batch_size=10, epochs=45, validation_split=0.3, shuffle=True,callbacks=[plot_losses])
predictions = myModel.predict(x_test)
print predictions
Your problem statement reveals an inherent flaw in the design. As you correctly pointed out, you have no way of knowing what the user does, how she has rated other videos, and how she will rate the current video.
It would be helpful to explain what your current input values are, and whether they could differ at all. For example, a metric like "time spent watching the video" might be different for different users.
On a larger scale, try to answer the question whethre you could answer the rating (with a completely deterministic judgement), i.e. would it be possible for you to come up with the same answer (given the same input), and constantly get the same result?
Since that is currently not the case, I would say that you should investigate more time in finding a suitable approach to your problem, like for example recommender systems, but that also requires you to use a lot of different input information.
Alternatively, you could try to find more input data, which specifically identifies the users, and allows you to make more suitable predictions; even then, it will be hard to base a reasonable prediction on such proxy metrics, since you might end up creating an unwanted bias in your preprocessing.
In any case, getting much better results with the current format of the input is very unlikely.
My convolutional network is designed to classify data patterns. I feed my CNN with 2D matrices to get the relevant class of the pattern if any. There are 4 patterns my classifier recognizes.
Data:
Every data sample is a snapshot from a chart of some sequence. The idea is inspired by this paper. Data samples are 2-dimensional data-bands, narrow matrices.
The problem:
The CNN classifies patterns with high accuracy.
However, there are some patterns that are very similar to one of the 4 classes yet those patterns do not fall under any class.
There are options to use fine-grain models or raise the threshold to the highest possible value. However, those options don't improve the model's accuracy significantly. Similar patterns are still largely classified as valid. I augmented data by expanding the matrices with more data, which didn't work well as well. A naked eye can see the difference between similar patterns and valid patterns. It seems that the CNN can't see subtle differences that are obvious to the human eye.
What would you recommend trying to make the model see those differences?
Thanks!
These are last layers. The rest part consists of conv and poooling layers
model.add(Dense(output_classes,
kernel_regularizer=regularizers.l2(regularization),
kernel_initializer=kernel_initializer,
bias_regularizer=regularizers.l2(regularization)))
model.add(BatchNormalization())
model.add(Activation('softmax'))
adam = optimizers.Adam(lr=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-08,
decay=0.0)
print('Compiling the model...')
model.compile(optimizer=adam,
loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()