How to input features seperately to LSTM model

How to input features seperately to LSTM model - python

I am trying to use LSTM for my timeseries classification problem as follows. My dataset has about 2000 datapoints and each data point is having 25 length 4 timeseries.
model = Sequential()
model.add(LSTM(100, input_shape=(25,4)))
model.add(Dense(50))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
However, the LSTM model works very poorly and gives me very low results. While this is upsetting, I am thinking that LSTM provides low results as it is unable to capture some imporatant characteristics from the timeseries.
In that case, I am wondering if it is possible to give some handcrafted features along with the timeseries to the model? If so, please let me know how to do it.
I am happy to provide more details if needed.
EDIT:
I am thinking if it is possible to use kera's functional API in this regard. So that, I can I can use my features as a seperate input.

LSTM model takes-in a 3-dimensional tensor as an input with dimensions (batch-size, time-length, num-features).
To answer your question, you will have to concatenate those hand-crafted features along with these four raw features that you have, may be normalize them to bring all of them to the same scale, and pass a (batch-size, time-length, features+x) as an input to the LSTM model.

Related

Need help defining a simple neural network

I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!

Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.

Keras model for multiclass classification for sentiment analysis with LSTM - how can my model be improved?

So I want to do predict the number of stars a product gets on Amazon through keras, I have seen other ways of doing this, but I have used the universal sentence encoder with one-hot encoding (I have followed a Youtube tutorial to embed the reviews). Now without using an LSTM layer and using the following layers:
`model.add(keras.layers.Dense(units=256,input_shape=(X_train.shape[1], ),activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(units=128,activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(0.0001),metrics= ['accuracy'])`
I am able to get an accuracy of around 0.55 and a loss of 1, which isn't great. However when I reshape my X_train and X_test data to be 3D input for an LSTM layer and then put it into a model such as:
`model.add(keras.layers.Dense(units=256,input_shape=(512, 1), activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(100, dropout=0.2, recurrent_dropout=0.3)))
model.add(keras.layers.Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(0.0001),metrics= ['accuracy'])`
I get an accuracy of around 0.2 which is even worse, with a loss of close to 2.00.
I have no idea whether an LSTM is necessary as I am new to neural networks but I have been trying to do this for my project.
So I am asking should I stick with the first model without an LSTM or is there a way of changing the second neural network with LSTM to have an accuracy of 0.2 whilst using the embedding methods that I have used?
Thanks for your time!

Why you should choose LSTM instead of normal neurons is because in language, there is a relationship between words and that is important in understanding what the sentence means. The model with only dense layer is not able to do that great because there are no connections it that can store such information, it just predicts by looking at the whole picture and not the connections the words have in between. Coming to LSTM, they stand for Long Short Term Memory, in short, what they have is the capability to remember data that they had seen previously, which helps it in creating connections with different words in the same sentence.
Coming to how you would go about creating your model. First, you need a Tokenizer in the TF library to create token out of your data, then convert your sequence into numbers through it, then pad your data using pad_sequences. Your data is then ready. In your network, your first layer should be an Embedding layer. Followed by it you can have the LSTM (as I have explained why you should use them) or Bidirectional LSTM (they can learn the dependency from left-to-right and right-to-left, performs better than unidirectional LSTM) or Conv1D (according to filter size it is able model dependencies in lying in its filter length, it has been used and works, you can try) layers, followed by pooling layer (GlobalMaxPooling1D) and then, dense layers to get your predictions.

How to identify number of nodes and layers in lstm model

I have time-series classification problem where I use a dataset of 2000 data point. Each data point has 4 timeseries that are 25 length long.
I am using the following LSTM model on this dataset.
model = Sequential()
model.add(LSTM(10, input_shape=(25,4)))
model.add(Dense(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The above model gave me really bad results (nearly 0.3). This is really upsetting and now I am looking for ways to improve my results.
I think the problem is with the number of nodes I use in each layer (i.e. 10 and 32 respectively). My question is, is there a way to identify how many nodes we need in each layer? Moreover, is it sufficient to have one LSTM layer and one Dense layer? Do you think that I can improve these layers?
I am happy to provide more details if needed.

lstm prediction result delay phenomenon

Recently I'm using lstm to predict time series. I'm using keras 2.0 to construct my lstm model. It has a structure like this:
model = Sequential()
model.add(LSTM(128, input_shape=(timesteps, 1), return_sequences=False, stateful=False)
model.add(Dropout(rate=0.1))
model.add(Dense(1))
I have tried to use this network to predict several time series including sin(t) and a real traffic flow dataset. I found that the prediction for sin is fine while the prediction for real dataset is just like shifting the last input value by one step. I don't know whether it's a prediction error or the network doesn't learn the pattern of the dataset at all. Does anyone get similar results? Are there any solutions to this annoying shift? Thanks a lot.
Here are some of my predictions:
3 frequencies sin prediction result
real traffic dataset prediction result

This is simply the starting point for your network and you'll have to work through it by trying various things.
To name only a few:
Try different window lengths (timesteps fed into network)
Try adding dense layers, or multiple LSTM layers, or fewer LTSM nodes
Try different optimizers, with various learning rates
Look for additional datapoints to feed into the network
How much data do you have? You may need more to get a good prediction
Try different offsets for the Y variable, how many timesteps do you need to be able to predict out for your specific problem?
The list goes on....

Convolutional network for data pattern recognition

My convolutional network is designed to classify data patterns. I feed my CNN with 2D matrices to get the relevant class of the pattern if any. There are 4 patterns my classifier recognizes.
Data:
Every data sample is a snapshot from a chart of some sequence. The idea is inspired by this paper. Data samples are 2-dimensional data-bands, narrow matrices.
The problem:
The CNN classifies patterns with high accuracy.
However, there are some patterns that are very similar to one of the 4 classes yet those patterns do not fall under any class.
There are options to use fine-grain models or raise the threshold to the highest possible value. However, those options don't improve the model's accuracy significantly. Similar patterns are still largely classified as valid. I augmented data by expanding the matrices with more data, which didn't work well as well. A naked eye can see the difference between similar patterns and valid patterns. It seems that the CNN can't see subtle differences that are obvious to the human eye.
What would you recommend trying to make the model see those differences?
Thanks!
These are last layers. The rest part consists of conv and poooling layers
model.add(Dense(output_classes,
kernel_regularizer=regularizers.l2(regularization),
kernel_initializer=kernel_initializer,
bias_regularizer=regularizers.l2(regularization)))
model.add(BatchNormalization())
model.add(Activation('softmax'))
adam = optimizers.Adam(lr=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-08,
decay=0.0)
print('Compiling the model...')
model.compile(optimizer=adam,
loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.