My convolutional network is designed to classify data patterns. I feed my CNN with 2D matrices to get the relevant class of the pattern if any. There are 4 patterns my classifier recognizes.
Data:
Every data sample is a snapshot from a chart of some sequence. The idea is inspired by this paper. Data samples are 2-dimensional data-bands, narrow matrices.
The problem:
The CNN classifies patterns with high accuracy.
However, there are some patterns that are very similar to one of the 4 classes yet those patterns do not fall under any class.
There are options to use fine-grain models or raise the threshold to the highest possible value. However, those options don't improve the model's accuracy significantly. Similar patterns are still largely classified as valid. I augmented data by expanding the matrices with more data, which didn't work well as well. A naked eye can see the difference between similar patterns and valid patterns. It seems that the CNN can't see subtle differences that are obvious to the human eye.
What would you recommend trying to make the model see those differences?
Thanks!
These are last layers. The rest part consists of conv and poooling layers
model.add(Dense(output_classes,
kernel_regularizer=regularizers.l2(regularization),
kernel_initializer=kernel_initializer,
bias_regularizer=regularizers.l2(regularization)))
model.add(BatchNormalization())
model.add(Activation('softmax'))
adam = optimizers.Adam(lr=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-08,
decay=0.0)
print('Compiling the model...')
model.compile(optimizer=adam,
loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
Related
I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!
Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.
So I want to do predict the number of stars a product gets on Amazon through keras, I have seen other ways of doing this, but I have used the universal sentence encoder with one-hot encoding (I have followed a Youtube tutorial to embed the reviews). Now without using an LSTM layer and using the following layers:
`model.add(keras.layers.Dense(units=256,input_shape=(X_train.shape[1], ),activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(units=128,activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(0.0001),metrics= ['accuracy'])`
I am able to get an accuracy of around 0.55 and a loss of 1, which isn't great. However when I reshape my X_train and X_test data to be 3D input for an LSTM layer and then put it into a model such as:
`model.add(keras.layers.Dense(units=256,input_shape=(512, 1), activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(100, dropout=0.2, recurrent_dropout=0.3)))
model.add(keras.layers.Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(0.0001),metrics= ['accuracy'])`
I get an accuracy of around 0.2 which is even worse, with a loss of close to 2.00.
I have no idea whether an LSTM is necessary as I am new to neural networks but I have been trying to do this for my project.
So I am asking should I stick with the first model without an LSTM or is there a way of changing the second neural network with LSTM to have an accuracy of 0.2 whilst using the embedding methods that I have used?
Thanks for your time!
Why you should choose LSTM instead of normal neurons is because in language, there is a relationship between words and that is important in understanding what the sentence means. The model with only dense layer is not able to do that great because there are no connections it that can store such information, it just predicts by looking at the whole picture and not the connections the words have in between. Coming to LSTM, they stand for Long Short Term Memory, in short, what they have is the capability to remember data that they had seen previously, which helps it in creating connections with different words in the same sentence.
Coming to how you would go about creating your model. First, you need a Tokenizer in the TF library to create token out of your data, then convert your sequence into numbers through it, then pad your data using pad_sequences. Your data is then ready. In your network, your first layer should be an Embedding layer. Followed by it you can have the LSTM (as I have explained why you should use them) or Bidirectional LSTM (they can learn the dependency from left-to-right and right-to-left, performs better than unidirectional LSTM) or Conv1D (according to filter size it is able model dependencies in lying in its filter length, it has been used and works, you can try) layers, followed by pooling layer (GlobalMaxPooling1D) and then, dense layers to get your predictions.
I am trying to create a keras neural network to predict distance on roads between two points in city. I am using Google Maps to get travel distance and then train neural network to do that.
import pandas as pd
arr=[]
for i in range(0,100):
arr.append(generateTwoPoints(55.901819,37.344735,55.589537,37.832254))
df=pd.DataFrame(arr,columns=['p1Lat','p1Lon','p2Lat','p2Lon', 'distnaceInMeters', 'timeInSeconds'])
print(df)
Neural network architecture:
from keras.optimizers import SGD
sgd = SGD(lr=0.00000001)
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(100, input_dim=4 , activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='mse', optimizer='sgd', metrics=['mse'])
Then i divide sets to test/train
Xtrain=train[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytrain=train[['distnaceInMeters']]/100000
Xtest=test[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytest=test[['distnaceInMeters']]/100000
Then i fit data into the model, but loss stays the same:
history = model.fit(Xtrain, Ytrain,
batch_size=1,
epochs=1000,
# We pass some validation for
# monitoring validation loss and metrics
# at the end of each epoch
validation_data=(Xtest, Ytest))
I later print the data:
prediction = model.predict(Xtest)
print(prediction)
print (Ytest)
But result is the same for all the inputs:
[[0.26150784]
[0.26171574]
[0.2617755 ]
[0.2615582 ]
[0.26173398]
[0.26166356]
[0.26185763]
[0.26188275]
[0.2614446 ]
[0.2616575 ]
[0.26175532]
[0.2615183 ]
[0.2618127 ]]
distnaceInMeters
2 0.13595
6 0.27998
7 0.48849
16 0.36553
21 0.37910
22 0.40176
33 0.09173
39 0.24542
53 0.04216
55 0.38212
62 0.39972
64 0.29153
87 0.08788
I can not find the problem. What is it? I am new to machine learning.
You are doing a very elementary mistake: since you are in a regression setting, you should not use a sigmoid activation for your final layer (this is used for binary classification cases); change your last layer to
model.add(Dense(1,activation='linear'))
or even
model.add(Dense(1))
since, according to the docs, if you do not specify the activation argument it defaults to linear.
Various other advice offered already in the other answer and the comments may be useful (lower LR, more layers, other optimizers e.g. Adam), and you certainly need to increase your batch size; but nothing will work with the sigmoid activation function you currently use for your last layer.
Irrelevant to the issue, but in regression settings you don't need to repeat your loss function as a metric; this
model.compile(loss='mse', optimizer='sgd')
will suffice.
It would be very useful if you could post the the progression of the loss and MSE (of both the training and validation/test set) as it goes throughout the training. Even better, it would be best if you can visualize it as per https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/ and post the vizualization here.
In the meantime, based on the facts:
1) You say the loss isn't decreasing (I'm assuming on the training set, during training, based on your compile args).
2) You say that the prediction "accuracy" on your test set is bad.
3) My experience/intuition (not an empirical assessment) tells me that your two layer dense model is a little too small to be able to capture the complexity inherent in your data. AKA your model is suffering from too high a Bias https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
The fastest and easiest thing you can try, is to try to add both more layers and more nodes to each layer.
However, I should note that there is a lot of causal information that can affect the driving distance and driving time beyond just the the distance between two coordinates, which might be the feature that your Neural network will most readily extract. For example, whether you drive on a highway or sides treets, traffic lights, whehter the roads twist and turn or go straight... to infer all of that just from that the data you will need enormous amounts of data(examples) in my opinion. If you could add input columns with e.g. disatance to nearest higway from both points, you might be able to train with less data
I would also reccomend that you souble check that you are feeding as input what you think you are feeding (and its shape), and also, you should use some standardization from function sklearn which might help the model learn faster and converge faster to a higher "accuracy".
If and when you post either more code or the training history I can help you more (and also how many training samples).
EDIT 1: Try changing batch size to a larger number preferably batch_size=32 if it fits in your memory. you can use a small batch size (such as 1) when working with an "info rich" input like an image, but when using a very "info poor" datum like 4 floats (2 coordinates), the gradient will point each batch (with batch_size=1) to a practically random (pseudo...) direction and not neccessarily get any closer to a local minimum. Only when taking the gradient on the collective loss of a larger batch (like 32, and perhaps more) will you get a gradient that points at least approximately in the direction of the local minimum and converge to a better result. Also, I suggest that you don't mess with the learning rate manually and perhaps change to an optimizer like "adam" or "RMSProp".
Edit 2: #Desertnaut made an excellent point that I totally missed, a correction without which, your code will not work properly. He deserves the credit so I will not include it here. Please refer to his answer. Also, don't forget to raise your batch size, and not "manually mess" with your learning rate, "adam" for example, will do it for you.
I am trying to use LSTM for my timeseries classification problem as follows. My dataset has about 2000 datapoints and each data point is having 25 length 4 timeseries.
model = Sequential()
model.add(LSTM(100, input_shape=(25,4)))
model.add(Dense(50))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
However, the LSTM model works very poorly and gives me very low results. While this is upsetting, I am thinking that LSTM provides low results as it is unable to capture some imporatant characteristics from the timeseries.
In that case, I am wondering if it is possible to give some handcrafted features along with the timeseries to the model? If so, please let me know how to do it.
I am happy to provide more details if needed.
EDIT:
I am thinking if it is possible to use kera's functional API in this regard. So that, I can I can use my features as a seperate input.
LSTM model takes-in a 3-dimensional tensor as an input with dimensions (batch-size, time-length, num-features).
To answer your question, you will have to concatenate those hand-crafted features along with these four raw features that you have, may be normalize them to bring all of them to the same scale, and pass a (batch-size, time-length, features+x) as an input to the LSTM model.
I am trying to program a Keras model for audio transcription using connectionist temporal classification. Using a mostly working framewise classification model and the OCR example, I came up with the model given below, which I want to train on mapping the short-time Fourier transform of German sentences to their phonetic transcription.
My training data actually do have timing information, so I can use it to train a framewise model without CTC. The framewise prediction model, without the CTC loss, works decently (training accuracy 80%, validation accuracy 50%).
There is however much more potential training data available without timing information, so I really want to switch a CTC. To test this, I removed the timing from the data, increased the output size by one for the NULL class and added a CTC loss function.
This CTC model does not seem to learn. Overall, the loss is not going down (it went down from 2000 to 180 in a dozen epochs of 80 sentences each, but then it went back up to 430) and the maximum likelihood output it produces creeps around [nh each all of the sentences, which generally have around six words and transcriptions like [foːɐmʔɛsndʰaɪnəhɛndəvaʃn] – [] are part of the sequence, representing the pause at start and end of the audio.
I find it somewhat difficult to find good explanations of CTC in Keras, so it may be that I did something stupid. Did I mess up the model, mixing up the order of arguments somewhere? Do I need to be much more careful how I train the model, starting maybe with audio snippets with one, two or maybe three sounds each before giving the model complete sentences? In short,
How do I get this CTC model to learn?
connector = inputs
for l in [100, 100, 150]:
lstmf, lstmb = Bidirectional(
LSTM(
units=l,
dropout=0.1,
return_sequences=True,
), merge_mode=None)(connector)
connector = keras.layers.Concatenate(axis=-1)([lstmf, lstmb])
output = Dense(
units=len(dataset.SEGMENTS)+1,
activation=softmax)(connector)
loss_out = Lambda(
ctc_lambda_func, output_shape=(1,),
name='ctc')([output, labels, input_length, label_length])
ctc_model = Model(
inputs=[inputs, labels, input_length, label_length],
outputs=[loss_out])
ctc_model.compile(loss={'ctc': lambda y_true, y_pred: y_pred},
optimizer=SGD(
lr=0.02,
decay=1e-6,
momentum=0.9,
nesterov=True,
clipnorm=5))
ctc_lambda_function and the code to generate sequences from the predictions are from the OCR example.
It is entirely invisible from the code given here, but elsewhere OP gives links to their Github repository. The error lies actually in the data preparation:
The data are log spectrograms. They are unnormalized, and mostly highly negative. The CTC function picks up on the general distribution of labels much faster than the LSTM layer can adapt its input bias and input weights, so all variation in the input is flattened out. The local minimum of loss might then come from epochs when the marginalized distribution of labels is not yet assumed globally.
The solution to this is to scale the input spectrograms such that they contain both positive and negative values:
for i, file in enumerate(files):
sg = numpy.load(file.with_suffix(".npy").open("rb"))
spectrograms[i][:len(sg)] = 2 * (sg-sg.min())/(sg.max()-sg.min()) - 1