CTC model does not learn

CTC model does not learn - python

I am trying to program a Keras model for audio transcription using connectionist temporal classification. Using a mostly working framewise classification model and the OCR example, I came up with the model given below, which I want to train on mapping the short-time Fourier transform of German sentences to their phonetic transcription.
My training data actually do have timing information, so I can use it to train a framewise model without CTC. The framewise prediction model, without the CTC loss, works decently (training accuracy 80%, validation accuracy 50%).
There is however much more potential training data available without timing information, so I really want to switch a CTC. To test this, I removed the timing from the data, increased the output size by one for the NULL class and added a CTC loss function.
This CTC model does not seem to learn. Overall, the loss is not going down (it went down from 2000 to 180 in a dozen epochs of 80 sentences each, but then it went back up to 430) and the maximum likelihood output it produces creeps around [nh each all of the sentences, which generally have around six words and transcriptions like [foːɐmʔɛsndʰaɪnəhɛndəvaʃn] – [] are part of the sequence, representing the pause at start and end of the audio.
I find it somewhat difficult to find good explanations of CTC in Keras, so it may be that I did something stupid. Did I mess up the model, mixing up the order of arguments somewhere? Do I need to be much more careful how I train the model, starting maybe with audio snippets with one, two or maybe three sounds each before giving the model complete sentences? In short,
How do I get this CTC model to learn?
connector = inputs
for l in [100, 100, 150]:
lstmf, lstmb = Bidirectional(
LSTM(
units=l,
dropout=0.1,
return_sequences=True,
), merge_mode=None)(connector)
connector = keras.layers.Concatenate(axis=-1)([lstmf, lstmb])
output = Dense(
units=len(dataset.SEGMENTS)+1,
activation=softmax)(connector)
loss_out = Lambda(
ctc_lambda_func, output_shape=(1,),
name='ctc')([output, labels, input_length, label_length])
ctc_model = Model(
inputs=[inputs, labels, input_length, label_length],
outputs=[loss_out])
ctc_model.compile(loss={'ctc': lambda y_true, y_pred: y_pred},
optimizer=SGD(
lr=0.02,
decay=1e-6,
momentum=0.9,
nesterov=True,
clipnorm=5))
ctc_lambda_function and the code to generate sequences from the predictions are from the OCR example.

It is entirely invisible from the code given here, but elsewhere OP gives links to their Github repository. The error lies actually in the data preparation:
The data are log spectrograms. They are unnormalized, and mostly highly negative. The CTC function picks up on the general distribution of labels much faster than the LSTM layer can adapt its input bias and input weights, so all variation in the input is flattened out. The local minimum of loss might then come from epochs when the marginalized distribution of labels is not yet assumed globally.
The solution to this is to scale the input spectrograms such that they contain both positive and negative values:
for i, file in enumerate(files):
sg = numpy.load(file.with_suffix(".npy").open("rb"))
spectrograms[i][:len(sg)] = 2 * (sg-sg.min())/(sg.max()-sg.min()) - 1

Related

Checking model overfit of doc2vec with infer_vector()

my aim is to create document embeddings from the column df["text"] as a first step and then as a second step plug them along with other variables into a XGBoost Regressor model in order to make predictions. This works very well for the train_df.
I am currently trying to evaluate my trained Doc2Vec model by inferring vectors with infer_vector() on the unseen test_df and then again make predictions with it.However, the results are super bad. I got a very large error (RMSE).
I assume, this means that Doc2Vec is massively overfitting?
I am actually not sure if this is the correct way to evaluate my doc2vec model (by infer_vector)?
What to do to prevent doc2vec from overfitting?
Please find my code below for infering vectors from a model:
vectors_test=[]
for i in range(0, len(test_df)):
vecs=model.infer_vector(tokenize(test_df["text"][i]))
vectors_test.append(vecs)
vectors_test= pd.DataFrame(vectors_test)
test_df = pd.concat([test_df, vectors_test], axis=1)
I then make predictions with my XGBoost model:
np.random.seed(0)
test_df= test_df.reindex(np.random.permutation(test_df.index))
y = test_df['target'].values
X = test_df.drop(['target'], axis=1).values
y_pred = mod.predict(X)
pred = pd.DataFrame()
pred["Prediction"] = y_pred
rmse = np.sqrt(mean_squared_error(y,y_pred))
print(rmse)
Please see also the training of my doc2vec model:
doc_tag = train_df.apply(lambda train_df: TaggedDocument(words=tokenize(train_df["text"]), tags= [train_df.Tag]), axis = 1)
# initializing model, building a vocabulary
model = Doc2Vec(dm=0, vector_size=200, min_count=1, window=10, workers= cores)
model.build_vocab([x for x in tqdm(doc_tag.values)])
# train model for 5 epochs
for epoch in range(5):
model.train(utils.shuffle([x for x in tqdm(doc_tag.values)]), total_examples=len(doc_tag.values), epochs=1)

Without knowing what your XGBoost model is being trained to predict, or more about the type/quantity of your training data for certain steps, it's hard to speculate why one particular set of inputs are performing poorly. (For example, it could equally be the XGBoost model's data, parameters, or training that's mismatched to the task.)
But, some observations:
You generally shouldn't be calling train() multiple times in your own loop. See My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? for discussion of common problems here. (Yours isn't quite as stark, but the learning-rate isn't being handled properly in your 5 separate train()s - indeed there should even be some error in your log output.)
Similarly: it's often a bad idea to use a min_count so small as 1 in these kinds of models: such rare words, without enough varied examples to be truly understood, just inject idiosyncratic noise which dilutes the influence of other, surrounding tokens which are meaningful.
Most published work trains a Doc2Vec model for 10-20 epochs – you're only using 5. (And, for smaller datasets or smaller texts, often even more epochs help.) Inference will also default to the epochs configured when the model was created – here only 5 – but more epochs are often beneficial.
It's unclear the size of your training texts and their unique vocabulary, but Doc2Vec overfitting will be most likely if the model is relatively large – in terms of vector_size or total surviving vocabulary – compared to the training data. Then, the model has lots of opportunity to essentially 'memorize' idiosyncracies of the training set, instead of more-generalizable patterns that will still be useful for out-of-training data. (For example, min_count=1, if it's preserving many singleton words which appear in only one text each, gives the model lots of "nooks and crannies" in which to improve its training target results in ways unlikely to help on other examples.) If your training data is "small", you likely need to use a smaller vector_size and a larger min_count to avoid overfitting, and then perhaps more epochs to ensure adequate training.
infer_vector essentially ignores any words not in its vocabulary - so you should take a look at some of the specific texts in the set performing poorly, and check whether most of their words are present, or not. But note also: as Doc2Vec is an unsupervised method, a plausible case can be made for training it to learn textual patterns on all available data, including the texts in your 'test' set. Then, it is more likely to have some word data, top at least the min_count threshold, for words across all examples. (Of course the actual supervised predictor itself can only be fairly evaluated on test examples whose desired answers weren't provided during the predictor's training. But it still can receive its features from an unsupervised step that used all text data.)
a crude check of a Doc2Vec model for overfitting or other training problems (but not overall quality) is to re-infer doc-vectors from the same texts it was trained on, and checking the model's set of bulk-trained vectors (model.docvecs) for the nearest-neighbors to these re-inferred vectors. If the re-inferred vector's nearest neighbor isn't usually the same text's bulk-trained vector – or if more generally, re-inferring the same text multiple times doesn't yield vectors that are 'close' to each other – then something about the model training or inference is deficient: overfitting, or undertraining, or insufficient data, or unwise parameters.

Overfitting - huge difference between training and validation accuracy

I have a dataset of 180k images for which I try to recognize the characters on the images (License plate recognition). All of these license plates contain seven characters and 35 characters are possible, so the output vector y is of shape (7, 35). I therefore onehot-encoded every license plate label.
I applied the bottom of the EfficicentNet-B0 model (https://keras.io/api/applications/efficientnet/#efficientnetb0-function) together with a customized top, which is divided in 7 branches (because of seven characters per license plate). I used the weights of the imagenet and freezed the bottom layers of efnB0_model:
def create_model(input_shape = (224, 224, 3)):
input_img = Input(shape=input_shape)
model = efnB0_model (input_img)
model = GlobalAveragePooling2D(name='avg_pool')(model)
model = Dropout(0.2)(model)
backbone = model
branches = []
for i in range(7):
branches.append(backbone)
branches[i] = Dense(360, name="branch_"+str(i)+"_Dense_16000")(branches[i])
branches[i] = BatchNormalization()(branches[i])
branches[i] = Activation("relu") (branches[i])
branches[i] = Dropout(0.2)(branches[i])
branches[i] = Dense(35, activation = "softmax", name="branch_"+str(i)+"_output")(branches[i])
output = Concatenate(axis=1)(branches)
output = Reshape((7, 35))(output)
model = Model(input_img, output)
return model
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
For training and validating the model I only use 10.000 training images and 3.000 validation images due to the big size of my model and the huge number of data which would make my training very, very slow.
I use this DataGenerator to feed batches to my model:
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
batch_x = batch_x * 1./255
batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_y = np.array(batch_y)
return batch_x, batch_y
I fit the model using this code:
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 32,
validation_steps = num_val_samples // 32,
epochs = 10, workers=6, use_multiprocessing=True)
Now, after several epochs of training, I observed big differences regarding training accuracy and validation accuracy. I think, one reason for that is the small size of data. Which other factors influence this overfitting in my model? Do you think, there is something completely wrong with my code/model? Do you think the model is to big and complex as well or is it maybe due to the preprocessing of the data?
Note: I already experimented with Data Augmentation and tried the model without Transfer Learning. That leads to poor results on training AND validation data. So, is there anything what I could do additionally?

First a disclaimer
Are you sure that this is the correct approach to follow? The EfficientNet is a model created for image recognition, and your task demands the correct localization of 7 characters in one image, recognition of each one of them, and also demand to keep the order of the characters. Maybe an approach of detection + segmentation followed by recognition like in this medium post is more efficient (pun intended). Even though I think that this is very likely to be your real problem, I will try to answer your original question.
Now some general tips regarding overfitting
There's a very good guide here in Keras documentation on how to use EfficientNet for transfer learning. I will try to summarize some tips here.
From your question seems that you are not even doing the fine-tuning step which is essential for the network to learn the task better.
Now, after several epochs of training, I observed big differences regarding training accuracy and validation accuracy.
With several, you mean how many epochs? Because from the image you put in the question I think that the second complete epoch is too soon to infer that your model is overfitting. Also, from the code (10 epochs) and for the image you posted (20 epochs) I would say to train for more epochs, like 40.
Increase the dropout. Try some configurations like 30%, 40%, 50%.
Data Augmentation in practice will increase the number of samples that you have. However, you have 180K images and are only using 10K of images, data augmentation is good but when you have more images available try using them first. From that guide I mentioned, seems that with this model and Google colab is feasible to use more images to train. So, try to increase the train size. Still in the topic of DA, some transformations may be harmful to your task, like too much rotation or reflection since you are trying to recognize numbers and letters.
Reducing the batch size to 16 may provide more regularization which helps to fight overfitting. Speaking of regularization, try to apply regularization to the dense layers that you are adding.
EDIT:
After quickly reading the paper you linked, I reaffirm my point about the epochs since in the paper the results are shown for 100 epochs. Also from the charts in the paper, we can see it's not possible to confirm that the author did not had overfitting too. Additionally, the changes in the Xception network are not clear at all. Changing the input layer has an impact on the dimensions of all other layers because of the way that the Convolution operation work and this is not discussed in the paper. The operations performed to achieve that output dimension is not clear too. Besides what you did I would suggest using a pooling layer to get the output dimensions that you want. Finally, the paper doesn't explain how the positioning of the plate is guaranteed. I would try to get more details about this paper that you are trying to reproduce to be sure that you are not missing anything in your model.

I have been working on character detection + recognition problem for industrial application. From my experience using only deep CNN and dense layer to predict character class is not the best approach to solve this problem. There are good research papers for scene text recognition problem, one common approach to design character recognition problem is to have ---
any deep CNN model like VGG, ResNet or EfficientNet to extract the image feature.
Then add some RNN layers on top of CNN backbone to get character sequence from the extracted features. This would be a great plus if you want to predict variable length of character.
After getting character sequence from RNN layers, the next step is to decode this character sequence. For this you can either use CTC based method or attention mechanism. Both of these methods have their own pros and cons. CTC based methods are fast but performance is bit poor, on the other hand, attention based models give good results but they are very slow. So the selection of the method depends on your requirement.
Below image from very famous text recognition paper CRNN gives general idea about above steps.
[
For training the model, #Hemerson has given good suggestions. Try to build and train this type of model with multiple stages and I am sure you will get better results:)
Best regards!

Why am I getting 100% accuracy using feed-forward neural networks for separate training, validation, and testing datasets in Keras?

Today I was working on a classifier to detect whether or not a mushroom was poisonous given its features. The data was in a .csv file(read to a pandas DataFrame) and the link to the data can be found at the end.
I used sci-kit learn's train_test_split function to split the data into training and testing sets.
I then removed the column that specified whether or not the mushroom was poisonous or not for the training and testing labels and assigned this to a yTrain, and yTest variable.
I then applied a one-hot-encoding (Using pd.get_dummies()) to the data since the parameters were categorical.
After this, I normalized the training and testing input data.
Essentially the training and testing input data was a distinct list of one-hot-encoded parameters and the output data was a list of one's and zeroes representing the output(one meant poisonous, zero meant edible).
I used Keras and a simple-feed forward network for this project. This network is comprised of three layers; A simple Dense(Linear Layer for PyTorch users) layer with 300 neurons, a Dense layer with 100 neurons, and a Dense layer with two neurons, each representing the probability of whether or not the given parameters of the mushroom signified it was poisonous, or edible. Adam was the optimizer that I had used, and Sparse-Categorical-Crossentropy was my loss-function.
I trained my network for 60 epochs. After about 5 epochs the loss was basically zero, and my accuracy was 1. After training, I was worried that my network had overfitted, so I tried it on my distinct testing data. The results were the same as the training and validation data; the accuracy was at 100% and my loss was negligible.
My validation loss at the end of 50 epochs is 2.258996e-07, and my training loss is 1.998715e-07. My testing loss was 4.732502e-09. I am really confused at the state of this, is the loss supposed to be this low? I don't think I am overfitting, and my validation loss is only a bit higher than my training loss, so I don't think that I am underfitting, as well.
Do any of you know the answer to this question? I am sorry if I had messed up in a silly way of some sort.
Link to dataset: https://www.kaggle.com/uciml/mushroom-classification

It seems that that Kaggle dataset is solvable, in the sense that you can create a model which gives the correct answer 100% of the time (if these results are to be believed). If you look at those results, you can see that the author was actually able to find models which give 100% accuracy using several methods, including decisions trees.

Predicting the percentage accuracy based on limited features

A practice problem based on whether or not and with what accuracy/probability an uber ride gets completed after being ordered has the following features:
Available Drivers int64
Placed Time float64
Response Distance float64
Car Type int32
Day Of Week int64
Response Delay float64
Order Completion int32 [target]
My approach has been to use tf.Keras Sequential to predict the target. Here's what it looks like:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(16, activation='relu', input_shape=input_shape),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1, activation='sigmoid')
])
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
binary_crossentropy_loss = tf.keras.losses.BinaryCrossentropy()
model.compile(optimizer=adam_optimizer,
loss=binary_crossentropy_loss,
metrics=['accuracy'])
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=ES_PATIENCE)
history = model.fit(train_dataset, validation_data=validation_dataset, epochs=EPOCHS, verbose=2,
callbacks=[early_stop])
I normalize the data like this (note that train_data is a dataframe):
train_data = tf.keras.utils.normalize(train_data)
And then for predicting,
predictions = model.predict_proba(prediction_dataset, batch_size=None)
Training results:
loss: 0.3506 - accuracy: 0.8817 - val_loss: 0.3493 - val_accuracy: 0.8773
But this still gives me a poor quality probability for the corresponding occurrence. Is this the wrong approach ?
What approach would you suggest for a problem like this and am I doing it completely wrong ? Are Neural Networks a bad idea for this solution? Thanks a lot!

As you framed the problem, this is a classic machine learning classification problem.
Given N features(independent variables) you have to predict 1(one) dependent variable.
The way in which you constructed the neural network is good.
Since you have a binary classification problem, the sigmoid activation is the correct one.
With respect to the complexity of your model (number of layers, number of neurons per layer) it depends very much on your dataset.
If you have a comprehensive dataset with a lot of features and a lot of examples(an example is a row in dataframe with X1,X2,X3... Y), where X are the features and Y the dependent variable, your model can vary in complexity.
If you have a small dataset with a few features, a small model is recommended. Always begin with a small model.
If you run into the issue of underfitting (poor accuracy on the training set and also on the validation and test set), you can gradually increase the complexity of the model (add more layers, add more neurons per layer).
If you run into the issue of overfitting, implementing regularisation techniques may help (Dropout, L1/L2 Regularisation, Noise Addition, Data Augmentation).
What you have to take into consideration is that, if you have a small dataset, then a classical machine learning algorithm could outperform the deep learning model. This happens because neural networks are very 'hungry' ---> as compared to machine learning models, they require much more data in order to properly work. You could choose SVM/Kernel SVM/Random Forest/ XGBoost and other similar algorithms.
EDIT!
Whether or not and with what accuracy/probability automatically splits the problem into two parts, not only a simple classification one.
What I would personally do is the following: Since the probabilities occur between 0% and 100%, if you had probability as a feature in your X columns (which you don't), then, according to the number of data points(rows) you have you could do the following: I would assign a label to each probability section: 1 to (0%,25%), 2 to (25%, 50%), 3 to (50%,75%), 4 to (75%, 100%). But that depends exclusively on the prior probability information(if you had the probability as a feature). Then if you inferred and you get label 3, you would know the probability of the ride being completed.
Otherwise, you cannot frame your current problem as both a classification and a probablity one.
I hope that I have given you an introductory insight. Happy coding.

If you are doing classification, you may want to look into ensemble methods (forests, boosts, etc.)
If you are calculating probability, you may want to look into probabilistic graphical models (Bayesian networks, etc.)

gensim Doc2Vec vs tensorflow Doc2Vec

I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better.
I ran the following code to train the gensim model and the one below that for tensorflow model. My questions are as follows:
Is my tf implementation of Doc2Vec correct. Basically is it supposed to be concatenating the word vectors and the document vector to predict the middle word in a certain context?
Does the window=5 parameter in gensim mean that I am using two words on either side to predict the middle one? Or is it 5 on either side. Thing is there are quite a few documents that are smaller than length 10.
Any insights as to why Gensim is performing better? Is my model any different to how they implement it?
Considering that this is effectively a matrix factorisation problem, why is the TF model even getting an answer? There are infinite solutions to this since its a rank deficient problem. <- This last question is simply a bonus.
Gensim
model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
model.build_vocab(corpus)
epochs = 100
for i in range(epochs):
model.train(corpus)
TF
batch_size = 512
embedding_size = 100 # Dimension of the embedding vector.
num_sampled = 10 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])
# The variables
word_embeddings = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
stddev=1.0 / np.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
###########################
# Model.
###########################
# Look up embeddings for inputs and stack words side by side
embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
shape=[int(batch_size/context_window),-1])
embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
embed = tf.concat(1,[embed_words, embed_docs])
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
train_labels, num_sampled, vocabulary_size))
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
Update:
Check out the jupyter notebook here (I have both models working and tested in here). It still feels like the gensim model is performing better in this initial analysis.

Old question, but an answer would be useful for future visitors. So here are some of my thoughts.
There are some problems in the tensorflow implementation:
window is 1-side size, so window=5 would be 5*2+1 = 11 words.
Note that with PV-DM version of doc2vec, the batch_size would be the number of documents. So train_word_dataset shape would be batch_size * context_window, while train_doc_dataset and train_labels shapes would be batch_size.
More importantly, sampled_softmax_loss is not negative_sampling_loss. They are two different approximations of softmax_loss.
So for the OP's listed questions:
This implementation of doc2vec in tensorflow is working and correct in its own way, but it is different from both the gensim implementation and the paper.
window is 1-side size as said above. If document size is less than context size, then the smaller one would be use.
There are many reasons why gensim implementation is faster. First, gensim was optimized heavily, all operations are faster than naive python operations, especially data I/O. Second, some preprocessing steps such as min_count filtering in gensim would reduce the dataset size. More importantly, gensim uses negative_sampling_loss, which is much faster than sampled_softmax_loss, I guess this is the main reason.
Is it easier to find somethings when there are many of them? Just kidding ;-)
It's true that there are many solutions in this non-convex optimization problem, so the model would just find a local optimum. Interestingly, in neural network, most local optima are "good enough". It has been observed that stochastic gradient descent seems to find better local optima than larger batch gradient descent, although this is still a riddle in current research.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.