Debug python tensorflow issue

Debug python tensorflow issue - python

I am working on a audio set to train a neural network using tensorflow library but there is a weird issue that I can't figure out. So I am following this blog Urban Sound Classification, the only difference is that I have my own dataset.
So everything is working fine if I have small data like about 30 audio files or so but when I use the complete data my training code simply runs couple of iterations outputs cost and then that is about it, no error, exception or warning is thrown the tensorflow session simply doesn't give any further results. Let's see the code to better explanation:
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
_,cost = sess.run([optimizer,cost_function],feed_dict={X:tr_features,Y:tr_labels})
cost_history = np.append(cost_history,cost)
y_pred = sess.run(tf.argmax(y_,1),feed_dict={X: ts_features})
y_true = sess.run(tf.argmax(ts_labels,1))
print("Test accuracy: ",round(session.run(accuracy,
feed_dict={X: ts_features,Y: ts_labels}),3))
So when I run the above code for training on complete data (about 9000 files) it generates cost history for about 2 epochs and then stop generation history but the code keeps on executing like normal jus the session.run() stop outputting results. My guess is that due to some exception the session stops but how do I debug this stupid error? I have nothing to go on. Can anyone advise on this?
Note: I am not sure if this is the right forum but point me in right direction I will move the question if need be.
UPDATE 01:
So I have figured some correlation between the amount of data/learning rate and the error. Here is my understanding of what is happening. So when I was coding I used subset of my original data about 10-15 files for training and the learning rate was 0.01 and it worked well (as in it completed all it's epochs).
When I used 500 files for training it repeated the same behavior as described in original question (it would output 2 iterations and then kaboom not more outputs and no exception or error). I noticed in those iterations that cost was increasing so I tried to lower the learning rate, and viola it worked like a charm with a new learning rate of 0.001. (again all epochs ran and successfully outputted the results)
Finally when I run the training for all of my data that is about 9000 files but I observed the same behavior as previously discussed. So my question now is how much should I lower the learning rate? What is the correlation of the learning rate to the amount of data?

Related

Keras model.predict function not giving similar results as model.evaluate

I have trained an image classification model using Keras. The model after training has 95% accuracy on training data and using model.evaluate on an untouched validation data, I get ~92.8% accuracy.
But when I use model.predict function instead to get the prediction probabilities and get the predicted class with maximum probability, I get ~80% accuracy.
The complete code is available as a colab notebook on the following link - https://colab.research.google.com/drive/1RQ2KnT2sVsdCAWfpsDj_kcMZiqiwJrpc?usp=sharing
You should be able to run everything and see the difference in accuracy. The problem lies in the code blocks as shown below

To make both the accuracies from predict_generator and evaluate_generator same, you have to set the following 3 things in your functions as parameters:
shuffle = False
pickle_safe = True
workers = 1
Your program might be running on different threads and these settings make it run on the main thread.

The solution I could find so far after having posted the issue here and keras official github (without any answer for weeks) is that instead of using Keras, I used tf.keras. Most of the implementation stayed the same. And the "Shuffle" option is definitely messing up the accuracy. The lower accuracy with "Shuffle = False" is a bug in the keras implementation probably. The tf.keras implementation gives the same result in the "evaluate_generator" function. And the predict and evaluate function outputs with respect to accuracy match. I hope if other people encounter this error, they don't waste as much time as I did on the issue.

Tensorflow output to terminal is filled with equals sign

I'm currently learning Tensorflow using the fashion_mnsit dataset. I created a simple neural network with 3 layers, trained the neural net for 10 epochs and then evaluated to unseen data.
My issue arises when I run the script in the terminal(windows). It displays the progress of each epoch with the "loading bar" represented by:
"[===========>.....] "
But once the training finishes. The terminal screen is completely filled with "================" and then at the very end, the result.
Mine:
https://imgur.com/a/KUY8QjQ
Expected:
https://imgur.com/a/P3rh7yA
This is detrimental as I cannot analyze the progression of the model over the epochs.
This is using Tensorflow 2.0 on Python v3.7, 64bit, Windows 10.
Any help appreciated.

I ran into the same issue and found your question, but no answers.
I took a guess and added this to my evaluate:
results = model.evaluate(test_data, test_labels, verbose=0)
The verbose = 0 seems to have resolved the issue for me. No more equal signs.

Why does more epochs make my model worse?

Most of my code is based on this article and the issue I'm asking about is evident there, but also in my own testing. It is a sequential model with LSTM layers.
Here is a plotted prediction over real data from a model that was trained with around 20 small data sets for one epoch.
Here is another plot but this time with a model trained on more data for 10 epochs.
What causes this and how can I fix it? Also that first link I sent shows the same result at the bottom - 1 epoch does great and 3500 epochs is terrible.
Furthermore, when I run a training session for the higher data count but with only 1 epoch, I get identical results to the second plot.
What could be causing this issue?

A few questions:
Is this graph for training data or validation data?
Do you consider it better because:
The graph seems cool?
You actually have a better "loss" value?
If so, was it training loss?
Or validation loss?
Cool graph
The early graph seems interesting, indeed, but take a close look at it:
I clearly see huge predicted valleys where the expected data should be a peak
Is this really better? It sounds like a random wave that is completely out of phase, meaning that a straight line would indeed represent a better loss than this.
Take a look a the "training loss", this is what can surely tell you if your model is better or not.
If this is the case and your model isn't reaching the desired output, then you should probably make a more capable model (more layers, more units, a different method, etc.). But be aware that many datasets are simply too random to be learned, no matter how good the model.
Overfitting - Training loss gets better, but validation loss gets worse
In case you actually have a better training loss. Ok, so your model is indeed getting better.
Are you plotting training data? - Then this straight line is actually better than a wave out of phase
Are you plotting validation data?
What is happening with the validation loss? Better or worse?
If your "validation" loss is getting worse, your model is overfitting. It's memorizing the training data instead of learning generally. You need a less capable model, or a lot of "dropout".
Often, there is an optimal point where the validation loss stops going down, while the training loss keeps going down. This is the point to stop training if you're overfitting. Read about the EarlyStopping callback in keras documentation.
Bad learning rate - Training loss is going up indefinitely
If your training loss is going up, then you've got a real problem there, either a bug, a badly prepared calculation somewhere if you're using custom layers, or simply a learning rate that is too big.
Reduce the learning rate (divide it by 10, or 100), create and compile a "new" model and restart training.
Another problem?
Then you need to detail your question properly.

How long does tensorflow object detection API train.py complete training using CPU only?

I am a beginner in machine learning. Recently, I had successfully running a machine learning application using Tensorflow object detection API.
My dataset is 200 images of object with 300*300 resolution. However, the training had been running for two days and yet to be completed.
I wonder how long would it take to complete a training?? At the moment it is running at global step 9000, how many global step needed to complete the training?
P.S: the training used only CPUs

It depends on your desired accuracy and data set of course but I generally stop training when the loss value gets around 4 or less. What is your current loss value after 9000 steps?

To me this sounds like your training is not converging.
See the discussion in the comments of this question.
Basically, it is recommended that you run eval.py in parallel and check how it performs there as well.

Running Tensorflow Predictions code twice does not result same outcome

I am new to tensorflow, so please pardon my ignorance.
I have a tensorflow demo model "from an online tutorial" that should predict stockmarket prices for S&P. When I run the code I get inconsistent results everytime I run it. Training data does not change, I suppressed block shuffling , ...
But, When I run the prediction 2 times in the same run I get consistent results "i.e. use Only one training , run prediction twice".
My questions are:
Why am I getting inconsistent results?
If you are going to release such code to production , would you
just take the last time you ran this model training results? if not, then what would you do?
Does it make sense to force the model to produce consistent predictions? how would
you do that?
Here is my code location github repo

In training a neural network there is more randomness involved than just the batch shuffling. The initial weights of the layers are also randomly initialized.
Typically you would use the best model you have trained so far. To determine which model is the best you usually use some test dataset you did not use during training.
It is probably not a good sign if your performance fluctuates for different training runs. This means your result depends a lot on the random initialization. But I personally don't know about any general techniques to make learning more stable. But there probably are some.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.