I'm currently learning Tensorflow using the fashion_mnsit dataset. I created a simple neural network with 3 layers, trained the neural net for 10 epochs and then evaluated to unseen data.
My issue arises when I run the script in the terminal(windows). It displays the progress of each epoch with the "loading bar" represented by:
"[===========>.....] "
But once the training finishes. The terminal screen is completely filled with "================" and then at the very end, the result.
Mine:
https://imgur.com/a/KUY8QjQ
Expected:
https://imgur.com/a/P3rh7yA
This is detrimental as I cannot analyze the progression of the model over the epochs.
This is using Tensorflow 2.0 on Python v3.7, 64bit, Windows 10.
Any help appreciated.
I ran into the same issue and found your question, but no answers.
I took a guess and added this to my evaluate:
results = model.evaluate(test_data, test_labels, verbose=0)
The verbose = 0 seems to have resolved the issue for me. No more equal signs.
Related
I am currently working on a multi-layer 1d-CNN. Recently I shifted my work over to an HPC server to train on both CPU and GPU (NVIDIA).
My code runs beautifully (albeit slowly) on my own laptop with TensorFlow 2.7.3. The HPC server I am using has a newer version of python (3.9.0) and TensorFlow installed.
Onto my problem: The Keras callback function "Earlystopping" no longer works as it should on the server. If I set the patience to 5, it will only run for 5 epochs despite specifying epochs = 50 in model.fit(). It seems as if the function is assuming that the val_loss of the first epoch is the lowest value and then runs from there.
I don't know how to fix this. The function would reach lowest val_loss at 15 epochs and run to 20 epochs on my own laptop. On the server, training time and epochs is not sufficient, with very low accuracy (~40%) on test dataset.
Please help.
For some reason, reducing my batch_size from 16 to 8 in model.fit() allowed the EarlyStopping callback to work properly. I don't know why this is though. But it works now.
I have trained an image classification model using Keras. The model after training has 95% accuracy on training data and using model.evaluate on an untouched validation data, I get ~92.8% accuracy.
But when I use model.predict function instead to get the prediction probabilities and get the predicted class with maximum probability, I get ~80% accuracy.
The complete code is available as a colab notebook on the following link - https://colab.research.google.com/drive/1RQ2KnT2sVsdCAWfpsDj_kcMZiqiwJrpc?usp=sharing
You should be able to run everything and see the difference in accuracy. The problem lies in the code blocks as shown below
To make both the accuracies from predict_generator and evaluate_generator same, you have to set the following 3 things in your functions as parameters:
shuffle = False
pickle_safe = True
workers = 1
Your program might be running on different threads and these settings make it run on the main thread.
The solution I could find so far after having posted the issue here and keras official github (without any answer for weeks) is that instead of using Keras, I used tf.keras. Most of the implementation stayed the same. And the "Shuffle" option is definitely messing up the accuracy. The lower accuracy with "Shuffle = False" is a bug in the keras implementation probably. The tf.keras implementation gives the same result in the "evaluate_generator" function. And the predict and evaluate function outputs with respect to accuracy match. I hope if other people encounter this error, they don't waste as much time as I did on the issue.
I am working on a audio set to train a neural network using tensorflow library but there is a weird issue that I can't figure out. So I am following this blog Urban Sound Classification, the only difference is that I have my own dataset.
So everything is working fine if I have small data like about 30 audio files or so but when I use the complete data my training code simply runs couple of iterations outputs cost and then that is about it, no error, exception or warning is thrown the tensorflow session simply doesn't give any further results. Let's see the code to better explanation:
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
_,cost = sess.run([optimizer,cost_function],feed_dict={X:tr_features,Y:tr_labels})
cost_history = np.append(cost_history,cost)
y_pred = sess.run(tf.argmax(y_,1),feed_dict={X: ts_features})
y_true = sess.run(tf.argmax(ts_labels,1))
print("Test accuracy: ",round(session.run(accuracy,
feed_dict={X: ts_features,Y: ts_labels}),3))
So when I run the above code for training on complete data (about 9000 files) it generates cost history for about 2 epochs and then stop generation history but the code keeps on executing like normal jus the session.run() stop outputting results. My guess is that due to some exception the session stops but how do I debug this stupid error? I have nothing to go on. Can anyone advise on this?
Note: I am not sure if this is the right forum but point me in right direction I will move the question if need be.
UPDATE 01:
So I have figured some correlation between the amount of data/learning rate and the error. Here is my understanding of what is happening. So when I was coding I used subset of my original data about 10-15 files for training and the learning rate was 0.01 and it worked well (as in it completed all it's epochs).
When I used 500 files for training it repeated the same behavior as described in original question (it would output 2 iterations and then kaboom not more outputs and no exception or error). I noticed in those iterations that cost was increasing so I tried to lower the learning rate, and viola it worked like a charm with a new learning rate of 0.001. (again all epochs ran and successfully outputted the results)
Finally when I run the training for all of my data that is about 9000 files but I observed the same behavior as previously discussed. So my question now is how much should I lower the learning rate? What is the correlation of the learning rate to the amount of data?
I am a beginner in machine learning. Recently, I had successfully running a machine learning application using Tensorflow object detection API.
My dataset is 200 images of object with 300*300 resolution. However, the training had been running for two days and yet to be completed.
I wonder how long would it take to complete a training?? At the moment it is running at global step 9000, how many global step needed to complete the training?
P.S: the training used only CPUs
It depends on your desired accuracy and data set of course but I generally stop training when the loss value gets around 4 or less. What is your current loss value after 9000 steps?
To me this sounds like your training is not converging.
See the discussion in the comments of this question.
Basically, it is recommended that you run eval.py in parallel and check how it performs there as well.
I am new to tensorflow, so please pardon my ignorance.
I have a tensorflow demo model "from an online tutorial" that should predict stockmarket prices for S&P. When I run the code I get inconsistent results everytime I run it. Training data does not change, I suppressed block shuffling , ...
But, When I run the prediction 2 times in the same run I get consistent results "i.e. use Only one training , run prediction twice".
My questions are:
Why am I getting inconsistent results?
If you are going to release such code to production , would you
just take the last time you ran this model training results? if not, then what would you do?
Does it make sense to force the model to produce consistent predictions? how would
you do that?
Here is my code location github repo
In training a neural network there is more randomness involved than just the batch shuffling. The initial weights of the layers are also randomly initialized.
Typically you would use the best model you have trained so far. To determine which model is the best you usually use some test dataset you did not use during training.
It is probably not a good sign if your performance fluctuates for different training runs. This means your result depends a lot on the random initialization. But I personally don't know about any general techniques to make learning more stable. But there probably are some.