I am optimising a Tensorflow/Keras model in Jupyter Notebook.
Some of fit runs fall into not learning (nan for loss and acc).
After nan occurred, if I just try to re-run cells to create model and train it again - it will always be nan.
If I restart kernel next run most likely works fine.
Is there a static variable somewhere in TF/Keras I need to reset without restarting kernel?
EDIT: I use this code routinely before building model and it does not solve my problem.
tf.keras.backend.clear_session()
tf.keras.backend.get_session().run(tf.global_variables_initializer())
EDIT2: this is TPU model. Its possible that setting weights on TPU model has a bug.
Related
I finally got ready for training my custom tensorflow 1.15 model (os: ubuntu 20.04) and ran into this issue during training. Firstly, I was only getting new checkpoints every ~3000 steps, but I fixed that by changing the checkpoints save_checkpoint_steps setting in the run_config.py file unter /home/mc/anaconda3/envs/tf1/lib/python3.8/site-packages/tensorflow_estimator/python/estimator. When I start training (after the second checkpoint has been created) I get: Skip the current checkpoint eval due to throttle secs(600). But no matter what, only my first checkpoint is evaluated, so how do I change the eval trottle secs or something else, so that my checkpoints could be properly evaluated???
I am trying to print model summary in tensorflow and I think the model is large and it's crashing my notebook. The model is ResNet101.
The whole computer comes to a halt, memory usage goes up to 99% and VS Code crashes. I have 16 GB of ram, so I didn't think printing something large would actually eat all my ram. Also, because the kernel crashes, all the variables are lost like history = model.fit() which I need to fine-tune the model afterwards. Moreover, I need to print base_model summary in order to choose from which layer to fine-tune from.
Is there a way to print the summary in another way and save the entire notebook with the variables, so I can continue working? I have checkpoints for model weights, but I need to keep track of past epochs through history to resume training afterwards.
I have built a preliminary ML (PySpark) model with sample data on my PC (Windows) and the accuracy is around 70%. After persisting model binary on disk I am reading it from a different jupyter notebook and the accuracy is somewhere near 70%. Now if I do the same thing on our cluster (MapR/Unix), after reading the model binary from disk, accuracy goes down to 10-11% (the dataset is also exactly same). Even with the full dataset I got the same issue (just for information).
As the cluster has Unix OS, I tried training-persisting-testing the model in a docker container (Unix), but no issue there. The issue is only with the cluster.
I have been scratching my head since then about what might be causing this and how to resolve it. Please help.
Edit:
It's a classification problem and I have used pyspark.ml.classification.RandomForestClassifier.
To persist the models I am simply using the standard setup:
model.write().overwrite().save(model_path)
And to load the model:
model = pyspark.ml.classification.RandomForestClassificationModel().load(model_path)
I have used StringIndexer, OneHotEncoder etc in the model and have also persisted them on disk to in order to use them in the other jupyter notebook (same way as the main model).
Edit:
Python: 3.x
Spark: 2.3.1
I'm trying to train a SSD mobilenet v2 using Tensorflow Object Detection API, with Tensorflow GPU. The training goes well and fast until the first checkpoint save (after some hundreds of steps), where it gets stuck after restoring the last checkpoint. The GPU usage goes down and never comes up. Sometimes Python itself crashes.
I'm running Tensorflow GPU on Windows 7, with an NVIDIA Quadro M4000, with CUDA 8.0 (the only version I managed to work with). The model is an SSD Mobilenet v2 pretrained with COCO, using a very low batch size of 4.
The config file is the same as it comes out from the Tensorflow Model ZOO, of course changing paths, batch size, number of classes and number of steps and adding shuffle: true on the training part.
I'm adding the terminal infos that come out. This is where it gets stuck.
Did someone experience the same kind of problem or has any idea why?
Thanks in advance
I faced the same problem as you stated. I waited a long time and found something interesting. I got some evaluation results. The training process continued after that. It seems that the evaluation process takes too much time. As it gives no output at the beginning, it just like get stuck. Maybe changing the parameter 'sample_1_of_n_eval_examples' will help. I'm trying...
I have created the following jupyter notebook. I have used 200 epoches for the training purpose. But there was a power failure and the epoch it reached was 110. Now I need to restart it from zero. All my hardwork was wasted.
Have a look at the Jupyter Notebook created by me.
What should I do to save the train model and find out the best out of it? I want to restore the model saved even if there is any problem associated to power.
How to save and restore the model successfully? Currently I am going with 200 epoches but I soon think to move for 1000.
If possible then suggest me how I can move to Keras. I have heard it is simple to understand compare to Tensorflow.