I have created the following jupyter notebook. I have used 200 epoches for the training purpose. But there was a power failure and the epoch it reached was 110. Now I need to restart it from zero. All my hardwork was wasted.
Have a look at the Jupyter Notebook created by me.
What should I do to save the train model and find out the best out of it? I want to restore the model saved even if there is any problem associated to power.
How to save and restore the model successfully? Currently I am going with 200 epoches but I soon think to move for 1000.
If possible then suggest me how I can move to Keras. I have heard it is simple to understand compare to Tensorflow.
Related
I am trying to print model summary in tensorflow and I think the model is large and it's crashing my notebook. The model is ResNet101.
The whole computer comes to a halt, memory usage goes up to 99% and VS Code crashes. I have 16 GB of ram, so I didn't think printing something large would actually eat all my ram. Also, because the kernel crashes, all the variables are lost like history = model.fit() which I need to fine-tune the model afterwards. Moreover, I need to print base_model summary in order to choose from which layer to fine-tune from.
Is there a way to print the summary in another way and save the entire notebook with the variables, so I can continue working? I have checkpoints for model weights, but I need to keep track of past epochs through history to resume training afterwards.
I faced a strange challenge trying to train neural network using code from github, it is huggingface conversational model.
What happens: even i use my own dataset for training result remains the same like with original dataset. My hypothesis that it is a somehow cache problem - old dataset continuously get loaded from cached and replace my.
Them when i launch actual interactive session with neural network it works, but without my data, even if i pass model checkpoint.
Why i think of cache: in this repo author use automatic downloading and caching neural network model in /home/joo/.cache/torch/pytorch_transformers/ if no parameter specified in terminal.
I have created an issue on Github. BUT i am not sure is that a problem specific for this repo only, or it is a common problem with retraining neural networks i faced first time.
https://github.com/huggingface/transfer-learning-conv-ai/issues/36
Some copypaste from issue:
I am still curious, was not able to pass my dataset:
I added to original 200mb json my personality
trained once more with --dataset_path ./my.json
invoke interact.py with new checkpoint and path python ./interact.py --model_checkpoint
./runs/Oct08_18-22-53_joo-tf_openai-gpt/ --dataset_path ./my.json
and it reports Gathered 18878 personalities (but not 18879, with my own).
I changed the code in interact.py to choose my first perosnality this way
was: personality = random.choice(personalities)
become: personality = personalities[0]
and this first personality is not mine.
Solved: it is a specific issue to this repo, just hardcoded dataset path.
But still why it doesn't load first time - no answer
I have built a preliminary ML (PySpark) model with sample data on my PC (Windows) and the accuracy is around 70%. After persisting model binary on disk I am reading it from a different jupyter notebook and the accuracy is somewhere near 70%. Now if I do the same thing on our cluster (MapR/Unix), after reading the model binary from disk, accuracy goes down to 10-11% (the dataset is also exactly same). Even with the full dataset I got the same issue (just for information).
As the cluster has Unix OS, I tried training-persisting-testing the model in a docker container (Unix), but no issue there. The issue is only with the cluster.
I have been scratching my head since then about what might be causing this and how to resolve it. Please help.
Edit:
It's a classification problem and I have used pyspark.ml.classification.RandomForestClassifier.
To persist the models I am simply using the standard setup:
model.write().overwrite().save(model_path)
And to load the model:
model = pyspark.ml.classification.RandomForestClassificationModel().load(model_path)
I have used StringIndexer, OneHotEncoder etc in the model and have also persisted them on disk to in order to use them in the other jupyter notebook (same way as the main model).
Edit:
Python: 3.x
Spark: 2.3.1
I am optimising a Tensorflow/Keras model in Jupyter Notebook.
Some of fit runs fall into not learning (nan for loss and acc).
After nan occurred, if I just try to re-run cells to create model and train it again - it will always be nan.
If I restart kernel next run most likely works fine.
Is there a static variable somewhere in TF/Keras I need to reset without restarting kernel?
EDIT: I use this code routinely before building model and it does not solve my problem.
tf.keras.backend.clear_session()
tf.keras.backend.get_session().run(tf.global_variables_initializer())
EDIT2: this is TPU model. Its possible that setting weights on TPU model has a bug.
I am new to tensorflow, so please pardon my ignorance.
I have a tensorflow demo model "from an online tutorial" that should predict stockmarket prices for S&P. When I run the code I get inconsistent results everytime I run it. Training data does not change, I suppressed block shuffling , ...
But, When I run the prediction 2 times in the same run I get consistent results "i.e. use Only one training , run prediction twice".
My questions are:
Why am I getting inconsistent results?
If you are going to release such code to production , would you
just take the last time you ran this model training results? if not, then what would you do?
Does it make sense to force the model to produce consistent predictions? how would
you do that?
Here is my code location github repo
In training a neural network there is more randomness involved than just the batch shuffling. The initial weights of the layers are also randomly initialized.
Typically you would use the best model you have trained so far. To determine which model is the best you usually use some test dataset you did not use during training.
It is probably not a good sign if your performance fluctuates for different training runs. This means your result depends a lot on the random initialization. But I personally don't know about any general techniques to make learning more stable. But there probably are some.