Can't train model from checkpoint on Google Colab as session expires

Can't train model from checkpoint on Google Colab as session expires - python

I'm using Google Colab for finetuning a pre-trained model.
I successfully preprocessed a dataset and created an instance of the Seq2SeqTrainer class:
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
The problem is training it from last checkpoint after the session is over.
If I run trainer.train(), it runs correctly. As it takes a long time, I sometimes came back to the Colab tab after a few hours, and I know that if the session has crashed I can continue training from the last checkpoint like this: trainer.train("checkpoint-5500")
The checkpoint data does no longer exist on Google Colab if I come back too late, so even though I know the point the training has reached, I will have to start all over again.
Is there any way to solve this problem? i.e. extend the session?

To fix your problem try adding a full fixed path, for example for your google drive and saving the checkpoint-5500 to it.
Using your trainer you can set the output directory as your Google Drive path when creating an instance of the Seq2SeqTrainingArguments.
When you come back to your code, if the session is indeed over you'll just need to load your checkpoint-5500 from your google drive instead of retraining everything.
Add the following code:
from google.colab import drive
drive.mount('/content/drive')
And then after your trainer.train("checkpoint-5500") is finished (or as it's last step) save your checkpoint to your google drive.
Or if you prefer, you can add a callback inside your fit function in order to save and update after every single epoch (that was if for some reason the session is crashing before it finish you'll still have some progress saved).

Related

How do I find my trained model trained by "kaggle.com"

I trained a model use kaggle.com
The final code is:
history = model.fit(dataset, epochs=1)
model.save("/kaggle/working/039_model.h5")
print("Model saved successfully!")
The output is:
31368/31368 [==============================] - 23489s 749ms/step - loss: 1.4623
Model saved successfully!
For test, I just trained it for 1 epoch, but I cannot find my model in the /kaggle/working directory in the right side bar. Even if I click the refresh button or refresh the page.
The page picture is :
my problem picture
Thanks for your help!

The refresh button of browser refreshes the whole environment that you were working on, all variables that were set, any file that were saved in storage instance allotted to you for that particular session.
You are given an instance of a server resource, i.e., RAM, few GB of temporary storage, CPU/GPU for a certain period of time. Refreshing the browser starts completely new instance losing all local changes to that instance.
So your model even though it might be saved in the local storage of
that instance (on server), it will get deleted once that session
expires (after 20 minutes of inactivity) or when you run out of quota
or when you refreshes your page.
Solution is to don't refresh your page, look where the model is saved by exploring in side bar by clicking dropdown button and downloading the model for future use case.
Side Note: Committing your notebook will only commit, i.e., save that particular version of your code in the notebook, and will not save anything that you saved to the local storage of that instance, like in your case a saved model file.

Well, maybe I know what happend, I just find out I should save and run all.
File -> save version -> save & run all
so we can get all we want.

It is application manage, it will save on target directory except you using checkpoint callback !
savedir = 'F:\\models\\save\\FlappyBird_15'
checkpoint_path = "F:\\models\\checkpoint\\FlappyBird_15\\TF_DataSets_01.h5"
if not exists(checkpoint_dir) :
os.mkdir(checkpoint_dir)
print("Create directory: " + checkpoint_dir)
### model initialize ###
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(1200, 1)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True, return_state=False)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
])
model.add(layers.Flatten())
model.add(layers.Dense(64))
model.add(layers.Dense(2))
model.summary()
########################
if exists(checkpoint_path) :
model.load_weights(checkpoint_path)
print("model load: " + checkpoint_path)
input("Press Any Key!")
model.save_weights(checkpoint_path)

How to load tf.keras model directly from cloud bucket?

I try to load tf.keras model direcly from cloud bucket but I can't see easy wat to do it.
I would like to load whole model structure not only weights.
I see 3 possible directions:
Is posssible to load keras model directly from Google cloud bucket? Command tf.keras.model.load_model('gs://my_bucket/model.h5') doesn't work
I tried to use tensorflow.python.lib.ii.file_io but I don't know how to load this as model.
I copied model to local directory by gsutil cp command but I don't know how to wait until operation will be complete. tf try to load model before download operation is complete so the errors occurs
I will be thankful for any sugestions.
Peter

Load the file from gs storage
from tensorflow.python.lib.io import file_io
model_file = file_io.FileIO('gs://mybucket/model.h5', mode='rb')
Save a temporary copy of the model locally
temp_model_location = './temp_model.h5'
temp_model_file = open(temp_model_location, 'wb')
temp_model_file.write(model_file.read())
temp_model_file.close()
model_file.close()
Load model saved locally
model = tf.keras.models.load_model(temp_model_location)

Can I pickle a tensorflow model?

Will I be able to pickle all the .meta, .data and checkpoint files of a tensorflow model? That's because I want to run a prediction on my model and if i deploy it , the files can't be on disk right? I know about the tensorflow serving but I don't really understand it. I want to be able to load the tensforflow files without accessing the drive all the time.

Using pickle is not recommended. Instead, they have created a new format called "SavedModel format" that serves this exact purpose.
See: https://www.tensorflow.org/guide/saved_model

Uploading Csv FIle Google colab

So I have a 1.2GB csv file and to upload it to google colab it is taking over an hour to upload. Is that normal? or am I doing something wrong?
Code:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['IF 10 PERCENT.csv']), index_col=None)
Thanks.

files.upload is perhaps the slowest method to transfer data into Colab.
The fastest is syncing using Google Drive. Download the desktop sync client. Then, mount your Drive in Colab and you'll find the file there.
A middle ground that is faster than files.upload but still slower than Drive is to click the upload button in the file browser.

1.2 GB is huge dataset and if you upload this huge dataset it take time no question at all. Previously i worked on one of my project and i face this same problem. There are multiple ways to handel this problem.
Solution 1:
Try to get your dataset in google drive and start doing your project in google colab. In colab you can mount your drive and just use file path and it works.
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('Enter file path')
Solution 2:
I believe that you used this dataset for a machine learning project. So for developing the initial model, your first task is to check whether your model is working or not so what you do, you just open your CSV file in Excel and copy the first 500 or 1000 thousand rows and paste into another excel sheet and make small dataset and work with that dataset. Once you find everything is working then uploads your full dataset and train your model on it.
This technique is little bit tedious because you have to take care about EDA and Feature Engineering stuff, when you upload entire 1.2 GB dataset. Apart from that everything is fine and it work.
NOTE: This techinique very helpful when your first priority is performing experiment, because loading huge dataset and then start working is very time comsuming process.

Tensorboard : No dashboards active for current dataset

I am training a neural network for object detection using Google Colab. I wanted to visualize the learning process but every time I try to access tensorboard, it shows me the following:
No dashboards are active for the current data set. Probable causes: - You haven’t written any data to your event files. - TensorBoard can’t find your event files.
I am not training the model locally and have configured my google drive account with the colab notebook for the training data so user hpabst's answer does not seem useful.
I also tried setting up tensorboard using ngrok but that gave me a similar output.
I made sure I am generating summary data in a log directory by creating a summary writer:
import tensorflow as tf
sess = tf.Session()
file_writer = tf.summary.FileWriter('/content/logs/my_log_dir/', sess.graph)
and followed that with
tensorboard = TensorBoard(log_dir="/content/logs/my_log_dir/",batch_size=32, write_graph=True, update_freq='epoch')
model.fit_generator(
train_generator,
steps_per_epoch=(train_data/BS),
epochs=EPOCHS,
validation_data=validation_generator,
validation_steps=(test_data/BS),
callbacks=[tensorboard, checkpoint])
and finally
tensorboard --logdir /content/logs/my_log_dir/
The event files are in place. The path to the log directory is also correct.

Like I said, I was getting the same- No active dashboards error using ngrok. I moved over to the SCALARS menu in the Tensorboard GUI and to the left, under the runs section, at the bottom, I found out that the path to the log directory was being shown as '/content/ log /my_log_dir' although everywhere in my code I had only mentioned the path as -'/content/ logs /my_log_dir'. Maybe setting up tensorboard using ngrok expects the files to be in the 'log' and not the 'logs' directory. I made the change and now it works just fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't train model from checkpoint on Google Colab as session expires - python

Related

How do I find my trained model trained by "kaggle.com"

How to load tf.keras model directly from cloud bucket?

Can I pickle a tensorflow model?

Uploading Csv FIle Google colab

Tensorboard : No dashboards active for current dataset

Categories

Resources