TensorBoard event files not updating in google drive - python

I use google colab for training my model. I use tensorflow 2.2.
Here is how I create summary writer:
writer = tf.summary.create_file_writer(os.path.join(model_dir, 'logs'), max_queue=1)
Here is my code which I run each step:
with writer.as_default():
tf.summary.scalar("train_loss", total_loss, step=num_steps)
writer.flush()
The problem is that if model_dir is just /content/model_dir, then everything saves fine, but if I save my model to a folder on google drive (I connect to my google drive with this code:
from google.colab import drive
drive.mount('/content/gdrive2', force_remount=True)
) then event file doesn't get updated. It is being created, but it is not filling with data during training (and even after training is finished).
As I understand, the problem is that google drive doesn't understand that tensorflow updates the event file. But after training is finished, the whole event file is saved. What can I do to fix this bug?

Related

Permanently saving train data in google colab

I have train data for 50GB.
My google drive capacity was 15GB so I upgraded it to 200GB and I uploaded my train data to my google drive
I connected to colab, but I can not find my train data in colab session, So I manually uploaded to colab which has 150GB capacity.
It says, it will be deleted when my colab connection is off.
It is impossible to save train data for colab permanently? And colab is free for 150GB?
And I see colab support nvidia P4 that is almost 5000$. can I use it 100% or it is shared to some portion(like 0.1%) to me? (When P4 is assigned to me)
The way you can do this is to mount your google drive into colab environment. Assume your files are kept under a folder named myfolder in your google drive. This is what I would suggest, do this before you read/write any file:
import os
from google.colab import drive
MOUNTPOINT = '/content/gdrive'
DATADIR = os.path.join(MOUNTPOINT, 'My Drive', 'myfolder')
drive.mount(MOUNTPOINT)
then, for example, your file bigthing.zip reside under myfolder in your google drive will be available in colab as path=os.path.join(DATADIR, 'bigthing.zip')
Similarly, when you save a file to a path like the above, you can find your file in Google Drive under the same directory.
In regards to the final questions, you are able to use it 100%, however, there are very inconsistent restrictions. Generally, you only get about 8 hours straight before you get kicked off, must be running code to keep the connection, and you can only use a GPU a few times in a row before you lose access for a day or so. You can pay for colab pro which would give you more access, and better GPUs in general for $10/month.
In my experience, before colab pro you could get a top GPU (Tesla P100) about 50% of the time. Now that they started the pro version I rarely get a P100 and get kicked off more often. So it can be a bit of a game to get regular use.
Another site that lets you do basically the same thing is https://console.paperspace.com/
They give you only 6 hour shifts on "notebook" but you wont get kicked off before then, and I can usually get a P5000 which is generally better than colab gives me.
https://www.kaggle.com/ will also give you 30 hours per week, so you really could get up to near 2 GPU hours for every hour of the day if you planned your life around it.

TensorBoard Colab UnimplementedError File system scheme '[local]' not implemented

I am using TensorFlow with Keras to train a classifier and I tried adding TensorBoard as a callback parameter to the fit method. I have installed TensorFlow 2.0 correctly and am also able to load TensorBoard by calling %load_ext tensorboard. I am working on Google Colab and thought I would be able to save the logs to Google Drive during training, so that I can visualize them with TensorBoard. However, when I try to fit the data to the model along with the TensorBoard callback, I get this error:
File system scheme '[local]' not implemented (file: '/content/drive/My
Drive/KInsekten/logs/20200409-160657/train') Encountered when
executing an operation using EagerExecutor.
I initialized the TensorBoard callback like this:
logs_base_dir = "/content/drive/My Drive/KInsekten/logs/"
if not os.path.exists(logs_base_dir):
os.mkdir(logs_base_dir)
log_dir = logs_base_dir + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensor_board = tf.keras.callbacks.TensorBoard(log_dir = log_dir, histogram_freq = 1,
write_graph = True, write_images = True)
I was facing the same issue. The issue is that TPU can't use the local filesystem we have to create a separate bucket on cloud storage and configure it with TPU.
Following are the two links from google cloud official TPU documentation in the first link the main problem is discussed and in the 2nd link the actual solution is implemented.
The main problem disscussed
Solution to this problem

When I run deep learning training code on Google Colab, do the resulting weights and biases get saved somewhere?

I am training some deep learning code from this repository on a Google Colab notebook. The training is ongoing and seems like it is going to take a day or two.
I am new to deep learning, but my question:
Once the Google Colab notebook has finished running the training script, does this mean that the resulting weights and biases will be hard written to a model somewhere (in the repository folder that I have on my Google Drive), and therefore I can then run the code on any test data I like at any point in the future? Or, once I close the Google Colab notebook, do I lose the weight and bias information and would have to run the training script again if I wanted to use the neural network?
I realise that this might depend on the details of the script (again, the repository is here), but I thought that there might be a general way that these things work also.
Any help in understanding would be greatly appreciated.
No; Colab comes with no built-in checkpointing; any saving must be done by the user - so unless the repository code does so, it's up to you.
Note that the repo would need to figure out how to connect to a remote server (or connect to your local device) for data transfer; skimming through its train.py, there's no such thing.
How to save model? See this SO; for a minimal version - the most common, and a reliable option is to "mount" your Google Drive onto Colab, and point save/load paths to direct
from google.colab import drive
drive.mount('/content/drive') # this should trigger an authentication prompt
%cd '/content/drive/My Drive/'
# alternatively, %cd '/content/drive/My Drive/my_folder/'
Once cd'd into, for example, DL Code in your My Drive (see below), you can simply do model.save("model0.h5"), and this will create model0.h5 in DL Code, containing entire model architecture & its optimizer. For just weights, use model.save_weights().

How to take and restore snapshots of model training on another VM in Google Colab?

There is a 12 hour time limit for training DL models on GPU, according to google colab. Other people have had similar questions in the past, but there has been no clear answer on how to save and load models halfway through training when the 12 hour limits get exceeded, including saving the number of epochs that has been completed/saving other parameters. Is there an automated script for me to save the relevant parameters and resume operations on another VM? I am a complete noob; clear cut answers will be much appreciated.
As far as I know, there is no way to automatically reconnect to another VM whenever you reach the 12 hours limit. So in any case, you have to manually reconnect when the time is up.
As Bob Smith points out, you can mount Google Drive in Colab VM so that you can save and load data from there. In particular, you can periodically save model checkpoints so that you can load the most recent one whenever you connect to a new Colab VM.
Mount Drive in your Colab VM:
from google.colab import drive
drive.mount('/content/gdrive')
Create a saver in your graph:
saver = tf.train.Saver()
Periodically (e.g. every epoch) save a checkpoint in Drive:
saver.save(session, CHECKPOINT_PATH)
When you connect to a new Colab VM (because of the timeout), mount Drive again in your VM and restore the most recent checkpoint before the training phase:
saver.restore(session, CHECKPOINT_PATH)
...
# Start training with the restored model.
Take a look at the documentation to read more about tf.train.Saver.
Mount Drive, and save and load persistent data from there.
from google.colab import drive
drive.mount('/content/gdrive')
https://colab.research.google.com/notebooks/io.ipynb#scrollTo=RWSJpsyKqHjH
From colab you can access github, which makes it possible to save your model checkpoints to github periodically. When a session is ended, you can start another session and load the checkpoint back from your github repo.

Save word2vec model on google drive through colab

I have created a word2vec using Google Colab. However, when I try to save it using the code that I generally use to save on my computer, the file doesn't appear:
model.init_sims(replace=True)
model_name = "Twitter"
model.save()

Categories

Resources