google AI platform read history on local - python

After training model i would like to save history in mine bucket or any location which i can access later on local
when i run code below on google colab all work fine
history = model.fit(training_dataset, steps_per_epoch=steps_per_epoch, epochs=EPOCHS,
validation_data=validation_dataset, validation_steps=validation_steps)
#model.summary()
model.save(BUCKET)
#save as pickle
with open('/trainHistoryDict', 'wb') as file_pi:
pickle.dump(history.history, file_pi)
I can read later using
history = pickle.load(open('/trainHistoryDict', "rb"))
however when run code as a job on google cloud AI-Platform ( using %%wrotefile) I cannot retrieve history on google colab using pickle load - im getting 'no such directory'
So how i can run training on AI platform on google cloud and then access history on google colab?
Can i save history.history in bucket ? i tried to use PACKAGE_STAGING_PATH but it didnt worked

found solution
subprocess.Popen('gsutil cp history gs://bigdatapart2-storage/history1', shell=True, stdout=subprocess.PIPE)

Related

Unable to load images from a Google Cloud Storage bucket in TensorFlow or Keras

I have a bucket on Google Cloud Storage that contains images for a TensorFlow model training. I'm using tensorflow_cloud to load the images stored in the bucket called stereo-train and the full URL to the directory with images is:
gs://stereo-train/data_scene_flow/training/dat
But using this path in the tf.keras.preprocessing.image_dataset_from_directory function, I get the error in the log in Google Cloud Console:
FileNotFoundError: [Errno 2] No such file or directory: 'gs://stereo-train/data_scene_flow/training/dat'
How to fix this?
Code:
GCP_BUCKET = "stereo-train"
kitti_dir = os.path.join("gs://", GCP_BUCKET, "data_scene_flow")
kitti_training_dir = os.path.join(kitti_dir, "training", "dat")
ds = tf.keras.preprocessing.image_dataset_from_directory(kitti_training_dir, image_size=(375,1242), batch_size=batch_size, shuffle=False, label_mode=None)
Even when I use the following, it doesn't work:
filenames = np.sort(np.asarray(os.listdir(kitti_train))).tolist()
# Make a Dataset of image tensors by reading and decoding the files.
ds = list(map(lambda x: tf.io.decode_image(tf.io.read_file(kitti_train + x)), filenames))
tf.io.read_file instead of the keras function, I get the same error. How to fix this?
If you are using Linux or OSX you can use Google Cloud Storage FUSE which will allow you to mount your bucket locally and use it like any other file system. Follow the installation guide and then mount your bucket somewhere on your system, ie.:
mkdir /mnt/buckets
gcsfuse gs://stereo-train /mnt/buckets
Then you should be able to use the paths from the mount point in your code and load the content from the bucket in Keras.

TensorBoard event files not updating in google drive

I use google colab for training my model. I use tensorflow 2.2.
Here is how I create summary writer:
writer = tf.summary.create_file_writer(os.path.join(model_dir, 'logs'), max_queue=1)
Here is my code which I run each step:
with writer.as_default():
tf.summary.scalar("train_loss", total_loss, step=num_steps)
writer.flush()
The problem is that if model_dir is just /content/model_dir, then everything saves fine, but if I save my model to a folder on google drive (I connect to my google drive with this code:
from google.colab import drive
drive.mount('/content/gdrive2', force_remount=True)
) then event file doesn't get updated. It is being created, but it is not filling with data during training (and even after training is finished).
As I understand, the problem is that google drive doesn't understand that tensorflow updates the event file. But after training is finished, the whole event file is saved. What can I do to fix this bug?

AWS SageMaker: Create an endpoint using a trained model hosted in S3

I have following this tutorial, which is mainly for jupyter notebook, and made some minimal modification for external processing. I've created a project that could prepare my dataset locally, upload it to S3, train, and finally deploy the model predictor to the same bucket. Perfect!
So, after to train and saved it in S3 bucket:
ss_model.fit(inputs=data_channels, logs=True)
it failed while deploying as an endpoint. So, I have found tricks to host an endpoint in many ways, but not from a model already saved in S3. Because in order to host, you probably need to get the estimator, which in normal way is something like:
self.estimator = sagemaker.estimator.Estimator(self.training_image,
role,
train_instance_count=1,
train_instance_type='ml.p3.2xlarge',
train_volume_size=50,
train_max_run=360000,
output_path=output,
base_job_name='ss-training',
sagemaker_session=sess)
My question is: is there a way to load an estimator from a model saved in S3 (.tar)? Or, anyway, to create an endpoint without train it again?
So, after to run on many pages, just found a clue here. And I finally found out how to load the model and create the endpoint:
def create_endpoint(self):
sess = sagemaker.Session()
training_image = get_image_uri(sess.boto_region_name, 'semantic-segmentation', repo_version="latest")
role = "YOUR_ROLE_ARN_WITH_SAGEMAKER_EXECUTION"
model = "s3://BUCKET/PREFIX/.../output/model.tar.gz"
sm_model = sagemaker.Model(model_data=model, image=training_image, role=role, sagemaker_session=sess)
sm_model.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge')
Please, do not forget to disable your endpoint after using. This is really important! Endpoints are charged by "running" not only by the use
I hope it also can help you out!
Deploy the model using the following code
model = sagemaker.Model(role=role,
model_data=### s3 location of tar.gz file,
image_uri=### the inference image uri,
sagemaker_session=sagemaker_session,
name=## model name)
model_predictor = model.deploy(initial_instance_count=1,
instance_type=instance_type)
Initialize the predictor
model_predictor = sagemaker.Predictor(endpoint_name= model.endpoint_name)
Finally predict using
model_predictor.predict(##your payload)

How to take and restore snapshots of model training on another VM in Google Colab?

There is a 12 hour time limit for training DL models on GPU, according to google colab. Other people have had similar questions in the past, but there has been no clear answer on how to save and load models halfway through training when the 12 hour limits get exceeded, including saving the number of epochs that has been completed/saving other parameters. Is there an automated script for me to save the relevant parameters and resume operations on another VM? I am a complete noob; clear cut answers will be much appreciated.
As far as I know, there is no way to automatically reconnect to another VM whenever you reach the 12 hours limit. So in any case, you have to manually reconnect when the time is up.
As Bob Smith points out, you can mount Google Drive in Colab VM so that you can save and load data from there. In particular, you can periodically save model checkpoints so that you can load the most recent one whenever you connect to a new Colab VM.
Mount Drive in your Colab VM:
from google.colab import drive
drive.mount('/content/gdrive')
Create a saver in your graph:
saver = tf.train.Saver()
Periodically (e.g. every epoch) save a checkpoint in Drive:
saver.save(session, CHECKPOINT_PATH)
When you connect to a new Colab VM (because of the timeout), mount Drive again in your VM and restore the most recent checkpoint before the training phase:
saver.restore(session, CHECKPOINT_PATH)
...
# Start training with the restored model.
Take a look at the documentation to read more about tf.train.Saver.
Mount Drive, and save and load persistent data from there.
from google.colab import drive
drive.mount('/content/gdrive')
https://colab.research.google.com/notebooks/io.ipynb#scrollTo=RWSJpsyKqHjH
From colab you can access github, which makes it possible to save your model checkpoints to github periodically. When a session is ended, you can start another session and load the checkpoint back from your github repo.

Save word2vec model on google drive through colab

I have created a word2vec using Google Colab. However, when I try to save it using the code that I generally use to save on my computer, the file doesn't appear:
model.init_sims(replace=True)
model_name = "Twitter"
model.save()

Categories

Resources