I am using TensorFlow with Keras to train a classifier and I tried adding TensorBoard as a callback parameter to the fit method. I have installed TensorFlow 2.0 correctly and am also able to load TensorBoard by calling %load_ext tensorboard. I am working on Google Colab and thought I would be able to save the logs to Google Drive during training, so that I can visualize them with TensorBoard. However, when I try to fit the data to the model along with the TensorBoard callback, I get this error:
File system scheme '[local]' not implemented (file: '/content/drive/My
Drive/KInsekten/logs/20200409-160657/train') Encountered when
executing an operation using EagerExecutor.
I initialized the TensorBoard callback like this:
logs_base_dir = "/content/drive/My Drive/KInsekten/logs/"
if not os.path.exists(logs_base_dir):
os.mkdir(logs_base_dir)
log_dir = logs_base_dir + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensor_board = tf.keras.callbacks.TensorBoard(log_dir = log_dir, histogram_freq = 1,
write_graph = True, write_images = True)
I was facing the same issue. The issue is that TPU can't use the local filesystem we have to create a separate bucket on cloud storage and configure it with TPU.
Following are the two links from google cloud official TPU documentation in the first link the main problem is discussed and in the 2nd link the actual solution is implemented.
The main problem disscussed
Solution to this problem
Related
I have trained a Logistic Regression model on my local machine. Saved the model using Joblib and tried deploying it on Aws Sagemaker using "Linear-Learner" image.
Facing issues while deployment as the deployment process keeps continuing and the Status is always as "Creating" and does not turn to "InService".
endpoint_name = "DEMO-LogisticEndpoint" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response["EndpointArn"])
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)
while status == "Creating":
time.sleep(60)
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)
The while loop keeps executing and the status never change.
Background: What is important to understand is that the endpoint runs a container that includes the serving software. Each container expects a certain type of model. You need to make sure you're model and how you package it matches what the container expects.
Two easy paths forward:
Linear-learner is a SageMaker built-in algorithm, so a straight forward path would be to train it in the cloud. See example, making it very easy to deploy.
Use Scikit-learn Logistic Regression]2, which you can train locally and deploy to SageMaker using the scikit-learn container (XGBoost is another easy path).
Otherwise, you can always go more advanced and use any custom algorithm by bringing your own custom algorithm/framework bybringing your own container. Google for existing implementations (e.g., CatBoost/SageMaker).
Note: These same steps work without any errors on Colab GPU.
Please help me with this. I created a dataset and saved it as file
data = tf.data.Dataset.from_tensor_slices(( features, labels))
tf.data.experimental.save(data, myfile)
When I try to load it
data = tf.data.experimental.load(myfile)
and run any function on the data like len(data), data.batch(16) or data.take(1) then I get this error:
NotFoundError: Could not find metadata file. [Op:DatasetCardinality]
TPU config
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
Is it similar to this TF1.14][TPU]Can not use custom TFrecord dataset on Colab using TPU ?
After some more debugging I got this error:
UnimplementedError: File system scheme '[local]' not implemented (file: './data/temp/2692738424590406024')
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors. [Op:DatasetCardinality]
I found this explanation:
Cloud TPUs can only access data in GCS as only the GCS file system is registered.
More info here: File system scheme '[local]' not implemented in Google Colab TPU
task : object_detection
environment: AWS sagemaker
instance type: 'ml.p2.xlarge' | num_instances = 1
Main file to be run: original
Problematic code segment from the main file:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
FLAGS.tpu_name)
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
elif FLAGS.num_workers > 1:
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
else:
strategy = tf.compat.v2.distribute.MirroredStrategy()
Problem : Can't find the proper value to be given as tpu_name argument.
My research on the problem:
According to the tensorflow documentation in tf.distribute.cluster_resolver.TPUClusterResolver, it says that this resolver works only on Google Cloud platform.
This is an implementation of cluster resolvers for the Google Cloud
TPU service.
TPUClusterResolver supports the following distinct environments:
Google Compute Engine Google Kubernetes Engine Google internal
It can be passed into tf.distribute.TPUStrategy to support TF2
training on Cloud TPUs.
But from this issue in github, I found out that a similar code also works in Azure.
My question :
Is there a way I can bypass this resolver and initialize my tpu in sagemaker ?
Even better, if I can find a way to insert the name or url of sagemaker gpu to the resolver and initiate it from there ?
Let me clarify some confusion here. TPUs are only offered on Google Cloud and the TPUClusterResolver implementation queries GCP APIs to get the cluster config for the TPU node. Thus, no you can't use TPUClusterResolver with AWS sagemaker, but you should try it out with TPUs on GCP instead or try find some other documentation on Sagemaker's end on how they enable cluster resolving on their end (if they do).
I am training some deep learning code from this repository on a Google Colab notebook. The training is ongoing and seems like it is going to take a day or two.
I am new to deep learning, but my question:
Once the Google Colab notebook has finished running the training script, does this mean that the resulting weights and biases will be hard written to a model somewhere (in the repository folder that I have on my Google Drive), and therefore I can then run the code on any test data I like at any point in the future? Or, once I close the Google Colab notebook, do I lose the weight and bias information and would have to run the training script again if I wanted to use the neural network?
I realise that this might depend on the details of the script (again, the repository is here), but I thought that there might be a general way that these things work also.
Any help in understanding would be greatly appreciated.
No; Colab comes with no built-in checkpointing; any saving must be done by the user - so unless the repository code does so, it's up to you.
Note that the repo would need to figure out how to connect to a remote server (or connect to your local device) for data transfer; skimming through its train.py, there's no such thing.
How to save model? See this SO; for a minimal version - the most common, and a reliable option is to "mount" your Google Drive onto Colab, and point save/load paths to direct
from google.colab import drive
drive.mount('/content/drive') # this should trigger an authentication prompt
%cd '/content/drive/My Drive/'
# alternatively, %cd '/content/drive/My Drive/my_folder/'
Once cd'd into, for example, DL Code in your My Drive (see below), you can simply do model.save("model0.h5"), and this will create model0.h5 in DL Code, containing entire model architecture & its optimizer. For just weights, use model.save_weights().
When executing the deploy code to sagemaker using sagemaker-python-sdk I get error as :
UnexpectedStatusException: Error hosting endpoint tensorflow-inference-eia-XXXX-XX-XX-XX-XX-XX-XXX:
Failed. Reason: The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-eia:1.14
-gpu' does not exist..
The code that I am using to deploy is as:
predictor = model.deploy(initial_instance_count=1,
instance_type='ml.p2.xlarge', accelerator_type='ml.eia1.medium')
If I remove the accelerator_type parameter then the endpoint gets deployed with no errors. Any idea on why this happens? Sagemaker seems to be referring to the image that doesn't exist. How do I fix this?
Also, I made sure that the version is supported from here: https://github.com/aws/sagemaker-python-sdk#tensorflow-sagemaker-estimators'. I am on TensorFlow: 1.14.
Edit:
Turns out, this works:
predictor = model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge', accelerator_type='ml.eia1.medium')
So, I am guessing that elastic inference is not available for GPU instances?
Note: None of the instances that I deploy my endpoint to is using GPU. (Please suggest some ideas if you are familiar or have made it work.)
Elastic Inference Accelerator (EIA) are designed to be attached to CPU endpoints.