I am playing around with tensorflow and today I have noticed that google also open-sourced Python SDK for their dataflow.
Currently when I need to train and evaluate several networks in parallel I usually use either luigi and run one model training after another or I use spark and I am performing each model training within the map step.
Whole this data processing is just a part of the pipeline.
I am wondering if there is or if there is planned something like perform tensorflow model training step inside of the dataflow pipeline?
Is there currently some best practice around this?
Or do I have to run each model setting within the map step?
I went through the documentation and for now it seems to be really vague, so I'm asking here if someone has some experience with this.
There is nothing planned at this time.
If you can run the Tensorflow training on a single machine (it sounds like this is what you were doing with Spark) then it should be possible to do the training within a DoFn of a Dataflow pipeline.
Related
We want to tune a SageMaker PipelineModel with a HyperparameterTuner (or something similar) where several components of the pipeline have associated hyperparameters. Both components in our case are realized via SageMaker containers for ML algorithms.
model = PipelineModel(..., models = [ our_model, xgb_model ])
deploy = Estimator(image_uri = model, ...)
...
tuner = HyperparameterTuner(deply, .... tune_parameters, ....)
tuner.fit(...)
Now, there is of course the problem how to distribute the tune_parameters to the pipeline steps during the tuning.
In scikit-learn this is achieved by specially naming the tuning parameters <StepName>__<ParameterName>.
I don't see a way to achieve something similar with SageMaker, though. Also, search of the two keywords brings up the same question here but is not really what we want to do.
Any suggestion how to achieve this?
If both the models need to be jointly optimized, you could run a SageMaker HPO job in script mode and define both the models in the script. Or you could run two HPO jobs, optimize each model, and then create the Pipeline Model. There is no native support for doing an HPO job on a PipelineModel.
I work at AWS and my opinions are my own.
i understand that gpt2 is based on the transformer architecture but where is the source code, there are limited resources and no tutorial on how to write one..
I am new to NLP and also if i had to generate novels, would training the transformer on multiple novels help or one?
I think the best way to train GPT and other trasnformers is by using the library https://huggingface.co/docs/transformers. They also have a course that can help you to familiarize with the topic: https://huggingface.co/course/
Yes, transformer models, if they are not too large, can be trained on Colab.
And yes, GPT-like models can be trained to generate novels, but only short ones (like several paragraphs), because almost all such models can work only with texts of limited length.
Yes, it is possible, and it would be better if you use GPU for training. make sure modify num_train_epochs, per_device_train_batch_size and per_gpu_train_batch_size features in TrainingArguments to prevent runtime from crashing! >> RuntimeError: CUDA out of memory
most of the time it will use the whole GPU and RAM and the notebook would crash!
I have finished training a model, what I want to do next is hand over my model to my colleages who know nothing about deep learning, I just want to give them a function that they can run without installing tensorflow or Python on their machines, may be just python (ideally I would love it to run on Matlab). Is this doable? How can I abstract away all or codify everything from them?
I read about deployment of models but it's all about servers and stuff, this is not what I want.
PS: assume a TF model for now.
Basically, there's onnx that aims at assisting in deploying trained models.
I do not have much experience with it, but I know it's not always a straight forward procedure.
I would like to evaluate a custom-trained Tensorflow object detection model on a new test set using Google Cloud.
I obtained the inital checkpoints from:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
I know that the Tensorflow object-detection API allows me to run training and evaluation simultaneously by using:
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py
To start such a job, i submit following ml-engine job:
gcloud ml-engine jobs submit training [JOBNAME]
--runtime-version 1.9
--job-dir=gs://path_to_bucket/model-dir
--packages dist/object_detection-
0.1.tar.gz,slim/dist/slim-0.1.tar.gz,pycocotools-2.0.tar.gz
--module-name object_detection.model_main
--region us-central1
--config object_detection/samples/cloud/cloud.yml
--
--model_dir=gs://path_to_bucket/model_dir
--pipeline_config_path=gs://path_to_bucket/data/model.config
However, after I have successfully transfer-trained a model I would like to use calculate performance metrics, such as COCO mAP(http://cocodataset.org/#detection-eval) or PASCAL mAP (http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf) on a new test data set which has not been previously used (neither during training nor during evaluation).
I have seen, that there is possible flag in model_main.py:
flags.DEFINE_string(
'checkpoint_dir', None, 'Path to directory holding a checkpoint. If '
'`checkpoint_dir` is provided, this binary operates in eval-only
mode, '
'writing resulting metrics to `model_dir`.')
But I don't know whether this really implicates that model_main.py can be run in exclusive evaluation mode? If yes, how should I submit the ML-Engine job?
Alternatively, are there any functions in the Tensorflow API which allows me to evaluate an existing output dictionary (containing bounding boxes, class labels, scores) based on COCO and/or Pascal mAP? If there is, I could easily read in a Tensorflow record file locally, run inference and then evaluate the output dictionary.
I know how to obtain these metrics for the evaluation data set, which is evaluated during training in model_main.py. However, from my understanding I should still report model performance on a new test data set, since I compare multiple models and implement some hyper-parameter optimization and thus I should not report on evaluation data set, am I right? On a more general note: I can really not comprehend why one would switch from separate training and evaluation (as it is in the legacy code) to a combined training and evaluation script?
Edit:
I found two related posts. However I do not think that the answers provided are complete:
how to check both training/eval performances in tensorflow object_detection
How to evaluate a pretrained model in Tensorflow object detection api
The latter has been written while TF's object detection API still had separate evaluation and training scripts. This is not the case anymore.
Thank you very much for any help.
If you specify the checkpoint_dir and set run_once to be true, then it should run evaluation exactly once on the eval dataset. I believe that metrics will be written to the model_dir and should also appear in your console logs. I usually just run this on my local machine (since it's just doing one pass over the dataset) and is not a distributed job. Unfortunately I haven't tried running this particular codepath on CMLE.
Regarding why we have a combined script... from the perspective of the Object Detection API, we were trying to write things in the tf.Estimator paradigm --- but you are right that personally I found it a bit easier when the two functionalities lived in separate binaries. If you want, you can always wrap up this functionality in another binary :)
I am a beginner in machine learning. Recently, I had successfully running a machine learning application using Tensorflow object detection API.
My dataset is 200 images of object with 300*300 resolution. However, the training had been running for two days and yet to be completed.
I wonder how long would it take to complete a training?? At the moment it is running at global step 9000, how many global step needed to complete the training?
P.S: the training used only CPUs
It depends on your desired accuracy and data set of course but I generally stop training when the loss value gets around 4 or less. What is your current loss value after 9000 steps?
To me this sounds like your training is not converging.
See the discussion in the comments of this question.
Basically, it is recommended that you run eval.py in parallel and check how it performs there as well.