Error every time running H2OXGBoostEstimator - python

I randomly encounter the same error whenever I run XGBoost model (both the normal run and grid search). The error message says this:
H2OConnectionError: Local server has died unexpectedly. RIP.
I don't know what happens, I tried to change versions but didn't work. I'm currently using the version 3.18.0.5. Does anyone have any idea what is happening? Thanks in advance

The only time I've seen this happen is when H2O runs out of memory. Please check that you have enough memory -- an H2O cluster should have at least 4x the amount of RAM as the dataset you're trying to train a model on (data size on disk).

Related

TensorFlow/Keras model __call__ gets slower and slower when run on GPU

I'm trying to implement an AlphaZero-like boardgame AI for the game Gomoku (Monte Carlo Tree Search in combination with a CNN that evaluations board positions).
Right now, the MCTS is implemented as a separate component.
Additionally, I have a simple TCP server written in Python that receives positions from the MCTS (in batches of around 50 to 200), converts them to Numpy arrays, passes them to the TF/Keras Model by invoking __call__, converts them back, and sends the results to the MCTS via TCP.
During training, I generate training data (around 5000 boards) by having the AI play against itself, call model.fit once, create a new dataset using the new model weights and so on.
I play multiple matches in parallel, each using their separate Python/TF server.
Each server loads their own copy of the model (I use tf.config.experimental.set_memory_growth(gpu, True)).
I problem I encounter is that, while playing matches, inference time is getting longer and longer, the longer the network is loaded until it gets so painfully slow that I have to restart the training.
After a restart, inference times are back to normal.
This, by the way, also happens if I only play one game at a time with just having one model loaded.
I tried to mitigate the problem by restarting the Python server (and therefore the model) after each match.
This seemingly solved it until I started experiencing the same issue after a couple of training iterations.
At first I thought the reason was my not ideal setup (gaming notebook running Windows), but on a Linux server of my university the problem also ocured.
On my Windows machine, along with the model getting slower and slower it was also using less and less memory. This apparently did not occur on Linux.
I had a similar issue and was due to having the wrong NVIDIA drivers installed for my card and not having CUDA installed. Hope that was helpful.
https://developer.nvidia.com/cuda-downloads

Python unable to access multiple GB of my ram?

I'm writing a machine learning project for some fun but I have run into an interesting error that I can't seem to fix. I'm using Sklearn (LinearSVC, train_test_split), numpy, and a few other small libraries like collections.
The project is a comment classifier - You put in a comment, it spits out a classification. The problem I'm running into is that a Memory Error (Unable to allocate 673. MiB for an array with shape (7384, 11947) and data type float64) when doing a train_test_split to check the classifier accuracy, specifically when I call model.fit.
There are 11947 unique words that my program finds, and I have a large training sample (14,769), but I've never had an issue where I run out of RAM. The problem is, I'm not running out of RAM. I have 32 GB, but the program ends up using less than 1gb before it gives up.
Is there something obvious I'm missing?

Tensorflow-GPU Object Detection API gets stuck after first saved checkpoint

I'm trying to train a SSD mobilenet v2 using Tensorflow Object Detection API, with Tensorflow GPU. The training goes well and fast until the first checkpoint save (after some hundreds of steps), where it gets stuck after restoring the last checkpoint. The GPU usage goes down and never comes up. Sometimes Python itself crashes.
I'm running Tensorflow GPU on Windows 7, with an NVIDIA Quadro M4000, with CUDA 8.0 (the only version I managed to work with). The model is an SSD Mobilenet v2 pretrained with COCO, using a very low batch size of 4.
The config file is the same as it comes out from the Tensorflow Model ZOO, of course changing paths, batch size, number of classes and number of steps and adding shuffle: true on the training part.
I'm adding the terminal infos that come out. This is where it gets stuck.
Did someone experience the same kind of problem or has any idea why?
Thanks in advance
I faced the same problem as you stated. I waited a long time and found something interesting. I got some evaluation results. The training process continued after that. It seems that the evaluation process takes too much time. As it gives no output at the beginning, it just like get stuck. Maybe changing the parameter 'sample_1_of_n_eval_examples' will help. I'm trying...

How to fix this strange error: "RuntimeError: CUDA error: out of memory"

I successfully trained the network but got this error during validation:
RuntimeError: CUDA error: out of memory
The best way is to find the process engaging gpu memory and kill it:
find the PID of python process from:
nvidia-smi
copy the PID and kill it by:
sudo kill -9 pid
1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under
with torch.no_grad():
...
net=Net()
pred_for_validation=net(input)
...
Above code doesn't use GPU memory
2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory
Even if docs guides with float(), in case of me, item() also worked like
entire_loss=0.0
for i in range(100):
one_loss=loss_function(prediction,label)
entire_loss+=one_loss.item()
3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()
for one_epoch in range(100):
...
optimizer.step()
del intermediate_variable1,intermediate_variable2,...
The error occurs because you ran out of memory on your GPU.
One way to solve it is to reduce the batch size until your code runs without this error.
I had the same issue and this code worked for me :
import gc
gc.collect()
torch.cuda.empty_cache()
It might be for a number of reasons that I try to report in the following list:
Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the biggest_batch_first description for the BucketIterator in AllenNLP.
In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html
I am a Pytorch user. In my case, the cause for this error message was actually not due to GPU memory, but due to the version mismatch between Pytorch and CUDA.
Check whether the cause is really due to your GPU memory, by a code below.
import torch
foo = torch.tensor([1,2,3])
foo = foo.to('cuda')
If an error still occurs for the above code, it will be better to re-install your Pytorch according to your CUDA version. (In my case, this solved the problem.)
Pytorch install link
A similar case will happen also for Tensorflow/Keras.
If you are getting this error in Google Colab use this code:
import torch
torch.cuda.empty_cache()
In my experience, this is not a typical CUDA OOM Error caused by PyTorch trying to allocate more memory on the GPU than you currently have.
The giveaway is the distinct lack of the following text in the error message.
Tried to allocate xxx GiB (GPU Y; XXX GiB total capacity; yyy MiB already allocated; zzz GiB free; aaa MiB reserved in total by PyTorch)
In my experience, this is an Nvidia driver issue. A reboot has always solved the issue for me, but there are times when a reboot is not possible.
One alternative to rebooting is to kill all Nvidia processes and reload the drivers manually. I always refer to the unaccepted answer of this question written by Comzyh when performing the driver cycle. Hope this helps anyone trapped in this situation.
If someone arrives here because of fast.ai, the batch size of a loader such as ImageDataLoaders can be controlled via bs=N where N is the size of the batch.
My dedicated GPU is limited to 2GB of memory, using bs=8 in the following example worked in my situation:
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(244), num_workers=0, bs=)
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
Problem solved by the following code:
import os
os.environ['CUDA_VISIBLE_DEVICES']='2, 3'
Not sure if this'll help you or not, but this is what solved the issue for me:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
Nothing else in this thread helped.
I faced the same issue with my computer. All you have to do is customize your configuration file to match your computer's specifications. Turns out my computer takes image sizes below 600 X 600 and when I adjusted the same in the configuration file, the program ran smoothly.

How to track down the cause a SIGSEGV error in an ML Engine training job?

I am training a custom tensorflow estimator on ML engine and running into this error:
The replica master 0 exited with a non-zero status of 11(SIGSEGV)
with the only other error log being:
Command '['python3', '-m', 'train_model.train', ... ']' returned non-zero exit status -11
There is no longer traceback so this 'invalid memory reference or segmentation fault code' is all I have to go on.
This SIGSEGV error does not always occur. Some training jobs run without problems, others throw this error after 4 hours, and others after 15 minutes.
I've varied parts of the estimator training code to try to trial and error my way to finding the cause but have had no success.
I thought the 11 code may correspond to this error code in the google apis and found a number of people have experienced OutOfSequence and OutOfRange errors when using custom metrics in the estimator EvalSpec but I do not think that is what is causing the error here as I use a tf.metric.
I am using the BASIC scale tier and looking at the CPU utilisation it never goes above 80% and the memory utilisation graph shows around 25%.
I am caching the tensorflow dataset, but I also receive this error when not caching the dataset. The error occurs when running the train_and_evaluate method as well as the train method.
Is there any advice as to how I can track down the route cause of this crash in the training jobs? Or what some of the common causes of this crash are? Is the solution just to use a larger memory machine?

Categories

Resources