spyder kernel dies during training - python

I am trying to train an fairly complex GCN network on my 10GB GPU. It runs smoothly until epoch 87 but then the spyder kernel restarts. Is it because of memory issue, if so how can I handle it.

As you mentioned, If the model is too large it good to store logs after every epoch.
## after every epoch
path = os.path.join(SAVE_DIR, 'model.pth')
torch.save(TheModelClass.cpu().state_dict(), path) # saving model
MODEL.cuda() # moving model to GPU for further training
## if the kernel terminates, load the model paramters
device = torch.device("cuda")
model = TheModelClass()
model.load_state_dict(torch.load(PATH))
model.train()
model.to(device)
so, anything happens at the process you can start from the last completed epoch.
from your information, it's hard to tell what is the causes to terminate the kernel exactly. RAM overloading is less likely because of GPU acceleration and using pytorch framework. but it could be.
However, the above solution will help you anywhere.

Related

base_model.summary() Crashes my notebook and VS Code - ResNet101

I am trying to print model summary in tensorflow and I think the model is large and it's crashing my notebook. The model is ResNet101.
The whole computer comes to a halt, memory usage goes up to 99% and VS Code crashes. I have 16 GB of ram, so I didn't think printing something large would actually eat all my ram. Also, because the kernel crashes, all the variables are lost like history = model.fit() which I need to fine-tune the model afterwards. Moreover, I need to print base_model summary in order to choose from which layer to fine-tune from.
Is there a way to print the summary in another way and save the entire notebook with the variables, so I can continue working? I have checkpoints for model weights, but I need to keep track of past epochs through history to resume training afterwards.

How can i stop model training and resume it?

I am working on object detection with autonomous datasets . I want to train my model with 10000 train images,2000 test,2000 validation images.I will use object detection tensorflow lite model maker.
Project link : tensorflow.org/lite/tutorials/model_maker_object_detection
After setting batch size 32, the training takes 50 epochs and continues for 2 days(Step 3).I can’t keep my computer on for two days.I am running the project in jupyter notebook
How can i stop model training and again resume it ? (e.g. stop the 10th epoch and continue one day later)
I sure it depend on your code you working on. You can do that with tensorflow check
How to Pause / Resume Training in Tensorflow
a sleep mode is a better option.
it will make your pc rest for some time and your work will be resumed after you log in again

Tensorflow2.0: GPU runs out of memory during hyperparameter tuning loop

I am trying to perform some hyperparameter tuning of a convolutional neural network written in Tensorflow 2.0 with GPU extension.
My systems settings are:
Windows 10 64bit
GeForce RTX2070, 8GB
Tensorflow 2.0-beta
CUDA 10.0 properly installed (I hope, deviceQuery.exe and bandwidthTest.exe passed positively)
My neural network has 75.572.574 parameters and I am training it on 3777 samples. In a single run, I have no problems in training the CNN.
As next step, I wanted to tune two hyperparameters of the CNN. To this aim, I created a for loop (iterating on 20 steps), in which I build and compile every time a new model, changing the hyperparameters at every loop iteration.
The gist of the code (this is not an MWE) is the following
import tensorflow as tf
from tensorflow import keras
def build_model(input_shape, output_shape, lr=0.01, dropout=0, summary=True):
model = keras.models.Sequential(name="CNN")
model.add(keras.layers.Conv2D(32, (7, 7), activation='relu', input_shape=input_shape, padding="same"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(dropout))
model.add(keras.layers.Conv2D(128, (3, 3), activation='relu', padding="same"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(dropout))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1024, activation='relu'))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(output_shape, activation='linear'))
model.build()
model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr),
loss="mse",
metrics=[RMSE])
if summary:
print(model.summary())
return model
...
for run_id in range(25):
lr = learning_rate.max_value + (learning_rate.min_value - learning_rate.max_value) * np.random.rand(1)
dropout = dropout.min_value + (dropout.max_value -
dropout.min_value) * np.random.rand(1)
print("%=== Run #{0}".format(run_id))
run_dir = hparamdir + "\\run{0}".format(run_id)
model0 = build_model(IMG_SHAPE, Ytrain.shape[1], lr=lr, dropout=dropout)
model0_history = model0.fit(Xtrain,
Ytrain,
validation_split=0.3,
epochs=2,
verbose=2)
The problem I encountered is that, after a few (6) loops, the program halts returning the error
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[73728,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] name: dense_12/kernel/Initializer/random_uniform/
Process finished with exit code 1.
I believe the problem is that the GPU does not release the memory in between each iteration of the for loop and, after a while, it saturates and crashes.
I have digged around and I tried different solutions as suggested in similar posts (post1, post2)
Trying releasing the memory using the Keras backend at the end of every iteration of the for loop using
from keras import backend as K
K.clear_session()
Trying clearing the GPU using Numba and CUDA with
from numba import cuda
cuda.select_device(0)
cuda.close()
I tried deleting the graph using del model0 but that did not work either.
I couldn't try using tf.reset_default_graph() since the programming style of TF2.0 doesn't have a default graph anymore (AFAIK) and thus I have not found a way to kill/delete a graph at runtime.
Solutions 1. and 3. returned the same out of memory error, while solution 2. returned the following error during the second iteration of the for loop, while building the model in the build_model()call:
2019-07-24 19:51:31.909377: F .\tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: invalid resource handle
Process finished with exit code -1073740791 (0xC0000409)
I tried to look around and I don't really understand the last error, I would guess the GPU has not been closed properly/is occupied/cannot be seen by Python anymore.
In any case, I could not find any solution to this issue, except for running the training by hand for every hyperparameter to be tested.
Does anybody have any idea how to solve this problem?
Or a workaround for hyperparameter tuning?
Should I open an issue in TF2.0 Github issue tracker (it does not appear to be a TensorFlow issue per se, since they declare that they don't want to free the GPU to avoid segmentation problems)?
This is due to how TF handles memory.
If you monitor your system while iteratively training TF models, you will observe a linear increase in memory consumption. Additionally, if you watch -n 0.1 nvidia-smi you will notice that the PID for the process remains constant while iterating. TF does not fully release utilized memory until the PID controlling the memory is killed. Also, the Numba documentation notes that cuda.close() is not useful if you want to reset the GPU (though I definitely spent a while trying to make it work when I discovered it!).
The easiest solution is to iterate using the Ray python package and something like the following:
import ray
#ray.remote(
num_gpus=1 # or however many you want to use (e.g., 0.5, 1, 2)
)
class RayNetWrapper:
def __init__(self, net):
self.net = net
def train(self):
return self.net.train()
ray.init()
actors = [RayNetWrapper.remote(model) for _ in range(25)]
results = ray.get([actor.train.remote() for actor in actors]
You should then notice GPU processes will cycle on/off with new PIDs each time and your system memory will no longer increase. Alternatively, you can put your model training code in a new python script and iteratively call out using python's subprocess module. You will also notice some latency now when models shutdown and new models boot up, but this is expected because the GPU is restarting. Ray also has an experimental asynchronous framework that I've had some success with, and enables fractional sharing of GPUs (model size permitting).
you can locate these two lines on top of your code.
from tensorflow.python.framework.config import set_memory_growth
tf.compat.v1.disable_v2_behavior()
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
that works for me.

Tensorflow-GPU Object Detection API gets stuck after first saved checkpoint

I'm trying to train a SSD mobilenet v2 using Tensorflow Object Detection API, with Tensorflow GPU. The training goes well and fast until the first checkpoint save (after some hundreds of steps), where it gets stuck after restoring the last checkpoint. The GPU usage goes down and never comes up. Sometimes Python itself crashes.
I'm running Tensorflow GPU on Windows 7, with an NVIDIA Quadro M4000, with CUDA 8.0 (the only version I managed to work with). The model is an SSD Mobilenet v2 pretrained with COCO, using a very low batch size of 4.
The config file is the same as it comes out from the Tensorflow Model ZOO, of course changing paths, batch size, number of classes and number of steps and adding shuffle: true on the training part.
I'm adding the terminal infos that come out. This is where it gets stuck.
Did someone experience the same kind of problem or has any idea why?
Thanks in advance
I faced the same problem as you stated. I waited a long time and found something interesting. I got some evaluation results. The training process continued after that. It seems that the evaluation process takes too much time. As it gives no output at the beginning, it just like get stuck. Maybe changing the parameter 'sample_1_of_n_eval_examples' will help. I'm trying...

Keras with Tensorflow backend - Run predict on CPU but fit on GPU

I am using keras-rl to train my network with the D-DQN algorithm. I am running my training on the GPU with the model.fit_generator() function to allow data to be sent to the GPU while it is doing backprops. I suspect the generation of data to be too slow compared to the speed of processing data by the GPU.
In the generation of data, as instructed in the D-DQN algorithm, I must first predict Q-values with my models and then use these values for the backpropagation. And if the GPU is used to run these predictions, it means that they are breaking the flow of my data (I want backprops to run as often as possible).
Is there a way I can specify on which device to run specific operations? In a way that I could run the predictions on the CPU and the backprops on the GPU.
Maybe you can save the model at the end of the training. Then start another python file and write os.environ["CUDA_VISIBLE_DEVICES"]="-1"before you import any keras or tensorflow stuff. Now you should be able to load the model and make predictions with your CPU.
It's hard to properly answer your question without seeing your code.
The code below shows how you can list the available devices and force tensorflow to use a specific device.
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
get_available_devices()
with tf.device('/gpu:0'):
//Do GPU stuff here
with tf.device('/cpu:0'):
//Do CPU stuff here

Categories

Resources