Generally for running pip with no cache we use --no-cache-dir, like
pip install pytorch --no-cache-dir.
I downloaded a CNN model I want to use from github.
The first two lines of execution
python generate_dataset.py --is_train=True --use_phase=True --chip_size=100 --patch_size=94 --use_phase=True --dataset=soc
python generate_dataset.py --is_train=False --use_phase=True --chip_size=128 --patch_size=128 --use_phase=True --dataset=soc
executed succesfully. But while running
python train.py --config_name=config/AConvNet-SOC.json
It is giving MemoryError.
The publisher of above repository is using 32GB RAM and 11 GB GPU. But I have 8 GB RAM and 8GB GPU.
Here is what I have done:
I thought of running it without cache. like,
python train.py --config_name=config/AConvNet-SOC.json --no-cache-dir
But it is throwing below error
FATAL Flags parsing error: Unknown command line flag 'no-cache-dir' Pass --helpshort or --helpfull to see help on flags.
I think it is because no-cache-dir argument is not defined in it by using absl.flags. Does python supports using no chache directory implementation
I am able to solve it by decreasing the number of epochs and batch_size. But I want to run it for full epochs.
Using zeo_grad() of Pytorch makes the gradients zero for every minibatch, so that GPU won't run out of memory. But it is already used in the code I am running in _base.py. Is there anyway I can leverage more of this.
How to resolve this.
Related
I am trying to run train a yolov7 model without a gpu. This is currently the command line that I am using on colab.
python train_aux.py --workers 1 --device cpu --batch-size 1 --data data/coco.yaml --img 128 128 --cfg /content/yolov7/cfg/training/yolov7-e6e.yaml --weights '' --name yolov7-e6e --hypdata/hyp.scratch.p6.yaml`
For some reason I first get an warning
warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
and then I get the error
RuntimeError: No CUDA GPUs are available
during the first epoch. I dont understand why it is trying to use cuda when I am running it on CPU. Am I missing some spot that I have to edit in the code to fix this? Here is the link to the github that I am using
I have tried to download the cuda library incase that helped using.
!pip install cuda-python
but it didnt solve the issue.
So it looks like this issue is due to cuda being hard coded into the model for certain procedures. A more in-depth explanation can be found here link.In the meantime removing the --device cpu for some reason fixed it.
I have ~50000 images and annotation files for training a YOLOv5 object detection model. I've trained a model no problem using just CPU on another computer, but it takes too long, so I need GPU training. My problem is, when I try to train with a GPU I keep getting this error:
OSError: [WinError 1455] The paging file is too small for this operation to complete
This is the command I'm executing:
train.py --img 640 --batch 4 --epochs 100 --data myyaml.yaml --weights yolov5l.pt
CUDA and PyTorch have successfully been installed and are available. The following command installed with no errors:
pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
I've found other people online with similar issues and have fixed it by changing the num_workers = 8 to num_workers = 1. When I tried this, training started and seemed to get past the point where the paging file is too small error appears, but then crashes a couple hours later. I've also increased the virtual memory available on my GPU as per this video (https://www.youtube.com/watch?v=Oh6dga-Oy10) that also didn't work. I think it's a memory issue because some of the times it crashes I get a low memory warning from my computer.
Any help would be much appreciated.
So I've managed to fix my specific problem and thought posting the answer here might help someone else. Basically, I don't think I had enough RAM. I was using 8 GB before and I've upgraded to 32GB and it's working fine.
As I wrote in the question above, I thought it was a memory issue and I got it to work on another computer only using CPU. I also noticed that when training started there was a spike in RAM usage. This guy also explains the importance of RAM when training deep learning models on large datasets:
https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/
Hope this can help other people with the same issue.
I converted a TensorFlow Model to ONNX using this command:
python -m tf2onnx.convert --saved-model tensorflow-model-path --opset 10 --output model.onnx
The conversion was successful and I can inference on the CPU after installing onnxruntime.
But when I create a new environment, install onnxruntime-gpu on it and inference using GPU, I get different error messages based on the model. E.g. for MobileNet I receive W:onnxruntime:Default, cuda_execution_provider.cc:1498 GetCapability] CUDA kernel not supported. Fallback to CPU execution provider for Op type: Conv node name: StatefulPartitionedCall/mobilenetv2_1.00_224/Conv1/Conv2D
I tried out different opsets.
Does someone know why I am getting errors when running on GPU
That is not an error. That is a warning and it is basically telling you that that particular Conv node will run on CPU (instead of GPU). It is most likely because the GPU backend does not yet support asymmetric paddings and there is a PR in progress to mitigate this issue - https://github.com/microsoft/onnxruntime/pull/4627. Once this PR is merged, these warnings should go away and such Conv nodes will run on the GPU backend.
So I am trying to compile TensorFlow from the source (using a clone from their git repo from 2019-01-31). I installed Bazel from their shell script (https://github.com/bazelbuild/bazel/releases/download/0.21.0/bazel-0.21.0-installer-linux-x86_64.sh).
I executed ./configure in the tensorflow code and provided the default settings except for adding my machine specific -m options (-mavx2 -mfma) and pointing python to the correct python3 location (/usr/bin/py3). I then ran the following command as per the tensorflow instructions:
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package //tensorflow:libtensorflow_framework.so //tensorflow:libtensorflow.so
Now that continues to run and run, I haven't seen it complete yet (though I am limited to letting it run for a maximum of about 10 hours). It produces a ton of INFO: warnings regarding signed and unsigned integers and control reaching the end of non-void functions. None of these appear fatal. Compilation continues to tick with the two numbers continuing to grow ('[N,NNN / X,XXX] 4 actions running') and files ticking through 'Compiling'.
The machine is an EC2 instance with ~16GiB of RAM, CPU is 'Intel(R) Xeon(R) CPU E5-2686 v4 # 2.30GHz' with I believe 4-cores, plenty of HDD space (although compilation seems to eat QUITE a bit, > 1GiB)
Any ideas on what's going on here?
Unfortunately, some programs can take a long time to compile. A couple of hours of compilation is not strange for tensorflow on your setup.
There are reports of it taking 50 minutes on a considerably faster machine
A solution to this problem is to use pre-compiled binaries that are available with pip, instructions can be found here: https://www.tensorflow.org/install/pip.html
Basically you can do this:
pip install tensorflow
If you require a specific older version, like 1.15, you can do this:
pip install tensorflow==1.15
For gpu support you add -gpu to the package name, like this:
pip install tensorflow-gpu
And:
pip install tensorflow-gpu==1.15
In development, I have been using the gpu-accelerated tensorflow
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.1-cp35-cp35m-linux_x86_64.whl
I am attempting to deploy my trained model along with an application binary for my users. I compile using PyInstaller (3.3.dev0+f0df2d2bb) on python 3.5.2 to create my application into a binary for my users.
For deployment, I install the cpu version, https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp35-cp35m-linux_x86_64.whl
However, upon successful compilation, I run my program and receive the infamous tensorflow cuda error:
tensorflow.python.framework.errors_impl.NotFoundError:
tensorflow/contrib/util/tensorflow/contrib/cudnn_rnn/python/ops/_cudnn_rnn_ops.so:
cannot open shared object file: No such file or directory
why is it looking for cuda when I've only got the cpu version installed? (Let alone the fact that I'm still on my development machine with cuda, so it should find it anyway. I can use tensorflow-gpu/cuda fine in uncompiled scripts. But this is irrelevant because deployment machines won't have cuda)
My first thought was that somehow I'm importing the wrong tensorflow, but I've not only used pip uninstall tensorflow-gpu but then I also went to delete the tensorflow-gpu in /usr/local/lib/python3.5/dist-packages/
Any ideas what could be happening? Maybe I need to start using a virtual-env..