CPU utilization when using ray and torch

CPU utilization when using ray and torch - python

I use ray and torch in my code and set one CPU core for each ray remote actor
to compute gradient(use torch package). But I find the CPU utilization of the actor
can go up to 300% in some time, This seems to be impossible since The actor is supposed to use
only one CPU core.
I want to know if the actor is actually using more CPU resources since torch may open one or more
threads to compute gradient.
My OS is Win10 and CPU is Ryzen 5600H. Thanks.

Ray currently does not automatically pin the actor to specific CPU cores and prevent it from using other CPU cores. So what you're seeing makes sense.
It is possible to use a library like psutil to pin the actor to a specific core and prevent it from using other cores. This can be helpful if you have many parallel tasks/actors that are all multi-threaded and competing with each other for resources (e.g., because they use pytorch or numpy).

Related

Run code on GPU instead of CPU with detecto

I am using machine learning with detecto in Python. However, whenever I run it, I get a warning saying
It looks like you're training your model on a CPU. Consider switching to a GPU; otherwise,
this method can take hours upon hours or even days to finish. For more information, see
https://detecto.readthedocs.io/en/latest/usage/quickstart.html#technical-requirements
I have a GPU in the form of an Intel(R) HD graphics 4600, but for some reason the code is running on the CPU. I have checked out the link it gives which says
By default, Detecto will run all heavy-duty code on the GPU if it’s available and on the CPU otherwise.
It recommends using Google Collab if the computer doesn't have a GPU it can use, but I do have one, and don't want to use Google Collab.
Why is it running on the CPU instead of the GPU? And how can I fix it? The part of my code where I get the warning is
losses = fitmodel(loader, Test_dataset, epochs=25, lr_step_size=5,
learning_rate=0.001, verbose=True)
The code does work, however it takes ages to run, so want to be able to run it on the GPU to save time.

The GPU that detecto is referring to would need to be a CUDA capable Nvidia GPU. So your Intel(R) HD graphics 4600 does not meet this criterion.
Detecto uses pytorch internally, whichs GPU support is based on CUDA. So in order to use a GPU, you would need to move to a machine that has a CUDA capable card

Automatic GPU offloading in python

I have written a piece of scientific code in python, mainly using the numpy library (especially Fast Fourier Transforms), and a bit of Cython. Nothing in CUDA or anything GPU related that I am aware of. There is no graphic interface, everything runs in the terminal (I'm using WSL2 on Windows). The whole code is mostly about number crunching, nothing fancy at all.
When I run my program, I see that CPU usage is ~ 100% (to be expected of course), but GPU usage also rises, to around 5%.
Is it possible that a part of the work gets offloaded automatically to the GPU? How else can I explain this small but consistent increase in GPU usage ?
Thanks for the help

No, there is no automatic offloading in Numpy, at least not with the standard Numpy implementation. Note that some specific FFT libraries can use the GPU, but the standard implementation of Numpy uses its own implementation of FFT called PocketFFT based on FFTPack that do not use the GPU. Cython do not perform any automatic implicit GPU offloading. The code need to do that explicitly/manually.
No GPU offloading are automatically performed because GPUs are not faster than CPUs for all tasks and offloading data to the GPU is expensive, especially with small arrays (due to the relatively high-latency of the PCI bus and kernel calls in such a case). Moreover, this is hard to do efficiently even in case where the GPUs could be theoretically faster.
The 5% GPU usage is relative to the frequency of the GPU which is often configured to use an adaptative frequency. For example my discrete Nv-1660S GPU frequency is currently 300 MHz while it can automatically reach 1.785 GHz. Using actually 5% of a GPU running at a 17% of its maximum frequency with a 2D rendering of a terminal is totally possible. On my machine, printing lines in a for loop at 10 FPS in a Windows terminal takes 6% of my GPU still running at low-frequency (0-1% without running anything).
If you want to check the frequency of your GPU and the load there are plenty of tools for that starting from vendor tools often installed with the driver to softwares like GPU-z on Windows. For Nvidia GPU, you can list the processes currently using you GPU with nvidia-smi (it should be rocm-smi on AMD GPUs).

Limit Tensorflow CPU and Memory usage

I've seen several questions about GPU Memory with Tensorflow but I've installed it on a Pine64 with no GPU support.
That means I'm running it with very limited resources (CPU and RAM only) and Tensorflow seems to want it all, completely freezing my machine.
Is there a way to limit the amount of processing power and memory allocated to Tensorflow? Something similar to bazel's own --local_resources flag?

This will create a session that runs one op at a time, and only one thread per op
sess = tf.Session(config=
tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1))
Not sure about limiting memory, it seems to be allocated on demand, I've had TensorFlow freeze my machine when my network wanted 100GB of RAM, so my solution was to make networks that need less RAM

For TensorFlow 2.x this has been answered in the following thread:
In Tensorflow 2.x, there is no session anymore. Directly use the config API to set the parallelism at the start of the program.
import tensorflow as tf
tf.config.threading.set_intra_op_parallelism_threads(2)
tf.config.threading.set_inter_op_parallelism_threads(2)
with tf.device('/CPU:0'):
model = tf.keras.models.Sequential([...
https://www.tensorflow.org/api_docs/python/tf/config/threading

Python Caffe cpu & gpu mode simultaneously

Is it possible to run Caffe in both CPU and GPU mode? I have several Caffe models, but my GPU resources are limited, so that I can't put all models into GPU memory. I want to use e.g. 3 models with GPU mode and 2 models with CPU mode, but set_mode_cpu() and set_mode_gpu() commands just switch the mode for the whole library.

In my opinion, you can write multiple python scripts one for each task.
In each script you can choose whether use CPU or GPU (and GPU device).
Then you can run these multiple scripts at the same time.
But run multiple tasks at one GPU card with slow down the speed severely in my experience. Good luck!

Python execution speed: laptop vs desktop

I am running a program that does simple data processing:
parses text
populates dictionaries
calculates some functions over the resulting data
The program only uses CPU, RAM, and HDD:
run from Windows command line
input/output to the local hard drive
nothing displayed on or printed to screen
no networking
The same program is run on:
desktop: Windows 7, i7-930 CPU overclocked #3.6 GHz (with matching memory speed), Intel X-25M SSD
laptop: Windows XP, Intel Core2 Duo T9300 #2.5GHz, 7200 rpm HDD
The CPU is 1.44 faster frequency, HDD is 4 times higher benchmark score (Passmark - Disk Mark). I found the program runs just around 1.66 times faster on the desktop. So apparently, the CPU is the bottleneck.
It seems there's only 15% benefit from the i7 Core vs Intel Core2 Duo architecture (most of the performance boost is due to the straight CPU frequency). Is there anything I can do in the code to increase the benefit of the new architecture?
EDIT: forgot to mention that I use ActivePython 3.1.2 if that matters.

The increasing performance of hardware brings in most cases automatically results in benefit to user applications. The much maligned "GIL" means that you may not be able to take advantage of multicores with CPython unless you design your program to take advantage via various multiprocessing modules / libraries.
SO discussion on the same : Does python support multiprocessor/multicore programming?
A related collation of solutions on python wiki: http://wiki.python.org/moin/ParallelProcessing

Split your processing into multiple threads. Your particular i7 should be able to support up to 8 threads in parallel.

Consider repeating on regular HDD's - that SSD could well result in a substantial performance difference depending on caches, and the nature of that data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.