Compute Engine n1-standard only use 50% of CPU

Compute Engine n1-standard only use 50% of CPU - python

I'm running an heavy pytorch task on this VM(n1-standard, 2vCpu, 7.5 GB) and the statistics show that the cpu % is at 50%. On my PC(i7-8700) the cpu utilization is about 90/100% when I run this script (deep learning model).
I don't understand if there is some limit for the n1-standard machine(I have read in the documentation that only the f1 obtain 20% of cpu usage and the g1 the 50%).
Maybe if I increase the max cpu usage, my script runs faster.
Is there any setting I should change?

In this case, the task utilizes only one of the two processors that you have available so that's why you see only 50% of the CPU getting used.
If you allow pytorch to use all the CPUs of your VM by setting the number of threads, then it will see that the usage goes to 100%

Related

Erratic GPU Utilization while training GAN in Tensorflow 2.5

I'm using Tensorflow 2.5 to train a starGAN network for generating images (128x128 jpeg). I am using tf.keras.preprocessing.image_dataset_from_directory to load the images from the subfolders.
Additionally I am using arguments to maximize loading performance as suggested in various posts and threads such as loadedDataset.cache().repeat.prefetch
I'm also using the num_parallel_calls=tf.data.AUTOTUNE for the mapping functions for post-processing the images after loading.
While training the network on GPU the performance I am getting for GPU Utilization is in the picture attached below.
My question regarding this are:
Is the GPU utlization normal or is it not supposed to be so erratic for traning GANs?
Is there any way to make this performance more consistent?
Is there any way to improve the training performance to fully utlize the GPU?
Note that Ive logged my disk I/O also and there is no bottleneck reading/writing from the disk (nvme ssd).
The system has 32GB RAM and a RTX3070 with 8GB Vram. I have tried the running it on colab also; but the performance was similarly erratic.

It is fairly normal for utilization to be erratic like for any kind of parallelized software, including training GANs. Of course, it would be better if you could fully utilize your GPU, but writing software that does this is challenging and becomes virtually impossible when you are talking about complex applications like GANs.
Let me try to demonstrate with a trivial example. Say you have two threads, threadA and threadB. threadA is running the following python code:
x = some_time_comsuming_task()
y = get_y_from_threadB()
print(x+y)
Here threadA is performing lots of calculations to get the value for x, retrieving the value for y, and printing out the sum of x+y. Imagine threadB is also doing some kind of time consuming calculation to generate the value for y. Unless threadA is ready to retrieve y at the exact same time threadB finishes calculating it, you won't have 100% utilization of both threads for the entire duration of the program. And this is just two threads, when you have 100s of threads working together with multiple chained data dependencies, you can see how it becomes exponentially more difficult to eliminate any and all time threads spend waiting on other threads to deliver input to the next step of computation.
Trying to make your "performance more consistent" is pointless. Whether your GPU utilization went up and down (like in the graph you shared) or it stayed exactly at the average utilization for the entire execution would not change the overall execution time, which is probably the actually important metric here. Utilization is mostly useful to identify where you can optimize your code.
Fully utilize? Probably not. As explained in my answer to question one, it's going to be virtually impossible to orchestrate your GAN to completely remove bottlenecks. I would encourage you to try and improve execution time, rather than utilization, when optimizing your GAN. There's no magic setting that you're missing that will completely unlock all of your GPU's potential.

Understanding Memory Usage by PyTorch DataLoader Workers

When running a PyTorch training program with num_workers=32 for DataLoader, htop shows 33 python process each with 32 GB of VIRT and 15 GB of RES.
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory? htop shows only about 50 GB of RAM and 20 GB of swap is being used on the entire machine with 128 GB of RAM. So, how do we explain the discrepancy?
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
Thank you

Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory?
Not necessary. You have a worker process (with several subprocesses - workers) and the CPU has several cores. One worker usually loads one batch. The next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is the secret for the speeding up.
I guess, you should use far less num_workers.
It would be interesting to know your batch size too, which you can adapt for the training process as well.
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
I was googling but could not find a concrete formula. I think that it is a rough estimation of how many cores has your CPU and Memory and Batch Size.
To choose the num_workers depends on what kind of computer you are using, what kind of dataset you are taking, and how much on-the-fly pre-processing your data requires.
HTH

There is a python function called tracemalloc which is used to trace memory blocks allocated to python. https://docs.python.org/3/library/tracemalloc.html
Tracebacks
Statics on memory per filename
Compute the diff between snapshots
import tracemalloc
tracemalloc.start()
do_someting_that_consumes_ram_and releases_some()
# show how much RAM the above code allocated and the peak usage
current, peak = tracemalloc.get_traced_memory()
print(f"{current:0.2f}, {peak:0.2f}")
tracemalloc.stop()
https://discuss.pytorch.org/t/measuring-peak-memory-usage-tracemalloc-for-pytorch/34067

Keras/TF CPU creating too many threads

Even after setting tf.config.threading.set_inter_op_parallelism_threads(1) and tf.config.threading.set_intra_op_parallelism_threads(1) Keras with Tensorflow CPU (running a simple CNN model fit) on a linux machine is creating too many threads. Whatever I try it seems to be creating 94 threads while going through the fitting epochs. Have tried playing with tf.compat.v1.ConfigProto settings but nothing helps. How do I limit the number of threads?

This is why tensorflow created many threads.
Using the mentioned 2 types of parallelism (inter and intra) you have limited control over the number of threads generated by TensorFlow. The minimum number of threads that you can get by setting these two variables is N, where N is the number of cores on your cpu (I don't know if you use gpu).
intra_op_parallelism_threads = 1
inter_op_parallelism_threads = 1
Even by setting the environment variables OMP_NUM_THREADS and MKL_NUM_THREADS can't help in further reducing the number of threads.
The following discussions suggest that without changing the source code of TensorFlow, it is not possible to reduce the number threads below N.
How can I confine TensorFlow C API to use one and only one thread in total
How to disable Tensorflow's multi-threading?
How to stop TensorFlow from multi-threading
https://github.com/tensorflow/tensorflow/issues/42510
https://github.com/tensorflow/tensorflow/issues/33627

Best way to import data in google-colaboratory for fast computing and training?

I am running a simple deep learning model on Google's colab, but it's running slower than my MacBook Air with no GPU.
I read this question and found out it's a problem because of dataset importing over the internet, but I am unable to figure out how to speed up this process.
My model can be found here. Any idea of how I can make the epoch faster?
My local machine takes 0.5-0.6 seconds per epoch and google-colabs takes 3-4 seconds

Is GPU always faster than CPU? No, why? because the speed optimization by a GPU depends on a few factors,
How much part of your code runs/executes in parallel, i.e how much part of your code creates threads that run parallel, this is automatically taken care by Keras and should not be a problem in your scenario.
Time Spent sending the data between CPU and GPU, this is where many times people falter, it is assumed that GPU will always outperform CPU, but if data being passed is too small, the time it takes to perform the computation (No of computation steps required) are lesser than breaking the data/processes into thread, executing them in GPU and then recombining them back again on the CPU.
The second scenario looks probable in your case since you have used a batch_size of 5.
classifier=KerasClassifier(build_fn=build_classifier,epochs=100,batch_size=5), If your dataset is big enough, Increasing the batch_size will increase the performance of GPU over CPU.
Other than that you have used a fairly simple model and as #igrinis pointed out that data is loaded only once from drive to memory so the problem in all theory should not be loading time because the data is on drive.

Keras Multi-GPU: One CPU core goes to 100% and system Hangs

I have been using the system with one GTX1080 Ti on my system for training. It was working fine. Recently added one more 1080 Ti and tried to use both using Keras multi_gpu model. Initially, the system got hang so I replaced CPU thermal cooler and CPU temperature remains below 50C. However, during the training process, one of the CPU cores goes to 100% usage and system stop to respond. Though, I can move the cursor but can't click. The situation is given here in this image.
Now, I have tried using only one GPU for the training but still issue remain there if I increase max_queue_size above 5 keeping workers= 1 in model.fit_generator(....)
System Configuration: Core i7 4790 3.6 GHz, 32 GB

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.