Understanding Memory Usage by PyTorch DataLoader Workers

Understanding Memory Usage by PyTorch DataLoader Workers - python

When running a PyTorch training program with num_workers=32 for DataLoader, htop shows 33 python process each with 32 GB of VIRT and 15 GB of RES.
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory? htop shows only about 50 GB of RAM and 20 GB of swap is being used on the entire machine with 128 GB of RAM. So, how do we explain the discrepancy?
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
Thank you

Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory?
Not necessary. You have a worker process (with several subprocesses - workers) and the CPU has several cores. One worker usually loads one batch. The next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is the secret for the speeding up.
I guess, you should use far less num_workers.
It would be interesting to know your batch size too, which you can adapt for the training process as well.
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
I was googling but could not find a concrete formula. I think that it is a rough estimation of how many cores has your CPU and Memory and Batch Size.
To choose the num_workers depends on what kind of computer you are using, what kind of dataset you are taking, and how much on-the-fly pre-processing your data requires.
HTH

There is a python function called tracemalloc which is used to trace memory blocks allocated to python. https://docs.python.org/3/library/tracemalloc.html
Tracebacks
Statics on memory per filename
Compute the diff between snapshots
import tracemalloc
tracemalloc.start()
do_someting_that_consumes_ram_and releases_some()
# show how much RAM the above code allocated and the peak usage
current, peak = tracemalloc.get_traced_memory()
print(f"{current:0.2f}, {peak:0.2f}")
tracemalloc.stop()
https://discuss.pytorch.org/t/measuring-peak-memory-usage-tracemalloc-for-pytorch/34067

Related

Compute Engine n1-standard only use 50% of CPU

I'm running an heavy pytorch task on this VM(n1-standard, 2vCpu, 7.5 GB) and the statistics show that the cpu % is at 50%. On my PC(i7-8700) the cpu utilization is about 90/100% when I run this script (deep learning model).
I don't understand if there is some limit for the n1-standard machine(I have read in the documentation that only the f1 obtain 20% of cpu usage and the g1 the 50%).
Maybe if I increase the max cpu usage, my script runs faster.
Is there any setting I should change?

In this case, the task utilizes only one of the two processors that you have available so that's why you see only 50% of the CPU getting used.
If you allow pytorch to use all the CPUs of your VM by setting the number of threads, then it will see that the usage goes to 100%

How to calculate CPU memory usage of fit_generator in Keras?

I'm trying to train a network using Keras and I've got a problem with fit_generator and excessive use of memory.
I currently have 128GB of RAM and my dataset occupies 20 GB (compressed). I load the compressed dataset in RAM and then use a Sequence generator to uncompress batches of data to feed the network. Each of my samples is 100x100x100 pixels stored as float32, I'm using a batch_size of 64, queue_size of 5 and 27 workers with multiprocessing=True. In theory, I should have a total of 100*100*100 *4 *64 *5 *27 =~ 35 GB. However, when I run my script, it gets killed by the queuing system because of excessive memory usage:
slurmstepd: error: Job XXXX exceeded memory limit (1192359452 > 131072000), being killed
I've even tried to use a max_queue_size as little as 2, and the process is still exceeding the maximum memory. To make things even more difficult to understand, sometimes, completely by chance, the process is executed properly (even with a max_queue_size of 30!).
I verify before running my script that the memory is actually free, using free -m and everything looks fine. I also tried to profile my script with memory-profiler, although the results are quite strange:
I looks like my script is producing 54 (??????) different children and is using 1200 GB (!) of RAM. This clearly doesn't make any sense...
Am I calculating the memory usage of fit_generator wrong? From what I understand from the documentation, it looks like the data should be shared across workers, so the largest part of memory should be used by the queued batches. Is there anything that I'm missing?

DataLoader num_workers vs torch.set_num_threads

Is there a difference between the parallelization that takes place between these two options? I’m assuming num_workers is solely concerned with the parallelizing the data loading. But is setting torch.set_num_threads for training in general? Trying to understand the difference between these options. Thanks!

The num_workers for the DataLoader specifies how many parallel workers to use to load the data and run all the transformations. If you are loading large images or have expensive transformations then you can be in situation where GPU is fast to process your data and your DataLoader is too slow to continuously feed the GPU. In that case setting higher number of workers helps. I typically increase this number until my epoch step is fast enough. Also, a side tip: if you are using docker, usually you want to set shm to 1X to 2X number of workers in GB for large dataset like ImageNet.
The torch.set_num_threads specifies how many threads to use for parallelizing CPU-bound tensor operations. If you are using GPU for most of your tensor operations then this setting doesn't matter too much. However, if you have tensors that you keep on cpu and you are doing lot of operations on them then you might benefit from setting this. Pytorch docs, unfortunately, don't specify which operations will benefit from this so see your CPU utilization and adjust this number until you can max it out.

Best way to import data in google-colaboratory for fast computing and training?

I am running a simple deep learning model on Google's colab, but it's running slower than my MacBook Air with no GPU.
I read this question and found out it's a problem because of dataset importing over the internet, but I am unable to figure out how to speed up this process.
My model can be found here. Any idea of how I can make the epoch faster?
My local machine takes 0.5-0.6 seconds per epoch and google-colabs takes 3-4 seconds

Is GPU always faster than CPU? No, why? because the speed optimization by a GPU depends on a few factors,
How much part of your code runs/executes in parallel, i.e how much part of your code creates threads that run parallel, this is automatically taken care by Keras and should not be a problem in your scenario.
Time Spent sending the data between CPU and GPU, this is where many times people falter, it is assumed that GPU will always outperform CPU, but if data being passed is too small, the time it takes to perform the computation (No of computation steps required) are lesser than breaking the data/processes into thread, executing them in GPU and then recombining them back again on the CPU.
The second scenario looks probable in your case since you have used a batch_size of 5.
classifier=KerasClassifier(build_fn=build_classifier,epochs=100,batch_size=5), If your dataset is big enough, Increasing the batch_size will increase the performance of GPU over CPU.
Other than that you have used a fairly simple model and as #igrinis pointed out that data is loaded only once from drive to memory so the problem in all theory should not be loading time because the data is on drive.

Memory Consumption Random Forest scikit learn trained in parallel

I'm using Random Forests from scikit-learn in production and I'm trying to minimise and understand their memory footprint. Therefore, I was running a memory profiling for my predicting script. My random forest is loaded from a file which is approximately 40 mb big with 30 subtrees, each about 1mb is size. It was trained parallel on 12 cores. So when executing my profiling script it gave the following output: In the WrapperClass the classifier is loaded in memory.
Line # Mem usage Increment Line Contents
================================================
41 579.2 MiB 532.7 MiB classifier = WrapperClass('test_tree.rf')
However, when I'm looking at the same process in htop it shows me that the memory usage is about 4gb? How do this two numbers fit together? Which is the one, I can trust?
Thanks a lot for your answers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.