I'm using Random Forests from scikit-learn in production and I'm trying to minimise and understand their memory footprint. Therefore, I was running a memory profiling for my predicting script. My random forest is loaded from a file which is approximately 40 mb big with 30 subtrees, each about 1mb is size. It was trained parallel on 12 cores. So when executing my profiling script it gave the following output: In the WrapperClass the classifier is loaded in memory.
Line # Mem usage Increment Line Contents
================================================
41 579.2 MiB 532.7 MiB classifier = WrapperClass('test_tree.rf')
However, when I'm looking at the same process in htop it shows me that the memory usage is about 4gb? How do this two numbers fit together? Which is the one, I can trust?
Thanks a lot for your answers.
Related
So, I would like to train a lightGBM on a remote, large ray cluster and a large dataset. Before that, I would like to write the code such that I can run the training also in a memory-constrained setting, e.g. my local laptop, where the dataset does not fit in-mem. That will require some way of lazy loading the data.
The way I imagine it, I should be possible with ray to load batches of random samples of the large dataset from disk (multiple .pq files) and feed them to the lightgbm training function. The memory should thereby act as a fast buffer, which contains random, loaded batches that are fed to the training function and then removed from memory. Multiple workers take care of training + IO ops for loading new samples from disk into memory. The maximum amount of memory can be constrained to not exceed my local resources, such that my pc doesn't crash. Is this possible?
I did not understand yet whether the LGBM needs the full dataset at once, or can be fed batches iteratively, such as with neural networks, for instance. So far, I have tried using the lightgbm_ray lib for this:
from lightgbm_ray import RayDMatrix, RayParams, train, RayFileType
# some stuff before
...
# make dataset
data_train = RayDMatrix(
data=filenames,
label=TARGET,
feature_names=features,
filetype=RayFileType.PARQUET,
num_actors=2,
lazy=True,
)
# feed to training function
evals_result = {}
bst = train(
params_model,
data_train,
evals_result=evals_result,
valid_sets=[data_train],
valid_names=["train"],
verbose_eval=False,
ray_params=RayParams(num_actors=2, cpus_per_actor=2)
)
I thought the lazy=True keyword might take care of it, however, when executing this, I see the memory being maxed out and then my app crashes.
Thanks for any advice!
LightGBM requires loading the entire dataset for training, so in this case, you can test on your laptop with a subset of the data (i.e. only pass a subset of the parquet filenames in).
The lazy=True flag delays the data loading to be split across the actors, rather than loading into memory first, then splitting+sending to actors. However, this would still load the entire dataset into memory, since all actors are on the same (local) node.
Additionally, when you do move to running on the remote cluster, these tips might be helpful to optimize memory usage: https://docs.ray.io/en/latest/train/gbdt.html?highlight=xgboost%20memro#how-to-optimize-xgboost-memory-usage.
When running a PyTorch training program with num_workers=32 for DataLoader, htop shows 33 python process each with 32 GB of VIRT and 15 GB of RES.
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory? htop shows only about 50 GB of RAM and 20 GB of swap is being used on the entire machine with 128 GB of RAM. So, how do we explain the discrepancy?
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
Thank you
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory?
Not necessary. You have a worker process (with several subprocesses - workers) and the CPU has several cores. One worker usually loads one batch. The next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is the secret for the speeding up.
I guess, you should use far less num_workers.
It would be interesting to know your batch size too, which you can adapt for the training process as well.
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
I was googling but could not find a concrete formula. I think that it is a rough estimation of how many cores has your CPU and Memory and Batch Size.
To choose the num_workers depends on what kind of computer you are using, what kind of dataset you are taking, and how much on-the-fly pre-processing your data requires.
HTH
There is a python function called tracemalloc which is used to trace memory blocks allocated to python. https://docs.python.org/3/library/tracemalloc.html
Tracebacks
Statics on memory per filename
Compute the diff between snapshots
import tracemalloc
tracemalloc.start()
do_someting_that_consumes_ram_and releases_some()
# show how much RAM the above code allocated and the peak usage
current, peak = tracemalloc.get_traced_memory()
print(f"{current:0.2f}, {peak:0.2f}")
tracemalloc.stop()
https://discuss.pytorch.org/t/measuring-peak-memory-usage-tracemalloc-for-pytorch/34067
I'm running an heavy pytorch task on this VM(n1-standard, 2vCpu, 7.5 GB) and the statistics show that the cpu % is at 50%. On my PC(i7-8700) the cpu utilization is about 90/100% when I run this script (deep learning model).
I don't understand if there is some limit for the n1-standard machine(I have read in the documentation that only the f1 obtain 20% of cpu usage and the g1 the 50%).
Maybe if I increase the max cpu usage, my script runs faster.
Is there any setting I should change?
In this case, the task utilizes only one of the two processors that you have available so that's why you see only 50% of the CPU getting used.
If you allow pytorch to use all the CPUs of your VM by setting the number of threads, then it will see that the usage goes to 100%
I'm trying to train a network using Keras and I've got a problem with fit_generator and excessive use of memory.
I currently have 128GB of RAM and my dataset occupies 20 GB (compressed). I load the compressed dataset in RAM and then use a Sequence generator to uncompress batches of data to feed the network. Each of my samples is 100x100x100 pixels stored as float32, I'm using a batch_size of 64, queue_size of 5 and 27 workers with multiprocessing=True. In theory, I should have a total of 100*100*100 *4 *64 *5 *27 =~ 35 GB. However, when I run my script, it gets killed by the queuing system because of excessive memory usage:
slurmstepd: error: Job XXXX exceeded memory limit (1192359452 > 131072000), being killed
I've even tried to use a max_queue_size as little as 2, and the process is still exceeding the maximum memory. To make things even more difficult to understand, sometimes, completely by chance, the process is executed properly (even with a max_queue_size of 30!).
I verify before running my script that the memory is actually free, using free -m and everything looks fine. I also tried to profile my script with memory-profiler, although the results are quite strange:
I looks like my script is producing 54 (??????) different children and is using 1200 GB (!) of RAM. This clearly doesn't make any sense...
Am I calculating the memory usage of fit_generator wrong? From what I understand from the documentation, it looks like the data should be shared across workers, so the largest part of memory should be used by the queued batches. Is there anything that I'm missing?
I have been testing with a word2vec model. This word2vec model for some reason doesn't use the gpu much. My performance is roughly 1 epoch every 30 seconds with a ~2000 samples dataset.
This doesn't seem normal. There are researchers that have gigabytes of training data, and I doubt they are waiting for months for the training to finish.
My GPU is a gtx 970. The memory usage is around 10% (Note that I have a few programs open too)
The problem might be the batches itself, although I am not sure.
Basically I run a method at the start of the training, and then while training I iterate over the entries in that list.
This is roughly how I do this.
Is my approach wrong? (I would guess that it's not suitable for huge datasets)
batch_method(batch_size=x) # I tested with different sizes, all seem to train fine, from 2 to 512.
for epo in self.epochs_num:
for batch in self.batch_list:
for input,target in batch:
...