Why don't O3 less than O1 in gpu memory?

Why don't O3 less than O1 in gpu memory? - python

i'm training EfficientDet-D7(head_only=True) in 2080TI * 1.
And i'm using NVIDIA/APEX:amp.
When i use opt_level=O1, although the memory is definitely reduced compared to when apex is not used.
But, when I use opt_level=O2orO3, more memory is consumed.
I am experimenting with the same 2080 Ti, each with a separate GPU by creating two containers with the same docker image. The learning code also copied the code using O1 as it is and changed it to O3, and the args required for learning are all the same. (The batch size and d7 are also the same.)
Why happen this... TT
Additionally, Can you recommend the book about this?(ex. deep learning Memory, gpu ... etc)
Thanks!

You're optimizing for speed. Some speed optimizations will reduce memory usage. Other will increase them.
An example of when speed optimization reduces memory usage is when unnecessary variables and function calls are removed.
An example of the opposite is loop unrolling.
There's no reason to expect optimization to either reduce or increase memory usage. That's not the goal when optimizing for speed. Any increase or decrease is just a byproduct.
If you really want to find out why it happens in your particular case, you can study the documentation for your compiler and inspect the assembly code.

Related

Understand order of magnitude performance gap between python and C++ for CPU heavy application

**Summary: ** I observe a ~1000 performance gap between a python code and a C+ code doing the same job despite the use of parallelization, vectorization, just in time compilation and machine code conversion using Numba in the context of scientific calculation. CPU wont be used at full, and I don't understand why
Hello everybody,
I just started in a laboratory doing simulation of various material, including simulation of the growth of biological-like tissues. To do that we create a 3D version of said tissue (collection of vertices stored in a numpy array) and we apply different functions on it to mimic physic/biology.
We have a C++ code doing just that, which takes approximately 10 second to run. Someone converted said code to python, but this version takes about 2h30 hours to process. We tried every trick in the book we knew to accelerate the code. We used numba to accelerate numpy where appropriate, parallelized the code as much as we could, tried to vectorize what could be, but still the gap remains. In fact the earlier version of the code took days to proceed.
When the code execute, multiple cores are properly used, as monitored using the build-in system monitor. However, they are not used at full, and in fact deactivating cores manually does not seem to hit performances too much. At first I thought it could be due to the GIL, but releasing it had no effect on performances either. Somehow it makes me think of a bottleneck in memory transfer between the CPU and the ram, but I cannot understand why the C version would not have the same problem. I also have the feeling that there is a performance cost for calling functions. One of my earlier tasks was to refactor the code, thus decomposing complicated functions into smaller elements. I since have a small performance degradation compared to the earlier version.
I must say I am really wondering where my bottleneck is and how it could be tested/improved. Any idea would be very welcome.
I am aware my question is kind of a complicated one, so let me know if you would need additional information, I would be happy to provide.

Understanding memory growth of a Python process (VmRSS vs gc.get_objects)

Without going into algorithmic details, lets just say that my code sequentially processes a list of inputs:
inputs = [2,5,6,7,8,10,12,13,14,15,16,17,18,19,20,21]
for i in inputs:
process_input(i)
For simplicity, lets consider process_input to be a state-less black-box.
I know that this site is full of questions about finding memory leaks in Python code, but this is not what this question is about. Instead, I'm trying to understand the memory consumption of my code over time and whether it might suffer from leaking memory.
In particular, I'm trying to understand a discrepancy of two distinct indicators of memory usage:
The number of allocated objects (reported by gc.get_objects) and
the actually used amount of physical memory (read from VmRSS on a Linux system).
To study these two indicators, I expanded the original code from above as follows:
import time, gc
def get_current_memory_usage():
with open('/proc/self/status') as f:
memusage = f.read().split('VmRSS:')[1].split('\n')[0][:-3]
return int(memusage.strip()) / (1024 ** 2)
inputs = [2,5,6,7,8,10,12,13,14,15,16,17,18,19,20,21]
gc.collect()
last_object_count = len(gc.get_objects())
for i in inputs:
print(f'\nProcessing input {i}...')
process_input(i)
gc.collect()
time.sleep(1)
memory_usage = get_current_memory_usage()
object_count = len(gc.get_objects())
print(f'Memory usage: {memory_usage:.2f} GiB')
print(f'Object count: {object_count - last_object_count:+}')
last_object_count = object_count
Note that process_input is state-less, i.e. the order of the inputs does not matter. Thus, we would expect both indicators to be about the same before running process_input and afterwards, right? Indeed, this is what I observe for the number of allocated objects. However, the consumption of memory grows steadily:
Now my core question: Do these observations indicate a memory leak? To my understanding, memory leaking in Python would be indicated by a growth of allocated objects, which we do not observe here. On the other hand, why does the memory consumption grow steadily?
For further investigation, I also ran a second test. For this test, I repeatedly invoked process_input(i) using a fixed input i (five times each) and recorded the memory consumption in between of the iterations:
For i=12, the memory consumption remained constant at 10.91 GiB.
For i=14, the memory consumption remained constant at 7.00 GiB.
I think, these observations make the presence of a memory leak even more unlikely, right? But then, what could be a possible explanation for why the memory consumption is not falling in between of the iterations, given that process_input is state-less?
The system has 32 GiB RAM in total and is running Ubuntu 20.04. Python version is 3.6.10. The process_input function uses several third-party libraries.

In general RSS is not a particularly good indicator because it is "resident" set size and even a rather piggish process, in terms of committed memory, can have a modest RSS as memory can be swapped out. You can look at /proc/self/smaps and add up the size of the writable regions to get a much better benchmark.
On the other hand, if there is actually growth, and you want to understand why, you need to look at the actual dynamically allocated memory. What I'd suggest for this is using https://github.com/vmware/chap
To do this, just make that 1 second sleep a bit longer, put a print just before the call to sleep, and use gcore from another session to gather a live core during a few of those sleeps.
So lets say you have cores gathered from when the input was 14 and when it was 21. Look at each of the cores using chap, for example, with the following commands:
count used
That will give you a good view of allocations that have been requested but not released. If the numbers are much larger for the later core, you probably have some kind of growth issue. If those numbers do differ by quite a lot, use
summarize used
If you have growth, it is possible that there is a leak (as opposed to some container simply expanding). To check this, you can try commands like
count leaked
show leaked
From there you should probably look at the documentation, depending on what you find.
OTOH if used allocations are not the issue, maybe try the following, to see memory for allocations that have been released but are part of larger regions of memory that cannot be given back to the operating system because parts of those regions are still in use:
count free
summarize free
If neither "used" allocations or "free" allocations are the issue, you might try:
summarize writable
That is a very high level view of all writable memory. For example, you can see things like stack usage...

Tensorflow - Profiling using timeline - Understand what is limiting the system

I am trying to understand why each train iteration takes aprox 1.5 sec.
I used the tracing method described here.I am working on a TitanX Pascal GPU. My results look very strange, it seems that every operation is relatively fast and the system is idle most of the time between operations. How can i understand from this what is limiting the system.
It does seem however that when I drastically reduce the batch size the gaps close, as could be seen here.
Unfortunately the code is very complicated and I can't post a small version of it that has the same problem
Is there a way to understand from the profiler what is taking the space in the gaps between operations?
Thanks!
EDIT:
On CPU ony I do not see this behavior:
I am running a

Here are a few guesses, but it's hard to say without a self-contained reproduction that I can run and debug.
Is it possible you are running out of GPU memory? One signal of this is if you see log messages of the form Allocator ... ran out of memory during training. If you run out of GPU memory, then the allocator backs off and waits in the hope more becomes available. This might explain the large inter-operator gaps that go away if you reduce the batch size.
As Yaroslav suggests in a comment above, what happens if you run the model on CPU only? What does the timeline look like?
Is this a distributed training job or a single-machine job? If it's a distributed job, does a single-machine version show the same behavior?
Are you calling session.run() or eval() many times, or just once per training step? Every run() or eval() call will drain the GPU pipeline, so for efficiency you need usually need to express your computation as one big graph with only a single run() call. (I doubt this is your problem but I mention it for completeness.)

memory exhaust on big matrix operation using dask

Currently I'm implementing this paper for my undergraduate theses with python, but I only use the mahalanobis metric learning (in case you're curious).
In a shortcut, I face a problem when I need to learn a matrix with the size of 67K*67K consisting of integer, by simply numpy.dot(A.T,A) where A is a random vector sized (1,67K). When I do that it's simply throw MemoryError since my PC only have 8gb ram, and the raw calculation of the memory needed is 16gb to init. Than I search for alternative and found dask.
so i moved on to dask with this dask.array.dot(A.T,A) and it's done. But than I need to do whitening transformation to that matrix, and in dask I can achieve it by get the SVD. But everytime I do that SVD, the ipython kernel dies (I assume it due to lack of memory).
this is what I do so far from init, until the kernel dies:
fv_length=512*2*66
W = da.random.randint(10,20,(fv_length),(1000,1000))
W = da.reshape(W,(1,fv_length))
W_T = W.T
Wt = da.dot(W_T,W); del W,W_T
Wt = da.reshape(Wt,(fv_length*fv_length/2,2))
U,S,Vt = da.linalg.svd(Wt); del Wt
I didn't get the U,S,and Vt yet.
Is my memory simply not enough to do these sort of things, even when I'm using dask?
or actually this is not a spec problem, but my bad memory management?
or something else?
At this point I'm desperately trying in other bigger spec PC, so I am planning to rent a bare metal server with a 32gb ram. Even if I do so, is it enough?

Generally speaking dask.array does not guarantee out-of-core operation on all computations. A square matrix-matrix multiply (or any L3 BLAS operation) is more-or-less impossible to do efficiently in small memory.
You can ask Dask to use an on-disk cache for intermediate values. See the FAQ under the question My computation fills memory, how do I spill to disk?. However this will be limited by disk-writing speeds, which are generally fairly slow.
A large memory machine and NumPy is probably the simplest way to resolve this problem. Alternatively you could try to find a different formulation of your problem.

Super Linear Speedup - Python - Cluster - Multiple Processes

I parallelized a program which uses fairly large matrices. The program depicts the Ising model, from statistical mechanics. On my laptop everything works fine - even the visualization shows the behaviour I expect. Now I wanted to see how it scales using many CPUs, so I used a cluster computer I have at hand. Well, I get super linear speedup. First I thought it's not a big deal since it's possible that when I use multiple processes the problem size gets smaller and thus might fit into the cache. So no time consuming coping from cache to ram and back will slow it down. However, I even get super linear speedup for one CPU. I wouldn't expect that. If the whole system (matrix) doesn't fit into the cache for the sequential version then it also shouldn't fit into it using the parallel version with only one CPU, right?
I've done a check on my laptop. Averaged over 5 runs, the parallel version using one CPU is a tiny bit slower than the sequential version. I guess this is okay since there are some statements in the parallel version which I don't have in the sequential one.
Any ideas what this could be all about? Is the super linear speedup reasonable?
Note: I'm programming in python using numpy and for the parallel version, processes and shmarray.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.