Tensorflow - Profile Custom Op

Tensorflow - Profile Custom Op - python

I am interested in a way to measure the detailed performance of a custom Tensorflow Op when run on a GPU.
So far I have tried the approach of this post using a Timeline, as well as the internal Tensorflow Profiler (tf.profiler.Profiler). Both deliver very similar results, which are fine if I want to investigate a Network, but for profiling a single Op the output is too coarse and doesn't include intra-op calculations (at least I couldn't find a way for that). My next try was using the CUDA profiler nvprof (or nvvp for that matter), which is more in the right direction and displays single calls to CUDA kernels and Memory allocations. But now, the CPU calculations are not included. I tried running nvprof --cpu-profiling on, but now the profiler never finishes (see here)
My scenario is the following: I wrote a custom Op that is very similar to a convolution in 2D and should not take much more time to calculate. In a Network, my custom Op's performance is about 3 times worse than tf.nn.conv2d. Using the tf.profiler.Profiler I get the following:
Profile:
node name | requested bytes | total execution time | accelerator execution time | cpu execution time
CustomConv2DBackpropInput 72.09MB (100.00%, 7.04%), 194.36ms (100.00%, 38.05%), 49.82ms (100.00%, 17.61%), 144.54ms (100.00%, 63.44%)
CustomConv2D 65.54MB (92.96%, 6.40%), 95.41ms (61.95%, 18.68%), 45.16ms (82.39%, 15.96%), 50.25ms (36.56%, 22.06%)
CustomConv2DBackpropFilter 134.48MB (86.55%, 13.14%), 72.39ms (43.27%, 14.17%), 41.22ms (66.44%, 14.56%), 31.17ms (14.50%, 13.68%)
Conv2DBackpropFilter 294.68MB (73.41%, 28.79%), 63.39ms (29.10%, 12.41%), 62.80ms (51.87%, 22.19%), 594us (0.82%, 0.26%)
Conv2DBackpropInput 230.97MB (44.62%, 22.57%), 48.77ms (16.69%, 9.55%), 48.16ms (29.68%, 17.02%), 610us (0.56%, 0.27%)
Conv2D 225.74MB (22.06%, 22.06%), 36.50ms (7.15%, 7.15%), 35.84ms (12.66%, 12.66%), 664us (0.29%, 0.29%)
So it seems to me that my custom Ops take about as much time on the GPU, but more than an order of magnitude longer on the CPU. For a GPU Op, that is not acceptable and I'd like to find where my Ops spend this time on the CPU. What additionally startles me, is that my Ops seem to only allocate one third of the GPU Memory that the original Conv Ops do.
Is there a way to get a detailed profile of my custom Op (which includes CPU and GPU usage) that can explain to me what I did wrong and help me fix my mistakes?

Related

Why don't O3 less than O1 in gpu memory?

i'm training EfficientDet-D7(head_only=True) in 2080TI * 1.
And i'm using NVIDIA/APEX:amp.
When i use opt_level=O1, although the memory is definitely reduced compared to when apex is not used.
But, when I use opt_level=O2orO3, more memory is consumed.
I am experimenting with the same 2080 Ti, each with a separate GPU by creating two containers with the same docker image. The learning code also copied the code using O1 as it is and changed it to O3, and the args required for learning are all the same. (The batch size and d7 are also the same.)
Why happen this... TT
Additionally, Can you recommend the book about this?(ex. deep learning Memory, gpu ... etc)
Thanks!

You're optimizing for speed. Some speed optimizations will reduce memory usage. Other will increase them.
An example of when speed optimization reduces memory usage is when unnecessary variables and function calls are removed.
An example of the opposite is loop unrolling.
There's no reason to expect optimization to either reduce or increase memory usage. That's not the goal when optimizing for speed. Any increase or decrease is just a byproduct.
If you really want to find out why it happens in your particular case, you can study the documentation for your compiler and inspect the assembly code.

Understand order of magnitude performance gap between python and C++ for CPU heavy application

**Summary: ** I observe a ~1000 performance gap between a python code and a C+ code doing the same job despite the use of parallelization, vectorization, just in time compilation and machine code conversion using Numba in the context of scientific calculation. CPU wont be used at full, and I don't understand why
Hello everybody,
I just started in a laboratory doing simulation of various material, including simulation of the growth of biological-like tissues. To do that we create a 3D version of said tissue (collection of vertices stored in a numpy array) and we apply different functions on it to mimic physic/biology.
We have a C++ code doing just that, which takes approximately 10 second to run. Someone converted said code to python, but this version takes about 2h30 hours to process. We tried every trick in the book we knew to accelerate the code. We used numba to accelerate numpy where appropriate, parallelized the code as much as we could, tried to vectorize what could be, but still the gap remains. In fact the earlier version of the code took days to proceed.
When the code execute, multiple cores are properly used, as monitored using the build-in system monitor. However, they are not used at full, and in fact deactivating cores manually does not seem to hit performances too much. At first I thought it could be due to the GIL, but releasing it had no effect on performances either. Somehow it makes me think of a bottleneck in memory transfer between the CPU and the ram, but I cannot understand why the C version would not have the same problem. I also have the feeling that there is a performance cost for calling functions. One of my earlier tasks was to refactor the code, thus decomposing complicated functions into smaller elements. I since have a small performance degradation compared to the earlier version.
I must say I am really wondering where my bottleneck is and how it could be tested/improved. Any idea would be very welcome.
I am aware my question is kind of a complicated one, so let me know if you would need additional information, I would be happy to provide.

Tensorflow - Profiling using timeline - Understand what is limiting the system

I am trying to understand why each train iteration takes aprox 1.5 sec.
I used the tracing method described here.I am working on a TitanX Pascal GPU. My results look very strange, it seems that every operation is relatively fast and the system is idle most of the time between operations. How can i understand from this what is limiting the system.
It does seem however that when I drastically reduce the batch size the gaps close, as could be seen here.
Unfortunately the code is very complicated and I can't post a small version of it that has the same problem
Is there a way to understand from the profiler what is taking the space in the gaps between operations?
Thanks!
EDIT:
On CPU ony I do not see this behavior:
I am running a

Here are a few guesses, but it's hard to say without a self-contained reproduction that I can run and debug.
Is it possible you are running out of GPU memory? One signal of this is if you see log messages of the form Allocator ... ran out of memory during training. If you run out of GPU memory, then the allocator backs off and waits in the hope more becomes available. This might explain the large inter-operator gaps that go away if you reduce the batch size.
As Yaroslav suggests in a comment above, what happens if you run the model on CPU only? What does the timeline look like?
Is this a distributed training job or a single-machine job? If it's a distributed job, does a single-machine version show the same behavior?
Are you calling session.run() or eval() many times, or just once per training step? Every run() or eval() call will drain the GPU pipeline, so for efficiency you need usually need to express your computation as one big graph with only a single run() call. (I doubt this is your problem but I mention it for completeness.)

theano CPU running out of memory: what is wrong?

I run a simple network with theano on the server and got out-of-memory error, but I am not sure what is the reason. I am asking because it is unlikely to be just because I am using too much memory.
Here are the reasons:
First, according to this post, only when running with GPU will result in the problems caused by no support of virtual memory, but I am running it with CPU, so it should be fine.
Second, I build a network where the first layer is a matrix 100k by 10, and the second layer is 10 by 1, so it's just about 1M numbers for the model. So far, I only tried with 1000 data points together, so even if the machine load all the data together, and initialize all the layers together, there should be at most 110M float numbers. I used float32, on a 64bit machine. According to this post, each number takes 60bytes at most. So, the whole initialization takes 6GB memory. Even if there could be a variate different resources that take up memory, I don't understand why it cannot run on a 128GB RAM server.
Can someone suggest what I should look into?
Just in case someone asks for code, here it is.

What size are your minibatches? You need to remember that the activations take space in memory too.

Super Linear Speedup - Python - Cluster - Multiple Processes

I parallelized a program which uses fairly large matrices. The program depicts the Ising model, from statistical mechanics. On my laptop everything works fine - even the visualization shows the behaviour I expect. Now I wanted to see how it scales using many CPUs, so I used a cluster computer I have at hand. Well, I get super linear speedup. First I thought it's not a big deal since it's possible that when I use multiple processes the problem size gets smaller and thus might fit into the cache. So no time consuming coping from cache to ram and back will slow it down. However, I even get super linear speedup for one CPU. I wouldn't expect that. If the whole system (matrix) doesn't fit into the cache for the sequential version then it also shouldn't fit into it using the parallel version with only one CPU, right?
I've done a check on my laptop. Averaged over 5 runs, the parallel version using one CPU is a tiny bit slower than the sequential version. I guess this is okay since there are some statements in the parallel version which I don't have in the sequential one.
Any ideas what this could be all about? Is the super linear speedup reasonable?
Note: I'm programming in python using numpy and for the parallel version, processes and shmarray.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.