Tensorflow - Profiling using timeline - Understand what is limiting the system

Tensorflow - Profiling using timeline - Understand what is limiting the system - python

I am trying to understand why each train iteration takes aprox 1.5 sec.
I used the tracing method described here.I am working on a TitanX Pascal GPU. My results look very strange, it seems that every operation is relatively fast and the system is idle most of the time between operations. How can i understand from this what is limiting the system.
It does seem however that when I drastically reduce the batch size the gaps close, as could be seen here.
Unfortunately the code is very complicated and I can't post a small version of it that has the same problem
Is there a way to understand from the profiler what is taking the space in the gaps between operations?
Thanks!
EDIT:
On CPU ony I do not see this behavior:
I am running a

Here are a few guesses, but it's hard to say without a self-contained reproduction that I can run and debug.
Is it possible you are running out of GPU memory? One signal of this is if you see log messages of the form Allocator ... ran out of memory during training. If you run out of GPU memory, then the allocator backs off and waits in the hope more becomes available. This might explain the large inter-operator gaps that go away if you reduce the batch size.
As Yaroslav suggests in a comment above, what happens if you run the model on CPU only? What does the timeline look like?
Is this a distributed training job or a single-machine job? If it's a distributed job, does a single-machine version show the same behavior?
Are you calling session.run() or eval() many times, or just once per training step? Every run() or eval() call will drain the GPU pipeline, so for efficiency you need usually need to express your computation as one big graph with only a single run() call. (I doubt this is your problem but I mention it for completeness.)

Related

Understand order of magnitude performance gap between python and C++ for CPU heavy application

**Summary: ** I observe a ~1000 performance gap between a python code and a C+ code doing the same job despite the use of parallelization, vectorization, just in time compilation and machine code conversion using Numba in the context of scientific calculation. CPU wont be used at full, and I don't understand why
Hello everybody,
I just started in a laboratory doing simulation of various material, including simulation of the growth of biological-like tissues. To do that we create a 3D version of said tissue (collection of vertices stored in a numpy array) and we apply different functions on it to mimic physic/biology.
We have a C++ code doing just that, which takes approximately 10 second to run. Someone converted said code to python, but this version takes about 2h30 hours to process. We tried every trick in the book we knew to accelerate the code. We used numba to accelerate numpy where appropriate, parallelized the code as much as we could, tried to vectorize what could be, but still the gap remains. In fact the earlier version of the code took days to proceed.
When the code execute, multiple cores are properly used, as monitored using the build-in system monitor. However, they are not used at full, and in fact deactivating cores manually does not seem to hit performances too much. At first I thought it could be due to the GIL, but releasing it had no effect on performances either. Somehow it makes me think of a bottleneck in memory transfer between the CPU and the ram, but I cannot understand why the C version would not have the same problem. I also have the feeling that there is a performance cost for calling functions. One of my earlier tasks was to refactor the code, thus decomposing complicated functions into smaller elements. I since have a small performance degradation compared to the earlier version.
I must say I am really wondering where my bottleneck is and how it could be tested/improved. Any idea would be very welcome.
I am aware my question is kind of a complicated one, so let me know if you would need additional information, I would be happy to provide.

Tensorflow - Profile Custom Op

I am interested in a way to measure the detailed performance of a custom Tensorflow Op when run on a GPU.
So far I have tried the approach of this post using a Timeline, as well as the internal Tensorflow Profiler (tf.profiler.Profiler). Both deliver very similar results, which are fine if I want to investigate a Network, but for profiling a single Op the output is too coarse and doesn't include intra-op calculations (at least I couldn't find a way for that). My next try was using the CUDA profiler nvprof (or nvvp for that matter), which is more in the right direction and displays single calls to CUDA kernels and Memory allocations. But now, the CPU calculations are not included. I tried running nvprof --cpu-profiling on, but now the profiler never finishes (see here)
My scenario is the following: I wrote a custom Op that is very similar to a convolution in 2D and should not take much more time to calculate. In a Network, my custom Op's performance is about 3 times worse than tf.nn.conv2d. Using the tf.profiler.Profiler I get the following:
Profile:
node name | requested bytes | total execution time | accelerator execution time | cpu execution time
CustomConv2DBackpropInput 72.09MB (100.00%, 7.04%), 194.36ms (100.00%, 38.05%), 49.82ms (100.00%, 17.61%), 144.54ms (100.00%, 63.44%)
CustomConv2D 65.54MB (92.96%, 6.40%), 95.41ms (61.95%, 18.68%), 45.16ms (82.39%, 15.96%), 50.25ms (36.56%, 22.06%)
CustomConv2DBackpropFilter 134.48MB (86.55%, 13.14%), 72.39ms (43.27%, 14.17%), 41.22ms (66.44%, 14.56%), 31.17ms (14.50%, 13.68%)
Conv2DBackpropFilter 294.68MB (73.41%, 28.79%), 63.39ms (29.10%, 12.41%), 62.80ms (51.87%, 22.19%), 594us (0.82%, 0.26%)
Conv2DBackpropInput 230.97MB (44.62%, 22.57%), 48.77ms (16.69%, 9.55%), 48.16ms (29.68%, 17.02%), 610us (0.56%, 0.27%)
Conv2D 225.74MB (22.06%, 22.06%), 36.50ms (7.15%, 7.15%), 35.84ms (12.66%, 12.66%), 664us (0.29%, 0.29%)
So it seems to me that my custom Ops take about as much time on the GPU, but more than an order of magnitude longer on the CPU. For a GPU Op, that is not acceptable and I'd like to find where my Ops spend this time on the CPU. What additionally startles me, is that my Ops seem to only allocate one third of the GPU Memory that the original Conv Ops do.
Is there a way to get a detailed profile of my custom Op (which includes CPU and GPU usage) that can explain to me what I did wrong and help me fix my mistakes?

How can I maximize throughput for an embarrassingly-parallel task in Python on Google Cloud Platform?

I am trying to use Apache Beam/Google Cloud Dataflow to speed up an existing Python application. The bottleneck of the application occurs after randomly permuting an input matrix N (default 125, but could be more) times, when the system runs a clustering algorithm on each matrix. The runs are fully independent of one another. I've captured the top of the pipeline below:
This processes the default 125 permutations. As you can see, only the RunClustering step takes an appreciable amount of time (there are 11 more steps not shown below that total to 11 more seconds). I ran the pipeline earlier today for just 1 permutation, and the Run Clustering step takes 3 seconds (close enough to 1/125th the time shown above).
I'd like the RunClustering step to finish in 3-4 seconds no matter what the input N is. My understanding is that Dataflow is the correct tool for speeding up embarrassingly-parallel computation on Google Cloud Platform, so I've spent a couple weeks learning it and porting my code. Is my understanding correct? I've also tried throwing more machines at the problem (instead of Autoscaling, which, for whatever reason, only scales up to 2-3 machines*) and specifying more powerful machine types, but those don't help.
*Is this because of a long startup time for VMs? Is there a way to use quickly-provisioned VMs, if that's the case? Another question I have is how to cut down on the pipeline startup time; it's a deal breaker if users can't get results back quickly, and the fact that the total Dataflow job time is 13–14 minutes (compared to the already excessive 6–7 for the pipeline) is unacceptable.

Your pipeline is suffering from excessive fusion, and ends up doing almost everything on one worker. This is also why autoscaling doesn't scale higher: it detects that it is unable to parallelize your job's code, so it prefers not to waste extra workers. This is also why manually throwing more workers at the problem doesn't help.
In general fusion is a very important optimization, but excessive fusion is also a common problem that, ideally, Dataflow would be able to mitigate automatically (like it automatically mitigates imbalanced sharding), but it is even harder to do, though some ideas for that are in the works.
Meanwhile, you'll need to change your code to insert a "reshuffle" (a group by key / ungroup will do - fusion never happens across a group by key operation). See Preventing fusion; the question Best way to prevent fusion in Google Dataflow? contains some example code.

CPU and GPU operations parallelization

I have an application that has 3 main functionalities which are running sequentially at the moment:
1) Loading data to memory and perform preprocesssing on it.
2) Perform some computations on the data using GPU with theano.
3) Monitor the state of the computations on GPU and print them to the screen.
These 3 functionalities are embarrassingly parallelizable by using multi-threading. But in python I perform all these three functionalities sequentially. Partly because in the past I had some bad luck with Python multi-threading and GIL issues.
Here in this case, I don't necessarily need to utilize the full-capabilities of multiple-cpu's at hand. All I want to do is, to load the data and preprocess them while the computations at the GPU are performed and monitor the state of the computations at the same time. Currently most time-consuming computations are performed at 2), so I'm kind of time-bounded with operations at 2). Now my questions are:
*Can python parallelize these 3 operations without creating new bottlenecks, e.g.: due to GIL issues.
*Should I use multiprocessing instead of multithreading?
In a nutshell how should parallelize these three operations if I should in Python.
It is been some time since last time I wrote multi-threaded code for CPU(especially for python), any guidance will be appreciated.
Edit: Typos.

The GIL is a bit of a nuisance sometimes...
A lot of it is going to revolve around how you can use the GPU. Does the API your using allow you to set it running then go off and do something else, occasionally polling to see if the GPU has finished? Or maybe it can raise an event, call a callback or something like that?
I'm sensing from your question that the answer is no... In which case I suspect your only choice (given that you're using Python) is multi processing. If the answer is yes then you can start off the GPU then get on with some preprocessing and plotting in the meantime and then check to see if the GPU has finished.
I don't know much about Python or how it does multiprocessing, but I suspect that it involves serialisation and copying of data being sent between processes. If the quantity of data you're processing is large (I suggest getting worried at the 100's of megabytes mark. Though that's just a hunch) then you may wish to consider how much time is lost in serialising and copy that data. If you don't like the answers to that analysis then your probably out of luck so far as using Python is concerned.
You say that the most time consuming part is the GPU processing? Presumably the other two parts are reasonably lengthy otherwise there would be little point trying to parallelise them. For example if the GPU was 95% of the runtime then saving 5% by parallelising the rest hardly seems worth it.

Super Linear Speedup - Python - Cluster - Multiple Processes

I parallelized a program which uses fairly large matrices. The program depicts the Ising model, from statistical mechanics. On my laptop everything works fine - even the visualization shows the behaviour I expect. Now I wanted to see how it scales using many CPUs, so I used a cluster computer I have at hand. Well, I get super linear speedup. First I thought it's not a big deal since it's possible that when I use multiple processes the problem size gets smaller and thus might fit into the cache. So no time consuming coping from cache to ram and back will slow it down. However, I even get super linear speedup for one CPU. I wouldn't expect that. If the whole system (matrix) doesn't fit into the cache for the sequential version then it also shouldn't fit into it using the parallel version with only one CPU, right?
I've done a check on my laptop. Averaged over 5 runs, the parallel version using one CPU is a tiny bit slower than the sequential version. I guess this is okay since there are some statements in the parallel version which I don't have in the sequential one.
Any ideas what this could be all about? Is the super linear speedup reasonable?
Note: I'm programming in python using numpy and for the parallel version, processes and shmarray.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.