CPU and GPU operations parallelization

CPU and GPU operations parallelization - python

I have an application that has 3 main functionalities which are running sequentially at the moment:
1) Loading data to memory and perform preprocesssing on it.
2) Perform some computations on the data using GPU with theano.
3) Monitor the state of the computations on GPU and print them to the screen.
These 3 functionalities are embarrassingly parallelizable by using multi-threading. But in python I perform all these three functionalities sequentially. Partly because in the past I had some bad luck with Python multi-threading and GIL issues.
Here in this case, I don't necessarily need to utilize the full-capabilities of multiple-cpu's at hand. All I want to do is, to load the data and preprocess them while the computations at the GPU are performed and monitor the state of the computations at the same time. Currently most time-consuming computations are performed at 2), so I'm kind of time-bounded with operations at 2). Now my questions are:
*Can python parallelize these 3 operations without creating new bottlenecks, e.g.: due to GIL issues.
*Should I use multiprocessing instead of multithreading?
In a nutshell how should parallelize these three operations if I should in Python.
It is been some time since last time I wrote multi-threaded code for CPU(especially for python), any guidance will be appreciated.
Edit: Typos.

The GIL is a bit of a nuisance sometimes...
A lot of it is going to revolve around how you can use the GPU. Does the API your using allow you to set it running then go off and do something else, occasionally polling to see if the GPU has finished? Or maybe it can raise an event, call a callback or something like that?
I'm sensing from your question that the answer is no... In which case I suspect your only choice (given that you're using Python) is multi processing. If the answer is yes then you can start off the GPU then get on with some preprocessing and plotting in the meantime and then check to see if the GPU has finished.
I don't know much about Python or how it does multiprocessing, but I suspect that it involves serialisation and copying of data being sent between processes. If the quantity of data you're processing is large (I suggest getting worried at the 100's of megabytes mark. Though that's just a hunch) then you may wish to consider how much time is lost in serialising and copy that data. If you don't like the answers to that analysis then your probably out of luck so far as using Python is concerned.
You say that the most time consuming part is the GPU processing? Presumably the other two parts are reasonably lengthy otherwise there would be little point trying to parallelise them. For example if the GPU was 95% of the runtime then saving 5% by parallelising the rest hardly seems worth it.

Related

Understand order of magnitude performance gap between python and C++ for CPU heavy application

**Summary: ** I observe a ~1000 performance gap between a python code and a C+ code doing the same job despite the use of parallelization, vectorization, just in time compilation and machine code conversion using Numba in the context of scientific calculation. CPU wont be used at full, and I don't understand why
Hello everybody,
I just started in a laboratory doing simulation of various material, including simulation of the growth of biological-like tissues. To do that we create a 3D version of said tissue (collection of vertices stored in a numpy array) and we apply different functions on it to mimic physic/biology.
We have a C++ code doing just that, which takes approximately 10 second to run. Someone converted said code to python, but this version takes about 2h30 hours to process. We tried every trick in the book we knew to accelerate the code. We used numba to accelerate numpy where appropriate, parallelized the code as much as we could, tried to vectorize what could be, but still the gap remains. In fact the earlier version of the code took days to proceed.
When the code execute, multiple cores are properly used, as monitored using the build-in system monitor. However, they are not used at full, and in fact deactivating cores manually does not seem to hit performances too much. At first I thought it could be due to the GIL, but releasing it had no effect on performances either. Somehow it makes me think of a bottleneck in memory transfer between the CPU and the ram, but I cannot understand why the C version would not have the same problem. I also have the feeling that there is a performance cost for calling functions. One of my earlier tasks was to refactor the code, thus decomposing complicated functions into smaller elements. I since have a small performance degradation compared to the earlier version.
I must say I am really wondering where my bottleneck is and how it could be tested/improved. Any idea would be very welcome.
I am aware my question is kind of a complicated one, so let me know if you would need additional information, I would be happy to provide.

How many processes should I create for the multi-threads CPU in the computational intensive scenario?

I have a 32 cores and 64 threads CPU for executing a scientific computation task. How many processes should I create?
To be noted that my program is computationally intensive involved lots of matrix computations based on Numpy. Now, I use the Python default process pool to execute this task. It will create 64 processes. Will it perform better or worse than 32 processes?

I'm not really sure that Python is suited for multi-threading computational intensive scenarios, due to the Global Interpreter Lock (GIL). Basically, you should use multi-threading in Python only for IO-bound tasks. I'm not sure if Numpy applies since the heavy part if I recall correctly is written in C++.
If you're looking for alternatives you could use the Apache Spark framework to distribute the work across multiple machines. I think that even if you run your code in local mode (i.e. on your machine) with 8/16 workers you could get some performance boost.
EDIT: I'm sorry, I just read on the GIL page that I linked that it doesn't apply for Numpy. I still think that this is not really the best tool you can use, since effective multi-threading programming is quite hard to get right and there are some other nuances that you can read in the link.

It's impossible to give you an answer as it will depend on your exact problem and code but potentially also of your hardware.
Basically the process for multi-processing is to split the work in X parts then distribute it to each process, let each process work and then merge each result.
Now you need to know if you can effectively split the work in 64 parts while keeping each part around the same time of work (if one process take 90% of the time and you can't split it it's useless to have more than 2 processes as you will always wait for the first one).
If you can do it and it's not taking too long to split and merge the work/results (remember that it's a supplementary work to do so it will take extra time) then it can be interesting to use more process.
It is also possible that you can speed-up your code by using less process if you pass too much time on splitting/merging the work/results (sometime the speed-up obtained by using more process can be negative).
Also you have to remember that in some architecture the memory cache can be shared among cores so it can badly affect the performances of multiprocessing.

Python and C++ performance comparison

In a lecture I've encountered the following problem:
Given a simple program which computes the sum of a column in a large data set, performance of a python and a c++ implementation are being compared. The main bottleneck should be reading the data. The computation itself is rather simple. On first execution, the python version is about 2 times slower than c++ which makes sense.
Then on the second execution, the c++ program speeds up from 4 seconds to 1 second because apparently the "first execution is I/O bound, second is CPU bound". This still makes sense since probably the file contents were cached omitting the slow reading from disk.
However, the python implementation did not speed up at all on the second run, despite the warm cache. I know python is slow, but is it that slow? Does this mean that executing this simple computation in python is slower than reading about .7 GB from disk?
If this is always the case, I'm wondering why the biggest deep learning frameworks I know (PyTorch, tensorflow) have python apis. For real time object detection for example, it must be slower to parse the input (read frames from a video, maybe preprocess) to the network and to interpret the output, than performing the forward propagation itself on a gpu.
Have I misunderstood something? Thank you.

That's not so easy to answer without implementation details, but in general, python is known for it's much less cache friendliness, because you mostly haven't the option to low-level optimize cache behaviour in python. However, this isn't always correct. You propably can optimize the cache friendliness in python directly, or you use parts of c++ code for critical sections. But always consider, that you can just optimize your code better in C++. So if you have really critical code parts, where you want to achieve every percent of speed and effiency, you should use C++. That's the reason, that many programs use both, C++ for raw performance things and python for a nice interface and program structure.

Tensorflow - Profiling using timeline - Understand what is limiting the system

I am trying to understand why each train iteration takes aprox 1.5 sec.
I used the tracing method described here.I am working on a TitanX Pascal GPU. My results look very strange, it seems that every operation is relatively fast and the system is idle most of the time between operations. How can i understand from this what is limiting the system.
It does seem however that when I drastically reduce the batch size the gaps close, as could be seen here.
Unfortunately the code is very complicated and I can't post a small version of it that has the same problem
Is there a way to understand from the profiler what is taking the space in the gaps between operations?
Thanks!
EDIT:
On CPU ony I do not see this behavior:
I am running a

Here are a few guesses, but it's hard to say without a self-contained reproduction that I can run and debug.
Is it possible you are running out of GPU memory? One signal of this is if you see log messages of the form Allocator ... ran out of memory during training. If you run out of GPU memory, then the allocator backs off and waits in the hope more becomes available. This might explain the large inter-operator gaps that go away if you reduce the batch size.
As Yaroslav suggests in a comment above, what happens if you run the model on CPU only? What does the timeline look like?
Is this a distributed training job or a single-machine job? If it's a distributed job, does a single-machine version show the same behavior?
Are you calling session.run() or eval() many times, or just once per training step? Every run() or eval() call will drain the GPU pipeline, so for efficiency you need usually need to express your computation as one big graph with only a single run() call. (I doubt this is your problem but I mention it for completeness.)

producer-consumer with very large items in Python

I'm working in the following scenario (with Python). I have a neural net optimised with a stochastic algorithm that needs a constant stream of training data. Each datum is large (let's say 50 megabytes) and I have a lot of data. The entire data set is too large to fit in the RAM. I can see three possible ways of working with this data.
Load each datum sequentially from hard drive and then execute the code of the neural net. This is slow for obvious reasons.
Use multithreading (threading library). Then I can have multiple threads loading the data from the hard drive and putting them in the queue from which the main process is reading. This is working fairly well though it seems to me that GIL (https://wiki.python.org/moin/GlobalInterpreterLock) is slowing me down. The computation of the gradient for the neural net is clearly not really executed in parallel with reading from the hard drive.
Use multiprocessing (multiprocessing library). This is analogous to 2. though the implementation of the queue I need to use then is using Linux pipes which are limited to be very small, so the main process spends most of the time reading data from the queue in tiny parts.
All of these ways above are not acceptable for me. Can you think of a more efficient way of implementing such producer-consumer problem in Python?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.