Super Linear Speedup - Python - Cluster - Multiple Processes

Super Linear Speedup - Python - Cluster - Multiple Processes - python

I parallelized a program which uses fairly large matrices. The program depicts the Ising model, from statistical mechanics. On my laptop everything works fine - even the visualization shows the behaviour I expect. Now I wanted to see how it scales using many CPUs, so I used a cluster computer I have at hand. Well, I get super linear speedup. First I thought it's not a big deal since it's possible that when I use multiple processes the problem size gets smaller and thus might fit into the cache. So no time consuming coping from cache to ram and back will slow it down. However, I even get super linear speedup for one CPU. I wouldn't expect that. If the whole system (matrix) doesn't fit into the cache for the sequential version then it also shouldn't fit into it using the parallel version with only one CPU, right?
I've done a check on my laptop. Averaged over 5 runs, the parallel version using one CPU is a tiny bit slower than the sequential version. I guess this is okay since there are some statements in the parallel version which I don't have in the sequential one.
Any ideas what this could be all about? Is the super linear speedup reasonable?
Note: I'm programming in python using numpy and for the parallel version, processes and shmarray.

Related

Transpose large numpy matrix on disk

I have a rather large rectangular (>1G rows, 1K columns) Fortran-style NumPy matrix, which I want to transpose to C-style.
So far, my approach has been relatively trivial with the following Rust snippet, using MMAPed slices of the source and destination matrix, where both original_matrix and target_matrix are MMAPPed PyArray2, with Rayon handling the parallelization.
Since the target_matrix has to be modified by multiple threads, I wrap it in an UnsafeCell.
let shared_target_matrix = std::cell::UnsafeCell::new(target_matrix);
original_matrix.as_ref().par_chunks(number_of_nodes).enumerate().for_each(|(j, feature)|{
feature.iter().copied().enumerate().for_each(|(i, feature_value)| unsafe {
*(shared_target_matrix.uget_mut([i, j])) = feature_value;
});
});
This approach transposes a matrix with shape (~1G, 100), ~120GB takes ~3 hours on an HDD disk. Transposing a (~1G, 1000), ~1200GB matrix does not scale linearly to 30 hours, as one may naively expect, but explode to several weeks. As it stands, I have managed to transpose roughly 100 features in 2 days, and it keeps slowing down.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Note on sequential and parallel approaches
While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster (on a machine with 12 cores and 24 threads) than a sequential approach when transposing a matrix with shape (1G, 100). We are not sure why this is the case.
Note on using two HDDs
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.

While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster
This may be due to poor IO queue utilization. With an entirely sequential workload without prefetching you'll be alternating the device between working and idle. If you keep multiple operations in flight it'll be working all the time.
Check with iostat -x <interval>
But parallelism is a suboptimal way to achieve best utilization of a HDD because it'll likely cause more head-seeks than necessary.
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.
This may be due to the operating system's write cache which means it can batch writes very efficiently and you're mostly bottlenecked on reads. Again, check with iostat.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Yes, if the underlying filesystem supports it you can use FIEMAP to get the physical layout of the data on disk and then optimize your read order to follow the physical layout rather than the logical layout. You can use the filefrag CLI tool to inspect the fragmentation data manually, but there are rust bindings for that ioctl so you can use it programmatically too.
Additionally you can use madvise(MADV_WILLNEED) to inform the kernel to prefetch data in the background for the next few loop iterations. For HDDs this should be ideally done in batches worth a few megabytes at a time. And the next batch should be issued when you're half-way through the current one.
Issuing them in batches minimizes syscall overhead and starting the next one half-way through ensures there's enough time left to actually complete the IO before you reach the end of the current one.
And since you'll be manually issuing prefetches in physical instead of logical order you can also disable the default readahead heuristics (which would be getting in the way) via madvise(MADV_RANDOM)
If you have enough free diskspace you could also try a simpler approach: defragmenting the file before operating on it. But even then you should still use madvise to ensure that there always are IO requests in flight.

Understand order of magnitude performance gap between python and C++ for CPU heavy application

**Summary: ** I observe a ~1000 performance gap between a python code and a C+ code doing the same job despite the use of parallelization, vectorization, just in time compilation and machine code conversion using Numba in the context of scientific calculation. CPU wont be used at full, and I don't understand why
Hello everybody,
I just started in a laboratory doing simulation of various material, including simulation of the growth of biological-like tissues. To do that we create a 3D version of said tissue (collection of vertices stored in a numpy array) and we apply different functions on it to mimic physic/biology.
We have a C++ code doing just that, which takes approximately 10 second to run. Someone converted said code to python, but this version takes about 2h30 hours to process. We tried every trick in the book we knew to accelerate the code. We used numba to accelerate numpy where appropriate, parallelized the code as much as we could, tried to vectorize what could be, but still the gap remains. In fact the earlier version of the code took days to proceed.
When the code execute, multiple cores are properly used, as monitored using the build-in system monitor. However, they are not used at full, and in fact deactivating cores manually does not seem to hit performances too much. At first I thought it could be due to the GIL, but releasing it had no effect on performances either. Somehow it makes me think of a bottleneck in memory transfer between the CPU and the ram, but I cannot understand why the C version would not have the same problem. I also have the feeling that there is a performance cost for calling functions. One of my earlier tasks was to refactor the code, thus decomposing complicated functions into smaller elements. I since have a small performance degradation compared to the earlier version.
I must say I am really wondering where my bottleneck is and how it could be tested/improved. Any idea would be very welcome.
I am aware my question is kind of a complicated one, so let me know if you would need additional information, I would be happy to provide.

Should I run a single parallelized C script from Python or a parallel set of serial C scripts?

Overview
I am working on re-writing and optimizing a set of MATLAB scripts for a computationally intensive scientific computing application into a set of Python scripts with the kernels written in C (run using ctypes). The Python wrapping is necessary for ease of end-user application and configuration. My current hardware is a 12-core Ryzen 7 3700X with 64 GB RAM, but it is also intended to be suitable for running on much larger and lower-clocked clusters.
Input/Output
The section of code this question concerns is highly parallelizable. The input is going to be something like 100-200 sets (serially ordered in working memory) of a few million uniformly organized floats (you can imagine them as 100-200 fairly high-resolution B/W images, all with the same proportions and similar structure). Each such "set" can be processed independently and uninterrupted, for the bulk of the process. There are many computationally (and possibly memory) intensive calculations performed on these - some of it suitable for implementation using BLAS but also more complex upsampling, interpolation, filtering, back-projection and projection, and so forth. the MATLAB implementation I am using as a basis, it is implemented through a Parfor loop calling on a few subroutines and using some MEX functions written in C. The output of each iteration is going to be, again, a few million floats. If I recall correctly (running a realistic trial of the scripts is very messy at the moment - another thing I'm tasked with fixing - so I can't easily check), the computations can be intensive enough that each iteration of the loop can be expected to take a few minutes.
The Conundrum
My most likely course of action will be to turn this entire section of the code into a big "kernel" written in C. I have subfunctions of it written in mixed C/Python already, and those already have way too much Python overhead compared to the time the actual computations need - so I want to replace all of that, and the remainder of all this easily parallelized code, with C. Thus, I have two methods I can use to parallelize the code:
I have Python create subprocesses, each of which triggers serial C code separately with its section of the data.
I have Python start a single C process to which I hand all the data, having the C process use OpenMP to create subprocesses to parallelize the code.
I'm familiar with both Python and C multiprocessing, but I have never tried to run multiprocessing C scripts through Python. My question is then, which of these is preferable from a performance standpoint, and are there any aspects I should be considering which I haven't considered here?
Thank you in advance!

Tensorflow - Profiling using timeline - Understand what is limiting the system

I am trying to understand why each train iteration takes aprox 1.5 sec.
I used the tracing method described here.I am working on a TitanX Pascal GPU. My results look very strange, it seems that every operation is relatively fast and the system is idle most of the time between operations. How can i understand from this what is limiting the system.
It does seem however that when I drastically reduce the batch size the gaps close, as could be seen here.
Unfortunately the code is very complicated and I can't post a small version of it that has the same problem
Is there a way to understand from the profiler what is taking the space in the gaps between operations?
Thanks!
EDIT:
On CPU ony I do not see this behavior:
I am running a

Here are a few guesses, but it's hard to say without a self-contained reproduction that I can run and debug.
Is it possible you are running out of GPU memory? One signal of this is if you see log messages of the form Allocator ... ran out of memory during training. If you run out of GPU memory, then the allocator backs off and waits in the hope more becomes available. This might explain the large inter-operator gaps that go away if you reduce the batch size.
As Yaroslav suggests in a comment above, what happens if you run the model on CPU only? What does the timeline look like?
Is this a distributed training job or a single-machine job? If it's a distributed job, does a single-machine version show the same behavior?
Are you calling session.run() or eval() many times, or just once per training step? Every run() or eval() call will drain the GPU pipeline, so for efficiency you need usually need to express your computation as one big graph with only a single run() call. (I doubt this is your problem but I mention it for completeness.)

CPU and GPU operations parallelization

I have an application that has 3 main functionalities which are running sequentially at the moment:
1) Loading data to memory and perform preprocesssing on it.
2) Perform some computations on the data using GPU with theano.
3) Monitor the state of the computations on GPU and print them to the screen.
These 3 functionalities are embarrassingly parallelizable by using multi-threading. But in python I perform all these three functionalities sequentially. Partly because in the past I had some bad luck with Python multi-threading and GIL issues.
Here in this case, I don't necessarily need to utilize the full-capabilities of multiple-cpu's at hand. All I want to do is, to load the data and preprocess them while the computations at the GPU are performed and monitor the state of the computations at the same time. Currently most time-consuming computations are performed at 2), so I'm kind of time-bounded with operations at 2). Now my questions are:
*Can python parallelize these 3 operations without creating new bottlenecks, e.g.: due to GIL issues.
*Should I use multiprocessing instead of multithreading?
In a nutshell how should parallelize these three operations if I should in Python.
It is been some time since last time I wrote multi-threaded code for CPU(especially for python), any guidance will be appreciated.
Edit: Typos.

The GIL is a bit of a nuisance sometimes...
A lot of it is going to revolve around how you can use the GPU. Does the API your using allow you to set it running then go off and do something else, occasionally polling to see if the GPU has finished? Or maybe it can raise an event, call a callback or something like that?
I'm sensing from your question that the answer is no... In which case I suspect your only choice (given that you're using Python) is multi processing. If the answer is yes then you can start off the GPU then get on with some preprocessing and plotting in the meantime and then check to see if the GPU has finished.
I don't know much about Python or how it does multiprocessing, but I suspect that it involves serialisation and copying of data being sent between processes. If the quantity of data you're processing is large (I suggest getting worried at the 100's of megabytes mark. Though that's just a hunch) then you may wish to consider how much time is lost in serialising and copy that data. If you don't like the answers to that analysis then your probably out of luck so far as using Python is concerned.
You say that the most time consuming part is the GPU processing? Presumably the other two parts are reasonably lengthy otherwise there would be little point trying to parallelise them. For example if the GPU was 95% of the runtime then saving 5% by parallelising the rest hardly seems worth it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.