Combination of parallel processing and dask arrays to process multiple image stacks

Combination of parallel processing and dask arrays to process multiple image stacks - python

I have a directory containing n h5 file each of which has m image stacks to filter. For each image, I will run the filtering (gaussian and laplacian) using dask parallel arrays in order to speed up the processing (Ref to Dask). I will use the dask arrays through the apply_parallel() function in scikit-image.
I will run the processing on a small server with 20 cpus.
I would like to get an advice to which parallel strategy will make more sense to use:
1) Sequential processing of the h5 files and all the cpus for dask processing
2) Parallel processing of the h5 files with x cores and use the remaining 20-x to dask processing.
3) Distribute the resources and parallel processing the h5 files, the images in each h5 files and the remaining resources for dask.
thanks for the help!

It is always best to parallelize in the simplest way possible. If you have several files and just want to run the same computation on each of them then this is almost certainly the simplest approach. If this saturates your computational resources then you can stop here without diving into more sophisticated methods.
If this is indeed your situation then you can parallelize done with dask, make, concurrent.futures or any of a variety of other libraries.
If there are other concerns, like trying to parallelize the operation itself or making sure you don't run out of memory then you are forced into more sophisticated systems like dask, but this may not be the case.

Use make for parallelization.
With make -j20 you can tell make to run 20 processes in parallel.
By using multiple processes, you avoid the cost of the "global interpreter lock". For independent tasks, it is more efficient to use multiple independent processes (benchmark if you have doubt). Make is great for processing whole folders where you need to apply the same command to each file - it is traditionally used for compiling source code, but it can be used to run arbitrary commands.

Related

What is the logic behind "compute()" in Dask dataframes?

I'm struggling to understand when and when not to use compute() in Dask dataframes. I usually write my code by adding/removing compute() until the code works, but that's extremely error-prone. How should I use compute() in Dask? Does it differ in Dast Distributed?

The core idea of delayed computations is to delay the actual calculation until the final target is known. This allows:
increased speed of coding (e.g. as a data scientist, I don't need to wait for every transformation step to complete before designing the workflow),
distribution of work across multiple workers,
overcoming resource constraints of my client, e.g. if I am using a laptop with limited memory, I can run heavy computations on dask workers that are in the cloud or another machine with more resources,
better efficiency if the final target requires only some tasks to be done (e.g. if the final calculation requires only a subset of the dataframe, then dask will load only the relevant columns/partitions).
Some of the alternatives to calling .compute are:
.visualize(): this helps visualize the task graph. The DAG can become hairy when there are lots of tasks, so this is useful to run on smaller subsets of the data (e.g. only loading two/three partitions of the dataframe)
using client.submit: this launches computations right away providing you with a future, an object that refers to results of a task being computed. This gives the advantages of scaling work across multiple workers, but it can be a bit more resource intensive (since dask doesn't know the full workflow, it might run computations that are not needed to achieve the final target).
With regards to distributed, I don't think there is a difference except for where the result will be: dask.compute will put the result in local machine, while client.compute will keep the result on a remote worker.

Each partition in a Dask DataFrame is a Pandas DataFrame.
compute() combines all the partitions (Pandas DataFrames) into a single Pandas DataFrame.
Dask is fast because it can perform computations on partitions in parallel. Pandas can be slower because it only works on one partition.
You should avoid calling compute() whenever possible. It better to have the data spread out in multiple partitions, so it can be processed in parallel.
In rare cases you can compute to Pandas (e.g. when doing a large to small join or after a huge filtering operation), but it's best to learn how to use the Dask API to run computations in parallel.

How many processes should I create for the multi-threads CPU in the computational intensive scenario?

I have a 32 cores and 64 threads CPU for executing a scientific computation task. How many processes should I create?
To be noted that my program is computationally intensive involved lots of matrix computations based on Numpy. Now, I use the Python default process pool to execute this task. It will create 64 processes. Will it perform better or worse than 32 processes?

I'm not really sure that Python is suited for multi-threading computational intensive scenarios, due to the Global Interpreter Lock (GIL). Basically, you should use multi-threading in Python only for IO-bound tasks. I'm not sure if Numpy applies since the heavy part if I recall correctly is written in C++.
If you're looking for alternatives you could use the Apache Spark framework to distribute the work across multiple machines. I think that even if you run your code in local mode (i.e. on your machine) with 8/16 workers you could get some performance boost.
EDIT: I'm sorry, I just read on the GIL page that I linked that it doesn't apply for Numpy. I still think that this is not really the best tool you can use, since effective multi-threading programming is quite hard to get right and there are some other nuances that you can read in the link.

It's impossible to give you an answer as it will depend on your exact problem and code but potentially also of your hardware.
Basically the process for multi-processing is to split the work in X parts then distribute it to each process, let each process work and then merge each result.
Now you need to know if you can effectively split the work in 64 parts while keeping each part around the same time of work (if one process take 90% of the time and you can't split it it's useless to have more than 2 processes as you will always wait for the first one).
If you can do it and it's not taking too long to split and merge the work/results (remember that it's a supplementary work to do so it will take extra time) then it can be interesting to use more process.
It is also possible that you can speed-up your code by using less process if you pass too much time on splitting/merging the work/results (sometime the speed-up obtained by using more process can be negative).
Also you have to remember that in some architecture the memory cache can be shared among cores so it can badly affect the performances of multiprocessing.

producer-consumer with very large items in Python

I'm working in the following scenario (with Python). I have a neural net optimised with a stochastic algorithm that needs a constant stream of training data. Each datum is large (let's say 50 megabytes) and I have a lot of data. The entire data set is too large to fit in the RAM. I can see three possible ways of working with this data.
Load each datum sequentially from hard drive and then execute the code of the neural net. This is slow for obvious reasons.
Use multithreading (threading library). Then I can have multiple threads loading the data from the hard drive and putting them in the queue from which the main process is reading. This is working fairly well though it seems to me that GIL (https://wiki.python.org/moin/GlobalInterpreterLock) is slowing me down. The computation of the gradient for the neural net is clearly not really executed in parallel with reading from the hard drive.
Use multiprocessing (multiprocessing library). This is analogous to 2. though the implementation of the queue I need to use then is using Linux pipes which are limited to be very small, so the main process spends most of the time reading data from the queue in tiny parts.
All of these ways above are not acceptable for me. Can you think of a more efficient way of implementing such producer-consumer problem in Python?

CPU and GPU operations parallelization

I have an application that has 3 main functionalities which are running sequentially at the moment:
1) Loading data to memory and perform preprocesssing on it.
2) Perform some computations on the data using GPU with theano.
3) Monitor the state of the computations on GPU and print them to the screen.
These 3 functionalities are embarrassingly parallelizable by using multi-threading. But in python I perform all these three functionalities sequentially. Partly because in the past I had some bad luck with Python multi-threading and GIL issues.
Here in this case, I don't necessarily need to utilize the full-capabilities of multiple-cpu's at hand. All I want to do is, to load the data and preprocess them while the computations at the GPU are performed and monitor the state of the computations at the same time. Currently most time-consuming computations are performed at 2), so I'm kind of time-bounded with operations at 2). Now my questions are:
*Can python parallelize these 3 operations without creating new bottlenecks, e.g.: due to GIL issues.
*Should I use multiprocessing instead of multithreading?
In a nutshell how should parallelize these three operations if I should in Python.
It is been some time since last time I wrote multi-threaded code for CPU(especially for python), any guidance will be appreciated.
Edit: Typos.

The GIL is a bit of a nuisance sometimes...
A lot of it is going to revolve around how you can use the GPU. Does the API your using allow you to set it running then go off and do something else, occasionally polling to see if the GPU has finished? Or maybe it can raise an event, call a callback or something like that?
I'm sensing from your question that the answer is no... In which case I suspect your only choice (given that you're using Python) is multi processing. If the answer is yes then you can start off the GPU then get on with some preprocessing and plotting in the meantime and then check to see if the GPU has finished.
I don't know much about Python or how it does multiprocessing, but I suspect that it involves serialisation and copying of data being sent between processes. If the quantity of data you're processing is large (I suggest getting worried at the 100's of megabytes mark. Though that's just a hunch) then you may wish to consider how much time is lost in serialising and copy that data. If you don't like the answers to that analysis then your probably out of luck so far as using Python is concerned.
You say that the most time consuming part is the GPU processing? Presumably the other two parts are reasonably lengthy otherwise there would be little point trying to parallelise them. For example if the GPU was 95% of the runtime then saving 5% by parallelising the rest hardly seems worth it.

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.

A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.

Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.

Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.

You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.

Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.

I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)

Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.

From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)

Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.

Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.