MPI Newbie - some questions on how 'mpirun' works and process management

MPI Newbie - some questions on how 'mpirun' works and process management - python

First of all, I'm not a programmer by profession but I have to program a code for my project (I have some proficiency in C++ and python though). I came in here often when I got stuck, and most of the time got nice solutions from here, but now I have essential questions on MPI programming, else I couldn't really proceed until I know the concepts of it.
Here is my description of problems,
I would like to create a code for an algorithm for a scientific calculation. The code can be divided into 2 parts.
A.) Matrix-vector multiplication and inversion of a matrix. This part is relatively straightforward and I even have my own working MPI code for this part
B.) Calling an external MPI-ready program for a more complicated calculation (This part should also be simple, as it's simply calling a UNIX command line).
The problem I'm having is how to join these 2 parts together? My algorithm goes like this,
for k in specified range
dividing a state vector of size 6NMx1 into M blocks, let each of M nodes handle these.
Manipulate a state vector of size 6NMx1 according to A.) in parallel
After A.) is done, run B.) using M nodes in parallel /* THIS IS WHERE I GOT STUCK */
Update state vector
end for
To run B.), I have to use mpirun to call a UNIX command,
mpirun -np #PPN my_app > some_output
The questions I have are,
How does 'mpirun' actually work? Does it spawn new processes upon calling?
Let's say if I'm using M cluster compute nodes, and each one has 16 processors per node, if I use only 1 process of the node to call the above UNIX command, will it generate 16 more processes? If so, I will end up with 256M processes running in total, am I correct?
My main goal is to use each compute node handles a block in the system vector (blocks are independent, with size 6Nx1), and will use numbers from each block as an input for B.) I'm working with clusters, so when I submit my job, I have to define number of nodes beforehand, and I strictly want each node to also run B.) in parallel after A.) is done. Are there any suggestions on how to do this using MPI? Some people told me to write codes separately for A.) and B.), and use a python script to control them at the top layer, so it should look like..
Python script:
for k in specified range
mpirun A.) --> This is straightforward for me
mpirun B.)
end for
Pseudocode for B.)
/* THIS PROGRAM SHOULD HAVE 16M PROCESSES */
if rank % 16 == 0
mpirun -np 16 my_app > output
end if
/* I WANT M CALLS TO THIS PROGRAM IN PARALLEL */
MPI_COMM.BARRIER
Do you think this scheme will use 16M process in parallel for B.)? If there are better ways to implement B.) than this, or even better, to wrap it in the same code as A.), please suggest me!
3.) This is my prototype code, so I don't really care about efficiency. I just need it to work, and I will care about optimization later on.
If my descriptions are confusing, please ask me and I will come back and clarify. Thank you for your time, and I really appreciate your helps! :)

Mpirun is just a command that runs the number of jobs that you required in the option command line, it will probably detect what kind of machine you have and work with.
It's complicated to answer to the second question because if you are using a cluster with multiple nodes you should probably work with a dedicated protocol. For example, with slurm you run your program through a sbatch protocol which is like:
// number of proc on one node
#SBATCH -n 2
// number of node
#SBATCH -N 4
run ./a.out
That means you will run your program on 4 nodes with 2 procs on each nodes.
I don't really know about the following cause it's a lil bit confuse for me but maybe you should reconsider your problem with something else. You don't need MPI if you are working within a node but you should use openMP.
MPI is needed if you are working with a non-shared memory, within a node that's not the case.
I hope it can help you in your work.

Related

Python use slurm for multiprocessing

I want to run a simple task using multiprocessing (I think this one is the same as using parfor in matlab correct?)
For example:
from multiprocessing import Pool
def func_sq(i):
fig=plt.plot(x[i,:]) #x is a ready-to-use large ndarray, just want
fig.save(....) #to plot each column on a separate figure
pool = Pool()
pool.map(func_sq,[1,2,3,4,5,6,7,8])
But I am very confused of how to use slurm to submit my job. I have been searching for answers but could not find a good one.
Currently, while not using multiprocessing, I am using slurm job sumit file like this:(named test1.sh)
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p batch
#SBATCH --exclusive
module load anaconda3
source activate py36
srun python test1.py
Then, I type in sbatch test1.sh in my prompt window.
So if I would like to use the multiprocessing, how should I modify my sh file? I have tried by myself but it seems just changing my -n to 16 and Pool(16) makes my job repeat 16 times.
Or is there a way to maximize my performance if multiprocessing is not suitable (I have heard about multithreating but don't know how it exactly works)
And how do I effectively use my memory so that it won't crush? (My x matrix is very large)
Thanks!
For the GPU, is that possible to do the same thing? My current submission script without multiprocessing is:
#!/bin/bash
#SBATCH -n 1
#SBATCH -p gpu
#SBATCH --gres=gpu:1
Thanks a lot!

The "-n" flag is setting the number of tasks your sbatch submission is going to execute, which is why your script is running multiple times. What you want to change is the "-c" argument which is how many CPUs each task is assigned. Your script spawns additional processes but they will be children of the parent process and share the CPUs assigned to it. Just add "#SBATCH -c 16" to your script. As for memory, there is a default amount of memory your job will be given per CPU, so increasing the number of CPUs will also increase the amount of memory available. If you're not getting enough, add something like "#SBATCH --mem=20000M" to request a specific amount.

I don't mean to be unwelcoming here, but this question seems to indicate that you don't actually understand the tools you're using here. Python Multiprocessing allows a single Python program to launch child processes to help it perform work in parallel. This is particularly helpful because multithreading (which is commonly how you'd accomplish this in other programming languages) doesn't gain you parallel code execution in Python, due to Python's Global Interpreter Lock.
Slurm (which I don't use, but from some quick research) seems to be a fairly high-level utility that allows individuals to submit work to some sort of cluster of computers (or a supercomputer... usually similar concepts). It has no visibility, per se, into how the program it launches runs; that is, it has no relationship to the fact that your Python program proceeds to launch 16 (or however many) helper processes. Its job is to schedule your Python program to run as a black box, then sit back and make sure it finishes successfully.
You seem to have some vague data processing problem. You describe it as a large matrix, but you don't give nearly enough information for me to actually understand what you're trying to accomplish. Regardless, if you don't actually understand what you're doing and how the tools you're using work, you're just flailing until you maybe eventually get lucky enough for this to work. Stop guessing, figure out what these tools do, look around and read documentation, then figure out what you're trying to accomplish and how you could go about splitting up the work in a reasonable fashion.
Here's my best guess, but I really have very little information to work from so it may not be helpful at all:
Your Python script has no concept that it's being run multiple times by Slurm (the -n 16 you refer to, I guess). It makes sense, then, that the job gets repeated 16 times, because Slurm runs the entire script 16 times, and each time your Python script does the entire task from start to finish. If you want Slurm and your Python program to interact, so that the Python program expects to get run multiple times in parallel, I have no idea how to help you there, you'll just need to read more into Slurm.
Your data must be able to be read incrementally, or partially, if you have any hope of breaking this job into pieces. That is, if you can only read the entire matrix all at once, or not at all, you're stuck with solutions that begin by reading the entire matrix into memory, which you indicate is not really an option. Assuming you can, and that you want to perform some work on each row independently, then you're fortunate enough for your task to be what's officially known as "embarrassingly parallel". This is a very good thing, if true.
Assuming your problem is embarrassingly parallel (since it looks like you're just trying to load each row of your data matrix, plot it somehow, then save off that plot as an image to disk), and you can load your data incrementally, then continue reading up on Python's multiprocessing module, and Pool().map is probably the right direction to be headed in. Create some Python generator that produces rows of your data matrix, then pass that generator and func_sq to pool.map, and sit back and wait for the job to finish.
If you really need to do this work across multiple machines, rather than hacking your own Slurm + Multiprocessing stack, I'd suggest you start using actual data processing tools, such as PySpark.
This doesn't sound like a trivial problem, and even if it were, you don't give sufficient details for me to provide a robust answer. There's no "just fix this one line" answer to what you've asked, but I hope this helps give you an idea of what your tools are doing and how to proceed from here.

Is there an easy way to have "checkpoints" in an extended python script?

To preface my question let me give a bit of context: I'm currently working on a data pipeline that has a number of different steps. Each step can go wrong and many take some time (not a huge amount, but on the order of minutes).
For this reason the pipeline is currently heavily supervised by humans. An analyst goes through each step, running the python code in a Jupyter notebook and upon experiencing problems will make minor code adjustments inline and repeat that section.
In the long run, the goal here is to have zero human intervention. However, in the shorter term we'd like to make this process more seamless. The simplest way to do that seems like it would be to split each section into it's own script, and have a parent script that runs each bit and verifies output. However, we also need the ability to rerun the a file with an identical setup if it fails.
For example:
run a --> ✅
run b --> ✅ (b relies on some data produced by a)
run c --> ❌ (c relies on data produced by a and b)
// make some changes to c
run c --> ✅ (c should run in an identical state to its original run)
The most obvious way to do this would be to write output from each script to a file, and load all of these scripts into the next one. This would work, but seems a bit inelegant. A database seems another valid option, but a lot of the data doesn't fit cleanly into a db format.
Does anyone have any suggestions for some ways to achieve what I'm looking for? If anything is unclear I'm also more than happy to clarify any points!

You could create an object that basically maintains the state after each step and use pickle to serialize that object to a file.
Then it's up to your python script to unpickle that file and then decide which step it needs to start from based on the state.
https://wiki.python.org/moin/UsingPickle

Queues and pipelines work very nicely with each other (architecturally speaking). One of the nicer bits is the fact that slower stages can get more workers than faster stages, allowing the pipeline to be optimized based on the workload.

What is the correct workflow for parallelization: running on a cluster or multiproccesses?

I want to call a function similar to parallelize.map(function, args) that returns a list of results and the user is blind to the actual process. One of the functions I want to parallelize calls subprocess to another unix program that benefits from multiple cores.
I first tried ipython-cluster-helper. This works well with my setup, but I ran into problems installing it on several other machines. I also have to ask for names of clusters during setup. I haven't seen other programs start jobs on clusters for you, so I don't know if that is accepted practice.
joblib seems to be the standard for parallelization, but it can only use one cluster or computer at a time. This works as well, but is significantly slower because it is not using the cluster.
Also, the server I am running this code on complains if a program has run too long to ensure that people use the cluster. Do I write another script to run this program only on our cluster -- if I used joblib?
For now, I added special parameters in setup.py to add cluster names and install ipython-cluster-helper if necessary. And when map is called, it first checks if ipython-cluster-helper and the cluster names are available, use them, else use joblib.
What are others ways of achieving this? I'm looking for a standard way to do this that will work on most machines with or with out a cluster, so I can release the code and make it easy to use.
Thanks.

python on HPC cluster computer

I asked a question very close to this, but it wasn't answered and since then I hope I have learned to better ask the question.
I was curious as to how run many jobs serially on a Cray XE6 machine. You usually qsub things with a ccmrun (for a serial job) or an aprun (instead of mpirun or mpiexec). I first wanted to use the Pool() function, but due to it not being SMP based hardware it would be limited to 32 processors. Even an mpi4py application of something like a pool wouldn't work, because I am not giving the main program all of the processors. I would be running that script 64 times if I were to say aprun -n 64 mpipool.py, whereas it does work if I do something like aprun -n 1 -d 32 pool.py.
I've had a look at the https://wiki.python.org/moin/ParallelProcessing website and was wondering if anyone had any experience running multiple serial jobs on a cluster computing machine with any of them. I did write an mpi4py code that basically had rank 0 doing all of the job selection, and then giving them out to the the other processors. It didn't want to play nice on the machine since I needed to use subprocess to launch the giant amount of C code. So, one last caveat is that it would have to play nice with subprocess.
I would like to have it look at the amount of nodes chosen, and then basically do something along the lines of:
ccmrun jobscheduler.py &
ccmrun jobrunner.py 63 & # given that I started the job with 64 processors. I may have to do a bash loop here, but that's no problem.
Once started I would want them to be able to communicate between one another, but without MPI I'm not sure of an efficient way of doing this. If anyone could get me started on the right path I would greatly appreciate it. Maybe doing pickle dumps and locking them and deleting them when a jobrunner picks it up. There might be a really simple way of doing this, but I'm very new to this.
Thanks!

I don't know anything about Cray machines but I'll take a stab at this anyway. I noticed you mentioned qsub which makes me think that the system is using PBS or Torque. Both seem to support Job Arrays which may be along the lines of what you are looking for.
Job Arrays would make the queue system responsible for job management. Each subjob would be assigned an array id out of a range you specify and would be assigned whatever resources you requested with -l. In Torque, '#PBS -l nodes=1' and '#PBS -t 1-64' would create 64 subjobs with indexes from 1 to 64 each being assigned a single node. Man pages and Google will be a good resource and from what I've seen Torque and PBS differ on syntax. If that doesn't work, you can look at using pbsdsh inside of a single, larger job.
Also, I want to mention that advice from strangers on the Internet will only take you so far. Your local admin may have limits or scheduling policies in place that you may limit your options. You may also be able to get some advice from the admin about other, proven ways that you can solve your problem.

Python - Working around memory leaks

I have a Python program that runs a series of experiments, with no data intended to be stored from one test to another. My code contains a memory leak which I am completely unable to find (I've look at the other threads on memory leaks). Due to time constraints, I have had to give up on finding the leak, but if I were able to isolate each experiment, the program would probably run long enough to produce the results I need.
Would running each test in a separate thread help?
Are there any other methods of isolating the effects of a leak?
Detail on the specific situation
My code has two parts: an experiment runner and the actual experiment code.
Although no globals are shared between the code for running all the experiments and the code used by each experiment, some classes/functions are necessarily shared.
The experiment runner isn't just a simple for loop that can be easily put into a shell script. It first decides on the tests which need to be run given the configuration parameters, then runs the tests then outputs the data in a particular way.
I tried manually calling the garbage collector in case the issue was simply that garbage collection wasn't being run, but this did not work
Update
Gnibbler's answer has actually allowed me to find out that my ClosenessCalculation objects which store all of the data used during each calculation are not being killed off. I then used that to manually delete some links which seems to have fixed the memory issues.

You can use something like this to help track down memory leaks
>>> from collections import defaultdict
>>> from gc import get_objects
>>> before = defaultdict(int)
>>> after = defaultdict(int)
>>> for i in get_objects():
... before[type(i)] += 1
...
now suppose the tests leaks some memory
>>> leaked_things = [[x] for x in range(10)]
>>> for i in get_objects():
... after[type(i)] += 1
...
>>> print [(k, after[k] - before[k]) for k in after if after[k] - before[k]]
[(<type 'list'>, 11)]
11 because we have leaked one list containing 10 more lists

Threads would not help. If you must give up on finding the leak, then the only solution to contain its effect is running a new process once in a while (e.g., when a test has left overall memory consumption too high for your liking -- you can determine VM size easily by reading /proc/self/status in Linux, and other similar approaches on other OS's).
Make sure the overall script takes an optional parameter to tell it what test number (or other test identification) to start from, so that when one instance of the script decides it's taking up too much memory, it can tell its successor where to restart from.
Or, more solidly, make sure that as each test is completed its identification is appended to some file with a well-known name. When the program starts it begins by reading that file and thus knows what tests have already been run. This architecture is more solid because it also covers the case where the program crashes during a test; of course, to fully automate recovery from such crashes, you'll want a separate watchdog program and process to be in charge of starting a fresh instance of the test program when it determines the previous one has crashed (it could use subprocess for the purpose -- it also needs a way to tell when the sequence is finished, e.g. a normal exit from the test program could mean that while any crash or exit with a status != 0 signify the need to start a new fresh instance).
If these architectures appeal but you need further help implementing them, just comment to this answer and I'll be happy to supply example code -- I don't want to do it "preemptively" in case there are as-yet-unexpressed issues that make the architectures unsuitable for you. (It might also help to know what platforms you need to run on).

I had the same problem with a third party C library which was leaking. The most clean work-around that I could think of was to fork and wait. The advantage of it is that you don't even have to create a separate process after each run. You can define the size of your batch.
Here's a general solution (if you ever find the leak, the only change you need to make is to change run() to call run_single_process() instead of run_forked() and you'll be done):
import os,sys
batchSize = 20
class Runner(object):
def __init__(self,dataFeedGenerator,dataProcessor):
self._dataFeed = dataFeedGenerator
self._caller = dataProcessor
def run(self):
self.run_forked()
def run_forked(self):
dataFeed = self._dataFeed
dataSubFeed = []
for i,dataMorsel in enumerate(dataFeed,1):
if i % batchSize > 0:
dataSubFeed.append(dataMorsel)
else:
self._dataFeed = dataSubFeed
self.fork()
dataSubFeed = []
if self._child_pid is 0:
self.run_single_process()
self.endBatch()
def run_single_process(self)
for dataMorsel in self._dataFeed:
self._caller(dataMorsel)
def fork(self):
self._child_pid = os.fork()
def endBatch(self):
if self._child_pid is not 0:
os.waitpid(self._child_pid, 0)
else:
sys.exit() # exit from the child when done
This isolates the memory leak to the child process. And it will never leak more times than the value of the batchSize variable.

I would simply refactor the experiments into individual functions (if not like that already) then accept an experiment number from the command line which calls the single experiment function.
The just bodgy up a shell script as follows:
#!/bin/bash
for expnum in 1 2 3 4 5 6 7 8 9 10 11 ; do
python youProgram ${expnum} otherParams
done
That way, you can leave most of your code as-is and this will clear out any memory leaks you think you have in between each experiment.
Of course, the best solution is always to find and fix the root cause of a problem but, as you've already stated, that's not an option for you.
Although it's hard to imagine a memory leak in Python, I'll take your word on that one - you may want to at least consider the possibility that you're mistaken there, however. Consider raising that in a separate question, something that we can work on at low priority (as opposed to this quick-fix version).
Update: Making community wiki since the question has changed somewhat from the original. I'd delete the answer but for the fact I still think it's useful - you could do the same to your experiment runner as I proposed the bash script for, you just need to ensure that the experiments are separate processes so that memory leaks dont occur (if the memory leaks are in the runner, you're going to have to do root cause analysis and fix the bug properly).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.