python on HPC cluster computer - python

I asked a question very close to this, but it wasn't answered and since then I hope I have learned to better ask the question.
I was curious as to how run many jobs serially on a Cray XE6 machine. You usually qsub things with a ccmrun (for a serial job) or an aprun (instead of mpirun or mpiexec). I first wanted to use the Pool() function, but due to it not being SMP based hardware it would be limited to 32 processors. Even an mpi4py application of something like a pool wouldn't work, because I am not giving the main program all of the processors. I would be running that script 64 times if I were to say aprun -n 64 mpipool.py, whereas it does work if I do something like aprun -n 1 -d 32 pool.py.
I've had a look at the https://wiki.python.org/moin/ParallelProcessing website and was wondering if anyone had any experience running multiple serial jobs on a cluster computing machine with any of them. I did write an mpi4py code that basically had rank 0 doing all of the job selection, and then giving them out to the the other processors. It didn't want to play nice on the machine since I needed to use subprocess to launch the giant amount of C code. So, one last caveat is that it would have to play nice with subprocess.
I would like to have it look at the amount of nodes chosen, and then basically do something along the lines of:
ccmrun jobscheduler.py &
ccmrun jobrunner.py 63 & # given that I started the job with 64 processors. I may have to do a bash loop here, but that's no problem.
Once started I would want them to be able to communicate between one another, but without MPI I'm not sure of an efficient way of doing this. If anyone could get me started on the right path I would greatly appreciate it. Maybe doing pickle dumps and locking them and deleting them when a jobrunner picks it up. There might be a really simple way of doing this, but I'm very new to this.
Thanks!

I don't know anything about Cray machines but I'll take a stab at this anyway. I noticed you mentioned qsub which makes me think that the system is using PBS or Torque. Both seem to support Job Arrays which may be along the lines of what you are looking for.
Job Arrays would make the queue system responsible for job management. Each subjob would be assigned an array id out of a range you specify and would be assigned whatever resources you requested with -l. In Torque, '#PBS -l nodes=1' and '#PBS -t 1-64' would create 64 subjobs with indexes from 1 to 64 each being assigned a single node. Man pages and Google will be a good resource and from what I've seen Torque and PBS differ on syntax. If that doesn't work, you can look at using pbsdsh inside of a single, larger job.
Also, I want to mention that advice from strangers on the Internet will only take you so far. Your local admin may have limits or scheduling policies in place that you may limit your options. You may also be able to get some advice from the admin about other, proven ways that you can solve your problem.

Related

Python use slurm for multiprocessing

I want to run a simple task using multiprocessing (I think this one is the same as using parfor in matlab correct?)
For example:
from multiprocessing import Pool
def func_sq(i):
fig=plt.plot(x[i,:]) #x is a ready-to-use large ndarray, just want
fig.save(....) #to plot each column on a separate figure
pool = Pool()
pool.map(func_sq,[1,2,3,4,5,6,7,8])
But I am very confused of how to use slurm to submit my job. I have been searching for answers but could not find a good one.
Currently, while not using multiprocessing, I am using slurm job sumit file like this:(named test1.sh)
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p batch
#SBATCH --exclusive
module load anaconda3
source activate py36
srun python test1.py
Then, I type in sbatch test1.sh in my prompt window.
So if I would like to use the multiprocessing, how should I modify my sh file? I have tried by myself but it seems just changing my -n to 16 and Pool(16) makes my job repeat 16 times.
Or is there a way to maximize my performance if multiprocessing is not suitable (I have heard about multithreating but don't know how it exactly works)
And how do I effectively use my memory so that it won't crush? (My x matrix is very large)
Thanks!
For the GPU, is that possible to do the same thing? My current submission script without multiprocessing is:
#!/bin/bash
#SBATCH -n 1
#SBATCH -p gpu
#SBATCH --gres=gpu:1
Thanks a lot!
The "-n" flag is setting the number of tasks your sbatch submission is going to execute, which is why your script is running multiple times. What you want to change is the "-c" argument which is how many CPUs each task is assigned. Your script spawns additional processes but they will be children of the parent process and share the CPUs assigned to it. Just add "#SBATCH -c 16" to your script. As for memory, there is a default amount of memory your job will be given per CPU, so increasing the number of CPUs will also increase the amount of memory available. If you're not getting enough, add something like "#SBATCH --mem=20000M" to request a specific amount.
I don't mean to be unwelcoming here, but this question seems to indicate that you don't actually understand the tools you're using here. Python Multiprocessing allows a single Python program to launch child processes to help it perform work in parallel. This is particularly helpful because multithreading (which is commonly how you'd accomplish this in other programming languages) doesn't gain you parallel code execution in Python, due to Python's Global Interpreter Lock.
Slurm (which I don't use, but from some quick research) seems to be a fairly high-level utility that allows individuals to submit work to some sort of cluster of computers (or a supercomputer... usually similar concepts). It has no visibility, per se, into how the program it launches runs; that is, it has no relationship to the fact that your Python program proceeds to launch 16 (or however many) helper processes. Its job is to schedule your Python program to run as a black box, then sit back and make sure it finishes successfully.
You seem to have some vague data processing problem. You describe it as a large matrix, but you don't give nearly enough information for me to actually understand what you're trying to accomplish. Regardless, if you don't actually understand what you're doing and how the tools you're using work, you're just flailing until you maybe eventually get lucky enough for this to work. Stop guessing, figure out what these tools do, look around and read documentation, then figure out what you're trying to accomplish and how you could go about splitting up the work in a reasonable fashion.
Here's my best guess, but I really have very little information to work from so it may not be helpful at all:
Your Python script has no concept that it's being run multiple times by Slurm (the -n 16 you refer to, I guess). It makes sense, then, that the job gets repeated 16 times, because Slurm runs the entire script 16 times, and each time your Python script does the entire task from start to finish. If you want Slurm and your Python program to interact, so that the Python program expects to get run multiple times in parallel, I have no idea how to help you there, you'll just need to read more into Slurm.
Your data must be able to be read incrementally, or partially, if you have any hope of breaking this job into pieces. That is, if you can only read the entire matrix all at once, or not at all, you're stuck with solutions that begin by reading the entire matrix into memory, which you indicate is not really an option. Assuming you can, and that you want to perform some work on each row independently, then you're fortunate enough for your task to be what's officially known as "embarrassingly parallel". This is a very good thing, if true.
Assuming your problem is embarrassingly parallel (since it looks like you're just trying to load each row of your data matrix, plot it somehow, then save off that plot as an image to disk), and you can load your data incrementally, then continue reading up on Python's multiprocessing module, and Pool().map is probably the right direction to be headed in. Create some Python generator that produces rows of your data matrix, then pass that generator and func_sq to pool.map, and sit back and wait for the job to finish.
If you really need to do this work across multiple machines, rather than hacking your own Slurm + Multiprocessing stack, I'd suggest you start using actual data processing tools, such as PySpark.
This doesn't sound like a trivial problem, and even if it were, you don't give sufficient details for me to provide a robust answer. There's no "just fix this one line" answer to what you've asked, but I hope this helps give you an idea of what your tools are doing and how to proceed from here.

What is the correct workflow for parallelization: running on a cluster or multiproccesses?

I want to call a function similar to parallelize.map(function, args) that returns a list of results and the user is blind to the actual process. One of the functions I want to parallelize calls subprocess to another unix program that benefits from multiple cores.
I first tried ipython-cluster-helper. This works well with my setup, but I ran into problems installing it on several other machines. I also have to ask for names of clusters during setup. I haven't seen other programs start jobs on clusters for you, so I don't know if that is accepted practice.
joblib seems to be the standard for parallelization, but it can only use one cluster or computer at a time. This works as well, but is significantly slower because it is not using the cluster.
Also, the server I am running this code on complains if a program has run too long to ensure that people use the cluster. Do I write another script to run this program only on our cluster -- if I used joblib?
For now, I added special parameters in setup.py to add cluster names and install ipython-cluster-helper if necessary. And when map is called, it first checks if ipython-cluster-helper and the cluster names are available, use them, else use joblib.
What are others ways of achieving this? I'm looking for a standard way to do this that will work on most machines with or with out a cluster, so I can release the code and make it easy to use.
Thanks.

Utility to manage multiple python scripts

I saw this post on Medium, and wondered how one might go about managing multiple python scripts.
How I Hacked Amazon's Wifi Button
This describes a system where you need to run one or more scripts continuously to catch and react to events in your network.
My question: Let's say I had multiple python scripts that I wanted to do run while I work on other things. What approaches are available to manage these scripts? I have to imagine there is a better way than having a large number of terminal windows running each script individually.
I am coming back to python, and have no formal training in computer programming, so any guidance you can provide will be greatly appreciated.
Let's say I had multiple python scripts that I wanted to do run. What
approaches are available to manage these scripts? I have to imagine
there is a better way than having a large number of terminal windows
running each script individually.
If you have several .py files in a directory that you want to run, without having a specific order, you can do:
import glob
pyFiles = glob.glob('path/*.py')
for pyFile in pyFiles:
execfile(pyFile)
Your system already runs a large number of background processes, with output to the system log or occasionally to a service-specific log file.
A common arrangement for quick and dirty deployments -- where you don't necessarily want to invest in making the scripts robust and well-behaved enough to run as proper services -- is to start the script inside screen or tmux. You can detach when you don't need to be looking at it, and can reattach at any time -- even from a remote login -- to view the output, or to troubleshoot.
Take a look at luigi (I've not used it).
https://github.com/spotify/luigi
These days (five years after the question was asked) a lot of people use docker compose. But that's a little heavy weight depending on what you want to do.
I just saw today the script server of bugy. Maybe it might be a solution for you or somebody else.
(I am just trying to find a tampermonkey script structure for python..)

MPI Newbie - some questions on how 'mpirun' works and process management

First of all, I'm not a programmer by profession but I have to program a code for my project (I have some proficiency in C++ and python though). I came in here often when I got stuck, and most of the time got nice solutions from here, but now I have essential questions on MPI programming, else I couldn't really proceed until I know the concepts of it.
Here is my description of problems,
I would like to create a code for an algorithm for a scientific calculation. The code can be divided into 2 parts.
A.) Matrix-vector multiplication and inversion of a matrix. This part is relatively straightforward and I even have my own working MPI code for this part
B.) Calling an external MPI-ready program for a more complicated calculation (This part should also be simple, as it's simply calling a UNIX command line).
The problem I'm having is how to join these 2 parts together? My algorithm goes like this,
for k in specified range
dividing a state vector of size 6NMx1 into M blocks, let each of M nodes handle these.
Manipulate a state vector of size 6NMx1 according to A.) in parallel
After A.) is done, run B.) using M nodes in parallel /* THIS IS WHERE I GOT STUCK */
Update state vector
end for
To run B.), I have to use mpirun to call a UNIX command,
mpirun -np #PPN my_app > some_output
The questions I have are,
How does 'mpirun' actually work? Does it spawn new processes upon calling?
Let's say if I'm using M cluster compute nodes, and each one has 16 processors per node, if I use only 1 process of the node to call the above UNIX command, will it generate 16 more processes? If so, I will end up with 256M processes running in total, am I correct?
My main goal is to use each compute node handles a block in the system vector (blocks are independent, with size 6Nx1), and will use numbers from each block as an input for B.) I'm working with clusters, so when I submit my job, I have to define number of nodes beforehand, and I strictly want each node to also run B.) in parallel after A.) is done. Are there any suggestions on how to do this using MPI? Some people told me to write codes separately for A.) and B.), and use a python script to control them at the top layer, so it should look like..
Python script:
for k in specified range
mpirun A.) --> This is straightforward for me
mpirun B.)
end for
Pseudocode for B.)
/* THIS PROGRAM SHOULD HAVE 16M PROCESSES */
if rank % 16 == 0
mpirun -np 16 my_app > output
end if
/* I WANT M CALLS TO THIS PROGRAM IN PARALLEL */
MPI_COMM.BARRIER
Do you think this scheme will use 16M process in parallel for B.)? If there are better ways to implement B.) than this, or even better, to wrap it in the same code as A.), please suggest me!
3.) This is my prototype code, so I don't really care about efficiency. I just need it to work, and I will care about optimization later on.
If my descriptions are confusing, please ask me and I will come back and clarify. Thank you for your time, and I really appreciate your helps! :)
Mpirun is just a command that runs the number of jobs that you required in the option command line, it will probably detect what kind of machine you have and work with.
It's complicated to answer to the second question because if you are using a cluster with multiple nodes you should probably work with a dedicated protocol. For example, with slurm you run your program through a sbatch protocol which is like:
// number of proc on one node
#SBATCH -n 2
// number of node
#SBATCH -N 4
run ./a.out
That means you will run your program on 4 nodes with 2 procs on each nodes.
I don't really know about the following cause it's a lil bit confuse for me but maybe you should reconsider your problem with something else. You don't need MPI if you are working within a node but you should use openMP.
MPI is needed if you are working with a non-shared memory, within a node that's not the case.
I hope it can help you in your work.

How can I profile a multithreaded program?

I have a program that is performing waaaay under par, and I would like to profile it. However, it is multithreaded, so I can't seem to find a good way to profile this thing. Any advice?
I've tried yappi, but it segfaults on OS X :(
EDIT: This is in python, sorry for putting it under profiling...
Are you multithreading or multiprocessing? If you are just multithreading, then that is the problem. Python currently has problems with multithreading on a multiprocessor system because of the Global Interpreter Lock (GIL). They are working on fixing it for Python 3.2 - at least so that your program will run as fast on a single core as on multiple cores.
If you aren't convinced take a look at the shootout results for the thread-ring program. Running with a single core is faster than running with quad cores.
Now, if you use multiprocessing instead, profiling can be difficult as well because then you have to run CProfiler from each separate process. There are some questions that point you in the right direction though.
Depending on how far you've come in your troubleshooting, there are some tools that might point you in the right direction.
"top" is a helpful start to show you if your problem is burning CPU time or simply waiting for stuff.
"dtruss -c" can show you where you spend time and what system calls takes most of your time.
Both these can give you a hint without knowing anything about python.
If you just want to use yappi, it isn't too much work to set up a virtualbox and install some sort of Linux on your machine. I find myself doing that from time to time when I want to try something.
There might of course be things I don't know about that makes it impossible or not worth the effort. Also, profiling on another OS running virtualized might not give the exact same results, but it might still be helpful.

Categories

Resources