Python use slurm for multiprocessing

Python use slurm for multiprocessing - python

I want to run a simple task using multiprocessing (I think this one is the same as using parfor in matlab correct?)
For example:
from multiprocessing import Pool
def func_sq(i):
fig=plt.plot(x[i,:]) #x is a ready-to-use large ndarray, just want
fig.save(....) #to plot each column on a separate figure
pool = Pool()
pool.map(func_sq,[1,2,3,4,5,6,7,8])
But I am very confused of how to use slurm to submit my job. I have been searching for answers but could not find a good one.
Currently, while not using multiprocessing, I am using slurm job sumit file like this:(named test1.sh)
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p batch
#SBATCH --exclusive
module load anaconda3
source activate py36
srun python test1.py
Then, I type in sbatch test1.sh in my prompt window.
So if I would like to use the multiprocessing, how should I modify my sh file? I have tried by myself but it seems just changing my -n to 16 and Pool(16) makes my job repeat 16 times.
Or is there a way to maximize my performance if multiprocessing is not suitable (I have heard about multithreating but don't know how it exactly works)
And how do I effectively use my memory so that it won't crush? (My x matrix is very large)
Thanks!
For the GPU, is that possible to do the same thing? My current submission script without multiprocessing is:
#!/bin/bash
#SBATCH -n 1
#SBATCH -p gpu
#SBATCH --gres=gpu:1
Thanks a lot!

The "-n" flag is setting the number of tasks your sbatch submission is going to execute, which is why your script is running multiple times. What you want to change is the "-c" argument which is how many CPUs each task is assigned. Your script spawns additional processes but they will be children of the parent process and share the CPUs assigned to it. Just add "#SBATCH -c 16" to your script. As for memory, there is a default amount of memory your job will be given per CPU, so increasing the number of CPUs will also increase the amount of memory available. If you're not getting enough, add something like "#SBATCH --mem=20000M" to request a specific amount.

I don't mean to be unwelcoming here, but this question seems to indicate that you don't actually understand the tools you're using here. Python Multiprocessing allows a single Python program to launch child processes to help it perform work in parallel. This is particularly helpful because multithreading (which is commonly how you'd accomplish this in other programming languages) doesn't gain you parallel code execution in Python, due to Python's Global Interpreter Lock.
Slurm (which I don't use, but from some quick research) seems to be a fairly high-level utility that allows individuals to submit work to some sort of cluster of computers (or a supercomputer... usually similar concepts). It has no visibility, per se, into how the program it launches runs; that is, it has no relationship to the fact that your Python program proceeds to launch 16 (or however many) helper processes. Its job is to schedule your Python program to run as a black box, then sit back and make sure it finishes successfully.
You seem to have some vague data processing problem. You describe it as a large matrix, but you don't give nearly enough information for me to actually understand what you're trying to accomplish. Regardless, if you don't actually understand what you're doing and how the tools you're using work, you're just flailing until you maybe eventually get lucky enough for this to work. Stop guessing, figure out what these tools do, look around and read documentation, then figure out what you're trying to accomplish and how you could go about splitting up the work in a reasonable fashion.
Here's my best guess, but I really have very little information to work from so it may not be helpful at all:
Your Python script has no concept that it's being run multiple times by Slurm (the -n 16 you refer to, I guess). It makes sense, then, that the job gets repeated 16 times, because Slurm runs the entire script 16 times, and each time your Python script does the entire task from start to finish. If you want Slurm and your Python program to interact, so that the Python program expects to get run multiple times in parallel, I have no idea how to help you there, you'll just need to read more into Slurm.
Your data must be able to be read incrementally, or partially, if you have any hope of breaking this job into pieces. That is, if you can only read the entire matrix all at once, or not at all, you're stuck with solutions that begin by reading the entire matrix into memory, which you indicate is not really an option. Assuming you can, and that you want to perform some work on each row independently, then you're fortunate enough for your task to be what's officially known as "embarrassingly parallel". This is a very good thing, if true.
Assuming your problem is embarrassingly parallel (since it looks like you're just trying to load each row of your data matrix, plot it somehow, then save off that plot as an image to disk), and you can load your data incrementally, then continue reading up on Python's multiprocessing module, and Pool().map is probably the right direction to be headed in. Create some Python generator that produces rows of your data matrix, then pass that generator and func_sq to pool.map, and sit back and wait for the job to finish.
If you really need to do this work across multiple machines, rather than hacking your own Slurm + Multiprocessing stack, I'd suggest you start using actual data processing tools, such as PySpark.
This doesn't sound like a trivial problem, and even if it were, you don't give sufficient details for me to provide a robust answer. There's no "just fix this one line" answer to what you've asked, but I hope this helps give you an idea of what your tools are doing and how to proceed from here.

Related

What is the correct workflow for parallelization: running on a cluster or multiproccesses?

I want to call a function similar to parallelize.map(function, args) that returns a list of results and the user is blind to the actual process. One of the functions I want to parallelize calls subprocess to another unix program that benefits from multiple cores.
I first tried ipython-cluster-helper. This works well with my setup, but I ran into problems installing it on several other machines. I also have to ask for names of clusters during setup. I haven't seen other programs start jobs on clusters for you, so I don't know if that is accepted practice.
joblib seems to be the standard for parallelization, but it can only use one cluster or computer at a time. This works as well, but is significantly slower because it is not using the cluster.
Also, the server I am running this code on complains if a program has run too long to ensure that people use the cluster. Do I write another script to run this program only on our cluster -- if I used joblib?
For now, I added special parameters in setup.py to add cluster names and install ipython-cluster-helper if necessary. And when map is called, it first checks if ipython-cluster-helper and the cluster names are available, use them, else use joblib.
What are others ways of achieving this? I'm looking for a standard way to do this that will work on most machines with or with out a cluster, so I can release the code and make it easy to use.
Thanks.

python on HPC cluster computer

I asked a question very close to this, but it wasn't answered and since then I hope I have learned to better ask the question.
I was curious as to how run many jobs serially on a Cray XE6 machine. You usually qsub things with a ccmrun (for a serial job) or an aprun (instead of mpirun or mpiexec). I first wanted to use the Pool() function, but due to it not being SMP based hardware it would be limited to 32 processors. Even an mpi4py application of something like a pool wouldn't work, because I am not giving the main program all of the processors. I would be running that script 64 times if I were to say aprun -n 64 mpipool.py, whereas it does work if I do something like aprun -n 1 -d 32 pool.py.
I've had a look at the https://wiki.python.org/moin/ParallelProcessing website and was wondering if anyone had any experience running multiple serial jobs on a cluster computing machine with any of them. I did write an mpi4py code that basically had rank 0 doing all of the job selection, and then giving them out to the the other processors. It didn't want to play nice on the machine since I needed to use subprocess to launch the giant amount of C code. So, one last caveat is that it would have to play nice with subprocess.
I would like to have it look at the amount of nodes chosen, and then basically do something along the lines of:
ccmrun jobscheduler.py &
ccmrun jobrunner.py 63 & # given that I started the job with 64 processors. I may have to do a bash loop here, but that's no problem.
Once started I would want them to be able to communicate between one another, but without MPI I'm not sure of an efficient way of doing this. If anyone could get me started on the right path I would greatly appreciate it. Maybe doing pickle dumps and locking them and deleting them when a jobrunner picks it up. There might be a really simple way of doing this, but I'm very new to this.
Thanks!

I don't know anything about Cray machines but I'll take a stab at this anyway. I noticed you mentioned qsub which makes me think that the system is using PBS or Torque. Both seem to support Job Arrays which may be along the lines of what you are looking for.
Job Arrays would make the queue system responsible for job management. Each subjob would be assigned an array id out of a range you specify and would be assigned whatever resources you requested with -l. In Torque, '#PBS -l nodes=1' and '#PBS -t 1-64' would create 64 subjobs with indexes from 1 to 64 each being assigned a single node. Man pages and Google will be a good resource and from what I've seen Torque and PBS differ on syntax. If that doesn't work, you can look at using pbsdsh inside of a single, larger job.
Also, I want to mention that advice from strangers on the Internet will only take you so far. Your local admin may have limits or scheduling policies in place that you may limit your options. You may also be able to get some advice from the admin about other, proven ways that you can solve your problem.

MPI Newbie - some questions on how 'mpirun' works and process management

First of all, I'm not a programmer by profession but I have to program a code for my project (I have some proficiency in C++ and python though). I came in here often when I got stuck, and most of the time got nice solutions from here, but now I have essential questions on MPI programming, else I couldn't really proceed until I know the concepts of it.
Here is my description of problems,
I would like to create a code for an algorithm for a scientific calculation. The code can be divided into 2 parts.
A.) Matrix-vector multiplication and inversion of a matrix. This part is relatively straightforward and I even have my own working MPI code for this part
B.) Calling an external MPI-ready program for a more complicated calculation (This part should also be simple, as it's simply calling a UNIX command line).
The problem I'm having is how to join these 2 parts together? My algorithm goes like this,
for k in specified range
dividing a state vector of size 6NMx1 into M blocks, let each of M nodes handle these.
Manipulate a state vector of size 6NMx1 according to A.) in parallel
After A.) is done, run B.) using M nodes in parallel /* THIS IS WHERE I GOT STUCK */
Update state vector
end for
To run B.), I have to use mpirun to call a UNIX command,
mpirun -np #PPN my_app > some_output
The questions I have are,
How does 'mpirun' actually work? Does it spawn new processes upon calling?
Let's say if I'm using M cluster compute nodes, and each one has 16 processors per node, if I use only 1 process of the node to call the above UNIX command, will it generate 16 more processes? If so, I will end up with 256M processes running in total, am I correct?
My main goal is to use each compute node handles a block in the system vector (blocks are independent, with size 6Nx1), and will use numbers from each block as an input for B.) I'm working with clusters, so when I submit my job, I have to define number of nodes beforehand, and I strictly want each node to also run B.) in parallel after A.) is done. Are there any suggestions on how to do this using MPI? Some people told me to write codes separately for A.) and B.), and use a python script to control them at the top layer, so it should look like..
Python script:
for k in specified range
mpirun A.) --> This is straightforward for me
mpirun B.)
end for
Pseudocode for B.)
/* THIS PROGRAM SHOULD HAVE 16M PROCESSES */
if rank % 16 == 0
mpirun -np 16 my_app > output
end if
/* I WANT M CALLS TO THIS PROGRAM IN PARALLEL */
MPI_COMM.BARRIER
Do you think this scheme will use 16M process in parallel for B.)? If there are better ways to implement B.) than this, or even better, to wrap it in the same code as A.), please suggest me!
3.) This is my prototype code, so I don't really care about efficiency. I just need it to work, and I will care about optimization later on.
If my descriptions are confusing, please ask me and I will come back and clarify. Thank you for your time, and I really appreciate your helps! :)

Mpirun is just a command that runs the number of jobs that you required in the option command line, it will probably detect what kind of machine you have and work with.
It's complicated to answer to the second question because if you are using a cluster with multiple nodes you should probably work with a dedicated protocol. For example, with slurm you run your program through a sbatch protocol which is like:
// number of proc on one node
#SBATCH -n 2
// number of node
#SBATCH -N 4
run ./a.out
That means you will run your program on 4 nodes with 2 procs on each nodes.
I don't really know about the following cause it's a lil bit confuse for me but maybe you should reconsider your problem with something else. You don't need MPI if you are working within a node but you should use openMP.
MPI is needed if you are working with a non-shared memory, within a node that's not the case.
I hope it can help you in your work.

Optimal way of matlab to python communication

So I am working on a Matlab application that has to do some communication with a Python script. The script that is called is a simple client software. As a side note, if it would be possible to have a Matlab client and a Python server communicating this would solve this issue completely but I haven't found a way to work that out.
Anyhow, after searching the web I have found two ways to call Python scripts, either by the system() command or editing the perl.m file to call Python scripts instead. Both ways are too slow though (tic tocing them to > 20ms and must run faster than 6ms) as this call will be in a loop that is very time sensitive.
As a solution I figured I could instead save a file at a certain location and have my Python script continuously check for this file and when finding it executing the command I want it to. Now after timing each of these steps and summing them up I found it to be incredibly much faster (almost 100x so for sure fast enough) and I cant really believe that, or rather I cant understand why calling python scripts is so slow (not that I have more than a superficial knowledge of the subject). I also found this solution to be really messy and ugly so just wanted to check that, first, is it a good idea and second, is there a better one?
Finally, I realize that the Python time.time() and Matlab tic, toc might not be precise enough to measure time correctly on that scale so also a reason why I ask.

Spinning up new instances of the Python interpreter takes a while. If you spin up the interpreter once, and reuse it, this cost is paid only once, rather than for every run.
This is normal (expected) behaviour, since startup includes large numbers of allocations and imports. For example, on my machine, the startup time is:
$ time python -c 'import sys'
real 0m0.034s
user 0m0.022s
sys 0m0.011s

Speed up feedparser

I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?

I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.

Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.