numba or dask: parallel to call bash script by python

numba or dask: parallel to call bash script by python - python

I am a beginner in the field of python parallel computation. I want to parallize one part of my codes which consists for-loops. By default, my code run a time for loop for 3 years. In each day, my code calls a bash script run_offline.sh and run it 8 times. Each time the bash script is given different input data indexing by the loop id. Here is the main part of my python codes demo.py:
import os
import numpy as np
from dateutil.rrule import rrule, HOURLY, DAILY
import datetime
import subprocess
...
start_date = datetime.datetime(2017, 1, 1, 10, 0, 0)
end_date = datetime.datetime(2019, 12, 31, 10, 0, 0)
...
loop_id = 0
for date_assim in rrule(freq=HOURLY,
dtstart=start_date,
interval=time_delta,
until=end_date):
RESDIR='./results/'
TYP='experiment_1'
END_ID = 8
YYYYMMDDHH = date_assim.strftime('%Y%m%d%H')
p1 = subprocess.Popen(['./run_offline.sh', str(END_ID), str(loop_id), str(YYYYMMDDHH), RESDIR, TYP])
p1.wait()
#%%
# p1 creates 9 files from RESULTS_0.nc to RESULTS_8.nc
# details please see the bash script attached below
# following are codes computing based on RESULTS_${CID}.nc, CID from 0 to 8.
# In total 9 files are generated by p1 and used later.
loop_id += 1
And the ./run_offline.sh runs an atmsopheric model offline.exe 9 times which follows:
#!/bin/bash
# Usage: ./run_offline.sh END_ID loop_id YYYYMMDDHH RESDIR TYP
END_ID=${1:-1}
loop_id=${2:-1}
YYYYMMDDHH=${3:-1}
RESDIR=${4:-1}
TYP=${5:-1}
END_ID=`echo $((END_ID))`
loop_id=`echo $((loop_id))`
CID=0
ln -sf PREP_0.nc PREP.nc # one of the input file required. Must named by PREP.nc
while [ $CID -le $END_ID ]; do
cp -f ./OPTIONS.nam_${CID} ./OPTIONS.nam # one of the input file required by offline.exe
# different ./OPTIONS.nam_${CID} has different index of a perturbation.
# Say ./OPTIONS.nam_1 lets the offline.exe knows it should perturb the first variable in the atmospheric model,
# ./OPTIONS.nam_2 perturbs the second variable...
./offline.exe
cp RESULTS1.nc RESULTS_${CID}.OUT.nc # for next part of python codes in demo.py
mv RESULTS2.nc $RESDIR/$TYP/RESULTS2_${YYYYMMDDHH}.nc # store this file in my results dir
CID=$((CID+1))
done
Now I found the for-loop of offline.exe is super time-consuming. It's around 10-20s each time I called run_offline.sh (running ./offline.exe 9 times costs 10-20s). In total it costs 15s*365*3=4.5hours on average, if I want to run my scripts for 3 years...So can I parallize the loop of offline.exe? say assign the different run of different CID to different core/subprocess in the server. But one should note that two input files OPTIONS.nam and PREP.nc are forced to name as same names when we run offline.exe each time....which means we cannot use OPTIONS.nam_x for loop x. So can I use dask or numba to help this parallelization? Thanks!

If I understand your problem correctly, you run ~1000 times a bash script which runs 8~9 time a black-box executable and this executable is the main bottleneck.
So can I parallize the loop of offline.exe?
This is hard to say due to the executable being a black box. You need to check the input/output/temporary data required by the program. For example, if the program store temporary file somewhere in the storage device, then calling it in parallel will result in a race condition. Besides, you can only call it in parallel if the computational parts are fully independent. A dataflow analysis is very useful to know whether you can parallelize an application (especially when it is composed of multiple programs).
Additionally, you need to check if the program is already parallel or not. Running in parallel multiple parallel programs generally results in a much slower execution due to a large amount of thread to schedule, poor cache usage, bad synchronization patterns, etc.
In your case, I think the best option would be to parallelize the program run in the loop (ie. offline.exe). Otherwise, if the program is sequential and can be parallelised (see above), then you can run multiple processes using & in bash and then wait them in the end of the loop. Alternatively you can use GNU parallel.
But one should note that two input files OPTIONS.nam and PREP.nc are forced to name as same names when we run offline.exe each time
This can be solved by calling the N program from N distinct working directories in parallel. This is actually safer if the program creates temporary files in its working directory. You need to move/copy the files before the parallel execution and certainly after.
If the files OPTIONS.nam and/or PREP.nc are modified by the program, then it means the computation is completely sequential and cannot be parallelized (I assume the computation of each day is dependent of the previous one as this is a very common pattern in scientific simulations).
So can I use dask or numba to help this parallelization?
No. Dask and Numba are not mean to be used in this context. They are designed to operate on Numpy array in a Python code. The part you want to parallelize is in bash and the parallelized program is apparently not even written in Python.

If you cannot make offline.exe use other names than OPTIONS.nam and RESULTS1.nc, you will need to make sure that the parallel instances do not overwrite eachother.
One way to do this is to make a dir for each run:
#!/bin/bash
# Usage: ./run_offline.sh END_ID loop_id YYYYMMDDHH RESDIR TYP
END_ID=${1:-1}
loop_id=${2:-1}
YYYYMMDDHH=${3:-1}
RESDIR=${4:-1}
TYP=${5:-1}
END_ID=`echo $((END_ID))`
loop_id=`echo $((loop_id))`
doit() {
mkdir $1
cd $1
ln -sf ../PREP_$1.nc PREP.nc
cp -f ./OPTIONS.nam_$1 ./OPTIONS.nam # one of the input file required by
../offline.exe
cp RESULTS1.nc ../RESULTS_$1.OUT.nc # for next part of python codes in demo.py
mv RESULTS2.nc $RESDIR/$TYP/RESULTS2_${YYYYMMDDHH}.nc # store this file in my results dir
}
export -f doit
export RESDIR YYYYMMDDHH TYP
seq 0 $END_ID | parallel doit

Related

String concatenation much faster in Python than Go

I'm looking at using Go to write a small program that's mostly handling text. I'm pretty sure, based on what I've heard about Go and Python that Go will be substantially faster. I don't actually have a specific need for insane speeds, but I'd like to get to know Go.
The "Go is going to be faster" idea was supported by a trivial test:
# test.py
print("Hello world")
$ time python dummy.py
Hello world
real 0m0.029s
user 0m0.019s
sys 0m0.010s
// test.go
package main
import "fmt"
func main() {
fmt.Println("hello world")
}
$ time ./test
hello world
real 0m0.001s
user 0m0.001s
sys 0m0.000s
Looks good in terms of raw startup speed (which is entirely expected). Highly non-scientific justification:
$ strace python test.py 2>&1 | wc -l
1223
$ strace ./test 2>&1 | wc -l
174
However, my next contrived test was how fast is Go when faffing with strings, and I was expecting to be similarly blown away by Go's raw speed. So, this was surprising:
# test2.py
s = ""
for i in range(1000000):
s += "a"
$ time python test2.py
real 0m0.179s
user 0m0.145s
sys 0m0.013s
// test2.go
package main
func main() {
s := ""
for i:= 0; i < 1000000; i++ {
s += "a";
}
}
$ time ./test2
real 0m56.840s
user 1m50.836s
sys 0m17.653
So Go is hundreds of times slower than Python.
Now, I know this is probably due to Schlemiel the Painter's algorithm, which explains why the Go implementation is quadratic in i (i is 10 times bigger leads to 100 times slowdown).
However, the Python implementation seems much faster: 10 times more loops only slows it down by twice. The same effect persists if you concatenate str(i), so I doubt there's some kind of magical JIT optimization to s = 100000 * 'a' going on. And it's not much slower if I print(s) at the end, so the variable isn't being optimised out.
Naivety of the concatenation methods aside (there are surely more idiomatic ways in each language), is there something here that I have misunderstood, or is it simply easier in Go than in Python to run into cases where you have to deal with C/C++-style algorithmic issues when handling strings (in which case a straight Go port might not be as uh-may-zing as I might hope without having to, ya'know, think about things and do my homework)?
Or have I run into a case where Python happens to work well, but falls apart under more complex use?
Versions used: Python 3.8.2, Go 1.14.2

TL;DR summary: basically you're testing the two implementation's allocators / garbage collectors and heavily weighting the scale on the Python side (by chance, as it were, but this is something the Python folks optimized at some point).
To expand my comments into a real answer:
Both Go and Python have counted strings, i.e., strings are implemented as a two-element header thingy containing a length (byte count or, for Python 3 strings, Unicode characters count) and data pointer.
Both Go and Python are garbage-collected (GCed) languages. That is, in both languages, you can allocate memory without having to worry about freeing it yourself: the system takes care of that automatically.
But the underlying implementations differ, quite a bit in this particular one important way: the version of Python you are using has a reference counting GC. The Go system you are using does not.
With a reference count, the inner bits of the Python string handler can do this. I'll express it as Go (or at least pseudo-Go) although the actual Python implementation is in C and I have not made all the details line up properly:
// add (append) new string t to existing string s
func add_to_string(s, t string_header) string_header {
need = s.len + t.len
if s.refcount == 1 { // can modify string in-place
data = s.data
if cap(data) >= need {
copy_into(data + s.len, t.data, t.len)
return s
}
}
// s is shared or s.cap < need
new_s := make_new_string(roundup(need))
// important: new_s has extra space for the next call to add_to_string
copy_into(new_s.data, s.data, s.len)
copy_into(new_s.data + s.len, t.data, t.len)
s.refcount--
if s.refcount == 0 {
gc_release_string(s)
}
return new_s
}
By over-allocating—rounding up the need value so that cap(new_s) is large—we get about log2(n) calls to the allocator, where n is the number of times you do s += "a". With n being 1000000 (one million), that's about 20 times that we actually have to invoke the make_new_string function and release (for gc purposes because the collector uses refcounts as a first pass) the old string s.
[Edit: your source archaeology led to commit 2c9c7a5f33d, which suggests less than doubling but still a multiplicative increase. To other readers, see comment.]
The current Go implementation allocates strings without a separate capacity header field (see reflect.StringHeader and note the big caveat that says "don't depend on this, it might be different in future implementations"). Between the lack of a refcount—we can't tell in the runtime routine that adds two strings, that the target has only one reference—and the inability to observe the equivalent of cap(s) (or cap(s.data)), the Go runtime has to create a new string every time. That's one million memory allocations.
To show that the Python code really does use the refcount, take your original Python:
s = ""
for i in range(1000000):
s += "a"
and add a second variable t like this:
s = ""
t = s
for i in range(1000000):
s += "a"
t = s
The difference in execution time is impressive:
$ time python test2.py
0.68 real 0.65 user 0.03 sys
$ time python test3.py
34.60 real 34.08 user 0.51 sys
The modified Python program still beats Go (1.13.5) on this same system:
$ time ./test2
67.32 real 103.27 user 13.60 sys
and I have not poked any further into the details, but I suspect the Go GC is running more aggressively than the Python one. The Go GC is very different internally, requiring write barriers and occasional "stop the world" behavior (of all goroutines that are not doing the GC work). The refcounting nature of the Python GC allows it to never stop: even with a refcount of 2, the refcount on t drops to 1 and then next assignment to t drops it to zero, releasing the memory block for re-use in the next trip through the main loop. So it's probably picking up the same memory block over and over again.
(If my memory is correct, Python's "over-allocate strings and check the refcount to allow expand-in-place" trick was not in all versions of Python. It may have first been added around Python 2.4 or so. This memory is extremely vague and a quick Google search did not turn up any evidence one way or the other. [Edit: Python 2.7.4, apparently.])

Well. You should never, ever use string concatenation in this way :-)
in go, try the strings.Buider
package main
import (
"strings"
)
func main() {
var b1 strings.Builder
for i:= 0; i < 1000000; i++ {
b1.WriteString("a");
}
}

How to link interactive problems (w.r.t. CodeJam)?

I'm not sure if it's allowed to seek for help(if not, I don't mind not getting an answer until the competition period is over).
I was solving the Interactive Problem (Dat Bae) on CodeJam. On my local files, I can run the judge (testing_tool.py) and my program (<name>.py) separately and copy paste the I/O manually. However, I assume I need to find a way to make it automatically.
Edit: To make it clear, I want every output of x file to be input in y file and vice versa.
Some details:
I've used sys.stdout.write / sys.stdin.readline instead of print / input throughout my program
I tried running interactive_runner.py, but I don't seem to figure out how to use it.
I tried running it on their server, my program in first tab, the judge file in second. It's always throwing TLE error.
I don't seem to find any tutorial to do the same either, any help will be appreciated! :/

The usage is documented in comments inside the scripts:
interactive_runner.py
# Run this as:
# python interactive_runner.py <cmd_line_judge> -- <cmd_line_solution>
#
# For example:
# python interactive_runner.py python judge.py 0 -- ./my_binary
#
# This will run the first test set of a python judge called "judge.py" that
# receives the test set number (starting from 0) via command line parameter
# with a solution compiled into a binary called "my_binary".
testing_tool.py
# Usage: `testing_tool.py test_number`, where the argument test_number
# is 0 for Test Set 1 or 1 for Test Set 2.
So use them like this:
python interactive_runner.py python testing_tool.py 0 -- ./dat_bae.py

improving performance of behave tests

We run behave BDD tests in our pipeline. We run the tests in the docker container as part of the jenkins pipeline. Currently it takes ~10 minutes to run all the tests. We are adding a lot of tests and in few months, it might go upto 30 minutes. It is outputting a lot of information. I believe that if I reduce the amount of information it outputs, I can get the tests to run faster. Is there a way to control the amount of information behave outputs? I want to print the information only if something fails.
I took a look at behave-parallel. Looks like it is in python 2.7. We are in python3.
I was looking at various options behave provides.
behave -verbose=false folderName (I assumed that it will not output all the steps)
behave --logging-level=ERROR TQXYQ (I assumed it will print only if there is an error)
behave --logging-filter="Test Step" TQXYQ (I assumed it will print only the tests that has "Test Step" in it)
None of the above worked.
The current output looks like this
Scenario Outline: IsError is populated correctly based on Test Id -- #1.7 # TestName/Test.feature:187
Given the test file folder is set to /TestName/steps/ # common/common_steps.py:22 0.000s
And Service is running # common/common_steps.py:10 0.000s
Given request used is current.json # common/common_steps.py:26 0.000s
And request is modified to set X to q of type str # common/common_steps.py:111 0.000s
And request is modified to set Y to null of type str # common/common_steps.py:111 0.000s
And request is modified to set Z to USD of type str # common/common_steps.py:111 0.000s
When make a modification request # common/common_steps.py:37 0.203s
Then it returns 200 status code # common/common_steps.py:47 0.000s
And transformed result has IsError with 0 of type int # common/common_steps.py:92 0.000s
And transformed result has ErrorMessages contain [] # common/common_steps.py:52 0.000s
I want to print only all these things only if there is an error. If everything is passing, I don't want to display this information.

I think the default log level INFO will not impact the performance of your tests.
I am also using docker container to run the regression suite and it takes about 2 hours to run 2300 test scenarios. It took nearly a day before and here is what I did :
1. Run all test suite parallel.
This is the most important reason that will reduce the execution time.
We spent a lot of efforts to turn the regression suite to be parallel-able.
- make atomic, autonomous and independent tests so that you can run all your tests in parallel effectively.
- create a parallel runner to run tests on multiple processes. I am using multiprocessing and subprocessing libraries to do this.
I would not recommend behave-parallel because it is no longer active supported.
You can refer to this link :
http://blog.crevise.com/2018/02/executing-parallel-tests-using-behave.html?m=1
- using Docker Swarm to add more nodes into Selenium Grid.
You can scale up to add more nodes and the maximum numbers of nodes depend on the number of cpus. The best practice is number of node = number of cpu.
I have 4 PCs , each has 4 cores so I can scale up to 1 hub and 15 nodes.
2. Optimize synchronization in your framework.
Remove time.sleep()
Remove implicitly wait. Use explicitly wait instead.
Hope it helps.

Well I have solved this in a traditional way, but I m not sure how effective it could be. I just started this yesterday and now trying to work on building the reports out of it. Approach as below, suggestions welcome
this solves the parallel execution at the example driven as well.
parallel_behave.py
Run command (mimics all the params of behave command)
py parallel_behave.py -t -d -f ......
initial_command = 'behave -d -t <tags>'
'''
the above command returns the eligible cases. may not be the right approach, but works well for me
'''
r = subprocess.Popen(initial_command.split(' '), stdout=subprocess.PIPE, bufsize=0)
finalsclist = []
_tmpstr=''
for out in r.stdout:
out = out.decode('utf-8')
# print(out.decode('utf-8'))
if shellout.startswith('[') :
_tmpstr+=out
if shellout.startswith('{') :
_tmpstr+=out
if shellout.startswith(']'):
_tmpstr+=out
break
scenarionamedt = json.loads(_tmpstr)
for sc in scenarionamedt:
[finalsclist.append(s['name']) for s in sc['elements']]
now the finalsclist contains the scenario name
ts = int(timestamp.timestamp)
def foo:
cmd = "behave -n '{}' -o ./report/output{}.json".format(scenarioname,ts)
pool = Pool(<derive based on the power of the processor>)
pool.map(foo, finalsclist)
this will create that many processes of individual behave calls and generate the json output under report folder
*** there was a reference from https://github.com/hugeinc/behave-parallel but this is at the feature level. I just extended it to the scenarios and example ****

share variable (data from file) among multiple python scripts with not loaded duplicates

I would like to load a big matrix contained in the matrix_file.mtx. This load must be made once. Once the variable matrix is loaded into the memory, I would like many python scripts to share it with not duplicates in order to have a memory efficient multiscript program in bash (or python itself). I can imagine some pseudocode like this:
# Loading and sharing script:
import share
matrix = open("matrix_file.mtx","r")
share.send_to_shared_ram(matrix, as_variable('matrix'))
# Shared matrix variable processing script_1
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
# Shared matrix variable processing script_2
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
...
The idea is pointer_to_matrix to point to matrix in RAM, which is only once loaded by the n scripts (not n times). They are separately called from a bash script (or if possible form a python main):
$ python Load_and_share.py
$ python script_1.py -args string &
$ python script_2.py -args string &
$ ...
$ python script_n.py -args string &
I'd also be interested in solutions via hard disk, i.e. matrix could be stored at disk while the share object access to it as being required. Nonetheless, the object (a kind of pointer) in RAM can be seen as the whole matrix.
Thank you for your help.

Between the mmap module and numpy.frombuffer, this is fairly easy:
import mmap
import numpy as np
with open("matrix_file.mtx","rb") as matfile:
mm = mmap.mmap(matfile.fileno(), 0, access=mmap.ACCESS_READ)
# Optionally, on UNIX-like systems in Py3.3+, add:
# os.posix_fadvise(matfile.fileno(), 0, len(mm), os.POSIX_FADV_WILLNEED)
# to trigger background read in of the file to the system cache,
# minimizing page faults when you use it
matrix = np.frombuffer(mm, np.uint8)
Each process would perform this work separately, and get a read only view of the same memory. You'd change the dtype to something other than uint8 as needed. Switching to ACCESS_WRITE would allow modifications to shared data, though it would require synchronization and possibly explicit calls to mm.flush to actually ensure the data was reflected in other processes.
A more complex solution that follows your initial design more closely might be to uses multiprocessing.SyncManager to create a connectable shared "server" for data, allowing a single common store of data to be registered with the manager and returned to as many users as desired; creating an Array (based on ctypes types) with the correct type on the manager, then register-ing a function that returns the same shared Array to all callers would work too (each caller would then convert the returned Array via numpy.frombuffer as before). It's much more involved (it would be easier to have a single Python process initialize an Array, then launch Processes that would share it automatically thanks to fork semantics), but it's the closest to the concept you describe.

"embarrassingly parallel" programming using python and PBS on a cluster

I have a function (neural network model) which produces figures. I wish to test several parameters, methods and different inputs (meaning hundreds of runs of the function) from python using PBS on a standard cluster with Torque.
Note: I tried parallelpython, ipython and such and was never completely satisfied, since I want something simpler. The cluster is in a given configuration that I cannot change and such a solution integrating python + qsub will certainly benefit to the community.
To simplify things, I have a simple function such as:
import myModule
def model(input, a= 1., N=100):
do_lots_number_crunching(input, a,N)
pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')
where input is an object representing the input, input.name is a string, anddo_lots_number_crunching may last hours.
My question is: is there a correct way to transform something like a scan of parameters such as
for a in pylab.linspace(0., 1., 100):
model(input, a)
into "something" that would launch a PBS script for every call to the model function?
#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py
I was thinking of a function that would include the PBS template and call it from the python script, but could not yet figure it out (decorator?).

pbs_python[1] could work for this. If experiment_model.py 'a' as an argument you could do
import pbs, os
server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)
attopl = pbs.new_attropl(4)
attropl[0].name = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'
attropl[1].name = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'
attropl[2].name = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'
attrop1[3].name = pbs.ATTR_V
script='''
cd /data/work/
python experiment_model.py %f
'''
jobs = []
for a in pylab.linspace(0.,1.,100):
script_name = 'experiment_model.job' + str(a)
with open(script_name,'w') as scriptf:
scriptf.write(script % a)
job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
jobs.append(job_id)
os.remove(script_name)
print jobs
[1]: https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

You can do this easily using jug (which I developed for a similar setup).
You'd write in file (e.g., model.py):
#TaskGenerator
def model(param1, param2):
res = complex_computation(param1, param2)
pyplot.coolgraph(res)
for param1 in np.linspace(0, 1.,100):
for param2 in xrange(2000):
model(param1, param2)
And that's it!
Now you can launch "jug jobs" on your queue: jug execute model.py and this will parallelise automatically. What happens is that each job will in, a loop, do something like:
while not all_done():
for t in tasks in tasks_that_i_can_run():
if t.lock_for_me(): t.run()
(It's actually more complicated than that, but you get the point).
It uses the filesystem for locking (if you're on an NFS system) or a redis server if you prefer. It can also handle dependencies between tasks.
This is not exactly what you asked for, but I believe it's a cleaner architechture to separate this from the job queueing system.

It looks like I'm a little late to the party, but I also had the same question of how to map embarrassingly parallel problems onto a cluster in python a few years ago and wrote my own solution. I recently uploaded it to github here: https://github.com/plediii/pbs_util
To write your program with pbs_util, I would first create a pbs_util.ini in the working directory containing
[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00
Then a python script like this
import pbs_util.pbs_map as ppm
import pylab
import myModule
class ModelWorker(ppm.Worker):
def __init__(self, input, N):
self.input = input
self.N = N
def __call__(self, a):
myModule.do_lots_number_crunching(self.input, a, self.N)
pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')
# You need "main" protection like this since pbs_map will import this file on the compute nodes
if __name__ == "__main__":
input, N = something, picklable
# Use list to force the iterator
list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
startup_args=(input, N),
num_clients=100))
And that would do it.

I just started working with clusters and EP applications. My goal (I'm with the Library) is to learn enough to help other researchers on campus access HPC with EP applications...especially researchers outside of STEM. I'm still very new, but thought it may help this question to point out the use of GNU Parallel in a PBS script to launch basic python scripts with varying arguments. In the .pbs file, there are two lines to point out:
module load gnu-parallel # this is required on my environment
parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*
# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}` this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.'
# will be substituted into the python call as the second(3rd) argument where the
# `{}` resides. These can be simple text files that you use in your 'simple.py'
# script to pass the parameter sets, filenames, etc.
As a newby to EP supercomputing, even though I don't yet understand all the other options on "parallel", this command allowed me to launch python scripts in parallel with different parameters. This would work well if you can generate a slew of parameter files ahead of time that will parallelize your problem. For example, running simulations across a parameter space. Or processing many files with the same code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.