Dask with HTCondor scheduler

Dask with HTCondor scheduler - python

Background
I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed. The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used.
Issue
The admin will install HTCondor as a scheduler for the cluster.
Thought
In order order to have my code running on the new setup I was planning to use the approach showed in the dask manual for SGE because the cluster has a shared network files system.
# job1
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Job2
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
# Job3
# Start a process with the python code where the client is started this way
client = Client(scheduler_file='/path/to/scheduler.json')
Question and advice
If I understood correctly with this approach I will start scheduler, workers and analysis as independent jobs (different HTCondor submit files). How can I make sure that the order of execution will be correct? Is there a way I can use the same processing approach I have being using before or will be more efficient to translate the code to work better with HTCondor?
Thanks for the help!

HTCondor JobQueue support has been merged (https://github.com/dask/dask-jobqueue/pull/245) and should now be available in Dask JobQueue (HTCondorCluster(cores=1, memory='100MB', disk='100MB') )

Related

Job, Worker, and Task in dask_jobqueue

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)

Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs

Dask: Jobs on multiple nodes with one worker, run on one node only

I am trying to process some files using a python function and would like to parallelize the task on a PBS cluster using dask. On the cluster I can only launch one job but have access to 10 nodes with 24 cores each.
So my dask PBSCluster looks like:
import dask
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=240,
memory="1GB",
project='X',
queue='normal',
local_directory='$TMPDIR',
walltime='12:00:00',
resource_spec='select=10:ncpus=24:mem=1GB',
)
cluster.scale(1) # one worker
from dask.distributed import Client
client = Client(cluster)
client
After the Cluster in Dask shows 1 worker with 240 cores (not sure if that make sense).
When I run
result = compute(*foo, scheduler='distributed')
and access the allocated nodes only one of them is actually running the computation. I am not sure if I using the right PBS configuration.

cluster = PBSCluster(cores=240,
memory="1GB",
The values you give to the Dask Jobqueue constructors are the values for a single job for a single node. So here you are asking for a node with 240 cores, which probably doesn't make sense today.
If you can only launch one job then dask-jobqueue's model probably won't work for you. I recommnd looking at dask-mpi as an alternative.

How to make spark run all tasks in a job concurrently?

I have a system where a REST API (Flask) uses spark-sumbit to send a job to an up-and-running pyspark.
For various reasons, I need spark to run all tasks at the same time (i.e. I need to set the number of executors = the number of tasks during runtime).
For example, if I have twenty tasks and only 4 cores, I want each core to execute 5 tasks (executors) without having to restart spark.
I know I can set the number of executors when starting spark, but I don't want to do that since spark is executing other jobs.
Is this possible to achieve through a work around?

Use spark scheduler pools. Here is an example for running multiple queries using scheduler pools (all the way to the end of the article, for convenience copying here), the same logic works for DStreams too:
https://docs.databricks.com/spark/latest/structured-streaming/production.html
// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)
// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)

pyspark on cluster, make sure all nodes are used

Deployment info: "pyspark --master yarn-client --num-executors 16 --driver-memory 16g --executor-memory 2g "
I am turning a 100,000 line text file (in hdfs dfs format) into a RDD object with corpus = sc.textFile("my_file_name"). When I execute corpus.count() I do get 100000. I realize that all these steps are performed on the master node.
Now, my question is when I perform some action like new_corpus=corpus.map(some_function), will the job be automatically distributed by pyspark among all available slaves (16 in my case)? Or do I have to specify something?
Notes:
I don't think that anything gets distributed actually (or at least not on the 16 nodes) because when I do new_corpus.count(), what prints out is [Stage some_number:> (0+2)/2], not [Stage some_number:> (0+16)/16]
I don't think that doing corpus = sc.textFile("my_file_name",16) is the solution for me because the function I want to apply works at the line level and therefore should be applied 100,000 times (the goal of parallelization is to speed up this process, like having each slave taking 100000/16 lines). It should not be applied 16 times on 16 subsets of the original text file.

Your observations are not really correct. Stages are not "executors". In Spark we have jobs, tasks and then stages. The job is kicked off by the master driver and then task are assigned to different worker nodes where stage is a collection of task which has the same shuffling dependencies. In your case shuffling happens only once.
To check if executors are really 16, you have to look into the resource manager. Usually it is at port 4040 since you are using yarn.
Also if you use rdd.map(), then it should parallelize according to your defined partitions and not the executors which you set in sc.textFile("my_file_name", numPartitions).
Here is an overview again:
https://spark.apache.org/docs/1.6.0/cluster-overview.html

First of, I saw yarn-client and a chill ran down my spine.
Is there a reason why you want the node where you submit your job to be running the driver? Why not let Yarn do its thing?
But about your question:
I realize that all these steps are performed on the master node.
No they are not. You might be mislead by the fact you are running your driver on the node you are connected to (see my spine-chill ;) ).
You tell yarn to start up 16 executors for you, and Yarn will do so.
It will try to take your rack and data locality into account to the best of its ability while doing so. These will be run in parallel.
Yarn is a resource manager, it manages the resources so you don't have to. All you have to specify with Spark is the number of executors you want and the memory yarn has to assign to the executors and driver.
Update: I have added this image to clarify how spark-submit (in clustered mode) works

AWS EC2: How to process queue with 100 parallel ec2 instances?

(ubuntu 12.04). I envision some sort of Queue to put thousands of tasks, and have the ec2 isntances plow through it in parallel (100 ec2 instances) where each instance handles one task from the queue.
Also, each ec2 instance to use the image I provide, which will have the binaries and software installed on it for use.
Essentially what I am trying to do is, run 100 processing (a python function using packages that depend on binaries installed on that image) in parallel on Amazon's EC2 for an hour or less, shut them all off, and repeat this process whenever it is needed.
Is this doable? I am using Python Boto to do this.

This is doable. You should look into using SQS. Jobs are placed on a queue and the worker instances pop jobs off the queue and perform the appropriate work. As a job is completed, the worker deletes the job from the queue so no job is run more than once.
You can configure your instances using user-data at boot time or you can bake AMIs with all of your software pre-installed. I recommend Packer for baking AMIs as it works really well and is very scriptable so your AMIs can be rebuilt consistently as things need to be changed.
For turning on and off lots of instances, look into using AutoScaling. Simply set the group's desired capacity to the number of worker instances you want running and it will take care of the rest.

This sounds like it might be easier to with EMR.
You mentioned in comments you are doing computer vision. You can make your job hadoop friendly by preparing a file where each line a base64 encoding of the image file.
You can prepare a simple bootstrap script to make sure each node of the cluster has your software installed. Hadoop streaming will allow you to use your image processing code as is for the job (instead of rewriting in java).
When your job is over, the cluster instances will be shut down. You can also specify your output be streamed directly to an S3 bucket, its all baked in. EMR is also cheap, 100 m1.medium EC2 instances running for an hour will only cost you around 2 dollars according to the most recent pricing: http://aws.amazon.com/elasticmapreduce/pricing/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.