Job, Worker, and Task in dask_jobqueue

Job, Worker, and Task in dask_jobqueue - python

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)

Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs

Related

How to effectively run tasks parallelly in pyspark

I am working on writing a framework that basically does a data sanity check. I have a set of inputs like
{
"check_1": [
sql_query_1,
sql_query_2
],
"check_2": [
sql_query_1,
sql_query_2
],
"check_3": [
sql_query_1,
sql_query_2
]
.
.
.
"check_100": [
sql_query_1,
sql_query_2
]
}
As you can see, there are 100 checks, and each check is comprised of at most 2 SQL queries. The idea is we get the data from the SQL queries and do some diff for data quality check.
Currently, I am running check_1, then check_2, and so on. Which is very slow. I tried to use joblib library to parallelize the task but got an erroneous result. Also, come to know it is not a good idea to use multithreading in pyspark.
How can I achieve parallelism here? My idea is to
run as many checks as I can in parallel
Also run the SQL queries in parallel for a particular check if possible ( I tried with joblib, but got an erroneous result, more here)
NOTE: Fair schedular is on in spark

Run 100 separate jobs each with their own context/session
Just run each of the 100 checks as a separate Spark job and the fair scheduler should take care of sharing all available resources (memory/CPUs, by default memory) among jobs.
By default each queue bases fair sharing of resources based on memory (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Introduction):
See also https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness
schedulingPolicy: to set the scheduling policy of any queue. The allowed values are “fifo”/“fair”/“drf” or any class that extends org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy. Defaults to “fair”. If “fifo”, apps with earlier submit times are given preference for containers, but apps submitted later may run concurrently if there is leftover space on the cluster after satisfying the earlier app’s requests.
Submit jobs with separate threads within one context/session
On the other hand it should be possible to submit multiple jobs within a single application as long as each has its own thread. I assume one would use multiprocessing.
From Scheduling within an application
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
See also How to run multiple jobs in one Sparkcontext from separate threads in PySpark?

Neptune slow to insert data

I'm trying to load data with a script in python where I create 26000 vertex and related relationship.
Using gremlin-python, the script is like
g.V().has('dog', 'name', 'pluto').fold().\
coalesce(__.unfold(), __.addV('dog').property('name', 'pluto')).store('dog').\
V().has('person', 'name', 'sam').fold().\
coalesce(__.unfold(), __.addV('person').property('name', 'sam')).store('person').\
select('person').unfold().\
coalesce(__.outE('has_dog').where(__.inV().where(eq('dog')).by(T.id).by(__.unfold().id())),
__.addE('has_dog').to(__.select('person').unfold())).toList()
In the same transaction I can append up to 50 vertex and edges.
If I execute this script in my PC (i7 with 16GB RAM) I take 4/5 minutes but using a Neptune instance with 8CPU and 32GB RAM, after 20 minutes, the script is only at 10% of execution.
How is it possible that Neptune is so slower?
Thanks.

Each Neptune instance that you connect to has a pool of worker threads. That pool will be two times the number of vCPU on the instance. If you send the queries in a single threaded fashion you are only taking advantage of one worker thread. You can substantially increase the throughput rates by dividing the work across multiple tasks in your application. I often use the multithreading library but even using basic Python threads will likely help as these are IO bound tasks and so the threads will likely yield. I have added millions of vertices and edges using Python in this way. Without doing something like this you are not taking full advantage of the available resources on the instance. If you have the work already divided up into batches of 50, you can spread those batches across multiple threads. Matching up the number of client threads/tasks with two times the number of vCPU on the Neptune instance is a good place to start.
Ideally the threads will touch different parts of the graph to avoid concurrently trying to modify the same vertices and edges from concurrent threads.

Dask task to worker appointment

I run a computation graph with long functions (several hours) and big results (several hundred of megabytes). This type of load can be atypical for dask.
I try to run this graph on 4 workers. I see depicted task to worker appointment:
In first row "green" task depends only on a "blue" one, not "violet". Why green task is not moved to other worker?
Is it possible to give some hints to a scheduler to always move a task on a free worker? Which information do you need and possible to obtain which helps to debug more?
Such appointment is non-optimal and graph computation takes more time than necessary.
A little bit information:
Computation graph composing is done using dask.delayed
Computation invocation is done using next code
to_compute = [result_of_dask_delayed_1, ... , result_of_dask_delayed_n]
running = client.persist(to_compute)
results = as_completed(
futures_of(running),
with_results=True,
raise_errors=not bool(cmdline_options.partial_fails),
)

Firstly, we should state stat scheduling tasks to workers is complicated - please read this description. Roughly, when a batch of tasks come in, they will be distributed to workers, accounting for dependencies.
Dask has no idea, before running, how long any given task will take or how big the data result it generates might be. Once a task is done, this information is available for the completed task (but not for ones still waiting to run), but dask needs to make a decision on whether to steal an allotted task to an idle worker, and copy across the data it needs to run. Please read the link to see how this is done.
Finally, you can indeed have more fine-grained control over where things are run, e.g., this section.

Dask: Jobs on multiple nodes with one worker, run on one node only

I am trying to process some files using a python function and would like to parallelize the task on a PBS cluster using dask. On the cluster I can only launch one job but have access to 10 nodes with 24 cores each.
So my dask PBSCluster looks like:
import dask
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=240,
memory="1GB",
project='X',
queue='normal',
local_directory='$TMPDIR',
walltime='12:00:00',
resource_spec='select=10:ncpus=24:mem=1GB',
)
cluster.scale(1) # one worker
from dask.distributed import Client
client = Client(cluster)
client
After the Cluster in Dask shows 1 worker with 240 cores (not sure if that make sense).
When I run
result = compute(*foo, scheduler='distributed')
and access the allocated nodes only one of them is actually running the computation. I am not sure if I using the right PBS configuration.

cluster = PBSCluster(cores=240,
memory="1GB",
The values you give to the Dask Jobqueue constructors are the values for a single job for a single node. So here you are asking for a node with 240 cores, which probably doesn't make sense today.
If you can only launch one job then dask-jobqueue's model probably won't work for you. I recommnd looking at dask-mpi as an alternative.

Dask with HTCondor scheduler

Background
I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed. The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used.
Issue
The admin will install HTCondor as a scheduler for the cluster.
Thought
In order order to have my code running on the new setup I was planning to use the approach showed in the dask manual for SGE because the cluster has a shared network files system.
# job1
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Job2
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
# Job3
# Start a process with the python code where the client is started this way
client = Client(scheduler_file='/path/to/scheduler.json')
Question and advice
If I understood correctly with this approach I will start scheduler, workers and analysis as independent jobs (different HTCondor submit files). How can I make sure that the order of execution will be correct? Is there a way I can use the same processing approach I have being using before or will be more efficient to translate the code to work better with HTCondor?
Thanks for the help!

HTCondor JobQueue support has been merged (https://github.com/dask/dask-jobqueue/pull/245) and should now be available in Dask JobQueue (HTCondorCluster(cores=1, memory='100MB', disk='100MB') )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.