I have a system where a REST API (Flask) uses spark-sumbit to send a job to an up-and-running pyspark.
For various reasons, I need spark to run all tasks at the same time (i.e. I need to set the number of executors = the number of tasks during runtime).
For example, if I have twenty tasks and only 4 cores, I want each core to execute 5 tasks (executors) without having to restart spark.
I know I can set the number of executors when starting spark, but I don't want to do that since spark is executing other jobs.
Is this possible to achieve through a work around?
Use spark scheduler pools. Here is an example for running multiple queries using scheduler pools (all the way to the end of the article, for convenience copying here), the same logic works for DStreams too:
https://docs.databricks.com/spark/latest/structured-streaming/production.html
// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)
// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)
Related
I'm new to Airflow.
I`m considering to construct multiple airflow schedulers (celeryexecutor).
But, I'm curious about multiple schedulers operation
How does the multiple schedulers schedule for serialized dags in meta database?
Is there any rules for them? Who gets dag with which rules?
Is there any load balancing for multiple schedulers?
If you answer this questions, It'll be very helpful. Thanks...
Airflow does not provide a magic solution to synchronize the different schedulers, where there is no load balancing, but it does batch scheduling to allow all schedulers to work together to schedule runs and task instances.
Airflow scheduler is running in an infinite loop, in each scheduling loop, the scheduler takes care of creating dag runs for max_dagruns_to_create_per_loop dags (just creating dag runs in queued state), checking max_dagruns_per_loop_to_schedule dag runs if they can be scheduled (queued -> scheduled) starting by the runs with the smaller execution dates, and trying to schedule max_tis_per_query task instances (queued -> scheduled).
All this selected objects (dags, runs and tis) are locked in the DB by the scheduler, and they are not visible to the other, so the other schedulers do the same thing with other objects.
In the case of a small number of dags, dag runs or task instances, using big values for this 3 configurations may lead to scheduling being done by one of the schedulers.
I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)
Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
I am working on writing a framework that basically does a data sanity check. I have a set of inputs like
{
"check_1": [
sql_query_1,
sql_query_2
],
"check_2": [
sql_query_1,
sql_query_2
],
"check_3": [
sql_query_1,
sql_query_2
]
.
.
.
"check_100": [
sql_query_1,
sql_query_2
]
}
As you can see, there are 100 checks, and each check is comprised of at most 2 SQL queries. The idea is we get the data from the SQL queries and do some diff for data quality check.
Currently, I am running check_1, then check_2, and so on. Which is very slow. I tried to use joblib library to parallelize the task but got an erroneous result. Also, come to know it is not a good idea to use multithreading in pyspark.
How can I achieve parallelism here? My idea is to
run as many checks as I can in parallel
Also run the SQL queries in parallel for a particular check if possible ( I tried with joblib, but got an erroneous result, more here)
NOTE: Fair schedular is on in spark
Run 100 separate jobs each with their own context/session
Just run each of the 100 checks as a separate Spark job and the fair scheduler should take care of sharing all available resources (memory/CPUs, by default memory) among jobs.
By default each queue bases fair sharing of resources based on memory (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Introduction):
See also https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness
schedulingPolicy: to set the scheduling policy of any queue. The allowed values are “fifo”/“fair”/“drf” or any class that extends org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy. Defaults to “fair”. If “fifo”, apps with earlier submit times are given preference for containers, but apps submitted later may run concurrently if there is leftover space on the cluster after satisfying the earlier app’s requests.
Submit jobs with separate threads within one context/session
On the other hand it should be possible to submit multiple jobs within a single application as long as each has its own thread. I assume one would use multiprocessing.
From Scheduling within an application
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
See also How to run multiple jobs in one Sparkcontext from separate threads in PySpark?
I have around 70 hive queries which I am executing in pyspark in sequence. I am looking at ways to improve the runtime be running the hive queries in parallel. I am planning to do this by by creating python threads and running the sqlContext.sql in the threads. Would this create threads in driver and improve performance.
I am assuming, you do not have any dependency on these hive queries and so they can run in parallel. You can accomplish this by threading, but not sure of the benefit in a single user application - because the total number of resources is fixed for your cluster i.e. the total time to finish the all the queries will be the same - as the spark scheduler will round robing across these individual jobs - when you multi thread it.
https://spark.apache.org/docs/latest/job-scheduling.html explains this
1) SPARK by default uses a FIFO scheduler ( which you are observing)
2) By threading you can use a "fair" scheduler
3) Ensure the method that is being threaded -set this
sc.setLocalProperty("spark.scheduler.pool", )
4) The pool id needs to be different for each thread
Example use case of threading from a code perspective:
# set the spark context to use a fair scheduler mode
conf = SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
sc = new SparkContext(conf)
# runs a query taking a spark context, pool_id and query..
def runQuery(sc,<POOL_ID>,query):
sc.setLocalProperty("spark.scheduler.pool", pool_id)
.....<your code>
return df
t1 = threading.thread(target=runQuery,args=(sc,"1",<query1>)
t2 = threading.thread(target=runQuery,args=(sc,"2",<query2>)
# start the threads...
t1.start()
t2.sart()
# wait for the threads to complete and get the returned data frames...
df1 = t1.join()
df2 = t2.join()
Like the spark documentation indicates, you will not observe an improvement in the overall throughput.. it is suited for multi-user sharing of resources. Hope this helps.
Background
I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed. The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used.
Issue
The admin will install HTCondor as a scheduler for the cluster.
Thought
In order order to have my code running on the new setup I was planning to use the approach showed in the dask manual for SGE because the cluster has a shared network files system.
# job1
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Job2
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
# Job3
# Start a process with the python code where the client is started this way
client = Client(scheduler_file='/path/to/scheduler.json')
Question and advice
If I understood correctly with this approach I will start scheduler, workers and analysis as independent jobs (different HTCondor submit files). How can I make sure that the order of execution will be correct? Is there a way I can use the same processing approach I have being using before or will be more efficient to translate the code to work better with HTCondor?
Thanks for the help!
HTCondor JobQueue support has been merged (https://github.com/dask/dask-jobqueue/pull/245) and should now be available in Dask JobQueue (HTCondorCluster(cores=1, memory='100MB', disk='100MB') )