How to effectively run tasks parallelly in pyspark

How to effectively run tasks parallelly in pyspark - python

I am working on writing a framework that basically does a data sanity check. I have a set of inputs like
{
"check_1": [
sql_query_1,
sql_query_2
],
"check_2": [
sql_query_1,
sql_query_2
],
"check_3": [
sql_query_1,
sql_query_2
]
.
.
.
"check_100": [
sql_query_1,
sql_query_2
]
}
As you can see, there are 100 checks, and each check is comprised of at most 2 SQL queries. The idea is we get the data from the SQL queries and do some diff for data quality check.
Currently, I am running check_1, then check_2, and so on. Which is very slow. I tried to use joblib library to parallelize the task but got an erroneous result. Also, come to know it is not a good idea to use multithreading in pyspark.
How can I achieve parallelism here? My idea is to
run as many checks as I can in parallel
Also run the SQL queries in parallel for a particular check if possible ( I tried with joblib, but got an erroneous result, more here)
NOTE: Fair schedular is on in spark

Run 100 separate jobs each with their own context/session
Just run each of the 100 checks as a separate Spark job and the fair scheduler should take care of sharing all available resources (memory/CPUs, by default memory) among jobs.
By default each queue bases fair sharing of resources based on memory (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Introduction):
See also https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness
schedulingPolicy: to set the scheduling policy of any queue. The allowed values are “fifo”/“fair”/“drf” or any class that extends org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy. Defaults to “fair”. If “fifo”, apps with earlier submit times are given preference for containers, but apps submitted later may run concurrently if there is leftover space on the cluster after satisfying the earlier app’s requests.
Submit jobs with separate threads within one context/session
On the other hand it should be possible to submit multiple jobs within a single application as long as each has its own thread. I assume one would use multiprocessing.
From Scheduling within an application
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
See also How to run multiple jobs in one Sparkcontext from separate threads in PySpark?

Related

How to operate between multiple airflow schedulers

I'm new to Airflow.
I`m considering to construct multiple airflow schedulers (celeryexecutor).
But, I'm curious about multiple schedulers operation
How does the multiple schedulers schedule for serialized dags in meta database?
Is there any rules for them? Who gets dag with which rules?
Is there any load balancing for multiple schedulers?
If you answer this questions, It'll be very helpful. Thanks...

Airflow does not provide a magic solution to synchronize the different schedulers, where there is no load balancing, but it does batch scheduling to allow all schedulers to work together to schedule runs and task instances.
Airflow scheduler is running in an infinite loop, in each scheduling loop, the scheduler takes care of creating dag runs for max_dagruns_to_create_per_loop dags (just creating dag runs in queued state), checking max_dagruns_per_loop_to_schedule dag runs if they can be scheduled (queued -> scheduled) starting by the runs with the smaller execution dates, and trying to schedule max_tis_per_query task instances (queued -> scheduled).
All this selected objects (dags, runs and tis) are locked in the DB by the scheduler, and they are not visible to the other, so the other schedulers do the same thing with other objects.
In the case of a small number of dags, dag runs or task instances, using big values for this 3 configurations may lead to scheduling being done by one of the schedulers.

Job, Worker, and Task in dask_jobqueue

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)

Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs

dask, joblib, ipyparallel and other schedulers for embarrassingly parallel problems

This is a more general question about how to run "embarassingly paralllel" problems with python "schedulers" in a science environment.
I have a code that is a Python/Cython/C hybrid (for this example I'm using github.com/tardis-sn/tardis .. but I have more such problems for other codes) that is internally OpenMP parallalized. It provides a single function that takes a parameter dictionary and evaluates to an object within a few hundred seconds running on ~8 cores (result=fun(paramset, calibdata) where paramset is a dict and result is an object (collection of pandas and numpy arrays basically) and calibdata is a pre-loaded pandas dataframe/object). It logs using the standard Python logging function.
I would like a python framework that can easily evaluate ~10-100k parameter sets using fun on a SLURM/TORQUE/... cluster environment.
Ideally, this framework would automatically spawn workers (given availability with a few cores each) and distribute the parameter sets between the workers (different parameter sets take different amount of time). It would be nice to see the state (in_queue, running, finished, failed) for each of the parameter-sets as well as logs (if it failed or finished or is running).
It would be nice if it keeps track of what is finished and what needs to be done so that I can restart this if my scheduler tasks fails. It would be nice if this seemlessly integrates into jupyter notebook and runs locally for testing.
I have tried dask but that does not seem to queue the tasks but runs them all-at-once with client.map(fun, [list of parameter sets]). Maybe there are better tools or maybe this is a very niche problem. It's also unclear to me what the difference between dask, joblib and ipyparallel is (having quickly tried all three of them at various stages).
Happy to give additional info if things are not clear.
UPDATE: so dask seems to provide some functionality of what I require - but dealing with an OpenMP parallelized code in addition to dask is not straightforward - see issue https://github.com/dask/dask-jobqueue/issues/181

How to avoid pickling Celery task?

My scenario is as follows: I have a large machine learning model, which is computed by a bunch of workers. In essence workers compute their own part of a model and then exchange with results in order to maintain globally consistent state of model.
So, every celery task computes it's own part of job.
But this means, that tasks aren't stateless, and here is my trouble : if I say some_task.delay( 123, 456 ), in reality I'm NOT sending two integers here!
I'm sending whole state of task, which is pickled somewhere in Celery. This state is typically about 200 MB :-((
I know, that it's possible to select a decent serializer in Celery, but my question is how NOT to pickle just ANY data, which could be in task.
How to pickle arguments of task only?
Here is a citation from celery/app/task.py:
def __reduce__(self):
# - tasks are pickled into the name of the task only, and the reciever
# - simply grabs it from the local registry.
# - in later versions the module of the task is also included,
# - and the receiving side tries to import that module so that
# - it will work even if the task has not been registered.
mod = type(self).__module__
mod = mod if mod and mod in sys.modules else None
return (_unpickle_task_v2, (self.name, mod), None)
I simply don't want this to happen.
Is there a simple way around it, or I'm just forced to build my own Celery ( which is ugly to imagine)?

Don't use the celery results backend for this. Use a separate data store.
While you could just use Task.ignore_result this would mean that you loose the ability to track the tasks status etc.
The best solution would be to use one storage engine (e.g. Redis) for your results backend.
You should set up a separate storage engine (a separate instance of Redis, or maybe something like MongoDB, depending on your needs) to store the actual data.
In this way you can still see the status of your tasks but the large data sets do not affect the operation of celery.
Switching to the JSON serializer may reduce the serialization overhead, depending on the format of the data you generate . However it can't solve the underlying problem of putting too much data through the results backend.
The results backend can handle relatively small amounts of data - once you go over a certain limit you start to prevent the proper operation of its primary tasks - the communication of task status.
I would suggest updating your tasks so that they return a lightweight data structure containing useful metadata (to e.g. facilitate co-ordination between tasks), and storing the "real" data in a dedicated storage solution.

You have to define the ignore result from your task as it says in the docs:
Task.ignore_result
Don’t store task state. Note that this means you can’t use AsyncResult to check if the task is ready, or get its return value.

That'd be a little offtop, but still.
What as I understood is happening here. You have several processes, which do heavy calculations in parallel with inter-process communications. So, instead of unsatisfying in your case celery you could:
use zmq for inter-process communications (to send only necessary data),
use supervisor for managing and running processes (numprocs in particular will help with running multiple same workers).
While it will not require to write your own celery, some code will require to be written.

DAG(directed acyclic graph) dynamic job scheduler

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution.
Are there any frameworks in python that can handle this?
I see several core functions:
DAG bulding
Execution of nodes (run shell cmd with wait,logging etc.)
Ability to rebuild sub-graph in parent DAG during execution
Ability to manual execute nodes or sub-graph while parent graph is running
Suspend graph execution while waiting for external event
List job queue and job details
Something like Oozie, but more general purpose and in python.

1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.
2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Both of the two modules are mainly written in Python and web interfaces are included for convenient management.
As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.
3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.
4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.
Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html

There are a ton of these; everyone seems to write their own. There is a good list at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems. Which includes systems that originate in both industry and academia.

Have you looked at Ruffus?
I have no experience with it but it appears to do some of the items on your list. It also looks quite hackable so you might be able to implement your other requirements yourself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.