I'm new to Airflow.
I`m considering to construct multiple airflow schedulers (celeryexecutor).
But, I'm curious about multiple schedulers operation
How does the multiple schedulers schedule for serialized dags in meta database?
Is there any rules for them? Who gets dag with which rules?
Is there any load balancing for multiple schedulers?
If you answer this questions, It'll be very helpful. Thanks...
Airflow does not provide a magic solution to synchronize the different schedulers, where there is no load balancing, but it does batch scheduling to allow all schedulers to work together to schedule runs and task instances.
Airflow scheduler is running in an infinite loop, in each scheduling loop, the scheduler takes care of creating dag runs for max_dagruns_to_create_per_loop dags (just creating dag runs in queued state), checking max_dagruns_per_loop_to_schedule dag runs if they can be scheduled (queued -> scheduled) starting by the runs with the smaller execution dates, and trying to schedule max_tis_per_query task instances (queued -> scheduled).
All this selected objects (dags, runs and tis) are locked in the DB by the scheduler, and they are not visible to the other, so the other schedulers do the same thing with other objects.
In the case of a small number of dags, dag runs or task instances, using big values for this 3 configurations may lead to scheduling being done by one of the schedulers.
Related
I am working on writing a framework that basically does a data sanity check. I have a set of inputs like
{
"check_1": [
sql_query_1,
sql_query_2
],
"check_2": [
sql_query_1,
sql_query_2
],
"check_3": [
sql_query_1,
sql_query_2
]
.
.
.
"check_100": [
sql_query_1,
sql_query_2
]
}
As you can see, there are 100 checks, and each check is comprised of at most 2 SQL queries. The idea is we get the data from the SQL queries and do some diff for data quality check.
Currently, I am running check_1, then check_2, and so on. Which is very slow. I tried to use joblib library to parallelize the task but got an erroneous result. Also, come to know it is not a good idea to use multithreading in pyspark.
How can I achieve parallelism here? My idea is to
run as many checks as I can in parallel
Also run the SQL queries in parallel for a particular check if possible ( I tried with joblib, but got an erroneous result, more here)
NOTE: Fair schedular is on in spark
Run 100 separate jobs each with their own context/session
Just run each of the 100 checks as a separate Spark job and the fair scheduler should take care of sharing all available resources (memory/CPUs, by default memory) among jobs.
By default each queue bases fair sharing of resources based on memory (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Introduction):
See also https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness
schedulingPolicy: to set the scheduling policy of any queue. The allowed values are “fifo”/“fair”/“drf” or any class that extends org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy. Defaults to “fair”. If “fifo”, apps with earlier submit times are given preference for containers, but apps submitted later may run concurrently if there is leftover space on the cluster after satisfying the earlier app’s requests.
Submit jobs with separate threads within one context/session
On the other hand it should be possible to submit multiple jobs within a single application as long as each has its own thread. I assume one would use multiprocessing.
From Scheduling within an application
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
See also How to run multiple jobs in one Sparkcontext from separate threads in PySpark?
I currently have a list of tasks that I run through the command
luigi.build(tasks, workers=N, local_scheduler=True, detailed_summary=True)
I would like to programmatically get the status of the local scheduler, hence I could not use global scheduler for my application. How can I get the list of running, pending, and completed tasks?
At first, I used to check for the creation of output files of some known tasks, but it is not efficient anymore since the complexity is increased and I also have dynamic tasks (yielded by some other tasks at runtime).
I noticed that I could list dependencies through:
import luigi.tools.deps
luigi.tools.deps.find_deps(my_main_task, luigi.tools.deps.upstream().family)
but it is not helping more than looking at task's output().
I have also noticed Workers having a nice _running_tasks attribute, thus I would need to get the worker and list it, but I am also wondering about what happens if I have more than 1 worker with pending tasks along with running ones.
What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.
We would like to use Apache Airflow to mostly schedule Scrapy Python Spiders, and some other scripts.
We will have thousands of spiders, and the scheduling of them can vary, from day to day so we want to be able to create the Airflow dags and schedule them all of them once a day, automatically from a database. The only examples I have seen for airflow use python scripts to write the DAG files.
How is the best way to create the dag files and scheduling automatically?
EDIT:
I Managed to find a solution which should work, using YAML files
https://codeascraft.com/2018/11/14/boundary-layer%E2%80%89-declarative-airflow-workflows/
Airflow can be used in thousands of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. You still can use Airflow, for example, to process the whole bunch of scraped data and use this info in your ETL process later.
Large amount of dynamic tasks can lead to DAG runs like it:
Which leads to many garbage info both in GUI and in log files.
But if you really want to use only Airflow, you can read this article (about dynamic DAG generation) and this article (about dynamic tasks generation inside the DAG).
I'm working on project which main future will be running periodically one type of async task for each user. Every user will be able to configure task (running daily, weekly etc. at specified time). Also task will use some data stored by user. Now I'm wondering which approach should be better: allow users to create own PeriodicTask (by using some restricted endpoint of course) or create single PeriodicTask (for example running every 5 minutes) which will iterate over all users and determine if task should be queued or not for current user? I think I will use AMPQ as broker.
periodic tasks scheduler in celery is not designed to handle thousands of scheduled tasks, so from performance perspective, much better solution is to have one task that is running at the smallest interval (e.g. if you allow user to sechedule dayly, weekly, monthly - running task daily is enough)
such approach is as well more stable - every time schedule changes, all of the schedule records are reloaded
plus is more secure because you do not expose or use any internal mechanisms for tasks execution