how to create thousands of tasks each day, automatically - python

We would like to use Apache Airflow to mostly schedule Scrapy Python Spiders, and some other scripts.
We will have thousands of spiders, and the scheduling of them can vary, from day to day so we want to be able to create the Airflow dags and schedule them all of them once a day, automatically from a database. The only examples I have seen for airflow use python scripts to write the DAG files.
How is the best way to create the dag files and scheduling automatically?
EDIT:
I Managed to find a solution which should work, using YAML files
https://codeascraft.com/2018/11/14/boundary-layer%E2%80%89-declarative-airflow-workflows/

Airflow can be used in thousands of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. You still can use Airflow, for example, to process the whole bunch of scraped data and use this info in your ETL process later.
Large amount of dynamic tasks can lead to DAG runs like it:
Which leads to many garbage info both in GUI and in log files.
But if you really want to use only Airflow, you can read this article (about dynamic DAG generation) and this article (about dynamic tasks generation inside the DAG).

Related

How to operate between multiple airflow schedulers

I'm new to Airflow.
I`m considering to construct multiple airflow schedulers (celeryexecutor).
But, I'm curious about multiple schedulers operation
How does the multiple schedulers schedule for serialized dags in meta database?
Is there any rules for them? Who gets dag with which rules?
Is there any load balancing for multiple schedulers?
If you answer this questions, It'll be very helpful. Thanks...
Airflow does not provide a magic solution to synchronize the different schedulers, where there is no load balancing, but it does batch scheduling to allow all schedulers to work together to schedule runs and task instances.
Airflow scheduler is running in an infinite loop, in each scheduling loop, the scheduler takes care of creating dag runs for max_dagruns_to_create_per_loop dags (just creating dag runs in queued state), checking max_dagruns_per_loop_to_schedule dag runs if they can be scheduled (queued -> scheduled) starting by the runs with the smaller execution dates, and trying to schedule max_tis_per_query task instances (queued -> scheduled).
All this selected objects (dags, runs and tis) are locked in the DB by the scheduler, and they are not visible to the other, so the other schedulers do the same thing with other objects.
In the case of a small number of dags, dag runs or task instances, using big values for this 3 configurations may lead to scheduling being done by one of the schedulers.

Is Apache Airflow or Luigi a good tool for this use case?

I'm working at an org that has an embarrassingly large amount of manual tasks that touch multiple databases (internal and external), multiple file types/formats, and a huge amount of datasets. The "general" workflow is to take data from one place, extract/create metadata for it, change the file format into something more standardised, and publish the dataset.
I'm trying to improve automation here and I've been looking at Luigi and Apache-Airflow to try standardise some of the common blocks that get used but I'm not sure if these are the appropriate tools. Before I sink too much time in figuring out these tools I thought I'd ask here.
A dummy example:
Check a REST API end point to see if a dataset has changed (www.some_server.org/api/datasets/My_dataset/last_update_time)
If it's changed download the zip file (My_dataset.zip)
Unzip the file (My_dataset.zip >> my_file1.csv, my_file2.csv ... my_fileN.csv)
Do something with the each CSV; filter, delete, pivot whatever
Combine the csv's and transform into "My_filtered_dataset.json"
For each step create/append a "my_dataset_metadata.csv" file to show things like the processing date, inputs, authors, pipeline version etc.
Upload json and metadata files somewhere else
My end goal would be to quickly swap out blocks, like the "csv_to_json" function with a "csv_to_xlsx" function, for different processing tasks. Also have things like alerting on failure, job visualisation, worker management etc.
Some problems I'm seeing is that Luigi isn't so good at handling dynamic filenames and would struggle to create N branches when I don't know the number of files coming out of the zip file. It's also very basic and doesn't seem to have much community support.
Also from the Airflow docs: "This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. (although there does seem to be some support for this ability with XCOMs)" In my dummy case it I would probably need to share, at least, the filenames and the metadata between each step. Combining all steps into a single operator would kind of defeat the point of Airflow...
Am I misunderstanding things here? Are these tools good for this kind of application? Is this task too simple/complex and should just be stuck into a single script?
With Airflow you can achieve all your goals:
there is sensor operators to wait for a condition: check an API, check if a file exists, run a query on a database and check the result, ...
create a dag to define the dependencies between your tasks, and decide which tasks can run in parallel and which should be run sequentially
a lot of existing operators developed by the community: SSH operators, operators to interact with cloud providers services, ...
a built-in mechanism to send emails on run failure and retry
it's based on python scripts, so you can create a method to create dags dynamically (dags factory), so if you have several dags which share a part of the same logic, you can create them by a conditional method
a built-in messaging system (XCom), to send small data between tasks
a secure way to store your credentials and secrets (Airflow variables and connections)
a modern UI to manage your dag runs and read the logs of your tasks, with Access Control.
you can develop your own plugins and add them to Airflow (ex: UI plugin using FlaskAppBuilder)
you can process each file in a separate task in parallel, on a cluster of nodes (Celery or K8S), using the new feature Dynamic Task Mapping (Airflow >= 2.3)
To pass files between the tasks, which is a basic need for everyone, you can use an external storage service (google GCS, AWS S3, ...) to store the output of each task, use XCom to pass the file path, then read the file in the second task. You can also use a custom backend for XCom to use S3 for example, instead of Airflow metastore db, in this case all the variables and the files passed by XCom will be stored automatically on S3, and there will be no more limit on message size.

How to best partition Airflow jobs that work on data that cannot be partioned by date

I have some DAGs already defined in Airflow that perform queries on third party APIs, pull some data partitioned by date (for example the trending items for yesterday) and write them to the DB. They can also be triggered manually with a bunch of parameters to download the same items without the date-based logic. So far so good, this is a standard scenario for Airflow.
I now want to reuse and adapt some of these dags to perform special queries: in Airflow's terms this means receiving different Job's parameters. I can do it one by one manually but clearly this is not the best. The main reason is that these third party APIs have daily quota thresholds that we don't want to cross. So we are not free to run everything every day but we need to be considerate with the executions.
So let's say I want to download 100 entities, which ID I can download through a service call and let's say my quota is 10 per day. One solution would be to create a DAG that does the call, saves the ids into a database with the date in which they should be executed, but I'm doing the Airflow scheduler's job and it seems not good.There are many things that could go wrong.
I could do the same trick but with something that looks like a queue: one manual DAG puts tasks in the queue and another, daily, DAG pulls from the queue. This one kinda works in my mind but it seems like a lot of effort and I'm not sure what should keep track of the queue. Something like Celery seems like an overkill so probably I would have to use a database. Still, it seems like over engineering and some kind of Airflow anti-pattern but I don't have much experience with the tool so feedbacks are welcome.
Are there other options? Is there some Airflow's feature that would solve this easily?

Get certain celery tasks to only run during certain hours

Using Celery, with (if it matters) a rabbitmq broker and django as the result backend.
I have many thousands of tasks, such that it will take weeks to complete them all. However, some of them I only want to be run between certain hours of the day. This is because they involve downloading files from a public server that has explicitly requested this behaviour. Each task takes several minutes to complete. How best to go about this?
You can use Crontab schedules to setup what you need.

DAG(directed acyclic graph) dynamic job scheduler

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution.
Are there any frameworks in python that can handle this?
I see several core functions:
DAG bulding
Execution of nodes (run shell cmd with wait,logging etc.)
Ability to rebuild sub-graph in parent DAG during execution
Ability to manual execute nodes or sub-graph while parent graph is running
Suspend graph execution while waiting for external event
List job queue and job details
Something like Oozie, but more general purpose and in python.
1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.
2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Both of the two modules are mainly written in Python and web interfaces are included for convenient management.
As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.
3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.
4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.
Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html
There are a ton of these; everyone seems to write their own. There is a good list at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems. Which includes systems that originate in both industry and academia.
Have you looked at Ruffus?
I have no experience with it but it appears to do some of the items on your list. It also looks quite hackable so you might be able to implement your other requirements yourself.

Categories

Resources