I have built a pipeline of Tasks in Luigi. Because this pipeline is going to be used in different contexts, it was possible that it would require to include more tasks at the beginning of or the end of the pipeline or even totally different dependencies between the tasks.
That's when I thought: "Hey, why declare the dependencies between the tasks in my config file?", so I added something like this to my config.py:
PIPELINE_DEPENDENCIES = {
"TaskA": [],
"TaskB": ["TaskA"],
"TaskC": ["TaskA"],
"TaskD": ["TaskB", "TaskC"]
}
I was annoyed by having those stacking up parameters throughout the tasks, so at some point I introduced just one parameter, task_config, that every Task has and where every information or data that's necessary for run() is stored. So I put PIPELINE_DEPENDENCIES right in there.
Finally, I would have every Task I defined inherit from both luigi.Task and a custom Mixin class, that would implement the dynamic requires(), which looks something like this:
class TaskRequirementsFromConfigMixin(object):
task_config = luigi.DictParameter()
def requires(self):
required_tasks = self.task_config["PIPELINE_DEPENDENCIES"]
requirements = [
self._get_task_cls_from_str(required_task)(task_config=self.task_config)
for required_task in required_tasks
]
return requirements
def _get_task_cls_from_str(self, cls_str):
...
Unfortunately, that doesn't work, as running the pipeline gives me the following:
===== Luigi Execution Summary =====
Scheduled 4 tasks of which:
* 4 were left pending, among these:
* 4 was not granted run permission by the scheduler:
- 1 TaskA(...)
- 1 TaskB(...)
- 1 TaskC(...)
- 1 TaskD(...)
Did not run any tasks
This progress looks :| because there were tasks that were not granted run permission by the scheduler
===== Luigi Execution Summary =====
and a lot of
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
Although I am not sure if that's relevant.
So:
1. What's my mistake? Is it fixable?
2. Is there another way to achieve this?
I realize this is an old question, but I recently learned how to enable dynamic dependencies. I was able to accomplish this by using a WrapperTask and yielding a dict comprehension (though you could do a list too if you want) with the parameters I wanted to pass to the other tasks in the requires method.
Something like this:
class WrapperTaskToPopulateParameters(luigi.WrapperTask):
date = luigi.DateMinuteParameter(interval=30, default=datetime.datetime.today())
def requires(self):
base_params = ['string', 'string', 'string', 'string', 'string', 'string']
modded_params = {modded_param:'mod' + base for base in base_params}
yield list(SomeTask(param1=key_in_dict_we_created, param2=value_in_dict_we_created) for key_in_dict_we_created,value_in_dict_we_created in modded_params.items())
I can post an example using a list comprehension too if there's interest.
Related
I am very new to airflow and I am trying to create a DAG based on the below requirement.
Task 1 - Run a Bigquery query to get a value which I need to push to 2nd task in the dag
Task 2 - Use the value from the above query and run another query and export the data into google cloud bucket.
I have read other answers related to this and I understand we cannot use xcom_pull or xcom_push in bigqueryoperator in airflow. So what I am doing is using a python operator where I can use jinja template variables by using "provide_context=True".
Below is the snipped of my code. Just the task 1 where I want to do "task_instance.xcom_push" in order to see the value in airflow under logs xcom.
def get_bq_operator(dag, task_id, configuration, table_params=None, trigger_rule='all_success'):
bq_operator = BigQueryInsertJobOperator(
task_id=task_id,
configuration=configuration,
gcp_conn_id=gcp_connection_id,
dag=dag,
params=table_params,
trigger_rule=trigger_rule,
**task_instance.xcom_push(key='yr_wk', value=yr_wk),**
)
return bq_operator
def get_bq_wm_yr_wk():
get_bq_operator(dag,app_name,bigquery_util.get_bq_job_configuration(
bq_query,
query_params=None))
get_wm_yr_wk = PythonOperator(task_id='get_wm_yr_wk',
python_callable=get_bq_wm_yr_wk,
provide_context=True,
on_failure_callback=failure_callback,
on_retry_callback=failure_callback,
dag=dag)
"bq_query" is the one I am passing the sql file which has my query and the query returns the value of yr_wk which I need to use in my 2nd task.
The highlighted task_instance.xcom_push(key='yr_wk', value=yr_wk), in get_bq_operator is failing and the errror i am getting is as below
raise KeyError(f'Variable {key} does not exist')
KeyError: 'Variable ei_migration_hour does not exist'
If I comment the line above , the DAG runs fine. However, how do I validate the value of yr_wk?? I want to push it so that I can view the value in logs.
I do not fully understand your code :), but if you want to do something with results of BigQuery query, then by far better way to approach it is to use BigQueryHook in your python callable.
Operators in Airflow are usually thin wrappers around Hooks that really provide a "complete" taks (for example you can use it run an update operation) but if you want to do something with the result of it and you already do it via Python Operator, it is far better to use Hooks directly as you do not make all the assumptions that operators have in execute method.
In your case it should be something like (and I am using here the new TaskFlow syntax which is preferred to do this kind of operations. See https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html for the tutorial on Task Flow API. Aspecially in Airflow 2 it became the de-facto default way of writing tasks.
#task(.....)
def my_task():
hook = BigQueryHook(....) # initialize it with the right parameters
result = hook.run(sql='YOUR_QUERY', ...) # add other necessary params
processed_result = process_result(result) # do something with the result
return processed_result
This way you do not evey have to run xcom_push (task_flow API will do it for you automatically and other tasks will be able to use by just doing :
#task
next_task(input):
pass
And then:
result = my_task()
next_task(result)
Then all the xcom push/pull will be handled for you automatically via TaskFlow.
Is there a way to design a python class that implements a specific data pipeline pattern outside of a dag in order to use this class for all data-pipelines that needs this pattern ?
Example: in order to load data from Google Cloud Storage to Big Query, the process can be to validate ingestion candidate files with data quality tests. Then attempt to load data in a raw table in Big Query then dispatching the file in archive or in a rejected folder depending on loading result.
Doing it one time is easy, what if it needs to be done 1000 times ? i am trying to figure out how to optimize engineering time.
SubDag could be considered but it shows limitations in terms of performances and is going to be deprecated anyway.
Task groups needs to be part of a dag to be implemented https://github.com/apache/airflow/blob/1be3ef635fab635f741b775c52e0da7fe0871567/airflow/utils/task_group.py#L35.
One way to achieve the expected behavior might be to generate dags, task groups and tasks from a single python file that leverage dynamic DAGing
Nevertheless, code that is used in this particular file can't be reused somewhere in the code base. It is against DRYness even though DRYness vs understandability is always a tradeoff.
Based on this article
Here is how to solve this question:
You can define a plugin in airflow ./plugins
Lets create a sample taskgroup in ./plugins/test_taskgroup.py
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.task_group import TaskGroup
def hello_world_py():
print('Hello World')
def build_taskgroup(dag: DAG) -> TaskGroup:
with TaskGroup(group_id="taskgroup") as taskgroup:
dummy_task = DummyOperator(
task_id="dummy_task",
dag=dag
)
python_task = PythonOperator(
task_id="python_task",
python_callable=hello_world_py,
dag=dag
)
dummy_task >> python_task
return taskgroup
You can call it in a simple python DAG like this:
from airflow.utils import task_group
from test_plugin import build_taskgroup
from airflow import DAG
with DAG(
dag_id="modularized_dag",
schedule_interval="#once",
start_date=datetime(2021, 1, 1),
) as dag:
task_group = build_taskgroup(dag)
Here is the result
I'm interested in this question as well. Airflow 2.0 has released the new feature of Dynamic DAG. Although I'm not sure if it will totally answer your design. It may solve the problem of the single codebase. In my case, I have a function to create a task group with the necessary parameters. Then I iterate to create each DAG with the function to create the task group(s) with different parameters. Here is the overview of my pseudo code:
def create_task_group(group_id, a, b, c):
with TaskGroup(group_id=group_id) as my_task_group:
# add some tasks
pass
for x in LIST_OF_THINGS:
dag_id = f"{x}_workflow"
schedule_interval = SCHEDULE_INTERVAL[x]
with DAG(
dag_id,
start_date=START_DATE,
schedule_interval=schedule_interval,
) as globals()[dag_id]:
task_group = create_task_group(x, ..., ..., ...)
The LIST_OF_THINGS here represents a list of different configuration. Each DAG can have different dag_id, schedule_interval, start_date, and so on. You can define your task configuration in some config file, such as JSON or YAML, and parse it as a dictionary as well.
I haven't tried, but technically you maybe you can move the create_task_group() into some class and import it too if you will need to reuse the same functionality. Another good thing about task groups is that they can add task dependencies to other tasks or task groups which is very convenient.
I saw a concept of YAML configuration for Airflow DAG using an extra package, but I'm not sure if it's mature yet.
See more information about Dynamic DAG here: https://www.astronomer.io/guides/dynamically-generating-dags
You should just create your own Operator and then use it inside your DAGs.
Extend BaseOperator and use hooks to BigQuery or whatever you need.
I want to implement a dynamic FTPSensor of a kind. Using the contributed FTP sensor I managed to make it work in this way:
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./data/test.txt",
ftp_conn_id="ftp_default",
poke_interval=5,
dag=dag,
)
and it works just fine. But I need to pass dynamic path and ftp_conn_id params. I.e. I generate a bunch of new connections in a previous task and in the ftp_sensor task I want to check for each of the new connections that I previously generated if there's a file present on the FTP.
So I thought first to grab the connections' ids from XCom.
I send them from the previous task in XCom but it seems I cannot access XCom outside of tasks.
E.g. I was aiming at something like:
active_ftp_connections = context['ti'].xcom_pull(key='active_ftps')
for conn in active_ftp_connections:
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path=conn['path'],
ftp_conn_id=conn['connection'],
poke_interval=5,
dag=dag,
)
but this doesn't seem to be a possible solution.
Then I just wasted a good amount of time trying to create my custom FTPSensor to which to pass dynamically the data I need but right now I reached to the conclusion that I need a hybrid between a sensor and operator, because I need to keep the poke functionality for instance but also have the execute functionality.
I guess one option is to write a custom operator that implements poke from the sensor base class but am probably too tired to try to do it now.
Do you have an idea how to achieve what I am aiming at? I can't seem to find any materials on the topic on the internet - maybe it's just me.
Let me know if the question is not clear so I can provide more details.
Update
I now reached to this as possibility
def get_active_ftps(**context):
active_ftp_connestions = context['ti'].xcom_pull(key='active_ftps')
return active_ftp_connestions
for ftp in get_active_ftps():
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./"+ ftp['folder'] +"/test.txt",
ftp_conn_id=ftp['conn_id'],
poke_interval=5,
dag=dag,
)
but it throws an error: Broken DAG: [/usr/local/airflow/dags/copy_file_from_ftp.py] 'ti'
I managed to do it like this:
active_ftp_folder = Variable.get('active_ftp_folder')
active_ftp_conn_id = Variable.get('active_ftp_conn_id')
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./"+ active_ftp_folder +"/test.txt",
ftp_conn_id=active_ftp_conn_id,
poke_interval=5,
dag=dag,
)
And will just have the dag run one ftp account at a time since I realized that there shouldn't be cycles in a direct acyclic graphs ... apparently.
I am trying to collect application-specific Prometheus metrics in Django for functions that are called by django-background-tasks.
In my application models.py file, I am first adding a custom metric with:
my_task_metric = Summary("my_task_metric ", "My task metric")
Then, I am adding this to my function to capture the timestamp at which this function was last run successfully:
#background()
def my_function():
# my function code here
# collecting the metric
my_task_metric.observe((datetime.now().replace(tzinfo=timezone.utc) - datetime(1970, 1, 1).replace(tzinfo=timezone.utc)).total_seconds())
When I bring up Django, the metric is created and accessible in /metrics. However, after this function is run, the value for sum is 0 as if the metric is not observed. Am I missing something?
Or is there a better way to monitor django-background-tasks with Prometheus? I have tried using the model of django-background-tasks but I found it a bit cumbersome.
I ended up creating a decorator leveraging the Prometheus Pushgateway feature
def push_metric_to_prometheus(function):
registry = CollectorRegistry()
Gauge(f'{function.__name__}_last_successful_run', f'Last time {function.__name__} successfully finished',
registry=registry).set_to_current_time()
push_to_gateway('bolero.club:9091', job='batchA', registry=registry)
return function
and then on my function (the order of the decorators is important)
#background()
#push_metric_to_prometheus
def my_function():
# my function code here
The kind of workflow that I want to run looks like this:
workflow = (
generator.s() |
spread.s() |
gather.s()
)
where spread is a task that replaces itself with a group.
from celery import Celery, group
celery_app = Celery()
#celery_app.task(bind=True)
def spread(self, numbers):
return self.replace(group(
(task_1.si(n) | task_2.s() | task_3.s()) for n in numbers
)
)
The whole workflow works fine and as expected.
My question is essentially only about the chains in the group created by spread. I don't care too much if some of them fail. I'm fine if an error somewhere in the chain would lead to a shorter list of results being passed to gather. However, I'm not sure how to achieve that.
I can, of course, catch exceptions in each of task_1, task_2, and task_3 and pass on an empty dummy result. For convenience I'd really like to be able to say that on an error anywhere in the chain, please log the traceback and remove the result from the group or pass on an empty dummy result.
I've searched the documentation and GitHub issues far and wide but could not find anything. I know that I can pass an on_error callback to the chain but I don't know how to pass on an empty result from there (if that's even possible).
Setup:
Python 3.6
celery 4.2.1
Redis broker and backend (though it's not a problem for me to switch if that would enable the behavior)