Several operators allow to pull data but I never managed to use the results.
For example:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/bigquery_get_data.py
This operator can be called as follow:
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
Yet, get_data is of type DAG but line 116 says "return table_data".
To be clear, the operator works and retrieve the data, I just don't understand how to use the data retrieve/where it is located.
How do I get the data using "get_data" above?
The way you would use get_data is in the next task can be a PythonOperator which you can then use to process the data.
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
def process_data_from_bq(**kwargs):
ti = kwargs['ti']
bq_data = ti.xcom_pull(task_ids='get_data_from_bq')
# Now bq_data here would have your data in Python list
print(bq_data)
process_data = PythonOperator(
task_id='process_data_from_bq',
python_callable=process_bq_data,
provide_context=True
)
get_data >> process_data
PS: I am the author of BigQueryGetDataOperator and Airflow committer / PMC
The return value is saved in an Xcom. You can access it from another operator as it is shown in this example.
data = ti.xcom_pull(task_ids='get_data_from_bq')
Related
I have a dag that uses a list of CSV files then creates them into a data frame for imports.
#CREATING CSV FILES
def csv_filess():
print("Creating CSV files....")
csv_files = []
for file in os.listdir(dataset_dir):
if file.endswith('.csv'):
csv_files.append(file)
print("***Step 3: CSV files created***")
return csv_files
#CREATING DATAFRAME
def create_df(csv_files):
print("Creating dataframe....")
df = {}
for file in csv_files:
try:
df[file] = pd.read_csv(data_path+file)
except UnicodeDecodeError:
df[file] = pd.read_csv(dataset_dir+file, encoding="ISO-8859-1")
print("***Step 4: CSV files created in df!***")
return df
t3 = PythonOperator(
task_id='create_csv',
python_callable=csv_filess, provide_context=True,
dag=dag)
t4 = PythonOperator(
task_id='create_df',
python_callable=create_df,
op_args = t3.output,
provide_context=True,
dag=dag)
But I get an error:
create_df() takes 1 positional argument but 4 were given
I think it's because I have to put it this way first?:
csv_files = csv_filess()
But how to define that on an Airflow task?
Returning a value from a PythonOperator automatically stores the output as an XCom with key "return_value". So you'll get an XCom from task create_csv with key return_value and value ["file1.csv", "file2.csv", ...]. You can inspect all XComs in Airflow under Admin -> XComs, or per task by clicking a task -> Instance Details -> XCom.
In your create_df task, you then pass the output of create_csv using t3.output. This is a reference to the previously created XCom. When given a list to op_args, Airflow automatically unpacks the list. So you'll have to accept multiple arguments with a * to do the trick:
def create_df(*csv_files):
...
Two notes:
You might be interested in exploring Airflow's TaskFlow API, which would reduce boilerplate code. Your code would look as:
from airflow.decorators import task
with DAG(...) as dag:
#task
def csv_filess():
...
#task
def create_df(csv_files):
...
create_df(csv_filess())
(note that here create_df does not require unpacking.
And lastly note that returned values from PythonOperators are automatically stored as XCom (and are by default stored in the Airflow metastore). Fine if intended/custom XCom backend configured, but I'm a bit wary when it comes to returning Pandas DataFrames as these could potentially be very large.
I want to use XCOM values as a parameter of my Operator.
Firstly, was executed OracleReadOperator, which read table from db, and return values.
This is value in XCOM:
[{'SOURCE_HOST': 'TEST_HOST'}]
Using this function I want to get value from xcom
def print_xcom(**kwargs):
ti = kwargs['ti']
ti.xcom_pull(task_ids='task1')
Then use values as as parameter:
with DAG(
schedule_interval='#daily',
dagrun_timeout=timedelta(minutes=120),
default_args=args,
template_searchpath=tmpl_search_path,
catchup=False,
dag_id='test'
) as dag:
test_l = OracleLoadOperator(
task_id = "task1",
oracle_conn_id="orcl_conn_id",
object_name='table'
)
test_l
def print_xcom(**kwargs):
ti = kwargs['ti']
ti.xcom_pull(task_ids='task1', value='TARGET_TABLE')
load_from_db = MsSqlToOracleTransfer(
task_id= 'task2',
mssql_conn_id = "{task_instance.xcom_pull(task_ids='task1') }",
oracle_conn_id = 'conn_def_orc',
sql= 'test.sql',
oracle_table = "oracle_table"
tasks.append(load_from_db)
I don't know do I need print_xcom function.
Or I can get value without it, if yes how?
I got this error:
airflow.exceptions.AirflowNotFoundException: The conn_id `{ task_instance.xcom_pull(task_ids='task1') }` isn't defined
To resolve the immediate NameError exception, Jinja expressions are strings so the arg for oracle_table needs to be updated to:
oracle_table = "{{ task_instance.xcom_pull(task_ids='print_xcom', key='task1') }}"
EDIT
(Since the question and problem changed.)
Only template_fields declared for an operator can use Jinja expressions. It looks like MsSqlToOracleTransfer is a custom operator and if you want to use a Jinja template for the mssql_conn_id arg, it needs to be declared as part of template_fields otherwise the literal string is used as the arg value (which is what you're seeing). Also you need the expression in the "{{ ... }}" format as well.
Here is some guidance on Jinja templating with custom operators if you find it helpful.
However, it seems like there is more to this picture than what we have context for. What is task1? Are you simply trying to retrieve a connection ID? What is it exactly you are trying to accomplish accessing XComs in the DAG?
The Airflow tasks has implemented the output attribute that returns an intance of XComArs. For example:
def push_xcom(ti):
return {"key": "value"}
def pull_xcom(input):
print(f'XCom: {input}')
with DAG(...) as dag:
start = PythonOperator(task_id='dp_start', python_callable=push_xcom)
end = PythonOperator(task_id='dp_start', python_callable=pull_xcom,
op_kwargs={'input': start.output})
start >> end
Maybe you could use test_l.output in load_from_db.mssql_conn_id, But I think in the case of whatever_conn_id parameters, the value should be the ID of an Airflow connection.
I currently have a DAG in Airflow with a Python Operator and associated python callable like such:
def push_xcom(**kwargs):
ti = kwargs["ti"]
ti.xcom_push(key=key, value=value)
xcom_opr = PythonOperator(
task_id='xcom_opr',
python_callable=push_xcom,
dag=dag
)
The goal of this dag is to update other DAG's xcom variables defined in Airflow. Is this not possible? I couldn't find any source code for xcom_push, but maybe something like a dag_id argument?
Looking at the source code for TaskInstance it looks like you could copy what it does under the hood directly, and specify your desired DAG id.
XCom.set(
key=key,
value=value,
task_id=self.task_id,
dag_id=self.dag_id,
execution_date=execution_date or self.execution_date)
However, the xcom_pull API directly supports pulling from another DAG's xcom so perhaps you could have the DAG you want to modify pull from the other instead?
def xcom_pull(
self,
task_ids: Optional[Union[str, Iterable[str]]] = None,
dag_id: Optional[str] = None,
key: str = XCOM_RETURN_KEY,
include_prior_dates: bool = False) -> Any
I have a situation where I need to find a specific folder in S3 to pass onto a PythonOperator in an Airflow script. I am doing this using another PythonOperator that finds the correct directory. I can successfully either xcom.push() or Variable.set() and read it back within the PythonOperator. The problem is, I need to pass this variable onto a separate PythonOperator that uses code in a python library. Therefore, I need to Variable.get() or xcom.pull() this variable within the main part of the Airflow script. I have searched quite a bit and can't seem to figure out if this is possible or not. Below is some code for reference:
def check_for_done_file(**kwargs):
### This function does a bunch of stuff to find the correct S3 path to
### populate target_dir, this has been verified and works
Variable.set("target_dir", done_file_list.pop())
test = Variable.get("target_dir")
print("TEST: ", test)
#### END OF METHOD, BEGIN MAIN
with my_dag:
### CALLING METHOD FROM MAIN, POPULATING VARIABLE
check_for_done_file_task = PythonOperator(
task_id = 'check_for_done_file',
python_callable = check_for_done_file,
dag = my_dag,
op_kwargs = {
"source_bucket" : "my_source_bucket",
"source_path" : "path/to/the/s3/folder/I/need"
}
)
target_dir = Variable.get("target_dir") # I NEED THIS VAR HERE.
move_data_to_in_progress_task = PythonOperator(
task_id = 'move-from-incoming-to-in-progress',
python_callable = FileOps.move, # <--- PYTHON LIBRARY THAT COPIES FILES FROM SRC TO DEST
dag = my_dag,
op_kwargs = {
"source_bucket" : "source_bucket",
"source_path" : "path/to/my/s3/folder/" + target_dir,
"destination_bucket" : "destination_bucket",
"destination_path" : "path/to/my/s3/folder/" + target_dir,
"recurse" : True
}
)
So, is the only way to accomplish this to augment the library to look for the "target_dir" variable? I don't think Airflow main has a context, and therefore what I want to do may not be possible. Any Airflow experts, please weigh in to let me know what my options might be.
op_kwargs is a templated field. So you can use xcom_push:
def check_for_done_file(**kwargs):
...
kwargs['ti'].xcom_push(value=y)
and use jinja template in op_kwargs:
move_data_to_in_progress_task = PythonOperator(
task_id = 'move-from-incoming-to-in-progress',
python_callable = FileOps.move, # <--- PYTHON LIBRARY THAT COPIES FILES FROM SRC TO DEST
dag = my_dag,
op_kwargs = {
"source_bucket" : "source_bucket",
"source_path" : "path/to/my/s3/folder/{{ ti.xcom_pull(task_ids='check_for_done_file') }}",
"destination_bucket" : "destination_bucket",
"destination_path" : "path/to/my/s3/folder/{{ ti.xcom_pull(task_ids='check_for_done_file') }}",
"recurse" : True
}
)
Also, add provide_context=True to your check_for_done_file_task task to pass context dictionary to callables.
I am using a main dag (main_dag) that contains a number of subdags and each of those subdags has a number of tasks. I pushed an xcom from subdagA taskA, but I am pulling that xcom within subdagB taskB. Since the dag_id argument in xcom_pull() defaults to self.dag_id I have been unable to pull the necessary xcom. I was wondering how one would do this and/or if there is a better way to set this scenario up so I don't have to deal with this.
example of what I am currently doing in subdagB:
def subdagB(parent_dag, child_dag, start_date, schedule_interval):
subdagB = DAG('%s.%s' % (parent_dag, child_dag), start_date=start_date, schedule_interval=schedule_interval)
start = DummyOperator(
task_id='taskA',
dag=subdagB)
tag_db_template = '''echo {{ task_instance.xcom_pull(dag_id='dag.main_dag.subdagA', task_ids='taskA') }};'''
t1 = BashOperator(
task_id='taskB',
bash_command=tag_db_template,
xcom_push=True,
dag=subdagB)
end = DummyOperator(
task_id='taskC',
dag=subdagB)
t0.set_upstream(start)
t1.set_upstream(t0)
end.set_upstream(t1)
return subdagB
Thank you in advance for any help!
You should be fine as long as you override the dag_id in
[Operator].xcom_pull(dag_id=dag_id, ...) or
[TaskInstance].xcom_pull(dag_id=dag_id, ...)
Just make sure that
dag_id = "{parent_dag_id}.{child_dag_id}"
If you can make your example more complete I can try running it locally, but I tested a (similar) example and cross-subdag xcoms work as expected.