I have a dag that uses a list of CSV files then creates them into a data frame for imports.
#CREATING CSV FILES
def csv_filess():
print("Creating CSV files....")
csv_files = []
for file in os.listdir(dataset_dir):
if file.endswith('.csv'):
csv_files.append(file)
print("***Step 3: CSV files created***")
return csv_files
#CREATING DATAFRAME
def create_df(csv_files):
print("Creating dataframe....")
df = {}
for file in csv_files:
try:
df[file] = pd.read_csv(data_path+file)
except UnicodeDecodeError:
df[file] = pd.read_csv(dataset_dir+file, encoding="ISO-8859-1")
print("***Step 4: CSV files created in df!***")
return df
t3 = PythonOperator(
task_id='create_csv',
python_callable=csv_filess, provide_context=True,
dag=dag)
t4 = PythonOperator(
task_id='create_df',
python_callable=create_df,
op_args = t3.output,
provide_context=True,
dag=dag)
But I get an error:
create_df() takes 1 positional argument but 4 were given
I think it's because I have to put it this way first?:
csv_files = csv_filess()
But how to define that on an Airflow task?
Returning a value from a PythonOperator automatically stores the output as an XCom with key "return_value". So you'll get an XCom from task create_csv with key return_value and value ["file1.csv", "file2.csv", ...]. You can inspect all XComs in Airflow under Admin -> XComs, or per task by clicking a task -> Instance Details -> XCom.
In your create_df task, you then pass the output of create_csv using t3.output. This is a reference to the previously created XCom. When given a list to op_args, Airflow automatically unpacks the list. So you'll have to accept multiple arguments with a * to do the trick:
def create_df(*csv_files):
...
Two notes:
You might be interested in exploring Airflow's TaskFlow API, which would reduce boilerplate code. Your code would look as:
from airflow.decorators import task
with DAG(...) as dag:
#task
def csv_filess():
...
#task
def create_df(csv_files):
...
create_df(csv_filess())
(note that here create_df does not require unpacking.
And lastly note that returned values from PythonOperators are automatically stored as XCom (and are by default stored in the Airflow metastore). Fine if intended/custom XCom backend configured, but I'm a bit wary when it comes to returning Pandas DataFrames as these could potentially be very large.
Related
I am new to Airflow.I would like read the Trigger DAG configuration passed by user and store as a variable which can be passed as job argument to the actual code.
Would like to access all the parameters passed while triggering the DAG.
def get_execution_date(**kwargs):
if ({{kwargs["dag_run"].conf["execution_date"]}}) is not None:
execution_date = kwargs["dag_run"].conf["execution_date"]
print(f" execution date given by user{execution_date}")
else:
execution_date = str(datetime.today().strftime("%Y-%m-%d"))
return execution_date
You can't use Jinja templating as you did.
The {{kwargs["dag_run"].conf["execution_date"]}} will not be rendered.
You can access DAG information via:
dag_run = kwargs.get('dag_run')
task_instance = kwargs.get('task_instance')
execution_date = kwargs.get('execution_date')
Passing variable in Trigger Operator (airflow v2.1.2)
trigger_dependent_dag = TriggerDagRunOperator(
task_id="trigger_dependent_dag",
trigger_dag_id="dependent-dag",
conf={"test_run_id": "rx100"},
wait_for_completion=False
)
Reading it in dependent dag via context['dag_run'].conf['{variable_key}']
def dependent_fuction(**context):
print("run_id=" + context['dag_run'].conf['test_run_id'])
print('Dependent DAG has completed.')
time.sleep(180)
I currently have a DAG in Airflow with a Python Operator and associated python callable like such:
def push_xcom(**kwargs):
ti = kwargs["ti"]
ti.xcom_push(key=key, value=value)
xcom_opr = PythonOperator(
task_id='xcom_opr',
python_callable=push_xcom,
dag=dag
)
The goal of this dag is to update other DAG's xcom variables defined in Airflow. Is this not possible? I couldn't find any source code for xcom_push, but maybe something like a dag_id argument?
Looking at the source code for TaskInstance it looks like you could copy what it does under the hood directly, and specify your desired DAG id.
XCom.set(
key=key,
value=value,
task_id=self.task_id,
dag_id=self.dag_id,
execution_date=execution_date or self.execution_date)
However, the xcom_pull API directly supports pulling from another DAG's xcom so perhaps you could have the DAG you want to modify pull from the other instead?
def xcom_pull(
self,
task_ids: Optional[Union[str, Iterable[str]]] = None,
dag_id: Optional[str] = None,
key: str = XCOM_RETURN_KEY,
include_prior_dates: bool = False) -> Any
I have a situation where I need to find a specific folder in S3 to pass onto a PythonOperator in an Airflow script. I am doing this using another PythonOperator that finds the correct directory. I can successfully either xcom.push() or Variable.set() and read it back within the PythonOperator. The problem is, I need to pass this variable onto a separate PythonOperator that uses code in a python library. Therefore, I need to Variable.get() or xcom.pull() this variable within the main part of the Airflow script. I have searched quite a bit and can't seem to figure out if this is possible or not. Below is some code for reference:
def check_for_done_file(**kwargs):
### This function does a bunch of stuff to find the correct S3 path to
### populate target_dir, this has been verified and works
Variable.set("target_dir", done_file_list.pop())
test = Variable.get("target_dir")
print("TEST: ", test)
#### END OF METHOD, BEGIN MAIN
with my_dag:
### CALLING METHOD FROM MAIN, POPULATING VARIABLE
check_for_done_file_task = PythonOperator(
task_id = 'check_for_done_file',
python_callable = check_for_done_file,
dag = my_dag,
op_kwargs = {
"source_bucket" : "my_source_bucket",
"source_path" : "path/to/the/s3/folder/I/need"
}
)
target_dir = Variable.get("target_dir") # I NEED THIS VAR HERE.
move_data_to_in_progress_task = PythonOperator(
task_id = 'move-from-incoming-to-in-progress',
python_callable = FileOps.move, # <--- PYTHON LIBRARY THAT COPIES FILES FROM SRC TO DEST
dag = my_dag,
op_kwargs = {
"source_bucket" : "source_bucket",
"source_path" : "path/to/my/s3/folder/" + target_dir,
"destination_bucket" : "destination_bucket",
"destination_path" : "path/to/my/s3/folder/" + target_dir,
"recurse" : True
}
)
So, is the only way to accomplish this to augment the library to look for the "target_dir" variable? I don't think Airflow main has a context, and therefore what I want to do may not be possible. Any Airflow experts, please weigh in to let me know what my options might be.
op_kwargs is a templated field. So you can use xcom_push:
def check_for_done_file(**kwargs):
...
kwargs['ti'].xcom_push(value=y)
and use jinja template in op_kwargs:
move_data_to_in_progress_task = PythonOperator(
task_id = 'move-from-incoming-to-in-progress',
python_callable = FileOps.move, # <--- PYTHON LIBRARY THAT COPIES FILES FROM SRC TO DEST
dag = my_dag,
op_kwargs = {
"source_bucket" : "source_bucket",
"source_path" : "path/to/my/s3/folder/{{ ti.xcom_pull(task_ids='check_for_done_file') }}",
"destination_bucket" : "destination_bucket",
"destination_path" : "path/to/my/s3/folder/{{ ti.xcom_pull(task_ids='check_for_done_file') }}",
"recurse" : True
}
)
Also, add provide_context=True to your check_for_done_file_task task to pass context dictionary to callables.
Several operators allow to pull data but I never managed to use the results.
For example:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/bigquery_get_data.py
This operator can be called as follow:
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
Yet, get_data is of type DAG but line 116 says "return table_data".
To be clear, the operator works and retrieve the data, I just don't understand how to use the data retrieve/where it is located.
How do I get the data using "get_data" above?
The way you would use get_data is in the next task can be a PythonOperator which you can then use to process the data.
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
def process_data_from_bq(**kwargs):
ti = kwargs['ti']
bq_data = ti.xcom_pull(task_ids='get_data_from_bq')
# Now bq_data here would have your data in Python list
print(bq_data)
process_data = PythonOperator(
task_id='process_data_from_bq',
python_callable=process_bq_data,
provide_context=True
)
get_data >> process_data
PS: I am the author of BigQueryGetDataOperator and Airflow committer / PMC
The return value is saved in an Xcom. You can access it from another operator as it is shown in this example.
data = ti.xcom_pull(task_ids='get_data_from_bq')
I am using a main dag (main_dag) that contains a number of subdags and each of those subdags has a number of tasks. I pushed an xcom from subdagA taskA, but I am pulling that xcom within subdagB taskB. Since the dag_id argument in xcom_pull() defaults to self.dag_id I have been unable to pull the necessary xcom. I was wondering how one would do this and/or if there is a better way to set this scenario up so I don't have to deal with this.
example of what I am currently doing in subdagB:
def subdagB(parent_dag, child_dag, start_date, schedule_interval):
subdagB = DAG('%s.%s' % (parent_dag, child_dag), start_date=start_date, schedule_interval=schedule_interval)
start = DummyOperator(
task_id='taskA',
dag=subdagB)
tag_db_template = '''echo {{ task_instance.xcom_pull(dag_id='dag.main_dag.subdagA', task_ids='taskA') }};'''
t1 = BashOperator(
task_id='taskB',
bash_command=tag_db_template,
xcom_push=True,
dag=subdagB)
end = DummyOperator(
task_id='taskC',
dag=subdagB)
t0.set_upstream(start)
t1.set_upstream(t0)
end.set_upstream(t1)
return subdagB
Thank you in advance for any help!
You should be fine as long as you override the dag_id in
[Operator].xcom_pull(dag_id=dag_id, ...) or
[TaskInstance].xcom_pull(dag_id=dag_id, ...)
Just make sure that
dag_id = "{parent_dag_id}.{child_dag_id}"
If you can make your example more complete I can try running it locally, but I tested a (similar) example and cross-subdag xcoms work as expected.