I am new to Airflow and I followed the tutorial on the official page (https://airflow.readthedocs.io/en/stable/tutorial.html) and added a subdag to the tutorial dag.
When I zoom into the subdag on the web-UI and click on code, the code of the main-dag is shown. Also when I click on details of the subdag the filename of the main-dag is displayed, like on the screenshot.
Screenshot of wrong filepath:
My file structure:
dags/
├── subdags
│ ├── hellosubdag.py
│ ├── __init__.py
├── tutorial.py
My main-dag code:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.utils.dates import days_ago
from subdags.hellosubdag import sub_dag
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
parentdag = DAG(
dag_id='tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
)
subdag_execute = SubDagOperator(
task_id='subdag-exe',
subdag=sub_dag('tutorial', 'subdag-exe', default_args['start_date'], timedelta(days=1)),
dag=parentdag,
)
And the subdag simply prints a string.
My company was using airflow 1.10.3 before updating to 1.10.9 and I've been told that it used to work before the update.
I can't find any changelog or documentation regarding this issue, was this feature removed at some point or am I doing something wrong?
Related
I am learning airflow and as a practice exercise im trying to create a table at Redshift through an airflow dag at MWAA. I create the connection to Redshift at the UI (specifying host,port, etc) and run the following dag, but it fails at the "sql_query" task. Any idea of how can I solve this problem or what can be causing it?
Script:
import os
from datetime import timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.providers.amazon.aws.operators.redshift import RedshiftSQLOperator
from airflow.utils.dates import days_ago
DEFAULT_ARGS = {
"owner": "username",
"depends_on_past": False,
"retries": 0,
"email_on_failure": False,
"email_on_retry": False,
"redshift_conn_id": "redshift_default",
}
with DAG(
dag_id= "new_table_dag",
description="",
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(minutes=15),
start_date=days_ago(1),
schedule_interval=None,
tags=[""],
) as dag:
begin = DummyOperator(task_id="begin")
end = DummyOperator(task_id="end")
sql_query = RedshiftSQLOperator(
task_id="sql_query",
sql= "CREATE TABLE schema_name.table_a AS (SELECT * FROM table_b)")
chain(begin,sql_query, end)
I've installed the airflow on docker and i'm trying to create my first DAG, but when i use the command FROM airflow import DAG and try to execute it gives an import error. The file name isn't set as airflow.py to avoid import problems. Also i can't import the from airflow.operators.python_operator import PythonOperator it says that the airflow.operators.python_operator could not be resolved.
Here's the code that i've used to create my first DAG:
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
default_args ={
'owner': 'eike',
'depends_on_past': False,
'start-date': airflow.utils.dates.days_ago(2),
'email': ['eike#gmail.com.br'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=3),
}
dag = DAG(
'anonimização',
default_args = default_args,
description = 'Realização da anonimzação do banco de dados propesq',
schedule_interval = timedelta(None),
catchup = False,
)
Code of the DAG on vs code
Airflow home page with DAG import error
My use case is quite simple:
When file dropped in the FTP server directory, SFTPSensor task picks the specified txt extension file and process the file content.
path="/test_dir/sample.txt" this case is working.
my requirement is to read the dynamic filenames with only the specified extension(text files).
path="/test_dir/*.txt", in this case file poking is not working..
#Sample Code
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.sftp.sensors.sftp import SFTPSensor
from airflow.providers.ssh.hooks.ssh import SSHHook
from datetime import datetime
default_args= {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2022, 4, 16)
}
with DAG(
'sftp_sensor_test',
schedule_interval=None,
default_args=default_args
) as dag:
waiting_for_file = SFTPSensor(
task_id="check_for_file",
sftp_conn_id="sftp_default",
path="/test_dir/*.txt", #NOTE: Poking for the txt extension files
mode="reschedule",
poke_interval=30
)
waiting_for_file
To achieve what you want, I think you should use the file_pattern argument as follows :
waiting_for_file = SFTPSensor(
task_id="check_for_file",
sftp_conn_id="sftp_default",
path="test_dir",
file_pattern="*.txt",
mode="reschedule",
poke_interval=30
)
However, there is currently a bug for this feature → https://github.com/apache/airflow/issues/28121
While this gets solved, you can easily create a local fixed version of the sensor in your project following the issue's explanations.
Here is the file with the current fix: https://github.com/RishuGuru/airflow/blob/ac0457a51b885459bc5ae527878a50feb5dcadfa/airflow/providers/sftp/sensors/sftp.py
I'm using google oozie to airflow converter to convert some oozie workflow that are running on AWS EMR. Managed to get a first version, but when I try to upload the DAG, airflow throws an error:
Broken DAG: No module named 'o2a'
I have tried to deploy the pypi package o2a, both using command
gcloud composer environments update composer-name --update-pypi-packages-from-file requirements.txt --location location
And from google cloud console. Both failed.
requirements.txt
o2a==1.0.1
Here is the code
from airflow import models
from airflow.operators.subdag_operator import SubDagOperator
from airflow.utils import dates
from o2a.o2a_libs import functions
from airflow.models import Variable
import subdag_validation
import subdag_generate_reports
CONFIG = {}
JOB_PROPS = {
}
dag_config = Variable.get("coordinator", deserialize_json=True)
cdrPeriod = dag_config["cdrPeriod"]
TASK_MAP = {"validation": ["validation"], "generate_reports": ["generate_reports"] }
TEMPLATE_ENV = {**CONFIG, **JOB_PROPS, "functions": functions, "task_map": TASK_MAP}
with models.DAG(
"workflow_coordinator",
schedule_interval=None, # Change to suit your needs
start_date=dates.days_ago(0), # Change to suit your needs
user_defined_macros=TEMPLATE_ENV,
) as dag:
validation = SubDagOperator(
task_id="validation",
trigger_rule="one_success",
subdag=subdag_validation.sub_dag(dag.dag_id, "validation", dag.start_date, dag.schedule_interval),
)
generate_reports = SubDagOperator(
task_id="generate_reports",
trigger_rule="one_success",
subdag=subdag_generate_reports.sub_dag(dag.dag_id, "generate_reports", dag.start_date, dag.schedule_interval,
{
"cdrPeriod": "{{cdrPeriod}}"
}),
)
validation.set_downstream(generate_reports)
There is a section in the o2a docs that cover how to deploy o2a:
https://github.com/GoogleCloudPlatform/oozie-to-airflow#the-o2a-libraries
With started to failed because another dependency:lark-parser
Just installed using pypi package manager for Composer did the trick.
I want to call a script through airflow from a custom python project
My directory structure is:
/home/user/
├──airflow/
│ ├──dags
├──.venv_airflow (virtual environment for airflow)
│ ├──my_dag.py
├──my_project
├──.venv (virtual environment for my_project)
├──folderA
├──__init__.py
├──folderB
├──call_me.py (has a line "from my_project.folderA.folderB import import_me")
├──import_me.py
My dag file looks like:
from airflow import DAG
import datetime as dt
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'arpita',
'start_date': dt.datetime(2019, 11, 20),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
'depends_on_past': False,
'email': ['example#abc.com'],
'email_on_failure': True,
'email_on_retry': True,
}
with DAG('sample',
default_args=default_args,
schedule_interval='30 * * * *',
) as dag:
enter_project = BashOperator(task_id='enter_project',
bash_command='cd /home/user/my_project',
retries=2)
setup_environment = BashOperator(task_id='setup_environment',
bash_command='source /home/user/my_project/.venv/bin/activate',
retries=2)
call_script = BashOperator(task_id='call_script',
bash_command='python -m my_project.folderA.folderB.call_me,
retries=2)
enter_project >> setup_environment >> call_script
But I am getting this error
[2019-11-22 11:56:49,311] {bash_operator.py:115} INFO - Running command: python -m my_project.folderA.folderB.call_me
[2019-11-22 11:56:49,315] {bash_operator.py:124} INFO - Output:
[2019-11-22 11:56:49,349] {bash_operator.py:128} INFO - /home/user/airflow/.venv/bin/python: Error while finding spec for 'my_project.folderA.folderB.call_me' (ImportError: No module named 'my_project')
Project and the script are working outside airflow. In airflow, it imports other packages like pandas and tensorflow but not custom packages. I tried inserting path with sys.path.insert but that is not working. Thank you for reading:)
Your bash commands run in three separate bash operators. It should run in one.
call_script = BashOperator(
task_id='call_script',
bash_command='cd /home/user/my_project;'
'source /home/user/my_project/.venv/bin/activate;'
'python -m my_project.folderA.folderB.call_me',
retries=2)