Airflow + Docker: Path behaviour (+Repo) - python

I have difficulties to understand how the paths in airflow work. I created this repository so that it is easy to see what I mean: https://github.com/remo2479/airflow_example/blob/master/dags/testdag.py
I created this repository from scratch according to the manual on the airflow page. I just deactivated the example DAGs.
As you can see in the only DAG (dags/testdag.py) the DAG contains two tasks and one variable declaration using an opened file.
The two tasks are using the dummy sql script in the repository (dags/testdag/testscript.sql). One time i used testdag/testscript.sql as path (task 1) and one time dags/testdag/testscript.sql (task 2). With a connection set up task 1 would work and task 2 wouldnt because the template cannot be found. This is how I would expect both tasks to run since the dag is in the dags folder and we should not put it in the path.
But when I try to open the testscript.sql and read its contents it's necessary that I put "dags" in the path (dags/testdag/testscript.sql). Why does the path behave differently when using the MsSqlOperator and the open-function?
For convenience I put the whole script in this post:
from airflow import DAG
from airflow.providers.microsoft.mssql.operators.mssql import MsSqlOperator
from datetime import datetime
with DAG(
dag_id = "testdag",
schedule_interval="30 6 * * *",
start_date=datetime(2022, 1, 1),
catchup=False) as dag:
# Error because of missing connection - this is how it should be
first_task = MsSqlOperator(
task_id="first_task",
sql="testdag/testscript.sql")
# Error because of template not found
second_task = MsSqlOperator(
task_id="second_task",
sql="dags/testdag/testscript.sql")
# When trying to open the file the path has to contain "dags" in the path - why?
with open("dags/testdag/testscript.sql","r") as file:
f = file.read()
file.close()
first_task
second_task

MsSqlOperator has sql as templated field. This means that Jinja engine will run on the string passed via the sql parameter. Moreover it has .sql as templated extension. This means that the operator knows to open .sql file, read it content and pass it via the Jinja engine before submitting it to the MsSQL db for execution. The behavior that you are seeing is part of Airflow power. You don't need to write code to read the query from the file. Airflow does that for you. Airflow asks you just to provide the query string and the connection - The rest is on the Operator to handle.
The:
second_task = MsSqlOperator(
task_id="second_task",
sql="dags/testdag/testscript.sql")
is throwing template not found error since Airflow knows to look for template extensions in paths relative to your DAG. This path is not relative to your DAG. If you want this path to be available then use template_searchpath as:
with DAG(
...,
template_searchpath=["dags/testdag/"],
) as dag:
Then your operator can just have sql=testscript.sql
As for the:
with open("dags/testdag/testscript.sql","r") as file:
f = file.read()
file.close()
This practically do nothing. The file will be opened and read from the scheduler as this is a top level code. Not only that - these lines will be executed every 30 seconds (default of min_file_process_interval as Airflow periodically scans your .py file searching for DAG updates. This should also answer your question why dags/ is needed.

Using the template_searchpath will work as #Elad has mentioned, but this is DAG-specific.
To find files in Airflow without using template_searchpath, remember that everything Airflow runs starts in the $AIRFLOW_HOME directory (i.e. airflow by default, or wherever you're executing the services from). So either start there with all your imports, or reference them in relation to the code file you're currently in (i.e. current_dir from my previous answer).
Setting Airflow up for the first time can be fiddly.

Related

Set Airflow Variable dynamically

Hi community I need for help.
I have a GCS bucket called "staging". This bucket contain folders and subfolders (see picture).
The "date-folders" (eg. 20221128) may be several. Each date-folder has 3 subfolders: I'm interested in the "main_folder". The main_folder has 2 "project folders". Each project folder has several subfolders. Each of these last subfolder has a .txt file.
The main objective is:
Obtain a list of all the path to .txt files (eg. gs://staging/20221128/main_folder/project_1/subfold_1/file.txt, ...)
Export the list on an Airflow Variable
Use the "list" Variable to run some DAGS dynamically.
The folders in the staging bucket may vary everyday, so I don't have static paths.
I'm using Apache Beam with Python SDK on Cloud Dataflow and Airflow with Cloud Composer.
Is there a way to obtain the list of paths (as os.listdir() on python) with Beam and schedule this workflow daily? (I need to override the list Variable eveyday with new paths).
For example I can achieve step n.1 (locally) with the following Python script:
def collect_paths(startpath="C:/Users/PycharmProjects/staging/"):
list_paths = []
for path, dirs, files in os.walk(startpath):
for f in files:
file_path = path + "/" + f
list_paths .append(file_path )
return list_paths
Thank you all in advance.
Edit n.1.
I've retrieved file paths thanks to google.cloud.storage API in my collect_paths script. Now, I want to access to XCom and get the list of paths. This is my task instance:
collect_paths_job = PythonOperator(
task_id='collect_paths',
python_callable=collect_paths,
op_kwargs={'bucket_name': 'staging'},
do_xcom_push=True
)
I want to iterate over the list in order to run (in the same DAG) N concurrent task, each processing a single file. I tried with:
files = collect_paths_job.xcom_pull(task_ids='collect_paths', include_prior_dates=False)
for f in files:
job = get_job_operator(f)
chain(job)
But got the following error:
TypeError: xcom_pull() missing 1 required positional argument: 'context'
I would like to correct you in your usage of the term Variable . Airflow attributes a special meaning to this object. What you want is for the file info to be accessible as parameters in a task.
Use XCom
Assume you have the DAG with the python task called -- list_files_from_gcs.
This task is a python task which exactly runs the collect_path function that you have written. Since this function returns a list, airflow automatically stuffs this into XCom. So now you can access this information anywhere in your DAG in subsequent tasks.
Now your subsequent task can again be a python task in the same DAG which case you can access XCom very very easily:
#task
def next_task(xyz, abc, **context):
ti = context['ti']
files_list = ti.xcom_pull(task_ids='list_files_from_gcs')
...
...
If you are now looking to call an entirely different DAG now, then you can use TriggerDagRunOperator as well and pass this list as dag_run config like this:
TriggerDagRunOperator(
conf={
"gcs_files_list": "{{task_instance.xcom_pull(task_ids='list_files_from_gcs'}}"
},
....
....
)
Then your triggered DAG can just parse the DAG run config to move ahead.

Creating backup folder through Task Scheduler and a Python script gives Date Modified as the date of the task

Trying to run a script with Windows Task Scheduler to copy a folder to a different location on a schedule.
The script works, for ref:
import os
import shutil
import datetime
src = r'J:\SteamLibrary\steamapps\common\ARK\ShooterGame\Saved'
dest = r'J:\SteamLibrary\steamapps\common\ARK\ShooterGame\Saved\---BACKUP---'
x = datetime.datetime.now()
timestamp = x.strftime('%Y-%m-%d %H.%M.%S')
newdest = dest+'\\'+timestamp
ignore = shutil.ignore_patterns('---BACKUP---','*tmp')
shutil.copytree(src, newdest, ignore=ignore)
Not really looking for comments regarding this, as it does what I need, -unless- there's something I need to add to make it work correctly.
Now when I run the task through the console or by manually clicking "run" on the Task Scheduler, the folder generated has the timestamp of when the script runs.
However, when the task runs automatically based on the conditions of the task, ie - "run at 14:00", the folder it generates has a timestamp of a few days ago, presumably when I first generated the task itself - and every subsequent run of the task generated a folder with the identical date modified. See attached: https://i.imgur.com/sqGF6UY.png
Any ideas on how to make the date modified of the folder created be the actual date created?

Airflow not recognising zip file DAG built with pytest fixture

We are using Google Composer (a managed Airflow service) with airflow v1.10 and Python 3.6.8.
To deploy our DAGS, we are taking the Packaged DAG (https://airflow.apache.org/concepts.html?highlight=zip#packaged-dags) method.
All is well when the zip file is created from the cmd line like
zip -r dag_under_test.zip test_dag.py
but when I try to do this from a pytest fixture, so I load in the DagBag and test the integrity of my DAG, airflow doesnt recognise this zip file at all. here is the code to my pytest fixture
#fixture
def setup(config):
os.system("zip -r dag_under_test.zip test_zip.py")
def test_import_dags(setup):
dagbag = DagBag(include_examples=False)
noOfDags = len(dagbag.dags)
dagbag.process_file("dag_under_test.zip")
assert len(dagbag.dags) == noOfDags + 1, 'DAG import failures. Errors: {}'.format(dagbag.import_errors)
I copied this zip file to the DAGs folder, but airflow isnt recognising it at all, no error messages.
But the zip file built with same command from cmdline is being loaded by airflow!! seems like I am missing something obvious here, cant figure out.
In this case, it looks like there is a mismatch between the working directory of os.system and where the DagBag loader is looking. If you inspect the code of airflow/dagbag.py, the path accepted by process_file is passed to os.path.isfile:
def process_file(self, filepath, only_if_updated=True, safe_mode=True):
if filepath is None or not os.path.isfile(filepath):
...
That means within your test, you can probably do some testing to make sure all of these match:
# Make sure this works
os.path.isfile(filepath)
# Make sure these are equal
os.system('pwd')
os.getcwd()
So it turned out that where I am creating the zip file is important. As in this case I am creating the zip file from the test folder and archiving the files in src folders. Although the final zip file looks perfect for the naked eye, airflow is rejecting it.
I tried with adding '-j' to the zip command (to junk the directory names) and my test started working.
zip -r -j dag_under_test_metrics.zip ../src/metricsDAG.py
I had another bigger problem, i.e. to test the same scenario when there is a full folder structure in my DAG project. A dag file at the top level which references a lots of python modules with in the project. I couldnt get this working by the above trick, but came up with a workaround. I have created a small shell script, which does the zip part, like this..
SCRIPT_PATH=${0%/*/*}
cd $SCRIPT_PATH
zip -r -q test/dag_under_test.zip DagRunner.py
zip -r -q test/dag_under_test.zip tasks dag common resources
This shell script is changing the currentdir to project home and archiving from there. I am invoking this shell from the pytest fixture like this
#fixture
def setup():
os.system('rm {}'.format(DAG_UNDER_TEST))
os.system('sh {}'.format(PACKAGE_SCRIPT))
yield
print("-------- clean up -----------")
os.system('rm {}'.format(DAG_UNDER_TEST))
This is perfectly working with my integration test.
def test_conversionDAG(setup):
configuration.load_test_config()
dagbag = DagBag(include_examples=False)
noOfDags = len(dagbag.dags)
dagbag.process_file(DAG_UNDER_TEST)
assert len(dagbag.dags) == noOfDags + 1, 'DAG import failures. Errors: {}'.format(dagbag.import_errors)
assert dagbag.get_dag("name of the dag")

Is performing a dynamic loop (from Variables) to create airflow tasks a good approach?

I have a simple airflow DAG with let's say 2 tasks. something like this:
newDirToLoad = Variable.get("path")
filesInDir = next(os.walk(newDirToLoad))[2]
for f in filesInDir:
task1 = BashOperator(
task_id = "load_for_" + str(f),
params = {"fileToProcess" : newDirToLoad + "/" + f}
# ...
)
task1 = BashOperator(
# ...
)
task1 >> task2
Here the path variable is initially set to some dummy directory so that my Dag wouldn't fail at Dag creation.
Once a new directory with files are created under "data/to/load/" dir at some point, I have a script written somewhere which will trigger airflow variables -set path data/to/load/$newDir followed by airflow trigger_dag myDag. This works pretty well and i see number of tasks in airflow GUI is as same as the number of files present in $newDir. But i think its kind of a tweak to allow dynamic task creations using Variable feature. Are there any good approach? Is it a bad practice to initially set the path variable to some dummy directory for Dag's successful creation?
It's actually possible and very convenient.
I'd recommend you have a look at this blog post from the Astronomer's team: Dynamically Generating DAGs in Airflow
TL;DR you could write something like:
customers = Variable.get("customers_list")
for customer in customers:
dag = DAG(dag_id=f"dag_for_{customer}")
run_first = DummyOperator(
task_id='run_this_first',
dag=dag
)
# for iteration purposes
previous = run_first
for step in Variable.get("customers_steps"):
run_this = DummyOperator(
task_id=f'run_{step}',
dag=dag
)
previous >> run_this
previous = run_this
# From that point, the DAG is complete
# Now, you need to put it in the module globals
globals()[f"dag_for_{customer}"] = dag
This works because Airflow looks for DAG objects into the modules it's scanning. In that case, you would just have several DAGs into one module, which is perfectly fine.
Some pros and cons:
pros:
The separation of concerns it offers can be very useful from a business standpoint. In the example of separating customers, you could eventually just edit the variable content and Airflow takes of generating / removing DAGs.
For tasks, same thing
The example uses variables because you asked for it but it can actually take whatever list you want, whereas it is a hardcoded one or a result of a query.
cons:
keep in mind that you're generating DAGs at runtime and your module doesn't contain any static DAG. If you wish to import a DAG for some reason, you'd have to use some reflection in imports from another module but that seems to be a very rare case
if your list is large, rebuilding the list of DAGs can be long. You would probably need to cut your DAGs in smaller ones and partition the list it's generating the DAGs from.
if your DAGs change too much over time, you'll have a hard-time keeping relevant logs and it might be more difficult for the scheduler to track task triggering
Enjoy!

Export environment variables at runtime with airflow

I am currently converting workflows that were implemented in bash scripts before to Airflow DAGs. In the bash scripts, I was just exporting the variables at run time with
export HADOOP_CONF_DIR="/etc/hadoop/conf"
Now I'd like to do the same in Airflow, but haven't found a solution for this yet. The one workaround I found was setting the variables with os.environ[VAR_NAME]='some_text' outside of any method or operator, but that means they get exported the moment the script gets loaded, not at run time.
Now when I try to call os.environ[VAR_NAME] = 'some_text' in a function that gets called by a PythonOperator, it does not work. My code looks like this
def set_env():
os.environ['HADOOP_CONF_DIR'] = "/etc/hadoop/conf"
os.environ['PATH'] = "somePath:" + os.environ['PATH']
os.environ['SPARK_HOME'] = "pathToSparkHome"
os.environ['PYTHONPATH'] = "somePythonPath"
os.environ['PYSPARK_PYTHON'] = os.popen('which python').read().strip()
os.environ['PYSPARK_DRIVER_PYTHON'] = os.popen('which python').read().strip()
set_env_operator = PythonOperator(
task_id='set_env_vars_NOT_WORKING',
python_callable=set_env,
dag=dag)
Now when my SparkSubmitOperator gets executed, I get the exception:
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
My use case where this is relevant is that I have SparkSubmitOperator, where I submit jobs to YARN, therefore either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. Setting them in my .bashrc or any other config is sadly not possible for me, which is why I need to set them at runtime.
Preferably I'd like to set them in an Operator before executing the SparkSubmitOperator, but if there was the possibility to pass them as arguments to the SparkSubmitOperator, that would be at least something.
From what I can see in the spark submit operator you can pass in environment variables to spark-submit as a dictionary.
:param env_vars: Environment variables for spark-submit. It
supports yarn and k8s mode too.
:type env_vars: dict
Have you tried this?

Categories

Resources