Airflow not recognising zip file DAG built with pytest fixture

Airflow not recognising zip file DAG built with pytest fixture - python

We are using Google Composer (a managed Airflow service) with airflow v1.10 and Python 3.6.8.
To deploy our DAGS, we are taking the Packaged DAG (https://airflow.apache.org/concepts.html?highlight=zip#packaged-dags) method.
All is well when the zip file is created from the cmd line like
zip -r dag_under_test.zip test_dag.py
but when I try to do this from a pytest fixture, so I load in the DagBag and test the integrity of my DAG, airflow doesnt recognise this zip file at all. here is the code to my pytest fixture
#fixture
def setup(config):
os.system("zip -r dag_under_test.zip test_zip.py")
def test_import_dags(setup):
dagbag = DagBag(include_examples=False)
noOfDags = len(dagbag.dags)
dagbag.process_file("dag_under_test.zip")
assert len(dagbag.dags) == noOfDags + 1, 'DAG import failures. Errors: {}'.format(dagbag.import_errors)
I copied this zip file to the DAGs folder, but airflow isnt recognising it at all, no error messages.
But the zip file built with same command from cmdline is being loaded by airflow!! seems like I am missing something obvious here, cant figure out.

In this case, it looks like there is a mismatch between the working directory of os.system and where the DagBag loader is looking. If you inspect the code of airflow/dagbag.py, the path accepted by process_file is passed to os.path.isfile:
def process_file(self, filepath, only_if_updated=True, safe_mode=True):
if filepath is None or not os.path.isfile(filepath):
...
That means within your test, you can probably do some testing to make sure all of these match:
# Make sure this works
os.path.isfile(filepath)
# Make sure these are equal
os.system('pwd')
os.getcwd()

So it turned out that where I am creating the zip file is important. As in this case I am creating the zip file from the test folder and archiving the files in src folders. Although the final zip file looks perfect for the naked eye, airflow is rejecting it.
I tried with adding '-j' to the zip command (to junk the directory names) and my test started working.
zip -r -j dag_under_test_metrics.zip ../src/metricsDAG.py
I had another bigger problem, i.e. to test the same scenario when there is a full folder structure in my DAG project. A dag file at the top level which references a lots of python modules with in the project. I couldnt get this working by the above trick, but came up with a workaround. I have created a small shell script, which does the zip part, like this..
SCRIPT_PATH=${0%/*/*}
cd $SCRIPT_PATH
zip -r -q test/dag_under_test.zip DagRunner.py
zip -r -q test/dag_under_test.zip tasks dag common resources
This shell script is changing the currentdir to project home and archiving from there. I am invoking this shell from the pytest fixture like this
#fixture
def setup():
os.system('rm {}'.format(DAG_UNDER_TEST))
os.system('sh {}'.format(PACKAGE_SCRIPT))
yield
print("-------- clean up -----------")
os.system('rm {}'.format(DAG_UNDER_TEST))
This is perfectly working with my integration test.
def test_conversionDAG(setup):
configuration.load_test_config()
dagbag = DagBag(include_examples=False)
noOfDags = len(dagbag.dags)
dagbag.process_file(DAG_UNDER_TEST)
assert len(dagbag.dags) == noOfDags + 1, 'DAG import failures. Errors: {}'.format(dagbag.import_errors)
assert dagbag.get_dag("name of the dag")

Related

How do I create an automated test for my python script?

I am fairly new to programming and currently working on a python script. It is supposed to gather all the files and directories that are given as paths inside the program and copy them to a new location that the user can choose as an input.
import shutil
import os
from pathlib import Path
import argparse
src = [ [insert name of destination directory, insert path of file/directory that
should be copied ]
]
x = input("Please choose a destination path\n>>>")
if not os.path.exists(x):
os.makedirs(x)
print("Directory was created")
else:
print("Existing directory was chosen")
dest = Path(x.strip())
for pfad in src:
if os.path.isdir(pfad[1]):
shutil.copytree(pfad[1], dest / pfad[0])
elif os.path.isfile(pfad[1]):
pfad1 = Path(dest / pfad[0])
if not os.path.exists(pfad1):
os.makedirs(pfad1)
shutil.copy(pfad[1], dest / pfad[0])
else:
print("An error occured")
print(pfad)
print("All files and directories have been copied!")
input()
The script itself is working just fine. The problem is that I want write a test that automatically test the code each time I push it to my GitLab repository. I have been browsing through the web for quite some time now but wasnt able to find a good explanation on how to approach creating a test for a script like this.
I would be extremely thankful for any kind of feedback or hints to helpful resources.

First, you should write a test that you can run in command line.
I suggest you use the argparse module to pass source and destination directories, so that you can run thescript.py source_dir dest_dir without human interaction.
Then, as you have a test you can run, you need to add a .gitlab-ci.yml to the root of the project so that you can use the gitlab CI.
If you never used the gitlab CI, you need to start here: https://docs.gitlab.com/ee/ci/quick_start/
After that, you'll be able to add a job to your .gitlab-ci.yml, so that a runner with python installed will run the test. If you don't understad the bold terms of the previous sentence, you need to understant Gitlab CI first.

Makefile - get arguments from .ini/.yml file

Is it possible to have a Makefile grabing arguments from either config.ini or config.yml file?
Let's consider this example, we have a python main.py file which is written as a CLI. Not we do not want users to be filling arguments to a python CLI in terminal so we have an example config.ini file with the arguments:
PYTHON FILE:
import typer
def say_name(name:str):
print('runnig the code')
print(f'Hello there {name}')
if __name__ == "__main__":
typer.run(say_name)
config.ini FILE:
[argument]
name = person
Makefile FILE:
run_code:
python main.py ${config.ini.argument.name}
Is it possible to have a project infrastructure like this?
I am aware that Spacy project does exactly this. However I would like to some something like those even outside NLP project without the need of using spacy.

You need to find, or write, a tool which will read in your .ini file and generate a set of makefile variables from it. I don't know where you would find such a thing but it's probably not hard to write one using a python module that parses .ini files.
Suppose you have a script ini2make that will do this, so that if you run:
ini2make config.ini
it will write to stdout makefile variable assignment lines like this:
config.ini.argument.name = person
config.ini.argument.email = person#somewhere.org
etc. Then you can integrate this into your makefile very easily (here I'm assuming you're using GNU make) through use of GNU make's automatic include file regeneration:
include config.ini.mk
config.ini.mk: config.ini
ini2make $< > $#
Done. Now whenever config.ini.mk doesn't exist or config.ini has been changed since config.ini.mk was last generated, make will run the ini2make script to update it then re-execute itself automatically to read the new values.
Then you can use variables that were generated, like $(config.ini.argument.name) etc.

Airflow + Docker: Path behaviour (+Repo)

I have difficulties to understand how the paths in airflow work. I created this repository so that it is easy to see what I mean: https://github.com/remo2479/airflow_example/blob/master/dags/testdag.py
I created this repository from scratch according to the manual on the airflow page. I just deactivated the example DAGs.
As you can see in the only DAG (dags/testdag.py) the DAG contains two tasks and one variable declaration using an opened file.
The two tasks are using the dummy sql script in the repository (dags/testdag/testscript.sql). One time i used testdag/testscript.sql as path (task 1) and one time dags/testdag/testscript.sql (task 2). With a connection set up task 1 would work and task 2 wouldnt because the template cannot be found. This is how I would expect both tasks to run since the dag is in the dags folder and we should not put it in the path.
But when I try to open the testscript.sql and read its contents it's necessary that I put "dags" in the path (dags/testdag/testscript.sql). Why does the path behave differently when using the MsSqlOperator and the open-function?
For convenience I put the whole script in this post:
from airflow import DAG
from airflow.providers.microsoft.mssql.operators.mssql import MsSqlOperator
from datetime import datetime
with DAG(
dag_id = "testdag",
schedule_interval="30 6 * * *",
start_date=datetime(2022, 1, 1),
catchup=False) as dag:
# Error because of missing connection - this is how it should be
first_task = MsSqlOperator(
task_id="first_task",
sql="testdag/testscript.sql")
# Error because of template not found
second_task = MsSqlOperator(
task_id="second_task",
sql="dags/testdag/testscript.sql")
# When trying to open the file the path has to contain "dags" in the path - why?
with open("dags/testdag/testscript.sql","r") as file:
f = file.read()
file.close()
first_task
second_task

MsSqlOperator has sql as templated field. This means that Jinja engine will run on the string passed via the sql parameter. Moreover it has .sql as templated extension. This means that the operator knows to open .sql file, read it content and pass it via the Jinja engine before submitting it to the MsSQL db for execution. The behavior that you are seeing is part of Airflow power. You don't need to write code to read the query from the file. Airflow does that for you. Airflow asks you just to provide the query string and the connection - The rest is on the Operator to handle.
The:
second_task = MsSqlOperator(
task_id="second_task",
sql="dags/testdag/testscript.sql")
is throwing template not found error since Airflow knows to look for template extensions in paths relative to your DAG. This path is not relative to your DAG. If you want this path to be available then use template_searchpath as:
with DAG(
...,
template_searchpath=["dags/testdag/"],
) as dag:
Then your operator can just have sql=testscript.sql
As for the:
with open("dags/testdag/testscript.sql","r") as file:
f = file.read()
file.close()
This practically do nothing. The file will be opened and read from the scheduler as this is a top level code. Not only that - these lines will be executed every 30 seconds (default of min_file_process_interval as Airflow periodically scans your .py file searching for DAG updates. This should also answer your question why dags/ is needed.

Using the template_searchpath will work as #Elad has mentioned, but this is DAG-specific.
To find files in Airflow without using template_searchpath, remember that everything Airflow runs starts in the $AIRFLOW_HOME directory (i.e. airflow by default, or wherever you're executing the services from). So either start there with all your imports, or reference them in relation to the code file you're currently in (i.e. current_dir from my previous answer).
Setting Airflow up for the first time can be fiddly.

Running all Python scripts with the same name across many directories

I have a file structure that looks something like this:
Master:
First
train.py
other1.py
Second
train.py
other2.py
Third
train.py
other3.py
I want to be able to have one Python script that lives in the Master directory that will do the following when executed:
Loop through all the subdirectories (and their subdirectories if they exist)
Run every Python script named train.py in each of them, in whatever order necessary
I know how to execute a given python script from another file (given its name), but I want to create a script that will execute whatever train.py scripts it encounters. Because the train.py scripts are subject to being moved around and being duplicated/deleted, I want to create an adaptable script that will run all those that it finds.
How can I do this?

You can use os.walk to recursively collect all train.py scripts and then run them in parallel using ProcessPoolExecutor and the subprocess module.
import os
import subprocess
from concurrent.futures import ProcessPoolExecutor
def list_python_scripts(root):
"""Finds all 'train.py' scripts in the given directory recursively."""
scripts = []
for root, _, filenames in os.walk(root):
scripts.extend([
os.path.join(root, filename) for filename in filenames
if filename == 'train.py'
])
return scripts
def main():
# Make sure to change the argument here to the directory you want to scan.
scripts = list_python_scripts('master')
with ProcessPoolExecutor(max_workers=len(scripts)) as pool:
# Run each script in parallel and accumulate CompletedProcess results.
results = pool.map(subprocess.run,
[['python', script] for script in scripts])
for result in results:
print(result.returncode, result.stdout)
if __name__ == '__main__':
main()

Which OS are you using ?
If Ubuntu/CentOS try this combination:
import os
//put this in master and this lists every file in master + subdirectories and then after the pipe greps train.py
train_scripts = os.system("find . -type d | grep train.py ")
//next execute them
python train_scripts

If you are using Windows you could try running them from a PowerShell script. You can run two python scripts at once with just this:
python Test1.py
python Folder/Test1.py
And then add a loop and or a function that goes searching for the files. Because it's Windows Powershell, you have a lot of power when it comes to the filesystem and controlling Windows in general.

how do you run pytest either from a notebook or command line on databricks?

I have created some classes each of which takes a dataframe as a parameter. I have imported pytest and created some fixtures and simple assert methods.
I can call pytest.main([.]) from a notebook and it will execute pytest from the rootdir (databricks/driver).
I have tried passing the notebook path but it says not found.
Ideally, i'd want to execute this from the command line.
How do i configure the rootdir?
There seems to be a disconnect between the spark os and the user workspace area which i'm finding hard to connect.
As a caveat I dont want to use unittest as i pytest can be used successfully in the CI pipleine by outputting junitxml which AzureDevOps can report on.

I've explained the reason why you can't run pytest on Databricks notebooks (unless you export them, and upload them to dbfs as regular .py files, which is not what you want) in the link at the bottom of this post.
However, I have been able to run doctests in Databricks, using the doctest.run_docstring_examples method like so:
import doctest
def f(x):
"""
>>> f(1)
45
"""
return x + 1
doctest.run_docstring_examples(f, globals())
This will print out:
**********************************************************************
File "/local_disk0/tmp/1580942556933-0/PythonShell.py", line 5, in NoName
Failed example:
f(1)
Expected:
45
Got:
2
If you also want to raise an exception, take a further look at: https://menziess.github.io/howto/test/code-in-databricks-notebooks/

Taken from Databricks' own repo: https://github.com/databricks/notebook-best-practices/blob/main/notebooks/run_unit_tests.py
# Databricks notebook source
# MAGIC %md Test runner for `pytest`
# COMMAND ----------
!cp ../requirements.txt ~/.
%pip install -r ~/requirements.txt
# COMMAND ----------
# pytest.main runs our tests directly in the notebook environment, providing
# fidelity for Spark and other configuration variables.
#
# A limitation of this approach is that changes to the test will be
# cache by Python's import caching mechanism.
#
# To iterate on tests during development, we restart the Python process
# and thus clear the import cache to pick up changes.
dbutils.library.restartPython()
import pytest
import os
import sys
# Run all tests in the repository root.
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
repo_root = os.path.dirname(os.path.dirname(notebook_path))
os.chdir(f'/Workspace/{repo_root}')
%pwd
# Skip writing pyc files on a readonly filesystem.
sys.dont_write_bytecode = True
retcode = pytest.main([".", "-p", "no:cacheprovider"])
# Fail the cell execution if we have any test failures.
assert retcode == 0, 'The pytest invocation failed. See the log above for details.'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Airflow not recognising zip file DAG built with pytest fixture - python

Related

How do I create an automated test for my python script?

Makefile - get arguments from .ini/.yml file

Airflow + Docker: Path behaviour (+Repo)

Running all Python scripts with the same name across many directories

how do you run pytest either from a notebook or command line on databricks?

Categories

Resources