External files in Airflow DAG - python

I'm trying to access external files in a Airflow Task to read some sql, and I'm getting "file not found". Has anyone come across this?
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
dag = DAG(
'my_dat',
start_date=datetime(2017, 1, 1),
catchup=False,
schedule_interval=timedelta(days=1)
)
def run_query():
# read the query
query = open('sql/queryfile.sql')
# run the query
execute(query)
tas = PythonOperator(
task_id='run_query', dag=dag, python_callable=run_query)
The log state the following:
IOError: [Errno 2] No such file or directory: 'sql/queryfile.sql'
I understand that I could simply copy and paste the query inside the same file, it's really not at neat solution. There are multiple queries and the text is really big, embed it with the Python code would compromise readability.

Here is an example use Variable to make it easy.
First add Variable in Airflow UI -> Admin -> Variable, eg. {key: 'sql_path', values: 'your_sql_script_folder'}
Then add following code in your DAG, to use Variable from Airflow you just add.
DAG code:
import airflow
from airflow.models import Variable
tmpl_search_path = Variable.get("sql_path")
dag = airflow.DAG(
'tutorial',
schedule_interval="#daily",
template_searchpath=tmpl_search_path, # this
default_args=default_args
)
Now you can use sql script name or path under folder Variable
You can learn more in this

All relative paths are taken in reference to the AIRFLOW_HOME environment variable. Try:
Giving absolute path
place the file relative to AIRFLOW_HOME
try logging the PWD in the python callable and then decide what path to give (Best option)

Assuming that the sql directory is relative to the current Python file, you can figure out the absolute path to the sql file like this:
import os
CUR_DIR = os.path.abspath(os.path.dirname(__file__))
def run_query():
# read the query
query = open(f"{CUR_DIR}/sql/queryfile.sql")
# run the query
execute(query)

you can get DAG directory like below.
conf.get('core', 'DAGS_FOLDER')
# open file
open(os.path.join(conf.get('core', 'DAGS_FOLDER'), 'something.json'), 'r')
ref: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder

Related

How to use imported function from a module that creates and returns a DAG for UI to see?

I've created a module named dag_template_module.py that returns a DAG using specified arguments. I want to use this definition for multiple DAGs, doing same thing but from different sources (thus parameters). A simplified version of dag_template_module.py:
from airflow.decorators import dag, task
from airflow.operators.bash import BashOperator
def dag_template(
dag_id: str,
echo_message_1: str,
echo_message_2: str
):
#dag(
dag_id=dag_id,
schedule_interval="0 6 2 * *"
)
def dag_example():
echo_1 = BashOperator(
task_id='echo_1',
bash_command=f'echo {echo_message_1}'
)
echo_2 = BashOperator(
task_id='echo_2',
bash_command=f'echo {echo_message_2}'
)
echo_1 >> echo_2
dag = dag_example()
return dag
Now I've created a hello_world_dag.py that imports dag_template() function from dag_template_module.py and uses it to create a DAG:
from dag_template import dag_template
hello_world_dag = dag_template(
dag_id='hello_world_dag',
echo_message_1='Hello',
echo_message_2='World'
)
I've expected that this DAG will be discovered by Airflow UI but that's not the case.
I've also tried using globals() in hello_world_dag.py according to documentation but that also doesn't work for me:
from dag_template import dag_template
hello_world_dag = 'hello_word_dag'
globals()[hello_world_dag] = dag_template(dag_id='hello_world_dag',
echo_message_1='Hello',
echo_message_2='World'
)
A couple things:
The DAG you are attempting to create is missing the start_date param
There is a nuance to how Airflow determine which Python files might contain a DAG definition and it's looking for "dag" and "airflow" in the file contents. The hello_world_dag.py is missing these keywords so the DagFileProcessor won't attempt to parse this file and, therefore, doesn't call the dag_template() function.
Adding these small tweaks, and running with Airflow 2.5.0:
dag_template_module.py
from pendulum import datetime
from airflow.decorators import dag
from airflow.operators.bash import BashOperator
def dag_template(dag_id: str, echo_message_1: str, echo_message_2: str):
#dag(dag_id, start_date=datetime(2023, 1, 22), schedule=None)
def dag_example():
echo_1 = BashOperator(task_id="echo_1", bash_command=f"echo {echo_message_1}")
echo_2 = BashOperator(task_id="echo_2", bash_command=f"echo {echo_message_2}")
echo_1 >> echo_2
return dag_example()
hello_world_dag.py
#airflow dag <- Make sure this these words appear _somewhere_ in the file.
from dag_template_module import dag_template
dag_template(dag_id="dag_example", echo_message_1="Hello", echo_message_2="World")

Poke the Specified Extension file in the server directory using the Airflow SFTPSensor

My use case is quite simple:
When file dropped in the FTP server directory, SFTPSensor task picks the specified txt extension file and process the file content.
path="/test_dir/sample.txt" this case is working.
my requirement is to read the dynamic filenames with only the specified extension(text files).
path="/test_dir/*.txt", in this case file poking is not working..
#Sample Code
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.sftp.sensors.sftp import SFTPSensor
from airflow.providers.ssh.hooks.ssh import SSHHook
from datetime import datetime
default_args= {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2022, 4, 16)
}
with DAG(
'sftp_sensor_test',
schedule_interval=None,
default_args=default_args
) as dag:
waiting_for_file = SFTPSensor(
task_id="check_for_file",
sftp_conn_id="sftp_default",
path="/test_dir/*.txt", #NOTE: Poking for the txt extension files
mode="reschedule",
poke_interval=30
)
waiting_for_file
To achieve what you want, I think you should use the file_pattern argument as follows :
waiting_for_file = SFTPSensor(
task_id="check_for_file",
sftp_conn_id="sftp_default",
path="test_dir",
file_pattern="*.txt",
mode="reschedule",
poke_interval=30
)
However, there is currently a bug for this feature → https://github.com/apache/airflow/issues/28121
While this gets solved, you can easily create a local fixed version of the sensor in your project following the issue's explanations.
Here is the file with the current fix: https://github.com/RishuGuru/airflow/blob/ac0457a51b885459bc5ae527878a50feb5dcadfa/airflow/providers/sftp/sensors/sftp.py

create utils.py in AWS lambda

I had a def hello() function in my home/file.py file. I created a home/common/utils.pyfile and moved the function there.
Now, I want to import it in my file file.py.
I imported it like this: from utils import hello and from common.utils import hello and the import in my file doesn't throw an error. However, when I run it on AWS Lambda, I get an error that:
Runtime.ImportModuleError: Unable to import module 'file': No module named 'utils'
How can I fix this? without having to use Ec2 or something...
data "archive_file" "file_zip" {
type = "zip"
source_file = "${path.module}/src/file.py"
output_file_mode = "0666"
output_path = "${path.module}/bin/file.zip"
}
The deployment package that you're uploading only contains your main Python script (file.py). Specifically, it does not include any dependencies such as common/utils.py. That's why the import fails when the code runs in Lambda.
Modify the creation of your deployment package (file.zip) so that it includes all needed dependencies.
For example:
data "archive_file" "file_zip" {
type = "zip"
output_file_mode = "0666"
output_path = "${path.module}/bin/file.zip"
source {
content = file("${path.module}/src/file.py")
filename = "file.py"
}
source {
content = file("${path.module}/src/common/utils.py")
filename = "common/utils.py"
}
}
If all of your files happen to be in a single folder then you can use source_dir instead of indicating the individual files.
Note: I don't use Terraform so the file(...) with embedded interpolation may not be 100% correct, but you get the idea.
First of all, properly follow this standard URL:- https://docs.aws.amazon.com/lambda/latest/dg/python-package.html (refer section with title:- Deployment package with dependencies)
Now, if you notice, in the end of section,
zip -g my-deployment-package.zip lambda_function.py
Follow the same command for your utils file:-
zip -g my-deployment-package.zip common/
zip -g my-deployment-package.zip common/utils.py
Ensure that, in lambda_function, you are using proper import statement like:-
from common.utils import util_function_name
Now, you can upload this zip and test for yourself. It should run.
Hope this helps.

How to load environment variables in a config.ini file?

I have a config.ini file which contains some properties but I want to read the environment variables inside the config file.
[section1]
prop1:(from envrinment variable) or value1
Is this possible or do I have to write a method to take care of that?
I know this is late to the party, but someone might find this handy.
What you are asking after is value interpolation combined with os.environ.
Pyramid?
The logic of doing so is unusual and generally this happens if one has a pyramid server, (which is read by a config file out of the box) while wanting to imitate a django one (which is always set up to read os.environ).
If you are using pyramid, then pyramid.paster.get_app has the argument options: if you pass os.environ as the dictionary you can use %(variable)s in the ini. Not that this is not specific to pyramid.paster.get_app as the next section shows (but my guess is get_app is the case).
app.py
from pyramid.paster import get_app
from waitress import serve
import os
app = get_app('production.ini', 'main', options=os.environ)
serve(app, host='0.0.0.0', port=8000, threads=50)
production.ini:
[app:main]
sqlalchemy.url = %(SQL_URL)s
...
Configparse?
The above is using configparser basic interpolation.
Say I have a file called config.ini with the line
[System]
conda_location = %(CONDA_PYTHON_EXE)
The following will work...
import configparser, os
cfg = configparser.ConfigParser()
cfg.read('config.ini')
print(cfg.get('System', 'conda_location', vars=os.environ))
I think :thinking_face, use .env and in config.py from dotenv import load_dotenv() and in next line load_dotenv() and it will load envs from .env file

Export all Airflow variables

I have a problem with downloading all Airflow variables from the code.
There is an opportunity to export from UI, but i haven't found any way to do it programatically.
I discovered only Variable.get('variable_name') method which returns one Airflow variable.
There is no variants of getting the list of Airflow variables.
Searching in the source code didn't help as well.
Do you know some easy way?
Thank you in advance.
You can use Airflow CLI to export variables to a file and then read it from your Python code.
airflow variables --export FILEPATH
Programmatically you can use the BashOperator to achieve this.
I like the answer above about using the Airflow CLI, but it is also possible to extract all variables from a purely python point of view as well (so no need to do weird tricks to get it)
Use this code snippet:
from airflow.utils.db import create_session
from airflow.models import Variable
# a db.Session object is used to run queries against
# the create_session() method will create (yield) a session
with create_session() as session:
# By calling .query() with Variable, we are asking the airflow db
# session to return all variables (select * from variables).
# The result of this is an iterable item similar to a dict but with a
# slightly different signature (object.key, object.val).
airflow_vars = {var.key: var.val for var in session.query(Variable)}
The above method will query the Airflow sql database and return all variables.
Using a simple dictionary comprehension will allow you to remap the return values to a 'normal' dictionary.
The db.session.query will raise a sqlalchemy.exc.OperationalError if it is unable to connect to a running Airflow db instance.
If you (for whatever reason) wish to mock create_session as part of a unittest, this snippet can be used:
from unittest import TestCase
from unittest.mock import patch, MagicMock
import contextlib
import json
mock_data = {
"foo": {
"bar": "baz"
}
}
airflow_vars = ... # reference to an output (dict) of aforementioned method
class TestAirflowVariables(TestCase)
#contextlib.contextmanager
def create_session(self):
"""Helper that mocks airflow.settings.Session().query() result signature
This is achieved by yielding a mocked airflow.settings.Session() object
"""
session = MagicMock()
session.query.return_value = [
# for the purpose of this test mock_data is converted to json where
# dicts are encountered.
# You will have to modify the above method to parse data from airflow
# correctly (it will send json objects, not dicts)
MagicMock(key=k, val=json.dumps(v) if isinstance(v, dict) else v)
for k, v in mock_data.items()
]
yield session
#patch("airflow.utils.db")
def test_data_is_correctly_parsed(self, db):
db.create_session = self.create_session
self.assertDictEqual(airflow_vars, mock_data)
Note: you will have to change the patch to however you are importing the create_session method in the file you are referencing. I only got it to work by importing up until airflow.utils.db and calling db.create_session in the aforementioned method.
Hope this helps!
Good luck :)
Taking into account all propositions listed above, here is a code snippet that can be used to export all Airflow variables and store them in your GCS:
import datetime
import pendulum
import os
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
local_tz = pendulum.timezone("Europe/Paris")
default_dag_args = {
'start_date': datetime.datetime(2020, 6, 18, tzinfo=local_tz),
'email_on_failure': False,
'email_on_retry': False
}
with DAG(dag_id='your_dag_id',
schedule_interval='00 3 * * *',
default_args=default_dag_args,
catchup=False,
user_defined_macros={
'env': os.environ
}) as dag:
start = DummyOperator(
task_id='start',
)
export_task = BashOperator(
task_id='export_var_task',
bash_command='airflow variables --export variables.json; gsutil cp variables.json your_cloud_storage_path',
)
start >> \
export_task
I have issue with using BashOperator for this use case, so I copied the result of the bashcommand to a variable and used it inside my program.
import subprocess
output = (subprocess.check_output("airflow variables", shell=True)).decode('utf-8').split('pid=')[1].split()[1:-1]
print(output)

Categories

Resources