I have a problem with downloading all Airflow variables from the code.
There is an opportunity to export from UI, but i haven't found any way to do it programatically.
I discovered only Variable.get('variable_name') method which returns one Airflow variable.
There is no variants of getting the list of Airflow variables.
Searching in the source code didn't help as well.
Do you know some easy way?
Thank you in advance.
You can use Airflow CLI to export variables to a file and then read it from your Python code.
airflow variables --export FILEPATH
Programmatically you can use the BashOperator to achieve this.
I like the answer above about using the Airflow CLI, but it is also possible to extract all variables from a purely python point of view as well (so no need to do weird tricks to get it)
Use this code snippet:
from airflow.utils.db import create_session
from airflow.models import Variable
# a db.Session object is used to run queries against
# the create_session() method will create (yield) a session
with create_session() as session:
# By calling .query() with Variable, we are asking the airflow db
# session to return all variables (select * from variables).
# The result of this is an iterable item similar to a dict but with a
# slightly different signature (object.key, object.val).
airflow_vars = {var.key: var.val for var in session.query(Variable)}
The above method will query the Airflow sql database and return all variables.
Using a simple dictionary comprehension will allow you to remap the return values to a 'normal' dictionary.
The db.session.query will raise a sqlalchemy.exc.OperationalError if it is unable to connect to a running Airflow db instance.
If you (for whatever reason) wish to mock create_session as part of a unittest, this snippet can be used:
from unittest import TestCase
from unittest.mock import patch, MagicMock
import contextlib
import json
mock_data = {
"foo": {
"bar": "baz"
}
}
airflow_vars = ... # reference to an output (dict) of aforementioned method
class TestAirflowVariables(TestCase)
#contextlib.contextmanager
def create_session(self):
"""Helper that mocks airflow.settings.Session().query() result signature
This is achieved by yielding a mocked airflow.settings.Session() object
"""
session = MagicMock()
session.query.return_value = [
# for the purpose of this test mock_data is converted to json where
# dicts are encountered.
# You will have to modify the above method to parse data from airflow
# correctly (it will send json objects, not dicts)
MagicMock(key=k, val=json.dumps(v) if isinstance(v, dict) else v)
for k, v in mock_data.items()
]
yield session
#patch("airflow.utils.db")
def test_data_is_correctly_parsed(self, db):
db.create_session = self.create_session
self.assertDictEqual(airflow_vars, mock_data)
Note: you will have to change the patch to however you are importing the create_session method in the file you are referencing. I only got it to work by importing up until airflow.utils.db and calling db.create_session in the aforementioned method.
Hope this helps!
Good luck :)
Taking into account all propositions listed above, here is a code snippet that can be used to export all Airflow variables and store them in your GCS:
import datetime
import pendulum
import os
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
local_tz = pendulum.timezone("Europe/Paris")
default_dag_args = {
'start_date': datetime.datetime(2020, 6, 18, tzinfo=local_tz),
'email_on_failure': False,
'email_on_retry': False
}
with DAG(dag_id='your_dag_id',
schedule_interval='00 3 * * *',
default_args=default_dag_args,
catchup=False,
user_defined_macros={
'env': os.environ
}) as dag:
start = DummyOperator(
task_id='start',
)
export_task = BashOperator(
task_id='export_var_task',
bash_command='airflow variables --export variables.json; gsutil cp variables.json your_cloud_storage_path',
)
start >> \
export_task
I have issue with using BashOperator for this use case, so I copied the result of the bashcommand to a variable and used it inside my program.
import subprocess
output = (subprocess.check_output("airflow variables", shell=True)).decode('utf-8').split('pid=')[1].split()[1:-1]
print(output)
Related
I've created a module named dag_template_module.py that returns a DAG using specified arguments. I want to use this definition for multiple DAGs, doing same thing but from different sources (thus parameters). A simplified version of dag_template_module.py:
from airflow.decorators import dag, task
from airflow.operators.bash import BashOperator
def dag_template(
dag_id: str,
echo_message_1: str,
echo_message_2: str
):
#dag(
dag_id=dag_id,
schedule_interval="0 6 2 * *"
)
def dag_example():
echo_1 = BashOperator(
task_id='echo_1',
bash_command=f'echo {echo_message_1}'
)
echo_2 = BashOperator(
task_id='echo_2',
bash_command=f'echo {echo_message_2}'
)
echo_1 >> echo_2
dag = dag_example()
return dag
Now I've created a hello_world_dag.py that imports dag_template() function from dag_template_module.py and uses it to create a DAG:
from dag_template import dag_template
hello_world_dag = dag_template(
dag_id='hello_world_dag',
echo_message_1='Hello',
echo_message_2='World'
)
I've expected that this DAG will be discovered by Airflow UI but that's not the case.
I've also tried using globals() in hello_world_dag.py according to documentation but that also doesn't work for me:
from dag_template import dag_template
hello_world_dag = 'hello_word_dag'
globals()[hello_world_dag] = dag_template(dag_id='hello_world_dag',
echo_message_1='Hello',
echo_message_2='World'
)
A couple things:
The DAG you are attempting to create is missing the start_date param
There is a nuance to how Airflow determine which Python files might contain a DAG definition and it's looking for "dag" and "airflow" in the file contents. The hello_world_dag.py is missing these keywords so the DagFileProcessor won't attempt to parse this file and, therefore, doesn't call the dag_template() function.
Adding these small tweaks, and running with Airflow 2.5.0:
dag_template_module.py
from pendulum import datetime
from airflow.decorators import dag
from airflow.operators.bash import BashOperator
def dag_template(dag_id: str, echo_message_1: str, echo_message_2: str):
#dag(dag_id, start_date=datetime(2023, 1, 22), schedule=None)
def dag_example():
echo_1 = BashOperator(task_id="echo_1", bash_command=f"echo {echo_message_1}")
echo_2 = BashOperator(task_id="echo_2", bash_command=f"echo {echo_message_2}")
echo_1 >> echo_2
return dag_example()
hello_world_dag.py
#airflow dag <- Make sure this these words appear _somewhere_ in the file.
from dag_template_module import dag_template
dag_template(dag_id="dag_example", echo_message_1="Hello", echo_message_2="World")
I have the following my_func.py with create_config function.
*my_func.py
from fabric.state import env
def create_config(node_name):
config = {
"log_level": "INFO",
"addr1": "127.0.0.1",
}
config["addr2"] = env.host
return config
I tried the following approach to mock env.host variable where env is an import from fabric.state.
*test.py
import unittest
import my_func
import mock
class MyTestCase(unittest.TestCase):
def setUp(self):
self.master_config = {
"log_level": "INFO",
"addr2": "0.0.0.0",
"addr1": "127.0.0.1",
}
#mock.patch('env.host')
def test_create_consul_config(self, mock_host):
mock_host.return_value = "0.0.0.0"
result = my_func.create_config('master')
self.assertDictEqual(self.master_config, result)
if __name__ == '__main__':
unittest.main()
I am getting import error with 'env'. What is the best way to mock variable within a function with python mock.
ImportError: No module named env
mock variable env.host?
get the type of env first
In [6]: from fabric.state import env
In [7]: type(env)
Out[7]: fabric.utils._AttributeDict
env.host is instance variable of a class, the mock is a little different, mock_env is object(AttributeDict), the assignment of instance_variable host is direct assignment, not with a return_value
#mock.patch('my_func.env')
def test_create_consul_config(self, mock_env):
mock_env.host = 'xxx'
result = my_func.create_config('master')
self.assertDictEqual(self.master_config, result)
From the unittest.mock documentation on patch (note target is the first argument of patch):
target should be a string in the form 'package.module.ClassName'. The
target is imported and the specified object replaced with the new
object, so the target must be importable from the environment you are
calling patch() from. The target is imported when the decorated
function is executed, not at decoration time.
So you need to include the full path to the function you are patching. Note also in where to patch that the target should be the path to where the function/object is used, not where it is defined.
So changing your patch call to:
#mock.patch("my_func.env.host")
should fix the ImportError.
Let's assume, that there are following minimalistic python classes inside one module, e.g. Module:
module/
__init__.py
db.py
document.py
db.py
import yaml
class DB(object):
config = {}
#classmethod
def load_config(cls, config_path):
cls.config = yaml.load(open(config_path, 'r').read())
and document.py
from .db import DB
class Document(object):
db = None
def __init__(self):
self.db = DB()
End-user is going to use such Module as follows:
from Module import DB, Document
DB.load_config('/path/to/config.yml')
Document.do_some_stuff()
doc1 = Document()
doc2 = Document.find(...)
doc2.update_something(...)
doc2.save()
It is expected that Document class and every instance of it will have internally an access to class DB with a config specified by user. However, since Document performs an internal import of DB class (from .db import DB) it receives a 'fresh' DB class with default config.
I did a lot of searches, most of questions and answers are about module-wide configs, but not specified by the end user.
How can I achieve such functionality? I guess that there is some architectural problem here, but what is the most simple way to solve it?
Perhaps this isn't the most appropriate answer, but a few months back I wrote a module called aconf for this exact purpose. It's a memory-based global configuration module for Python written in 8 lines. The idea is you can do the following:
You create a Config object to force the user to input the configuration your program requires (in this case it's inside config.py):
""" 'Config' class to hold our desired configuration parameters.
Note:
This is technically not needed. We do this so that the user knows what he/she should pass
as a config for the specific project. Note how we also take in a function object - this is
to demonstrate that one can have absolutely any type in the global config and is not subjected
to any limitations.
"""
from aconf import make_config
class Config:
def __init__(self, arg, func):
make_config(arg=arg, func=func)
You consume your configuration throughout your module (in this case, inside functionality.py):
""" Use of the global configuration through the `conf` function. """
from aconf import conf
class Example:
def __init__(self):
func = conf().func
arg = conf().arg
self.arg = func(arg)
And then use it (in this case inside main.py):
from project.config import Config
from project.functionality import Example
# Random function to demonstrate we can pass _anything_ to 'make_config' inside 'Config'.
def uppercase(words):
return words.upper()
# We create our custom configuration without saving it.
Config(arg="hello world", func=uppercase)
# We initialize our Example object without passing the 'Config' object to it.
example = Example()
print(example.arg)
# >>> "HELLO WORLD"
The entire aconf module is the following:
__version__ = "1.0.1"
import namedtupled
def make_config(**kwargs):
globals()["aconf"] = kwargs
conf = lambda: namedtupled.map(globals()["aconf"])
config = lambda: globals()["aconf"]
... in essence, you just save your configuration to globals() during runtime.
It's so stupid it makes me wonder if you should even be allowed to do this. I wrote aconf for fun, but have never personally used it in a big project. The reality is, you might run into the problem of making your code weird for other developers.
I'm trying to access external files in a Airflow Task to read some sql, and I'm getting "file not found". Has anyone come across this?
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
dag = DAG(
'my_dat',
start_date=datetime(2017, 1, 1),
catchup=False,
schedule_interval=timedelta(days=1)
)
def run_query():
# read the query
query = open('sql/queryfile.sql')
# run the query
execute(query)
tas = PythonOperator(
task_id='run_query', dag=dag, python_callable=run_query)
The log state the following:
IOError: [Errno 2] No such file or directory: 'sql/queryfile.sql'
I understand that I could simply copy and paste the query inside the same file, it's really not at neat solution. There are multiple queries and the text is really big, embed it with the Python code would compromise readability.
Here is an example use Variable to make it easy.
First add Variable in Airflow UI -> Admin -> Variable, eg. {key: 'sql_path', values: 'your_sql_script_folder'}
Then add following code in your DAG, to use Variable from Airflow you just add.
DAG code:
import airflow
from airflow.models import Variable
tmpl_search_path = Variable.get("sql_path")
dag = airflow.DAG(
'tutorial',
schedule_interval="#daily",
template_searchpath=tmpl_search_path, # this
default_args=default_args
)
Now you can use sql script name or path under folder Variable
You can learn more in this
All relative paths are taken in reference to the AIRFLOW_HOME environment variable. Try:
Giving absolute path
place the file relative to AIRFLOW_HOME
try logging the PWD in the python callable and then decide what path to give (Best option)
Assuming that the sql directory is relative to the current Python file, you can figure out the absolute path to the sql file like this:
import os
CUR_DIR = os.path.abspath(os.path.dirname(__file__))
def run_query():
# read the query
query = open(f"{CUR_DIR}/sql/queryfile.sql")
# run the query
execute(query)
you can get DAG directory like below.
conf.get('core', 'DAGS_FOLDER')
# open file
open(os.path.join(conf.get('core', 'DAGS_FOLDER'), 'something.json'), 'r')
ref: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder
I'm using EasyGui to allow a user to select multiple options. Each option is a function which they can run if they select it. I'm trying to use dictionaries as suggested in other threads but I'm having trouble implementing it (Module object is not callable error). Is there something I'm missing?
from easygui import *
import emdtest1
import emdtest2
import emdtest3
EMDTestsDict = {"emdtest1":emdtest1,
"emdtest2":emdtest2,
"emdtest3":emdtest3}
def main():
test_list = UserSelect()
for i in range(len(test_list)):
if test_list[i] in EMDTestsDict.keys():
EMDTestsDict[test_list[i]]()
def UserSelect():
message = "Which EMD tests would you like to run?"
title = "EMD Test Selector"
tests = ["emdtest1",
"emdtest2",
"emdtest3"]
selected_master = multchoicebox(message, title, tests)
return selected_master
if __name__ == '__main__':
main()
You're putting modules into the dict, when you want to put functions in it. What you're doing is the equivalent of saying
import os
os()
Which, of course, makes no sense. If emdtest1, emdtest2, and emdtest3 are .py files with functions in them, you want:
from emdtest1 import function_name
Where function_name is the name of your function.
You need to import the functions rather than the module ... for example , if you have a file called emdtest1 with a defined function emdtest1, you'd use:
from emdtest1 import emdtest1