Export environment variables at runtime with airflow

Export environment variables at runtime with airflow - python

I am currently converting workflows that were implemented in bash scripts before to Airflow DAGs. In the bash scripts, I was just exporting the variables at run time with
export HADOOP_CONF_DIR="/etc/hadoop/conf"
Now I'd like to do the same in Airflow, but haven't found a solution for this yet. The one workaround I found was setting the variables with os.environ[VAR_NAME]='some_text' outside of any method or operator, but that means they get exported the moment the script gets loaded, not at run time.
Now when I try to call os.environ[VAR_NAME] = 'some_text' in a function that gets called by a PythonOperator, it does not work. My code looks like this
def set_env():
os.environ['HADOOP_CONF_DIR'] = "/etc/hadoop/conf"
os.environ['PATH'] = "somePath:" + os.environ['PATH']
os.environ['SPARK_HOME'] = "pathToSparkHome"
os.environ['PYTHONPATH'] = "somePythonPath"
os.environ['PYSPARK_PYTHON'] = os.popen('which python').read().strip()
os.environ['PYSPARK_DRIVER_PYTHON'] = os.popen('which python').read().strip()
set_env_operator = PythonOperator(
task_id='set_env_vars_NOT_WORKING',
python_callable=set_env,
dag=dag)
Now when my SparkSubmitOperator gets executed, I get the exception:
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
My use case where this is relevant is that I have SparkSubmitOperator, where I submit jobs to YARN, therefore either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. Setting them in my .bashrc or any other config is sadly not possible for me, which is why I need to set them at runtime.
Preferably I'd like to set them in an Operator before executing the SparkSubmitOperator, but if there was the possibility to pass them as arguments to the SparkSubmitOperator, that would be at least something.

From what I can see in the spark submit operator you can pass in environment variables to spark-submit as a dictionary.
:param env_vars: Environment variables for spark-submit. It
supports yarn and k8s mode too.
:type env_vars: dict
Have you tried this?

Related

Airflow + Docker: Path behaviour (+Repo)

I have difficulties to understand how the paths in airflow work. I created this repository so that it is easy to see what I mean: https://github.com/remo2479/airflow_example/blob/master/dags/testdag.py
I created this repository from scratch according to the manual on the airflow page. I just deactivated the example DAGs.
As you can see in the only DAG (dags/testdag.py) the DAG contains two tasks and one variable declaration using an opened file.
The two tasks are using the dummy sql script in the repository (dags/testdag/testscript.sql). One time i used testdag/testscript.sql as path (task 1) and one time dags/testdag/testscript.sql (task 2). With a connection set up task 1 would work and task 2 wouldnt because the template cannot be found. This is how I would expect both tasks to run since the dag is in the dags folder and we should not put it in the path.
But when I try to open the testscript.sql and read its contents it's necessary that I put "dags" in the path (dags/testdag/testscript.sql). Why does the path behave differently when using the MsSqlOperator and the open-function?
For convenience I put the whole script in this post:
from airflow import DAG
from airflow.providers.microsoft.mssql.operators.mssql import MsSqlOperator
from datetime import datetime
with DAG(
dag_id = "testdag",
schedule_interval="30 6 * * *",
start_date=datetime(2022, 1, 1),
catchup=False) as dag:
# Error because of missing connection - this is how it should be
first_task = MsSqlOperator(
task_id="first_task",
sql="testdag/testscript.sql")
# Error because of template not found
second_task = MsSqlOperator(
task_id="second_task",
sql="dags/testdag/testscript.sql")
# When trying to open the file the path has to contain "dags" in the path - why?
with open("dags/testdag/testscript.sql","r") as file:
f = file.read()
file.close()
first_task
second_task

MsSqlOperator has sql as templated field. This means that Jinja engine will run on the string passed via the sql parameter. Moreover it has .sql as templated extension. This means that the operator knows to open .sql file, read it content and pass it via the Jinja engine before submitting it to the MsSQL db for execution. The behavior that you are seeing is part of Airflow power. You don't need to write code to read the query from the file. Airflow does that for you. Airflow asks you just to provide the query string and the connection - The rest is on the Operator to handle.
The:
second_task = MsSqlOperator(
task_id="second_task",
sql="dags/testdag/testscript.sql")
is throwing template not found error since Airflow knows to look for template extensions in paths relative to your DAG. This path is not relative to your DAG. If you want this path to be available then use template_searchpath as:
with DAG(
...,
template_searchpath=["dags/testdag/"],
) as dag:
Then your operator can just have sql=testscript.sql
As for the:
with open("dags/testdag/testscript.sql","r") as file:
f = file.read()
file.close()
This practically do nothing. The file will be opened and read from the scheduler as this is a top level code. Not only that - these lines will be executed every 30 seconds (default of min_file_process_interval as Airflow periodically scans your .py file searching for DAG updates. This should also answer your question why dags/ is needed.

Using the template_searchpath will work as #Elad has mentioned, but this is DAG-specific.
To find files in Airflow without using template_searchpath, remember that everything Airflow runs starts in the $AIRFLOW_HOME directory (i.e. airflow by default, or wherever you're executing the services from). So either start there with all your imports, or reference them in relation to the code file you're currently in (i.e. current_dir from my previous answer).
Setting Airflow up for the first time can be fiddly.

Airflow SSHExecuteOperator() with env=... not setting remote environment

I am modifying the environment of the calling process and appending to it's PATH along with setting some new environment variables. However, when I print os.environ in the child process, these changes are not reflected. Any idea what may be happening?
My call to the script on the instance:
ssh_hook = SSHHook(conn_id=ssh_conn_id)
temp_env = os.environ.copy()
temp_env["PATH"] = "/somepath:"+temp_env["PATH"]
run = SSHExecuteOperator(
bash_command="python main.py",
env=temp_env,
ssh_hook=ssh_hook,
task_id="run",
dag=dag)

Explanation: Implementation Analysis
If you look at the source to Airflow's SSHHook class, you'll see that it doesn't incorporate the env argument into the command being remotely run at all. The SSHExecuteOperator implementation passes env= through to the Popen() call on the hook, but that only passes it through to the local subprocess.Popen() implementation, not to the remote operation.
Thus, in short: Airflow does not support passing environment variables over SSH. If it were to have such support, it would need to either incorporate them into the command being remotely executed, or to add the SendEnv option to the ssh command being locally executed for each command to be sent (which even then would only work if the remote sshd were configured with AcceptEnv whitelisting the specific environment variable names to be received).
Workaround: Passing Environment Variables On The Command Line
from pipes import quote # in Python 3, make this "from shlex import quote"
def with_prefix_from_env(env_dict, command=None):
result = 'set -a; '
for (k,v) in env_dict.items():
result += '%s=%s; ' % (quote(k), quote(v))
if command:
result += command
return result
SSHExecuteOperator(bash_command=prefix_from_env(temp_env, "python main.py"),
ssh_hook=ssh_hook, task_id="run", dag=dag)
Workaround: Remote Sourcing
If your environment variables are sensitive and you don't want them to be logged with the command, you can transfer them out-of-band and source the remote file containing them.
from pipes import quote
def with_env_from_remote_file(filename, command):
return "set -a; . %s; %s" % (quote(filename), command)
SSHExecuteOperator(bash_command=with_env_from_remote_file(envfile, "python main.py"),
ssh_hook=ssh_hook, task_id="run", dag=dag)
Note that set -a directs the shell to export all defined variables, so the file being executed need only define variables with key=val declarations; they'll be automatically exported. If generating this file from your Python script, be sure to quote both keys and values with pipes.quote() to ensure that it only performs assignments and does not run other commands. The . keyword is a POSIX-compliant equivalent to the bash source command.

How to call python script in Spark?

I have a metrics.py which calculates a graph.
I can call it in the terminal command line (python ./metrics.py -i [input] [output]).
I want to write a function in Spark. It calls the metrics.py script to run on the provide file path and collects the values that metrics.py prints out.
How can I do that?

In order to run metrics.py, you essentially ship it to all the executor nodes that run your Spark Job.
To do this, you either pass it via SparkContext -
sc = SparkContext(conf=conf, pyFiles=['path_to_metrics.py'])
or pass it later using the Spark Context's addPyFile method -
sc.addPyFile('path_to_metrics.py')
In either case, after that, do not forget to import metrics.py and then just call needed function that gives needed output.
import metrics
metrics.relevant_function()
Also make sure you have all the python libraries that are imported inside metrics.py installed on all executor nodes. Else, take care of them using the --py-files and --jars handles while spark-submitting your job.

From Python execute shell command and incorporate environment changes (without subprocess)?

I'm exploring using iPython as shell replacement for a workflow that requires good logging and reproducibility of actions.
I have a few non-python binary programs and bash shell commands to run during my common workflow that manipulate the environment variables affecting subsequent work. i.e. when run from bash, the environment changes.
How can I incorporate these cases into the Python / iPython interactive shell and modify the environment going forward in the session?
Let's focus on the most critical case.
From bash, I woud do:
> sysmanager initialize foo
where sysmanager is a function:
> type sysmanager
sysmanager is a function
sysmanager ()
{
eval `/usr/bin/sysmanagercmd bash $*`
}
I don't control the binary sysmanagercmd and it generally makes non-trivial manipulations of the environment variables. Use of the eval built-in means these manipulations affect the shell process going forward -- that's critical to the design.
How can I call this command from Python / iPython with the same affect? Does python have something equivalent to bash's eval built-in for non-python commands?

Having not come across any built-in capability to do this, I wrote the following function which accomplishes the broad intent. Environment variable modifications and change of working directory are reflected in the python shell after the function returns. Any modification of shell aliases or functions are not retained but that could be done too with enhancement of this function.
#!/usr/bin/env python3
"""
Some functionality useful when working with IPython as a shell replacement.
"""
import subprocess
import tempfile
import os
def ShellEval(command_str):
"""
Evaluate the supplied command string in the system shell.
Operates like the shell eval command:
- Environment variable changes are pulled into the Python environment
- Changes in working directory remain in effect
"""
temp_stdout = tempfile.SpooledTemporaryFile()
temp_stderr = tempfile.SpooledTemporaryFile()
# in broader use this string insertion into the shell command should be given more security consideration
subprocess.call("""trap 'printf "\\0`pwd`\\0" 1>&2; env -0 1>&2' exit; %s"""%(command_str,), stdout=temp_stdout, stderr=temp_stderr, shell=True)
temp_stdout.seek(0)
temp_stderr.seek(0)
all_err_output = temp_stderr.read()
allByteStrings = all_err_output.split(b'\x00')
command_error_output = allByteStrings[0]
new_working_dir_str = allByteStrings[1].decode('utf-8') # some risk in assuming index 1. What if commands sent a null char to the output?
variables_to_ignore = ['SHLVL','COLUMNS', 'LINES','OPENSSL_NO_DEFAULT_ZLIB', '_']
newdict = dict([ tuple(bs.decode('utf-8').split('=',1)) for bs in allByteStrings[2:-1]])
for (varname,varvalue) in newdict.items():
if varname not in variables_to_ignore:
if varname not in os.environ:
#print("New Variable: %s=%s"%(varname,varvalue))
os.environ[varname] = varvalue
elif os.environ[varname] != varvalue:
#print("Updated Variable: %s=%s"%(varname,varvalue))
os.environ[varname] = varvalue
deletedVars = []
for oldvarname in os.environ.keys():
if oldvarname not in newdict.keys():
deletedVars.append(oldvarname)
for oldvarname in deletedVars:
#print("Deleted environment Variable: %s"%(oldvarname,))
del os.environ[oldvarname]
if os.getcwd() != os.path.normpath(new_working_dir_str):
#print("Working directory changed to %s"%(os.path.normpath(new_working_dir_str),))
os.chdir(new_working_dir_str)
# Display output of user's command_str. Standard output and error streams are not interleaved.
print(temp_stdout.read().decode('utf-8'))
print(command_error_output.decode('utf-8'))

how to set environmental variables permanently in posix(unix/linux) machine using python script

I am trying to set a environmental variable permanently. but temporarily it is working.
if i run below program i got the variable path. after close it and open new terminal to find the variable path using the command printenv LD_LIBRARY_PATH nothing will print.
#!/usr/bin/python
import os
import subprocess
def setenv_var():
env_var = "LD_LIBRARY_PATH"
env_path = "/usr/local/lib"`enter code here`
os.environ[env_var] = env_path
process = subprocess.Popen('printenv ' + env_var, stdout=subprocess.PIPE, shell=True)
result = process.communicate()[0]
return result
if __name__ == '__main__':
print setenv_var()
please help me.

Here is what I use to set environment variables:
def setenv_var(env_file, set_this_env=True):
env_var = "LD_LIBRARY_PATH"
env_path = "/usr/local/lib"`enter code here`
# set environments opened later by appending to `source`-d file
with open(env_file, 'a') as f:
f.write(os.linesep + ("%s=%s" % (env_var, env_path)))
if set_this_end:
# set this environment
os.environ[env_var] = env_path
Now you only have to choose where to set it, that is the first argument in the function. I recommend the profile-specific file ~/.profile or if you're using bash which is pretty common ~/.bashrc
You can also set it globally by using a file like /etc/environment but you'll need to have permissions when you run this script (sudo python script.py).
Remember that environments are inherited from the parent process, and you can't have a child set up a parent process' environment.

When you set an environment variable, it only affects the currently running process (and, by extension, any children that are forked after the variable is set). If you are attempting to set an environment variable in your shell and you want that environment variable to always be set for your interactive shells, you need to set it in the startup scripts (eg .login, .bashrc, .profile) for your shell. Commands that you run are (initially) children of the shell from which you run them, so although they inherit the environment of the shell and can change their own environment, they cannot change the environment of your shell.

Whether you do an export from bash or you set your os.environ from Python, these only stay for the session or process's lifetime. If you want to set them permanent you will have to touch and add it to the respective shell's profile file.
For ex. If you are on bash, you could do:
with open("~/.bashrc", "a") as outfile: # 'a' stands for "append"
outfile.write("export LD_LIBRARY_PATH=/usr/local/lib")
Check this out for some insight as to which file to add this depending on the target shell. https://unix.stackexchange.com/questions/117467/how-to-permanently-set-environmental-variables

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Export environment variables at runtime with airflow - python

From what I can see in the spark submit operator you can pass in environment variables to spark-submit as a dictionary. :param env_vars: Environment variables for spark-submit. It supports yarn and k8s mode too. :type env_vars: dict Have you tried this?

Related

Airflow + Docker: Path behaviour (+Repo)

Airflow SSHExecuteOperator() with env=... not setting remote environment

How to call python script in Spark?

From Python execute shell command and incorporate environment changes (without subprocess)?

how to set environmental variables permanently in posix(unix/linux) machine using python script

Categories

Resources