I'm using google oozie to airflow converter to convert some oozie workflow that are running on AWS EMR. Managed to get a first version, but when I try to upload the DAG, airflow throws an error:
Broken DAG: No module named 'o2a'
I have tried to deploy the pypi package o2a, both using command
gcloud composer environments update composer-name --update-pypi-packages-from-file requirements.txt --location location
And from google cloud console. Both failed.
requirements.txt
o2a==1.0.1
Here is the code
from airflow import models
from airflow.operators.subdag_operator import SubDagOperator
from airflow.utils import dates
from o2a.o2a_libs import functions
from airflow.models import Variable
import subdag_validation
import subdag_generate_reports
CONFIG = {}
JOB_PROPS = {
}
dag_config = Variable.get("coordinator", deserialize_json=True)
cdrPeriod = dag_config["cdrPeriod"]
TASK_MAP = {"validation": ["validation"], "generate_reports": ["generate_reports"] }
TEMPLATE_ENV = {**CONFIG, **JOB_PROPS, "functions": functions, "task_map": TASK_MAP}
with models.DAG(
"workflow_coordinator",
schedule_interval=None, # Change to suit your needs
start_date=dates.days_ago(0), # Change to suit your needs
user_defined_macros=TEMPLATE_ENV,
) as dag:
validation = SubDagOperator(
task_id="validation",
trigger_rule="one_success",
subdag=subdag_validation.sub_dag(dag.dag_id, "validation", dag.start_date, dag.schedule_interval),
)
generate_reports = SubDagOperator(
task_id="generate_reports",
trigger_rule="one_success",
subdag=subdag_generate_reports.sub_dag(dag.dag_id, "generate_reports", dag.start_date, dag.schedule_interval,
{
"cdrPeriod": "{{cdrPeriod}}"
}),
)
validation.set_downstream(generate_reports)
There is a section in the o2a docs that cover how to deploy o2a:
https://github.com/GoogleCloudPlatform/oozie-to-airflow#the-o2a-libraries
With started to failed because another dependency:lark-parser
Just installed using pypi package manager for Composer did the trick.
Related
I've created a module named dag_template_module.py that returns a DAG using specified arguments. I want to use this definition for multiple DAGs, doing same thing but from different sources (thus parameters). A simplified version of dag_template_module.py:
from airflow.decorators import dag, task
from airflow.operators.bash import BashOperator
def dag_template(
dag_id: str,
echo_message_1: str,
echo_message_2: str
):
#dag(
dag_id=dag_id,
schedule_interval="0 6 2 * *"
)
def dag_example():
echo_1 = BashOperator(
task_id='echo_1',
bash_command=f'echo {echo_message_1}'
)
echo_2 = BashOperator(
task_id='echo_2',
bash_command=f'echo {echo_message_2}'
)
echo_1 >> echo_2
dag = dag_example()
return dag
Now I've created a hello_world_dag.py that imports dag_template() function from dag_template_module.py and uses it to create a DAG:
from dag_template import dag_template
hello_world_dag = dag_template(
dag_id='hello_world_dag',
echo_message_1='Hello',
echo_message_2='World'
)
I've expected that this DAG will be discovered by Airflow UI but that's not the case.
I've also tried using globals() in hello_world_dag.py according to documentation but that also doesn't work for me:
from dag_template import dag_template
hello_world_dag = 'hello_word_dag'
globals()[hello_world_dag] = dag_template(dag_id='hello_world_dag',
echo_message_1='Hello',
echo_message_2='World'
)
A couple things:
The DAG you are attempting to create is missing the start_date param
There is a nuance to how Airflow determine which Python files might contain a DAG definition and it's looking for "dag" and "airflow" in the file contents. The hello_world_dag.py is missing these keywords so the DagFileProcessor won't attempt to parse this file and, therefore, doesn't call the dag_template() function.
Adding these small tweaks, and running with Airflow 2.5.0:
dag_template_module.py
from pendulum import datetime
from airflow.decorators import dag
from airflow.operators.bash import BashOperator
def dag_template(dag_id: str, echo_message_1: str, echo_message_2: str):
#dag(dag_id, start_date=datetime(2023, 1, 22), schedule=None)
def dag_example():
echo_1 = BashOperator(task_id="echo_1", bash_command=f"echo {echo_message_1}")
echo_2 = BashOperator(task_id="echo_2", bash_command=f"echo {echo_message_2}")
echo_1 >> echo_2
return dag_example()
hello_world_dag.py
#airflow dag <- Make sure this these words appear _somewhere_ in the file.
from dag_template_module import dag_template
dag_template(dag_id="dag_example", echo_message_1="Hello", echo_message_2="World")
I'm trying to use boto3 client create_job() to create a Glue job, this is the script:
job = client.create_job(Name=xxx,
Role=xxx,
Command={
'Name': 'glueetl',
'ScriptLocation': 's3://my_bucket_name/my_project_name/src/glue.py',
'PythonVersion': '3'},
DefaultArguments={
'--job-language': 'python',
'--extra-py-files': 's3://my_bucket_name/my_project_name/src/test.zip',
'--conf': 'spark.yarn.executor.memoryOverhead=7g --conf spark.jars.packages=xxx',
},
ExecutionProperty={
'MaxConcurrentRuns': 1
},
GlueVersion='1.0'
)
The structure in test.zip is __init__.py file + 'glue.py' file (which is duplicated with the one specified in ScriptLocation) + example.py
Inside the 'glue.py' I have import example, then the job failed with error "ErrorMessage":"ModuleNotFoundError: No module named \'example\'".
I tried from test import example but not working, I'm confused and stuck here, how Glue read and import modules? do I need to setup something? Might someone be able to help please? Many thanks.
The _init_.py is incorrect. It should be __init__.py (double underscore) as explained in the AWS docs.
I'm using the Airflow PythonOperator to execute a python Beam job using the Dataflow runner.
The Dataflow job returns the error "ModuleNotFoundError: No module named 'airflow'"
In the DataFlow UI the SDK version being used when the job is called using the PythonOperator is 2.15.0. If the
job is executed from Cloud shell the SDK version being used is 2.23.0. The job works when initiated from
the shell.
The Environment details for Composer are:
Image version = composer-1.10.3-airflow-1.10.3
Python version= 3
A previous post suggested using the PythonVirtualenvOperator operator. I tried this using the settings:
requirements=['apache-beam[gcp]'],
python_version=3
Composer returns the error "'install', 'apache-beam[gcp]']' returned non-zero exit status 2."
Any advice would be greatly appreciated.
This is the DAG that calls the Dataflow job. I have not shown all the functions that are used in the DAG but kept the imports in :
import logging
import pprint
import json
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow.models import DAG
import google.cloud.logging
from datetime import timedelta
from airflow.utils.dates import days_ago
from deps import utils
from google.cloud import storage
from airflow.exceptions import AirflowException
from deps import logger_montr
from deps import dataflow_clean_csv
dag = DAG(dag_id='clean_data_file',
default_args=args,
description='Runs Dataflow to clean csv files',
schedule_interval=None)
def get_values_from_previous_dag(**context):
var_dict = {}
for key, val in context['dag_run'].conf.items():
context['ti'].xcom_push(key, val)
var_dict[key] = val
populate_ti_xcom = PythonOperator(
task_id='get_values_from_previous_dag',
python_callable=get_values_from_previous_dag,
provide_context=True,
dag=dag,
)
dataflow_clean_csv = PythonOperator(
task_id = "dataflow_clean_csv",
python_callable = dataflow_clean_csv.clean_csv_dataflow,
op_kwargs= {
'project':
'zone':
'region':
'stagingLocation':
'inputDirectory':
'filename':
'outputDirectory':
},
provide_context=True,
dag=dag,
)
populate_ti_xcom >> dataflow_clean_csv
I use the ti.xcom_pull(task_ids = 'get_values_from_previous_dag') method to assign the op_kwargs.
This is the Dataflow job that is being called:
import apache_beam as beam
import csv
import logging
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import WriteToText
def parse_file(element):
for line in csv.reader([element], quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL):
line = [s.replace('\"', '') for s in line]
clean_line = '","'.join(line)
final_line = '"'+ clean_line +'"'
return final_line
def clean_csv_dataflow(**kwargs):
argv = [
# Dataflow pipeline options
"--region={}".format(kwargs["region"]),
"--project={}".format(kwargs["project"]) ,
"--temp_location={}".format(kwargs["stagingLocation"]),
# Setting Dataflow pipeline options
'--save_main_session',
'--max_num_workers=8',
'--autoscaling_algorithm=THROUGHPUT_BASED',
# Mandatory constants
'--job_name=cleancsvdataflow',
'--runner=DataflowRunner'
]
options = PipelineOptions(
flags=argv
)
pipeline = beam.Pipeline(options=options)
inputDirectory = kwargs["inputDirectory"]
filename = kwargs["filename"]
outputDirectory = kwargs["outputDirectory"]
outputfile_temp = filename
outputfile_temp = outputfile_temp.split(".")
outputfile = "_CLEANED.".join(outputfile_temp)
in_path_and_filename = "{}{}".format(inputDirectory,filename)
out_path_and_filename = "{}{}".format(outputDirectory,outputfile)
pipeline = beam.Pipeline(options=options)
clean_csv = (pipeline
| "Read input file" >> beam.io.ReadFromText(in_path_and_filename)
| "Parse file" >> beam.Map(parse_file)
| "writecsv" >> beam.io.WriteToText(out_path_and_filename,num_shards=1)
)
pipeline.run()
This answer was provided by #BSpinoza in the comment section:
What I did was move all imports from the global namespace and place
them into the function definitions. Then, from the calling DAG I used
the BashOperator. It worked.
Also, one of the recommended way is to use DataFlowPythonOperator.
I am trying to run a python code in code inline an AWS Lambda function.
I am not zipping any file just pasting the below code in the Lambda function.
And I am getting this error:
errorMessage": "Unable to import module 'UpdateHost_Python'
import psycopg2
def lambda_handler(event,context):
conn_string = "dbname='myfirstdb' port='5432' user='db28' password='######' host='#####.ck0zbnniqteb.us-east-2.rds.amazonaws.com'"
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
cursor.execute("select * from unnmesh")
conn.commit()
cursor.close()
print("working")
For non-standard Python libraries (like psycopg2), you will need to create a Deployment Package.
This involves creating a Zip file with the libraries, then uploading the Zip file to Lambda.
See: AWS Lambda Deployment Package in Python - AWS Lambda
For a worked-through example, see also: Tutorial: Using AWS Lambda with Amazon S3 - AWS Lambda (I know you're not using Amazon S3, but the tutorial gives an example of building a package with dependencies.)
I'm trying to access external files in a Airflow Task to read some sql, and I'm getting "file not found". Has anyone come across this?
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
dag = DAG(
'my_dat',
start_date=datetime(2017, 1, 1),
catchup=False,
schedule_interval=timedelta(days=1)
)
def run_query():
# read the query
query = open('sql/queryfile.sql')
# run the query
execute(query)
tas = PythonOperator(
task_id='run_query', dag=dag, python_callable=run_query)
The log state the following:
IOError: [Errno 2] No such file or directory: 'sql/queryfile.sql'
I understand that I could simply copy and paste the query inside the same file, it's really not at neat solution. There are multiple queries and the text is really big, embed it with the Python code would compromise readability.
Here is an example use Variable to make it easy.
First add Variable in Airflow UI -> Admin -> Variable, eg. {key: 'sql_path', values: 'your_sql_script_folder'}
Then add following code in your DAG, to use Variable from Airflow you just add.
DAG code:
import airflow
from airflow.models import Variable
tmpl_search_path = Variable.get("sql_path")
dag = airflow.DAG(
'tutorial',
schedule_interval="#daily",
template_searchpath=tmpl_search_path, # this
default_args=default_args
)
Now you can use sql script name or path under folder Variable
You can learn more in this
All relative paths are taken in reference to the AIRFLOW_HOME environment variable. Try:
Giving absolute path
place the file relative to AIRFLOW_HOME
try logging the PWD in the python callable and then decide what path to give (Best option)
Assuming that the sql directory is relative to the current Python file, you can figure out the absolute path to the sql file like this:
import os
CUR_DIR = os.path.abspath(os.path.dirname(__file__))
def run_query():
# read the query
query = open(f"{CUR_DIR}/sql/queryfile.sql")
# run the query
execute(query)
you can get DAG directory like below.
conf.get('core', 'DAGS_FOLDER')
# open file
open(os.path.join(conf.get('core', 'DAGS_FOLDER'), 'something.json'), 'r')
ref: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder