I'm using the Airflow PythonOperator to execute a python Beam job using the Dataflow runner.
The Dataflow job returns the error "ModuleNotFoundError: No module named 'airflow'"
In the DataFlow UI the SDK version being used when the job is called using the PythonOperator is 2.15.0. If the
job is executed from Cloud shell the SDK version being used is 2.23.0. The job works when initiated from
the shell.
The Environment details for Composer are:
Image version = composer-1.10.3-airflow-1.10.3
Python version= 3
A previous post suggested using the PythonVirtualenvOperator operator. I tried this using the settings:
requirements=['apache-beam[gcp]'],
python_version=3
Composer returns the error "'install', 'apache-beam[gcp]']' returned non-zero exit status 2."
Any advice would be greatly appreciated.
This is the DAG that calls the Dataflow job. I have not shown all the functions that are used in the DAG but kept the imports in :
import logging
import pprint
import json
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow.models import DAG
import google.cloud.logging
from datetime import timedelta
from airflow.utils.dates import days_ago
from deps import utils
from google.cloud import storage
from airflow.exceptions import AirflowException
from deps import logger_montr
from deps import dataflow_clean_csv
dag = DAG(dag_id='clean_data_file',
default_args=args,
description='Runs Dataflow to clean csv files',
schedule_interval=None)
def get_values_from_previous_dag(**context):
var_dict = {}
for key, val in context['dag_run'].conf.items():
context['ti'].xcom_push(key, val)
var_dict[key] = val
populate_ti_xcom = PythonOperator(
task_id='get_values_from_previous_dag',
python_callable=get_values_from_previous_dag,
provide_context=True,
dag=dag,
)
dataflow_clean_csv = PythonOperator(
task_id = "dataflow_clean_csv",
python_callable = dataflow_clean_csv.clean_csv_dataflow,
op_kwargs= {
'project':
'zone':
'region':
'stagingLocation':
'inputDirectory':
'filename':
'outputDirectory':
},
provide_context=True,
dag=dag,
)
populate_ti_xcom >> dataflow_clean_csv
I use the ti.xcom_pull(task_ids = 'get_values_from_previous_dag') method to assign the op_kwargs.
This is the Dataflow job that is being called:
import apache_beam as beam
import csv
import logging
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import WriteToText
def parse_file(element):
for line in csv.reader([element], quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL):
line = [s.replace('\"', '') for s in line]
clean_line = '","'.join(line)
final_line = '"'+ clean_line +'"'
return final_line
def clean_csv_dataflow(**kwargs):
argv = [
# Dataflow pipeline options
"--region={}".format(kwargs["region"]),
"--project={}".format(kwargs["project"]) ,
"--temp_location={}".format(kwargs["stagingLocation"]),
# Setting Dataflow pipeline options
'--save_main_session',
'--max_num_workers=8',
'--autoscaling_algorithm=THROUGHPUT_BASED',
# Mandatory constants
'--job_name=cleancsvdataflow',
'--runner=DataflowRunner'
]
options = PipelineOptions(
flags=argv
)
pipeline = beam.Pipeline(options=options)
inputDirectory = kwargs["inputDirectory"]
filename = kwargs["filename"]
outputDirectory = kwargs["outputDirectory"]
outputfile_temp = filename
outputfile_temp = outputfile_temp.split(".")
outputfile = "_CLEANED.".join(outputfile_temp)
in_path_and_filename = "{}{}".format(inputDirectory,filename)
out_path_and_filename = "{}{}".format(outputDirectory,outputfile)
pipeline = beam.Pipeline(options=options)
clean_csv = (pipeline
| "Read input file" >> beam.io.ReadFromText(in_path_and_filename)
| "Parse file" >> beam.Map(parse_file)
| "writecsv" >> beam.io.WriteToText(out_path_and_filename,num_shards=1)
)
pipeline.run()
This answer was provided by #BSpinoza in the comment section:
What I did was move all imports from the global namespace and place
them into the function definitions. Then, from the calling DAG I used
the BashOperator. It worked.
Also, one of the recommended way is to use DataFlowPythonOperator.
Related
I've created a module named dag_template_module.py that returns a DAG using specified arguments. I want to use this definition for multiple DAGs, doing same thing but from different sources (thus parameters). A simplified version of dag_template_module.py:
from airflow.decorators import dag, task
from airflow.operators.bash import BashOperator
def dag_template(
dag_id: str,
echo_message_1: str,
echo_message_2: str
):
#dag(
dag_id=dag_id,
schedule_interval="0 6 2 * *"
)
def dag_example():
echo_1 = BashOperator(
task_id='echo_1',
bash_command=f'echo {echo_message_1}'
)
echo_2 = BashOperator(
task_id='echo_2',
bash_command=f'echo {echo_message_2}'
)
echo_1 >> echo_2
dag = dag_example()
return dag
Now I've created a hello_world_dag.py that imports dag_template() function from dag_template_module.py and uses it to create a DAG:
from dag_template import dag_template
hello_world_dag = dag_template(
dag_id='hello_world_dag',
echo_message_1='Hello',
echo_message_2='World'
)
I've expected that this DAG will be discovered by Airflow UI but that's not the case.
I've also tried using globals() in hello_world_dag.py according to documentation but that also doesn't work for me:
from dag_template import dag_template
hello_world_dag = 'hello_word_dag'
globals()[hello_world_dag] = dag_template(dag_id='hello_world_dag',
echo_message_1='Hello',
echo_message_2='World'
)
A couple things:
The DAG you are attempting to create is missing the start_date param
There is a nuance to how Airflow determine which Python files might contain a DAG definition and it's looking for "dag" and "airflow" in the file contents. The hello_world_dag.py is missing these keywords so the DagFileProcessor won't attempt to parse this file and, therefore, doesn't call the dag_template() function.
Adding these small tweaks, and running with Airflow 2.5.0:
dag_template_module.py
from pendulum import datetime
from airflow.decorators import dag
from airflow.operators.bash import BashOperator
def dag_template(dag_id: str, echo_message_1: str, echo_message_2: str):
#dag(dag_id, start_date=datetime(2023, 1, 22), schedule=None)
def dag_example():
echo_1 = BashOperator(task_id="echo_1", bash_command=f"echo {echo_message_1}")
echo_2 = BashOperator(task_id="echo_2", bash_command=f"echo {echo_message_2}")
echo_1 >> echo_2
return dag_example()
hello_world_dag.py
#airflow dag <- Make sure this these words appear _somewhere_ in the file.
from dag_template_module import dag_template
dag_template(dag_id="dag_example", echo_message_1="Hello", echo_message_2="World")
I have been in Airflow 1.10.14 for a long time, and now I'm trying to upgrade to Airflow 2.4.3 (latest?) I have built this dag in the new format in hopes to assimilate the language and understand how the new format works. Below is my dag:
from airflow.decorators import dag, task
from airflow.models import Variable
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.providers.microsoft.mssql.operators.mssql import MsSqlOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
import glob
path = '~/airflow/staging/gcs/offrs2/'
clear_Staging_Folders = """
rm -rf {}OFFRS2/LEADS*.*
""".format(Variable.get("temp_directory"))
#dag(
schedule_interval='#daily',
start_date=datetime(2022, 11, 1),
catchup=False,
tags=['offrs2', 'LEADS']
)
def taskflow():
CLEAR_STAGING = BashOperator(
task_id='Clear_Folders',
bash_command=clear_Staging_Folders,
dag=dag,
)
BQ_Output = BigQueryInsertJobOperator(
task_id='BQ_Output',
configuration={
"query": {
"query": '~/airflow/sql/Leads/Leads_Export.sql',
"useLegacySql": False
}
}
)
Prep_MSSQL = MsSqlOperator(
task_id='Prep_DB3_Table',
mssql_conn_id = 'db.offrs.com',
sql='truncate table offrs_staging..LEADS;'
)
#task
def Load_Staging_Table():
for files in glob.glob(path + 'LEADS*.csv'):
print(files)
CLEAR_STAGING >> BQ_Output >> Load_Staging_Table()
dag = taskflow()
when I send this up, I'm getting the below error:
Broken DAG: [/home/airflow/airflow/dags/BQ_OFFRS2_Leads.py] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 376, in apply_defaults
task_group = TaskGroupContext.get_current_task_group(dag)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/utils/task_group.py", line 490, in get_current_task_group
return dag.task_group
AttributeError: 'function' object has no attribute 'task_group'
As I look at my code, I don't have a specified task_group.
Where am I going wrong here?
Thank you!
You forgot to remove an undefined dag variable in CLEAR_STAGING. When you are using decorator, remove dag=dag.
CLEAR_STAGING = BashOperator(
task_id='Clear_Folders',
bash_command=clear_Staging_Folders,
# dag=dag <== Remove this
)
I am try to using pysparkling.ml.H2OMOJOModel for predict a spark dataframe using a MOJO model trained with h2o==3.32.0.2 in AWS Glue Jobs, how ever a got the error: TypeError: 'JavaPackage' object is not callable.
I opened a ticket in AWS support and they confirmed that Glue environment is ok and the problem is probably with sparkling-water (pysparkling). It seems that some dependency library is missing, but I have no idea which one.
The simple code bellow works perfectly if I run in my local computer (I only need to change the mojo path for GBM_grid__1_AutoML_20220323_233606_model_53.zip)
Could anyone ever run sparkling-water in Glue jobs successfully?
Job Details:
-Glue version 2.0
--additional-python-modules, h2o-pysparkling-2.4==3.36.0.2-1
-Worker type: G1.X
-Number of workers: 2
-Using script "createFromMojo.py"
createFromMojo.py:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd
from pysparkling.ml import H2OMOJOSettings
from pysparkling.ml import H2OMOJOModel
# from pysparkling.ml import *
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
#Job setup
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
caminho_modelo_mojo='s3://prod-lakehouse-stream/modeling/approaches/GBM_grid__1_AutoML_20220323_233606_model_53.zip'
print(caminho_modelo_mojo)
print(dir())
settings = H2OMOJOSettings(convertUnknownCategoricalLevelsToNa = True, convertInvalidNumbersToNa = True)
model = H2OMOJOModel.createFromMojo(caminho_modelo_mojo, settings)
data = {'days_since_last_application': [3, 2, 1, 0], 'job_area': ['a', 'b', 'c', 'd']}
base_escorada = model.transform(spark.createDataFrame(pd.DataFrame.from_dict(data)))
print(base_escorada.printSchema())
print(base_escorada.show())
job.commit()
I could run successfully following the steps:
Downloaded sparkling water distribution zip: http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/3.36.1.1-1-3.1/index.html
Dependent JARs path: s3://bucket_name/sparkling-water-assembly-scoring_2.12-3.36.1.1-1-3.1-all.jar
--additional-python-modules, h2o-pysparkling-3.1==3.36.1.1-1-3.1
Attempting to write a pipeline [in Apache Beam (Python)] to BigQuery using Jupyter Notebook (local), but seems like its failing because the following import is unsuccessful.
[File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\io\gcp\bigquery_tools.py]
try:
from apache_beam.io.gcp.internal.clients.bigquery import DatasetReference
from apache_beam.io.gcp.internal.clients.bigquery import TableReference
except ImportError:
DatasetReference = None
TableReference = None
Actual Error thrown:
--> 240 if isinstance(table, TableReference):
241 return TableReference(
242 projectId=table.projectId,
243 datasetId=table.datasetId,
244 tableId=table.tableId)
245 elif callable(table):
TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union
I wasn't able to find the 'TableReference' module under apache_beam.io.gcp.internal.clients.bigquery
Screenshot of the windows explorer folder location
These are the libraries I have imported..
!pip install google-cloud
!pip install google-cloud-pubsub
!pip install google-cloud-bigquery
!pip install apache-beam[gcp]
I read in an article that I need to import apache-beam[gcp] instead of apache-beam (which I was doing earlier).
Note! The code runs perfectly fine in Google Colabs.
Here is the complete code:
from google.cloud.bigquery import table
import argparse
import json
import os
import time
import logging
import pandas as pd
import apache_beam as beam
from google.cloud import bigquery, pubsub_v1
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions, GoogleCloudOptions, SetupOptions
import apache_beam.transforms.window as window
PROJECT_ID = "commanding-bee-322221"
BIGQUERY_DATASET = "Sample_Dataset"
BIGQUERY_MESSAGE_TABLE = "Message_Table"
SUBSCRIPTION_ID = "My-First-Test-Topic-Sub"
BQ_CLIENT = bigquery.Client(project=PROJECT_ID)
BQ_DATASET = BQ_CLIENT.dataset(BIGQUERY_DATASET)
BQ_MSG_TABLE = BQ_DATASET.table(BIGQUERY_MESSAGE_TABLE)
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
pipeline_options = {
'project': PROJECT_ID,
# 'runner': 'DataflowRunner',
'region': 'us-east1',
'staging_location': 'gs://my-test-bucket/tmp/',
'temp_location': 'gs://my-test-bucket/tmp/',
# 'template_location': 'gs://my-test-bucket/tmp/My_Template',
'save_main_session':False,
'streaming': True
}
pipeline_options = PipelineOptions.from_dictionary(pipeline_options)
p=beam.Pipeline(options=pipeline_options)
table_schema='MessageType:STRING,Location:STRING,Building:STRING,DateTime:DATETIME,ID:STRING,TID:STRING'
table= f'{PROJECT_ID}:{BIGQUERY_DATASET}.{BIGQUERY_MESSAGE_TABLE}'
class create_formatted_message(beam.DoFn):
def process(self,record):
message_str = record.decode("utf-8")
json_message = eval(message_str)
print(json_message)
# Construct the record
rows_to_insert = [{u"MessageType": u"{}".format(json_message["MessageType"]),
u"Location": u"{}".format(json_message["Location"]),
u"Building": u"{}".format(json_message["Building"]),
u"DateTime": u"{}".format(json_message["DateTime"]),
u"ID": u"{}".format(json_message["ID"]),
u"TID": u"{}".format(json_message["TID"])}]
return rows_to_insert
print(table)
processPubSubMessage = (
p
| "ReadFromPubSub" >> beam.io.gcp.pubsub.ReadFromPubSub(subscription=f'projects/{PROJECT_ID}/subscriptions/{SUBSCRIPTION_ID}', timestamp_attribute=None)
| "Create Formatted Message" >> beam.ParDo(create_formatted_message())
| "Write TO BQ" >> beam.io.WriteToBigQuery(
table,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://my-test-bucket/tmp/'
)
)
result = p.run()
# result.wait_until_finish()
time.sleep(20)
result.cancel()
I'm using google oozie to airflow converter to convert some oozie workflow that are running on AWS EMR. Managed to get a first version, but when I try to upload the DAG, airflow throws an error:
Broken DAG: No module named 'o2a'
I have tried to deploy the pypi package o2a, both using command
gcloud composer environments update composer-name --update-pypi-packages-from-file requirements.txt --location location
And from google cloud console. Both failed.
requirements.txt
o2a==1.0.1
Here is the code
from airflow import models
from airflow.operators.subdag_operator import SubDagOperator
from airflow.utils import dates
from o2a.o2a_libs import functions
from airflow.models import Variable
import subdag_validation
import subdag_generate_reports
CONFIG = {}
JOB_PROPS = {
}
dag_config = Variable.get("coordinator", deserialize_json=True)
cdrPeriod = dag_config["cdrPeriod"]
TASK_MAP = {"validation": ["validation"], "generate_reports": ["generate_reports"] }
TEMPLATE_ENV = {**CONFIG, **JOB_PROPS, "functions": functions, "task_map": TASK_MAP}
with models.DAG(
"workflow_coordinator",
schedule_interval=None, # Change to suit your needs
start_date=dates.days_ago(0), # Change to suit your needs
user_defined_macros=TEMPLATE_ENV,
) as dag:
validation = SubDagOperator(
task_id="validation",
trigger_rule="one_success",
subdag=subdag_validation.sub_dag(dag.dag_id, "validation", dag.start_date, dag.schedule_interval),
)
generate_reports = SubDagOperator(
task_id="generate_reports",
trigger_rule="one_success",
subdag=subdag_generate_reports.sub_dag(dag.dag_id, "generate_reports", dag.start_date, dag.schedule_interval,
{
"cdrPeriod": "{{cdrPeriod}}"
}),
)
validation.set_downstream(generate_reports)
There is a section in the o2a docs that cover how to deploy o2a:
https://github.com/GoogleCloudPlatform/oozie-to-airflow#the-o2a-libraries
With started to failed because another dependency:lark-parser
Just installed using pypi package manager for Composer did the trick.