Dynamic FTPSensor in Apache Airflow - python

I want to implement a dynamic FTPSensor of a kind. Using the contributed FTP sensor I managed to make it work in this way:
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./data/test.txt",
ftp_conn_id="ftp_default",
poke_interval=5,
dag=dag,
)
and it works just fine. But I need to pass dynamic path and ftp_conn_id params. I.e. I generate a bunch of new connections in a previous task and in the ftp_sensor task I want to check for each of the new connections that I previously generated if there's a file present on the FTP.
So I thought first to grab the connections' ids from XCom.
I send them from the previous task in XCom but it seems I cannot access XCom outside of tasks.
E.g. I was aiming at something like:
active_ftp_connections = context['ti'].xcom_pull(key='active_ftps')
for conn in active_ftp_connections:
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path=conn['path'],
ftp_conn_id=conn['connection'],
poke_interval=5,
dag=dag,
)
but this doesn't seem to be a possible solution.
Then I just wasted a good amount of time trying to create my custom FTPSensor to which to pass dynamically the data I need but right now I reached to the conclusion that I need a hybrid between a sensor and operator, because I need to keep the poke functionality for instance but also have the execute functionality.
I guess one option is to write a custom operator that implements poke from the sensor base class but am probably too tired to try to do it now.
Do you have an idea how to achieve what I am aiming at? I can't seem to find any materials on the topic on the internet - maybe it's just me.
Let me know if the question is not clear so I can provide more details.
Update
I now reached to this as possibility
def get_active_ftps(**context):
active_ftp_connestions = context['ti'].xcom_pull(key='active_ftps')
return active_ftp_connestions
for ftp in get_active_ftps():
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./"+ ftp['folder'] +"/test.txt",
ftp_conn_id=ftp['conn_id'],
poke_interval=5,
dag=dag,
)
but it throws an error: Broken DAG: [/usr/local/airflow/dags/copy_file_from_ftp.py] 'ti'

I managed to do it like this:
active_ftp_folder = Variable.get('active_ftp_folder')
active_ftp_conn_id = Variable.get('active_ftp_conn_id')
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./"+ active_ftp_folder +"/test.txt",
ftp_conn_id=active_ftp_conn_id,
poke_interval=5,
dag=dag,
)
And will just have the dag run one ftp account at a time since I realized that there shouldn't be cycles in a direct acyclic graphs ... apparently.

Related

Struggling with how to iterate data

I am learning Python3 and I have a fairly simple task to complete but I am struggling how to glue it all together. I need to query an API and return the full list of applications which I can do and I store this and need to use it again to gather more data for each application from a different API call.
applistfull = requests.get(url,authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
else:
print(applistfull.status_code)
I next have I think 'summaryguid' and I need to again query a different API and return a value that could exist many times for each application; in this case the compiler used to build the code.
I can statically call a GUID in the URL and return the correct information but I haven't yet figured out how to get it to do the below for all of the above and build a master list:
summary = requests.get(f"url{summaryguid}moreurl",authmethod)
if summary.ok:
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(appsummary["compiler"])
I would prefer to not yet have someone just type out the right answer but just drop a few hints and let me continue to work through it logically so I learn how to deal with what I assume is a common issue in the future. My thought right now is I need to move my second if up as part of my initial block and continue the logic in that space but I am stuck with that.
You are on the right track! Here is the hint: the second API request can be nested inside the loop that iterates through the list of applications in the first API call. By doing so, you can get the information you require by making the second API call for each application.
import requests
applistfull = requests.get("url", authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
summary = requests.get(f"url/{summaryguid}/moreurl", authmethod)
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(app["profile"]["name"],appsummary["compiler"])
else:
print(applistfull.status_code)

Airflow XCOM pull and push for a BigQueryInsertJobOperator and BigQueryOperator

I am very new to airflow and I am trying to create a DAG based on the below requirement.
Task 1 - Run a Bigquery query to get a value which I need to push to 2nd task in the dag
Task 2 - Use the value from the above query and run another query and export the data into google cloud bucket.
I have read other answers related to this and I understand we cannot use xcom_pull or xcom_push in bigqueryoperator in airflow. So what I am doing is using a python operator where I can use jinja template variables by using "provide_context=True".
Below is the snipped of my code. Just the task 1 where I want to do "task_instance.xcom_push" in order to see the value in airflow under logs xcom.
def get_bq_operator(dag, task_id, configuration, table_params=None, trigger_rule='all_success'):
bq_operator = BigQueryInsertJobOperator(
task_id=task_id,
configuration=configuration,
gcp_conn_id=gcp_connection_id,
dag=dag,
params=table_params,
trigger_rule=trigger_rule,
**task_instance.xcom_push(key='yr_wk', value=yr_wk),**
)
return bq_operator
def get_bq_wm_yr_wk():
get_bq_operator(dag,app_name,bigquery_util.get_bq_job_configuration(
bq_query,
query_params=None))
get_wm_yr_wk = PythonOperator(task_id='get_wm_yr_wk',
python_callable=get_bq_wm_yr_wk,
provide_context=True,
on_failure_callback=failure_callback,
on_retry_callback=failure_callback,
dag=dag)
"bq_query" is the one I am passing the sql file which has my query and the query returns the value of yr_wk which I need to use in my 2nd task.
The highlighted task_instance.xcom_push(key='yr_wk', value=yr_wk), in get_bq_operator is failing and the errror i am getting is as below
raise KeyError(f'Variable {key} does not exist')
KeyError: 'Variable ei_migration_hour does not exist'
If I comment the line above , the DAG runs fine. However, how do I validate the value of yr_wk?? I want to push it so that I can view the value in logs.
I do not fully understand your code :), but if you want to do something with results of BigQuery query, then by far better way to approach it is to use BigQueryHook in your python callable.
Operators in Airflow are usually thin wrappers around Hooks that really provide a "complete" taks (for example you can use it run an update operation) but if you want to do something with the result of it and you already do it via Python Operator, it is far better to use Hooks directly as you do not make all the assumptions that operators have in execute method.
In your case it should be something like (and I am using here the new TaskFlow syntax which is preferred to do this kind of operations. See https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html for the tutorial on Task Flow API. Aspecially in Airflow 2 it became the de-facto default way of writing tasks.
#task(.....)
def my_task():
hook = BigQueryHook(....) # initialize it with the right parameters
result = hook.run(sql='YOUR_QUERY', ...) # add other necessary params
processed_result = process_result(result) # do something with the result
return processed_result
This way you do not evey have to run xcom_push (task_flow API will do it for you automatically and other tasks will be able to use by just doing :
#task
next_task(input):
pass
And then:
result = my_task()
next_task(result)
Then all the xcom push/pull will be handled for you automatically via TaskFlow.

Airflow: is there a way to group operators together outside of a dag?

Is there a way to design a python class that implements a specific data pipeline pattern outside of a dag in order to use this class for all data-pipelines that needs this pattern ?
Example: in order to load data from Google Cloud Storage to Big Query, the process can be to validate ingestion candidate files with data quality tests. Then attempt to load data in a raw table in Big Query then dispatching the file in archive or in a rejected folder depending on loading result.
Doing it one time is easy, what if it needs to be done 1000 times ? i am trying to figure out how to optimize engineering time.
SubDag could be considered but it shows limitations in terms of performances and is going to be deprecated anyway.
Task groups needs to be part of a dag to be implemented https://github.com/apache/airflow/blob/1be3ef635fab635f741b775c52e0da7fe0871567/airflow/utils/task_group.py#L35.
One way to achieve the expected behavior might be to generate dags, task groups and tasks from a single python file that leverage dynamic DAGing
Nevertheless, code that is used in this particular file can't be reused somewhere in the code base. It is against DRYness even though DRYness vs understandability is always a tradeoff.
Based on this article
Here is how to solve this question:
You can define a plugin in airflow ./plugins
Lets create a sample taskgroup in ./plugins/test_taskgroup.py
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.task_group import TaskGroup
def hello_world_py():
print('Hello World')
def build_taskgroup(dag: DAG) -> TaskGroup:
with TaskGroup(group_id="taskgroup") as taskgroup:
dummy_task = DummyOperator(
task_id="dummy_task",
dag=dag
)
python_task = PythonOperator(
task_id="python_task",
python_callable=hello_world_py,
dag=dag
)
dummy_task >> python_task
return taskgroup
You can call it in a simple python DAG like this:
from airflow.utils import task_group
from test_plugin import build_taskgroup
from airflow import DAG
with DAG(
dag_id="modularized_dag",
schedule_interval="#once",
start_date=datetime(2021, 1, 1),
) as dag:
task_group = build_taskgroup(dag)
Here is the result
I'm interested in this question as well. Airflow 2.0 has released the new feature of Dynamic DAG. Although I'm not sure if it will totally answer your design. It may solve the problem of the single codebase. In my case, I have a function to create a task group with the necessary parameters. Then I iterate to create each DAG with the function to create the task group(s) with different parameters. Here is the overview of my pseudo code:
def create_task_group(group_id, a, b, c):
with TaskGroup(group_id=group_id) as my_task_group:
# add some tasks
pass
for x in LIST_OF_THINGS:
dag_id = f"{x}_workflow"
schedule_interval = SCHEDULE_INTERVAL[x]
with DAG(
dag_id,
start_date=START_DATE,
schedule_interval=schedule_interval,
) as globals()[dag_id]:
task_group = create_task_group(x, ..., ..., ...)
The LIST_OF_THINGS here represents a list of different configuration. Each DAG can have different dag_id, schedule_interval, start_date, and so on. You can define your task configuration in some config file, such as JSON or YAML, and parse it as a dictionary as well.
I haven't tried, but technically you maybe you can move the create_task_group() into some class and import it too if you will need to reuse the same functionality. Another good thing about task groups is that they can add task dependencies to other tasks or task groups which is very convenient.
I saw a concept of YAML configuration for Airflow DAG using an extra package, but I'm not sure if it's mature yet.
See more information about Dynamic DAG here: https://www.astronomer.io/guides/dynamically-generating-dags
You should just create your own Operator and then use it inside your DAGs.
Extend BaseOperator and use hooks to BigQuery or whatever you need.

Can't Schedule Query in BigQuery Via Python SDK

I'll preface this by saying I'm fairly new to BigQuery. I'm running into an issue when trying to schedule a query using the Python SDK. I used the example on the documentation page and modified it a bit but I'm running into errors.
Note that my query does use scripting to set some variables, and it's using a MERGE statement to update one of my tables. I'm not sure if that makes a huge difference.
def create_scheduled_query(dataset_id, project, name, schedule, service_account, query):
parent = transfer_client.common_project_path(project)
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=dataset_id,
display_name=name,
data_source_id="scheduled_query",
params={
"query": query
},
schedule=schedule,
)
transfer_config = transfer_client.create_transfer_config(
bigquery_datatransfer.CreateTransferConfigRequest(
parent=parent,
transfer_config=transfer_config,
service_account_name=service_account,
)
)
print("Created scheduled query '{}'".format(transfer_config.name))
I was able to successfully create a query with the function above. However the query errors out with the following message:
Error code 9 : Dataset specified in the query ('') is not consistent with Destination dataset '{my_dataset_name}'.
I've tried changing passing in "" as the dataset_id parameter, but I get the following error from the Python SDK:
google.api_core.exceptions.InvalidArgument: 400 Cannot create a transfer with parent projects/{my_project_name} without location info when destination dataset is not specified.
Interestingly enough I was able to successfully create this scheduled query in the GUI; the same query executed without issue.
I saw that the GUI showed the scheduled query's "Resource name" referenced a transferConfig, so I used the following command to see what that transferConfig looked like, to see if I could apply the same parameters using my Python script:
bq show --format=prettyjson --transfer_config {my_transfer_config}
Which gave me the following output:
{
"dataSourceId": "scheduled_query",
"datasetRegion": "us",
"destinationDatasetId": "",
"displayName": "test_scheduled_query",
"emailPreferences": {},
"name": "{REDACTED_TRANSFER_CONFIG_ID}",
"nextRunTime": "2021-06-18T00:35:00Z",
"params": {
"query": ....
So it looks like the GUI was able to use "" for destinationDataSetId but for whatever reason the Python SDK won't let me use that value.
Any help would be appreciated, since I prefer to avoid the GUI whenever possible.
UPDATE:
This does appear to be related to the scripting I used in my query. I removed the scripts from the query and it's working. I'm going to leave this open because I feel like this should be possible using the SDK since the query with scripting works in the console without issue.
This same thing also threw me through a loop but I managed to figure out what was wrong. The problem is with the
parent = transfer_client.common_project_path(project)
line that is given in the example query. By default, this returns something of the form projects/{project_id}. However, the CreateTransferConfigRequest documentation says of the parent parameter:
The BigQuery project id where the transfer configuration should be created. Must be in the format projects/{project_id}/locations/{location_id} or projects/{project_id}. If specified location and location of the destination bigquery dataset do not match - the request will fail.
Sure enough, if you use the projects/{project_id}/locations/{location_id} format instead, it resolves the error and allows you to pass a null destination_dataset_id.
I had the exact same issue. the fix for the issue is as below.
The below method returns Projects/{projectid}
parent = transfer_client.common_project_path(project_id)
instead use the below method , which returns projects/{project}/locations/{location}
parent = transfer_client.common_location_path(project_id , "EU")
I had tried with the above change , i am able to schedule a script in BQ.

Error when changing instance type in a python for loop

I have a Python 2 script which uses boto3 library.
Basically, I have a list of instance ids and I need to iterate over it changing the type of each instance from c4.xlarge to t2.micro.
In order to accomplish that task, I'm calling the modify_instance_attribute method.
I don't know why, but my script hangs with the following error message:
EBS-optimized instances are not supported for your requested configuration.
Here is my general scenario:
Say I have a piece of code like this one below:
def change_instance_type(instance_id):
client = boto3.client('ec2')
response = client.modify_instance_attribute(
InstanceId=instance_id,
InstanceType={
'Value': 't2.micro'
}
)
So, If I execute it like this:
change_instance_type('id-929102')
everything works with no problem at all.
However, strange enough, if I execute it in a for loop like the following
instances_list = ['id-929102']
for instance_id in instances_list:
change_instance_type(instance_id)
I get the error message above (i.e., EBS-optimized instances are not supported for your requested configuration) and my script dies.
Any idea why this happens?
When I look at EBS optimized instances I don't see that T2 micros are supported:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
I think you would need to add EbsOptimized=false as well.

Categories

Resources