The kind of workflow that I want to run looks like this:
workflow = (
generator.s() |
spread.s() |
gather.s()
)
where spread is a task that replaces itself with a group.
from celery import Celery, group
celery_app = Celery()
#celery_app.task(bind=True)
def spread(self, numbers):
return self.replace(group(
(task_1.si(n) | task_2.s() | task_3.s()) for n in numbers
)
)
The whole workflow works fine and as expected.
My question is essentially only about the chains in the group created by spread. I don't care too much if some of them fail. I'm fine if an error somewhere in the chain would lead to a shorter list of results being passed to gather. However, I'm not sure how to achieve that.
I can, of course, catch exceptions in each of task_1, task_2, and task_3 and pass on an empty dummy result. For convenience I'd really like to be able to say that on an error anywhere in the chain, please log the traceback and remove the result from the group or pass on an empty dummy result.
I've searched the documentation and GitHub issues far and wide but could not find anything. I know that I can pass an on_error callback to the chain but I don't know how to pass on an empty result from there (if that's even possible).
Setup:
Python 3.6
celery 4.2.1
Redis broker and backend (though it's not a problem for me to switch if that would enable the behavior)
Related
I am very new to airflow and I am trying to create a DAG based on the below requirement.
Task 1 - Run a Bigquery query to get a value which I need to push to 2nd task in the dag
Task 2 - Use the value from the above query and run another query and export the data into google cloud bucket.
I have read other answers related to this and I understand we cannot use xcom_pull or xcom_push in bigqueryoperator in airflow. So what I am doing is using a python operator where I can use jinja template variables by using "provide_context=True".
Below is the snipped of my code. Just the task 1 where I want to do "task_instance.xcom_push" in order to see the value in airflow under logs xcom.
def get_bq_operator(dag, task_id, configuration, table_params=None, trigger_rule='all_success'):
bq_operator = BigQueryInsertJobOperator(
task_id=task_id,
configuration=configuration,
gcp_conn_id=gcp_connection_id,
dag=dag,
params=table_params,
trigger_rule=trigger_rule,
**task_instance.xcom_push(key='yr_wk', value=yr_wk),**
)
return bq_operator
def get_bq_wm_yr_wk():
get_bq_operator(dag,app_name,bigquery_util.get_bq_job_configuration(
bq_query,
query_params=None))
get_wm_yr_wk = PythonOperator(task_id='get_wm_yr_wk',
python_callable=get_bq_wm_yr_wk,
provide_context=True,
on_failure_callback=failure_callback,
on_retry_callback=failure_callback,
dag=dag)
"bq_query" is the one I am passing the sql file which has my query and the query returns the value of yr_wk which I need to use in my 2nd task.
The highlighted task_instance.xcom_push(key='yr_wk', value=yr_wk), in get_bq_operator is failing and the errror i am getting is as below
raise KeyError(f'Variable {key} does not exist')
KeyError: 'Variable ei_migration_hour does not exist'
If I comment the line above , the DAG runs fine. However, how do I validate the value of yr_wk?? I want to push it so that I can view the value in logs.
I do not fully understand your code :), but if you want to do something with results of BigQuery query, then by far better way to approach it is to use BigQueryHook in your python callable.
Operators in Airflow are usually thin wrappers around Hooks that really provide a "complete" taks (for example you can use it run an update operation) but if you want to do something with the result of it and you already do it via Python Operator, it is far better to use Hooks directly as you do not make all the assumptions that operators have in execute method.
In your case it should be something like (and I am using here the new TaskFlow syntax which is preferred to do this kind of operations. See https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html for the tutorial on Task Flow API. Aspecially in Airflow 2 it became the de-facto default way of writing tasks.
#task(.....)
def my_task():
hook = BigQueryHook(....) # initialize it with the right parameters
result = hook.run(sql='YOUR_QUERY', ...) # add other necessary params
processed_result = process_result(result) # do something with the result
return processed_result
This way you do not evey have to run xcom_push (task_flow API will do it for you automatically and other tasks will be able to use by just doing :
#task
next_task(input):
pass
And then:
result = my_task()
next_task(result)
Then all the xcom push/pull will be handled for you automatically via TaskFlow.
I want to implement a dynamic FTPSensor of a kind. Using the contributed FTP sensor I managed to make it work in this way:
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./data/test.txt",
ftp_conn_id="ftp_default",
poke_interval=5,
dag=dag,
)
and it works just fine. But I need to pass dynamic path and ftp_conn_id params. I.e. I generate a bunch of new connections in a previous task and in the ftp_sensor task I want to check for each of the new connections that I previously generated if there's a file present on the FTP.
So I thought first to grab the connections' ids from XCom.
I send them from the previous task in XCom but it seems I cannot access XCom outside of tasks.
E.g. I was aiming at something like:
active_ftp_connections = context['ti'].xcom_pull(key='active_ftps')
for conn in active_ftp_connections:
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path=conn['path'],
ftp_conn_id=conn['connection'],
poke_interval=5,
dag=dag,
)
but this doesn't seem to be a possible solution.
Then I just wasted a good amount of time trying to create my custom FTPSensor to which to pass dynamically the data I need but right now I reached to the conclusion that I need a hybrid between a sensor and operator, because I need to keep the poke functionality for instance but also have the execute functionality.
I guess one option is to write a custom operator that implements poke from the sensor base class but am probably too tired to try to do it now.
Do you have an idea how to achieve what I am aiming at? I can't seem to find any materials on the topic on the internet - maybe it's just me.
Let me know if the question is not clear so I can provide more details.
Update
I now reached to this as possibility
def get_active_ftps(**context):
active_ftp_connestions = context['ti'].xcom_pull(key='active_ftps')
return active_ftp_connestions
for ftp in get_active_ftps():
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./"+ ftp['folder'] +"/test.txt",
ftp_conn_id=ftp['conn_id'],
poke_interval=5,
dag=dag,
)
but it throws an error: Broken DAG: [/usr/local/airflow/dags/copy_file_from_ftp.py] 'ti'
I managed to do it like this:
active_ftp_folder = Variable.get('active_ftp_folder')
active_ftp_conn_id = Variable.get('active_ftp_conn_id')
ftp_sensor = FTPSensor(
task_id="detect-file-on-ftp",
path="./"+ active_ftp_folder +"/test.txt",
ftp_conn_id=active_ftp_conn_id,
poke_interval=5,
dag=dag,
)
And will just have the dag run one ftp account at a time since I realized that there shouldn't be cycles in a direct acyclic graphs ... apparently.
I had the same problem as here (see link below), brielfy: unable to create .exe of a python script that uses APScheduler
Pyinstaller 3.3.1 & 3.4.0-dev build with apscheduler
So I did as suggested:
from apscheduler.triggers import interval
scheduler.add_job(Run, 'interval', interval.IntervalTrigger(minutes = time_int),
args = (input_file, output_dir, time_int),
id = theID, replace_existing=True)
And indeed importing interval.IntervalTrigger and passing it as an argument to add_job solved this particular error.
However, now I am encountring:
TypeError: add_job() got multiple values for argument 'args'
I tested it and I can ascertain it is occurring because of the way trigger is called now. I also tried defining trigger = interval.IntervalTrigger(minutes = time_int) separately and then just passing trigger, and the same happens.
If I ignore the error with try/except, I see that it does not add the job to the sql database at all (I am using SQLAlchemy as a jobstore). Initially I thought it is because I am adding several jobs in a for loop, but it happens with a single job add as well.
Anyone know of some other workaround if the initial problem, or any idea why this error might occur? I can't find anything online either :(
Things always work better in the morning.
For anyone else who encounters this: you don't need both 'interval' and interval.IntervalTrigger() as arguments, the code should be, this is where the error comes from.
scheduler.add_job(Run, interval.IntervalTrigger(minutes = time_int),
args = (input_file, output_dir, time_int),
id = theID, replace_existing=True)
This question already has an answer here:
How to run independent transformations in parallel using PySpark?
(1 answer)
Closed 5 years ago.
The usecase is the following:
I have a large dataframe, with a 'user_id' column in it (every user_id can appear in many rows). I have a list of users my_users which I need to analyse.
Groupby, filter and aggregate could be a good idea, but the available aggregation functions included in pyspark did not fit my needs. In the pyspark ver, user defined aggregation functions are still not fully supported and I decided to leave it for now..
Instead, I simply iterate the my_users list, filter each user in the dataframe, and analyse. In order to optimize this procedure, I decided to use python multiprocessing pool, for each user in my_users
The function that does the analysis (and passed to the pool) takes two arguments: the user_id, and a path to the main dataframe, on which I perform all the computations (PARQUET format). In the method, I load the dataframe, and work on it (DataFrame can't be passed as an argument itself)
I get all sorts of weird errors, on some of the processes (different in each run), that look like:
PythonUtils does not exist in the JVM (when reading the 'parquet' dataframe)
KeyError: 'c' not found (also, when reading the 'parquet' dataframe. What is 'c' anyway??)
When I run it without any multiprocessing, everything runs smooth, but slow..
Any ideas where these errors are coming from?
I'll put some code sample just to make things clearer:
PYSPRAK_SUBMIT_ARGS = '--driver-memory 4g --conf spark.driver.maxResultSize=3g --master local[*] pyspark-shell' #if it's relevant
# ....
def users_worker(df_path, user_id):
df = spark.read.parquet(df_path) # The problem is here!
## the analysis of user_id in df is here
def user_worker_wrapper(args):
users_worker(*args)
def analyse():
# ...
users_worker_args = [(df_path, user_id) for user_id in my_users]
users_pool = Pool(processes=len(my_users))
users_pool.map(users_worker_wrapper, users_worker_args)
users_pool.close()
users_pool.join()
Indeed, as #user6910411 commented, when I changed the Pool to be threadPool (multiprocessing.pool.ThreadPool package), everything worked as expected and these errors were gone.
The root reasons for the errors themselves are also clear now, if you want me to share them, please comment below.
I have built a pipeline of Tasks in Luigi. Because this pipeline is going to be used in different contexts, it was possible that it would require to include more tasks at the beginning of or the end of the pipeline or even totally different dependencies between the tasks.
That's when I thought: "Hey, why declare the dependencies between the tasks in my config file?", so I added something like this to my config.py:
PIPELINE_DEPENDENCIES = {
"TaskA": [],
"TaskB": ["TaskA"],
"TaskC": ["TaskA"],
"TaskD": ["TaskB", "TaskC"]
}
I was annoyed by having those stacking up parameters throughout the tasks, so at some point I introduced just one parameter, task_config, that every Task has and where every information or data that's necessary for run() is stored. So I put PIPELINE_DEPENDENCIES right in there.
Finally, I would have every Task I defined inherit from both luigi.Task and a custom Mixin class, that would implement the dynamic requires(), which looks something like this:
class TaskRequirementsFromConfigMixin(object):
task_config = luigi.DictParameter()
def requires(self):
required_tasks = self.task_config["PIPELINE_DEPENDENCIES"]
requirements = [
self._get_task_cls_from_str(required_task)(task_config=self.task_config)
for required_task in required_tasks
]
return requirements
def _get_task_cls_from_str(self, cls_str):
...
Unfortunately, that doesn't work, as running the pipeline gives me the following:
===== Luigi Execution Summary =====
Scheduled 4 tasks of which:
* 4 were left pending, among these:
* 4 was not granted run permission by the scheduler:
- 1 TaskA(...)
- 1 TaskB(...)
- 1 TaskC(...)
- 1 TaskD(...)
Did not run any tasks
This progress looks :| because there were tasks that were not granted run permission by the scheduler
===== Luigi Execution Summary =====
and a lot of
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
Although I am not sure if that's relevant.
So:
1. What's my mistake? Is it fixable?
2. Is there another way to achieve this?
I realize this is an old question, but I recently learned how to enable dynamic dependencies. I was able to accomplish this by using a WrapperTask and yielding a dict comprehension (though you could do a list too if you want) with the parameters I wanted to pass to the other tasks in the requires method.
Something like this:
class WrapperTaskToPopulateParameters(luigi.WrapperTask):
date = luigi.DateMinuteParameter(interval=30, default=datetime.datetime.today())
def requires(self):
base_params = ['string', 'string', 'string', 'string', 'string', 'string']
modded_params = {modded_param:'mod' + base for base in base_params}
yield list(SomeTask(param1=key_in_dict_we_created, param2=value_in_dict_we_created) for key_in_dict_we_created,value_in_dict_we_created in modded_params.items())
I can post an example using a list comprehension too if there's interest.