I have the following simple DAG:
dag = DAG('test_parallel',
description='Simple tutorial DAG',
schedule_interval=None,
start_date=datetime(2017, 3, 20),
catchup=False)
def first_echo(arg):
print('\n\n')
print('FIRST ECHO! %s' % arg)
def second_echo(arg):
print('\n\n')
print('SECOND ECHO! %s' % arg)
def final_echo():
print('\n\n')
print('FINAL ECHO: ')
final_echo = PythonOperator(task_id='final_echo' , dag=dag, provide_context=False, python_callable=final_echo)
for i in range(5):
first_echo_op = PythonOperator(task_id='first_echo_%s' % i, python_callable=first_echo, op_args=[i], dag=dag)
second_echo_op = PythonOperator(task_id='second_echo_%s' % i, python_callable=second_echo, op_args=[i], dag=dag)
first_echo_op.set_downstream(second_echo_op)
second_echo_op.set_downstream(final_echo)
The idea is that I have a series of five independent tasks, each which leads to a following task, and they all get aggregated to a final task.
The issue is that none of my second_echo tasks will start until all first_echo tasks finish. Since the first_echo tasks are all independent and each of the second_echo tasks only depend on the previous independent first_echo task, I would've thought they would've run as soon as there are available resources to do so...
I can provide a Gantt chart if needed.
The question is: How do I make independent pathways in a DAG run as soon as they can, rather than wait for all first tasks to finish, assuming I have the proper amount of resources.
Your DAG is is working as expected on my environment.
If I add a random delay to the first tasks, to force some to finish earlier, like so:
def first_echo(arg):
time.sleep(random.randint(0, 30))
print('\n\n')
print('FIRST ECHO! %s' % arg)
I get the expected parallel execution pattern (using LocalExecutor):
Related
I have a task function like this:
def task (s) :
# doing some thing
return res
The original program is:
res = []
for i in data :
res.append(task(i))
# using pickle to save res every 30s
I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing
pool = Pool(4)
status = []
res = []
for i in data :
status.append(pool.apply_async(task, (i,))
for i in status :
res.append(i.get())
# using pickle to save res every 30s
Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0) .... task(9)). If p0 takes a very long time to finish the task(0).
Does the main process be blocked at the first "res.append(i.get())" ?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later ?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)
I update my code but the main process was blocked somewhere while other process were still dealing tasks. What's wrong ? Here is the core of code
with concurrent.futures.ProcessPoolExecutor(4) as ex :
for i in self.inBuffer :
futuresList.append(ex.submit(warpper, i))
for i in concurrent.futures.as_completed(futuresList) :
(word, r) = i.result()
self.resDict[word] = r
self.logger.info("{} --> {}".format(word, r))
cur = datetime.now()
if (cur - self.timeStmp).total_seconds() > 30 :
self.outputPickle()
self.timeStmp = datetime.now()
The length of self.inBuffer is about 100000. self.logger.info will write the info to a log file. For some special input i, the wrapper function will print auxiliary information with print. self.resDict is a dict to store result. self.outputPickle() will write a .pkl file using pickle.dump
At first, the code run normally, both the update of log file and print by warpper. But at a moment, I found that the log file has not been updated for a long time (several hours, the time to complete a warper shall not exceed 120s), but the warpper is still printing information(Until I kill the process it print about 100 messages without any updates of log file). Also, the time stamp of the output .pkl file doesn't change. Here is the implementation of outputPickle()
def outputPickle (self) :
if os.path.exists(os.path.join(self.wordDir, self.outFile)) :
if os.path.exists(os.path.join(self.wordDir, "{}_backup".format(self.outFile))):
os.remove(os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
shutil.copy(os.path.join(self.wordDir, self.outFile), os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
with open(os.path.join(self.wordDir, self.outFile), 'wb') as f:
pickle.dump(self.resDict, f)
Then I add three printfunction :
print("getting res of something")
(word, r) = i.result()
print("finishing i.result")
self.resDict[word] = r
print("finished getting res of {}".format(word))
Here is the log:
getting res of something
finishing i.result
finished getting res of CNICnanotubesmolten
getting res of something
finishing i.result
finished getting res of CNN0
getting res of something
message by warpper
message by warpper
message by warpper
message by warpper
message by warpper
The log "message by warpper" can be printed at most once every time the warpper is called
Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:
import time
import concurrent.futures
def func(x):
time.sleep(3)
return x ** 2
if __name__ == '__main__':
data = range(1, 5)
results = []
with concurrent.futures.ProcessPoolExecutor(4) as ex:
futures = [ex.submit(func, i) for i in data]
# processing the earlier results: as they are completed
for fut in concurrent.futures.as_completed(futures):
res = fut.result()
results.append(res)
print(res)
Sample output:
4
1
9
16
Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:
def res_callback(v):
# ... processing result
with open('test.txt', 'a') as f: # just an example
f.write(str(v))
print(v, flush=True)
if __name__ == '__main__':
data = range(1, 5)
results = []
with Pool(4) as pool:
tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
# await for tasks finished
But that schema would still require to somehow await (get() results) for submitted tasks.
I am reading an integer variable from airflow variables and then incrementing the value by one each time the DAG runs and set it to the variable again.
But after the below code the variable at UI changes each time page is refreshed or so.
Idk what is causing such behavior
counter = Variable.get('counter')
s = BashOperator(
task_id='echo_start_variable',
bash_command='echo ' + counter,
dag=dag,
)
Variable.set("counter", int(counter) + 1)
sql_query = "SELECT * FROM UNNEST(SEQUENCE({start}, {end}))"
sql_query = sql_query.replace('{start}', start).replace('{end}', end)
submit_query = PythonOperator(
task_id='submit_athena_query',
python_callable=run_athena_query,
op_kwargs={'query': sql_query, 'db': 'db',
's3_output': 's3://s3-path/rohan/date=' + current_date + '/'},
dag=dag)
e = BashOperator(
task_id='echo_end_variable',
bash_command='echo ' + counter,
dag=dag,
)
s >> submit_query >> e
Airflow process that DAG file every 30 seconds (default of min_file_process_interval setting) this means that any top level code you have is running every 30 seconds so Variable.set("counter", int(counter) + 1)
will cause the Variable counter to be increased by 1 every 30 seconds.
It's a bad practice to interact with Variables in top level code (regardless of the increasing value issue). It opens a connection to the metastore database every 30 seconds which may cause serious problems and overwhelm the database.
To get the value of Variable you can use Jinja:
e = BashOperator(
task_id='echo_end_variable',
bash_command='echo {{ var.value.counter }}',
dag=dag,
)
This is a safe way to use variables as the value is being retrieved only when the operator is executed.
If you want to increase the value of the variable by 1 then do it with PythonOpeartor:
def increase():
counter = Variable.get('counter')
Variable.set("counter", int(counter) + 1)
increase_op = PythonOperator(
task_id='increase_task',
python_callable=increase,
dag=dag)
The python callable will be executed only when the operator runs.
I have a Airflow job like below:
import time
job_id = int(time.time())
airflow_job1 = PythonOperator(op_kwargs={"job_id" : job_id}, ...)
airflow_job2 = BashOperator(op_kwargs={"job_id" : job_id}, ...)
airflow_job1 >> airflow_job2
I know every time when script launched, I will have a new job_id, used in each airflow task. But I wonder what if I run the script from middle, like airflow_job1 failed, and I fix problem and rerun from airflow_job1 in UI, is a new job_id generated in the rerun, or Airflow use the last job_id before?
Actually, after I check with a simple case:
# global parameter
job_id = int(time.time())
def airflow_job1(job_id, **context):
print("in airflow_job1, current timestamp: %s" % job_id)
def airflow_job2(job_id, **context):
print("in airflow_job2, current timestamp: %s" % job_id)
airflow_job1 = PythonOperator(
task_id='airflow_job1',
provide_context=True,
python_callable=airflow_job1,
op_kwargs={'job_id': job_id},
dag=globals()[dag_name]
)
airflow_job2 = PythonOperator(
task_id='airflow_job2',
provide_context=True,
python_callable=airflow_job2,
op_kwargs={'job_id': job_id},
dag=globals()[dag_name]
)
airflow_job1 >> airflow_job2
I find job_id in airflow_job1 and airflow_job2 are different even if in the same run.
So the conclusion is that we shouldn't set global parameter in this way, maybe use xcom_pull / xcom_push to solve that
I am looping over a series of status tests running in Apache Airflow. Under certain conditions I want to publish a message. Something like this is what I have now:
for count, test in enumerate(test_list):
test = StatusTest(test['_id'],test['status'])
check_status_task = PythonOperator(
task_id='run_status_checker_'+str(count),
python_callable=run_status_checker,
op_kwargs={'status_test':test},
provide_context=True,
xcom_push=True,
retries=0,
dag=dag)
pub_results_task = PythonOperator(
task_id='pub_results_' + str(count),
python_callable=pub_result,
#op_kwargs={'task_id':'run_status_checker_'+str(count)},
provide_context=True,
trigger_rule='all_done',
dag=dag
)
check_status_task >> pub_results_task
Code that calls status checker, gets responses, etc:
def run_status_test(ti, **kwargs):
status_test_conn = MongoHook(conn_id='test_selector_mongo')
status_test = kwargs.pop('status_test', None)
is_up = check_test_status(status_test)
status, response = is_up
if status:
if status_test.already_failed(test_conn):
status_test.status = 'true'
status_test.update_status(status_test_conn)
message = {"Test {0} passed".format(status_test.uid)}
ti.xcom_push(XCOM_CHECK_STATUS_KEY, message)
else:
if (test.already_down(test_conn) and 'false' in test.status):
test.update_sty_down(status_test_conn, response=response)
else:
status_test.status = 'false'
status_test.update_status(status_test_conn)
message = {status_test.uid}
ti.xcom_push(XCOM_CHECK_STATUS_KEY, message)
status_test_con.close_conn()
Code that would do message publishing:
def pub_result(dag, ti, **context):
message = ti.xcom_pull(
task_ids=context['task_id'],
key=XCOM_CHECK_STATUS_KEY
)
message_con = pika.BlockingConnection(pika.URLParameters(os.getenv['BROKERURL']))
channel = message_con.channel()
channel.queue_declare(queue='status_test', durable=True, auto_delete=False, exclusive=False)
channel.basic_publish(exchange='', routing_key='outage', body=json.dumps(message), properties=pika.BasicProperties(delivery_mode=2))
message_con.close()
How do I tell Airflow to only do the publishing part of the workflow if certain conditions are met such as:
If there is a message then publish it (or run the publish task).
If not don't do anything.
I was thinking I could just check the value in the XCOM and publish if there is something or do nothing if it is empty. However, I wanted to see if there was a proper way to do it in Airflow.
You could use the BranchPythonOperator to create a task that determines whether to call pub_result or some other DummyTask. Here is an example from the Airflow Github repo.
I'm trying to submit around 150 million jobs to celery using the following code:
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
for line in alldat:
try:
result = chain(get_url.s(line[:-1]),do_work.s(line[:-1])).apply_async()
except:
print ("failed to submit job")
print('task submitted ' + str(line[:-1]))
Would it be faster to split the file into chunks and run multiple instances of this code? Or what can I do? I'm using memcached as the backend, rabbitmq as the broker.
import multiprocessing
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
num_workers = 200
def worker(urls,id):
"""worker function"""
for url in urls:
print ("%s - %s" % (id,url))
result = chain(get_url.s(url),do_work.s(url)).apply_async()
return
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
jobs = []
stack = []
id = 0
for i in alldat:
if (len(stack) < len(alldat) / num_workers):
stack.append(i[:-1])
continue
else:
id = id + 1
p = multiprocessing.Process(target=worker, args=(stack,id,))
jobs.append(p)
p.start()
stack = []
for j in jobs:
j.join()
If I understand your problem correctly:
you have a list of 150M urls
you want to run get_url() then do_work() on each of the urls
so you have two issues:
going over the 150M urls
queuing the tasks
Regarding the main for loop in your code, yes you could do that faster if you use multithreading, especially if you are using multicore cpu. Your master thread could read the file and pass chunks of it to sub-threads that will be creating the celery tasks.
Check the guide and the documentation:
https://realpython.com/intro-to-python-threading/
https://docs.python.org/3/library/threading.html
And now let's imagine you have 1 worker that is receiving these tasks. The code will generate 150M new tasks that will be pushed to the queue. Each chain will be a chain of get_url(), and do_work(), the next chain will run only when do_work() finishes.
If get_url() takes a short time and do_work() takes a long time, it will be a series of quick-task, slow-task, and the total time:
t_total_per_worker = (t_get_url_average+t_do_work_average) X 150M
If you have n workers
t_total = t_total_per_worker/n
t_total = (t_get_url_average+t_do_work_average) X 150M / n
Now if get_url() is time critical while do_work() is not, then, if you can, you should run all 150M get_url() first and when that is done run all 150M do_work(), but that may require changes to your process design.
That is what I would do. Maybe others have better ideas!?