I'm trying to run multiple tasks in parallel on a single EC2 instance. I've set up a psql backend and switched to using local executor in addition to changing the configs in the config file. However, my tasks seem to be executing sequentially rather than in parallel. Are you theoretically able to run multiple tasks in parallel with the PythonOperator? Ideally, I would expect to see alternating Hi and Hello's when I run this task but instead I get 10 Hello's and then 10 Hi's. If I can get this toy example working, it will really help with a workflow I'm trying to design.
"test-airflow",
default_args=default_args,
description="airflow pipeline features",
start_date=days_ago(2),
tags=["example"],
schedule_interval="#daily",
catchup=False,
) as dag:
# [END instantiate_dag]
# t1, t2 and t3 are examples of tasks created by instantiating operators
# [START basic_task]
def test():
for i in range(0,10):
print('Hello')
time.sleep(1)
return 3
def test1():
for i in range(0,10):
print('Hi')
time.sleep(1)
return 3
t = PythonOperator(task_id='test', python_callable=test)
t1 = PythonOperator(task_id='test1', python_callable=test1)
[t, t1]
With a LocalExecutor you should be able to run commands in parallel with this configuration (in your airflow.cfg file).
[core]
executor = LocalExecutor
... more config here ...
# This makes LocalExecutor spawn a process for each task
# No queue needed!
parallelism = 0
I ended up needing to restart the EC2 after I installed the psql backend and changed the config file. Airflow wasn't updating until I restarted it but that fixed the issue.
Related
I have a scheduler_project.py script file.
code:
from apscheduler.schedulers.background import BackgroundScheduler
scheduler = BackgroundScheduler()
def func_1():
# updating file-1
def func_2():
# updating file-2
scheduler.add_job(func_1, 'cron', hour='10-23' , minute='0-58', second=20)
scheduler.add_job(func_2, 'cron', hour='1')
scheduler.start()
When I run, (on Windows machine)
E:\> python scheduler_project.py
E:\> # there is no error
In Log: (I have added logging in above code at debug level)
It says, job is added and starts in (some x seconds), but it is not
starting.
In task manager, the command prompt process display for a second and disappear.
And my files are also not getting updated, which is obvious.
What's happening? What is the right way to execute this scheduler script?
Scheduler was created to run with other code - so after starting scheduler you can run other code.
If you don't have any other job then you have to use some loop to run it all time.
In documentation you can see link to examples on GitHub and one of example uses
while True:
time.sleep(2)
I'm using Python threading in a REST endpoint so that the endpoint can launch a thread, and then immediately return a 200 OK to the client while the thread runs. (The client then polls server state to track the progress of the thread).
The code runs in 7 seconds on my local dev system, but takes 6 minutes on an AWS EC2 m5.large.
Here's what the code looks like:
import threading
[.....]
# USES THREADING
# https://stackoverflow.com/a/1239108/364966
thr = threading.Thread(target=score, args=(myArgs1, myArgs2), kwargs={})
thr.start() # Will run "foo"
thr.is_alive() # Will return whether function is running currently
data = {'now creating test scores'}
return Response(data, status=status.HTTP_200_OK)
I turned off threading to test if that was the cause of the slowdown, like this:
# USES THREADING
# https://stackoverflow.com/a/1239108/364966
# thr = threading.Thread(target=score, args=(myArgs1, myArgs2), kwargs={})
# thr.start() # Will run "foo"
# thr.is_alive() # Will return whether function is running currently
# FOR DEBUGGING - SKIP THREADING TO SEE IF THAT'S WHAT'S SLOWING THINGS DOWN ON EC2
score(myArgs1, myArgs2)
data = {'now creating test scores'}
return Response(data, status=status.HTTP_200_OK)
...and it ran in 5 seconds on EC2. This proves that something about how I'm handling threads on EC2 is the cause of the slowdown.
Is there something I need to configure on EC2 to better support Python threads?
An AWS-certified consultant has advised me that EC2 is known to be slow in execution of Python threads, and to use AWS Lambda functions instead.
I have two kinds of jobs: ones that I want to run in serial and ones that I want to run concurrently in parallel. However I want the parallel jobs to get scheduled in serial (if you're still following). That is:
Do A.
Wait for A, do B.
Wait for B, do 2+ versions of C all concurrently.
My thought it to have 2 redis queues, a serial_queue that has just one worker on it. And a parallel_queue which has multiple workers on it.
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=job_a,
...)
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=job_b,
...)
def parallel_c():
for task in range(args.n_tasks):
queue_concurrent.schedule(
scheduled_time=datetime.utcnow(),
func=job_c,
...)
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=parallel_c,
...)
But this setup currently, gives the error that
AttributeError: module '__main__' has no attribute 'schedule_fetch_tweets' . How can I package this function properly for python-rq?
The solution requires a bit of gymnastics, in that you have to import the current script as if it were an external module.
So for instance. The contents of schedule_twitter_jobs.py would be:
from redis import Redis
from rq_scheduler import Scheduler
import schedule_twitter_jobs
# we are importing the very module we are executing
def schedule_fetch_tweets(args, queue_name):
''' This is the child process to schedule'''
concurrent_queue = Scheduler(queue_name=queue_name+'_concurrent', connection=Redis())
# this scheduler is created based on a queue_name that will be passed in
for task in range(args.n_tasks):
scheduler_concurrent.schedule(
scheduled_time=datetime.utcnow(),
func=app.controller.fetch_twitter_tweets,
args=[args.statuses_backfill, fill_start_time])
serial_queue = Scheduler(queue_name='myqueue', connection=Redis())
serial_queue.schedule(
'''This is the first schedule.'''
scheduled_time=datetime.utcnow(),
func=schedule_twitter_jobs.schedule_fetch_tweets,
#now we have a fully-qualified reference to the function we need to schedule.
args=(args, ttl, timeout, queue_name)
#pass through the args to the child schedule
)
My objective is to schedule an Azure Batch Task to run every 5 minutes from the moment it has been added, and I use the Python SDK to create/manage my Azure resources. I tried creating a Job-Schedule and it automatically created a new Job under the specified Pool.
job_spec = batch.models.JobSpecification(
pool_info=batch.models.PoolInformation(pool_id=pool_id)
)
schedule = batch.models.Schedule(
start_window=datetime.timedelta(hours=1),
recurrence_interval=datetime.timedelta(minutes=5)
)
setup = batch.models.JobScheduleAddParameter(
'python_test_schedule',
schedule,
job_spec
)
batch_client.job_schedule.add(setup)
What I did is then add a task to this new Job. But the task seems to run only once as soon as it is added (like a normal task). Is there something more that I need to do to make the task run recurrently? There doesn't seem to be much documentation and examples of JobSchedule either.
Thank you! Any help is appreciated.
You are correct in that a JobSchedule will create a new job at the specified time interval. Additionally, you cannot have a task "re-run" every 5 minutes once it has completed. You could do either:
Have one task that runs a loop, performing the same action every 5 minutes.
Use a Job Manager to add a new task (that does the same thing) every 5 minutes.
I would probably recommend the 2nd option, as it has a little more flexibility to monitor the progress of the tasks and job and take actions accordingly.
An example client which creates the job might look a bit like this:
job_manager = models.JobManagerTask(
id='job_manager',
command_line="/bin/bash -c 'python ./job_manager.py'",
environment_settings=[
mdoels.EnvironmentSettings('AZ_BATCH_KEY', AZ_BATCH_KEY)],
resource_files=[
models.ResourceFile(blob_sas="https://url/to/job_manager.py", file_name="job_manager.py")],
authentication_token_settings=models.AuthenticationTokenSettings(
access=[models.AccessScope.job]),
kill_job_on_completion=True, # This will mark the job as complete once the Job Manager has finished.
run_exclusive=False) # Whether the job manager needs a dedicated VM - this will depend on the nature of the other tasks running on the VM.
new_job = models.JobAddParameter(
id='my_job',
job_manager_task=job_manager,
pool_info=models.PoolInformation(pool_id='my_pool'))
batch_client.job.add(new_job)
Now we need a script to run as the Job Manager on the compute node. In this case I will use Python, so you will need to add a StartTask to you pool (or JobPrepTask to the job) to install the azure-batch Python package.
Additionally the Job Manager Task will need to be able to authenticate against the Batch API. There are two methods of doing this depending on the scope of activities that the Job Manager will perform. If you only need to add tasks, then you can use the authentication_token_settings attribute, which will add an AAD token environment variable to the Job Manager task with permissions to ONLY access the current job. If you need permission to do other things, like alter the pool, or start new jobs, you can pass an account key via environment variable. Both options are shown above.
The script you run on the Job Manager task could look something like this:
import os
import time
from azure.batch import BatchServiceClient
from azure.batch.batch_auth import SharedKeyCredentials
from azure.batch import models
# Batch account credentials
AZ_BATCH_ACCOUNT = os.environ['AZ_BATCH_ACCOUNT_NAME']
AZ_BATCH_KEY = os.environ['AZ_BATCH_KEY']
AZ_BATCH_ENDPOINT = os.environ['AZ_BATCH_ENDPOINT']
# If you're using the authentication_token_settings for authentication
# you can use the AAD token in the environment variable AZ_BATCH_AUTHENTICATION_TOKEN.
def main():
# Batch Client
creds = SharedKeyCredentials(AZ_BATCH_ACCOUNT, AZ_BATCH_KEY)
batch_client = BatchServiceClient(creds, base_url=AZ_BATCH_ENDPOINT)
# You can set up the conditions under which your Job Manager will continue to add tasks here.
# It could be a timeout, max number of tasks, or you could monitor tasks to act on task status
condition = True
task_id = 0
task_params = {
"command_line": "/bin/bash -c 'echo hello world'",
# Any other task parameters go here.
}
while condition:
new_task = models.TaskAddParameter(id=task_id, **task_params)
batch_client.task.add(AZ_JOB, new_task)
task_id += 1
# Perform any additional log here - for example:
# - Check the status of the tasks, e.g. stdout, exit code etc
# - Process any output files for the tasks
# - Delete any completed tasks
# - Error handling for tasks that have failed
time.sleep(300) # Wait for 5 minutes (300 seconds)
# Job Manager task has completed - it will now exit and the job will be marked as complete.
if __name__ == '__main__':
main()
job_spec = batchmodels.JobSpecification(
pool_info=pool_info,
job_manager_task=batchmodels.JobManagerTask(
id="JobManagerTask",
#specify the command that needs to run recurrently
command_line="/bin/bash -c \" python3 task.py\""
))
Add the task that you want run recurrently as a JobManagerTask inside JobSpecification as shown above. Now this JobManagerTask will run recurrently.
I have a complicated scenario I need to tackle.
I'm using Celery to run tasks in parallel, my tasks involve with HTTP requests and I'm planning to use Celery along with eventlet for such purpose.
Let me explain my scenario:
I have 2 tasks that can run in parallel and third task that needs to work on the output of those 2 tasks therefore I'm using Celery group to run the 2 tasks and Celery chain to pass the
output to the third task to work on it when they finish.
Now it gets complicated, the third task needs to spawn multiple tasks that I would like to run in parallel and I would like to collect all outputs together and to process it in another task.
So I created a group for the multiple tasks together with a chain to process all information.
I guess I'm missing basic information about Celery concurrent primitives, I was having a 1 celery task that worked well but I needed to make it faster.
This is a simplified sample of the code:
#app.task
def task2():
return "aaaa"
#app.task
def task3():
return "bbbb"
#app.task
def task4():
work = group(...) | task5.s(...)
work()
#app.task
def task1():
tasks = [task2.s(a, b), task3.s(c, d)]
work = group(tasks) | task4.s()
return work()
This is how I start this operation:
task = tasks1.apply_async(kwargs=kwargs, queue='queue1')
I save task.id and pull the server every 30 seconds to see if results available by doing:
results = tasks1.AsyncResult(task_id)
if results.ready():
res = results.get()