I have two kinds of jobs: ones that I want to run in serial and ones that I want to run concurrently in parallel. However I want the parallel jobs to get scheduled in serial (if you're still following). That is:
Do A.
Wait for A, do B.
Wait for B, do 2+ versions of C all concurrently.
My thought it to have 2 redis queues, a serial_queue that has just one worker on it. And a parallel_queue which has multiple workers on it.
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=job_a,
...)
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=job_b,
...)
def parallel_c():
for task in range(args.n_tasks):
queue_concurrent.schedule(
scheduled_time=datetime.utcnow(),
func=job_c,
...)
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=parallel_c,
...)
But this setup currently, gives the error that
AttributeError: module '__main__' has no attribute 'schedule_fetch_tweets' . How can I package this function properly for python-rq?
The solution requires a bit of gymnastics, in that you have to import the current script as if it were an external module.
So for instance. The contents of schedule_twitter_jobs.py would be:
from redis import Redis
from rq_scheduler import Scheduler
import schedule_twitter_jobs
# we are importing the very module we are executing
def schedule_fetch_tweets(args, queue_name):
''' This is the child process to schedule'''
concurrent_queue = Scheduler(queue_name=queue_name+'_concurrent', connection=Redis())
# this scheduler is created based on a queue_name that will be passed in
for task in range(args.n_tasks):
scheduler_concurrent.schedule(
scheduled_time=datetime.utcnow(),
func=app.controller.fetch_twitter_tweets,
args=[args.statuses_backfill, fill_start_time])
serial_queue = Scheduler(queue_name='myqueue', connection=Redis())
serial_queue.schedule(
'''This is the first schedule.'''
scheduled_time=datetime.utcnow(),
func=schedule_twitter_jobs.schedule_fetch_tweets,
#now we have a fully-qualified reference to the function we need to schedule.
args=(args, ttl, timeout, queue_name)
#pass through the args to the child schedule
)
Related
Recently I have started using the multiprocessor pool executor in python to accelerate my processing.
So instead of doing a
list_of_res=[]
for n in range(a_number):
res=calculate_something(list_of sources[n])
list_of_res.append(res)
joint_results=pd.concat(list_of_res)
I do
with ProcessPoolExecutor(max_workers=8) as executor:
joint_results=pd.concat(executor.map(calculate_something,list_of_sources))
It works great.
However I've noticed that inside the calculate_something function I call the same function like 8 times, one after another, so I might as well apply a map to them instead of a loop
My question is, can I apply multiprocessing to a function that is already being called in multiprocess?
yes you can have a worker process spawn another pool of workers, but it is not optimal.
each time you launch a new process it takes a few hundred milliseconds to a few seconds for this new process to initialize and start executing work (OS, disk and code dependent.)
launching a worker from a worker is just wasting the overhead of spawning the first child to begin with, and you are better off extracting the loop inside calculate_something and launching it directly within your initial executor.
a better approach is to launch your initial calculate_something using a ThreadPoolExecutor and have one shared ProcessPoolExecutor that all your thread workers will push work into, this way you can limit the number of newly created processes and avoid creating and deleting much more workers than you actually need, and it takes only a few microseconds to launch a threadpool.
this is an example of how to nest threadpool and process_pool.
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def process_worker(n):
print(n)
return n
def thread_worker(list_of_n,process_pool:ProcessPoolExecutor):
work_done = list(process_pool.map(process_worker,list_of_n))
return work_done
if __name__ == "__main__":
list_of_lists_of_n = [[1,2,3],[4,5,6]]
with ProcessPoolExecutor() as process_pool, ThreadPoolExecutor() as threadpool:
tasks = []
work_done = []
for item in list_of_lists_of_n:
tasks.append(threadpool.submit(thread_worker,item,process_pool))
for item in tasks:
work_done.append(item.result())
print(work_done)
I am using Celery to asynchronously perform a group of operations. There are a lot of these operations and each may take a long time, so rather than send the results back in the return value of the Celery worker function, I'd like to send them back one at a time as custom state updates. That way the caller can implement a progress bar with a change state callback, and the return value of the worker function can be of constant size rather than linear in the number of operations.
Here is a simple example in which I use the Celery worker function add_pairs_of_numbers to add a list of pairs of numbers, sending back a custom status update for every added pair.
#!/usr/bin/env python
"""
Run worker with:
celery -A tasks worker --loglevel=info
"""
from celery import Celery
app = Celery("tasks", broker="pyamqp://guest#localhost//", backend="rpc://")
#app.task(bind=True)
def add_pairs_of_numbers(self, pairs):
for x, y in pairs:
self.update_state(state="SUM", meta={"x":x, "y":y, "x+y":x+y})
return len(pairs)
def handle_message(message):
if message["status"] == "SUM":
x = message["result"]["x"]
y = message["result"]["y"]
print(f"Message: {x} + {y} = {x+y}")
def non_looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
result = task.get(on_message=handle_message)
print(result)
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
print(task)
while True:
pass
if __name__ == "__main__":
import sys
if sys.argv[1:] and sys.argv[1] == "looping":
looping((3,4), (2,7), (5,5))
else:
non_looping((3,4), (2,7), (5,5))
If you run just ./tasks it executes the non_looping function. This does the standard Celery thing: makes a delayed call to the worker function and then uses get to wait for the result. A handle_message callback function prints each message, and the number of pairs added is returned as the result. This is what I want.
$ ./task.py
Message: 3 + 4 = 7
Message: 2 + 7 = 9
Message: 5 + 5 = 10
3
Though the non-looping scenario is sufficient for this simple example, the real world task I'm trying to accomplish is processing a batch of files instead of adding pairs of numbers. Furthermore the client is a Flask REST API and therefore cannot contain any blocking get calls. In the script above I simulate this constraint with the looping function. This function starts the asynchronous Celery task, but does not wait for a response. (The infinite while loop that follows simulates the web server continuing to run and handle other requests.)
If you run the script with the argument "looping" it runs this code path. Here it immediately prints the Celery task ID then drops into the infinite loop.
$ ./tasks.py looping
a39c54d3-2946-4f4e-a465-4cc3adc6cbe5
The Celery worker logs show that the add operations are performed, but the caller doesn't define a callback function, so it never gets the results.
(I realize that this particular example is embarrassingly parallel, so I could use chunks to divide this up into multiple tasks. However, in my non-simplified real-world case I have tasks that cannot be parallelized.)
What I want is to be able to specify a callback in the looping scenario. Something like this.
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs, callback=handle_message) # There is no such callback.
print(task)
while True:
pass
In the Celery documentation and all the examples I can find online (for example this), there is no way to define a callback function as part of the delay call or its apply_async equivalent. You can only specify one as part of a get callback. That's making me think this is an intentional design decision.
In my REST API scenario I can work around this by having the Celery worker process send a "status update" back to the Flask server in the form of an HTTP post, but this seems weird because I'm starting to replicate messaging logic in HTTP that already exists in Celery.
Is there any way to write my looping scenario so that the caller receives callbacks without making a blocking call, or is that explicitly forbidden in Celery?
It's a pattern that is not supported by celery although you can (somewhat) trick it out by posting custom state updates to your task as described here.
Use update_state() to update a task’s state:.
def upload_files(self, filenames):
for i, file in enumerate(filenames):
if not self.request.called_directly:
self.update_state(state='PROGRESS',
meta={'current': i, 'total': len(filenames)})```
The reason that celery does not support such a pattern is that task producers (callers) are strongly decoupled from the task consumers (workers) with the only communications between the two being the broker to support communication from producers to consumers and the result backend supporting communications from consumers to producers. The closest you can get currently is with polling a task state or writing a custom result backend that will allow you to post events either via AMP RPC or redis subscriptions.
My objective is to schedule an Azure Batch Task to run every 5 minutes from the moment it has been added, and I use the Python SDK to create/manage my Azure resources. I tried creating a Job-Schedule and it automatically created a new Job under the specified Pool.
job_spec = batch.models.JobSpecification(
pool_info=batch.models.PoolInformation(pool_id=pool_id)
)
schedule = batch.models.Schedule(
start_window=datetime.timedelta(hours=1),
recurrence_interval=datetime.timedelta(minutes=5)
)
setup = batch.models.JobScheduleAddParameter(
'python_test_schedule',
schedule,
job_spec
)
batch_client.job_schedule.add(setup)
What I did is then add a task to this new Job. But the task seems to run only once as soon as it is added (like a normal task). Is there something more that I need to do to make the task run recurrently? There doesn't seem to be much documentation and examples of JobSchedule either.
Thank you! Any help is appreciated.
You are correct in that a JobSchedule will create a new job at the specified time interval. Additionally, you cannot have a task "re-run" every 5 minutes once it has completed. You could do either:
Have one task that runs a loop, performing the same action every 5 minutes.
Use a Job Manager to add a new task (that does the same thing) every 5 minutes.
I would probably recommend the 2nd option, as it has a little more flexibility to monitor the progress of the tasks and job and take actions accordingly.
An example client which creates the job might look a bit like this:
job_manager = models.JobManagerTask(
id='job_manager',
command_line="/bin/bash -c 'python ./job_manager.py'",
environment_settings=[
mdoels.EnvironmentSettings('AZ_BATCH_KEY', AZ_BATCH_KEY)],
resource_files=[
models.ResourceFile(blob_sas="https://url/to/job_manager.py", file_name="job_manager.py")],
authentication_token_settings=models.AuthenticationTokenSettings(
access=[models.AccessScope.job]),
kill_job_on_completion=True, # This will mark the job as complete once the Job Manager has finished.
run_exclusive=False) # Whether the job manager needs a dedicated VM - this will depend on the nature of the other tasks running on the VM.
new_job = models.JobAddParameter(
id='my_job',
job_manager_task=job_manager,
pool_info=models.PoolInformation(pool_id='my_pool'))
batch_client.job.add(new_job)
Now we need a script to run as the Job Manager on the compute node. In this case I will use Python, so you will need to add a StartTask to you pool (or JobPrepTask to the job) to install the azure-batch Python package.
Additionally the Job Manager Task will need to be able to authenticate against the Batch API. There are two methods of doing this depending on the scope of activities that the Job Manager will perform. If you only need to add tasks, then you can use the authentication_token_settings attribute, which will add an AAD token environment variable to the Job Manager task with permissions to ONLY access the current job. If you need permission to do other things, like alter the pool, or start new jobs, you can pass an account key via environment variable. Both options are shown above.
The script you run on the Job Manager task could look something like this:
import os
import time
from azure.batch import BatchServiceClient
from azure.batch.batch_auth import SharedKeyCredentials
from azure.batch import models
# Batch account credentials
AZ_BATCH_ACCOUNT = os.environ['AZ_BATCH_ACCOUNT_NAME']
AZ_BATCH_KEY = os.environ['AZ_BATCH_KEY']
AZ_BATCH_ENDPOINT = os.environ['AZ_BATCH_ENDPOINT']
# If you're using the authentication_token_settings for authentication
# you can use the AAD token in the environment variable AZ_BATCH_AUTHENTICATION_TOKEN.
def main():
# Batch Client
creds = SharedKeyCredentials(AZ_BATCH_ACCOUNT, AZ_BATCH_KEY)
batch_client = BatchServiceClient(creds, base_url=AZ_BATCH_ENDPOINT)
# You can set up the conditions under which your Job Manager will continue to add tasks here.
# It could be a timeout, max number of tasks, or you could monitor tasks to act on task status
condition = True
task_id = 0
task_params = {
"command_line": "/bin/bash -c 'echo hello world'",
# Any other task parameters go here.
}
while condition:
new_task = models.TaskAddParameter(id=task_id, **task_params)
batch_client.task.add(AZ_JOB, new_task)
task_id += 1
# Perform any additional log here - for example:
# - Check the status of the tasks, e.g. stdout, exit code etc
# - Process any output files for the tasks
# - Delete any completed tasks
# - Error handling for tasks that have failed
time.sleep(300) # Wait for 5 minutes (300 seconds)
# Job Manager task has completed - it will now exit and the job will be marked as complete.
if __name__ == '__main__':
main()
job_spec = batchmodels.JobSpecification(
pool_info=pool_info,
job_manager_task=batchmodels.JobManagerTask(
id="JobManagerTask",
#specify the command that needs to run recurrently
command_line="/bin/bash -c \" python3 task.py\""
))
Add the task that you want run recurrently as a JobManagerTask inside JobSpecification as shown above. Now this JobManagerTask will run recurrently.
I am currently creating a class which is supposed to execute some methods in a multi-threaded way, using the multiprocessing module. I execute the real computation using a Pool of n workers. Now I wanted to assign each of the currently n active workers an index between 0 and n for some other calculation. To do this, I wanted to use a shared Queue to assign an index in a way, that at every time no two workers have the same id. To share the same Queue inside the class between the different threads, I wanted to store it inside a Manager.Namespace(). But doing this, I got some problems with the Queue. Therefore, I created a minimal version of my problem and ended up with something like this:
from multiprocess import Process, Queue, Manager, Pool, cpu_count
class A(object):
def __init__(self):
manager = Manager()
self.ns = manager.Namespace()
self.ns.q = manager.Queue()
def foo(self):
for i in range(10):
print(i)
self.ns.q.put(i)
print(self.ns.q.get())
print(self.ns.q.qsize())
a = A()
a.foo()
In this code, the execution stops before the second print statement - therefore, I think, that no data is actually written in the Queue. When I remove the namespace related stuff the code works flawlessly. Is this the intended behaviour of the multiprocessings objects and am I doing something wrong? Or is this some kind of bug?
yes, you should not use Namespace here. when you put a Queue object into manager.Namespace(), each process will get a new Queue instance, all the writer/reader of those newly created queue objects have no connection with parent process, therefore no message will be received by worker processes. share a Queue solely instead.
by the way, you mentioned "thread" many times, but in the context of multiprocess module, a worker is a process, not a thread.
Sorry...it seems that i asked a popular question, but i cannot find any helpful for my case from stackflow :P
so my code does the following things:
step 1. the parent process write task object into multiprocessing.JoinableQueue
step 2. child process(more than 1) read(get) the task object from the JoinableQueue and execute the task
my module structure is:
A.py
Class Task(object)
Class WorkerPool(object)
Class Worker(multiprocessing.Process)
def run() # here the step 2 is executed
Class TestGroup()
def loadTest() # here the step 1 above is executed, i.e. append the object of Task
What i understand is when mp.JoinableQueue is used, the objects appended should be pickable, i got the meaning of "the pickable" from https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
my questions are:
1. does the object of Task is pickable in my case?
I got the error below when the code appends task objects into JoinableQueue:
File "/usr/lib/python2.6/multiprocessing/queues.py", line 242, in _feed
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object1
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object2
2014-06-23 03:18:43 INFO TestGroup: G1 End load test: object3
send(obj)
PicklingError: Can't pickle : attribute lookup pysphere.resources.VimService_services_types.DynamicData_Holder failed
What's the general usage of mp.JoinableQueue? in my case, i need to use join() and task_done()
When i choose to use Queue.Queue instead of mp.JoinableQueue, the pickerror is just gone, However, checking from the log, i found that all the child processes keep working on the first object of the Queue, what's the possible reason of this situation?
The multiprocessing module in Python starts multiple processes to run your tasks. Since processes do not share memory, they need to be able to communicate using serialized data. multiprocessing uses the pickle module to do the serialization, thus the requirement that the objects you are passing to the tasks be picklable.
1) Your task object seems to contain an instance from pysphere.resource.VimService_services_types. This is probably a reference to a system resource, such as an open file. This cannot be serialized or passed from one process to another, and therefore it causes the pickling error.
What you can do with mp.JoinableQueue is pass the arguments you need to the task, and have it start the service in the task itself so that it is local to that process.
For example:
queue = mp.JoinableQueue()
# not queue.put(task), since the new process will create the task
queue.put(task_args)
def f(task_args):
task = Task(task_args)
...
# you can't return the task, unless you've closed all non-serializable parts
return task.result
process = Process(target=f, args=(queue,))
...
2) Queue.Queue is meant for threading. It uses shared memory and synchronization mechanisms to provide atomic operations. However, when you start a new process with multiprocessing, it copies the initial process, and so each child will work on the same queue objects, since the queue in memory has been copied for each process.