Python: Celery inspect - python

I found some related questions but nothing that describes exactly my problem. So I can inspect my queue with a code like that:
from celery.task.control import inspect
#inspect(['my_queue']), with a list instead of a str, should work!
i = inspect(['my_queue'])
print(i.active()) # get a list of active tasks
print(i.registered()) # get a list of tasks registered
print(i.scheduled()) # get a list of tasks waiting
print(i.reserved()) #tasks that has been received, but waiting to be executed
But somehow, for every second execution the method returns a empty task list. Sometimes I also get a Reset Connection Error. Some ideas why this happes? Is there something like a interval, where workers fill there active tasks list or something like that?

I assume the code you wrote above is not how the actual application looks (it can't work without a Celery object). - The only explanation is that you have some connectivity issues, otherwise it should work every time you run, unless genuinely there are no tasks to report. In other words - the cluster is idle.
Keep in mind that inspect broadcasts a message to all the workers and waits for their replies. If some of them times out for whatever reason(s), you will not see that worker in the output. If it happens that only that worker was busy, you may end up with an empty list of tasks.
Try to call something like celery -A yourproject.celeryapp status to see if your workers are responsive, and if everything is OK run your script. - It should work.

Related

How to trigger the same build on many workers from one scheduler?

What I want is deceptively simple: user presses a single button in ForceScheduler and multiple build requests go to a dozen of workers simultaneously. Right now users repeatedly run the same scheduler multiple times with different worker each time and this is repetitive.
I've tried setting multiple = True, on WorkerChoiceParameter in the most straightforward manner, but this didn't help: only the first worker gets the build request. Also, I've seen that inside Worker class there's a field that hints that it can be done: endpoints = [WorkerEndpoint, WorkersEndpoint]. Looking at WorkersEndpoint (note the "s") I've noted that it tries to get a list of workers, so in theory this should be possible? But I cannot figure out how to tell buildbot to use this second endpoint or maybe I misunderstood what this code does.

Skip logging Celery results

I have a small webapp, written in Python using Flask. Some endpoints I have, require long execution times (~60 s +). The solution is to return task ids instantly, while starting up a Celery task in the background.
Everything works fine as it is. I have redirected the logging of Celery to a file and that is working great. The result that the task return is a huge data structure that later will be processed and potentially returned to the end-user. However, I have a small issue with the logging of the results. When celery finish a task it also logs the results of it. In my case, the previously mentioned, huge data structure. This is making the logfile harder to read and unnecessarily big.
Is it possible to only log that the task finished, the state of it and time it took?
Something like this:
[2017-02-06 15:12:01,286: INFO/PoolWorker-6] Task <task_name> succeeded in 60s
Not like this:
[2017-02-06 15:12:01,286: INFO/PoolWorker-6] Task <task_name> succeeded in 60s <very long string, potentially thousands of rows>
you can change celery worker log level to higher then INFO with:
celery ... --loglevel ERROR
see more from doc.

Celery's inspect unstable behaviour

I got celery project with RabbitMQ backend, that relies heavily on inspecting scheduled tasks. I found that the following code returns nothing for most of the time (of course, there are scheduled tasks) :
i = app.control.inspect()
scheduled = i.scheduled()
if (scheduled):
# do something
This code also runs from one of tasks, but I think it doesn't matter, I got same result from interactive python command line (with some exceptions, see below).
At the same time, celery -A <proj> inspect scheduled command never fails. Also, I noticed, that when called from interactive python command line for the first time, this command also never fails. Most of the successive i.scheduled() calls return nothing.
i.scheduled() guarantees result only when called for the first time?
If so, why and how then can I inspect scheduled tasks from task? Run dedicated worker and restart it after every task? Seems like overkill for such trivial task.
Please explain, how to use this feature the right way.
This is caused by some weird issue inside Celery app. To repeat methods from Inspect object you have to create new Celery app instance object.
Here is small snippet, which can help you:
from celery import Celery
def inspect(method):
app = Celery('app', broker='amqp://')
return getattr(app.control.inspect(), method)()
print inspect('scheduled')
print inspect('active')

How can I run a task after other (already started) tasks are finished

I'm new to Celery and I'm trying to understand if it can solve my problem.
I need to start a number of tasks (An) and then run another task (B) after these are done. The problem is that tasks An are added sequentially and I don't want to wait for the last one to be added before I start the first one. Can I configure task B to execute after tasks An are done?
Now to the real scenario:
Task An - Process a file uploaded by user (Added after each file is
uploaded)
Task B - do something with the results of processing all
uploaded files
Alternative solutions are welcome as well
For sure you can do this, celery canvas supports many options, inluding the behaviour you require, running a task after a group of tasks ... it is called "Chords", e.g.:
from celery import chord
from tasks import task_upload1, task_upload2, task_upload3, final_execution
result = chord(task_upload1.s(), task_upload2.s(), task_upload3.s())(final_execution.s())
get_required_result = result.get()
you can refer to this link for more details
With RabbitMQ you can get exact behavior using message acknowledgment and aggregator pattern.
You start worker, that consumes messages (A) and do some work(process a file uploaded by user in your case), but doesn't sent ack when finished. Instead it takes next message form queue and if it's A task again, he is doing the same thing. At some point he will receive task B and could process all previous A's results, all atones and send ack to all of them.
Unfortunately, this scenario can't be done with Celery, because you have to specify all A tasks and final B task(chains, chords, callbacks, etc.) on creating time.
Alternatively, you can save Task.id for each successful A task in separate queue (not Celery queue) and process this messages, when executing B task. Celery can fit for this algorithm.

Python / rq - monitoring worker status

If this is an idiotic question, I apologize and will go hide my head in shame, but:
I'm using rq to queue jobs in Python. I want it to work like this:
Job A starts. Job A grabs data via web API and stores it.
Job A runs.
Job A completes.
Upon completion of A, job B starts. Job B checks each record stored by job A and adds some additional response data.
Upon completion of job B, user gets a happy e-mail saying their report's ready.
My code so far:
redis_conn = Redis()
use_connection(redis_conn)
q = Queue('normal', connection=redis_conn) # this is terrible, I know - fixing later
w = Worker(q)
job = q.enqueue(getlinksmod.lsGet, theURL,total,domainid)
w.work()
I assumed my best solution was to have 2 workers, one for job A and one for B. The job B worker could monitor job A and, when job A was done, get started on job B.
What I can't figure out to save my life is how I get one worker to monitor the status of another. I can grab the job ID from job A with job.id. I can grab the worker name with w.name. But haven't the foggiest as to how I pass any of that information to the other worker.
Or, is there a much simpler way to do this that I'm totally missing?
Update januari 2015, this pull request is now merged, and the parameter is renamed to depends_on, ie:
second_job = q.enqueue(email_customer, depends_on=first_job)
The original post left intact for people running older versions and such:
I have submitted a pull request (https://github.com/nvie/rq/pull/207) to handle job dependencies in RQ. When this pull request gets merged in, you'll be able to do:
def generate_report():
pass
def email_customer():
pass
first_job = q.enqueue(generate_report)
second_job = q.enqueue(email_customer, after=first_job)
# In the second enqueue call, job is created,
# but only moved into queue after first_job finishes
For now, I suggest writing a wrapper function to sequentially run your jobs. For example:
def generate_report():
pass
def email_customer():
pass
def generate_report_and_email():
generate_report()
email_customer() # You can also enqueue this function, if you really want to
# Somewhere else
q.enqueue(generate_report_and_email)
From this page on the rq docs, it looks like each job object has a result attribute, callable by job.result, which you can check. If the job hasn't finished, it'll be None, but if you ensure that your job returns some value (even just "Done"), then you can have your other worker check the result of the first job and then begin working only when job.result has a value, meaning the first worker was completed.
You are probably too deep into your project to switch, but if not, take a look at Twisted. http://twistedmatrix.com/trac/ I am using it right now for a project that hits APIs, scrapes web content, etc. It runs multiple jobs in parallel, as well as organizing certain jobs in order, so Job B doesn't execute until Job A is done.
This is the best tutorial for learning Twisted if you want to attempt. http://krondo.com/?page_id=1327
Combine the things that job A and job B do in one function, and then use e.g. multiprocessing.Pool (it's map_async method) to farm that out over different processes.
I'm not familiar with rq, but multiprocessing is a part of the standard library. By default it uses as many processes as your CPU has cores, which in my experience is usually enough to saturate the machine.

Categories

Resources