If this is an idiotic question, I apologize and will go hide my head in shame, but:
I'm using rq to queue jobs in Python. I want it to work like this:
Job A starts. Job A grabs data via web API and stores it.
Job A runs.
Job A completes.
Upon completion of A, job B starts. Job B checks each record stored by job A and adds some additional response data.
Upon completion of job B, user gets a happy e-mail saying their report's ready.
My code so far:
redis_conn = Redis()
use_connection(redis_conn)
q = Queue('normal', connection=redis_conn) # this is terrible, I know - fixing later
w = Worker(q)
job = q.enqueue(getlinksmod.lsGet, theURL,total,domainid)
w.work()
I assumed my best solution was to have 2 workers, one for job A and one for B. The job B worker could monitor job A and, when job A was done, get started on job B.
What I can't figure out to save my life is how I get one worker to monitor the status of another. I can grab the job ID from job A with job.id. I can grab the worker name with w.name. But haven't the foggiest as to how I pass any of that information to the other worker.
Or, is there a much simpler way to do this that I'm totally missing?
Update januari 2015, this pull request is now merged, and the parameter is renamed to depends_on, ie:
second_job = q.enqueue(email_customer, depends_on=first_job)
The original post left intact for people running older versions and such:
I have submitted a pull request (https://github.com/nvie/rq/pull/207) to handle job dependencies in RQ. When this pull request gets merged in, you'll be able to do:
def generate_report():
pass
def email_customer():
pass
first_job = q.enqueue(generate_report)
second_job = q.enqueue(email_customer, after=first_job)
# In the second enqueue call, job is created,
# but only moved into queue after first_job finishes
For now, I suggest writing a wrapper function to sequentially run your jobs. For example:
def generate_report():
pass
def email_customer():
pass
def generate_report_and_email():
generate_report()
email_customer() # You can also enqueue this function, if you really want to
# Somewhere else
q.enqueue(generate_report_and_email)
From this page on the rq docs, it looks like each job object has a result attribute, callable by job.result, which you can check. If the job hasn't finished, it'll be None, but if you ensure that your job returns some value (even just "Done"), then you can have your other worker check the result of the first job and then begin working only when job.result has a value, meaning the first worker was completed.
You are probably too deep into your project to switch, but if not, take a look at Twisted. http://twistedmatrix.com/trac/ I am using it right now for a project that hits APIs, scrapes web content, etc. It runs multiple jobs in parallel, as well as organizing certain jobs in order, so Job B doesn't execute until Job A is done.
This is the best tutorial for learning Twisted if you want to attempt. http://krondo.com/?page_id=1327
Combine the things that job A and job B do in one function, and then use e.g. multiprocessing.Pool (it's map_async method) to farm that out over different processes.
I'm not familiar with rq, but multiprocessing is a part of the standard library. By default it uses as many processes as your CPU has cores, which in my experience is usually enough to saturate the machine.
Related
I'm working with a django application hosted on heroku with redistogo addon:nano pack. I'm using rq, to execute tasks in the background - the tasks are initiated by online users. I've a constraint on increasing number of connections, limited resources I'm afraid.
I'm currently having a single worker running over 'n' number of queues. Each queue uses an instance of connection from the connection pool to handle 'n' different types of task. For instance, lets say if 4 users initiate same type of task, I would like to have my main worker create child processes dynamically, to handle it. Is there a way to achieve required multiprocessing and concurrency?
I tried with multiprocessing module, initially without introducing Lock(); but that exposes and overwrites user passed data to the initiating function, with the previous request data. After applying locks, it restricts second user to initiate the requests by returning a server error - 500
github link #1: Looks like the team is working on the PR; not yet released though!
github link #2: This post helps to explain creating more workers at runtime.
This solution however also overrides the data. The new request is again processed with the previous requests data.
Let me know if you need to see some code. I'll try to post a minimal reproducible snippet.
Any thoughts/suggestions/guidelines?
Did you get a chance to try AutoWorker?
Spawn RQ Workers automatically.
from autoworker import AutoWorker
aw = AutoWorker(queue='high', max_procs=6)
aw.work()
It makes use of multiprocessing with StrictRedis from redis module and following imports from rq
from rq.contrib.legacy import cleanup_ghosts
from rq.queue import Queue
from rq.worker import Worker, WorkerStatus
After looking under the hood, I realised Worker class is already implementing multiprocessing.
The work function internally calls execute_job(job, queue) which in turn as quoted in the module
Spawns a work horse to perform the actual work and passes it a job.
The worker will wait for the work horse and make sure it executes within the given timeout bounds,
or will end the work horse with SIGALRM.
The execute_job() funtion makes a call to fork_work_horse(job, queue) implicitly which spawns a work horse to perform the actual work and passes it a job as per the following logic:
def fork_work_horse(self, job, queue):
child_pid = os.fork()
os.environ['RQ_WORKER_ID'] = self.name
os.environ['RQ_JOB_ID'] = job.id
if child_pid == 0:
self.main_work_horse(job, queue)
else:
self._horse_pid = child_pid
self.procline('Forked {0} at {1}'.format(child_pid, time.time()))
The main_work_horse makes an internal call to perform_job(job, queue) which makes a few other calls to actually perform the job.
All the steps about The Worker Lifecycle mentioned over rq's official documentation page are taken care within these calls.
It's not the multiprocessing I was expecting, but I guess they have a way of doing things. However my original post is still not answered with this, also I'm still not sure about concurrency..
The documentation there still needs to be worked upon, since it hardly covers the true essence of this library!
I've started a new Python 3 project in which my goal is to download tweets and analyze them. As I'll be downloading tweets from different subjects, I want to have a pool of workers that must download from Twitter status with the given keywords and store them in a database. I name this workers fetchers.
Other kind of worker is the analyzers whose function is to analyze tweets contents and extract information from them, storing the result in a database also. As I'll be analyzing a lot of tweets, would be a good idea to have a pool of this kind of workers too.
I've been thinking in using RabbitMQ and Celery for this but I have some questions:
General question: Is really a good approach to solve this problem?
I need at least one fetcher worker per downloading task and this could be running for a whole year (actually is a 15 minutes cycle that repeats and last for a year). Is it appropriate to define an "infinite" task?
I've been trying Celery and I used delay to launch some example tasks. The think is that I don't want to call ready() method constantly to check if the task is completed. Is it possible to define a callback? I'm not talking about a celery task callback, just a function defined by myself. I've been searching for this and I don't find anything.
I want to have a single RabbitMQ + Celery server with workers in different networks. Is it possible to define remote workers?
Yeah, it looks like a good approach to me.
There is no such thing as infinite task. You might reschedule a task it to run once in a while. Celery has periodic tasks, so you can schedule a task so that it runs at particular times. You don't necessarily need celery for this. You can also use a cron job if you want.
You can call a function once a task is successfully completed.
from celery.signals import task_success
#task_success(sender='task_i_am_waiting_to_complete')
def call_me_when_my_task_is_done():
pass
Yes, you can have remote workes on different networks.
I have a Python Django project with two RabbitMQ workers, using the pika lib, which receives jobs to perform actions on a certain Django object which is specified in the request.
The thing is, I don't want the workers, A and B, to perform their actions on the same Django object, x, at the same time as that might cause problems. It doesn't matter which workers goes first but if A is working on x and B receives a job to work on x, I want this job to wait until A is done.
So the problems boils down to being able to know what the other worker is working on and being able to pause a job until a certain time. Note that in my actual project I have more than 2 workers which this must be applied to, I choose two In my example to make it easier to dissect.
Thanks for the help,
Mattias
You will have to use some locking mechanism, perhaps based on database.
When a worker is working on a django object, it marks the work in database. a MySQL example:
worker_id | object_id | task_type
22 44 3 // entry inserted to mark the work
When another worker picks up a django object, it checks that it is not marked as in #1, and proceeds to pick next item.
When a worker has finished working on an object, the database lock row is deleted or marked as FINISHED.
I'm new to Celery and I'm trying to understand if it can solve my problem.
I need to start a number of tasks (An) and then run another task (B) after these are done. The problem is that tasks An are added sequentially and I don't want to wait for the last one to be added before I start the first one. Can I configure task B to execute after tasks An are done?
Now to the real scenario:
Task An - Process a file uploaded by user (Added after each file is
uploaded)
Task B - do something with the results of processing all
uploaded files
Alternative solutions are welcome as well
For sure you can do this, celery canvas supports many options, inluding the behaviour you require, running a task after a group of tasks ... it is called "Chords", e.g.:
from celery import chord
from tasks import task_upload1, task_upload2, task_upload3, final_execution
result = chord(task_upload1.s(), task_upload2.s(), task_upload3.s())(final_execution.s())
get_required_result = result.get()
you can refer to this link for more details
With RabbitMQ you can get exact behavior using message acknowledgment and aggregator pattern.
You start worker, that consumes messages (A) and do some work(process a file uploaded by user in your case), but doesn't sent ack when finished. Instead it takes next message form queue and if it's A task again, he is doing the same thing. At some point he will receive task B and could process all previous A's results, all atones and send ack to all of them.
Unfortunately, this scenario can't be done with Celery, because you have to specify all A tasks and final B task(chains, chords, callbacks, etc.) on creating time.
Alternatively, you can save Task.id for each successful A task in separate queue (not Celery queue) and process this messages, when executing B task. Celery can fit for this algorithm.
Hi um struggling with a problem . I created number of crons and i and i want to run them one after another in a specific order . Lets say i have A , B , C and D crons and want to Run Cron B after Completion of Cron A and after that want to run Cron D and after that cron C. I searched for a way to accomplish this task but could not find any . Can any one help?
If you're using crons, then I'm guessing you've defined endpoints that the cron runner will call...
Use the cron runner to start task A, and let it add a task in the task queue to run B after it finishes. Repeat for B and C.
You can probably use the same endpoints that you used for the cron jobs.
Though I agree with suggestions in comment, I think I have a better solution to your problem (Hopefully :))
Although it's not necessary you can use pull queue in your application, to facilitate design of your problem. The pattern I am suggesting is like this:
1) A servlet centrally handles execution (Let's call it controller) of various tasks and is exposed at a URL
2) The jobs are initiated by the controller by hitting the URL of the job (Assuming pull queue again)
3) After job completion, the job hits back at controller URL to report completion of job
4) Controller in turn deletes the job from queue which is done, and adds next logical job to queue
And this is repeated.
In this case your job code is unchanged even if logic of sequence changes or new jobs are added. You might need to make changes to controller only.