I'm writing a data processing pipeline using Celery because this speeds things up considerably.
Consider the following pseudo-code:
from celery.result import ResultSet
from some_celery_app import processing_task # of type #app.task
def crunch_data():
results = ResultSet([])
for document in mongo.find(): #Around 100K - 1M documents
job = processing_task.delay(document)
results.add(job)
return results.get()
collected_data = crunch_data()
#Do some stuff with this collected data
I successfully spawn four workers with concurrency enabled and when I run this script, the data is processed accordingly and I can do whatever I want.
I'm using RabbitMQ as message broker and rpc as backend.
What I see when I open the RabbitMQ management UI:
First, all the documents are processed
then, and only then, are the documents retrieved by the collective results.get() call.
My question: Is there a way to do the processing and subsequent retrieval simultaneously? In my case, as all documents are atomic entities that do not rely on each other, there seems to be no need to wait for the job to be processed completely.
You could try the callback parameter in ResultSet.get(callback=cbResult) and then you could process the result in the callback.
def cbResult(task_id, value):
print(value)
results.get(callback=cbResult)
Related
I'm currently leveraging celery for periodic tasks. I am new to celery. I have two workers running two different queues. One for slow background jobs and one for jobs user's queue up in the application.
I am monitoring my tasks on datadog because it's an easy way to confirm my workers a running appropriately.
What I want to do is after each task completes, record which queue the task was completed on.
#after_task_publish.connect()
def on_task_publish(sender=None, headers=None, body=None, **kwargs):
statsd.increment("celery.on_task_publish.start.increment")
task = celery.tasks.get(sender)
queue_name = task.queue
statsd.increment("celery.on_task_publish.increment", tags=[f"{queue_name}:{task}"])
The following function is something that I implemented after researching the celery docs and some StackOverflow posts, but it's not working as intended. I get the first statsd increment but the remaining code does not execute.
I am wondering if there is a simpler way to inspect inside/after each task completes, what queue processed the task.
Since your question says is there a way to inspect inside/after each task completes - I'm assuming you haven't tried this celery-result-backend stuff. So you could check out this feature which is provided by Celery itself : Celery-Result-Backend / Task-result-Backend .
It is very useful for storing results of your celery tasks.
Read through this => https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-result-backend-settings
Once you get an idea of how to setup this result-backend, Search for result_extended key (in the same link) to be able to add queue-names in your task return values.
Number of options are available - Like you can setup these results to go to any of these :
Sql-DB / NoSql-DB / S3 / Azure / Elasticsearch / etc
I have made use of this Result-Backend feature with Elasticsearch and this how my task results are stored :
It is just a matter of adding few configurations in settings.py file as per your requirements. Worked really well for my application. And I have a weekly cron that clears only successful results of tasks - since we don't need the results anymore - and I can see only failed results (like the one in image).
These were main keys for my requirement : task_track_started and task_acks_late along with result_backend
I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).
I've been working off of Google Cloud Platform's Python API library. I've had much success with these API samples out-of-the-box, but I'd like to streamline it a bit further by combining the three queries I need to run (and subsequent tables that will be created) into a single file. Although the documentation mentions being able to run multiple jobs asynchronously, I've been having trouble figuring out the best way to accomplish that.
Thanks in advance!
The idea of running multiple jobs asynchronously is in creating/preparing as many jobs as you need and kick them off using jobs.insert API (important you should either collect all respective jobids or set you own - they just need to be unique). Those API returns immediately, so you can kick them all off "very quickly" in one loop
Meantime, you need to check repeatedly for status of those jobs (in loop) and as soon as job is done you can kick processing of result as needed
You can check for details in Running asynchronous queries
BigQuery jobs are always async by default; this being said, requesting the result of the operation isn't. As of Q4 2021, the Python API does not support a proper async way to collect results. Each call to job.result() blocks the thread, hence making it impossible to use with a single threaded event loop like asyncio. Thus, the best way to collect multiple job results is by using multithreading:
from typing import Dict
from concurrent.futures import ThreadPoolExecutor
from google.cloud import bigquery
client: bigquery.Client = bigquery.Client()
def run(name, statement):
return name, client.query(statement).result() # blocks the thread
def run_all(statements: Dict[str, str]):
with ThreadPoolExecutor() as executor:
jobs = []
for name, statement in statements.items():
jobs.append(executor.submit(run, name, statement))
result = dict([job.result() for job in jobs])
return result
P.S.: Some credits are due to #Fredrik Håård for this answer :)
I'm writing a producer / consumer to suit my needs in work.
Generally there's a producer thread which fetch some log from remote server, put it in the queue. And one or more consumer thread which read data from the queue and do some work. After that the data and the result both need to be saved (e.g. in sqlite3 db) for later analysis.
To make sure that each piece of log can be processed only once, every time before consuming the data, I have to query the database to see if it has been done. I wonder if there is a better way to accomplish this. If there are more than one consumer threads, database locking seems to be a problem.
Code relevant:
import Queue
import threading
import requests
out_queue = Queue.Queue()
class ProducerThread(threading.Thread):
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
# Read remote log and put chunk in out_queue
resp = requests.get("http://example.com")
# place chunk into out queue and sleep for some time.
self.out_queue.put(resp)
time.sleep(10)
class ConsumerThread(threading.Thread):
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
# consume the data.
chunk = self.out_queue.get()
# check whether chunk has been consumed before. query the database.
flag = query_database(chunk)
if not flag:
do_something_with(chunk)
# signals to queue job is done
self.out_queue.task_done()
# persist the data and other info insert to the database.
data_persist()
else:
print("data has been consumed before.")
def main():
# just one producer thread.
t = ProducerThread(out_queue)
t.setDaemon(True)
t.start()
for i in range(3):
ct = ConsumerThread(out_queue)
ct.setDaemon(True)
ct.start()
# wait on the queue until everything has been processed
out_queue.join()
main()
If the logs read remote server are not duplicated/repeated, then there is no need to check whether the logs are processed for multiple times, as Queue class implements all the required locking semantics and thus Queue.get() ensures a specific item could only be got by one ConsumerThread.
If the logs could be duplicated (I guess not), then you should do the checking in ProducerThread (before adding the logs to the queue), rather than the do checking in ConsumerThread. In this way, you don't need to consider locking.
update based on #dofine's confirmation on my understanding about the requirement in below comments:
For points #2 and #3, you may need a lightweight persistent queue such as FifoDiskQueue in queuelib. To be honest, I didn't use this lib before but I think it should work for you. Please check out the lib.
For point #1, I guess you can achieve it by using whatever a (non-memory) database, in combination with another queue of FifoDiskQueue:
The 2nd queue serves the purpose of re-queueing a log immediately if it fails to be processed by one consumer thread. Please see my first comment below for the idea
there is a single table in the db. The producer thread always adds new records to it, but never updates any records; and the consumer thread only updates those records it has picked from the queue
with above logic, you should never needs a lock the table
on application startup (prior to starting the consumers), you may have the producer query the db for those logs that are "lost" in track due to application's unexpected termination
this update is typed in mobile SO, so it is kind of inconvenient to extend it. If needed, I will update again when I get a chance
If this is an idiotic question, I apologize and will go hide my head in shame, but:
I'm using rq to queue jobs in Python. I want it to work like this:
Job A starts. Job A grabs data via web API and stores it.
Job A runs.
Job A completes.
Upon completion of A, job B starts. Job B checks each record stored by job A and adds some additional response data.
Upon completion of job B, user gets a happy e-mail saying their report's ready.
My code so far:
redis_conn = Redis()
use_connection(redis_conn)
q = Queue('normal', connection=redis_conn) # this is terrible, I know - fixing later
w = Worker(q)
job = q.enqueue(getlinksmod.lsGet, theURL,total,domainid)
w.work()
I assumed my best solution was to have 2 workers, one for job A and one for B. The job B worker could monitor job A and, when job A was done, get started on job B.
What I can't figure out to save my life is how I get one worker to monitor the status of another. I can grab the job ID from job A with job.id. I can grab the worker name with w.name. But haven't the foggiest as to how I pass any of that information to the other worker.
Or, is there a much simpler way to do this that I'm totally missing?
Update januari 2015, this pull request is now merged, and the parameter is renamed to depends_on, ie:
second_job = q.enqueue(email_customer, depends_on=first_job)
The original post left intact for people running older versions and such:
I have submitted a pull request (https://github.com/nvie/rq/pull/207) to handle job dependencies in RQ. When this pull request gets merged in, you'll be able to do:
def generate_report():
pass
def email_customer():
pass
first_job = q.enqueue(generate_report)
second_job = q.enqueue(email_customer, after=first_job)
# In the second enqueue call, job is created,
# but only moved into queue after first_job finishes
For now, I suggest writing a wrapper function to sequentially run your jobs. For example:
def generate_report():
pass
def email_customer():
pass
def generate_report_and_email():
generate_report()
email_customer() # You can also enqueue this function, if you really want to
# Somewhere else
q.enqueue(generate_report_and_email)
From this page on the rq docs, it looks like each job object has a result attribute, callable by job.result, which you can check. If the job hasn't finished, it'll be None, but if you ensure that your job returns some value (even just "Done"), then you can have your other worker check the result of the first job and then begin working only when job.result has a value, meaning the first worker was completed.
You are probably too deep into your project to switch, but if not, take a look at Twisted. http://twistedmatrix.com/trac/ I am using it right now for a project that hits APIs, scrapes web content, etc. It runs multiple jobs in parallel, as well as organizing certain jobs in order, so Job B doesn't execute until Job A is done.
This is the best tutorial for learning Twisted if you want to attempt. http://krondo.com/?page_id=1327
Combine the things that job A and job B do in one function, and then use e.g. multiprocessing.Pool (it's map_async method) to farm that out over different processes.
I'm not familiar with rq, but multiprocessing is a part of the standard library. By default it uses as many processes as your CPU has cores, which in my experience is usually enough to saturate the machine.