AppEngine Task Queue API Calls increase on TaskAlreadyExistsError - python

I'm using deferred to put tasks in the default queue in a AppEngine app similar to this approach.
Im naming the task with a timestamp that changes every 5 second. During that time a lot of calls are made to the queue with the same name resulting in a TaskAlreadyExistsError which is fine. The problem is when I check the quotas the "Task Queue API Calls" are increasing for every call made, not only those who actually are put in the queue.
I mean if you look at the quota: Task Queue API Calls: 34,017 of 100,000 and compare to the actual queue calls: /_ah/queue/deferred - 2.49K
Here is the code that handles the queue:
try:
deferred.defer(function_call, params, _name=task_name, _countdown=int(interval/2))
except (taskqueue.TaskAlreadyExistsError, taskqueue.TombstonedTaskError):
pass
I suppose that is the way it works. Is there a good way to solve the problem with the quota? Can I use memcache to store the task_name and check if the task has been added besides the above try/catch? Or is there a way to check if the task already exists without using Task Queue Api Calls?

Thanks for pointing out that this is the case, because I didn't realise, but the same problem must be affecting me.
As far as I can see yeah, throwing something into memcache comprised of the taskname should work fine, and then if you want to reduce those hits on memcache you can store the flag locally also within the instance.

The "good way" to solve the quota problem is eliminating destined-to-fail calls to Task Queue API.
_name you are using changes every 5 seconds which might not be a bottleneck if you increase the execution rate of your Task Queue. But you also add Tasks using _countdown.

Related

Concurrency within redis queue

I'm working with a django application hosted on heroku with redistogo addon:nano pack. I'm using rq, to execute tasks in the background - the tasks are initiated by online users. I've a constraint on increasing number of connections, limited resources I'm afraid.
I'm currently having a single worker running over 'n' number of queues. Each queue uses an instance of connection from the connection pool to handle 'n' different types of task. For instance, lets say if 4 users initiate same type of task, I would like to have my main worker create child processes dynamically, to handle it. Is there a way to achieve required multiprocessing and concurrency?
I tried with multiprocessing module, initially without introducing Lock(); but that exposes and overwrites user passed data to the initiating function, with the previous request data. After applying locks, it restricts second user to initiate the requests by returning a server error - 500
github link #1: Looks like the team is working on the PR; not yet released though!
github link #2: This post helps to explain creating more workers at runtime.
This solution however also overrides the data. The new request is again processed with the previous requests data.
Let me know if you need to see some code. I'll try to post a minimal reproducible snippet.
Any thoughts/suggestions/guidelines?
Did you get a chance to try AutoWorker?
Spawn RQ Workers automatically.
from autoworker import AutoWorker
aw = AutoWorker(queue='high', max_procs=6)
aw.work()
It makes use of multiprocessing with StrictRedis from redis module and following imports from rq
from rq.contrib.legacy import cleanup_ghosts
from rq.queue import Queue
from rq.worker import Worker, WorkerStatus
After looking under the hood, I realised Worker class is already implementing multiprocessing.
The work function internally calls execute_job(job, queue) which in turn as quoted in the module
Spawns a work horse to perform the actual work and passes it a job.
The worker will wait for the work horse and make sure it executes within the given timeout bounds,
or will end the work horse with SIGALRM.
The execute_job() funtion makes a call to fork_work_horse(job, queue) implicitly which spawns a work horse to perform the actual work and passes it a job as per the following logic:
def fork_work_horse(self, job, queue):
child_pid = os.fork()
os.environ['RQ_WORKER_ID'] = self.name
os.environ['RQ_JOB_ID'] = job.id
if child_pid == 0:
self.main_work_horse(job, queue)
else:
self._horse_pid = child_pid
self.procline('Forked {0} at {1}'.format(child_pid, time.time()))
The main_work_horse makes an internal call to perform_job(job, queue) which makes a few other calls to actually perform the job.
All the steps about The Worker Lifecycle mentioned over rq's official documentation page are taken care within these calls.
It's not the multiprocessing I was expecting, but I guess they have a way of doing things. However my original post is still not answered with this, also I'm still not sure about concurrency..
The documentation there still needs to be worked upon, since it hardly covers the true essence of this library!

Celery time limit setting by queue

I have a project using Celery and originally implementing one unique queue, which could cause some trouble.
So I want to implement several queues (which is done and works), but I would like to set different soft time limitd per queue. Actually the only things I found is time_limit as global setting for Celery, or setting it every time I decorate a task. First is a too generic solution, the second is not enough.
Thanks
During you queue definition you can set a time to live x-message-ttl on it.
Queue('test_queue', Exchange('default'), routing_key='test_queue', queue_arguments={'x-message-ttl': 86400000})

Task queue for deferred tasks in GAE with python

I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!
You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.
To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).

Celery Task Grouping/Aggregation

I'm planning to use Celery to handle sending push notifications and emails triggered by events from my primary server.
These tasks require opening a connection to an external server (GCM, APS, email server, etc). They can be processed one at a time, or handled in bulk with a single connection for much better performance.
Often there will be several instances of these tasks triggered separately in a short period of time. For example, in the space of a minute, there might be several dozen push notifications that need to go out to different users with different messages.
What's the best way of handling this in Celery? It seems like the naïve way is to simply have a different task for each message, but that requires opening a connection for each instance.
I was hoping there would be some sort of task aggregator allowing me to process e.g. 'all outstanding push notification tasks'.
Does such a thing exist? Is there a better way to go about it, for example like appending to an active task group?
Am I missing something?
Robert
I recently discovered and have implemented the celery.contrib.batches module in my project. In my opinion it is a nicer solution than Tommaso's answer, because you don't need an extra layer of storage.
Here is an example straight from the docs:
A click counter that flushes the buffer every 100 messages, or every
10 seconds. Does not do anything with the data, but can easily be
modified to store it in a database.
# Flush after 100 messages, or 10 seconds.
#app.task(base=Batches, flush_every=100, flush_interval=10)
def count_click(requests):
from collections import Counter
count = Counter(request.kwargs['url'] for request in requests)
for url, count in count.items():
print('>>> Clicks: {0} -> {1}'.format(url, count))
Be wary though, it works fine for my usage, but it mentions that is an "Experimental task class" in the documentation. This might deter some from using a feature with such a volatile description :)
An easy way to accomplish this is to write all the actions a task should take on a persistent storage (eg. database) and let a periodic job do the actual process in one batch (with a single connection).
Note: make sure you have some locking in place to prevent the queue from being processes twice!
There is a nice example on how to do something similar at kombu level (http://ask.github.com/celery/tutorials/clickcounter.html)
Personally I like the way sentry does something like this to batch increments at db level (sentry.buffers module)

Periodical tasks for each entity

I often have models that are a local copy of some remote resource, which needs to be periodically kept in sync.
Task(
url="/keep_in_sync",
params={'entity_id':entity_id},
name="sync-%s" % entity_id,
countdown=3600
).add()
Inside keep_in_sync any changes are saved to the model and a new task is scheduled to happen again later.
Now, while superficially this seems like a nice solution, in practice you might become worried if all the necessary tasks have really been added or not. Maybe you have entities representing the level of food pellets inside your hamster cages so that an automated email can be sent to your housekeeper to feed them. But then a few weeks later when you come back from your holiday, you find several of your hamsters starving.
It then starts seeming like a good idea to make a script that goes through each entity and makes sure that the proper task really is in the queue for it. But neither Task nor Queue classes have any method for checking if a task exists or not.
Can you save the hamsters and come up with a nicer way to make sure that a method really for sure is being periodically called for each entity?
Update
It seems that if you want to be really sure that tasks are scheduled, you need to keep track of your own tasks as Nick Johnson suggests. Not ready to let go of the convenient task queue, so for the time being will just tolerate the uncertainty of being unable to check if tasks are really scheduled or not.
Instead of enqueueing a task per entity, handle multiple entities in a single task. This can be triggered by a daily cron job, for instance, which fans out to multiple tasks. As well as ensuring you execute your code for each entity, you can also take advantage of asynchronous URLFetch to synchronize with the external resource more efficiently, and batch puts and gets from the datastore to make the updates more efficient.
You'll get an exception (TaskAlreadyExistsError) if there already such task in queue (same url and same params). So, don't worry, just all of them into queue, and remember to catch exceptions.
You can find full list of exceptions here: http://code.google.com/intl/en/appengine/docs/python/taskqueue/exceptions.html

Categories

Resources