I often have models that are a local copy of some remote resource, which needs to be periodically kept in sync.
Task(
url="/keep_in_sync",
params={'entity_id':entity_id},
name="sync-%s" % entity_id,
countdown=3600
).add()
Inside keep_in_sync any changes are saved to the model and a new task is scheduled to happen again later.
Now, while superficially this seems like a nice solution, in practice you might become worried if all the necessary tasks have really been added or not. Maybe you have entities representing the level of food pellets inside your hamster cages so that an automated email can be sent to your housekeeper to feed them. But then a few weeks later when you come back from your holiday, you find several of your hamsters starving.
It then starts seeming like a good idea to make a script that goes through each entity and makes sure that the proper task really is in the queue for it. But neither Task nor Queue classes have any method for checking if a task exists or not.
Can you save the hamsters and come up with a nicer way to make sure that a method really for sure is being periodically called for each entity?
Update
It seems that if you want to be really sure that tasks are scheduled, you need to keep track of your own tasks as Nick Johnson suggests. Not ready to let go of the convenient task queue, so for the time being will just tolerate the uncertainty of being unable to check if tasks are really scheduled or not.
Instead of enqueueing a task per entity, handle multiple entities in a single task. This can be triggered by a daily cron job, for instance, which fans out to multiple tasks. As well as ensuring you execute your code for each entity, you can also take advantage of asynchronous URLFetch to synchronize with the external resource more efficiently, and batch puts and gets from the datastore to make the updates more efficient.
You'll get an exception (TaskAlreadyExistsError) if there already such task in queue (same url and same params). So, don't worry, just all of them into queue, and remember to catch exceptions.
You can find full list of exceptions here: http://code.google.com/intl/en/appengine/docs/python/taskqueue/exceptions.html
Related
When I add a queue, I give it an unique name, like longprocess-{id}-{timestamp}. ID is the id in the database for that entry to work, and the timestamp ensure I don't have colliding names in the queue.
The issue is that the user can stop/resume the longprocess if he wants, so in the stop request, I'd like to list all tasks that starts with longporcess-1 (for {id} = 1), and stop all of them (expected 1 entry).
I can target a task with :
q = taskqueue.Queue('longprocess')
q.delete_tasks(taskqueue.Task(name='longprocess-{0}'.format(longprocess.id,)))
But of course, this doesn't work because the name is incorrect (missing it's -{timestamp} part).
Is there something like a q.search('longprocess-1-*') that I would loop over and delete ?
Thank you for your help.
No, there is nothing like q.search('longprocess-1-*') and there could not be (nor it's technically impossible but is just not reasonable) due to nature of queues (in principal, otherwise it's just going to be a DB table).
The advantage (and limitations) of queues is that they use FIFO (firs-in-first out) - not strictly, sometimes with some extensions like "delay" parameters for a task. But anyway tasks scheduler/dispatcher/coordinator does not need to care about deleting tasks from the middle of the queue and is concentrated on work with limited number of tasks in the head of queue. From this specialization we gaining speed, cost effectiveness & reliability of the queues concept.
It's your job to handle how do you cancel a task. You have at least 2 options:
Store somewhere a task name and use it to delete the task from a queue.
Store somewhere intention (request) of canceling a task. When the task hits the worker you check the flag and if needed just ignore the task.
You can use combination of this 2 methods for an edge case when a task has been dispatched to a worker but has not been completed yet. But in most cases it does not worth the effort.
By the way lots of message queuing systems does not have "task deletion" at all. As Russian saying says "A word is not a bird - if it's gone you can not put it back".
I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!
You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.
To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).
I have found this soultion for adding periodic task schedules dynamically with django-celery.
My use case is mailings, which being added individually for users of web-site, each mailing has a PeriodicTask associated with it, so there is potentially may be huge quantity of PeriodicTask records in DB.
Im interested - is it valid (legal, proper, right) solution in that case, or it is better to have only one or few PeriodicTask's which would check mailings for last time they been sent and send them if necessary?
According to it's creator, Ask Solem in this thread:
There is no known limit to the number of periodic tasks, and the celerybeat scheduler should perform well even with a large number of schedule entries.
That Google group thread and this one are the most clarifying about the concern you have.
Said that, I'd like to give you an advice: even when celerybeat scheduler is able to handle huge amounts of periodical tasks, that will come to a cost: more database entries, more tasks to monitor, more ram, maybe more complexity for debugging because you are creating dynamic tasks, more hits to database because you will have to check for each mailing its sent datetime and then see if you send that email.
On the other hand, if you can have one one periodical task that can do one query to retrieve just the mailing instances that have to be sent and the fire one subtask task per email you have to send, then it would look simpler in your code, when you have to debug it and when you have to monitor it. Just my two cents.
Hope it helps.
Could you not have a single periodic task which runs every day, week or whatever, and inside that calculate in the first part all the users which require mailings at that time? Once you know all of these, you could kick-off a sub-task in celery for each of these so that these are all executed asynchronously and will allow the main task to complete very quickly, e.g.
#task
def send_periodic_emails():
users_who_need_mail = get_users_who_need_mail()
for user in users_who_need_mail:
send_user_email.delay(user.id)
#task
def send_user_email(user_id):
# Do email sending here
I appreciate this doesn't answer the question as it's formed, but it should allow you to avoid finding out whether this limit exists or adding scheduled tasks programatically!
A lot depends on the nature of your work. If you can group your users into classes for mailing purposes then it would seem natural to schedule mailing of the groups rather than mailing the individual users. If everyone is on a different schedule then by all means schedule each one individually. It's certainly legal and there's no compelling reason to avoid it if it's the natural solution to your problems.
You may want to run some tests to get an idea of the load you will generate, but your approach doesn't seem unreasonable.
I'm planning to use Celery to handle sending push notifications and emails triggered by events from my primary server.
These tasks require opening a connection to an external server (GCM, APS, email server, etc). They can be processed one at a time, or handled in bulk with a single connection for much better performance.
Often there will be several instances of these tasks triggered separately in a short period of time. For example, in the space of a minute, there might be several dozen push notifications that need to go out to different users with different messages.
What's the best way of handling this in Celery? It seems like the naïve way is to simply have a different task for each message, but that requires opening a connection for each instance.
I was hoping there would be some sort of task aggregator allowing me to process e.g. 'all outstanding push notification tasks'.
Does such a thing exist? Is there a better way to go about it, for example like appending to an active task group?
Am I missing something?
Robert
I recently discovered and have implemented the celery.contrib.batches module in my project. In my opinion it is a nicer solution than Tommaso's answer, because you don't need an extra layer of storage.
Here is an example straight from the docs:
A click counter that flushes the buffer every 100 messages, or every
10 seconds. Does not do anything with the data, but can easily be
modified to store it in a database.
# Flush after 100 messages, or 10 seconds.
#app.task(base=Batches, flush_every=100, flush_interval=10)
def count_click(requests):
from collections import Counter
count = Counter(request.kwargs['url'] for request in requests)
for url, count in count.items():
print('>>> Clicks: {0} -> {1}'.format(url, count))
Be wary though, it works fine for my usage, but it mentions that is an "Experimental task class" in the documentation. This might deter some from using a feature with such a volatile description :)
An easy way to accomplish this is to write all the actions a task should take on a persistent storage (eg. database) and let a periodic job do the actual process in one batch (with a single connection).
Note: make sure you have some locking in place to prevent the queue from being processes twice!
There is a nice example on how to do something similar at kombu level (http://ask.github.com/celery/tutorials/clickcounter.html)
Personally I like the way sentry does something like this to batch increments at db level (sentry.buffers module)
After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.
This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.
Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:
Is this a flow that would be well suited to parallelizing with MapReduce?
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? (I'm not sure exactly what counts as a "Job", so I don't know exactly how I'll be billed)
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
Are there any problems you see with the way I've designed this job flow?
Ok, now onto the details:
The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.
There are two calls I can make:
Get Followed Users -- Which returns all the users being followed by the requested user, and
Get Favorite Items -- Which returns all the favorite items of the requested user.
After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:
Jobs in this flow include:
Start Updating Queue for user -- kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
Get Favorites for user -- Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
Calculate New Queue for user -- Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.
So, again, my questions are:
Is this a flow that would be well suited to parallelizing with MapReduce? I don't know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX's queue only after that's all done.
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? Is there a limit on how many "threads" I can have waiting on open API requests if I use their module?
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
Are there any problems you see with the way I've designed this job flow?
Thanks for reading, I'm looking forward to some discussion with you all.
Edit, in response to JimR:
Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.
What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.
So, what I'm leaning towards is:
A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it's low overhead and scales well.
Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
Possibly using Amazon's Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It's very cheap. May be worth implementing instead of a custom job-queue system.
The work you describe is probably a good fit for either a queue, or a combination of a queue and job server. It certainly could work as a set of MapReduce steps as well.
For a job server, I recommend looking at Gearman. The documentation isn't awesome, but the presentations do a great job documenting it, and the Python module is fairly self-explanatory too.
Basically, you create functions in the job server, and these functions get called by clients via an API. The functions can be called either synchronously or asynchronously. In your example, you probably want to asynchronously add the "Start update" job. That will do whatever preparatory tasks, and then asynchronously call the "Get followed users" job. That job will fetch the users, and then call the "Update followed users" job. That will submit all the "Get Favourites for UserA" and friend jobs together in one go, and synchronously wait for the result of all of them. When they have all returned, it will call the "Calculate new queue" job.
This job-server-only approach will initially be a bit less robust, since ensuring that you handle errors and any down servers and persistence properly is going to be fun.
For a queue, SQS is an obvious choice. It is rock-solid, and very quick to access from EC2, and cheap. And way easier to set up and maintain than other queues when you're just getting started.
Basically, you will put a message onto the queue, much like you would submit a job to the job server above, except you probably won't do anything synchronously. Instead of making the "Get Favourites For UserA" and so forth calls synchronously, you will make them asynchronously, and then have a message that says to check whether all of them are finished. You'll need some sort of persistence (a SQL database you're familiar with, or Amazon's SimpleDB if you want to go fully AWS) to track whether the work is done - you can't check on the progress of a job in SQS (although you can in other queues). The message that checks whether they are all finished will do the check - if they're not all finished, don't do anything, and then the message will be retried in a few minutes (based on the visibility_timeout). Otherwise, you can put the next message on the queue.
This queue-only approach should be robust, assuming you don't consume queue messages by mistake without doing the work. Making a mistake like that is hard to do with SQS - you really have to try. Don't use auto-consuming queues or protocols - if you error out, you might not be able to ensure that you put a replacement message back on the queue.
A combination of queue and job server may be useful in this case. You can get away with not having a persistence store to check job progress - the job server will allow you to track job progress. Your "get favourites for users" message could place all the "get favourites for UserA/B/C" jobs into the job server. Then, put a "check all favourites fetching done" message on the queue with a list of tasks that need to be complete (and enough information to restart any jobs that mysteriously disappear).
For bonus points:
Doing this as a MapReduce should be fairly easy.
Your first job's input will be a list of all your users. The map will take each user, get the followed users, and output lines for each user and their followed user:
"UserX" "UserA"
"UserX" "UserB"
"UserX" "UserC"
An identity reduce step will leave this unchanged. This will form the second job's input. The map for the second job will get the favourites for each line (you may want to use memcached to prevent fetching favourites for UserX/UserA combo and UserY/UserA via the API), and output a line for each favourite:
"UserX" "UserA" "Favourite1"
"UserX" "UserA" "Favourite2"
"UserX" "UserA" "Favourite3"
"UserX" "UserB" "Favourite4"
The reduce step for this job will convert this to:
"UserX" [("UserA", "Favourite1"), ("UserA", "Favourite2"), ("UserA", "Favourite3"), ("UserB", "Favourite4")]
At this point, you might have another MapReduce job to update your database for each user with these values, or you might be able to use some of the Hadoop-related tools like Pig, Hive, and HBase to manage your database for you.
I'd recommend using Cloudera's Distribution for Hadoop's ec2 management commands to create and tear down your Hadoop cluster on EC2 (their AMIs have Python set up on them), and use something like Dumbo (on PyPI) to create your MapReduce jobs, since it allows you to test your MapReduce jobs on your local/dev machine without access to Hadoop.
Good luck!
Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.
Thanks for the answers, they were a lot of help finding the solution I was looking for.
I am working with a similar problem that i need to solve. I was also looking at MapReduce and using the Elastic MapReduce service from Amazon.
I'm pretty convinced MapReduce will work for this problem. The implementation is where I'm getting hung up, becauase I'm not sure my reducer even needs to do anything.
I'll answer your questions as I understand your (and my) problem, and hopefully it helps.
Yes I think it'll be suited well. You could look at leveraging the Elastic MapReduce service's multiple steps option. You could use 1 Step to fetch a the people a user is following, and another step to compile a list of tracks for each of those followers, and the reducer for that 2nd step would probably be the one to build the cache.
Depends on how big your data-set is and how often you'll be running it. It's hard to say without knowing how big the data-set is (or is going to get) if it'll be cost effective or not. Initially, it'll probably be quite cost-effective, as you won't have to manage your own hadoop cluster, nor have to pay for EC2 instances (assuming that's what you use) to be up all the time. Once you reach the point where you're actually crunching this data for a long period of time, it probably will make less and less sense to use Amazon's MapReduce service, because you'll constantly have nodes online all the time.
A job is basically your MapReduce task. It can consist of multiple steps (each MapReduce task is a step). Once your data has been processed and all steps have been completed, your Job is done. So you're effectively paying for CPU time for each node in the Hadoop cluster. so, T*n where T is the Time (in hours) it takes to process your data, and n is the number of nodes you tell Amazon to spin up.
I hope this helps, good luck. I'd like to hear how you end up implementing your Mappers and Reducers, as I'm solving a very similar problem and I'm not sure my approach is really the best.