My scenario is as follows: I have a large machine learning model, which is computed by a bunch of workers. In essence workers compute their own part of a model and then exchange with results in order to maintain globally consistent state of model.
So, every celery task computes it's own part of job.
But this means, that tasks aren't stateless, and here is my trouble : if I say some_task.delay( 123, 456 ), in reality I'm NOT sending two integers here!
I'm sending whole state of task, which is pickled somewhere in Celery. This state is typically about 200 MB :-((
I know, that it's possible to select a decent serializer in Celery, but my question is how NOT to pickle just ANY data, which could be in task.
How to pickle arguments of task only?
Here is a citation from celery/app/task.py:
def __reduce__(self):
# - tasks are pickled into the name of the task only, and the reciever
# - simply grabs it from the local registry.
# - in later versions the module of the task is also included,
# - and the receiving side tries to import that module so that
# - it will work even if the task has not been registered.
mod = type(self).__module__
mod = mod if mod and mod in sys.modules else None
return (_unpickle_task_v2, (self.name, mod), None)
I simply don't want this to happen.
Is there a simple way around it, or I'm just forced to build my own Celery ( which is ugly to imagine)?
Don't use the celery results backend for this. Use a separate data store.
While you could just use Task.ignore_result this would mean that you loose the ability to track the tasks status etc.
The best solution would be to use one storage engine (e.g. Redis) for your results backend.
You should set up a separate storage engine (a separate instance of Redis, or maybe something like MongoDB, depending on your needs) to store the actual data.
In this way you can still see the status of your tasks but the large data sets do not affect the operation of celery.
Switching to the JSON serializer may reduce the serialization overhead, depending on the format of the data you generate . However it can't solve the underlying problem of putting too much data through the results backend.
The results backend can handle relatively small amounts of data - once you go over a certain limit you start to prevent the proper operation of its primary tasks - the communication of task status.
I would suggest updating your tasks so that they return a lightweight data structure containing useful metadata (to e.g. facilitate co-ordination between tasks), and storing the "real" data in a dedicated storage solution.
You have to define the ignore result from your task as it says in the docs:
Task.ignore_result
Don’t store task state. Note that this means you can’t use AsyncResult to check if the task is ready, or get its return value.
That'd be a little offtop, but still.
What as I understood is happening here. You have several processes, which do heavy calculations in parallel with inter-process communications. So, instead of unsatisfying in your case celery you could:
use zmq for inter-process communications (to send only necessary data),
use supervisor for managing and running processes (numprocs in particular will help with running multiple same workers).
While it will not require to write your own celery, some code will require to be written.
Related
Background
I have some dags that pull data from an 3rd-party api.
The accounts we need to pull can change over time. To determine which accounts to pull, depending on the process we may need to query a database or make an HTTP request.
Before airflow, we would just get the account list at the start of the python script. Then we would iterate through the account list and pull each account to file or whatever it was we needed to do.
But now, using airflow, it makes sense to define tasks at the account level and let airflow handle retry functionality and date range and parallel execution etc.
Thus my dag might look something like this:
Problem
Since each account is a task, the account list needs to be accessed with every dag parse. But since dag files are parsed frequently, you don't necessarily want to query the database or wait for a REST call with every dag parse from every machine all day long. This could be resource intensive, and could cost money.
Question
Is there a good way to cache this type of config information in a local file, ideally with a specified time-to-live?
Thoughts
I have thought about a couple different approaches:
write to csv or pickle file and use mtime to expire.
the concern with this is that i might get collisions if two processes try to expire the file at the same time. i don't know how likely this is or what the consequences would be but probably nothing terrible.
create a common sqlite DB for all such processes. should be auto created first time a variable is accessed. each config variable gets a row in table. use last_modified_datetime column to tell when to expire.
requires more elaborate code & dependencies.
use airflow variables
nice thing about this would be that it uses existing DB, so would be no $ per query and reasonable network lag, but it still requires network round trip.
has benefit of being identical across all nodes in a multi-node setup.
determining when to expire would probably be problematic so would probably create config manager dag to update the config variables periodically.
but then this would add complexity to deployment and devolpment process -- the variables need to be populated in order to define the DAGs properly -- all developers would need to manage this locally too, as opposed to a more create-on-read cacheing approach.
Subdags?
never used them, but I have a suspicion they could be used here. But the community seems to discourage their use anyway...
Have you dealt with this problem? Did you arrive at a good solution? None of these seems very good.
Airflow default DAG parsing interval is pretty forgiving: 5 minutes. But even that is quite a lot for most people, so it's quite reasonable to increase that if your deployment isn't too close to the due times for the new DAGs.
In general, I'd say it's not that bad to make a REST request at every DAG parse heartbeat. Also, nowadays the scheduling process is decoupled from the parsing process, so that won't affect how fast your tasks are scheduled. Airflow caches the DAG definition for you.
If you think you still have reasons to put your own cache on top of that, my suggestion is to cache at the definitions server, not on the Airflow side. For example, using cache headers on the REST endpoint and handling cache invalidation yourself when you need it. But that could be some premature optimization, so my advice is to start without it and implement it only if you measure convincing evidence that you need it.
EDIT: regarding Webserver and Worker
It's true that the Webserver will trigger DAG Parses as well, not sure about how frequent. Probably following the guicorn workers refresh interval (which is 30 seconds by default). Workers will do it also by default at the start of every task, but that can be saved if you activate pickling DAGs. Not sure if that's a good idea though, I've heard this is something destined to be deprecated.
One other thing you can try to do is to cache that in the Airflow process itself, memoizing the function that makes the expensive request. Python has a built-in functools for that (lru_cache) and together with pickling it might be enough and very very much easier than the other options.
I have the same exact scenario.
Have API call for multiple accounts. Initially created a python script to iterate the list.
When I started using Airflow thought about what you are planning to do. Tried 2 of the alternatives you listed. After some experimentation decided to handle retry logic within python with simple try-except blocks if HTTP calls fail. Reasons are
One script to maintain
Less Airflow objects
Restartability is easier with one script in place.
(restarting failed job in Airflow is not a breeze (no pun intended))
At the end it's up to you, that was my experience.
I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).
I'm working on a project to parallelize some heavy simulation jobs. Each run takes about two minutes, takes 100% of the available CPU power, and generates over 100 MB of data. In order to execute the next step of the simulation, those results need to be combined into one huge result.
Note that this will be run on performant systems (currently testing on a machine with 16 GB ram and 12 cores, but will probably upgrade to bigger HW)
I can use a celery job group to easily dispatch about 10 of these jobs, and then chain that into the concatenation step and the next simulation. (Essentially a Celery chord) However, I need to be able to run at least 20 on this machine, and eventually 40 on a beefier machine. It seems that Redis doesn't allow for large enough objects on the result backend for me to do anything more than 13. I can't find any way to change this behavior.
I am currently doing the following, and it works fine:
test_a_group = celery.group(test_a(x) for x in ['foo', 'bar'])
test_a_result = rev_group.apply_async(add_to_parent=False)
return = test_b(test_a_result.get())
What I would rather do:
return chord(test_a_group, test_b())
The second one works for small datasets, but not large ones. It gives me a non-verbose 'Celery ChordError 104: connection refused' with large data.
Test B returns very small data, essentially a pass fail, and I am only passing the group result into B, so it should work, except that I think the entire group is being appended to the result of B, in the form of parent, making it too big. I can't find out how to prevent this from happening.
The first one works great, and I would be okay, except that it complains, saying:
[2015-01-04 11:46:58,841: WARNING/Worker-6] /home/cmaclachlan/uriel-venv/local/lib/python2.7/site-packages/celery/result.py:45:
RuntimeWarning: Never call result.get() within a task!
See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks
In Celery 3.2 this will result in an exception being
raised instead of just being a warning.
warnings.warn(RuntimeWarning(E_WOULDBLOCK))
What the link essentially suggests is what I want to do, but can't.
I think I read somewhere that Redis has a limit of 500 mb on size of data pushed to it.
Any advice on this hairiest of problems?
Celery isn't really designed to address this problem directly. Generally speaking, you want to keep the inputs/outputs of tasks small.
Every input or output has to be serialized (pickle by default) and transmitted through the broker, such as RabbitMQ or Redis. Since the broker needs to queue the messages when there are no clients available to handle them, you end up potentially paying the hit of writing/reading the data to disk anyway (at least for RabbitMQ).
Typically, people store large data outside of celery and just access it within the tasks by URI, ID, or something else unique. Common solutions are to use a shared network file system (as already mentioned), a database, memcached, or cloud storage like S3.
You definitely should not call .get() within a task because it can lead to deadlock.
I have a bunch of Feed objects in my database, and I'm trying to get each Feed to be updated every hour. My issue here is that I need to make sure there aren't any duplicate updates -- it needs to happen no more than once an hour, but I also don't want feeds waiting two hours for an update. (It's okay if it happens every hour +/- a few minutes, but twice in a few minutes is bad.)
I'm using Django and Celery with Amazon SQS as a broker. I have the feed update code set up as a Celery task, but I'm failing to find a way to prevent duplicates while remaining compatible with Celery running on multiple nodes.
My current solution is to add a last_update_scheduled attribute to the Feed model and run the following task every 5 minutes (pseudo-code):
threshold = datetime.now() - timedelta(seconds=3600)
for f in Feed.objects.filter(Q(last_update_scheduled__lt = threshold) |
Q(last_update_scheduled = None)):
updateFeed.delay(f)
f.last_update_scheduled = now
f.save()
This is susceptible to a number of synchronization issues. For example, if my task queues get backed up, this task could run twice at the same time, causing duplicate updates. I've seen some solutions for this (like Celery's recipe and an adaptation on Stack Overflow), but the memcached solution isn't reliable, e.g. duplicates could happen when restarting memcached or if it happens to run out of memory and purge old data. Not to mention I'd hate to have to add memcached to my production configuration just for a simple lock.
In a perfect world, I'd like to be able to say:
#modelTask(Feed, run_every=3600)
def updateFeed(feed):
# do something expensive
But so far my imagination fails me on how to implement that decorator.
To be clear, the Celery recipe is not using memcached per se, but rather Django's caching middleware. There are a number of other caching methods that would suit your needs without the downside of memcached. See the Django caching documentation for details.
I have an "analytics dashboard" screen that is visible to my django web applications users that takes a really long time to calculate. It's one of these screens that goes through every single transaction in the database for a user and gives them metrics on it.
I would love for this to be a realtime operation, but calculation times can be 20-30 seconds for an active user (no paging allowed, it's giving averages on transactions.)
The solution that comes to mind is to calculate this in the backend via a manage.py batch command and then just display cached values to the user. Is there a Django design pattern to help facilitate these types of models/displays?
What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.
Offline Processing
There are two widely-used approaches to work which needs to happen outside the request-response cycle:
Cron jobs (often made easier via a custom management command)
Celery
In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.
Cron
Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper PYTHONPATH and such.
If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run path/to/manage.py yourcommand. I'm not sure if this approach works without the assistance of virtualenv, which you really ought to be using regardless.
Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.
In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.
Celery
…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are straightforward:
CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
Voilá, no need for an AMQP broker.
Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule periodic tasks, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.
If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest available averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like flot. Here's an example flow:
User requests a page which contains an averages chart.
Check cache -- is there a stored, nonexpired queryset containing averages for this user?
If yes, use that.
If not, fire off a celery task to recalculate it, requery and cache the result. Since querying existing data is cheap, run the query if you want to show stale data to the user in the meantime.
If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.
You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent really stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.
Caching
The Django cache framework provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on memcached as their cache backend, I've lately started using redis with the django-redis-cache backend instead, but I'm not sure I'd trust it for a major production site yet.
Here's some code showing off usage of the low-level caching API to accomplish the workflow laid out above:
import pickle
from django.core.cache import cache
from django.shortcuts import render
from mytasks import calculate_stuff
from celery.task import task
#task
def calculate_stuff(user_id):
# ... do your work to update the averages ...
# now pull the latest series
averages = TransactionAverage.objects.filter(user=user_id, ...)
# cache the pickled result for ten minutes
cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10)
def myview(request, user_id):
ctx = {}
cached = cache.get("averages_%s" % user_id, None)
if cached:
averages = pickle.loads(cached) # use the cached queryset
else:
# fetch the latest available data for now, same as in the task
averages = TransactionAverage.objects.filter(user=user_id, ...)
# fire off the celery task to update the information in the background
calculate_stuff.delay(user_id) # doesn't happen in-process.
ctx['stale_chart'] = True # display a warning, if you like
ctx['averages'] = averages
# ... do your other work ...
render(request, 'my_template.html', ctx)
Edit: worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.
Simplest and IMO correct solution for such scenarios is to pre-calculate everything as things are updated, so that when user sees dashboard you calculate nothing but just display already calculated values.
There can be various ways to do that, but generic concept is to trigger a calculate function in background when something on which calculation depends changes.
For triggering such calculation in background I usually use celery, so suppose user adds a item foo in view view_foo, we call a celery task update_foo_count which will be run in background and will update foo count, alternatively you can have a celery timer which will update count say every 10 minutes by checking if re-calculation need to be done, recalculate flag can be set at various places where user updates data.
You need to have a look at Django’s cache framework.
If the data that is slow to compute can be denormalised and stored when data is added, rather than when it is viewed, then you may be interested in django-denorm.