I am trying to learn about Celery and was wondering if Celery and Pyro are trying to achieve the same thing ?
Could somebody please tell me if there is something which Celery can do which Pyro can not, or vice versa?
As I see in the official websites, Celery and Pyro, are intent to do different jobs but the confusion is pretty natural.
The objective in both of the packages is help you with distributed computing but with different approaches: Celery is intent to be a distributed task scheduler, it means, if you have a bunch of tasks (very uncorrelated) you can distribute them over a computer grid or over the network.
While, Pyro aims to establish a communication gateway between object over the network, it means, if you have a pretty big task, that you can't divide in little uncorrelated tasks, but with a bunch of objects, that are independent but usually need information about the others, then Pyro enables the communication between them, so you can perform the task distributing the objects in a computer grid or over the network.
You post this with the Django tag, so it will be relevant for you to say, that the requests that are performed to a web application can be seen as a bunch (a big one as the concurrency increases) of uncorrelated tasks, so Celery might be what you are looking for.
The answer above explains the differences between Pyro and Celery.
But in light of all the other changes that have happened over the years wrt to Python and the availability of Python ZeroMQ libraries and function picking, it might be worth taking a look at leveraging ZeroMQ and PiCloud's function pickling. This creates a whole new way to build distributed stacks.
See link sample code on jeffknupp.com blog
Yes, of course you can stick to Celery to develop distributed workers of tasks. And with Pyro, you can develop remote-procedure call applications. With Celery and Pyro, you are doing all of this in the Python world whereas with ZeroMQ they have implementations in a dozen different languages and it implements the common patterns for networking like PUB-SUB,REQ-RES,PIPES, etc. This opens up the possibility of creating language agnostic possibilities.
Related
The technology I would like to use in this example is Celery for queueing and python for component implementation.
Imagine a simple project hat exists of 2 components. One is a web app that connects to an API and gathers data. Component 2 is a processor that can then process the data. When the web app has gotten a piece of data from the API it is supposed to send a task into a task queue including the just crawled data which is then consumed by the processor to process the Data.
Whether or not this is a sensible way to go about a task like this is debatable and not the point of my question.
My question is, the tasks to process things are defined within the processor since they state what processing function shall be executed and the definition of that function is obviously within the processor. Now that the web app doesn't have access to the task definition how does he communicate the task to the processor?
Do you have to hold a copy of the source code of the processor within the web app?
Do you make the processor a dependency of the web app?
What is the best practice approach to handle such a scenario?
What you are describing is probably one of the most common use-cases for Celery. Just look how many people are asking Django/Flask + Celery questions here on StackOverflow... If you are a Django user, there is an entire section in the Celery documentation describing how to do exactly what you want. Things should be similar with other frameworks.
Do you have to hold a copy of the source code of the processor within the web app?
As far as I know you do not have to (I do not use any web framework) but it could be that you do need to because of some deeper integration with Celery. If your web application knows the Celery task name, and its parameters, it can schedule it to run without actually having access to the Python code. This is accomplished using send_task(task_name, ...).
Do you make the processor a dependency of the web app?
As I wrote above there are several ways to use it. If you want tighter integration then yes. If you just want to run task and get result using the send_task() than your web application should only depend on Celery.
What is the best practice approach to handle such a scenario?
Follow the Django guide. I advise you to run Celery independently, run some tasks, just so you learn about basic principles how it distributes the work, etc.
I am trying to do some tasks in django that consume alot of time. For that, I will be running background tasks.
After some R&D, i have found two solutions:
Celery with RabbitMQ.
Django Background tasks.
Both options seem to fulfill the criteria but setting up Celery will require some work. Now as far as the second option is concerned, setup is fairly simple and in fairly quick amount of time, i can go on writing background tasks. Now my questions if i adopt the 2nd option is this:
How well does Django Background tasks perform ? (Scalability wise in Production environment).
Can i poll the tasks (after some time) in DB to check the task's status?
Architecture of Django-Background-tasks? Couldn't find any clear explanation about it's architecture (Or have I missed some resource?)
Again coming to the first point, how well does Django Background tasks perform in production. (Talking about prior experience of using this in prod.)
Setting up celery takes work (although less when using Redis). It's also serious tool with almost a decade of investment and widespread industry adoption.
As for performance, scaling behaviors of task systems which are backed by queues vs those backed by RDBMs are well understood – but may not be relevant to you as "scalability" is a very subjective term. This thread provides some good framing on the subject and questions.
Comparing stars on GitHub (bg tasks' 3XX vs Celery's 13XXX), you should realize Django-Background-tasks has a smaller user base, and you're probably going to need to get into the internals to understand the architecture and precise mechanics. That shouldn't stop you – just be prepared to DIY when answers aren't forthcoming.
How well does Django Background tasks perform ? - This will depend upon how and what you implement. One thing to note is, Django-background-tasks is based upon database where celery can have redis/rabbitmq as backend, so most probably we'll see considerable performance difference here.
Can I poll the tasks (after some time) in DB to check the task's status? - It's possible in celery and maybe you can find a solution by inspecting django-background-tasks internal code. But one thing is, we can abort celery task, which maybe not possible in Django-Background-tasks.
Architecture of Django-Background-tasks? Couldn't find any clear explanation about it's architecture (Or have I missed some resource?) - It's simple Django based project. You can have a look at code. It's seems to be pretty straightforward.
Again coming to the first point, how well does Django Background tasks perform in production. - Haven't used in production. But since Django-Background-tasks is database based and celery can be configured to use redis/rabbitmq - I think celery have a plus point here.
To me this comparison, seems to be link comparing pistol with a high-end automatic machine guns. Both do same job. But one simple straightforward - other little complicated but with lots of options and scope.
Choose based on your use case.
I have decided to use Django-Background-Tasks. Let me clarify my motivations.
The tasks that will be processed by Django-Background-Tasks doesn't need to be processed in a fast manner. As it is stated by the name, they are background tasks. I accept delays.
The architecture of Django-Background-Tasks is very simple. When you call a method to be process in the background in your code a task record is inserted to the Django-Background-Tasks tables in your database. And the method you called is not executed actually. It is proxied. Then you should trigger another process to execute the jobs. Your method is then executed in this process.
The process that execute jobs can be executed by a cron entry in your server.
Since this setup is so easy and work for I decided to use Django-Background-Tasks. But If I needed something more responsive and fast I would use Celery since it is using memory and there is an active process that processes the jobs. Which isn't the case in Django-Background-Tasks.
I know that there are similar questions to this, such as:
https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.
So, I'm a python user who wants to make programs that either/both:
Are too large to
Take too long to
do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.
In asking my usual smart CS advice-givers, I have phrased my question as:
"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."
According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:
Why are they considered so separate, so different?
Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)
I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.
They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.
Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases.
Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.
So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....
Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.
Resume:
Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
Have pain with no setup/config time: Sockets
I've been managing a program that uses multiprocessing.manager due to some requirements, however we have been getting a steady amount of errors such as timeouts, invalid references and other similar errors.
Now I'm curious if there is a more developed alternative to multiprocessing.manager that has better overall reliability and less state tracking on the client side.
I've tried Google on the subject but due to the odd combination of keywords I only receive bogus results.
Our usual use case is similar to this:
def connect():
manager = CustomManager(address=manager_address, authkey=manager_authkey)
manager.connect()
session = manager.session()
return session
connect().some_function()
Judging by the question and your comments, If you want something more solid to manage processes there are better alternatives to using the multiprocessing module. Below are two options you might want to explore:
Gearman
This is a description of the Gearman project.
Gearman provides a generic application framework to farm out work to
other machines or processes that are better suited to do the work
Instagram has workers written in python and uses Gearman to run these jobs in the background. You can read about it in the Task Queue section of this What Powers Instagram post.
Celery: Distributed Task Queue
Celery is an asynchronous task queue based on distributed message passing, it is focused on real-time operation. It is really popular in the Django community.
Both solutions are very scalable and used extensively so you will find a lot of documentation and tutorials on how to use them. They are more involved though so there will be more of an initial learning curve. But I think it might be worth the time investment if you are hitting the limit of multiprocessing.
I'm writing a Python application that needs both concurrency and asynchronicity. I've had a few recommendations each for Twisted and Celery, but I'm having trouble determining which is the better choice for this application (I have no experience with either).
The application (which is not a web app) primarily centers around making SOAP calls out to various third party APIs. To process a given piece of data, I'll need to call several APIs sequentially. And I'd like to be able to have a pool of "workers" for each of these APIs so I can make more than 1 call at a time to each API. Nothing about this should be very cpu-intensive.
More specifically, an external process will add a new "Message" to this application's database. I will need a job that watches for new messages, and then pushes them through the Process. The process will contain 4-5 steps that need to happen in order, but can happen completely asynchronously. Each step will take the message and act upon it in some way, typically adding details to the message. Each subsequent step will require the output from the step that precedes it. For most of these Steps, the work involved centers around calling out to a third-party API typically with a SOAP client, parsing the response, and updating the message. A few cases will involve the creation of a binary file (harder to pickle, if that's a factor). Ultimately, once the last step has completed, I'll need to update a flag in the database to indicate the entire process is done for this message.
Also, since each step will involve waiting for a network response, I'd like to increase overall throughput by making multiple simultaneous requests at each step.
Is either Celery or Twisted a more generally appropriate framework here? If they'll both solve the problem adequately, are there pros/cons to using one vs the other? Is there something else I should consider instead?
Is either Celery or Twisted a more generally appropriate framework here?
Depends on what you mean by "generally appropriate".
If they'll both solve the problem adequately, are there pros/cons to using one vs the other?
Not an exhaustive list.
Celery Pros:
Ready-made distributed task queue, with rate-limiting, re-tries, remote workers
Rapid development
Comparatively shallow learning curve
Celery Cons:
Heavyweight: multiple processes, external dependencies
Have to run a message passing service
Application "processes" will need to fit Celery's design
Twisted Pros:
Lightweight: single process and not dependent on a message passing service
Rapid development (for those familiar with it)
Flexible
Probably faster, no "internal" message passing required.
Twisted Cons:
Steep learning curve
Not necessarily as easy to add processing capacity later.
I'm familiar with both, and from what you've said, if it were me I'd pick Twisted.
I'd say you'll get it done quicker using Celery, but you'd learn more while doing it by using Twisted. If you have the time and inclination to follow the steep learning curve, I'd recommend you do this in Twisted.
Celery allows you to use asynchronous behavior of various async library like gevent and eventlet. So you can have best of both world.
Example using eventlet
https://github.com/celery/celery/tree/master/examples/eventlet
Example using gevent
https://github.com/celery/celery/tree/master/examples/gevent