Clarification of use-cases for Hadoop versus RabbitMQ+Celery

Clarification of use-cases for Hadoop versus RabbitMQ+Celery - python

I know that there are similar questions to this, such as:
https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.
So, I'm a python user who wants to make programs that either/both:
Are too large to
Take too long to
do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.
In asking my usual smart CS advice-givers, I have phrased my question as:
"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."
According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:
Why are they considered so separate, so different?
Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)
I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.

They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.
Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases.
Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.
So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....
Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.
Resume:
Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
Have pain with no setup/config time: Sockets

Related

Call Python tasks from Golang

I have been building big data application for stock market analysis. About 5TB of records per day. I use Golang for data transformation/calculation and saving in Cassandra/MySQL. But Python has very good libraries for data analysis Pandas, Spark and etc., but there is no easy way for multicore processing and takes a lot of time.
So, I want to call python data analysis tasks concurrently in Golang. One way is to execute command line task directly, but I think there should be more scalable solution. Maybe there is library for communication between Golang and Python. I thought maybe I should create multiple servers of Python Flask and give tasks to them. Speed is important, but I can sacrifice some of it for concise solution. Any ideas?

Splitting your app into multiple servers, as you've suggested, carries some trade-offs.
On the plus side, splitting it up provides you with more flexibility in terms of load balancing. In other words, if your flask servers are overburdened, you can always spin a few more and scale horizontally with a load-balancer. Of course this assumes that whatever it is you're doing on those flask server can be done in parallel (depends on your actual business logic).
It also offers high-availability: you eliminate one potential single-point-of-failure.
However, this 'microservice' approach does incur some overheads
more code to write, since now you're writing 2 kinds of servers
some network overhead, since now you're communicating over the network as opposed to function calls.
more machines to spin (although you could run everything in containers and they could all be on the same machine, if you dont need the extra processing power)
You could consider using google-protobuff to serialize/de serialize the messages. its language-agnostic and saves some of the network overhead. its not as easy as sending json, but if efficiency is paramount, it might be worth the trouble. Plus it's supported in both python and go.

celery vs pyro : is Pyro an alternative to Celery?

I am trying to learn about Celery and was wondering if Celery and Pyro are trying to achieve the same thing ?
Could somebody please tell me if there is something which Celery can do which Pyro can not, or vice versa?

As I see in the official websites, Celery and Pyro, are intent to do different jobs but the confusion is pretty natural.
The objective in both of the packages is help you with distributed computing but with different approaches: Celery is intent to be a distributed task scheduler, it means, if you have a bunch of tasks (very uncorrelated) you can distribute them over a computer grid or over the network.
While, Pyro aims to establish a communication gateway between object over the network, it means, if you have a pretty big task, that you can't divide in little uncorrelated tasks, but with a bunch of objects, that are independent but usually need information about the others, then Pyro enables the communication between them, so you can perform the task distributing the objects in a computer grid or over the network.
You post this with the Django tag, so it will be relevant for you to say, that the requests that are performed to a web application can be seen as a bunch (a big one as the concurrency increases) of uncorrelated tasks, so Celery might be what you are looking for.

The answer above explains the differences between Pyro and Celery.
But in light of all the other changes that have happened over the years wrt to Python and the availability of Python ZeroMQ libraries and function picking, it might be worth taking a look at leveraging ZeroMQ and PiCloud's function pickling. This creates a whole new way to build distributed stacks.
See link sample code on jeffknupp.com blog
Yes, of course you can stick to Celery to develop distributed workers of tasks. And with Pyro, you can develop remote-procedure call applications. With Celery and Pyro, you are doing all of this in the Python world whereas with ZeroMQ they have implementations in a dozen different languages and it implements the common patterns for networking like PUB-SUB,REQ-RES,PIPES, etc. This opens up the possibility of creating language agnostic possibilities.

Writing a Python data analysis server for a Java interface

I want to write data analysis plugins for a Java interface. This interface is potentially run on different computers. The interface will send commands and the Python program can return large data. The interface is distributed by a Java Webstart system. Both access the main data from a MySQL server.
What are the different ways and advantages to implement the communication? Of course, I've done some research on the internet. While there are many suggestions I still don't know what the differences are and how to decide for one. (I have no knowledge about them)
I've found a suggestion to use sockets, which seems fine. Is it simple to write a server that dedicates a Python analysis process for each connection (temporary data might be kept after one communication request for that particular client)?
I was thinking to learn how to use sockets and pass YAML strings.
Maybe my main question is: What is the relation to and advantage of systems like RabbitMQ, ZeroMQ, CORBA, SOAP, XMLRPC?
There were also suggestions to use pipes or shared memory. But that wouldn't fit to my requirements?
Does any of the methods have advantages for debugging or other pecularities?
I hope someone can help me understand the technology and help me decide on a solution, as it is hard to judge from technical descriptions.
(I do not consider solutions like Jython, JEPP, ...)

Offering an opinion on the merits you described, it sounds like you are dealing with potentially large data/queries that may take a lot of time to fetch and serialize, in which case you definitely want to go with something that can handle concurrent connections without stacking up threads. Thereby, in the Python domain, I can't recommend any networking library other than Twisted.
http://twistedmatrix.com/documents/current/core/examples/
Whether you decide to use vanilla HTTP or your own protocol, twisted is pretty much the one stop shop for concurrent networking. Sure, the name gets thrown around alot, and the documentation is Atlantean, but if you take the time to learn it there is very little in the networking domain you cannot accomplish. You can extend the base protocols and factories to make one server that can handle your data in a reactor-based event loop and respond to deferred request when ready.
The serialization format really depends on the nature of the data. Will there be any binary in what is output as a response? Complex types? That rules out JSON if so, though that is becoming the most common serialization format. YAML sometimes seems to enjoy a position of privilege among the python community - I haven't used it extensively as most of the kind of work I've done with serials was data to be rendered in a frontend with javascript.
Message queues are really the most important tool in the toolbox when you need to defer background tasks without hanging response. They are commonly employed in web apps where the HTTP request should not hang until whatever complex processing needs to take place completes, so the UI can render early and count on an implicit "promise" the processing will take place. They have two important traits: they rely on eventual consistency, in that the process can finish long after the response in the protocol is sent, and they also have fail-safe and try-again directives should a task fail. They are where you turn in the "do this really hard task as soon as you can and I trust you to get it done" problem domain.
If we are not talking about potentially HUGE response bodies, and relatively simple data types within the serialized output, there is nothing wrong with rolling a simple HTTP deferred server in Twisted.

Twisted or Celery? Which is right for my application with lots of SOAP calls?

I'm writing a Python application that needs both concurrency and asynchronicity. I've had a few recommendations each for Twisted and Celery, but I'm having trouble determining which is the better choice for this application (I have no experience with either).
The application (which is not a web app) primarily centers around making SOAP calls out to various third party APIs. To process a given piece of data, I'll need to call several APIs sequentially. And I'd like to be able to have a pool of "workers" for each of these APIs so I can make more than 1 call at a time to each API. Nothing about this should be very cpu-intensive.
More specifically, an external process will add a new "Message" to this application's database. I will need a job that watches for new messages, and then pushes them through the Process. The process will contain 4-5 steps that need to happen in order, but can happen completely asynchronously. Each step will take the message and act upon it in some way, typically adding details to the message. Each subsequent step will require the output from the step that precedes it. For most of these Steps, the work involved centers around calling out to a third-party API typically with a SOAP client, parsing the response, and updating the message. A few cases will involve the creation of a binary file (harder to pickle, if that's a factor). Ultimately, once the last step has completed, I'll need to update a flag in the database to indicate the entire process is done for this message.
Also, since each step will involve waiting for a network response, I'd like to increase overall throughput by making multiple simultaneous requests at each step.
Is either Celery or Twisted a more generally appropriate framework here? If they'll both solve the problem adequately, are there pros/cons to using one vs the other? Is there something else I should consider instead?

Is either Celery or Twisted a more generally appropriate framework here?
Depends on what you mean by "generally appropriate".
If they'll both solve the problem adequately, are there pros/cons to using one vs the other?
Not an exhaustive list.
Celery Pros:
Ready-made distributed task queue, with rate-limiting, re-tries, remote workers
Rapid development
Comparatively shallow learning curve
Celery Cons:
Heavyweight: multiple processes, external dependencies
Have to run a message passing service
Application "processes" will need to fit Celery's design
Twisted Pros:
Lightweight: single process and not dependent on a message passing service
Rapid development (for those familiar with it)
Flexible
Probably faster, no "internal" message passing required.
Twisted Cons:
Steep learning curve
Not necessarily as easy to add processing capacity later.
I'm familiar with both, and from what you've said, if it were me I'd pick Twisted.
I'd say you'll get it done quicker using Celery, but you'd learn more while doing it by using Twisted. If you have the time and inclination to follow the steep learning curve, I'd recommend you do this in Twisted.

Celery allows you to use asynchronous behavior of various async library like gevent and eventlet. So you can have best of both world.
Example using eventlet
https://github.com/celery/celery/tree/master/examples/eventlet
Example using gevent
https://github.com/celery/celery/tree/master/examples/gevent

Python parallel processing libraries

Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.

Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.

There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.

Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.

The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.