Alternative to multiprocessing.manager in Python

Alternative to multiprocessing.manager in Python - python

I've been managing a program that uses multiprocessing.manager due to some requirements, however we have been getting a steady amount of errors such as timeouts, invalid references and other similar errors.
Now I'm curious if there is a more developed alternative to multiprocessing.manager that has better overall reliability and less state tracking on the client side.
I've tried Google on the subject but due to the odd combination of keywords I only receive bogus results.
Our usual use case is similar to this:
def connect():
manager = CustomManager(address=manager_address, authkey=manager_authkey)
manager.connect()
session = manager.session()
return session
connect().some_function()

Judging by the question and your comments, If you want something more solid to manage processes there are better alternatives to using the multiprocessing module. Below are two options you might want to explore:
Gearman
This is a description of the Gearman project.
Gearman provides a generic application framework to farm out work to
other machines or processes that are better suited to do the work
Instagram has workers written in python and uses Gearman to run these jobs in the background. You can read about it in the Task Queue section of this What Powers Instagram post.
Celery: Distributed Task Queue
Celery is an asynchronous task queue based on distributed message passing, it is focused on real-time operation. It is really popular in the Django community.
Both solutions are very scalable and used extensively so you will find a lot of documentation and tutorials on how to use them. They are more involved though so there will be more of an initial learning curve. But I think it might be worth the time investment if you are hitting the limit of multiprocessing.

Related

How do I run background job in Flask without threading or task-queue

I am building REST API with Flask-restplus. One of my endpoints takes a file uploaded from client and run some analysis. The job uses up to 30 seconds. I don't want the job to block the main process. So the endpoint will return a response with 200 or 201 right away, the job can still be running. Results will be saved to database which will be retrieved later.
It seems I have two options for long-running jobs.
Threading
Task-queue
Threading is relatively simpler. But problem is, there is a limit of thread numbers for Flask app. In a standalone Python app, I could use a queue for the threads. But this is REST api, each request call is independent. I don't know if there is a way to maintain a global queue for that. So if the requests exceed the thread limit, it won't be able to take more requests.
Task-queue with Celery and Redis is probably better option. But this is just a proof of concept thing, and time line is kind of tight. Setting up Celery, Redis with Flask is not easy, I am having lots of trouble on my dev machine which is a Windows. It will be deployed on AWS which is kind of complex.
I wonder if there is a third option for this case?

I would HIGHLY recommend using Celery as you have already mentioned in your post. It is built exactly for this use case. Their docs are really informative and there are no shortage of examples online that can get you up and running quickly.
Additionally, I would say THIS would be an excellent first resource for you to start with.

Celery is a fantastic solution to this problem I have used quite successfully in the past to manage millions of jobs per day.
The only real downside is the initial learning curve and complexity of debugging when things go sour (it can happen, especially with millions of jobs).

Fastest, simplest way to handle long-running upstream requests for Django

I'm using Django with Uwsgi. We have 8 processes running, and I have no real indication that our code is particularly thread safe, as it was never designed with threads in mind.
Recently, we added the ability to get live rates from vendors of a service through their various APIs and display them at once for the user. The problem is these requests are old web services technologies, and due to their response times, the time needed before the all rates from vendors are acquired (or it gives up), can be up to 10 seconds.
This presents a problem. We have a pretty decent amount of traffic on our site, and the customers need to look at these rates pretty often. With only 8 processes, it's quite easy to see how the server can get tied up waiting on these upstream requests. Especially when other optimizations need to be made to make the site baseline faster anyway (we're working on that).
We made a separate library (which should be mostly threadsafe, and if not, should be converted to it easily enough) for the rates requesting, and we can separate out its configuration. So I was thinking of making a separate service with its own threads, perhaps in Twisted, and having the browser contact that service for JSON instead of having it run in the main Django server.
Is this solution a good one? Can you think of a better or simpler way to do it? Should I use something other than Twisted, and if so, why?

If you want to use your code in-process with Django, you can simply call out to your Twisted by using Crochet, which can automatically manage the creation, running, and shutdown of the reactor within whatever WSGI implementation you choose (presuming that it behaves like a regular Python process, at least).
Obviously it might be less complex to just run within the Twisted WSGI container :-).
It might also be worth looking at TReq to issue your service client requests; your new "thread safe" library will still have the disadvantage of tying up an entire thread for each blocking client, which is a non-trivial amount of memory and additional concurrency overhead, whereas with Twisted you will only need to worry about a couple of objects.

Clarification of use-cases for Hadoop versus RabbitMQ+Celery

I know that there are similar questions to this, such as:
https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.
So, I'm a python user who wants to make programs that either/both:
Are too large to
Take too long to
do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.
In asking my usual smart CS advice-givers, I have phrased my question as:
"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."
According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:
Why are they considered so separate, so different?
Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)
I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.

They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.
Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases.
Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.
So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....
Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.
Resume:
Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
Have pain with no setup/config time: Sockets

celery vs pyro : is Pyro an alternative to Celery?

I am trying to learn about Celery and was wondering if Celery and Pyro are trying to achieve the same thing ?
Could somebody please tell me if there is something which Celery can do which Pyro can not, or vice versa?

As I see in the official websites, Celery and Pyro, are intent to do different jobs but the confusion is pretty natural.
The objective in both of the packages is help you with distributed computing but with different approaches: Celery is intent to be a distributed task scheduler, it means, if you have a bunch of tasks (very uncorrelated) you can distribute them over a computer grid or over the network.
While, Pyro aims to establish a communication gateway between object over the network, it means, if you have a pretty big task, that you can't divide in little uncorrelated tasks, but with a bunch of objects, that are independent but usually need information about the others, then Pyro enables the communication between them, so you can perform the task distributing the objects in a computer grid or over the network.
You post this with the Django tag, so it will be relevant for you to say, that the requests that are performed to a web application can be seen as a bunch (a big one as the concurrency increases) of uncorrelated tasks, so Celery might be what you are looking for.

The answer above explains the differences between Pyro and Celery.
But in light of all the other changes that have happened over the years wrt to Python and the availability of Python ZeroMQ libraries and function picking, it might be worth taking a look at leveraging ZeroMQ and PiCloud's function pickling. This creates a whole new way to build distributed stacks.
See link sample code on jeffknupp.com blog
Yes, of course you can stick to Celery to develop distributed workers of tasks. And with Pyro, you can develop remote-procedure call applications. With Celery and Pyro, you are doing all of this in the Python world whereas with ZeroMQ they have implementations in a dozen different languages and it implements the common patterns for networking like PUB-SUB,REQ-RES,PIPES, etc. This opens up the possibility of creating language agnostic possibilities.

Twisted or Celery? Which is right for my application with lots of SOAP calls?

I'm writing a Python application that needs both concurrency and asynchronicity. I've had a few recommendations each for Twisted and Celery, but I'm having trouble determining which is the better choice for this application (I have no experience with either).
The application (which is not a web app) primarily centers around making SOAP calls out to various third party APIs. To process a given piece of data, I'll need to call several APIs sequentially. And I'd like to be able to have a pool of "workers" for each of these APIs so I can make more than 1 call at a time to each API. Nothing about this should be very cpu-intensive.
More specifically, an external process will add a new "Message" to this application's database. I will need a job that watches for new messages, and then pushes them through the Process. The process will contain 4-5 steps that need to happen in order, but can happen completely asynchronously. Each step will take the message and act upon it in some way, typically adding details to the message. Each subsequent step will require the output from the step that precedes it. For most of these Steps, the work involved centers around calling out to a third-party API typically with a SOAP client, parsing the response, and updating the message. A few cases will involve the creation of a binary file (harder to pickle, if that's a factor). Ultimately, once the last step has completed, I'll need to update a flag in the database to indicate the entire process is done for this message.
Also, since each step will involve waiting for a network response, I'd like to increase overall throughput by making multiple simultaneous requests at each step.
Is either Celery or Twisted a more generally appropriate framework here? If they'll both solve the problem adequately, are there pros/cons to using one vs the other? Is there something else I should consider instead?

Is either Celery or Twisted a more generally appropriate framework here?
Depends on what you mean by "generally appropriate".
If they'll both solve the problem adequately, are there pros/cons to using one vs the other?
Not an exhaustive list.
Celery Pros:
Ready-made distributed task queue, with rate-limiting, re-tries, remote workers
Rapid development
Comparatively shallow learning curve
Celery Cons:
Heavyweight: multiple processes, external dependencies
Have to run a message passing service
Application "processes" will need to fit Celery's design
Twisted Pros:
Lightweight: single process and not dependent on a message passing service
Rapid development (for those familiar with it)
Flexible
Probably faster, no "internal" message passing required.
Twisted Cons:
Steep learning curve
Not necessarily as easy to add processing capacity later.
I'm familiar with both, and from what you've said, if it were me I'd pick Twisted.
I'd say you'll get it done quicker using Celery, but you'd learn more while doing it by using Twisted. If you have the time and inclination to follow the steep learning curve, I'd recommend you do this in Twisted.

Celery allows you to use asynchronous behavior of various async library like gevent and eventlet. So you can have best of both world.
Example using eventlet
https://github.com/celery/celery/tree/master/examples/eventlet
Example using gevent
https://github.com/celery/celery/tree/master/examples/gevent

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.