I'm starting to venture into distributed code and am having trouble figuring out which solution fits my needs based on all the stuff out there. Basically I have a python list of data that I need to process with a single function. This function has a few nested for loops but doesn't take too long(about a min) for each item on the list. My problem is the list is very large(3000+ items). I'm looking at multiprocessing but I think I want to experiment with multi-server processing it(because ideally, if the data gets larger I want to be able to have the choice of adding more servers during the job to make it run quicker).
I basically looking for something that I can distribute this data list through(and not super needed but it would be nice if I could distribute my code base through this also)
So my question is, what package can I use to achieve this? My database is hbase so I already have hadoop running(never used hadoop though, just using it for the database). I looked at celery and twisted as well but I'm confused on which will fit my needs.
Any suggestions?
I would highly recommend celery. You can define a task that operates on a single item of your list:
from celery.task import task
#task
def process(i):
# do something with i
i += 1
# return a result
return i
You can easily parallelize a list like this:
results = []
todo = [1,2,3,4,5]
for arg in todo:
res = process.apply_async(args=(arg))
results.append(res)
all_results = [res.get() for res in results]
This is easily scalable by just adding more celery workers.
check out rabbitMQ. Python bindings are available through pika. start with a simple work_queue and run few rpc calls.
It may look troublesome to experiment distributed computing in python with an external engine like rabbitMQ (there's a small learning curve for installing and configuring the rabbit) but you may find it even more useful later.
... and celery can work hand-in-hand with rabbitMQ, checkout robert pogorzelski's tutorial and Simple distributed tasks with Celery and RabbitMQ
Related
Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.
i need help with Python Queue for more (pareallel) users.
I make webapp, based on Flask, for processing relative lot of data. This processing works fine in parallel threads, but i need percentage information about processing (progressbar).
Percentage values i created on Python Queue and works great, but of course, when two (or more) users start processing, queues are mixed.
Is possible use maybe parallel queues or something like separate queues?
I tried use:
queues = {}
queues['user.id'] = Queue()
but i think, that is not best practice for this situation.
Thanks for any answers.
My scenario is as follows: I have a large machine learning model, which is computed by a bunch of workers. In essence workers compute their own part of a model and then exchange with results in order to maintain globally consistent state of model.
So, every celery task computes it's own part of job.
But this means, that tasks aren't stateless, and here is my trouble : if I say some_task.delay( 123, 456 ), in reality I'm NOT sending two integers here!
I'm sending whole state of task, which is pickled somewhere in Celery. This state is typically about 200 MB :-((
I know, that it's possible to select a decent serializer in Celery, but my question is how NOT to pickle just ANY data, which could be in task.
How to pickle arguments of task only?
Here is a citation from celery/app/task.py:
def __reduce__(self):
# - tasks are pickled into the name of the task only, and the reciever
# - simply grabs it from the local registry.
# - in later versions the module of the task is also included,
# - and the receiving side tries to import that module so that
# - it will work even if the task has not been registered.
mod = type(self).__module__
mod = mod if mod and mod in sys.modules else None
return (_unpickle_task_v2, (self.name, mod), None)
I simply don't want this to happen.
Is there a simple way around it, or I'm just forced to build my own Celery ( which is ugly to imagine)?
Don't use the celery results backend for this. Use a separate data store.
While you could just use Task.ignore_result this would mean that you loose the ability to track the tasks status etc.
The best solution would be to use one storage engine (e.g. Redis) for your results backend.
You should set up a separate storage engine (a separate instance of Redis, or maybe something like MongoDB, depending on your needs) to store the actual data.
In this way you can still see the status of your tasks but the large data sets do not affect the operation of celery.
Switching to the JSON serializer may reduce the serialization overhead, depending on the format of the data you generate . However it can't solve the underlying problem of putting too much data through the results backend.
The results backend can handle relatively small amounts of data - once you go over a certain limit you start to prevent the proper operation of its primary tasks - the communication of task status.
I would suggest updating your tasks so that they return a lightweight data structure containing useful metadata (to e.g. facilitate co-ordination between tasks), and storing the "real" data in a dedicated storage solution.
You have to define the ignore result from your task as it says in the docs:
Task.ignore_result
Don’t store task state. Note that this means you can’t use AsyncResult to check if the task is ready, or get its return value.
That'd be a little offtop, but still.
What as I understood is happening here. You have several processes, which do heavy calculations in parallel with inter-process communications. So, instead of unsatisfying in your case celery you could:
use zmq for inter-process communications (to send only necessary data),
use supervisor for managing and running processes (numprocs in particular will help with running multiple same workers).
While it will not require to write your own celery, some code will require to be written.
I have two dictionaries of data and I created a function that acts as a rules engine to analyze entries in each dictionaries and does things based on specific metrics I set(if it helps, each entry in the dictionary is a node in a graph and if rules match I create edges between them).
Here's the code I use(its a for loop that passes on parts of the dictionary to a rules function. I refactored my code to a tutorial I read):
jobs = []
def loadGraph(dayCurrent, day2Previous):
for dayCurrentCount in graph[dayCurrent]:
dayCurrentValue = graph[dayCurrent][dayCurrentCount]
for day1Count in graph[day2Previous]:
day1Value = graph[day2Previous][day1Count]
#rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
jobs.append(p)
p.start()
print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))
The data I'm studying could be rather large(could, because its randomly generated). I think for each day there's about 50,000 entries. Because most of the time is spend on this stage, I was wondering if I could use the 8 cores I have available to help process this faster.
Because each dictionary entry is being compared to a dictionary entry from the day before, I thought the proceses could be split up by that but my above code is slower than using it normally. I think this is because its creating a new process for every entry its doing.
Is there a way to speed this up and use all my cpus? My problem is, I don't want to pass the entire dictionary because then one core will get suck processing it, I would rather have a the process split to each cpu or in a way that I maximum all free cpus for this.
I'm totally new to multiprocessing so I'm sure there's something easy I'm missing. Any advice/suggestions or reading material would be great!
What I've done in the past is to create a "worker class" that processes data entries. Then I'll spin up X number of threads that each run a copy of the worker class. Each item in the dataset gets pushed into a queue that the worker threads are watching. When there are no more items in the queue, the threads spin down.
Using this method, I was able to process 10,000+ data items using 5 threads in about 3 seconds. When the app was only single-threaded, this would take significantly longer.
Check out: http://docs.python.org/library/queue.html
I would recommend looking into MapReduce implementations in Python. Here's one: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python. Also, take a look at a python package called Celery: http://celeryproject.org/. With celery you can distribute your computation not only among cores on a single machine, but also to a server farm (cluster). You do pay for that flexibility with more involved setup/maintenance.
I have a problem with using Twisted for simple concurrency in python. The problem is - I don't know how to do it and all online resources are about Twisted networking abilities. So I am turning to SO-gurus for some guidance.
Python 2.5 is used.
Simplified version of my problem runs as follows:
A bunch of scientific data
A function that munches on the data and creates output
??? < here enters concurrency, it takes chunks of data from 1 and feeds it to 2
Output from 3 is joined and stored
My guess is that Twisted reactor can do the number three job. But how?
Thanks a lot for any help and suggestions.
upd1:
Simple example code. No idea how reactor deals with processes, so I have given it imaginary functions:
datum = 'abcdefg'
def dataServer(data):
for char in data:
yield chara
def dataWorker(chara):
return ord(chara)
r = reactor()
NUMBER_OF_PROCESSES_AV = 4
serv = dataserver(datum)
id = 0
result = array(len(datum))
while r.working():
if NUMBER_OF_PROCESSES_AV > 0:
r.addTask(dataWorker(serv.next(), id)
NUMBER_OF_PROCESSES_AV -= 1
id += 1
for pr, id in r.finishedProcesses():
result[id] = pr
As Jean-Paul said, Twisted is great for coordinating multiple processes. However, unless you need to use Twisted, and simply need a distributed processing pool, there are possibly better suited tools out there.
One I can think of which hasn't been mentioned is celery. Celery is a distributed task queue - you set up a queue of tasks running a DB, Redis or RabbitMQ (you can choose from a number of free software options), and write a number of compute tasks. These can be arbitrary scientific computing type tasks. Tasks can spawn subtasks (implementing your "joining" step you mention above). You then start as many workers as you need and compute away.
I'm a heavy user of Twisted and Celery, so in any case, both options are good.
To actually compute things concurrently, you'll probably need to employ multiple Python processes. A single Python process can interleave calculations, but it won't execute them in parallel (with a few exceptions).
Twisted is a good way to coordinate these multiple processes and collect their results. One library oriented towards solving this task is Ampoule. You can find more information about Ampoule on its Launchpad page: https://launchpad.net/ampoule.
Do you need Twisted at all?
From your description of the problem I'd say that multiprocessing would fit the bill. Create a number of Process objects that are given a reference to a single Queue instance. Get them to start their work and put their results on the Queue. Just use blocking get()s to read the results.
It seems to me that you are misunderstanding the fundamentals of how Twisted operates. I recommend you give the Twisted Intro a shot by Dave Peticolas. It has been a great help to me, and I've been using Twisted for years!
HINT: Everything in Twisted relies on the reactor!
(source: krondo.com)