I am doing some research this summer and working on parallelizing pre-existing code. The main focus right now is a way to load balance the code so that it will run more efficient on the cluster. The current task is to make a proof of concept that creates several processes with each one having their own stack available and when the process is finished processing the stack it queries the two closest processes to see if they have any more work available in their stack.
I am having difficulties conceptualizing this in python, but was hoping someone could point me in the right direction or have some sort of example that is similar in either mpi4py or ParallelPython. Also if anyone knows of a better or easier module then that would be great to know.
Thanks.
Here's a simple way to do this.
Create a single common shared queue of work to do. This application
will fill this queue with work to do.
Create an application which gets one item from the queue, and does the work.
This is the single-producer-multiple-consumer design. It works well and can swamp your machine with parallel processes.
To use the built-in queue class, you need to wrap the queue in some kind of
multi-processing API. http://docs.python.org/library/queue.html. Personally, I like to create a small HTTP-based web-server that handles the queue. Each application does a
GET to fetch the next piece of work.
You can use tools like RabbitMQ to create a very nice shared queue.
http://nathanborror.com/posts/2009/may/20/working-django-and-rabbitmq/
You might be able to use http://hjb.python-hosting.com/ to make use of JMS queues.
You'll need a small application to create and fill the queue with work.
Create as many copies of the application as you like. For example:
for i in 1 2 3 4 5 6 7 8 9 10
do
python myapp.py &
done
This will run 10 concurrent copies of your application. All 10 are trying to get work from a single queue. They will use all available CPU resources and the OS will nicely schedule them for you.
Peer, node-to-node synchronization means you have O(n*(n-1)/2) communication paths among all nodes.
The "two-adjacent nodes" means you still have 2*n communication paths and work has to "somehow" trickle among the nodes. If the nodes are all initially seeded with work, then someone did a lot of planning to balance the workload. If you're going to do that much planning, why ask the nodes to synchronize at all?
If the queues are not carefully balanced to begin with than every even node could be slow. Every odd node could be fast. The odd nodes finish first, check for work from two even nodes, and those nodes are (a) not done and (b) don't have more work to do, either. What now? Half the nodes are working, half are idle. All due to poor planning in the initial distribution of work.
Master-slave means you have n communication paths. Further, the balancing is automatic since all idle nodes have equal access to work. There's no such thing as a biased initial distribution that leads to poor overall performance.
use multiprocessing module
Related
Say if I have the following deployment on SLURM:
cluster = SLURMCluster(processes=1, cores=25, walltime=1-00:00:00)
cluster.scale(20)
client = Client(cluster)
So I will have 20 nodes each with 25 cores.
Is there a way to tell the slurm scheduler to start all nodes at the same time, instead of starting each one individually when they become available?
A specific example: when nodes are being started individually, those that started the earliest might wait for several, say 2, hours until all 20 nodes are ready. This not only is a waste of resources but also this makes my total job time to be less than 24 hour (e.g. 22 hours).
This is something one can do easily with dask_mpi, where a single batch job is allocated. I am wondering if it's possible to do this with dask_jobqueue specifically.
dask-jobqueue itself doesn't propose such a functionality.
It is designed to submit independent jobs. So to achieve this you would have to look at the possibilities of the job queuing system, Slurm in your case, and see if this is possible without dask-jobqueue. Then you should try to add the correct options to dask-jobqueue if you can, though job_extra_directives kwarg for example.
I'm not aware of such a functionality within Slurm, but there are so many knobs it is hard to tell. I know this is not possible with PBS.
A good option to achieve what you want is, as you said so, using dask-mpi.
A final thought, you could also start your computation with the first two nodes, not waiting for the other to be ready. This should be doable in most cases.
Python threading module documentation says something like this
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Can someone explain whether I can use threading module in my situation or not?
I'm going to detect the frameworks used by websites.
So here is how my app works
My MySQL database contains around 10 million domains ( id, domain, frameworks )
Fetch 1000 rows from the database
Scrape domain one by one using requests module
Detect the frameworks
Update the database row with the results.
Since I have 10 million domains, its going to take very long time. So I would like to speed up the process by using threads.
But i'm not sure whether my app is I/O bound or not. Can someone explain?
Thankyou
I guess, the most time expensive activity will be fetching all the urls.
So the answer to your question is: Yes, your app is very likely to be I/O bound.
You plan to scrape domains one by one, this would lead into really long processing time. You shall definitely do that concurrently. One solution is described in my answer to similar question related to scraping web sites.
Anyway, the number of your urls seems really large, you might need to take advantage from splitting the work to multiple workers - for this purpose you might use e.g. Celery framework. However, as your task is really I/O bound, you would earn some speed only, if your workers work on multiple computers, ideally with independent connectivity. I did similar task on DigitalOcean machines and it worked very well.
I know that there are similar questions to this, such as:
https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.
So, I'm a python user who wants to make programs that either/both:
Are too large to
Take too long to
do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.
In asking my usual smart CS advice-givers, I have phrased my question as:
"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."
According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:
Why are they considered so separate, so different?
Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)
I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.
They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.
Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases.
Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.
So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....
Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.
Resume:
Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
Have pain with no setup/config time: Sockets
I have a problem with using Twisted for simple concurrency in python. The problem is - I don't know how to do it and all online resources are about Twisted networking abilities. So I am turning to SO-gurus for some guidance.
Python 2.5 is used.
Simplified version of my problem runs as follows:
A bunch of scientific data
A function that munches on the data and creates output
??? < here enters concurrency, it takes chunks of data from 1 and feeds it to 2
Output from 3 is joined and stored
My guess is that Twisted reactor can do the number three job. But how?
Thanks a lot for any help and suggestions.
upd1:
Simple example code. No idea how reactor deals with processes, so I have given it imaginary functions:
datum = 'abcdefg'
def dataServer(data):
for char in data:
yield chara
def dataWorker(chara):
return ord(chara)
r = reactor()
NUMBER_OF_PROCESSES_AV = 4
serv = dataserver(datum)
id = 0
result = array(len(datum))
while r.working():
if NUMBER_OF_PROCESSES_AV > 0:
r.addTask(dataWorker(serv.next(), id)
NUMBER_OF_PROCESSES_AV -= 1
id += 1
for pr, id in r.finishedProcesses():
result[id] = pr
As Jean-Paul said, Twisted is great for coordinating multiple processes. However, unless you need to use Twisted, and simply need a distributed processing pool, there are possibly better suited tools out there.
One I can think of which hasn't been mentioned is celery. Celery is a distributed task queue - you set up a queue of tasks running a DB, Redis or RabbitMQ (you can choose from a number of free software options), and write a number of compute tasks. These can be arbitrary scientific computing type tasks. Tasks can spawn subtasks (implementing your "joining" step you mention above). You then start as many workers as you need and compute away.
I'm a heavy user of Twisted and Celery, so in any case, both options are good.
To actually compute things concurrently, you'll probably need to employ multiple Python processes. A single Python process can interleave calculations, but it won't execute them in parallel (with a few exceptions).
Twisted is a good way to coordinate these multiple processes and collect their results. One library oriented towards solving this task is Ampoule. You can find more information about Ampoule on its Launchpad page: https://launchpad.net/ampoule.
Do you need Twisted at all?
From your description of the problem I'd say that multiprocessing would fit the bill. Create a number of Process objects that are given a reference to a single Queue instance. Get them to start their work and put their results on the Queue. Just use blocking get()s to read the results.
It seems to me that you are misunderstanding the fundamentals of how Twisted operates. I recommend you give the Twisted Intro a shot by Dave Peticolas. It has been a great help to me, and I've been using Twisted for years!
HINT: Everything in Twisted relies on the reactor!
(source: krondo.com)
Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.
Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.
There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.
Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.
The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.