I have a python program that performs several independent and time consuming processes. The python code is generally an automater, that calls into several batch files via popen.
The program currently takes several hours, so I'd like to split it up across multiple machines. How can I split tasks to process in parallel with python, over an intranet network?
There are many Python parallelisation frameworks out there. Just two of the options:
The parallel computing facilities of IPython
The parallelisation framework jug
For the remote execution you could use execnet. Do you have to distribute the data too?
I might suggest STAF. It's advertised as a software testing framework, yet it allows for distribution of activities across multiple PCs (and multiple platforms). You can run scripts, copy data, and easily communicate between your multiple sessions. Best of all, it's fairly easy to integrate with already existing scripts.
Related
I need to run some numpy computation on 5000 files in parallel using python. I have the sequential single machine version implemented already. What would be the easiest way to run the code in parallel (say using an ec2 cluster)? Should I write my own task scheduler and job distribution code?
You can have a look pscheduler Python module. It will allow you to queue up your jobs and run them sequentially. The number of concurrent processes will depend upon the available CPU cores. This program can easily scale up and submit your jobs to remote machines but then would require all your remote machines to use NFS.
I'll be happy to help you further.
There are lots of different modules for threading/parallelizing python. Dispy and pp/ParallelPython seem especially popular. It looks like these are all designed for a single interface (e.g. desktop) which has many cores/processors. Is there a module which works on massively parallel architectures which are run by queue systems (specifically: SLURM)?
The most used parallel framework on large compute clusters for scientific/technical applications is MPI. The name of the Python package is MPI4py, which is part of SciPy.
MPI offers a high-level API for creating parallel software using messages for communicating over the network; remote process creation, data scatter/gather, reductions, etc. All implementations are able to take advantage of fast and low-latency networks if present. It is fully integrated with all cluster managers, including Slurm.
Via the ParallelPython main page:
"PP is a python module which provides mechanism for parallel execution of python code on SMP (systems with multiple processors or cores) and clusters (computers connected via network)."
I'm looking at using inotify to watch about 200,000 directories for new files. On creation, the script watching will process the file and then it will be removed. Because it is part of a more compex system with many processes, I want to benchmark this and get system performance statistics on cpu, memory, disk, etc while the tests are run.
I'm planning on running the inotify script as a daemon and having a second script generating test files in several of the directories (randomly selected before the test).
I'm after suggestions for the best way to benchmark the performance of something like this, especially the impact it has on the Linux server it's running on.
I would try and remove as many other processes as possible in order to get a repeatable benchmark. For example, I would set up a separate, dedicated server with an NFS mount to the directories. This server would only run inotify and the Python script. For simple server measurements, I would use top or ps to monitor CPU and memory.
The real test is how quickly your script "drains" the directories, which depends entirely on your process. You could profile the script and see where it's spending the time.
We are developing a distributed application in Python. Right now, we are about to re-organize some of our system components and deploy them on separate servers, so I'm looking to understand more about deployment for an application such as this. We will have several back-end code servers, several database servers (of different types) and possibly several front-end servers.
My question is this: what / which are good deployment patterns for distributed applications (in Python or in general)? How can I manage pushing code to several servers (whose IP's should be parameterized in the deployment system), static files to several front ends, starting / stopping processes in the servers, etc.? We are looking for possibly an easy-to-use solution, but mostly, something that once set-up will get out of our way and let us deploy as painlessly as possible.
To clarify: we are aware that there is no one standard solution for this particular application, but this question is rather more geared towards a guide of best practices for different types / parts of deployment than a single, unified solution.
Thanks so much! Any suggestions regarding this or other deployment / architecture pointers will be very appreciated.
It all depends on your application.
You can:
use Puppet to deploy servers,
use Fabric to remotely connect to the servers and execute specific tasks,
use pip for distributing Python modules (even non-public ones) and install dependencies,
use other tools for specific tasks (such as use boto to work with Amazon Web Services APIs, eg. to start new instance),
It is not always that simple and you will most likely need something customized. Just take a look at your system: it is not so "standard", so do not expect it to be handled in a "standard" way.
Is there a testing framework (preferable python) that executes test, monitor the progress (failed/passed/timeout) and controls the vmware? Thanks
I am trying to make some automation functional testing in Vmware using Autoit script, VMs are controlled by a little python script on the host machine (deploy test files into VMs, execute them and collect the results data). But now it seems to be lots of works to do if I want this script to be able to manage and execute a series of test cases.
Thanks a lot!
Cheers,
Zhe
There are lots of continuous integration tools that may do what you want.
One implemented in Python that may fit your need is Buildbot - it can manage running builds and tests across multiple machines and consolidating the results.