I need to run some numpy computation on 5000 files in parallel using python. I have the sequential single machine version implemented already. What would be the easiest way to run the code in parallel (say using an ec2 cluster)? Should I write my own task scheduler and job distribution code?
You can have a look pscheduler Python module. It will allow you to queue up your jobs and run them sequentially. The number of concurrent processes will depend upon the available CPU cores. This program can easily scale up and submit your jobs to remote machines but then would require all your remote machines to use NFS.
I'll be happy to help you further.
Related
I'm trying to get a feel for how MLRun executes my Python code. What different runtimes are supported and why would I use one vs the other?
MLRun has several different ways to run a piece of code. At this time, the following runtimes are supported:
Batch runtimes
local - execute a Python or shell program in your local environment (i.e. Jupyter, IDE, etc.)
job - run the code in a Kubernetes Pod
dask - run the code as a Dask Distributed job (over Kubernetes)
mpijob - run distributed jobs and Horovod over the MPI job operator, used mainly for deep learning jobs
spark - run the job as a Spark job (using Spark Kubernetes Operator)
remote-spark - run the job on a remote Spark service/cluster (e.g. Iguazio Spark service)
Real-time runtimes
nuclio - real-time serverless functions over Nuclio
serving - higher level real-time Graph (DAG) over one or more Nuclio functions
If you are interested in learning more about each runtime, see the documentation.
I have a dedicated application server that does analytics.
I'm running on 2CPU, 8GB RAM machine.
I have two same applications running like below.
python do_analytics.py &
python do_analytics.py &
However, my CPU is below 20%. Can I run more processes to make full use of my CPU? Will it speed up or my single processes will run slower now since I only have 2 CPU?
Thanks.
The fact that your CPU usage is below 20%, means that your CPU can take more load. So yes you can run more processes.
Will it speed up or my single processes will run slower now since I only have 2 CPU?
It depends on other factors of what your application is doing. If most of the analytic logic is just using the processing power and memory. You need not worry. But if more process mean more disk access or shared resource. Then running more process may reduce the overall performance.
There are lots of different modules for threading/parallelizing python. Dispy and pp/ParallelPython seem especially popular. It looks like these are all designed for a single interface (e.g. desktop) which has many cores/processors. Is there a module which works on massively parallel architectures which are run by queue systems (specifically: SLURM)?
The most used parallel framework on large compute clusters for scientific/technical applications is MPI. The name of the Python package is MPI4py, which is part of SciPy.
MPI offers a high-level API for creating parallel software using messages for communicating over the network; remote process creation, data scatter/gather, reductions, etc. All implementations are able to take advantage of fast and low-latency networks if present. It is fully integrated with all cluster managers, including Slurm.
Via the ParallelPython main page:
"PP is a python module which provides mechanism for parallel execution of python code on SMP (systems with multiple processors or cores) and clusters (computers connected via network)."
I'm looking at using inotify to watch about 200,000 directories for new files. On creation, the script watching will process the file and then it will be removed. Because it is part of a more compex system with many processes, I want to benchmark this and get system performance statistics on cpu, memory, disk, etc while the tests are run.
I'm planning on running the inotify script as a daemon and having a second script generating test files in several of the directories (randomly selected before the test).
I'm after suggestions for the best way to benchmark the performance of something like this, especially the impact it has on the Linux server it's running on.
I would try and remove as many other processes as possible in order to get a repeatable benchmark. For example, I would set up a separate, dedicated server with an NFS mount to the directories. This server would only run inotify and the Python script. For simple server measurements, I would use top or ps to monitor CPU and memory.
The real test is how quickly your script "drains" the directories, which depends entirely on your process. You could profile the script and see where it's spending the time.
I have a python program that performs several independent and time consuming processes. The python code is generally an automater, that calls into several batch files via popen.
The program currently takes several hours, so I'd like to split it up across multiple machines. How can I split tasks to process in parallel with python, over an intranet network?
There are many Python parallelisation frameworks out there. Just two of the options:
The parallel computing facilities of IPython
The parallelisation framework jug
For the remote execution you could use execnet. Do you have to distribute the data too?
I might suggest STAF. It's advertised as a software testing framework, yet it allows for distribution of activities across multiple PCs (and multiple platforms). You can run scripts, copy data, and easily communicate between your multiple sessions. Best of all, it's fairly easy to integrate with already existing scripts.