python code to run across cores

python code to run across cores - python

I am writing a simple python code for a very task which some of the best minds i guess are working on? Anyways.
I have a really powerful desktop 8cores (16 virtual cores).
I want to write a program whcih can find the distinct words in the whole corpus of words.
Or think of other task like things like word frequency counts.
Though map-reduce stuff works great for distributed framework.
Is there a way to make use of all the cores of your processor? that is multicore execution of code.
Or maybe this.
if i have to do the following:
def hello_word():
print "hello world!"
and instead of python hello_world.py
i want to run this hello_world.py using all the cores of my processor.
what changes will i make.?
Thanks

You first need to identify how your process can be broken down into parallel pieces. Running your example in your question on multiple cores makes absolutely no sense because there is just a single task to accomplish and no way it can be broken down into simpler, parallel steps.
After you have figured out how to break your task into parallel pieces, take a look at the multiprocessing module as Michael mentioned in comments. Working through some of the examples on that page is a good way to start.

Related

Python use slurm for multiprocessing

I want to run a simple task using multiprocessing (I think this one is the same as using parfor in matlab correct?)
For example:
from multiprocessing import Pool
def func_sq(i):
fig=plt.plot(x[i,:]) #x is a ready-to-use large ndarray, just want
fig.save(....) #to plot each column on a separate figure
pool = Pool()
pool.map(func_sq,[1,2,3,4,5,6,7,8])
But I am very confused of how to use slurm to submit my job. I have been searching for answers but could not find a good one.
Currently, while not using multiprocessing, I am using slurm job sumit file like this:(named test1.sh)
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p batch
#SBATCH --exclusive
module load anaconda3
source activate py36
srun python test1.py
Then, I type in sbatch test1.sh in my prompt window.
So if I would like to use the multiprocessing, how should I modify my sh file? I have tried by myself but it seems just changing my -n to 16 and Pool(16) makes my job repeat 16 times.
Or is there a way to maximize my performance if multiprocessing is not suitable (I have heard about multithreating but don't know how it exactly works)
And how do I effectively use my memory so that it won't crush? (My x matrix is very large)
Thanks!
For the GPU, is that possible to do the same thing? My current submission script without multiprocessing is:
#!/bin/bash
#SBATCH -n 1
#SBATCH -p gpu
#SBATCH --gres=gpu:1
Thanks a lot!

The "-n" flag is setting the number of tasks your sbatch submission is going to execute, which is why your script is running multiple times. What you want to change is the "-c" argument which is how many CPUs each task is assigned. Your script spawns additional processes but they will be children of the parent process and share the CPUs assigned to it. Just add "#SBATCH -c 16" to your script. As for memory, there is a default amount of memory your job will be given per CPU, so increasing the number of CPUs will also increase the amount of memory available. If you're not getting enough, add something like "#SBATCH --mem=20000M" to request a specific amount.

I don't mean to be unwelcoming here, but this question seems to indicate that you don't actually understand the tools you're using here. Python Multiprocessing allows a single Python program to launch child processes to help it perform work in parallel. This is particularly helpful because multithreading (which is commonly how you'd accomplish this in other programming languages) doesn't gain you parallel code execution in Python, due to Python's Global Interpreter Lock.
Slurm (which I don't use, but from some quick research) seems to be a fairly high-level utility that allows individuals to submit work to some sort of cluster of computers (or a supercomputer... usually similar concepts). It has no visibility, per se, into how the program it launches runs; that is, it has no relationship to the fact that your Python program proceeds to launch 16 (or however many) helper processes. Its job is to schedule your Python program to run as a black box, then sit back and make sure it finishes successfully.
You seem to have some vague data processing problem. You describe it as a large matrix, but you don't give nearly enough information for me to actually understand what you're trying to accomplish. Regardless, if you don't actually understand what you're doing and how the tools you're using work, you're just flailing until you maybe eventually get lucky enough for this to work. Stop guessing, figure out what these tools do, look around and read documentation, then figure out what you're trying to accomplish and how you could go about splitting up the work in a reasonable fashion.
Here's my best guess, but I really have very little information to work from so it may not be helpful at all:
Your Python script has no concept that it's being run multiple times by Slurm (the -n 16 you refer to, I guess). It makes sense, then, that the job gets repeated 16 times, because Slurm runs the entire script 16 times, and each time your Python script does the entire task from start to finish. If you want Slurm and your Python program to interact, so that the Python program expects to get run multiple times in parallel, I have no idea how to help you there, you'll just need to read more into Slurm.
Your data must be able to be read incrementally, or partially, if you have any hope of breaking this job into pieces. That is, if you can only read the entire matrix all at once, or not at all, you're stuck with solutions that begin by reading the entire matrix into memory, which you indicate is not really an option. Assuming you can, and that you want to perform some work on each row independently, then you're fortunate enough for your task to be what's officially known as "embarrassingly parallel". This is a very good thing, if true.
Assuming your problem is embarrassingly parallel (since it looks like you're just trying to load each row of your data matrix, plot it somehow, then save off that plot as an image to disk), and you can load your data incrementally, then continue reading up on Python's multiprocessing module, and Pool().map is probably the right direction to be headed in. Create some Python generator that produces rows of your data matrix, then pass that generator and func_sq to pool.map, and sit back and wait for the job to finish.
If you really need to do this work across multiple machines, rather than hacking your own Slurm + Multiprocessing stack, I'd suggest you start using actual data processing tools, such as PySpark.
This doesn't sound like a trivial problem, and even if it were, you don't give sufficient details for me to provide a robust answer. There's no "just fix this one line" answer to what you've asked, but I hope this helps give you an idea of what your tools are doing and how to proceed from here.

I need to execute test cases in robot file in parallel

As there are huge number of test cases,it takes lot of time to complete the entire execution.I am not using selenium. Is there any way that can help to achieve parallel execution in robot framework or python .Any example would be of great help.

Pabot is likely what you're looking for. Though you should know that this will not magically make your tests thread-safe. In other words Pabot can only help you with the execution part, but your test cases will need to be designed with parallelization in mind. For example, test cases that make changes to a database or edit a global file may not be parallelization-friendly and will need to be redesigned with parallelization in mind.
PabotLib can help you design thread-safe test cases when needed.

See pabot here : https://github.com/mkorpela/pabot
First install pabot:
pip3 install -U robotframework-pabot
Then you can run rest under case directory in parallel as simple as:
pabot --testlevelsplit case

What is the correct workflow for parallelization: running on a cluster or multiproccesses?

I want to call a function similar to parallelize.map(function, args) that returns a list of results and the user is blind to the actual process. One of the functions I want to parallelize calls subprocess to another unix program that benefits from multiple cores.
I first tried ipython-cluster-helper. This works well with my setup, but I ran into problems installing it on several other machines. I also have to ask for names of clusters during setup. I haven't seen other programs start jobs on clusters for you, so I don't know if that is accepted practice.
joblib seems to be the standard for parallelization, but it can only use one cluster or computer at a time. This works as well, but is significantly slower because it is not using the cluster.
Also, the server I am running this code on complains if a program has run too long to ensure that people use the cluster. Do I write another script to run this program only on our cluster -- if I used joblib?
For now, I added special parameters in setup.py to add cluster names and install ipython-cluster-helper if necessary. And when map is called, it first checks if ipython-cluster-helper and the cluster names are available, use them, else use joblib.
What are others ways of achieving this? I'm looking for a standard way to do this that will work on most machines with or with out a cluster, so I can release the code and make it easy to use.
Thanks.

Multithreading / Multiprocessing in Python

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.

What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.

Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.

Iterative MapReduce

I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API.
I would like suggestions on how best to run this program on Hadoop.
After each run of mapper and reducer, new centres are generated. These centres are input for the next iteration.
From what I can see, each mapreduce iteration will have to be a separate mapreduce job. And it looks like I'll have to write another script (python/bash) to extract the new centres from HDFS after each reduce phase, and feed it back to mapper.
Any other easier, less messier way? If the cluster happens to use a fair scheduler, It will be very long before this computation completes?

You needn't write another job. You can put the same job in a loop ( a while loop) and just keep changing the parameters of the job, so that when the mapper and reducer complete their processing, the control starts with creating a new configuration, and then you just automatically have an input file that is the output of the previous phase.

The Java interface of Hadoop has the concept of chaining several jobs:
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
However, since you're using Hadoop Streaming you don't have any support for chaining jobs and managing workflows.
You should checkout Oozie which should do the job for you:
http://yahoo.github.com/oozie/

Here are a few ways to do it: github.com/bwhite/hadoop_vision/tree/master/kmeans
Also check this out (has oozie support): http://bwhite.github.com/hadoopy/

Feels funny to be answering my own question. I used PIG 0.9 (not released yet, but available in the trunk). In this, there is support for modularity and flow control by way of allowing PIG Statements to be embedded inside scripting languages like Python.
So, I wrote a main python script that had a loop, and inside that called my PIG Scripts. The PIG scripts inturn made calls to the UDFs. So, had to write three different programs. But it worked out fine.
You can check the example here - http://www.mail-archive.com/user#pig.apache.org/msg00672.html
For the record, my UDFs were also written in Python, using this new feature that allows writing UDFs in scripting languages.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.