I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API.
I would like suggestions on how best to run this program on Hadoop.
After each run of mapper and reducer, new centres are generated. These centres are input for the next iteration.
From what I can see, each mapreduce iteration will have to be a separate mapreduce job. And it looks like I'll have to write another script (python/bash) to extract the new centres from HDFS after each reduce phase, and feed it back to mapper.
Any other easier, less messier way? If the cluster happens to use a fair scheduler, It will be very long before this computation completes?
You needn't write another job. You can put the same job in a loop ( a while loop) and just keep changing the parameters of the job, so that when the mapper and reducer complete their processing, the control starts with creating a new configuration, and then you just automatically have an input file that is the output of the previous phase.
The Java interface of Hadoop has the concept of chaining several jobs:
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
However, since you're using Hadoop Streaming you don't have any support for chaining jobs and managing workflows.
You should checkout Oozie which should do the job for you:
http://yahoo.github.com/oozie/
Here are a few ways to do it: github.com/bwhite/hadoop_vision/tree/master/kmeans
Also check this out (has oozie support): http://bwhite.github.com/hadoopy/
Feels funny to be answering my own question. I used PIG 0.9 (not released yet, but available in the trunk). In this, there is support for modularity and flow control by way of allowing PIG Statements to be embedded inside scripting languages like Python.
So, I wrote a main python script that had a loop, and inside that called my PIG Scripts. The PIG scripts inturn made calls to the UDFs. So, had to write three different programs. But it worked out fine.
You can check the example here - http://www.mail-archive.com/user#pig.apache.org/msg00672.html
For the record, my UDFs were also written in Python, using this new feature that allows writing UDFs in scripting languages.
Related
I started a new job in which they have 0 optimization. Essentially I wanted to use a code which I have pretty much finished to clean excel files and output only the values that meet parameters. My boss said that I’d only be allowed to do this if it was a web application that was developed so I’m wondering if it’s possible and if so how would I go about it. If it can’t be done using python what would be another easy option to learn in which the code would be similar to what I wrote for the parameters in python?
I have a pipeline that looks roughly like this:
_ (
p |
SomeSourceProducingListOfFiles() |
beam.Map(some_expensive_fn) |
beam.FlatMap(some_inexpensive_agg)
)
SomeSourceProducingListOfFiles in my case is reading from a single CSV/TSV and doesn't currently support splitting.
some_expensive_fn is an expensive operation that may take a minute to run.
some_inexpensive_agg is perhaps not that important for the question but is to show that there are some results brought together for aggregation purpose.
In the case where SomeSourceProducingListOfFiles produces say 100 items, the load doesn't seem to get split across multiple works.
I understand that in general Apache Beam tries to keep things on one worker to reduce serialisation overhead. (And there is some hard coded limit of 1000 items). How can I convince Apache Beam to split the load across multiple workers even for a very small number of items. If I say have three items and three workers I would like each worker to execute one item.
Note: I disabled auto scaling and am using a fixed number of workers.
https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion discusses ways to prevent fusion. Beam Java 2.2+ has a built-in transform to do this, Reshuffle.viaRandomKey(); Beam Python doesn't yet have it so you'll need to code something similar manually using one of the methods by that link.
Can you try using beam.Reshuffle? It seems like this isn't well documented but I hear from some good sources that this is what you should use.
https://beam.apache.org/documentation/transforms/python/other/reshuffle/
I want to call a function similar to parallelize.map(function, args) that returns a list of results and the user is blind to the actual process. One of the functions I want to parallelize calls subprocess to another unix program that benefits from multiple cores.
I first tried ipython-cluster-helper. This works well with my setup, but I ran into problems installing it on several other machines. I also have to ask for names of clusters during setup. I haven't seen other programs start jobs on clusters for you, so I don't know if that is accepted practice.
joblib seems to be the standard for parallelization, but it can only use one cluster or computer at a time. This works as well, but is significantly slower because it is not using the cluster.
Also, the server I am running this code on complains if a program has run too long to ensure that people use the cluster. Do I write another script to run this program only on our cluster -- if I used joblib?
For now, I added special parameters in setup.py to add cluster names and install ipython-cluster-helper if necessary. And when map is called, it first checks if ipython-cluster-helper and the cluster names are available, use them, else use joblib.
What are others ways of achieving this? I'm looking for a standard way to do this that will work on most machines with or with out a cluster, so I can release the code and make it easy to use.
Thanks.
I am writing a simple python code for a very task which some of the best minds i guess are working on? Anyways.
I have a really powerful desktop 8cores (16 virtual cores).
I want to write a program whcih can find the distinct words in the whole corpus of words.
Or think of other task like things like word frequency counts.
Though map-reduce stuff works great for distributed framework.
Is there a way to make use of all the cores of your processor? that is multicore execution of code.
Or maybe this.
if i have to do the following:
def hello_word():
print "hello world!"
and instead of python hello_world.py
i want to run this hello_world.py using all the cores of my processor.
what changes will i make.?
Thanks
You first need to identify how your process can be broken down into parallel pieces. Running your example in your question on multiple cores makes absolutely no sense because there is just a single task to accomplish and no way it can be broken down into simpler, parallel steps.
After you have figured out how to break your task into parallel pieces, take a look at the multiprocessing module as Michael mentioned in comments. Working through some of the examples on that page is a good way to start.
The use case is as follows :
I have a script that runs a series of
non-python executables to reduce (pulsar) data. I right now use
subprocess.Popen(..., shell=True) and then the communicate function of subprocess to
capture the standard out and standard error from the non-python executables and the captured output I log using the python logging module.
The problem is: just one core of the possible 8 get used now most of the time.
I want to spawn out multiple processes each doing a part of the data set in parallel and I want to keep track of progres. It is a script / program to analyze data from a low frequencey radio telescope (LOFAR). The easier to install / manage and test the better.
I was about to build code to manage all this but im sure it must already exist in some easy library form.
The subprocess module can start multiple processes for you just fine, and keep track of them. The problem, though, is reading the output from each process without blocking any other processes. Depending on the platform there's several ways of doing this: using the select module to see which process has data to be read, setting the output pipes non-blocking using the fnctl module, using threads to read each process's data (which subprocess.Popen.communicate itself uses on Windows, because it doesn't have the other two options.) In each case the devil is in the details, though.
Something that handles all this for you is Twisted, which can spawn as many processes as you want, and can call your callbacks with the data they produce (as well as other situations.)
Maybe Celery will serve your needs.
If I understand correctly what you are doing, I might suggest a slightly different approach. Try establishing a single unit of work as a function and then layer on the parallel processing after that. For example:
Wrap the current functionality (calling subprocess and capturing output) into a single function. Have the function create a result object that can be returned; alternatively, the function could write out to files as you see fit.
Create an iterable (list, etc.) that contains an input for each chunk of data for step 1.
Create a multiprocessing Pool and then capitalize on its map() functionality to execute your function from step 1 for each of the items in step 2. See the python multiprocessing docs for details.
You could also use a worker/Queue model. The key, I think, is to encapsulate the current subprocess/output capture stuff into a function that does the work for a single chunk of data (whatever that is). Layering on the parallel processing piece is then quite straightforward using any of several techniques, only a couple of which were described here.