I'm finding Hadoop on Windows somewhat frustrating: I want to know if there are any serious alternatives to Hadoop for Win32 users. The features I most value are:
Ease of initial setup & deployment on a smallish network (I'd be astonished if we ever got more than 20 worker-PCs assigned to this project)
Ease of management - the ideal framework should have web/GUI based administration system so that I do not have to write one myself.
Something popular & stable. Bonuses depend on us getting this project delivered in time.
BACKGROUND:
The company I work for wants to build a new grid system to run some financial calculations.
The first framework I have been evaluating is Hadoop. This seemed to do exactly what was intended except that it's very UNIX oriented. I was able to get all of the tutorials up & running on an Ubuntu VirtualBox. Unfortunately nothing seems to run easily on Win32.
Yes... Win32: Our company has a policy that everything has to run on Windows. None of the server admins (or anybody outside of select few developers) know anything about Linux. I'd probably get in trouble if they found my virtual Ubuntu environment! The sad fact is that our grid needs to be hosted on Win32 (since all the test PCs run Windows XP 32bit), with an option to upgrade to Win64 at sometime in the future.
To complicate matters - 95% of what we want to run are Python scripts with C++ Windows 32bit DLL add ons. Our calculation library is overwhelmingly written in Python. Our calculation libraries will not run on anything other than Windows... I do not really have a choice
For python there is:
disco
bigtempo
celery - not really a map-reduce framework, but it's a good start if you want something very customized
And you can find a bunch of hadoop clients/integrations on pypi
You could try MPI. It is a standard for message-passing concurrent applications. We are running it on our Linux cluster but it is cross-platform. The most popular implementation is mpich2, written in C. There are python bindings for MPI through the mpi4py library.
IPython has some parallel computing features that are simple and work on windows. It may be enough for your needs. Here's a good place to start:
http://showmedo.com/videotutorials/video?name=7200100&fromSeriesID=720
I've compiled a list of available MapReduce/Hadoop offerings in the cloud (hosted services, PaaS-level), this might be of help as well.
Many distributed computing frameworks can be used for many-task computing. If you don't need the MapReduce paradigm, but rather the ability to distribute the tasks of a job across separate computers, communication and resource management, then you could take a look at other platforms in this area like Condor, or even Boinc; both run on Windows.
You could also run Hadoop on Linux virtual machines.
Related
I have the last six months been working on a Python GUI application that I will use at work. Specifically my GUI will run on a couple of super computer clusters that I use for work.
However, I am mostly developing the software at my personal computer, and here I do not have direct access to the commands that my GUI will call, since the GUI will use subprocess to call commands that only are available on the computing cluster.
So, in order to efficiently develop the program, I often have to copy the directory containing all files related to the GUI, to the cluster. Then I test my current version there, locate all my bugs, fix them by editing the files on the cluster, and finally copy back all files to my computer, overwriting the old version.
This just seems like a bad way of doing it, but I have to be able to test my software in the environment it is made for in order to find my bugs.
Surely this is a common problem in software development... What do actual programmers do (as opposed to hobby programmers such as myself)?
Edit:
Examples of commands that are only available on the computing cluster, that I make heavy use of, are squeue, sacct, and scontrol (SLURM related commands).
Edit2:
I could mention that I tested using ssh connections with Python, but it slowed down the commands significantly, having to establish the ssh connection for each command I wanted. Unless I could set of a lasting ssh session, as in logging in when opening my program, I don't think the ssh-ing will work.
Explore the concepts that make Vagrant a popular choice for developers
Vagrant is a tool for building and managing virtual machine
environments in a single workflow. With an easy-to-use workflow and
focus on automation, Vagrant lowers development environment setup
time, increases production parity, and makes the "works on my machine"
excuse a relic of the past.
Your use case is covered by a couple of vagrant boxes that create a slurm cluster for development purposes. A good starting point might be
Example slurm cluster on your laptop (multiple VMs via vagrant)
If you understand and can setup your development environment with tools like Vagrant, you might explore next which options modern code editors or integrated development environments (IDE) offer for remote development. Remote development covers some other use cases, that might fit into your developer toolbox as well.
A "good enough", free and open source code editor for Python development is Visual Studio Code. According to the docs it has powerful features for remote development.
Visual Studio Code Remote Development allows you to use a container, remote machine, or the Windows Subsystem for Linux (WSL) as a full-featured development environment.
Read the docs
VS Code Remote Development
As part of an effort to make the scikit-image examples gallery interactive, I would like to build a web service that receives a Python code snippet, executes it, and provides me with the generated output image.
For safety, the Python instances launched should be sandboxed and resource controlled, so I was thinking of using LXC containers.
Is this a good way to approach the problem? If so, what is the recommended way of launching one Python VM per request?
Stefan, perhaps "Docker" could be of use? I get the impression that you could constrain the VM that the application is run in -- an example web service:
http://docs.docker.io/en/latest/examples/python_web_app/
You could try running the application on Digital Ocean, like so:
https://www.digitalocean.com/community/articles/how-to-install-and-use-docker-getting-started
[disclaimer: I'm an engineer at Continuum working on Wakari]
Wakari Enterprise (http://enterprise.wakari.io) is aiming to do exactly this, and we're hoping to back-port the functionality into Wakari Cloud (http://wakari.io) so "published" IPython Notebooks can have some knobs on them for variable input control, then they can be "invoked" in a sandboxed state, and then the output given back to the user.
However for things that exist now, you should look at Sage Notebook. A few years ago several people worked hard on a Sage Notebook Cell Server that could do exactly what you were asking for: execute small code snippets. I haven't followed it since then, but it seems it is still alive and well from a quick search:
http://sagecell.sagemath.org/?q=ejwwif
http://sagecell.sagemath.org
http://www.sagemath.org/eval.html
For the last URL, check out Graphics->Mandelbrot and you can see that Sage already has some great capabilities for UI widgets that are tied to the "cell execution".
I think docker is the way to go for this. The instances are very light weight, and docker is designed to spawn 100s of instances at a time (Spin up time is fractions of a second vs traditional VMs couple of seconds). Configured correctly I believe it also gives you a complete sandboxed environment. Then it matters not about trying to sandbox python :-D
I'm not sure if you really have to go as far as setting up LXC containers:
There is seccomp-nurse, a Python sandbox that leverages the seccomp feature of the Linux kernel.
Another option would be to use PyPy, which has explicit support for sandboxing out of the box.
In any case, do not use pysandbox, it is broken by design and has severe security risks.
I am working on a project where I am quite comfortable running linux, virtualenv, pip, manage.py runserver, git and so on for back-end development. I work with a front-end developer who needs to collaborate remotely, currently via a Dropbox synced copy of the codebase (also in a git branch) on Windows. A development server on my side lets the developer see their changes semi-live.
Although this has served us fairly well so far, has anyone come across a similar working arrangement with a better setup for collaboration?
I'm mindful that the source control learning curve and environmental management overhead is potentially significant and somewhat unnecessary for front-end work (as long as I commit from time to time). I'm considering a VM based setup such as BitNami's DjangoStack so that the front-end dev has their own server setup, but I thought I'd ask about other experiences.
I would recommend vagrant not only for quick development setups (which it excels at), but also for sharing VM configurations as you can publish your own vagrant file which your designer uses.
It relies on VirtualBox Sun Oracle's open source hypervisor and is available for free on all major platforms.
I have been in a very similar situation before Rog, where the backend was a Ruby on Rails setup running on *nix, and the frontend guy needed windows. We initially set up a Windows-Apache-MySql+git+RoR (using Cygwin and other tools) but eventually installing our app libraries and gems became a pain on the windows setup (anytime we would introduce a new gem (or app in django terms) the setup would break on windows). In the end we finally made the front-end guy work on *nix setup.
andLinux is extremely useful in these situations, it lets your run a seamless install of linux withing a windows 2000 setup, so the front end guy can still use windows tool. It is not like a dual boot, but here both the OS are running at the same time. Have a look into it.
I want to remove as much complexity as I can from administering Python in on Amazon EC2 following some truly awful experiences with hosting providers who claim support for Python. I am looking for some guidance on which AMI to choose so that I have a stable and easily managed environment which already included Python and ideally an Apache web server and a database.
I am agnostic to Python version, web server, DB and OS as I am still early enough in my development cycle that I can influence those choices. Cost is not a consideration (within bounds) so Windows will work fine if it means easy administration.
Anyone have any practical experience or recommendations they can share?
Try the Ubuntu EC2 images. Python 2.7 is installed by default. The rest you just apt-get install and optionally create an image when the baseline is the way you want it (or just maintain a script that installs all the pieces and run after you create the base Ubuntu instance).
If you can get by with using the Amazon provided ones, I'd recommend it. I tend to use ami-84db39ed.
Honestly though, if you plan on leaving this running all the time, you would probably save a bit of money by just going with a VPS. Amazon tends to be cheaper if you are turning the service on and off over time.
I am starting to work on a hobby project with a Python codebase and I would like to set up some form of continuous integration (i.e. running a battery of test-cases each time a check-in is made and sending nag e-mails to responsible persons when the tests fail) similar to CruiseControl or TeamCity.
I realize I could do this with hooks in most VCSes, but that requires that the tests run on the same machine as the version control server, which isn't as elegant as I would like. Does anyone have any suggestions for a small, user-friendly, open-source continuous integration system suitable for a Python codebase?
We run Buildbot - Trac at work. I haven't used it too much since my codebase isn't part of the release cycle yet. But we run the tests on different environments (OSX/Linux/Win) and it sends emails — and it's written in Python.
One possibility is Hudson. It's written in Java, but there's integration with Python projects:
Hudson embraces Python
I've never tried it myself, however.
(Update, Sept. 2011: After a trademark dispute Hudson has been renamed to Jenkins.)
Second the Buildbot - Trac integration. You can find more information about the integration on the Buildbot website. At my previous job, we wrote and used the plugin they mention (tracbb).
What the plugin does is rewriting all of the Buildbot urls so you can use Buildbot from within Trac. (http://example.com/tracbb).
The really nice thing about Buildbot is that the configuration is written in Python. You can integrate your own Python code directly to the configuration. It's also very easy to write your own BuildSteps to execute specific tasks.
We used BuildSteps to get the source from SVN, pull the dependencies, publish test results to WebDAV, etcetera.
I wrote an X10 interface so we could send signals with build results. When the build failed, we switched on a red lava lamp. When the build succeeded, a green lava lamp switched on. Good times :-)
We use both Buildbot and Hudson for Jython development. Both are useful, but have different strengths and weaknesses.
Buildbot's configuration is pure Python and quite simple once you get the hang of it (look at the epydoc-generated API docs for the most current info). Buildbot makes it easier to define non-testing tasks and distribute the testers. However, it really has no concept of individual tests, just textual, HTML, and summary output, so if you want to have multi-level browsable test output and so forth you'll have to build it yourself, or just use Hudson.
Hudson has terrific support for drilling down from overall results into test suites and individual tests; it also is great for comparing test output between builds, but the distributed (master/slave) stuff is comparatively more complicated because you need a Java environment on the slaves too; also, Hudson is less tolerant of flaky network links between the master and slaves.
So, to get the benefits of both tools, we run a single instance of Hudson, which catches the common test failures, then we do multi-platform regression with Buildbot.
Here are our instances:
Jython Hudson
Jython buildbot
We are using Bitten wich is integrated with trac. And it's python based.
TeamCity has some Python integration.
But TeamCity is:
not open-source
is not small, but rather feature rich
is free for small-mid teams.
I have very good experiences with Travis-CI for smaller code bases.
The main advantages are:
setup is done in less than half a screen of config file
you can do your own installation or just use the free hosted version
semi-automatic setup for github repositories
no account needed on website; login via github
Some limitations:
Python is not supported as a first class language (as of time of writing; but you can use pip and apt-get to install python dependencies; see this tutorial)
code has to be hosted on github (at least when using the official version)