Set up spark using an external virtual machine

Set up spark using an external virtual machine - python

I am not as huge a computer person as many others on here, I majored in math with MatLab as my main computer knowledge. I have recently got involved with Apache Spark through the excellent edX course offered by Berkeley.
The method that they used for setting up Spark was provided in a great step by step guide, it involved: downloading Oracle VM Virtual Box with an Ubuntu 32bit VM, then through the use of a vagrant (again I'm not hugely computer-y so not 100% sure how this worked or what it is) connect this to IPython notebook. This enabled me to have access to Spark over the internet and to code in python with pySpark, this is exactly what I want to do.
Everything was going very well until the second lab exercise, it became apparent that my Windows laptop has insufficient free memory (just 3 Gb and four years old) after it continually froze and crashed when trying to work with large datasets.
It is not possible to have a VM in a VM apparently so I have spent most of today looking for alternative ways of setting up Spark to no avail; the guides are all aimed at someone with more computer knowledge than I have.
My (likely naive) idea now is to rent an external machine that I can interface with through my windows laptop completely as before but so that the virtual machine operates outside of the memory of my laptop i.e. in the cloud (using any of Ubuntu, Windows, etc.). Essentially I want to move the Oracle VM virtual box to an outside source to rid my computer of memory burdens and to use Ipython notebook as before.
How can I set up a virtual machine to use for the computational side of Spark in Ipython notebook?
Or is there an alternate method that would be simple to follow?

Don't run VMs. Instead:
Download the latest Spark version. (1.4.1 at the moment.)
Extract the archive.
Run bin/pyspark.cmd.
It's not an IPython Notebook, but you can run Python code against a local Spark instance.
If you want a beefier instance, do the same on a beefy remote machine. For example an EC2 m4.2xlarge is $0.5 per hour with 8 cores and 30 GB of RAM.

Related

Run a ML program locally using the Colab GPU

I have a Lenovo as computer, but there is no GPU installed. So when I run a machine learning program written in python, it runs it on my local CPU. I know that Colab provides us a GPU for free. To use it, I need to take the content of all the python files from my ML program and put it in this Colab notebook. It is not very convenient at this point. Is it possible to run in any ways my ML program from my computer using directly the Colab GPU without using the Colab Notebook directly.
EDIT
Be aware that I don't want to work from Jupiter Notebook. I would like to work in Visual Studio Code and run the code on the Colab GPU directly instead of my CPU

It is possible to run. Check this article out
https://amitness.com/vscode-on-colab/
and
https://github.com/abhi1thakur/colabcode

Not that I know of. Colab's GPU and notebook runs on Google's computers. Your local jupyter notebook runs on your computer alone and sort of can't communicate to Google's computers. This is not a physics limitation or anything. It's just that no one has integrated them before.
What you can do though, to make the transfers quickly, is to create a git repo for all of your files, commit them to GitHub, then pull them down in colab's notebooks. It's relatively quick, syncs well, and serves as a backup.

Be aware that I don't want to work from Jupiter Notebook. I would like to work in Visual Studio Code and run the code on the Colab GPU directly instead of my CPU
Nope, not possible.
Update reason:
Colab itself is a jupyter notebook, you can't take away the machine resources to link to your pc and use other software with it.
If this possible, people will already abuse it and use it for mining crypto, run-heavy load programs, etc.
Colab is a free product by Google to introduce you to their cloud compute services. This mean colab have its own limitation
"Colab resources are not guaranteed and not unlimited, and the usage limits sometimes fluctuate. " -Colab FAQ
If you are a fan of colab, you might want to try the pro version for just $10/month

Did you check out colab-ssh? You SSH into colab from VS Code and can leverage the GPU the same as you would on colab.

Developing python software that will run in different environment

I have the last six months been working on a Python GUI application that I will use at work. Specifically my GUI will run on a couple of super computer clusters that I use for work.
However, I am mostly developing the software at my personal computer, and here I do not have direct access to the commands that my GUI will call, since the GUI will use subprocess to call commands that only are available on the computing cluster.
So, in order to efficiently develop the program, I often have to copy the directory containing all files related to the GUI, to the cluster. Then I test my current version there, locate all my bugs, fix them by editing the files on the cluster, and finally copy back all files to my computer, overwriting the old version.
This just seems like a bad way of doing it, but I have to be able to test my software in the environment it is made for in order to find my bugs.
Surely this is a common problem in software development... What do actual programmers do (as opposed to hobby programmers such as myself)?
Edit:
Examples of commands that are only available on the computing cluster, that I make heavy use of, are squeue, sacct, and scontrol (SLURM related commands).
Edit2:
I could mention that I tested using ssh connections with Python, but it slowed down the commands significantly, having to establish the ssh connection for each command I wanted. Unless I could set of a lasting ssh session, as in logging in when opening my program, I don't think the ssh-ing will work.

Explore the concepts that make Vagrant a popular choice for developers
Vagrant is a tool for building and managing virtual machine
environments in a single workflow. With an easy-to-use workflow and
focus on automation, Vagrant lowers development environment setup
time, increases production parity, and makes the "works on my machine"
excuse a relic of the past.
Your use case is covered by a couple of vagrant boxes that create a slurm cluster for development purposes. A good starting point might be
Example slurm cluster on your laptop (multiple VMs via vagrant)
If you understand and can setup your development environment with tools like Vagrant, you might explore next which options modern code editors or integrated development environments (IDE) offer for remote development. Remote development covers some other use cases, that might fit into your developer toolbox as well.
A "good enough", free and open source code editor for Python development is Visual Studio Code. According to the docs it has powerful features for remote development.
Visual Studio Code Remote Development allows you to use a container, remote machine, or the Windows Subsystem for Linux (WSL) as a full-featured development environment.
Read the docs
VS Code Remote Development

How to apply GoogleColab stronger CPU and more RAM?

I use GoogleColab to test data stuctures like chain-hashmap,probe-hashmap,AVL-tree,red-black-tree,splay-tree(written in Python),and I store very large dataset(key-value pairs) with these data stuctures to test some operation running time,its scale just like a small wikipedia,so run these python script will use very much memory(RAM),GoogleColab offers a approximately 12G RAM but not enough for me,these python scripts will use about 20-30G RAM,so when I run python program in GoogleColab,will often raise an exception that"your program run over 12G upper bound",and often restarts.On the other hand,I have some PythonScript to do some recursion algorithm,as is seen to all,recursion algorithm use CPU vety mush(as well as RAM),when I run these algorithm with 20000+ recursion,GoogleColab often fails to run and restart,I knew that GoogleColab uses two cores of Intel-XEON CPU,but how do I apply more cores of CPU from Google?

You cannot upgrade the GPU and CPU but you can increase the RAM from 12 gb to 25gb just by crashing the session with just by any non ending while loop.
l=[]
while 1:
l.append('nothing')

There is no way to request more CPU/RAM from Google Colaboratory at this point, sorry.

Google Colab Pro recently launched for $9.99 a month (Feb. 2020). Users in the US can get higher resource limits and more frequent access to better resources.
Q&A from the signup page is below:
What kinds of GPUs are available in Colab Pro?
With Colab Pro you get priority access to our fastest GPUs. For example, you may get access to T4 and P100 GPUs at times when non-subscribers get K80s. You also get priority access to TPUs. There are still usage limits in Colab Pro, though, and the types of GPUs and TPUs available in Colab Pro may vary over time.
In the free version of Colab there is very limited access to faster GPUs, and usage limits are much lower than they are in Colab Pro.
How long can notebooks run in Colab Pro?
With Colab Pro your notebooks can stay connected for up to 24 hours, and idle timeouts are relatively lenient. Durations are not guaranteed, though, and idle timeouts may sometimes vary.
In the free version of Colab notebooks can run for at most 12 hours, and idle timeouts are much stricter than in Colab Pro.
How much memory is available in Colab Pro?
With Colab Pro you get priority access to high-memory VMs. These VMs generally have double the memory of standard Colab VMs, and twice as many CPUs. You will be able to access a notebook setting to enable high-memory VMs once you are subscribed. Additionally, you may sometimes be automatically assigned a high-memory VM when Colab detects that you are likely to need it. Resources are not guaranteed, though, and there are usage limits for high memory VMs.
In the free version of Colab the high-memory preference is not available, and users are rarely automatically assigned high memory VMs.

For a paid, high-capability solution, you may want to try Google Cloud Datalab instead

Alternatives to Hadoop / Map-reduce framework for win32 platform

I'm finding Hadoop on Windows somewhat frustrating: I want to know if there are any serious alternatives to Hadoop for Win32 users. The features I most value are:
Ease of initial setup & deployment on a smallish network (I'd be astonished if we ever got more than 20 worker-PCs assigned to this project)
Ease of management - the ideal framework should have web/GUI based administration system so that I do not have to write one myself.
Something popular & stable. Bonuses depend on us getting this project delivered in time.
BACKGROUND:
The company I work for wants to build a new grid system to run some financial calculations.
The first framework I have been evaluating is Hadoop. This seemed to do exactly what was intended except that it's very UNIX oriented. I was able to get all of the tutorials up & running on an Ubuntu VirtualBox. Unfortunately nothing seems to run easily on Win32.
Yes... Win32: Our company has a policy that everything has to run on Windows. None of the server admins (or anybody outside of select few developers) know anything about Linux. I'd probably get in trouble if they found my virtual Ubuntu environment! The sad fact is that our grid needs to be hosted on Win32 (since all the test PCs run Windows XP 32bit), with an option to upgrade to Win64 at sometime in the future.
To complicate matters - 95% of what we want to run are Python scripts with C++ Windows 32bit DLL add ons. Our calculation library is overwhelmingly written in Python. Our calculation libraries will not run on anything other than Windows... I do not really have a choice

For python there is:
disco
bigtempo
celery - not really a map-reduce framework, but it's a good start if you want something very customized
And you can find a bunch of hadoop clients/integrations on pypi

You could try MPI. It is a standard for message-passing concurrent applications. We are running it on our Linux cluster but it is cross-platform. The most popular implementation is mpich2, written in C. There are python bindings for MPI through the mpi4py library.

IPython has some parallel computing features that are simple and work on windows. It may be enough for your needs. Here's a good place to start:
http://showmedo.com/videotutorials/video?name=7200100&fromSeriesID=720

I've compiled a list of available MapReduce/Hadoop offerings in the cloud (hosted services, PaaS-level), this might be of help as well.

Many distributed computing frameworks can be used for many-task computing. If you don't need the MapReduce paradigm, but rather the ability to distribute the tasks of a job across separate computers, communication and resource management, then you could take a look at other platforms in this area like Condor, or even Boinc; both run on Windows.
You could also run Hadoop on Linux virtual machines.

Can you recommend an Amazon AMI for Python?

I want to remove as much complexity as I can from administering Python in on Amazon EC2 following some truly awful experiences with hosting providers who claim support for Python. I am looking for some guidance on which AMI to choose so that I have a stable and easily managed environment which already included Python and ideally an Apache web server and a database.
I am agnostic to Python version, web server, DB and OS as I am still early enough in my development cycle that I can influence those choices. Cost is not a consideration (within bounds) so Windows will work fine if it means easy administration.
Anyone have any practical experience or recommendations they can share?

Try the Ubuntu EC2 images. Python 2.7 is installed by default. The rest you just apt-get install and optionally create an image when the baseline is the way you want it (or just maintain a script that installs all the pieces and run after you create the base Ubuntu instance).

If you can get by with using the Amazon provided ones, I'd recommend it. I tend to use ami-84db39ed.
Honestly though, if you plan on leaving this running all the time, you would probably save a bit of money by just going with a VPS. Amazon tends to be cheaper if you are turning the service on and off over time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.