Running Jupyter Notebook in GCP on a Schedule - python

What is the best way to migrate a jupyter notebook in to Google Cloud Platform?
Requirements
I don't want to do a lot of changes to the notebook to get it to run
I want it to be scheduleable, preferably through the UI
I want it to be able to run a ipynb file, not a py file
In AWS it seems like sagemaker is the no brainer solution for this. I want the tool in GCP that gets as close to the specific task without a lot of extras
I've tried the following,
Cloud Function: it seems like it's best for running python scripts, not a notebook, requires you to run a main.py file by default
Dataproc: seems like you can add a notebook to a running instance but it cannot be scheduled
Dataflow: sort of seemed like overkill, like it wasn't the best tool and that it was better suited apache based tools
I feel like this question should be easier, I found this article on the subject:
How to Deploy and Schedule Jupyter Notebook on Google Cloud Platform
He actually doesn't do what the title says, he moves a lot of GCP code in to a main.py to create an instance and he has the instance execute the notebook.
Feel free to correct my perspective on any of this

I use Vertex AI Workbench to run notebooks on GCP. It provides two variants:
Managed Notebooks
User-managed Notebooks
User-managed notebooks creates compute instances at the background and it comes with pre-built packages such as Jupyter Lab, Python, etc and allows customisation. I mainly use for developing Dataflow pipelines.
Other requirement of scheduling - Managed Notebooks supports this feature, refer this documentation (I am yet to try Managed Notebooks):
Use the executor to run a notebook file as a one-time execution or on
a schedule. Choose the specific environment and hardware that you want
your execution to run on. Your notebook's code will run on Vertex AI
custom training, which can make it easier to do distributed training,
optimize hyperparameters, or schedule continuous training jobs. See
Run notebook files with the executor.
You can use parameters in your execution to make specific changes to
each run. For example, you might specify a different dataset to use,
change the learning rate on your model, or change the version of the
model.
You can also set a notebook to run on a recurring schedule. Even while
your instance is shut down, Vertex AI Workbench will run your notebook
file and save the results for you to look at and share with others.

Related

Multiple users use same jupyter kernel

Question: Is it possible to have multiple users connect to the same Jupyter kernel?
Context: I am trying to provide jupyter notebook access to large volume of users. All users are using python.
Right now, every notebook spawns a new kernel pod in the kubernetes cluster and this is inefficient. I am looking for a way to connect a few users to a single kernel pod in Kubernetes. So that we can consume relatively lower compute resources.
I am new to jupyter notebooks so my terminology might have errors. Also, I came across KernelProvisioner and was wondering if that's of any help?
I am looking to see
If it's even possible in Jupyter?
Which new K8S objects to add to achieve this for example, custom controllers, services, deployments etc.
Any inputs will be appreciated.
Thank you!

cloud 9 and sagemaker - hyper parameter optimisation

I have done quite a few google searches but have not found a clear answer to the following use case. Basically, I would rather use cloud 9 (most of the time) as my IDE rather than Jupyter. What I am confused/not sure about is, how I could executed long running jobs like (Bayesian) hyper parameter optimisation from there. Can I use Sagemaker capabilities? Should I use docker and deploy to ECR (looking for the cheapest-ish option)? Any pointers w.r.t. to this particular issue would be very much appreciated. Thanks.
You could use whatever IDE you choose (including your laptop).
SaegMaker tuning job (example) is asynchronous, so you can safely close your IDE after launching it. You can monitor the job the AWS web console, or with a DescribeHyperParameterTuningJob API call.
You can launch TensorFlow, PyTorch, XGBoost, Scikit-learn, and other popular ML frameworks, using one of the built-in framework containers, avoiding the extra work of bringing your own container.

How can I run Google Cloud's "AI Notebooks" on a schedule automatically?

Notebooks in the Google Cloud Platform has been great for Python development in the cloud, but the last missing piece is just running existing notebooks on a schedule. There's a million different tools (Airflow, Papermill, Google Cloud Jobs, Google Cloud Scheduler, Google Cloud Cron Jobs), and as someone not as familiar with Cloud, it's really easy to get lost. Any suggestion? Thanks guys!
This post on Medium, "How to Deploy and Schedule Jupyter Notebook on Google Cloud Platform", describes how to run Jupyter notebook jobs on a Compute Engine Instance and schedule it using GCP's Cloud Scheduler > Cloud Pub/Sub > Cloud Functions.
If you want to use Cloud Composer, then you might find this answer to related question, "ETL in Airflow aided by Jupyter Notebooks and Papermill," useful.
this schedule option just works for managed notebooks.Unfortunately not for user managed notebooks.
The 2 main options seem to be to either manually configure Jupyter Notebooks to run on a schedule, or to let Cloud Composer do the heavy lifting.
Regarding the manual route, you can manipulate Jupyter Notebook to run on a schedule, plugin for scheduling files for recurring execution, schedule recurring Python script (converted from Jupyter) on GCP, Cloud Scheduler to turn cronjob, using Cloud Function & Pub/Sub & Cloud Scheduler, see this Stack Overflow thread for “How to run a Python notebook daily automatically”.
While using Cloud Composer offers a less manual approach, and is more scalable if need be, refer to this Stack Overflow thread for more information.
To execute a specific notebook you can use Papermill and point D2 here is a very extensive article for scheduled execution using Papermill. Check out this Google Cloud blog for an example and more information.
There is a “Jupyter Notebook Manifesto: Best practices that can improve the life of any developer using Jupyter notebooks” blog post by Google Cloud that explains the product in depth and can be found here.
If you create a managed notebook on GCP, you can now schedule a workbook execution within the notebook environment itself.
Create a Jupyter environment with Managed Notebooks:
Within a managed notebook choose Execute for the scheduler settings:
See also: Schedule managed notebook quickstart

multiple simultaneous connections on same jupyter notebook at the same time

I created a jupyter notebook with the purpose of taking a survey on a fairly large group of people, which consists of 1 script that each person has to run and fill in. To make it convenient for them I hosted a public jupyter notebook server and mailed every person the link to participate.
The problem is that when one person is running the script, all other people have to wait until that person closes the notebook in order for them to run it. I want a system that generates one seperate kernel for every incoming connection so multiple people can take the survey at the same time.
Does anyone have any ideas?
Jupyter Notebook wasn't made for simultaneous collaboration on the same file. One solution I've seen that addresses exactly this problem is Google Colab, which is a fork of Jupyter built on Google's collaborative Docs platform, and allows exactly what you're talking about.
It looks like for Jupyter Lab, they're hoping to integrate simultaneous editing as a core feature (they were originally going for a Google Drive backend, but Google seems to have pulled support and now they're considering more P2P solutions like IPFS), but it looks like that work has hit a few roadbumps and won't be released with version 1.0.

Interactive Ipython Notebooks on Heroku

I am currently trying to make python tutorials and host them using an ipython notebook on a Heroku site. The problem is that ipython notebooks are static when uploaded. I am trying to make it such that the user can use the notebook interactively (such as print outputs). I also dont want the output from their notebooks to be saved permanently on the Heroku website.
From what I understand, you have 2 issues do deal with :
interactive notebooks
"read only" notebooks (do not save the modifications)
For issue 1, you need to use a jupyter (the new IPython name for notebooks) server. Only showing the notebook is not enough because you need a server to "understand" and execute the modifications. See : http://jupyter-notebook.readthedocs.io/en/latest/public_server.html
I am not familiar with Heroku, after googling 2s I found this : https://github.com/pl31/heroku-jupyter which was able to deploy a working Jupyter server on a demo heroku machine.
According to me, issue 2 is more difficult to solve.
When the "learners" will change the notebook, the modifications will be applied to the notebook file (.ipnb) so the modifications will be persistent... This is not want you want.
You could try some tricks using file permissions to prevent the kernel to save the file, but I think it would only crash the kernel...
Moreover it asks several user-interaction problems, for instance what if I lose my internet connection ? Will I loose my work ? Why ? Is this what I really want as a learner ?
For this, the best solution is to provide a user access to the notebook / a worksapce where she can save her progression, but it is more work than just deploy a jupyter server. As an example, see databricks.com (the first (only) one that come to mind, not necessary the best).
(As a remark, it seems that the multi user mode is already implemented : https://jupyterhub.readthedocs.io/en/latest/)
I would like to add a last remark about the security of the server. Letting stranger access a server with an embedded shell sound like a bad idea if you are not prepared for the consequences. I would suggest you to see how you can put each user's jupyter session in a "jail" / container, anything that works in Heroku.

Categories

Resources