I need to share Jupyter notebooks with my colleagues. My company is pretty strict about data, sharing, etc so it needs to be on our private cloud. We use AWS. SageMaker is ok, but we also want to share the same environment I set up for my notebook. We don't have huge budgets for Domino Data Labs. Any methods / tools, even if low cost, you recommend? Really appreciaet the help.
Background on what i tried:
I don't even know where to start..
Our dev ops guys dont have bandwidth to do this for a while, but it's crushing us because we have a deliverable due soon
Related
Question: Is it possible to have multiple users connect to the same Jupyter kernel?
Context: I am trying to provide jupyter notebook access to large volume of users. All users are using python.
Right now, every notebook spawns a new kernel pod in the kubernetes cluster and this is inefficient. I am looking for a way to connect a few users to a single kernel pod in Kubernetes. So that we can consume relatively lower compute resources.
I am new to jupyter notebooks so my terminology might have errors. Also, I came across KernelProvisioner and was wondering if that's of any help?
I am looking to see
If it's even possible in Jupyter?
Which new K8S objects to add to achieve this for example, custom controllers, services, deployments etc.
Any inputs will be appreciated.
Thank you!
I would like to perform a small, one-day-long coding workshop with some kids and teenagers.
For this, I am looking for a publicly hosted JupyterLab system that we can use for free to write some small Python scripts. One requirement is that we can upload a .csv file to this system.
I stumbled across https://try.jupyter.org/ which provides free JupyterLab instances to use.
My question: Does anybody have experience how long scripts stay "uploaded" there? Is the lab regularly reset, or does it store the scripts somewhere locally (in the browser cache, etc.)? We would shut down the PCs to go for lunch and the files should be accessible in the lab again once we restart the PCs.
I know we can download and store the notebooks offline (which we will do, just to make sure) and reupload them if we need, but it would still be nice to know about the persistency of the "Try" service.
I am using a locked down system where I cannot install any applications including Anaconda or any other python.
Anybody knows if it is possible to access local files from a jupyter online solution? I know it would probably be slow as the file would have to be moved back and forth?
Thanks
Yes, you can use your local files from a Jupyter online solution doing what you say of moving them back and forth. (The remote server cannot connect to your local system itself beyond the browser sandbox, and so concerns like Chris mentions aren't an issue.)
I can demonstrate this easily:
Go to here and click on the launch binder badge you see.
A temporary session backed by MyBinder.org will spin up. Depending where you are in the world, you may be on a machine run by Jupyter folks via Google or another member of this service backed by folks in a Federation that believe this a valuable service to offer to empower Jupyter users.
After the session comes up, you'll be in JupyterLab interface. You'll see a file/directory navigation pane on the left side. You can click and drag a file on your local computer and drop it in that pane. You should see it show up on the remote directory.
You should be able to open and edit it. Depending on what it is or what you convert it to. You can even run it.
Of course you can make a new notebook on the remote session and save it. Then after saving it, download it back to your local machine by right-clicking on the icon for it in the file navigation pane and selecting 'Download'.
If you prefer to work in the classic Jupyter notebook interface, you can go to 'Help' and select 'Launch Classic Notebook' from the menu. The classic Jupyter Dashboard will come up. You will need to upload things to there using the upload button as drag and drop only works in JupyterLab. You can download back to your local computer from the dashboard or when you have a notebook open, you can use the file menu to download back to your local machine, too.
Make sure you save anything you make that is useful back to your machine as the sessions are temporary and will time out after 10 minutes of inactivity. They'll also disconnect after a few hours even if you are actively using them. There's a safety net built in that works if it does disconnect but you have to aware of it ahead of time. And it is best tested a few times in advance when you don't need it. See Getting your notebook after your Binder has stopped.
As this is going to a remote machine, obviously there are security concerns. Part of this is addressed by the temporary nature of the sessions. Nothing is stored remotely once the active session goes away. (Hence, the paragraph above because once it is gone, it is gone.) However, don't upload anything you wouldn't want someone else to see. Don't share keys and things with this system. In fact, it is possible now to do real time co-authoring/co-editing of Jupyter notebooks via the MyBinder system although some of the minor glitches are still being worked out.
A lot of packages you can install right in the session using %pip install or %conda install in cells right in the notebook. However, sometimes you want them already installed so the session is ready with the necessary software. (Plus some software won't work unless installed during the building backing image of the container backing the session.) That is where it becomes handy that you can customize the session that comes up via configuration files in public repositories. A list of places you can host those files is seen by going to MyBinder.org and pressing the dropdown menu in the top left side of that form there, under 'GitHub repository name or URL'. Here's an example. You can look in requirements.txt and see I install quite a few packages in the data science stack.
Of course there's other related online offerings for Jupyter (or you can install it on remote servers) and many use authentication. As some of those cost money and you are unsure about your locked system, the MyBinder.org system may help you test the limits of what you can do on your machine.
On my company PC, I do not have full permissions to install Python packages(usually this has to be requested for approval from IT, which is very painful and takes a very long time).
I am thinking to ask my manager to invest in Anaconda Enterprise so that the security aspect of open source Python use will not be an issue anymore. However, also to consider, my boss is looking to move to the cloud and I was wondering if Anaconda Enterprise can be used interchangeably on-premise (offline from cloud, i.e., no use of cloud storage or cloud compute resources) and when needed for big data processing, switched to 'cloud mode' by connecting to any of AWS, GCP, Azure to rent GPU instances? Any advice welcome.
Yes, that can be a good approach for your company, I used it in many projects on GCP and IBM cloud over Debian 7,8 and 9, and is a good approach, you can also depend on your need to create a package channel with the enterprise version and manage the permissions over your packages and it has a deploying tool where you can manage the deploys and audit the different for projects and API's as well also track the deployments and assign them to owners.
You can switch your server nodes to different servers or add and remove as well when you work with those depending on your environment can be difficult at the beginning but is pretty good after implemented.
Below are some links where you can see more information about what I'm talking about:
using-anaconda-enterprise
conda-offline-install-update
server-nodes
Depending on your preferences it may not be necessary to use anaconda enterprise on GCP. If you're boss is looking to move to the cloud then GCP has some great options for analyzing big data. Using the AI Platform you can deploy a new instance choose R, Python, CUDA, TensorFlow etc. Once the instance is deployed you can start your data preprocessing. Install whatever libraries you desire, Numpy, Scipy, Pandas, Matplotlib etc. And start your data manipulation.
If using something like Jupyter Notebooks you can use that offline to prepare your work before entering the GCP platform to run the Model Training.
Oh, also GCP has many labs to test out their Data Science platform.
https://www.qwiklabs.com/quests/43
GCP has many free promos these days below is a link to one.
GCP - Build your cloud skills for free with Google Cloud
Step by step usage for AI Platform
After having worked with it for a while, I would like to understand how Colab really works and whereas it is safe to work with confidential data in it.
A bit of context. I understand the differences between Python, IPython and Jupyter Notebook described in here. and I would summarize it by saying Python is a programming language and can be installed as any other application with sudo apt-get). IPython is an interactive command-line terminal for Python and can be installed with pip, the standard package manager for Python. It allows you to install and manage additional packages writen in Python that are not part of the Python standard library. Jupyter Notebook add a web interface to and it can use several kernels or backends being IPython one of them.
What about Colab? It is my understanding than when using Colab, I get a VM from google with Python pre-installed as well as many other libraries (aka packages) like pandas or matplotlib. These packages are all installed in the base python installation.
Colab VMs comes with some ephemeral storage. This is equivalent to instance storage in AWS. So it will be lost when the VM runtime is interrupted, i.e. our VM is stopped (or would you rather say...terminated?) by Google. I believe that if I were to upload my confidential data there it will not be in my private subnet...
Mounting our Drive is hence equivalent of using an EBS volume in AWS. An EBS volume is network attached drive so the daat in it will persist even if the VM runtime is interrupted. EBS volumes can however be attached to only one EC2 instance... but I can mount my Drive to several Colab sessions. Not exactly clear to me what these sessions are...
Some users would like to create virtual environments in Colab and it looks like mounting the drive is a way to get around it.
When mounting our Drive to Colab, we need to authentificate because we are giving to the IP of the Colab VM access to our private subnet. Hence, if we had some confidential data, by using Colab the data would not be leaving our private company subnet...?
IIUC, the last paragraph asks the question: "Can I use IP-based authentication to restrict access to data in Colab?"
The answer is no: network address filtering cannot provide meaningful access restrictions in Colab.
Colab is a service rather than a machine. Colab backends do not have fixed IP addresses or a fixed IP address range. By analogy, there's no list of IP addresses for restricting access to a particular set of Google Drive users since, of course, Google Drive users don't have a fixed IP address. Colab users and backends are similar.
Instead of attempting to restrict access to IPs, you'll want to restrict access to particular Google accounts, perhaps using typical Drive file ACLs.