Is it safe to work with confidential data in Colab?

Is it safe to work with confidential data in Colab? - python

After having worked with it for a while, I would like to understand how Colab really works and whereas it is safe to work with confidential data in it.
A bit of context. I understand the differences between Python, IPython and Jupyter Notebook described in here. and I would summarize it by saying Python is a programming language and can be installed as any other application with sudo apt-get). IPython is an interactive command-line terminal for Python and can be installed with pip, the standard package manager for Python. It allows you to install and manage additional packages writen in Python that are not part of the Python standard library. Jupyter Notebook add a web interface to and it can use several kernels or backends being IPython one of them.
What about Colab? It is my understanding than when using Colab, I get a VM from google with Python pre-installed as well as many other libraries (aka packages) like pandas or matplotlib. These packages are all installed in the base python installation.
Colab VMs comes with some ephemeral storage. This is equivalent to instance storage in AWS. So it will be lost when the VM runtime is interrupted, i.e. our VM is stopped (or would you rather say...terminated?) by Google. I believe that if I were to upload my confidential data there it will not be in my private subnet...
Mounting our Drive is hence equivalent of using an EBS volume in AWS. An EBS volume is network attached drive so the daat in it will persist even if the VM runtime is interrupted. EBS volumes can however be attached to only one EC2 instance... but I can mount my Drive to several Colab sessions. Not exactly clear to me what these sessions are...
Some users would like to create virtual environments in Colab and it looks like mounting the drive is a way to get around it.
When mounting our Drive to Colab, we need to authentificate because we are giving to the IP of the Colab VM access to our private subnet. Hence, if we had some confidential data, by using Colab the data would not be leaving our private company subnet...?

IIUC, the last paragraph asks the question: "Can I use IP-based authentication to restrict access to data in Colab?"
The answer is no: network address filtering cannot provide meaningful access restrictions in Colab.
Colab is a service rather than a machine. Colab backends do not have fixed IP addresses or a fixed IP address range. By analogy, there's no list of IP addresses for restricting access to a particular set of Google Drive users since, of course, Google Drive users don't have a fixed IP address. Colab users and backends are similar.
Instead of attempting to restrict access to IPs, you'll want to restrict access to particular Google accounts, perhaps using typical Drive file ACLs.

Related

Connection killed Azure container

I am trying to access to a Azure container to dowload some blobs with my python code.
My code is working perfectly on windows but when I execute it on my debian VM I have this error message :
<azure.storage.blob._container_client.ContainerClient object at 0x7f0c51cafd10>
Killed
admin_bbt#vm-bbt-cegidToAZ:/lsbCodePythonCegidToAZ/fuzeo_bbt_vmLinux_csvToAZ$
The blob I am trying to acces is not mine but I do have the sas key.
My code fail after this line :
container = ContainerClient.from_container_url(sas_url)
What I have tried to do :
move my VM to an other location
open the port 445 on my VM
install cifs-utils

Usually this issue comes when our VM was not enabled for managed identities for azure resources on VM. This MS Docs helped me to enable it successfully (MSDocs1, MSDocs2)
We need to check for network access rules which are as below
Go to the storage account you want to secure.
Select on the settings menu called Networking.
To deny access by default, choose to allow access from Selected networks. To allow traffic from all networks, choose to allow access from All networks.
Select Save to apply your changes.
Also along with these setting changes, need to ensure users can access blob storage, and might need to add vnet integration
Check this MS Docs for understanding about Azure Storage firewall rules.
We can use MSI to authenticate with VM

Anaconda enterprise connection to cloud vs. 'offline'

On my company PC, I do not have full permissions to install Python packages(usually this has to be requested for approval from IT, which is very painful and takes a very long time).
I am thinking to ask my manager to invest in Anaconda Enterprise so that the security aspect of open source Python use will not be an issue anymore. However, also to consider, my boss is looking to move to the cloud and I was wondering if Anaconda Enterprise can be used interchangeably on-premise (offline from cloud, i.e., no use of cloud storage or cloud compute resources) and when needed for big data processing, switched to 'cloud mode' by connecting to any of AWS, GCP, Azure to rent GPU instances? Any advice welcome.

Yes, that can be a good approach for your company, I used it in many projects on GCP and IBM cloud over Debian 7,8 and 9, and is a good approach, you can also depend on your need to create a package channel with the enterprise version and manage the permissions over your packages and it has a deploying tool where you can manage the deploys and audit the different for projects and API's as well also track the deployments and assign them to owners.
You can switch your server nodes to different servers or add and remove as well when you work with those depending on your environment can be difficult at the beginning but is pretty good after implemented.
Below are some links where you can see more information about what I'm talking about:
using-anaconda-enterprise
conda-offline-install-update
server-nodes

Depending on your preferences it may not be necessary to use anaconda enterprise on GCP. If you're boss is looking to move to the cloud then GCP has some great options for analyzing big data. Using the AI Platform you can deploy a new instance choose R, Python, CUDA, TensorFlow etc. Once the instance is deployed you can start your data preprocessing. Install whatever libraries you desire, Numpy, Scipy, Pandas, Matplotlib etc. And start your data manipulation.
If using something like Jupyter Notebooks you can use that offline to prepare your work before entering the GCP platform to run the Model Training.
Oh, also GCP has many labs to test out their Data Science platform.
https://www.qwiklabs.com/quests/43
GCP has many free promos these days below is a link to one.
GCP - Build your cloud skills for free with Google Cloud
Step by step usage for AI Platform

Can I create, store and run Python projects/scripts from a cloud storage network drive?

I've almost run out of space on my C: and I'm currently working for myself remotely. I want to purchase cloud storage that will act as a mounted drive, so that it can do the following:
Store all of my Python projects along with any other files
Run my Python scripts on VS code (or any IDE) straight from the drive
Create virtual environments for my Python projects that will be stored on the drive
Set up APIs, from Python scripts stored on this drive, to other programs (eg GA or Heroku) so I can push and pull data as required
I just purchased OneDrive thinking I'd be able do this but according to the answer in this SO post it's not a good idea. This article is describing the exact behaviour that I'm after and pCloud looks like a good option, given it's security, but I can't find much resource on it's compatibility with Python.
Google Cloud, AWS and Azure are all out of my price range and look too complex for what I'm after. My cloud computing knowledge is fairly limited but I was wondering if anyone has any experience of running scripts in Python from the cloud (from pulling data from a warehouse to hosting an application in the public domain) that isn't using one of the big cloud computing companies?

How do I programmatically upload files to JupyterLab from my local storage?

I was wondering if JupyterLab has an API that allows me to programmatically upload files from my local storage to the JupyterLab portal. Currently, I am able to manually select "Upload" through the UI, but I want to automate this.
I have searched their documentation but no luck. Any help would be appreciated. Also, I am using a chromebook (if that matters).
Thanks!!

Firstly, you can use python packages "requests" and "urllib" to upload files
https://stackoverflow.com/a/41915132/11845699
This method is actually the same as clicking the upload button, but the uploading speed is not very satisfying so I don't recommend it if you are uploading lots of files or some large files.
I don't know whether your JupyterLab server is managed by your administrator or yourself. In my case, I'm the administrator of the server in my lab. So I setup an NFS disk and mount it to a folder in the JupyterLab working directory. The users can access to this NFS disk via our local network or the internet. NFS disk is capable of transmitting lots of large files, which is much more efficient than the Jupyter upload button. I learned this from a speech of a TA in Berkeley https://bids.berkeley.edu/resources/videos/teaching-ipythonjupyter-notebooks-and-jupyterhub
I highly recommend this if you can contact the person who has access to the file system of your Jupyter server. If you don't use Linux, then Webdav is an alternative to NFS. Actually, anything that can give you access to a folder on a remote server is optional, such as Nextcloud or Pydio.
(If you can't ask the administrator to deploy such service, then just use the python packages)

In my Compute Engine VM I have to reinstall Python modules every time I log in

The title says it all. I have a Google Cloud Compute Engine VM and every time I have to login, in order to run my script I have to reinstall the packages every time.
It doesn't matter if I have the modules installed in the virtual environment
or in the VM
The weird thing is that the .json credentials that I supposed to export every time are fine. I don't have to export every time I login.
Any ideas why this keep happening?

The behavior you are describing matches with the limitations of Cloud Shell. I'm not saying you are connecting the the GCP Cloud Shell instead of your VM Instance, however some users have confused his VMs terminals with Cloud Shell in the past.
Is this only happening for Python or any other package? Please review you have enough space available in your Filesystems and that you are not installing the package in a volatile Filesystem partition.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is it safe to work with confidential data in Colab? - python

Related

Connection killed Azure container

Anaconda enterprise connection to cloud vs. 'offline'

Can I create, store and run Python projects/scripts from a cloud storage network drive?

How do I programmatically upload files to JupyterLab from my local storage?

In my Compute Engine VM I have to reinstall Python modules every time I log in

Categories

Resources