How to upload data permanently on google-colaboratory?

How to upload data permanently on google-colaboratory? - python

When I upload data using following code, the data vanishes once I get disconnected.
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
Please suggest me ways to upload my data so that the data remains intact even after days of disconnection.

I keep my data stored permanently in a .zip file in google drive, and upload it to the google colabs VM using the following code.
Paste it into a cell, and change the file_id. You can find the file_id from the URL of the file in google drive. (Right click on file -> Get shareable link -> find the part of the URL after open?id=)
##title uploader
file_id = "1BuM11fJJ1qdZH3VbQ-GwPlK5lAvXiNDv" ##param {type:"string"}
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# PyDrive reference:
# https://googledrive.github.io/PyDrive/docs/build/html/index.html
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
# Replace the assignment below with your file ID
# to download a different file.
#
# A file ID looks like: 1gLBqEWEBQDYbKCDigHnUXNTkzl-OslSO
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
fileId = drive.CreateFile({'id': file_id }) #DRIVE_FILE_ID is file id example: 1iytA1n2z4go3uVCwE_vIKouTKyIDjEq
print(fileId['title'])
fileId.GetContentFile(fileId['title']) # Save Drive file as a local file
!unzip {fileId['title']}

Keeping data in GDrive is good (#skaem).
If your data contains code, I can suggest you to simply git clone your source repository from Github (or any other code versioning service), at the beginning of your colab notebook.
This way, you can develop offline, and perform your experiments in the cloud whenever you need, with up-to-date code.

Related

PyDrive - Erase contents of a file

Consider the following code that uses the PyDrive module:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
file = drive.CreateFile({'title': 'test.txt'})
file.Upload()
file.SetContentString('hello')
file.Upload()
file.SetContentString('')
file.Upload() # This throws an exception.
Creating file and changing its contents works fine until I try to erase the contents by setting the content string to an empty one. Doing so throws this exception:
pydrive.files.ApiRequestError
<HttpError 400 when requesting
https://www.googleapis.com/upload/drive/v2/files/{LONG_ID}?alt=json&uploadType=resumable
returned "Bad Request">
When I look at my Drive, I see the test.txt file successfully created with text hello in it. However I expected that it would be empty.
If I change the empty string to any other text, the file is changed twice without errors. Though this doesn't clear the contents so it's not what I want.
When I looked up the error on the Internet, I found this issue on PyDrive github that may be related though it remains unsolved for almost a year.
If you want to reproduce the error, you have to create your own project that uses Google Drive API following this tutorial from the PyDrive docs.
How can one erase the contents of a file through PyDrive?

Issue and workaround:
When resumable=True is used, it seems that the data of 0 byte cannot be used. So in this case, it is required to upload the empty data without using resumable=True. But when I saw the script of PyDrive, it seems that resumable=True is used as the default. Ref So in this case, as a workaround, I would like to propose to use the requests module. The access token is retrieved from gauth of PyDrive.
When your script is modified, it becomes as follows.
Modified script:
import io
import requests
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
file = drive.CreateFile({'title': 'test.txt'})
file.Upload()
file.SetContentString('hello')
file.Upload()
# file.SetContentString()
# file.Upload() # This throws an exception.
# I added below script.
res = requests.patch(
"https://www.googleapis.com/upload/drive/v3/files/" + file['id'] + "?uploadType=multipart",
headers={"Authorization": "Bearer " + gauth.credentials.token_response['access_token']},
files={
'data': ('metadata', '{}', 'application/json'),
'file': io.BytesIO()
}
)
print(res.text)
References:
PyDrive
Files: update

Get a shareable link of a file in our google drive using Colab notebook

could anyone please inform me how to automatically get a shareable link of a file in our google drive using Colab notebook?
Thank you.

You can use xattr to get file_id
from subprocess import getoutput
from IPython.display import HTML
from google.colab import drive
drive.mount('/content/drive') # access drive
# need to install xattr
!apt-get install xattr > /dev/null
# get the id
fid = getoutput("xattr -p 'user.drive.id' '/content/drive/My Drive/Colab Notebooks/R.ipynb' ")
# make a link and display it
HTML(f"<a href=https://colab.research.google.com/drive/{fid} target=_blank>notebook</a>")
Here I access my notebook file at /Colab Notebooks/R.ipynb and make a link to open it in Colab.

In my case, the suggested solution doesn't work. So I replaced the colab URL with "https://drive.google.com//file/d/"
Below what I used:
def get_shareable_link(file_path):
fid = getoutput("xattr -p 'user.drive.id' " + "'" + file_path + "'")
print(fid)
# make a link and display it
return HTML(f"<a href=https://drive.google.com/file/d/{fid} target=_blank>file URL</a>")
get_shareable_link("/content/drive/MyDrive/../img_01.jpg")

If you look into the documentation you can see a section that explain how to list files from Drive.
Using that and reading the documentation of the library used, I've created this script:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
files = drive.ListFile().GetList()
for file in files:
keys = file.keys()
if 'webContentLink' in keys:
link = file['webContentLink']
elif 'webViewLink' in keys:
link = file['webViewLink']
else:
link = 'No Link Available. Check your sharing settings.'
if 'name' in keys:
name = file['name']
else:
name = file['id']
print('name: {} link: {}'.format(name, link))
This is currently listing all files and providing a link to it.
You can then edit the function to find a specific file instead.
Hope this helps!

get_output wasn't defined for me in the other answers, but this worked instead, after doing the drive.mount command: (test.zip needs to point to the right folder and file in your Drive!):
!apt-get install -qq xattr
filename = "/content/drive/My\ Drive/test.zip"
# Retrieving the file ID for a file in `"/content/drive/My Drive/"`:
id = !xattr -p 'user.drive.id' {filename}
print(id)

Export dataframe as csv file from google colab to google drive

I want to upload a dataframe as csv from colab to google drive.I tried a lot but no luck. I can upload a simple text file but failed to upload a csv.
I tried the following code:
import pandas as pd
df=pd.DataFrame({1:[1,2,3]})
df.to_csv('abc',sep='\t')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
uploaded = drive.CreateFile({'title': 'sample.csv', 'mimeType':'csv'})
uploaded.SetContentFile('abc')
uploaded.Upload()

It may be easier to use mounting instead of pydrive.
from google.colab import drive
drive.mount('drive')
After authentication, you can copy your csv file.
df.to_csv('data.csv')
!cp data.csv "drive/My Drive/"

Without using !cp command
from google.colab import drive
Mounts the google drive to Colab Notebook
drive.mount('/drive')
Make sure the folder Name is created in the google drive before uploading
df.to_csv('/drive/My Drive/folder_name/name_csv_file.csv')

# Import Drive API and authenticate.
from google.colab import drive
# Mount your Drive to the Colab VM.
drive.mount('/gdrive')
# Write the DataFrame to CSV file.
with open('/gdrive/My Drive/foo.csv', 'w') as f:
df.to_csv(f)

The other answers almost worked but I needed a small tweak:
from google.colab import drive
drive.mount('drive')
df.to_csv('/content/drive/My Drive/filename.csv', encoding='utf-8', index=False)
the /content/ bit proved necessary

If you want to save locally then you can use this
f.to_csv('sample.csv')
from google.colab import files
files.download("sample.csv")

If you don't want to work with Pandas, do this:
df_mini.coalesce(1)\
.write\
.format('com.databricks.spark.csv')\
.options(header='true', delimiter='\t')\
.save('gdrive/My Drive/base_mini.csv')

Load a saved Doc2Vec model in Colab

I have trained and saved a model with doc2vec in colab as
model = gensim.models.Doc2Vec(vector_size=size_of_vector, window=10, min_count=5, workers=16,alpha=0.025, min_alpha=0.025, epochs=40)
model.build_vocab(allXs)
model.train(allXs, epochs=model.epochs, total_examples=model.corpus_count)
The model is saved in a folder not accessible from my drive but which I can see as:
from os import listdir
from os.path import isfile, getsize
from operator import itemgetter
files = [(f, getsize(f)) for f in listdir('.') if isfile(f)]
files.sort(key=itemgetter(1), reverse=True)
for f, size in files:
print ('{} {}'.format(size, f))
print ('({} files {} total size)'.format(len(files), sum(f[1] for f in files)))
The output is:
79434928 Model_after_train.docvecs.vectors_docs.npy
9155086 Model_after_train
1024 .rnd
(3 files 88591038 total size)
To move the two files in the same shared directory as the notebook
folder_id = FolderID
for f, size in files:
if 'our_first_lda' in f:
file = drive.CreateFile({'parents':[{u'id': folder_id}]})
file.SetContentFile(f)
file.Upload()
The problem that I am facing now are two:
1) gensim creates two files when saving the model. Which one should I load?
2) when I try to load a file or the other with:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
file_id = FileID
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
_, done = downloader.next_chunk()
model = doc2vec.Doc2Vec.load(downloaded.read())
I am not able to load the model getting the error:
TypeError: file() argument 1 must be encoded string without null bytes, not str
Any suggestion?

I've never used gensim, but from a look at the docs, here's what I think is going on:
You're getting two files because you passed separately=True to save, which is saving large numpy arrays in the output as separate files. You'll want to copy both files around.
Based on the load docs, you want to pass a filename, not the contents of the file. So when fetching the file from Drive, save to a file, and pass mmap='r' to load.
If that doesn't get you up and running, it'd be helpful to see a complete example (eg with fake data).

python upload file to google drive folder that is shared with me

I am using Python 2.7 and I am trying to upload a file (*.txt) into a folder that is shared with me.
So far I was able to upload it to my Drive, but how to set to which folder. I get the url to where I must place this file.
Thank you
this is my code so far
def Upload(file_name, file_path, upload_url):
upload_url = upload_url
client = gdata.docs.client.DocsClient(source=upload_url)
client.api_version = "3"
client.ssl = True
client.ClientLogin(username, passwd, client.source)
filePath = file_path
newResource = gdata.docs.data.Resource(filePath,file_name)
media = gdata.data.MediaSource()
media.SetFileHandle(filePath, 'mime/type')
newDocument = client.CreateResource(
newResource,
create_uri=gdata.docs.client.RESOURCE_UPLOAD_URI,
media=media
)

the API you are using is deprecated. Use google-api-python-client instead.
Follow this official python quickstart guide to simply upload a file to a folder. Additionally, send parents parameter in request body like this: body['parents'] = [{'id': parent_id}]
Or, you can use PyDrive, a Python wrapper library which simplifies a lot of works dealing with Google Drive API. The whole code is as simple as this:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
drive = GoogleDrive(gauth)
f = drive.CreateFile({'parent': parent_id})
f.SetContentFile('cat.png') # Read local file
f.Upload() # Upload it

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to upload data permanently on google-colaboratory? - python

Related

PyDrive - Erase contents of a file

Get a shareable link of a file in our google drive using Colab notebook

Export dataframe as csv file from google colab to google drive

Load a saved Doc2Vec model in Colab

python upload file to google drive folder that is shared with me

Categories

Resources