I have trained and saved a model with doc2vec in colab as
model = gensim.models.Doc2Vec(vector_size=size_of_vector, window=10, min_count=5, workers=16,alpha=0.025, min_alpha=0.025, epochs=40)
model.build_vocab(allXs)
model.train(allXs, epochs=model.epochs, total_examples=model.corpus_count)
The model is saved in a folder not accessible from my drive but which I can see as:
from os import listdir
from os.path import isfile, getsize
from operator import itemgetter
files = [(f, getsize(f)) for f in listdir('.') if isfile(f)]
files.sort(key=itemgetter(1), reverse=True)
for f, size in files:
print ('{} {}'.format(size, f))
print ('({} files {} total size)'.format(len(files), sum(f[1] for f in files)))
The output is:
79434928 Model_after_train.docvecs.vectors_docs.npy
9155086 Model_after_train
1024 .rnd
(3 files 88591038 total size)
To move the two files in the same shared directory as the notebook
folder_id = FolderID
for f, size in files:
if 'our_first_lda' in f:
file = drive.CreateFile({'parents':[{u'id': folder_id}]})
file.SetContentFile(f)
file.Upload()
The problem that I am facing now are two:
1) gensim creates two files when saving the model. Which one should I load?
2) when I try to load a file or the other with:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
file_id = FileID
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
_, done = downloader.next_chunk()
model = doc2vec.Doc2Vec.load(downloaded.read())
I am not able to load the model getting the error:
TypeError: file() argument 1 must be encoded string without null bytes, not str
Any suggestion?
I've never used gensim, but from a look at the docs, here's what I think is going on:
You're getting two files because you passed separately=True to save, which is saving large numpy arrays in the output as separate files. You'll want to copy both files around.
Based on the load docs, you want to pass a filename, not the contents of the file. So when fetching the file from Drive, save to a file, and pass mmap='r' to load.
If that doesn't get you up and running, it'd be helpful to see a complete example (eg with fake data).
Related
I want to add time string to the end of filenames which I am uploading to Google Drive by using pydrive. Basically I try to code below, but I have no idea how to adapt new file variable:
import time
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from time import sleep
gauth = GoogleAuth()
drive = GoogleDrive(gauth)
timestring = time.strftime("%Y%m%d-%H%M")
upload_file_list = ['Secrets.kdbx']
for upload_file in upload_file_list:
gfile = drive.CreateFile({'parents': [{'id': 'folder_id'}]})
# Read file and set it as the content of this instance.
upload_file = drive.CreateFile({'title': 'Secrets.kdbx' + timestring, 'mimeType':'application/x-kdbx'}) # here I try to set new filename
gfile.SetContentFile(upload_file)
gfile.Upload() # Upload the file.
getting TypeError: expected str, bytes or os.PathLike object, not GoogleDriveFile
Actually I found the mistake. I changed name of the file inside CreateFile() and used string slicing to keep the file extension. Of course, this solution cannot be applied to other files with different names.
upload_file_list = ['Secrets.kdbx']
for upload_file in upload_file_list:
gfile = drive.CreateFile({'title': [upload_file[:7] + timestring + "." + upload_file[-4:]], # file name is changed here
'parents': [{'id': '1Ln6ptJ4bTlYoGdxczIgmD-7xcRlJa_7m'}]})
# Read file and set it as the content of this instance.
gfile.SetContentFile(upload_file)
gfile.Upload() # Upload the file. # Output something like: Secrets20211212-1627.kdbx #that's what I wanted :)
I have stored a csv file in G drive and try to load it to torchtext data.TabularDataset. The error message is "FileNotFoundError: [Errno 2] No such file or directory: 'https://.....'"
Is it impossible to load csv file from g drive directly to torchtext TabularDataset?
Here is the code. I have also made a public colab notebook with data publicly available.
import torch
from torchtext import data, datasets
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
TEXT = data.Field(tokenize = 'spacy', batch_first = True, lower=False)
LABEL = data.LabelField(sequential=False, dtype = torch.float)
train = data.TabularDataset(path = 'https://drive.google.com/open?id=1eWMjusU3H34m0uml5SdJvYX6gQuB8zta',
format = 'csv',
fields = [('Insult', LABEL), (None, None), ('Comment', TEXT)],
skip_header=False)
Let's assume you can afford to download this CSV file. I would suggest you to use a functionally built-in on torchtext: download_from_url.
import os
import torch
from torchtext import data, datasets
from torchtext.utils import download_from_url
# download the file
CSV_FILENAME = 'data.csv'
CSV_GDRIVE_URL = 'https://drive.google.com/uc?export=download&id=1eWMjusU3H34m0uml5SdJvYX6gQuB8zta'
download_from_url(CSV_GDRIVE_URL, CSV_FILENAME)
TEXT = data.Field(tokenize = 'spacy', batch_first = True, lower=False) #from torchtext import data
LABEL = data.LabelField(sequential=False, dtype = torch.float)
# if you're on Colab, you'll need this /content
train = data.TabularDataset(path=os.path.join('/content', CSV_FILENAME),
format='csv',
fields = [('Insult', LABEL), (None, None), ('Comment', TEXT)],
skip_header=False )
Notice that the Google Drive link should not be the one with open?id, but change it to uc?export=download&id.
When I upload data using following code, the data vanishes once I get disconnected.
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
Please suggest me ways to upload my data so that the data remains intact even after days of disconnection.
I keep my data stored permanently in a .zip file in google drive, and upload it to the google colabs VM using the following code.
Paste it into a cell, and change the file_id. You can find the file_id from the URL of the file in google drive. (Right click on file -> Get shareable link -> find the part of the URL after open?id=)
##title uploader
file_id = "1BuM11fJJ1qdZH3VbQ-GwPlK5lAvXiNDv" ##param {type:"string"}
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# PyDrive reference:
# https://googledrive.github.io/PyDrive/docs/build/html/index.html
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
# Replace the assignment below with your file ID
# to download a different file.
#
# A file ID looks like: 1gLBqEWEBQDYbKCDigHnUXNTkzl-OslSO
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
fileId = drive.CreateFile({'id': file_id }) #DRIVE_FILE_ID is file id example: 1iytA1n2z4go3uVCwE_vIKouTKyIDjEq
print(fileId['title'])
fileId.GetContentFile(fileId['title']) # Save Drive file as a local file
!unzip {fileId['title']}
Keeping data in GDrive is good (#skaem).
If your data contains code, I can suggest you to simply git clone your source repository from Github (or any other code versioning service), at the beginning of your colab notebook.
This way, you can develop offline, and perform your experiments in the cloud whenever you need, with up-to-date code.
I am trying to retrieve file metadata from Google drive API V3 in Python. I did it in API V2, but failed in V3.
I tried to get metadata by this line:
data = DRIVE.files().get(fileId=file['id']).execute()
but all I got was a dict of 'id', 'kind', 'name', and 'mimeType'. How can I get 'md5Checksum', 'fileSize', and so on?
I read the documentation.
I am supposed to get all the metadata by get() methods, but all I got was a small part of it.
Here is my code:
from __future__ import print_function
import os
from apiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools
try:
import argparse
flags = argparse.ArgumentParser(parents=[tools.argparser]).parse_args()
except ImportError:
flags = None
SCOPES = 'https://www.googleapis.com/auth/drive.metadata
https://www.googleapis.com/auth/drive'
store = file.Storage('storage.json')
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets('storage.json', scope=SCOPES)
creds = tools.run_flow(flow, store)
DRIVE = build('drive','v3', http=creds.authorize(Http()))
files = DRIVE.files().list().execute().get('files',[])
for file in files:
print('\n',file['name'],file['id'])
data = DRIVE.files().get(fileId=file['id']).execute()
print('\n',data)
print('Done')
I tried this answer:
Google Drive API v3 Migration
List
Files returned by service.files().list() do not contain information now, i.e. every field is null. If you want list on v3 to behave like in v2, call it like this:
service.files().list().setFields("nextPageToken, files");
but I get a Traceback:
DRIVE.files().list().setFields("nextPageToken, files")
AttributeError: 'HttpRequest' object has no attribute 'setFields'
Suppose you want to get the md5 hash of a file given its fileId, you can do it like this:
DRIVE = build('drive','v3', http=creds.authorize(Http()))
file_service = DRIVE.files()
remote_file_hash = file_service.get(fileId=fileId, fields="md5Checksum").execute()['md5Checksum']
To list some files on the Drive:
results = file_service.list(pageSize=10, fields="files(id, name)").execute()
I have built a small application gDrive-auto-sync containing more examples of API usage.
It's well-documented, so you can have a look at it if you want.
Here is the main file containing all the code. It might look like a lot but more than half of lines are just comments.
If you want to retrieve all the fields for a file resource, simply set fields='*'
In your above example, you would run
data = DRIVE.files().get(fileId=file['id'], fields='*').execute()
This should return all the available resources for the file as listed in:
https://developers.google.com/drive/v3/reference/files
There is a library PyDrive that provide easy interactions with google drive
https://googledrive.github.io/PyDrive/docs/build/html/filelist.html
Their example:
from pydrive.drive import GoogleDrive
drive = GoogleDrive(gauth) # Create GoogleDrive instance with authenticated GoogleAuth instance
# Auto-iterate through all files in the root folder.
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print('title: %s, id: %s' % (file1['title'], file1['id']))
All you need is file1['your key']
I am using Python 2.7 and I am trying to upload a file (*.txt) into a folder that is shared with me.
So far I was able to upload it to my Drive, but how to set to which folder. I get the url to where I must place this file.
Thank you
this is my code so far
def Upload(file_name, file_path, upload_url):
upload_url = upload_url
client = gdata.docs.client.DocsClient(source=upload_url)
client.api_version = "3"
client.ssl = True
client.ClientLogin(username, passwd, client.source)
filePath = file_path
newResource = gdata.docs.data.Resource(filePath,file_name)
media = gdata.data.MediaSource()
media.SetFileHandle(filePath, 'mime/type')
newDocument = client.CreateResource(
newResource,
create_uri=gdata.docs.client.RESOURCE_UPLOAD_URI,
media=media
)
the API you are using is deprecated. Use google-api-python-client instead.
Follow this official python quickstart guide to simply upload a file to a folder. Additionally, send parents parameter in request body like this: body['parents'] = [{'id': parent_id}]
Or, you can use PyDrive, a Python wrapper library which simplifies a lot of works dealing with Google Drive API. The whole code is as simple as this:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
drive = GoogleDrive(gauth)
f = drive.CreateFile({'parent': parent_id})
f.SetContentFile('cat.png') # Read local file
f.Upload() # Upload it