Colab Python: Downloading/Uploading google files without mounting - python

I'm using Python in Colabs to access specific files with their IDs, analyze them, and record info on them. I need to save the recorded info in a team folder on Drive. I'm reluctant to mount because the team folder is deep in the architecture of the company drive folders. I don't know how to access it from the main drive, I have it stared in my drive for easy access. I also just feel weird mounting my entire drive with all the company info in it. I really wish you could mount a single folder (I know you can change the path after the mount, but that also feels weird).
I have found a ton of ways to download a file based on the file ID, but I can't find any to upload to that file ID or save to the file ID. I know there's also a way with Pandas to read info from a file ID and use that as a data frame which is an option, but can you save the new info to the file ID with Pandas? There also seems to be a way to easily download with the Google API with a file ID, but again, not an easy way to upload to a file ID or a folder ID and over write the file.
These files are going to be really big as time goes on; tens of thousands of lines, so it needs to be able to deal with that either by only uploading the new info or being able to handle long downloads.
Edit: I did also just now try using gspread, but I'm not able to share files with emails outside of our company domain, so I'm unable to use gspread. ):

You can use pydrive to read and write based on FILE_ID
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
fid = 'Your File ID'
# read it
f = drive.CreateFile({'id': fid}) # just open an existing file
f.FetchMetadata(fetch_all=True)
text = f.GetContentString() # or f.GetContentFile('im.png') to save a local file
# or write it
f.SetContentString('Sample upload file content') # or SetContentFile('im.png')
f.Upload()

Related

How do I read a csv file from google drive folder using python, without downloading it

I need to read a csv file which is present in google drive folder using python, I do not want to download or change it into a different format, is there any way I could get the csv file url to read the file?
I can see there is no csv file url in google drive
If its a CSV file i would recommend converting it to a google sheets file then you can just update it via the Google sheets api.
You cant write directly to a file on google drive without downloading it first. I have done something like what you are looking for in C# How to upload to Google Drive API from memory with C# Once you download the file into a memory stream you will have access to it so you could change the text then upload it again.
unfortunately what you want to do isnt exactly possible. Let me know if you need help converting it to google sheets or translating my C# tutorial into python. It should be doable in python as well.
update
Got board and i was curious, you can upload memory stream to Google drive as well.
# create drive api client
service = build('drive', 'v3', credentials=creds)
file_metadata = {'name': 'Upload.txt'}
fh = io.StringIO("some initial text data")
media = MediaIoBaseUpload(fh, mimetype='text/plain')
file = service.files().create(body=file_metadata, media_body=media,
fields='id').execute()
print(F'File ID: {file.get("id")}')

How to read file from Google Drive after it was updated?

I need to place .csv file somewhere and then update it on daily basis for code to read it. The situation is the person who will be updating file will not be using coding, and it should be as easy as to upload from web for him. I read tens of question here about how to read or download file from Google Drive, Google storage but all of them suppose updating with code or downloading using API. I want little bit simpler solution. For example,
I'm using the code below to read the .csv file from google drive (this file is, for example, actual will be different).
This is the file which will be updated each day, however each time I update file (remove and upload new one into google drive) the link changes.
Is there a way to update file every time but not making change to code?
For example to make code to get file from the particular folder on Google Drive?
The main thing is that I need to do it without using API and Google Oauth.
If it's not possible where could the file be uploaded for this purpose? I need to be able to upload file every day without any code so code will read update data. Is there storage like this?
import pandas as pd
import requests
from io import StringIO
url='https://drive.google.com/file/d/1976F_8WzIxj9wJXjNyN_uD8Lrl_XtpIf/view?usp=sharing'
file_id = url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url2 = requests.get(dwn_url).text
csv_raw = StringIO(url2)
df = pd.read_csv(csv_raw)
print(df.head())
create vs update
The first thing you need to be sure is that the first time you run your code you use the file.create method.
However when you updated the file you should be using file.update this should not be creating a new file each time. There by your id will remain the same.
google api python client
IMO you should consider using the python client library this will make things a little easer for you
updated_file = service.files().update(
fileId=file_id,
body=file,
newRevision=new_revision,
media_body=media_body).execute()
Google sheets api
You are editing a csv file. By using the google drive api you are downloading the file and uploading it over and over.
Have you consider converting it to a google sheet and using the google sheet api to edit the file programmatically. May save you some processing.

Python: find and download missing files from google drive (using a sharable link)

Given a sharable link to a google drive folder (folder ID), I would like to compare the dir list under this folder with the dir list under a given path, and download the missing files.
I've read about PyDrive but couldn't find an elegant way to access the drive folder without authentication.
For instance:
files_under_gdrive = ["File1", "File2", "File3"]
files_under_given_path = ["File1", "some_other_file"]
# Download missing files found only in Google Drive
...
files_under_given_path = ["File1", "some_other_file", "File2", "File3"]
Any hint/idea would be highly appreciated. Thanks :)
You could easily start by gathering the files from the local directory and store their names in a list for example.
Afterwards, in order to retrieve the files from the Google Drive Shared drive, you can make use of Files.list request:
GET https://www.googleapis.com/drive/v3/files
With the includeItemsFromAllDrives set to true and the driveId field set to the corresponding id of the shared drive. Depending on your exact needs and requirements, you may add other fields to the request as well.
After retrieving the files from the shared Google Drive, you can simply compare the two lists and based on the results, download the needed files. For downloading the files, you may want to check this snippet from the Drive API documentation:
file_id = 'ID_OF_THE_FILE_TO_DOWNLOAD'
request = drive_service.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print "Download %d%%." % int(status.progress() * 100)
However, I recommend you to start by completing the Google Drive Quick Start with Python from here.
Reference
Drive API v3 Files.list;
Drive API v3 download files;
Drive API v3 Python Quick Start.

In Google CoLab Notebook, how to read data from a Public Google Drive AND my personal drive *without* authenticating twice?

I have a Google CoLab notebook used by third-parties. The user of the notebook needs the notebook to read CSVs both from their personal mounted GDrive as well as from a 3rd-party publicly shared GDrive.
As far as I can tell, reading from these 2 different sources each require the user to complete an authentication verification code workflow copy/pasting a code each time. The UX would be much improved if they only had to do a single authentication verification, rather than 2.
Put another way: if I've already authenticated and verified who I am to mount my drive, then why do I need to do it again to read data from a publicly shared Google Drive?
I figured there would be someway to use the authentication from one method first step in the second method (see details below), or to somehow request permissions to both in a single step, but I am not having any luck figuring it out.
Background
There has been a lot written about how to read data into Google Colab notebooks: Import data into Google Colaboratory &
Towards Data Science - 3 ways to load CSV files into colab and Google CoLab's official helper notebook are some good references.
To quickly recap, you have a few options, depending on where the data is coming from. If you are working with your own data, then an easy solution is to put your data in Google Drive, and then mount your drive.
from google.colab import drive as mountGoogleDrive
mountGoogleDrive.mount('/content/mountedDrive')
And you can read files as if they were in your local filesystem at content/mountedDrive/.
Sometimes mounting your drive is not sufficient. For example, let's say you want to read data from a publicly shared Google Drive owned by a 3rd party. In this case, you can't mount your drive, because the shared data is not in your Drive. You could copy all of the data out of the 3rd parties drive and into your drive, but it would be preferable to read directly from the Public Drive, especially if this is a shared notebook that many people use.
In this case, you can use PyDrive (see same references).
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
You have to look up the drive id for your dataset, and then you can read it, e.g., like this:
import pandas as pd
downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('Filename.csv')
df = pd.read_csv('Filename.csv')
In both of these work flows, you must authenticate your Google Account by following a special link, copying a code, and pasting the code back into the notebook.
Here is my problem:
I want to do both of these things in the same notebook: (1) read from a mounted google drive and (2) read from a publicly shared GDrive.
The user of my notebook is a third party. If the notebook runs both sets of code, then the user is forced to perform the authentication validation code twice. It's a bad UX, and confusing, and seems like it should be unnecessary.
Things I have tried:
Regarding this code:
auth.authenticate_user() # We already authenticated when we mounted our GDrive
gauth = GoogleAuth()
I thought there might be a way to pass the gauth object into the .mount() function so that if credentials already existed, you would not need to re-request authentication with a new verification code. But I have not been able to find documentation on google.colab.drive.mount(), and guessing randomly at passing parameters is not working out.
Alternatively we could go vice versa, however I am not sure if it is possible to save/extract authentication permissions from .mount().
Next I tried running the following code, removing the explicit authenticate_user() call after the mounting had already happened, like this:
from google.colab import drive as mountGoogleDrive
mountGoogleDrive.mount('/content/mountedDrive')
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
# auth.authenticate_user() # Commented out, hoping we already authenticated during mounting
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
The first 2 lines run as expected, including the authentication link and verification code.
However once we get to the line gauth.credentials = GoogleCredentials.get_application_default() my 3rd party user gets the following error:
1260 # If no credentials, fail.
-> 1261 raise ApplicationDefaultCredentialsError(ADC_HELP_MSG)
1262
1263 #staticmethod
ApplicationDefaultCredentialsError: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
I'm not 100% what these different lines accomplish, so I tried removing the error line as well:
from google.colab import drive as mountGoogleDrive
mountGoogleDrive.mount('/content/mountedDrive')
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
# auth.authenticate_user() # Commented out, hoping we already authenticated during mounting
gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default() # Commented out, hoping we don't need this line if we are already mounted?
drive = GoogleDrive(gauth)
This now runs without error, however when I then try to read a file from the public drive I get the following error:
InvalidConfigError: Invalid client secrets file ('Error opening file', 'client_secrets.json', 'No such file or directory', 2)
At this point I noticed something that is probably important:
When I run the drive-mounting code, the authentication is requesting access to Google DriveFile Stream.
When I run the PyDrive authentication, the authentication is requesting access on behalf of Google Cloud SDK.
So these are different permissions.
So, the question is... is there anyway to streamline this and package all of these permissions into a single-verification-code authentication work-flow? If I want to read from both my mounted Drive AND from a publicly-shared GDrive, is it required that the notebook user do double-authentication?
Thanks for any pointers to documentation or examples.
There is no way to do this. The OAuth scope is different, one is for Google Drive file system ; the other is for Google Cloud SDK.

How to delete permanently from mounted Drive folder?

I wrote a script to upload my models and training examples to Google Drive after every iteration in case of crashes or anything that stops the notebook from running, which looks something like this:
drive_path = 'drive/My Drive/Colab Notebooks/models/'
if path.exists(drive_path):
shutil.rmtree(drive_path)
shutil.copytree('models', drive_path)
Whenever I check my Google Drive, a few GBs is taken up by dozens of deleted models folder in the Trash, which I have to manually delete them.
The only function in google.colab.drive seems to be mount and that's it.
According to this tutorial, shutil.rmtree() removes a directory permanently but apparently it doesn't work for Drive.
It is possible to perform this action inside Google Colab by using the pydrive module. I suggest that you first move your unwanted files and folders to Trash (by ordinarily removing them in your code), and then, anytime you think it's necessary (e.g. you want to free up some space for saving weights of a new DL project), empty your trash by coding the following lines.
In order to permanently empty your Google Drive's Trash, code the following lines in your Google Colab notebook:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
my_drive = GoogleDrive(gauth)
After entering authentication code and creating a valid instance of GoogleDrive class, write:
for a_file in my_drive.ListFile({'q': "trashed = true"}).GetList():
# print the name of the file being deleted.
print(f'the file "{a_file['title']}", is about to get deleted permanently.')
# delete the file permanently.
a_file.Delete()
If you don't want to use my suggestion and want to permanently delete a specific folder in your Drive, it is possible that you have to make more complex queries and deal with fileId, parentId, and the fact that a file or folder in your Drive may have multiple parent folders, when making queries to Google Drive API.
For more information:
You can find examples of more complex (yet typical) queries, here.
You can find an example of Checking if a file is in a specific folder, here.
This statement that Files and folders in Google Drive can each have multiple parent folders may become better and more deeply understood, by reading this post.
Files will move to bin upon delete, so this neat trick reduces the file size to 0 before deleting (cannot be undone!)
import os
delete_filepath = 'drive/My Drive/Colab Notebooks/somefolder/examplefile.png'
open(delete_filename, 'w').close() #overwrite and make the file blank instead - ref: https://stackoverflow.com/a/4914288/3553367
os.remove(delete_filename) #delete the blank file from google drive will move the file to bin instead
Just have to move the into trash and connect to your drive. From there delete the notebooks permanently.

Categories

Resources