Python Google Drive API - Downloading duplicate files

Python Google Drive API - Downloading duplicate files - python

So I am trying to download a lot of different files from google drive, then combine them into smaller fewer files. However, for some reason, my code is downloading duplicate files, or potentially just reading the BytesIO object incorrectly. I have pasted the code below and here is just a quick explanation of the file structure.
So I have ~135 folders, containing 52 files each. My goal is to loop through each folder, download the 52 files, then convert those 52 files into one file that is more compressed (get rid of unnecessary/duplicate data).
Code
def main(temporary_workspace, workspace):
store = file.Storage('tokenRead.json')
big_list_of_file_ids = []
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets('credentials.json', SCOPES)
creds = tools.run_flow(flow, store)
service = build('drive', 'v3', http=creds.authorize(Http()))
# Call the Drive v3 API
results = service.files().list(
q="'MAIN_FOLDER_WITH_SUBFOLDERS_ID' in parents",
pageSize=1000, fields="nextPageToken, files(id, name)").execute()
items = results.get('files', [])
list_of_folders_and_ids = []
if not items:
raise RuntimeError('No files found.')
else:
for item in items:
list_of_folders_and_ids.append((item['name'], item['id']))
list_of_folders_and_ids.sort(key=lambda x: x[0])
for folder_id in list_of_folders_and_ids:
start_date = folder_id[0][:-3]
id = folder_id[1]
print('Folder: ', start_date, ', ID: ', id)
query_string = "'{}' in parents".format(id)
results = service.files().list(
q=query_string, fields="nextPageToken, files(id, name)"
).execute()
items = results.get('files', [])
list_of_files_and_ids = []
if not items:
raise RuntimeError('No files found.')
else:
for item in items:
list_of_files_and_ids.append((item['name'], item['id']))
for file_id in list_of_files_and_ids:
# Downloading the files
if file_id[1] not in big_list_of_file_ids:
big_list_of_file_ids.append(file_id[1])
else:
print('Duplicate file ID!')
exit()
print('\tFile: ', file_id[0], ', ID: ', file_id[1])
request = service.files().get_media(fileId=file_id[1])
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print("Download: {}".format(int(status.progress() * 100)))
fh.seek(0)
temporary_location = os.path.join(tmp_workspace, file_id[0])
with open(temporary_location, 'wb') as out:
out.write(fh.read())
fh.close()
convert_all_netcdf(temporary_workspace, start_date, workspace, r'Qout_south_america_continental',
num_of_rivids=62317)
os.system('rm -rf %s/*' % tmp_workspace)
So as you can see, I first get the ID's of all of the folders, then I loop through each folder and get the 52 files in that folder, then I save all of the 52 files to a temporary folder, convert them into one file, which I save in another directory, and then delete all of the 52 files and move on to the next folder in Google Drive. The problem is, when I compare the files that I compressed with the convert_all_netcdf method, they are all the same. I feel as though I am doing something wrong with the BytesIO object, do I need to do something more to clear it? It also may be that I am accidentally reading from the same folder every time in the google drive api calls. Any help is appreciated.

I realize that this was probably not a great question, and I mainly asked it because I thought I was doing something wrong with the BytesIO object, but I found the answer. I was reading all of the files that I downloaded with a library called Xarray, and I was forgetting to close the connection. This was causing me to only read the first connection on subsequent loops, giving me duplicates. Thanks to anyone who tried!

Related

Downloading all tabs of a spreadsheet Google Drive API

I'm trying to download the full content of a spreadsheet using google Drive. Currently, my code is exporting and then writing to a file the content from the first tab from the given spreadsheet only. How can I make it download the full content of the file?
This is the function that I'm currently using:
def download_file(real_file_id, service):
try:
file_id = real_file_id
request = service.files().export_media(fileId=file_id,
mimeType='text/csv')
file = io.BytesIO()
downloader = MediaIoBaseDownload(file, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print(F'Download {int(status.progress() * 100)}.')
except HttpError as error:
print(F'An error occurred: {error}')
file = None
file_object = open('test.csv', 'a')
file_object.write(file.getvalue().decode("utf-8"))
file_object.close()
return file.getvalue()
I call the function at a later stage in my code by passing the already initialised google drive service and the file id
download_file(real_file_id='XXXXXXXXXXXXXXXXXXXXX', service=service)

I believe your goal is as follows.
You want to download all sheets in a Google Spreadsheet as CSV data.
You want to achieve this using googleapis for python.
In this case, how about the following sample script? In this case, in order to retrieve the sheet names of each sheet in Google Spreadsheet, Sheets API is used. Using Sheets API, the sheet IDs of all sheets are retrieved. Using these sheet Ids, all sheets are downloaded as CSV data.
Sample script:
From your showing script, I guessed that service might be service = build("drive", "v3", credentials=creds). If my understanding is corret, in order to retrieve the acess token, please use creds.
spreadsheetId = "###" # Please set the Spreadsheet ID.
sheets = build("sheets", "v4", credentials=creds)
sheetObj = sheets.spreadsheets().get(spreadsheetId=spreadsheetId, fields="sheets(properties(sheetId,title))").execute()
accessToken = creds.token
for s in sheetObj.get("sheets", []):
p = s["properties"]
sheetName = p["title"]
print("Download: " + sheetName)
url = "https://docs.google.com/spreadsheets/export?id=" + spreadsheetId + "&exportFormat=csv&gid=" + str(p["sheetId"])
res = requests.get(url, headers={"Authorization": "Bearer " + accessToken})
with open(sheetName + ".csv", mode="wb") as f:
f.write(res.content)
In this case, please add import requests.
When this script is run, all sheets in a Google Spreadsheet are downloaded as CSV data. The filename of each CSV file uses the tab name in Google Spreadsheet.
In this case, please add a scope of "https://www.googleapis.com/auth/spreadsheets.readonly" as follows. And, please reauthorize the scopes. Please be careful about this.
SCOPES = [
"https://www.googleapis.com/auth/drive.readonly", # Please use this for your actual situation.
"https://www.googleapis.com/auth/spreadsheets.readonly",
]
Reference:
Method: spreadsheets.get

Tanaike's answer is easier and more straightforward, but I already spent some time on this so I might as well post it as an alternative.
The problem you originally encountered is that CSV files do not support multiple tabs/sheets, so Drive's files.export will only export the first sheet, and it doesn't have a way to select specific sheets.
Another way you can approach this is to use the Sheets API copyTo() method to create temp files for each sheet and export those as single CSV files.
# need a service for sheets and one for drive
sheetservice = build('sheets', 'v4', credentials=creds)
driveservice = build('drive', 'v3', credentials=creds)
spreadsheet = sheetservice.spreadsheets()
result = spreadsheet.get(spreadsheetId=YOUR_SPREADSHEET).execute()
sheets = result.get('sheets', []) # the list of sheets within your spreadsheet
# standard metadata to create the blank spreadsheet files
file_metadata = {
"name":"temp",
"mimeType":"application/vnd.google-apps.spreadsheet"
}
for sheet in sheets:
# create a blank spreadsheet and get its ID
tempfile = driveservice.files().create(body=file_metadata).execute()
tempid = tempfile.get('id')
# copy the sheet to the new file
sheetservice.spreadsheets().sheets().copyTo(spreadsheetId=YOUR_SPREADSHEET, sheetId=sheet['properties']['sheetId'], body={"destinationSpreadsheetId":tempid}).execute()
# need to delete the first sheet since the copy gets added as second
sheetservice.spreadsheets().batchUpdate(spreadsheetId=tempid, body={"requests":{"deleteSheet":{"sheetId":0}}}).execute()
download_file(tempid, driveservice) # runs your original method to download the file
driveservice.files().delete(fileId=tempid).execute() # to clean up the temp file
You'll also need the https://www.googleapis.com/auth/spreadsheets and https://www.googleapis.com/auth/drive scopes. This involves more API calls so I just recommend Tanaike's method, but I hope it gives you an idea of ways that you can play with the API to suit your needs.

How to use the files().update method in Google Drive Python API?

I'm using the Google Drive Python API (v3), and I'm having trouble updating a file on Drive.
Here is my code for updating a file. The code runs without any errors, and even returns the correct file ID at the end, but it does not update the file on Drive.
basename = "myfile.txt"
filename = "/path/to/myfile.txt"
file_metadata = {'name': basename}
media = MediaFileUpload(filename, mimetype='text/plain', resumable=True)
file = drive_service.files().update(fileId=file_id, body=file_metadata, media_body=media).execute()
print(file)
The print(file) statement produces the following output:
{'id': <fileid>}
What's curious is that I'm able to create a file without any issues. Here is my code for creating a file, which creates a file successfully.
basename = "myfile.txt"
filename = "/path/to/myfile.txt"
file_metadata = {'name': basename, 'parents': [drive_folder_id]}
media = MediaFileUpload(filename, mimetype='text/plain')
file = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
Why am I able to create a file, but not update a file?
How can I improve my code for updating a file so that it successfully updates the file on Drive?
Summary:
My code for updating a file runs smoothly, but it doesn't do the thing it's intended to do, that is, update the contents of the file. How can I update the contents of a file on Drive using the Google Drive Python API v3?

It seems that the body parameter should point to a file resource object, not a plain dictionary as in your code. See this example from the documentation
try:
# First retrieve the file from the API.
file = service.files().get(fileId=file_id).execute()
# File's new metadata.
file['title'] = new_title
file['description'] = new_description
file['mimeType'] = new_mime_type
# File's new content.
media_body = MediaFileUpload(
new_filename, mimetype=new_mime_type, resumable=True)
# Send the request to the API.
updated_file = service.files().update(
fileId=file_id,
body=file,
newRevision=new_revision,
media_body=media_body).execute()
return updated_file
except errors.HttpError, error:
print 'An error occurred: %s' % error
return None

I think the problem may have been my web browser's caching feature. I was looking at files that I updated and I didn't see any changes, and this may have been caused by caching.
I switched browsers and tested, and it seems to be the case.
The update method works for me now.

Download a spreadsheet from a folder using Google Drive api

I have a Google drive repository where I used to upload lots of files. This time I would like to download something from this same repository.
The following code works to download a file with file_id:
DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http()))
file_id = '23242342345YdJqjvKLVbenO22FeKcL'
request = team_drive.DRIVE.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print ("Download %d%%." % int(status.progress() * 100))
fh.seek(0)
with open('test.csv', 'wb') as f:
shutil.copyfileobj(fh, f, length=131072)
I would like to do the same but download a file from a folder this time. I tried the following code to display files in a given folder with folder_id. But it does not work.
folder_id = '13223232323237jWuf3__hKAG18jVo'
results = team_drive.DRIVE.files().list(q="mimeType='application/vnd.google-apps.spreadsheet' and parents in '"+folder_id+"'",fields="nextPageToken, files(id, name)",pageSize=400).execute()
Should the code work? I got an empty list. Any contribution would be appreciated

I believe your goal and situation as follows.
You want to download the Google Spreadsheet, which is the latest modified time, from the specific folder in your shared Drive as the XLSX format.
You want to achieve this using googleapis for python.
You have already been able to download the file using Drive API.
For this, I would like to propose the following sample script. The flow of this script is as follows.
Retrieve the latest Google Spreadsheet from the specific folder in the shared Drive.
For this, I use results = DRIVE.files().list(pageSize=1, fields="files(modifiedTime,name,id)", orderBy="modifiedTime desc", q="'" + folder_id + "' in parents and mimeType = 'application/vnd.google-apps.spreadsheet'", supportsAllDrives=True, includeItemsFromAllDrives=True).execute()
By this, the Google Spreadsheet with the latest modified time can be retrieved.
Retrieve the file ID of latest Google Spreadsheet.
In this case, results.get('files', [])[0]['id'] is the file ID.
Download the Google Spreadsheet as the XLSX format.
In this case, DRIVE.files().export_media(fileId=file_id, mimeType='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet') is used.
When above flow is used, the sample script is as follows.
Sample script:
folder_id = "###" # Please set the folder ID.
DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http()))
results = DRIVE.files().list(pageSize=1, fields="files(modifiedTime,name,id)", orderBy="modifiedTime desc", q="'" + folder_id + "' in parents and mimeType = 'application/vnd.google-apps.spreadsheet'", supportsAllDrives=True, includeItemsFromAllDrives=True).execute()
items = results.get('files', [])
if items:
file_id = items[0]['id']
file_name = items[0]['name']
request = DRIVE.files().export_media(fileId=file_id, mimeType='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
fh = io.FileIO(file_name + '.xlsx', mode='wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print('Download %d%%.' % int(status.progress() * 100))
Note:
From your script, I couldn't correctly understand about DRIVE and team_drive.DRIVE. In this case, from DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http())), I used DRIVE. If this cannot be used, please modify it.
Reference:
Files: list in Drive API v3

I use this function to get the URL's of files in a Drive folder:
from google.colab import auth
from oauth2client.client import GoogleCredentials
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
myDrive = GoogleDrive(gauth)
def getGDriveFileLinks(drive, folder_id, mime_type=None):
"""
Returns a list of dicts of pairs of file names and shareable links
drive: a GoogleDrive object with credentials https://pythonhosted.org/PyDrive/pydrive.html?highlight=googledrive#pydrive.drive.GoogleDrive
folder_id: a folderID of the folderID containing the file (grab it from the folder's URL)
mime_type (optional): the identifier of the filetype https://developers.google.com/drive/api/v3/mime-types,
https://www.iana.org/assignments/media-types/media-types.xhtml
"""
file_list = []
mime_type_query = "mimeType='{}' and ".format(mime_type) if mime_type != None else ''
files = drive.ListFile({'q': mime_type_query + "'{}' in parents".format(folder_id)}).GetList()
for file in files:
keys = file.keys()
if 'alternateLink' in keys:
link = file['alternateLink']
elif 'webContentLink' in keys:
link = file['webContentLink']
elif 'webViewLink' in keys:
link = file['webViewLink']
else:
try:
file.InsertPermission({
'type': 'anyone',
'value': 'anyone',
'role': 'reader'})
link = file['alternateLink']
except (HttpError, ApiRequestError):
link = 'Insufficient permissions for this file'
if 'title' in keys:
name = file['title']
else:
name = file['id']
file_list.append({'name': name, 'link': link})
return file_list
print(getGDriveFileLinks(myDrive, 'folder_id'))
Then, the URL can be used to retrieve the file using pydrive.

if anyone uses ruby and needs help, this command return IO.
drive.export_file(sheet_id, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
ref: https://googleapis.dev/ruby/google-api-client/latest/Google/Apis/DriveV3/DriveService.html#export_file-instance_method

Filenames getting appended every time I search and store them in list using Google Drive API?

I am using the google drive API to search inside of a folder named 'Users' and then I store the filenames the reside in that folder in a list. In Django every time I refresh a page which triggers this google API function the list of filename gets appended.
Ex. 1st execution
files_in_users = ['1','2','3','4','5']
same page refreshed again
files_in_users = ['1','2','3','4','5','1','2','3','4','5']
And this goes on every time I refresh the page. I have even tried to restart the server but there is always some redundant data in the list. I have also used the list.clear() function at the end of execution. Why does this happen? Is there some cache that I need to delete after each execution?
Code:
#GET ID OF USER FOLDER FROM DRIVE
drive_users_id = get_user_file_id(service)
# GET THE FILES IN USER FOLDER FROM DRIVE
flow_of_users_file = service.files().list(q=" '{0}' in parents ".format(drive_users_id),spaces ='drive').execute()
for i in flow_of_users_file['files']:
files_in_user_folder.append(i['name'])
print('files in drive are :', files_in_user_folder)
After printing the flow_of_users_file which is of type Dictionary, I could see there are redundant data in it, some files are getting appended every time I execute that line
VIEWS.PY
def datapage(request):
#all_data = Userdata.objects.all()
files_in_user_folder = []
#DRIVE API SETTINGS
creds = None
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
creds = pickle.load(token)
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
'credentials.json', SCOPES)
creds = flow.run_local_server(port=0)
# Save the credentials for the next run
with open('token.pickle', 'wb') as token:
pickle.dump(creds, token)
service = build('drive', 'v3', credentials=creds)
#GET ID OF USER FOLDER FROM DRIVE
drive_users_id = get_user_file_id(service)
# GET THE FILES IN USER FOLDER FROM
flow_of_users_file = service.files().list(q=" '{0}' in parents ".format(drive_users_id),spaces ='drive').execute()
print(type(flow_of_users_file))
for i in flow_of_users_file['files']:
files_in_user_folder.append(i['name'])
# FILES IN LOCAL FOLDER CALLED 'user_output_files'
files_in_local_folder = os.listdir(settings.BASE_DIR+'/users_output_files/')
print('files in drive are :', files_in_user_folder)
print('files in local folder are :',files_in_local_folder)
#CHECK IF ANY FILE UPDATES EXIST
z = list(set(files_in_user_folder)-set(files_in_local_folder))
#Checking if the local files have been generated
if len(files_in_local_folder) == 0:
print("No local files exist creating everything")
for folder in flow_of_users_file['files']:
name_of_folder = folder['name']
if not os.path.exists(settings.BASE_DIR+'/users_output_files/'+str(name_of_folder)):
print('No file does not exist')
os.mkdir(settings.BASE_DIR+'/users_output_files/'+str(name_of_folder))
folder_mime_type = folder['mimeType']
if folder_mime_type == 'application/vnd.google-apps.folder':
flow_of_file = service.files().list(q=" '{0}' in parents ".format(folder['id']),spaces ='drive').execute()
for file in flow_of_file['files']:
print('The contents of folder {0} are {1}'.format(name_of_folder, file['name'] ))
response = service.files().get_media(fileId = file['id']).execute()
print(type(response))
data = json.loads(response)
converter_xls(name_of_folder, file['name'],data )

It looks to me as if you aren't clearing files_in_user_folder, and then you're appending things to it.
You should try
# GET THE FILES IN USER FOLDER FROM DRIVE
flow_of_users_file = service.files().list(q=" '{0}' in parents ".format(drive_users_id),spaces ='drive').execute()
files_in_user_folder.clear()
for i in flow_of_users_file['files']:
files_in_user_folder.append(i['name'])
print('files in drive are :', files_in_user_folder)

The reason why this redundancy was occurring was because of similar data in my drive Trash. So, while scanning for folders it was appending the trash data and my drive data.

This is likely because you have deleted the files in your drive, and now they are in the trash. Which means they still have the parent folder attribute pointing to Users folder; BUT now the attribute named trashed is set to true.
So if you want to avoid duplicate filenames take care to pass trashed = false in your query string.
Further reading:
Files list method
files.list querying
File resource

How to download files from Google Drive using For Loop through API

When I retrieve csv files on Google Drive via api, I get files with no contents.
The code below consists of 3 parts (1: authenticate 2: search for files, 3: download files).
I suspect there is something wrong in step3: download files specifically around while done is False because I have no problem accessing Google Drive and download files. It's just that they are all empty files.
It would be great if someone can show me how I can fix it.
Codes below are mostly borrowed from Google website. Thank you for your time in advance!
Step 1: Authentication
from apiclient import discovery
from httplib2 import Http
import oauth2client
from oauth2client import file, client, tools
obj = lambda: None # this code allows for an empty class
auth = {"auth_host_name":'localhost', 'noauth_local_webserver':'store_true', 'auth_host_port':[8080, 8090], 'logging_level':'ERROR'}
for k, v in auth.items():
setattr(obj, k, v)
scopes = 'https://www.googleapis.com/auth/drive'
store = file.Storage('token_google_drive2.json')
creds = store.get()
# The following will takes a user to authentication link if no token file is found.
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets('client_id.json', scopes)
creds = tools.run_flow(flow, store, obj)
Step 2: Search for files and create a dictionary of files to download
from googleapiclient.discovery import build
page_token = None
drive_service = build('drive', 'v3', credentials=creds)
while True:
name_list = []
id_list = []
response = drive_service.files().list(q="mimeType='text/csv' and name contains 'RR' and name contains '20191001'", spaces='drive',fields='nextPageToken, files(id, name)', pageToken=page_token).execute()
for file in response.get('files', []):
name = file.get('name')
id_ = file.get('id')
#name and id are strings, so create list first before creating a dictionary
name_list.append(name)
id_list.append(id_)
#also you need to remove ":" in name_list or you cannot download files - nowhere to be found in the folder!
name_list = [word.replace(':','') for word in name_list]
page_token = response.get('nextPageToken', None)
if page_token is None:
break
#### Create dictionary using name_list and id_list
zipobj = zip(name_list, id_list)
temp_dic = dict(zipobj)
Step 3: Download Files (the troublesome part)
import io
from googleapiclient.http import MediaIoBaseDownload
for i in range(len(temp_dic.values())):
file_id = list(temp_dic.values())[i]
v = list(temp_dic.keys())[i]
request = drive_service.files().get_media(fileId=file_id)
fh = io.FileIO(v, mode='w')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
status_complete = int(status.progress()*100)
print(f'Download of {len(temp_dic.values())} files, {int(status.progress()*100)}%')

Actually I figured out myself. Below is an edit.
All I needed to do was delete done = False
while done is False: and add fh.close() to close the downloader.
The complete revised part 3 is as follows:
from googleapiclient.http import MediaIoBaseDownload
for i in range(len(temp_dic.values())):
file_id = list(temp_dic.values())[i]
v = list(temp_dic.keys())[i]
request = drive_service.files().get_media(fileId=file_id)
# replace the filename and extension in the first field below
fh = io.FileIO(v, mode='wb') #only in Windows, writing for binary is specified with wb
downloader = MediaIoBaseDownload(fh, request)
status, done = downloader.next_chunk()
status_complete = int(status.progress()*100)
print(f'{list(temp_dic.keys())[i]} is {int(status.progress()*100)}% downloaded')
fh.close()
print(f'{len(list(temp_dic.keys()))} files')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Google Drive API - Downloading duplicate files - python

Related

Downloading all tabs of a spreadsheet Google Drive API

How to use the files().update method in Google Drive Python API?

Download a spreadsheet from a folder using Google Drive api

Filenames getting appended every time I search and store them in list using Google Drive API?

How to download files from Google Drive using For Loop through API

Categories

Resources