Downloading a large number of small files from Google Drive

Downloading a large number of small files from Google Drive - python

I have a folder in Google Drive that is shared to public and it has thousands of files in the size range 5-8 MBs each.
I tried the following to download each file at a time:
Get a list of file IDs/names from the folder using getfilelistpy package as follows:
from getfilelistpy import getfilelist
resource = {
"api_key": API_KEY,
"id": FOLDER_ID,
"fields": "files(name,id)",
}
res = getfilelist.GetFileList(resource)
print(res)
Once the file list is obtained I am looping through each file and use wget to download the file as follows:
for i in range(Len(id_list)):
command1 = "wget --no-check-certificate -r 'https://docs.google.com/uc?export=download&id={}' -O '{}'".format(id_list[i], file_name_list[i])
os.system(command1)
After 65-66 files, the file size download is 0 KB. Does Google Drive put a limit on the number of files/file size one can download? How can we over come that?
Any help would be appreciated? Thank you.

In your script, the API key is used. In this case, it is considered that the files in the folder are publicly shared. From your replying, also it was found that the files have been publicly shared. So in this answer, I would like to propose to download the files using the API key.
The modified script is as follows.
Modified script:
import requests
from getfilelistpy import getfilelist
API_KEY = '###' # Please set your API key.
FOLDER_ID = '###' # Please set the folder ID.
resource = {
"api_key": API_KEY,
"id": FOLDER_ID,
"fields": 'nextPageToken, files(id,name,webContentLink,mimeType)',
}
res = getfilelist.GetFileList(resource)
for files in res['fileList']:
for file in files['files']:
if 'google' not in file['mimeType']:
filename = file['name']
print('%s is downloading.' % filename)
r = requests.get(file['webContentLink'], stream=True)
if r.status_code == 200:
with open(filename, 'wb') as f:
f.write(r.content)
Reference:
getfilelistpy

Related

Download a spreadsheet from a folder using Google Drive api

I have a Google drive repository where I used to upload lots of files. This time I would like to download something from this same repository.
The following code works to download a file with file_id:
DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http()))
file_id = '23242342345YdJqjvKLVbenO22FeKcL'
request = team_drive.DRIVE.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print ("Download %d%%." % int(status.progress() * 100))
fh.seek(0)
with open('test.csv', 'wb') as f:
shutil.copyfileobj(fh, f, length=131072)
I would like to do the same but download a file from a folder this time. I tried the following code to display files in a given folder with folder_id. But it does not work.
folder_id = '13223232323237jWuf3__hKAG18jVo'
results = team_drive.DRIVE.files().list(q="mimeType='application/vnd.google-apps.spreadsheet' and parents in '"+folder_id+"'",fields="nextPageToken, files(id, name)",pageSize=400).execute()
Should the code work? I got an empty list. Any contribution would be appreciated

I believe your goal and situation as follows.
You want to download the Google Spreadsheet, which is the latest modified time, from the specific folder in your shared Drive as the XLSX format.
You want to achieve this using googleapis for python.
You have already been able to download the file using Drive API.
For this, I would like to propose the following sample script. The flow of this script is as follows.
Retrieve the latest Google Spreadsheet from the specific folder in the shared Drive.
For this, I use results = DRIVE.files().list(pageSize=1, fields="files(modifiedTime,name,id)", orderBy="modifiedTime desc", q="'" + folder_id + "' in parents and mimeType = 'application/vnd.google-apps.spreadsheet'", supportsAllDrives=True, includeItemsFromAllDrives=True).execute()
By this, the Google Spreadsheet with the latest modified time can be retrieved.
Retrieve the file ID of latest Google Spreadsheet.
In this case, results.get('files', [])[0]['id'] is the file ID.
Download the Google Spreadsheet as the XLSX format.
In this case, DRIVE.files().export_media(fileId=file_id, mimeType='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet') is used.
When above flow is used, the sample script is as follows.
Sample script:
folder_id = "###" # Please set the folder ID.
DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http()))
results = DRIVE.files().list(pageSize=1, fields="files(modifiedTime,name,id)", orderBy="modifiedTime desc", q="'" + folder_id + "' in parents and mimeType = 'application/vnd.google-apps.spreadsheet'", supportsAllDrives=True, includeItemsFromAllDrives=True).execute()
items = results.get('files', [])
if items:
file_id = items[0]['id']
file_name = items[0]['name']
request = DRIVE.files().export_media(fileId=file_id, mimeType='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
fh = io.FileIO(file_name + '.xlsx', mode='wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print('Download %d%%.' % int(status.progress() * 100))
Note:
From your script, I couldn't correctly understand about DRIVE and team_drive.DRIVE. In this case, from DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http())), I used DRIVE. If this cannot be used, please modify it.
Reference:
Files: list in Drive API v3

I use this function to get the URL's of files in a Drive folder:
from google.colab import auth
from oauth2client.client import GoogleCredentials
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
myDrive = GoogleDrive(gauth)
def getGDriveFileLinks(drive, folder_id, mime_type=None):
"""
Returns a list of dicts of pairs of file names and shareable links
drive: a GoogleDrive object with credentials https://pythonhosted.org/PyDrive/pydrive.html?highlight=googledrive#pydrive.drive.GoogleDrive
folder_id: a folderID of the folderID containing the file (grab it from the folder's URL)
mime_type (optional): the identifier of the filetype https://developers.google.com/drive/api/v3/mime-types,
https://www.iana.org/assignments/media-types/media-types.xhtml
"""
file_list = []
mime_type_query = "mimeType='{}' and ".format(mime_type) if mime_type != None else ''
files = drive.ListFile({'q': mime_type_query + "'{}' in parents".format(folder_id)}).GetList()
for file in files:
keys = file.keys()
if 'alternateLink' in keys:
link = file['alternateLink']
elif 'webContentLink' in keys:
link = file['webContentLink']
elif 'webViewLink' in keys:
link = file['webViewLink']
else:
try:
file.InsertPermission({
'type': 'anyone',
'value': 'anyone',
'role': 'reader'})
link = file['alternateLink']
except (HttpError, ApiRequestError):
link = 'Insufficient permissions for this file'
if 'title' in keys:
name = file['title']
else:
name = file['id']
file_list.append({'name': name, 'link': link})
return file_list
print(getGDriveFileLinks(myDrive, 'folder_id'))
Then, the URL can be used to retrieve the file using pydrive.

if anyone uses ruby and needs help, this command return IO.
drive.export_file(sheet_id, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
ref: https://googleapis.dev/ruby/google-api-client/latest/Google/Apis/DriveV3/DriveService.html#export_file-instance_method

Uploading a folder (contains .csv files) to the dB using firebase in python

I want to save the folder (contains .csv files) to the dB using firebase. But I am failing to do that. Please help me.
import os
from firebase import firebase
firebase = firebase.FirebaseApplication("https://face-crime.firebaseio.com/")
path = r'C:\Users\Jshei\Desktop\Criminal Face\Crime
Record'
files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
for file in f:
if '.csv' in file:
files.append(os.path.join(r, file))
for f in files:
print(f)

import pyrebase
# This is a SDK from firebase
config = {
"apiKey": "your API key",
"authDomain": "you domain",
"databaseURL": "your db url",
"projectId": "your id",
"storageBucket": "bucket name",
# Don't forget to format the "double-quotes" for the strings, firebase not make that!
"messagingSenderId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"appId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"measurementId": "xxxxxxxxxxxxxx"
}
firebase = pyrebase.initialize_app(config) # data from the config
storage = firebase.storage() # storage of database
storage.child("Crime Record csv files/Crime Record_2020-08-28_21-37-01.csv").put(
r"/home/folio/Desktop/xxxxxxxxxx.csv") #make sure to put 'r'
# Crime Record is a folder in project in the firebase, Crime Record xxx=xxx=xxx.csv is the file where the Crime info.
# is stored, has a location to fetch the data from the Crime Record.
storage.child("Crime Record Images/xxxxxx.jpg").put(
r"/home/folio/Desktop/xxxxxxxxxxxx.jpg")
# Crime Record is a folder in project in the firebase, TrainingImage xxx-xxx.jpg is the file where the Crime info. is
# stored, has a location to fetch the data from the TrainingImage.

How can I download files from team drives?

I'd like to periodically download and upload a file from a shared teams drive on google drive. I can upload to the folder, but not download.
This is what I've tried
team_drive_id = 'YYY'
file_to_download= 'ZZZ'
parent_folder_id = 'XXX'
f = drive.CreateFile({
'id':file_to_download,
'parents': [{
'kind': 'drive#fileLink',
'teamDriveId': team_folder_id,
'id': parent_drive_id
}]
})
f= drive.CreateFile({'id': file_to_download})
f.GetContentFile('test.csv', mimetype='text/csv')
But this is what I get:
ApiRequestError: <HttpError 404 when requesting https://www.googleapis.com/drive/v2/files/file_to_download?alt=json returned "File not found: file_to_download">
Any suggestions?

Following the documentation that can be seen here
First you create the file:
f = drive.CreateFile({'id': file_to_download})
then you set the content on the file
f.SetContentString("whatever comes inside this file, it is a csv so you know what to expect to be here")
and to finish the upload you need to do
f.Upload()
after that, the file is properly created there, you can read it using the GetContentString method

how to get the size of a file from firebase storage with python and pyrebase

I try to get the size of a file from the storage of python for which ready the elements with the following code:
storage = firebase.storage()
#storage.child("chat11.exe").put("PdaNetA5105.exe")
files = storage.list_files()
for file in files:
print(file.name)
print(storage.child(file.name).get_url(None))
url = storage.child(file.name).get_url(None)
if file.name == "chat1.PNG":
sour = open(bytes(url))
but some idea does not work, what I try to do is show the process of uploading and downloading files from firebase.
Any ideas

Download files from a Box location using API

How to download files from a Box location programmatically?
I have a shared box location URL(Not the exact path of the box location).
I want to download all the files under the location.
I checked below sdk to connect to box but unable to find methods/library to download files from a shared link.
https://github.com/box/box-python-sdk
from boxsdk import Client
from boxsdk import OAuth2
oauth = OAuth2(
client_id='XXX',
client_secret='XXX',
store_tokens='XXX',
)
data = client.make_request(
'GET',
'<Shared BOX URL>',
)
Please help

Get metadata of shared Box link:
shared_folder = client.get_shared_item("https://app.box.com/s/0123456789abcdef0123456789abcdef")
Loop through each item inside the folder and download each file using boxsdk.object.file.File.content or boxsdk.object.file.File.download_to:
for item in shared_folder.get_items(limit=1000):
if item.type == 'file':
# Get file contents into memory
file_contents = client.file(file_id=item.id).content()
# Or download to file
client.file(file_id=item.id).download_to(item.name)

You can use the method that gives you the direct URL:
download_url = client.file(file_id='SOME_FILE_ID').get_shared_link_download_url()
And then you can use urlib to download it to your local computer:
import urllib
urllib.urlretrieve (download_url , your_local_file_name)
Could it solve your problem?

Pre-requisite:
oauth = OAuth2(
client_id = 'strBoxClientID',
client_secret = 'strBoxClientSecret',
access_token = access_token,
)
client = Client(oauth)
Initial attempt (failed, it produces an empty file):
with open(Box_File.name, 'wb') as open_file:
client.file(Box_File.id).download_to(open_file)
open_file.close()
Final solution:
output_file = open('strFilePath' + str(Box_File.name), 'wb')
Box_File.download_to(output_file)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading a large number of small files from Google Drive - python

Related

Download a spreadsheet from a folder using Google Drive api

Uploading a folder (contains .csv files) to the dB using firebase in python

How can I download files from team drives?

how to get the size of a file from firebase storage with python and pyrebase

Download files from a Box location using API

Categories

Resources