Downloading all tabs of a spreadsheet Google Drive API - python

I'm trying to download the full content of a spreadsheet using google Drive. Currently, my code is exporting and then writing to a file the content from the first tab from the given spreadsheet only. How can I make it download the full content of the file?
This is the function that I'm currently using:
def download_file(real_file_id, service):
try:
file_id = real_file_id
request = service.files().export_media(fileId=file_id,
mimeType='text/csv')
file = io.BytesIO()
downloader = MediaIoBaseDownload(file, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print(F'Download {int(status.progress() * 100)}.')
except HttpError as error:
print(F'An error occurred: {error}')
file = None
file_object = open('test.csv', 'a')
file_object.write(file.getvalue().decode("utf-8"))
file_object.close()
return file.getvalue()
I call the function at a later stage in my code by passing the already initialised google drive service and the file id
download_file(real_file_id='XXXXXXXXXXXXXXXXXXXXX', service=service)

I believe your goal is as follows.
You want to download all sheets in a Google Spreadsheet as CSV data.
You want to achieve this using googleapis for python.
In this case, how about the following sample script? In this case, in order to retrieve the sheet names of each sheet in Google Spreadsheet, Sheets API is used. Using Sheets API, the sheet IDs of all sheets are retrieved. Using these sheet Ids, all sheets are downloaded as CSV data.
Sample script:
From your showing script, I guessed that service might be service = build("drive", "v3", credentials=creds). If my understanding is corret, in order to retrieve the acess token, please use creds.
spreadsheetId = "###" # Please set the Spreadsheet ID.
sheets = build("sheets", "v4", credentials=creds)
sheetObj = sheets.spreadsheets().get(spreadsheetId=spreadsheetId, fields="sheets(properties(sheetId,title))").execute()
accessToken = creds.token
for s in sheetObj.get("sheets", []):
p = s["properties"]
sheetName = p["title"]
print("Download: " + sheetName)
url = "https://docs.google.com/spreadsheets/export?id=" + spreadsheetId + "&exportFormat=csv&gid=" + str(p["sheetId"])
res = requests.get(url, headers={"Authorization": "Bearer " + accessToken})
with open(sheetName + ".csv", mode="wb") as f:
f.write(res.content)
In this case, please add import requests.
When this script is run, all sheets in a Google Spreadsheet are downloaded as CSV data. The filename of each CSV file uses the tab name in Google Spreadsheet.
In this case, please add a scope of "https://www.googleapis.com/auth/spreadsheets.readonly" as follows. And, please reauthorize the scopes. Please be careful about this.
SCOPES = [
"https://www.googleapis.com/auth/drive.readonly", # Please use this for your actual situation.
"https://www.googleapis.com/auth/spreadsheets.readonly",
]
Reference:
Method: spreadsheets.get

Tanaike's answer is easier and more straightforward, but I already spent some time on this so I might as well post it as an alternative.
The problem you originally encountered is that CSV files do not support multiple tabs/sheets, so Drive's files.export will only export the first sheet, and it doesn't have a way to select specific sheets.
Another way you can approach this is to use the Sheets API copyTo() method to create temp files for each sheet and export those as single CSV files.
# need a service for sheets and one for drive
sheetservice = build('sheets', 'v4', credentials=creds)
driveservice = build('drive', 'v3', credentials=creds)
spreadsheet = sheetservice.spreadsheets()
result = spreadsheet.get(spreadsheetId=YOUR_SPREADSHEET).execute()
sheets = result.get('sheets', []) # the list of sheets within your spreadsheet
# standard metadata to create the blank spreadsheet files
file_metadata = {
"name":"temp",
"mimeType":"application/vnd.google-apps.spreadsheet"
}
for sheet in sheets:
# create a blank spreadsheet and get its ID
tempfile = driveservice.files().create(body=file_metadata).execute()
tempid = tempfile.get('id')
# copy the sheet to the new file
sheetservice.spreadsheets().sheets().copyTo(spreadsheetId=YOUR_SPREADSHEET, sheetId=sheet['properties']['sheetId'], body={"destinationSpreadsheetId":tempid}).execute()
# need to delete the first sheet since the copy gets added as second
sheetservice.spreadsheets().batchUpdate(spreadsheetId=tempid, body={"requests":{"deleteSheet":{"sheetId":0}}}).execute()
download_file(tempid, driveservice) # runs your original method to download the file
driveservice.files().delete(fileId=tempid).execute() # to clean up the temp file
You'll also need the https://www.googleapis.com/auth/spreadsheets and https://www.googleapis.com/auth/drive scopes. This involves more API calls so I just recommend Tanaike's method, but I hope it gives you an idea of ways that you can play with the API to suit your needs.

Related

How can I fetch Google Sheet name from Google sheet ID using python and google sheets API?

I have a code which exports Google sheets into csv and stores it in my local storage. I want to store the csv file with the same name as the google sheet name. Can someone help me out with this ?
So right now, this code saves the csv file as sheet1.csv , how can I make it have the original sheet name ?
import pandas as pd
from Google import Create_Service
def sheets_to_csv(GOOGLE_SHEET_ID):
CLIENT_SECRET_FILE = 'secret.json'
API_SERVICE_NAME = 'sheets'
API_VERSION = 'v4'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
service = Create_Service(CLIENT_SECRET_FILE, API_SERVICE_NAME, API_VERSION, SCOPES)
try:
gsheets = service.spreadsheets().get(spreadsheetId=GOOGLE_SHEET_ID).execute()
sheets = gsheets['sheets']
for sheet in sheets:
dataset = service.spreadsheets().values().get(
spreadsheetId=GOOGLE_SHEET_ID,
range=sheet['properties']['title'],
majorDimension='ROWS'
).execute()
df = pd.DataFrame(dataset['values'])
df.columns = df.iloc[0]
df.drop(df.index[0], inplace=True)
df.to_csv(sheet['properties']['title'] + '.csv', index=False)
print()
except Exception as e:
print(e)
sheets_to_csv('1LqynjF33-mrO9M5INf4wJvJuY57Hhy4vjv_FjtuM')
From your following reply,
each sheet is being exported as the CSV file. But i would like to fetch the name of the overall sheet and name the csv files as Name-sheet1.csv, Name-sheet2.csv and so on..
I understood your goal as follows.
You want to retrieve the Spreadsheet title and use it as the filename of CSV file like Name-sheet1.csv, Name-sheet2.csv.
In your script, how about the following modification?
From:
df.to_csv(sheet['properties']['title'] + '.csv', index=False)
To:
df.to_csv(gsheets['properties']['title'] + '-' + sheet['properties']['title'] + '.csv', index=False)
Reference:
SpreadsheetProperties

Google spreadsheet to Pandas dataframe via Pydrive without download

How do I read the content of a Google spreadsheet into a Pandas dataframe without downloading the file?
I think gspread or df2gspread may be good shots, but I've been working with pydrive so far and got close to the solution.
With Pydrive I managed to get the export link of my spreadsheet, either as .csv or .xlsx file. After the authentication process, this looks like
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
# choose whether to export csv or xlsx
data_type = 'csv'
# get list of files in folder as dictionaries
file_list = drive.ListFile({'q': "'my-folder-ID' in parents and
trashed=false"}).GetList()
export_key = 'exportLinks'
excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
csv_key = 'text/csv'
if data_type == 'excel':
urls = [ file[export_key][excel_key] for file in file_list ]
elif data_type == 'csv':
urls = [ file[export_key][csv_key] for file in file_list ]
The type of url I get for xlsx is
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx
and similarly for csv
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv
Now, if I click on these links (or visit them with webbrowser.open(url)), I download the file, that I can then normally read into a Pandas dataframe with pandas.read_excel() or pandas.read_csv(), as described here.
How can I skip the download, and directly read the file into a dataframe from these links?
I tried several solutions:
The obvious pd.read_csv(url) gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
Interestingly these numbers (1, 6, 2) do not depend on the number of rows and columns in my spreadsheet, hinting that the script is trying to read not what it is intended to.
The analogue pd.read_excel(url) gives
ValueError: Excel file format cannot be determined, you must specify an engine manually.
and specifying e.g. engine = 'openpyxl' gives
zipfile.BadZipFile: File is not a zip file
BytesIO solution looked promising, but
r = requests.get(url)
data = r.content
df = pd.read_csv(BytesIO(data))
still gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
If I print(data) I get hundreds of lines of html code
b'\n<!DOCTYPE html>\n<html lang="de">\n <head>\n <meta charset="utf-8">\n <meta content="width=300, initial-scale=1" name="viewport">\n
...
...
</script>\n </body>\n</html>\n'
In your situation, how about the following modification? In this case, by retrieving the access token from gauth, the Spreadsheet is exported as XLSX data, and the XLSX data is put into the dataframe.
Modified script:
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
values = pd.read_excel(BytesIO(res.content))
print(values)
In this script, please add import requests.
In this case, the 1st tab of XLSX data is used.
When you want to use the other tab, please modify values = pd.read_excel(BytesIO(res.content)) as follows.
sheet = "Sheet2"
values = pd.read_excel(BytesIO(res.content), sheet_name=sheet)
I want to contribute an additional option to #Tanaike's excellent answer. Indeed it is quite difficult to successfully get an excel file (.xlsx from drive and not a google sheet) into a python environment without publishing the content to the web. Whereas the previous answer uses pydrive and GoogleAuth(), I usually use a different method of authentification in colab/jupyter notebooks. Adapted from googleapis documentation. In my environment using BytesIO(response.content) is unnecessary.
import pandas as pd
from oauth2client.client import GoogleCredentials
from google.colab import auth
auth.authenticate_user()
from google.auth.transport.requests import AuthorizedSession
from google.auth import default
creds, _ = default()
id = 'aaaaaaaaaaaaaaaaaaaaaaaaaaa'
sheet = 'Sheet12345'
url = f'https://docs.google.com/spreadsheets/export?id={id}&exportFormat=xlsx'
authed_session = AuthorizedSession(creds)
response = authed_session.get(url)
values = pd.read_excel(response.content, sheet_name=sheet)

Sharepoint excel data take into Pandas data frame without downloading

I need to take my SharePoint excel file in to pandas data frame because I need to do analysis using python for that excel file. to access the SharePoint I use bellow code and it works. From bellow code I can access my excel file which located in SharePoint. Now I want take my excel file in to pandas data frame.so how I can modify bellow code?
from office365.sharepoint.client_context import ClientContext
SP_SITE_URL ='https://asdfgh.sharepoint.com/sites/ABC/'
SP_DOC_LIBRARY ='Publications'
USERNAME ='asd#fgh.onmicrosoft.com'
PASSWORD ='******'
# 1. Create a ClientContext object and use the user’s credentials for authentication
ctx =ClientContext(SP_SITE_URL).with_user_credentials(USERNAME, PASSWORD)
# 2. Read file entities from the SharePoint document library
files = ctx.web.lists.get_by_title(SP_DOC_LIBRARY).root_folder.files
ctx.load(files)
ctx.execute_query()
# 3. loop through file entities
for filein files:
# 4. Access the file object properties
print(file.properties['Name'], file.properties['UniqueId'])
# 5. Access list item object through the file object
item = file.listItemAllFields
ctx.load(item)
ctx.execute_query()
print('Access metadata - Category: {0}, Status: {1}'.format(item.properties['Category'], item.properties['Status']))
# 4. The Output:
# File Handling in SharePoint Document Library Using Python.docx 77819f08-5fbe-450f-9f9b-d3ae2862cbb5
# Access metadata - Category: Python, Status: Submitted
For it operate through, the file will be needed to be present in the memory of the system.
Find the path of the file - It should be in of the Meta-Data of the file which you are already.
With the below library :
from office365.sharepoint.files.file import File
You could use the below code to go ahead and store it in the memory and read from the Panda data frame.
response = File.open_binary(ctx, url)
bytes_file_obj = io.BytesIO()
bytes_file_obj.write(response.content)
bytes_file_obj.seek(0) #set file object to start
df = pd.read_excel(bytes_file_obj, sheetname = <Sheetname>)

Python how to generate a signed url for a Google cloud storage file after composing and renaming the file?

I am building files for download by exporting them from BigQuery into Google Cloud storage. Some of these files are relatively large, and so they have to be composed together after they are sharded by the BigQuery export.
The user inputs a file name for the file that will be generated on the front-end.
On the backend, I am generating a random temporary name so that I can then compose the files together.
# Job Configuration for the job to move the file to bigQuery
job_config = bigquery.QueryJobConfig()
job_config.destination = table_ref
query_job = client.query(sql, job_config=job_config)
query_job.result()
# Generate a temporary name for the file
fname = "".join(random.choices(string.ascii_letters, k=10))
# Generate Destination URI
destinationURI = info["bucket_uri"].format(filename=f"{fname}*.csv")
extract_job = client.extract_table(table_ref, destination_uris=destinationURI, location="US")
extract_job.result()
# Delete the temporary table, if it doesn't exist ignore it
client.delete_table(f"{info['did']}.{info['tmpid']}", not_found_ok=True)
After the data export has completed, I unshard the files by composing the blobs together.
client = storage.Client(project=info["pid"])
bucket = client.bucket(info['bucket_name'])
all_blobs = list(bucket.list_blobs(prefix=fname))
blob_initial = all_blobs.pop(0)
prev_ind = 0
for i in range(31, len(all_blobs), 31):
# Compose files in chunks of 32 blobs (GCS Limit)
blob_initial.compose([blob_initial, *all_blobs[prev_ind:i]])
# PREVENT GCS RATE-LIMIT FOR FILES EXCEEDING ~100GB
time.sleep(1.0)
prev_ind = i
else:
# Compose all remaining files when less than 32 files are left
blob_initial.compose([blob_initial, *all_blobs[prev_ind:]])
for b in all_blobs:
# Delete the sharded files
b.delete()
After all the files have been composed into one file, I rename the blob to the user provided filename. Then I generate a signed URL which gets posted to firebase for the front-end to provide the file for download.
# Rename the file to the user provided filename
bucket.rename_blob(blob_initial, data["filename"])
# Generate signed url to post to firebase
download_url = blob_initial.generate_signed_url(datetime.now() + timedelta(days=10000))
The issue I am encountering occurs because of the use of the random filename used when the files are sharded. The reason I chose to use a random filename instead of the user-provided filename is because there may be instances when multiple users submit a request using the same (default value) filenames, and so those edge-cases would cause issues with the file sharding.
When I try to download the file, I get the following return:
<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Details>No such object: project-id.appspot.com/YMHprgIqMe000000000000.csv</Details>
</Error>
Although I renamed the file, it seems that the download URL is still using the old file name.
Is there a way to inform GCS that the filename has changed when I generate the signed URL?
It appears as though all that was needed was to reload the blob!
bucket.rename_blob(blob_initial, data["filename"])
blob = bucket.get(data["filename"])
# Generate signed url to post to firebase
download_url = blob.generate_signed_url(datetime.now() + timedelta(days=10000))

Read formula in the Google Sheets cells using Python

I am trying to download a Google Sheets document as a Microsoft Excel document using Python. I have been able to accomplish this task using the Python module googleapiclient.
However, the Sheets document may contain some formulas which are not compatible with Microsoft Excel (https://www.dataeverywhere.com/article/27-incompatible-formulas-between-excel-and-google-sheets/).
When I use the application I created on any Google Sheets document that used any of these formulas anywhere, I get a bogus Microsoft Excel document as output.
I would like to read the cell values in the Google Sheets document before downloading it as a Microsoft Excel document, just to prevent any such errors from happening.
The code I have written thus far is attached below:
import sys
import os
from googleapiclient import discovery
from httplib2 import Http
from oauth2client import file, client, tools
SCOPES = "https://www.googleapis.com/auth/drive.readonly"
store = file.Storage("./credentials/credentials.json")
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets("credentials/client_secret.json",
SCOPES)
creds = tools.run_flow(flow, store)
DRIVE = discovery.build("drive", "v3", http = creds.authorize(Http()))
print("Usage: tmp.py <name of the spreadsheet>")
FILENAME = sys.argv[1]
SRC_MIMETYPE = "application/vnd.google-apps.spreadsheet"
DST_MIMETYPE = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
files = DRIVE.files().list(
q = 'name="%s" and mimeType="%s"' % (FILENAME, SRC_MIMETYPE),
orderBy = "modifiedTime desc,name").execute().get("files", [])
if files:
fn = '%s.xlsx' % os.path.splitext(files[0]["name"].replace(" ", "_"))[0]
print('Exporting "%s" as "%s"... ' % (files[0]['name'], fn), end = "")
data = DRIVE.files().export(fileId=files[0]['id'], mimeType=DST_MIMETYPE).execute()
if data:
with open(fn, "wb") as f:
f.write(data)
print("Done")
else:
print("ERROR: Could not download file")
else:
print("ERROR: File not found")
If you want to use python to export something from google docs, then the simplest way is to let googles own server do the job for you.
I was doing a little webscraping on google sheets, and I made this little program which will do the job for you. You just have to insert the id of the document you want to download.
I put in a temporary id, so anyone can try it out.
import requests
ext = 'xlsx' #csv, ods, html, tsv and pdf can be used as well
key = '1yEoHh7WL1UNld-cxJh0ZsRmNwf-69uINim2dKrgzsLg'
url = f'https://docs.google.com/spreadsheets/d/{key}/export?format={ext}'
res = requests.get(url)
with open(f'file.{ext}', 'wb') as f:
f.write(res.content)
That way conversion will most certainly always be correct, because this is the same a clicking the export button inside the browser version of google sheets.
If you are planning to work with the data inside python, then I recommend using csv format instead of xlsx, and then create the necessary formulas inside python.
I think the gspread library might be what you are looking for. https://gspread.readthedocs.io/en/latest/
Here's a code sample:
import tenacity
import gspread
from oauth2client.service_account import ServiceAccountCredentials
#tenacity.retry(wait=tenacity.wait_exponential()) # If you exceed the Google API quota, this waits to retry your request
def loadGoogleSheet(spreadsheet_name):
# use creds to create a client to interact with the Google Drive API
print("Connecting to Google API...")
scope = [
'https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive'
]
creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
client = gspread.authorize(creds)
spreadsheet = client.open(spreadsheet_name)
return spreadsheet
def readGoogleSheet(spreadsheet):
sheet = spreadsheet.sheet1 # Might need to loop through sheets or whatever
val = sheet.cell(1, 1).value # This just gets the value of the first cell. The docs I linked to above are pretty helpful on all the other stuff you can do
return val
test_spreadsheet = loadGoogleSheet('Copy of TLO Summary - Template DO NOT EDIT')
test_output = readGoogleSheet(test_spreadsheet)
print(test_output)

Categories

Resources