Google spreadsheet to Pandas dataframe via Pydrive without download - python

How do I read the content of a Google spreadsheet into a Pandas dataframe without downloading the file?
I think gspread or df2gspread may be good shots, but I've been working with pydrive so far and got close to the solution.
With Pydrive I managed to get the export link of my spreadsheet, either as .csv or .xlsx file. After the authentication process, this looks like
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
# choose whether to export csv or xlsx
data_type = 'csv'
# get list of files in folder as dictionaries
file_list = drive.ListFile({'q': "'my-folder-ID' in parents and
trashed=false"}).GetList()
export_key = 'exportLinks'
excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
csv_key = 'text/csv'
if data_type == 'excel':
urls = [ file[export_key][excel_key] for file in file_list ]
elif data_type == 'csv':
urls = [ file[export_key][csv_key] for file in file_list ]
The type of url I get for xlsx is
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx
and similarly for csv
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv
Now, if I click on these links (or visit them with webbrowser.open(url)), I download the file, that I can then normally read into a Pandas dataframe with pandas.read_excel() or pandas.read_csv(), as described here.
How can I skip the download, and directly read the file into a dataframe from these links?
I tried several solutions:
The obvious pd.read_csv(url) gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
Interestingly these numbers (1, 6, 2) do not depend on the number of rows and columns in my spreadsheet, hinting that the script is trying to read not what it is intended to.
The analogue pd.read_excel(url) gives
ValueError: Excel file format cannot be determined, you must specify an engine manually.
and specifying e.g. engine = 'openpyxl' gives
zipfile.BadZipFile: File is not a zip file
BytesIO solution looked promising, but
r = requests.get(url)
data = r.content
df = pd.read_csv(BytesIO(data))
still gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
If I print(data) I get hundreds of lines of html code
b'\n<!DOCTYPE html>\n<html lang="de">\n <head>\n <meta charset="utf-8">\n <meta content="width=300, initial-scale=1" name="viewport">\n
...
...
</script>\n </body>\n</html>\n'

In your situation, how about the following modification? In this case, by retrieving the access token from gauth, the Spreadsheet is exported as XLSX data, and the XLSX data is put into the dataframe.
Modified script:
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
values = pd.read_excel(BytesIO(res.content))
print(values)
In this script, please add import requests.
In this case, the 1st tab of XLSX data is used.
When you want to use the other tab, please modify values = pd.read_excel(BytesIO(res.content)) as follows.
sheet = "Sheet2"
values = pd.read_excel(BytesIO(res.content), sheet_name=sheet)

I want to contribute an additional option to #Tanaike's excellent answer. Indeed it is quite difficult to successfully get an excel file (.xlsx from drive and not a google sheet) into a python environment without publishing the content to the web. Whereas the previous answer uses pydrive and GoogleAuth(), I usually use a different method of authentification in colab/jupyter notebooks. Adapted from googleapis documentation. In my environment using BytesIO(response.content) is unnecessary.
import pandas as pd
from oauth2client.client import GoogleCredentials
from google.colab import auth
auth.authenticate_user()
from google.auth.transport.requests import AuthorizedSession
from google.auth import default
creds, _ = default()
id = 'aaaaaaaaaaaaaaaaaaaaaaaaaaa'
sheet = 'Sheet12345'
url = f'https://docs.google.com/spreadsheets/export?id={id}&exportFormat=xlsx'
authed_session = AuthorizedSession(creds)
response = authed_session.get(url)
values = pd.read_excel(response.content, sheet_name=sheet)

Related

Downloading all tabs of a spreadsheet Google Drive API

I'm trying to download the full content of a spreadsheet using google Drive. Currently, my code is exporting and then writing to a file the content from the first tab from the given spreadsheet only. How can I make it download the full content of the file?
This is the function that I'm currently using:
def download_file(real_file_id, service):
try:
file_id = real_file_id
request = service.files().export_media(fileId=file_id,
mimeType='text/csv')
file = io.BytesIO()
downloader = MediaIoBaseDownload(file, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print(F'Download {int(status.progress() * 100)}.')
except HttpError as error:
print(F'An error occurred: {error}')
file = None
file_object = open('test.csv', 'a')
file_object.write(file.getvalue().decode("utf-8"))
file_object.close()
return file.getvalue()
I call the function at a later stage in my code by passing the already initialised google drive service and the file id
download_file(real_file_id='XXXXXXXXXXXXXXXXXXXXX', service=service)
I believe your goal is as follows.
You want to download all sheets in a Google Spreadsheet as CSV data.
You want to achieve this using googleapis for python.
In this case, how about the following sample script? In this case, in order to retrieve the sheet names of each sheet in Google Spreadsheet, Sheets API is used. Using Sheets API, the sheet IDs of all sheets are retrieved. Using these sheet Ids, all sheets are downloaded as CSV data.
Sample script:
From your showing script, I guessed that service might be service = build("drive", "v3", credentials=creds). If my understanding is corret, in order to retrieve the acess token, please use creds.
spreadsheetId = "###" # Please set the Spreadsheet ID.
sheets = build("sheets", "v4", credentials=creds)
sheetObj = sheets.spreadsheets().get(spreadsheetId=spreadsheetId, fields="sheets(properties(sheetId,title))").execute()
accessToken = creds.token
for s in sheetObj.get("sheets", []):
p = s["properties"]
sheetName = p["title"]
print("Download: " + sheetName)
url = "https://docs.google.com/spreadsheets/export?id=" + spreadsheetId + "&exportFormat=csv&gid=" + str(p["sheetId"])
res = requests.get(url, headers={"Authorization": "Bearer " + accessToken})
with open(sheetName + ".csv", mode="wb") as f:
f.write(res.content)
In this case, please add import requests.
When this script is run, all sheets in a Google Spreadsheet are downloaded as CSV data. The filename of each CSV file uses the tab name in Google Spreadsheet.
In this case, please add a scope of "https://www.googleapis.com/auth/spreadsheets.readonly" as follows. And, please reauthorize the scopes. Please be careful about this.
SCOPES = [
"https://www.googleapis.com/auth/drive.readonly", # Please use this for your actual situation.
"https://www.googleapis.com/auth/spreadsheets.readonly",
]
Reference:
Method: spreadsheets.get
Tanaike's answer is easier and more straightforward, but I already spent some time on this so I might as well post it as an alternative.
The problem you originally encountered is that CSV files do not support multiple tabs/sheets, so Drive's files.export will only export the first sheet, and it doesn't have a way to select specific sheets.
Another way you can approach this is to use the Sheets API copyTo() method to create temp files for each sheet and export those as single CSV files.
# need a service for sheets and one for drive
sheetservice = build('sheets', 'v4', credentials=creds)
driveservice = build('drive', 'v3', credentials=creds)
spreadsheet = sheetservice.spreadsheets()
result = spreadsheet.get(spreadsheetId=YOUR_SPREADSHEET).execute()
sheets = result.get('sheets', []) # the list of sheets within your spreadsheet
# standard metadata to create the blank spreadsheet files
file_metadata = {
"name":"temp",
"mimeType":"application/vnd.google-apps.spreadsheet"
}
for sheet in sheets:
# create a blank spreadsheet and get its ID
tempfile = driveservice.files().create(body=file_metadata).execute()
tempid = tempfile.get('id')
# copy the sheet to the new file
sheetservice.spreadsheets().sheets().copyTo(spreadsheetId=YOUR_SPREADSHEET, sheetId=sheet['properties']['sheetId'], body={"destinationSpreadsheetId":tempid}).execute()
# need to delete the first sheet since the copy gets added as second
sheetservice.spreadsheets().batchUpdate(spreadsheetId=tempid, body={"requests":{"deleteSheet":{"sheetId":0}}}).execute()
download_file(tempid, driveservice) # runs your original method to download the file
driveservice.files().delete(fileId=tempid).execute() # to clean up the temp file
You'll also need the https://www.googleapis.com/auth/spreadsheets and https://www.googleapis.com/auth/drive scopes. This involves more API calls so I just recommend Tanaike's method, but I hope it gives you an idea of ways that you can play with the API to suit your needs.

Read formula in the Google Sheets cells using Python

I am trying to download a Google Sheets document as a Microsoft Excel document using Python. I have been able to accomplish this task using the Python module googleapiclient.
However, the Sheets document may contain some formulas which are not compatible with Microsoft Excel (https://www.dataeverywhere.com/article/27-incompatible-formulas-between-excel-and-google-sheets/).
When I use the application I created on any Google Sheets document that used any of these formulas anywhere, I get a bogus Microsoft Excel document as output.
I would like to read the cell values in the Google Sheets document before downloading it as a Microsoft Excel document, just to prevent any such errors from happening.
The code I have written thus far is attached below:
import sys
import os
from googleapiclient import discovery
from httplib2 import Http
from oauth2client import file, client, tools
SCOPES = "https://www.googleapis.com/auth/drive.readonly"
store = file.Storage("./credentials/credentials.json")
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets("credentials/client_secret.json",
SCOPES)
creds = tools.run_flow(flow, store)
DRIVE = discovery.build("drive", "v3", http = creds.authorize(Http()))
print("Usage: tmp.py <name of the spreadsheet>")
FILENAME = sys.argv[1]
SRC_MIMETYPE = "application/vnd.google-apps.spreadsheet"
DST_MIMETYPE = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
files = DRIVE.files().list(
q = 'name="%s" and mimeType="%s"' % (FILENAME, SRC_MIMETYPE),
orderBy = "modifiedTime desc,name").execute().get("files", [])
if files:
fn = '%s.xlsx' % os.path.splitext(files[0]["name"].replace(" ", "_"))[0]
print('Exporting "%s" as "%s"... ' % (files[0]['name'], fn), end = "")
data = DRIVE.files().export(fileId=files[0]['id'], mimeType=DST_MIMETYPE).execute()
if data:
with open(fn, "wb") as f:
f.write(data)
print("Done")
else:
print("ERROR: Could not download file")
else:
print("ERROR: File not found")
If you want to use python to export something from google docs, then the simplest way is to let googles own server do the job for you.
I was doing a little webscraping on google sheets, and I made this little program which will do the job for you. You just have to insert the id of the document you want to download.
I put in a temporary id, so anyone can try it out.
import requests
ext = 'xlsx' #csv, ods, html, tsv and pdf can be used as well
key = '1yEoHh7WL1UNld-cxJh0ZsRmNwf-69uINim2dKrgzsLg'
url = f'https://docs.google.com/spreadsheets/d/{key}/export?format={ext}'
res = requests.get(url)
with open(f'file.{ext}', 'wb') as f:
f.write(res.content)
That way conversion will most certainly always be correct, because this is the same a clicking the export button inside the browser version of google sheets.
If you are planning to work with the data inside python, then I recommend using csv format instead of xlsx, and then create the necessary formulas inside python.
I think the gspread library might be what you are looking for. https://gspread.readthedocs.io/en/latest/
Here's a code sample:
import tenacity
import gspread
from oauth2client.service_account import ServiceAccountCredentials
#tenacity.retry(wait=tenacity.wait_exponential()) # If you exceed the Google API quota, this waits to retry your request
def loadGoogleSheet(spreadsheet_name):
# use creds to create a client to interact with the Google Drive API
print("Connecting to Google API...")
scope = [
'https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive'
]
creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
client = gspread.authorize(creds)
spreadsheet = client.open(spreadsheet_name)
return spreadsheet
def readGoogleSheet(spreadsheet):
sheet = spreadsheet.sheet1 # Might need to loop through sheets or whatever
val = sheet.cell(1, 1).value # This just gets the value of the first cell. The docs I linked to above are pretty helpful on all the other stuff you can do
return val
test_spreadsheet = loadGoogleSheet('Copy of TLO Summary - Template DO NOT EDIT')
test_output = readGoogleSheet(test_spreadsheet)
print(test_output)

Downloaded Share Point Excel Not Opening with Open

I am re-framing an existing question for simplicity. I have the following code to download Excel files from a company Share Point site.
import requests
import pandas as pd
def download_file(url):
filename = url.split('/')[-1]
r = requests.get(url)
with open(filename, 'wb') as output_file:
output_file.write(r.content)
df = pd.read_excel(r'O:\Procurement Planning\QA\VSAF_test_macro.xlsm')
df['Name'] = 'share_point_file_path_documentName' #i'm appending the sp file path to the document name
file = df['Name'] #I only need the file path column, I don't need the rest of the dataframe
# for loop for download
for url in file:
download_file(url)
The downloads happen and I don't get any errors in Python, however when I try to open them I get an error from Excel saying Excel cannot open the file because the file format or extension is not valid. If I print the link in Jupyter Notebooks it does open correctly, the issue appears to be with the download.
Check r.status_code. This must be 200 or you have the wrong url or no permission.
Open the downloaded file in a text editor. It might be a HTML file (Office Online)
If the URL contains a web=1 query parameter, remove it or replace it by web=0.

How to import data into google colab from google drive?

I have some data files uploaded on my google drive.
I want to import those files into google colab.
The REST API method and PyDrive method show how to create a new file and upload it on drive and colab. Using that, I am unable to figure out how to read the data files already present on my drive in my python code.
I am a total newbie to this. Can someone help me out?
(Update April 15 2018: The gspread is frequently being updated, so to ensure stable workflow I specify the version)
For spreadsheet file, the basic idea is using packages gspread and pandas to read spreadsheets in Drive and convert them to pandas dataframe format.
In the Colab notebook:
#install packages
!pip install gspread==2.1.1
!pip install gspread-dataframe==2.1.0
!pip install pandas==0.22.0
#import packages and authorize connection to Google account:
import pandas as pd
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.colab import auth
auth.authenticate_user() # verify your account to read files which you have access to. Make sure you have permission to read the file!
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
Then I know 3 ways to read Google spreadsheets.
By file name:
spreadsheet = gc.open("goal.csv") # Open file using its name. Use this if the file is already anywhere in your drive
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
By url:
spreadsheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/1LCCzsUTqBEq5pemRNA9EGy62aaeIgye4XxwReYg1Pe4/edit#gid=509368585') # use this when you have the complete url (the edit#gid means permission)
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
By file key/ID:
spreadsheet = gc.open_by_key('1vpukIbGZfK1IhCLFalBI3JT3aobySanJysv0k5A4oMg') # use this when you have the key (the string in the url following spreadsheet/d/)
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
I shared the code above in a Colab notebook:
https://drive.google.com/file/d/1cvur-jpIpoEN3vAO8Fd_yVAT5Qgbr4GV/view?usp=sharing
Source: https://github.com/burnash/gspread
!) Set your data to be publicly available then
for public spreadsheets:
from StringIO import StringIO # got moved to io in python3.
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?
key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content
In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=
['Quradate'])
In [11]: df.head()
More here: Getting Google Spreadsheet CSV into A Pandas Dataframe
If private data sort of the same but you will have to do some auth gymnastics...
From Google Colab snippets
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('Your spreadsheet name').sheet1
# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
print(rows)
# Convert to a DataFrame and render.
import pandas as pd
pd.DataFrame.from_records(rows)

How to download a Excel file from behind a paywall into a pandas dataframe?

I have this website that requires log in to access data.
import pandas as pd
import requests
r = requests.get(my_url, cookies=my_cookies) # my_cookies are imported from a selenium session.
df = pd.io.excel.read_excel(r.content, sheetname=0)
Reponse:
IOError: [Errno 2] No such file or directory: 'Ticker\tAction\tName\tShares\tPrice\...
Apparently, the str is processed as a filename. Is there a way to process it as a file? Alternatively can we pass cookies to pd.get_html?
EDIT: After further processing we can now see that this is actually a csv file. The content of the downloaded file is:
In [201]: r.content
Out [201]: 'Ticker\tAction\tName\tShares\tPrice\tCommission\tAmount\tTarget Weight\nBRSS\tSELL\tGlobal Brass and Copper Holdings Inc\t400.0\t17.85\t-1.00\t7,140\t0.00\nCOHU\tSELL\tCohu Inc\t700.0\t12.79\t-1.00\t8,953\t0.00\nUNTD\tBUY\tUnited Online Inc\t560.0\t15.15\t-1.00\t-8,484\t0.00\nFLXS\tBUY\tFlexsteel Industries Inc\t210.0\t40.31\t-1.00\t-8,465\t0.00\nUPRO\tCOVER\tProShares UltraPro S&P500\t17.0\t71.02\t-0.00\t-1,207\t0.00\n'
Notice that it is tab delimited. Still, trying:
# csv version 1
df = pd.read_csv(r.content)
# Returns error, file does not exist. Apparently read_csv() is also trying to read it as a file.
# csv version 2
fh = io.BytesIO(r.content)
df = pd.read_csv(fh) # ValueError: No columns to parse from file.
# csv version 3
s = StringIO(r.content)
df = pd.read_csv(s)
# No error, but the resulting df is not parsed properly; \t's show up in the text of the dataframe.
Simply wrap the file contents in a BytesIO:
with io.BytesIO(r.content) as fh:
df = pd.io.excel.read_excel(fh, sheetname=0)
This functionality was included in an update from 2014. According to the documentation it is as simple as providing the url:
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx
Based on the code you've provided, it looks like you are using pandas 0.13.x? If you can upgrade to a newer version (code below is tested with 0.16.x) you can get this to work without the additional utilization of the requests library. This was added in 0.14.1
data2 = pd.read_excel(data_url)
As an example of a full script (with the example XLS document taken from the original bug report stating the read_excel didn't accept a URL):
import pandas as pd
data_url = "http://www.eia.gov/dnav/pet/xls/PET_PRI_ALLMG_A_EPM0_PTC_DPGAL_M.xls"
data = pd.read_excel(data_url, "Data 1", skiprows=2)

Categories

Resources