Google Collab : Read gsheet file from Google drive - python

I am trying to read a gsheet file in Google drive using Google Collab. I tried using drive.mount to get the file but I don't know how to get a dataframe with pandas from there. Here what I tried to do :
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
import os
import pandas as pd
from google.colab import drive
# setup
gc = gspread.authorize(GoogleCredentials.get_application_default())
drive.mount('/content/drive',force_remount=True)
# read data and put it in a dataframe
gsheets = gc.open_by_url('/content/drive/MyDrive/test/myGoogleSheet.gsheet')
As you can tell, I am quite lost with the libraries. I want to use the ability to access the drive with the drive library, to get the content from gspread, and read with pandas.
Can anyone help me find a solution, please ?

I have found a solution for my problem by looking further into the library gspread. I was able to load the gsheet file by id or by url which I did not know. Then I manage to get the content of a sheet and read it as pandas dataframe. Here is the code :
from google.colab import auth
auth.authenticate_user()
import gspread
import pandas as pd
from oauth2client.client import GoogleCredentials
# setup
gc = gspread.authorize(GoogleCredentials.get_application_default())
# read data and put it in a dataframe
# spreadsheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/google_sheet_id/edit#gid=0')
spreadsheet = gc.open_by_key('google_sheet_id')
wks = spreadsheet.worksheet('sheet_name')
data = wks.get_all_values()
headers = data.pop(0)
df = pd.DataFrame(data, columns=headers)
print(df)

Related

Export data from the Google Sheet to PDF PYTHON

I am getting all the data present in google sheet using code below,
i want to write all these data to the pdf file and download that.
import gspread
import sys
print(sys.path)
import os
#sys.path.append('/usr/lib/python3/dist-packages')
from oauth2client.service_account import ServiceAccountCredentials
scope = [
'https://www.googleapis.com/auth/spreadsheets',
'https://www.googleapis.com/auth/drive'
]
path = os.path.abspath('cred.json')
credentials=ServiceAccountCredentials.from_json_keyfile_name('cred.json',scope)
client=gspread.authorize(credentials)
sheet=client.open('xyz').sheet1
data=sheet.get_all_records()
print(data)
I believe your goal as follows.
You want to export Google Spreadsheet of xyz as a PDF file using gspread with python and the service acccount.
Modification points:
Unfortunately, it seems that in the current stage, the Spreadsheet cannot be directly export as a PDF file using gspread. So in this case, requests library and the endpoint for exporting Spreadsheet to PDF are used.
When the points are reflected to your script, it becomes as follows.
Modified script:
import gspread
import sys
print(sys.path)
import os
#sys.path.append('/usr/lib/python3/dist-packages')
from oauth2client.service_account import ServiceAccountCredentials
scope = [
'https://www.googleapis.com/auth/spreadsheets',
'https://www.googleapis.com/auth/drive'
]
path = os.path.abspath('cred.json')
creds=ServiceAccountCredentials.from_json_keyfile_name('cred.json',scope)
client=gspread.authorize(creds)
# I added below script
spreadsheet_name = 'xyz'
spreadsheet = client.open(spreadsheet_name)
url = 'https://docs.google.com/spreadsheets/export?format=pdf&id=' + spreadsheet.id
headers = {'Authorization': 'Bearer ' + creds.create_delegated("").get_access_token().access_token}
res = requests.get(url, headers=headers)
with open(spreadsheet_name + ".pdf", 'wb') as f:
f.write(res.content)
Note:
In this modified script, it supposes that you hav ealready been able to get values from Google Spreadsheet using Sheets API. Please be careful this.
If an error related to Drive API, please enable Drive API at the API console.
If an error related to the service account, please modify create_delegated("") to create_delegated("email of the service account").

Export dataframe as csv file from google colab to google drive

I want to upload a dataframe as csv from colab to google drive.I tried a lot but no luck. I can upload a simple text file but failed to upload a csv.
I tried the following code:
import pandas as pd
df=pd.DataFrame({1:[1,2,3]})
df.to_csv('abc',sep='\t')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
uploaded = drive.CreateFile({'title': 'sample.csv', 'mimeType':'csv'})
uploaded.SetContentFile('abc')
uploaded.Upload()
It may be easier to use mounting instead of pydrive.
from google.colab import drive
drive.mount('drive')
After authentication, you can copy your csv file.
df.to_csv('data.csv')
!cp data.csv "drive/My Drive/"
Without using !cp command
from google.colab import drive
Mounts the google drive to Colab Notebook
drive.mount('/drive')
Make sure the folder Name is created in the google drive before uploading
df.to_csv('/drive/My Drive/folder_name/name_csv_file.csv')
# Import Drive API and authenticate.
from google.colab import drive
# Mount your Drive to the Colab VM.
drive.mount('/gdrive')
# Write the DataFrame to CSV file.
with open('/gdrive/My Drive/foo.csv', 'w') as f:
df.to_csv(f)
The other answers almost worked but I needed a small tweak:
from google.colab import drive
drive.mount('drive')
df.to_csv('/content/drive/My Drive/filename.csv', encoding='utf-8', index=False)
the /content/ bit proved necessary
If you want to save locally then you can use this
f.to_csv('sample.csv')
from google.colab import files
files.download("sample.csv")
If you don't want to work with Pandas, do this:
df_mini.coalesce(1)\
.write\
.format('com.databricks.spark.csv')\
.options(header='true', delimiter='\t')\
.save('gdrive/My Drive/base_mini.csv')

python gspread - How to get a spreadsheet URL path in after i create it?

I'm trying to create a new spreadsheet using the gspread python package, then get its URL path (inside the google drive) and send it to other people so they could go in as well.
I tried to find an answer here and here, with no luck.
I created a brand new Spreadsheet:
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
gc = gspread_connect()
spreadsheet = gc.create('TESTING SHEET')
Then i Shared it with my account:
spreadsheet.share('my_user#my_company.com', perm_type='user', role='writer')
Then i wrote some random stuff into it:
worksheet = gc.open('TESTING SHEET').sheet1
df = pd.DataFrame.from_records([{'a': i, 'b': i * 2} for i in range(100)])
set_with_dataframe(worksheet, df)
Now when i go to my google drive i can find this sheet by looking for its name ("TESTING SHEET")
But i didn't figure how do i get the URL path in my python code, so i could pass it right away to other people.
Tnx!
You can generate the URL by using Spreadsheet.id. Here's an example that uses spreadsheet variable from your code:
spreadsheet_url = "https://docs.google.com/spreadsheets/d/%s" % spreadsheet.id

How to import data into google colab from google drive?

I have some data files uploaded on my google drive.
I want to import those files into google colab.
The REST API method and PyDrive method show how to create a new file and upload it on drive and colab. Using that, I am unable to figure out how to read the data files already present on my drive in my python code.
I am a total newbie to this. Can someone help me out?
(Update April 15 2018: The gspread is frequently being updated, so to ensure stable workflow I specify the version)
For spreadsheet file, the basic idea is using packages gspread and pandas to read spreadsheets in Drive and convert them to pandas dataframe format.
In the Colab notebook:
#install packages
!pip install gspread==2.1.1
!pip install gspread-dataframe==2.1.0
!pip install pandas==0.22.0
#import packages and authorize connection to Google account:
import pandas as pd
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.colab import auth
auth.authenticate_user() # verify your account to read files which you have access to. Make sure you have permission to read the file!
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
Then I know 3 ways to read Google spreadsheets.
By file name:
spreadsheet = gc.open("goal.csv") # Open file using its name. Use this if the file is already anywhere in your drive
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
By url:
spreadsheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/1LCCzsUTqBEq5pemRNA9EGy62aaeIgye4XxwReYg1Pe4/edit#gid=509368585') # use this when you have the complete url (the edit#gid means permission)
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
By file key/ID:
spreadsheet = gc.open_by_key('1vpukIbGZfK1IhCLFalBI3JT3aobySanJysv0k5A4oMg') # use this when you have the key (the string in the url following spreadsheet/d/)
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
I shared the code above in a Colab notebook:
https://drive.google.com/file/d/1cvur-jpIpoEN3vAO8Fd_yVAT5Qgbr4GV/view?usp=sharing
Source: https://github.com/burnash/gspread
!) Set your data to be publicly available then
for public spreadsheets:
from StringIO import StringIO # got moved to io in python3.
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?
key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content
In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=
['Quradate'])
In [11]: df.head()
More here: Getting Google Spreadsheet CSV into A Pandas Dataframe
If private data sort of the same but you will have to do some auth gymnastics...
From Google Colab snippets
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('Your spreadsheet name').sheet1
# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
print(rows)
# Convert to a DataFrame and render.
import pandas as pd
pd.DataFrame.from_records(rows)

How best to convert from azure blob csv format to pandas dataframe while running notebook in azure ml

I have a number of large csv (tab delimited) data stored as azure blobs, and I want to create a pandas dataframe from these. I can do this locally as follows:
from azure.storage.blob import BlobService
import pandas as pd
import os.path
STORAGEACCOUNTNAME= 'account_name'
STORAGEACCOUNTKEY= "key"
LOCALFILENAME= 'path/to.csv'
CONTAINERNAME= 'container_name'
BLOBNAME= 'bloby_data/000000_0'
blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
# Only get a local copy if haven't already got it
if not os.path.isfile(LOCALFILENAME):
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
df_customer = pd.read_csv(LOCALFILENAME, sep='\t')
However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from csv, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).
I can get to the desired end result (pandas dataframe for blob csv data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas dataframe, but I'd prefer to just read straight from the blob storage location.
The accepted answer will not work in the latest Azure Storage SDK. MS has rewritten the SDK completely. It's kind of annoying if you are using the old version and update it. The below code should work in the new version.
from azure.storage.blob import ContainerClient
from io import StringIO
import pandas as pd
conn_str = ""
container = ""
blob_name = ""
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()))
I think you want to use get_blob_to_bytes, or get_blob_to_text; these should output a string which you can use to create a dataframe as
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
df = pd.read_csv(StringIO(blobstring))
Thanks for the answer, I think some correction is needed. You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
Simple Answer:
Working as on 12th June 2022
Below are the steps to read a CSV file from Azure Blob into a Jupyter notebook dataframe (python).
STEP 1:
First generate a SAS token & URL for the target CSV(blob) file on Azure-storage by right-clicking the blob/storage CSV file(blob file).
STEP 2: Copy the Blob SAS URL that appears below the button used for generating SAS token and URL.
STEP 3: Use the below line of code in your Jupyter notbook to import the desired CSV. Replace url value with your Blob SAS URL copied in the above step.
import pandas as pd
url ='Your Blob SAS URL'
df = pd.read_csv(url)
df.head()
Use ADLFS (pip install adlfs), which is an fsspec-compatible API for Azure lakes (gen1 and gen2):
storage_options = {
'tenant_id': tenant_id,
'account_name': account_name,
'client_id': client_id,
'client_secret': client_secret
}
url = 'az://some/path.csv'
pd.read_csv(url, storage_options=storage_options)

Categories

Resources