How to import data into google colab from google drive? - python

I have some data files uploaded on my google drive.
I want to import those files into google colab.
The REST API method and PyDrive method show how to create a new file and upload it on drive and colab. Using that, I am unable to figure out how to read the data files already present on my drive in my python code.
I am a total newbie to this. Can someone help me out?

(Update April 15 2018: The gspread is frequently being updated, so to ensure stable workflow I specify the version)
For spreadsheet file, the basic idea is using packages gspread and pandas to read spreadsheets in Drive and convert them to pandas dataframe format.
In the Colab notebook:
#install packages
!pip install gspread==2.1.1
!pip install gspread-dataframe==2.1.0
!pip install pandas==0.22.0
#import packages and authorize connection to Google account:
import pandas as pd
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.colab import auth
auth.authenticate_user() # verify your account to read files which you have access to. Make sure you have permission to read the file!
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
Then I know 3 ways to read Google spreadsheets.
By file name:
spreadsheet = gc.open("goal.csv") # Open file using its name. Use this if the file is already anywhere in your drive
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
By url:
spreadsheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/1LCCzsUTqBEq5pemRNA9EGy62aaeIgye4XxwReYg1Pe4/edit#gid=509368585') # use this when you have the complete url (the edit#gid means permission)
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
By file key/ID:
spreadsheet = gc.open_by_key('1vpukIbGZfK1IhCLFalBI3JT3aobySanJysv0k5A4oMg') # use this when you have the key (the string in the url following spreadsheet/d/)
sheet = spreadsheet.get_worksheet(0) # 0 means the first sheet in the file
df2 = pd.DataFrame(sheet.get_all_records())
df2.head()
I shared the code above in a Colab notebook:
https://drive.google.com/file/d/1cvur-jpIpoEN3vAO8Fd_yVAT5Qgbr4GV/view?usp=sharing
Source: https://github.com/burnash/gspread

!) Set your data to be publicly available then
for public spreadsheets:
from StringIO import StringIO # got moved to io in python3.
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?
key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content
In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=
['Quradate'])
In [11]: df.head()
More here: Getting Google Spreadsheet CSV into A Pandas Dataframe
If private data sort of the same but you will have to do some auth gymnastics...

From Google Colab snippets
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('Your spreadsheet name').sheet1
# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
print(rows)
# Convert to a DataFrame and render.
import pandas as pd
pd.DataFrame.from_records(rows)

Related

Google spreadsheet to Pandas dataframe via Pydrive without download

How do I read the content of a Google spreadsheet into a Pandas dataframe without downloading the file?
I think gspread or df2gspread may be good shots, but I've been working with pydrive so far and got close to the solution.
With Pydrive I managed to get the export link of my spreadsheet, either as .csv or .xlsx file. After the authentication process, this looks like
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
# choose whether to export csv or xlsx
data_type = 'csv'
# get list of files in folder as dictionaries
file_list = drive.ListFile({'q': "'my-folder-ID' in parents and
trashed=false"}).GetList()
export_key = 'exportLinks'
excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
csv_key = 'text/csv'
if data_type == 'excel':
urls = [ file[export_key][excel_key] for file in file_list ]
elif data_type == 'csv':
urls = [ file[export_key][csv_key] for file in file_list ]
The type of url I get for xlsx is
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx
and similarly for csv
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv
Now, if I click on these links (or visit them with webbrowser.open(url)), I download the file, that I can then normally read into a Pandas dataframe with pandas.read_excel() or pandas.read_csv(), as described here.
How can I skip the download, and directly read the file into a dataframe from these links?
I tried several solutions:
The obvious pd.read_csv(url) gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
Interestingly these numbers (1, 6, 2) do not depend on the number of rows and columns in my spreadsheet, hinting that the script is trying to read not what it is intended to.
The analogue pd.read_excel(url) gives
ValueError: Excel file format cannot be determined, you must specify an engine manually.
and specifying e.g. engine = 'openpyxl' gives
zipfile.BadZipFile: File is not a zip file
BytesIO solution looked promising, but
r = requests.get(url)
data = r.content
df = pd.read_csv(BytesIO(data))
still gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
If I print(data) I get hundreds of lines of html code
b'\n<!DOCTYPE html>\n<html lang="de">\n <head>\n <meta charset="utf-8">\n <meta content="width=300, initial-scale=1" name="viewport">\n
...
...
</script>\n </body>\n</html>\n'
In your situation, how about the following modification? In this case, by retrieving the access token from gauth, the Spreadsheet is exported as XLSX data, and the XLSX data is put into the dataframe.
Modified script:
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
values = pd.read_excel(BytesIO(res.content))
print(values)
In this script, please add import requests.
In this case, the 1st tab of XLSX data is used.
When you want to use the other tab, please modify values = pd.read_excel(BytesIO(res.content)) as follows.
sheet = "Sheet2"
values = pd.read_excel(BytesIO(res.content), sheet_name=sheet)
I want to contribute an additional option to #Tanaike's excellent answer. Indeed it is quite difficult to successfully get an excel file (.xlsx from drive and not a google sheet) into a python environment without publishing the content to the web. Whereas the previous answer uses pydrive and GoogleAuth(), I usually use a different method of authentification in colab/jupyter notebooks. Adapted from googleapis documentation. In my environment using BytesIO(response.content) is unnecessary.
import pandas as pd
from oauth2client.client import GoogleCredentials
from google.colab import auth
auth.authenticate_user()
from google.auth.transport.requests import AuthorizedSession
from google.auth import default
creds, _ = default()
id = 'aaaaaaaaaaaaaaaaaaaaaaaaaaa'
sheet = 'Sheet12345'
url = f'https://docs.google.com/spreadsheets/export?id={id}&exportFormat=xlsx'
authed_session = AuthorizedSession(creds)
response = authed_session.get(url)
values = pd.read_excel(response.content, sheet_name=sheet)

Google Collab : Read gsheet file from Google drive

I am trying to read a gsheet file in Google drive using Google Collab. I tried using drive.mount to get the file but I don't know how to get a dataframe with pandas from there. Here what I tried to do :
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
import os
import pandas as pd
from google.colab import drive
# setup
gc = gspread.authorize(GoogleCredentials.get_application_default())
drive.mount('/content/drive',force_remount=True)
# read data and put it in a dataframe
gsheets = gc.open_by_url('/content/drive/MyDrive/test/myGoogleSheet.gsheet')
As you can tell, I am quite lost with the libraries. I want to use the ability to access the drive with the drive library, to get the content from gspread, and read with pandas.
Can anyone help me find a solution, please ?
I have found a solution for my problem by looking further into the library gspread. I was able to load the gsheet file by id or by url which I did not know. Then I manage to get the content of a sheet and read it as pandas dataframe. Here is the code :
from google.colab import auth
auth.authenticate_user()
import gspread
import pandas as pd
from oauth2client.client import GoogleCredentials
# setup
gc = gspread.authorize(GoogleCredentials.get_application_default())
# read data and put it in a dataframe
# spreadsheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/google_sheet_id/edit#gid=0')
spreadsheet = gc.open_by_key('google_sheet_id')
wks = spreadsheet.worksheet('sheet_name')
data = wks.get_all_values()
headers = data.pop(0)
df = pd.DataFrame(data, columns=headers)
print(df)

Export dataframe as csv file from google colab to google drive

I want to upload a dataframe as csv from colab to google drive.I tried a lot but no luck. I can upload a simple text file but failed to upload a csv.
I tried the following code:
import pandas as pd
df=pd.DataFrame({1:[1,2,3]})
df.to_csv('abc',sep='\t')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
uploaded = drive.CreateFile({'title': 'sample.csv', 'mimeType':'csv'})
uploaded.SetContentFile('abc')
uploaded.Upload()
It may be easier to use mounting instead of pydrive.
from google.colab import drive
drive.mount('drive')
After authentication, you can copy your csv file.
df.to_csv('data.csv')
!cp data.csv "drive/My Drive/"
Without using !cp command
from google.colab import drive
Mounts the google drive to Colab Notebook
drive.mount('/drive')
Make sure the folder Name is created in the google drive before uploading
df.to_csv('/drive/My Drive/folder_name/name_csv_file.csv')
# Import Drive API and authenticate.
from google.colab import drive
# Mount your Drive to the Colab VM.
drive.mount('/gdrive')
# Write the DataFrame to CSV file.
with open('/gdrive/My Drive/foo.csv', 'w') as f:
df.to_csv(f)
The other answers almost worked but I needed a small tweak:
from google.colab import drive
drive.mount('drive')
df.to_csv('/content/drive/My Drive/filename.csv', encoding='utf-8', index=False)
the /content/ bit proved necessary
If you want to save locally then you can use this
f.to_csv('sample.csv')
from google.colab import files
files.download("sample.csv")
If you don't want to work with Pandas, do this:
df_mini.coalesce(1)\
.write\
.format('com.databricks.spark.csv')\
.options(header='true', delimiter='\t')\
.save('gdrive/My Drive/base_mini.csv')

python gspread - How to get a spreadsheet URL path in after i create it?

I'm trying to create a new spreadsheet using the gspread python package, then get its URL path (inside the google drive) and send it to other people so they could go in as well.
I tried to find an answer here and here, with no luck.
I created a brand new Spreadsheet:
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
gc = gspread_connect()
spreadsheet = gc.create('TESTING SHEET')
Then i Shared it with my account:
spreadsheet.share('my_user#my_company.com', perm_type='user', role='writer')
Then i wrote some random stuff into it:
worksheet = gc.open('TESTING SHEET').sheet1
df = pd.DataFrame.from_records([{'a': i, 'b': i * 2} for i in range(100)])
set_with_dataframe(worksheet, df)
Now when i go to my google drive i can find this sheet by looking for its name ("TESTING SHEET")
But i didn't figure how do i get the URL path in my python code, so i could pass it right away to other people.
Tnx!
You can generate the URL by using Spreadsheet.id. Here's an example that uses spreadsheet variable from your code:
spreadsheet_url = "https://docs.google.com/spreadsheets/d/%s" % spreadsheet.id

How best to convert from azure blob csv format to pandas dataframe while running notebook in azure ml

I have a number of large csv (tab delimited) data stored as azure blobs, and I want to create a pandas dataframe from these. I can do this locally as follows:
from azure.storage.blob import BlobService
import pandas as pd
import os.path
STORAGEACCOUNTNAME= 'account_name'
STORAGEACCOUNTKEY= "key"
LOCALFILENAME= 'path/to.csv'
CONTAINERNAME= 'container_name'
BLOBNAME= 'bloby_data/000000_0'
blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
# Only get a local copy if haven't already got it
if not os.path.isfile(LOCALFILENAME):
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
df_customer = pd.read_csv(LOCALFILENAME, sep='\t')
However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from csv, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).
I can get to the desired end result (pandas dataframe for blob csv data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas dataframe, but I'd prefer to just read straight from the blob storage location.
The accepted answer will not work in the latest Azure Storage SDK. MS has rewritten the SDK completely. It's kind of annoying if you are using the old version and update it. The below code should work in the new version.
from azure.storage.blob import ContainerClient
from io import StringIO
import pandas as pd
conn_str = ""
container = ""
blob_name = ""
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()))
I think you want to use get_blob_to_bytes, or get_blob_to_text; these should output a string which you can use to create a dataframe as
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
df = pd.read_csv(StringIO(blobstring))
Thanks for the answer, I think some correction is needed. You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
Simple Answer:
Working as on 12th June 2022
Below are the steps to read a CSV file from Azure Blob into a Jupyter notebook dataframe (python).
STEP 1:
First generate a SAS token & URL for the target CSV(blob) file on Azure-storage by right-clicking the blob/storage CSV file(blob file).
STEP 2: Copy the Blob SAS URL that appears below the button used for generating SAS token and URL.
STEP 3: Use the below line of code in your Jupyter notbook to import the desired CSV. Replace url value with your Blob SAS URL copied in the above step.
import pandas as pd
url ='Your Blob SAS URL'
df = pd.read_csv(url)
df.head()
Use ADLFS (pip install adlfs), which is an fsspec-compatible API for Azure lakes (gen1 and gen2):
storage_options = {
'tenant_id': tenant_id,
'account_name': account_name,
'client_id': client_id,
'client_secret': client_secret
}
url = 'az://some/path.csv'
pd.read_csv(url, storage_options=storage_options)

Categories

Resources