Exporting scraped content to google sheets - python

I am willing to scrape a website for some information. It would be 3 to 4 columns. The difficult part is, i want to export all the data in to the google sheets and make the crawler run after some specific intervals. I 'll be using scrapy for this purpose. Any suggestions on how can i do this (by making custom pipeline or any other way as i don't have much experience in writing custom pipelines)

You can use the Google API and python pygsheets module.
Refer this link for more details Click Here
Please see the sample code and this might help you.
import pygsheets
import pandas as pd
#authorization
gc = pygsheets.authorize(service_file='/Users/desktop/creds.json')
# Create empty dataframe
df = pd.DataFrame()
# Create a column
df['name'] = ['John', 'Steve', 'Sarah']
#open the google spreadsheet (where 'PY to Gsheet Test' is the name of my sheet)
sh = gc.open('PY to Gsheet Test')
#select the first sheet
wks = sh[0]
#update the first sheet with df, starting at cell B2.
wks.set_dataframe(df,(1,1))

Related

Using google sheets through python in the most simple way

So basically I am making a discord bot that takes trades for ingame items on a game I play and stores the order in a google sheet. What would be the easiest way to do this through python, I know how to do all the bot stuff but when it comes to accessing a google sheet, searching through it and collecting certain rows of information I cant find much that helps. What module would I use to make this easiest as possible, the module needs to be able to search through the sheet for specific values in one column, find the first find the first empty cell in a column as well as collect all the information from a row. If anyone knows a good module for doing this it would be greatly appreciated.
Note: I have set up the OAuth and all that kind of stuff for the sheets api, I saw that there was a bunch of modules that make accessing the sheet easier however so I was wondering which one was the best at making the coding easier as I am not super experienced.
Use Googlesheets API to get the data and then pandas to read in the data as a dataframe. Once you have the dataframe, Pandas can accomplish this in various ways: "needs to be able to search through the sheet for specific values in one column, find the first empty cell in a column as well as collect all the information from a row"
from googleapiclient.discovery import build
from google.oauth2.credentials import Credentials
import pandas as pd
SAMPLE_SPREADSHEET_ID = 'sheet ID' # your sheet ID
SAMPLE_NAME = 'sheet name' # your sheet name
RANGE = '!A1:D2' # your row/col sheet range
TOKEN_PATH = 'token.json' # path to your token file
SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']
SAMPLE_RANGE_NAME = SAMPLE_NAME + RANGE
creds = Credentials.from_authorized_user_file(TOKEN_PATH, SCOPES)
service = build('sheets', 'v4', credentials=creds)
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])
df = pd.DataFrame(data=values[1:], columns=values[0])
Adapted from https://developers.google.com/sheets/api/quickstart/python

Want a active excel sheet names in python?

I have loaded a excel into python (Google Colab), but I was wondering if there was a way of extracting the names of the excel (.xlsm) file. Please check attached image.
import pandas as pd
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_excel(io.BytesIO(uploaded['202009 Testing - September - Diamond Plod Day & Night MKY021.xlsm']),sheet_name='1 D',header=8,usecols='BE,BH',nrows=4)
df1 = pd.read_excel(io.BytesIO(uploaded['202009 Testing - September - Diamond Plod Day & Night MKY021.xlsm']),sheet_name='1 D',header=3)
df=df.assign(PlodDate='D5')
df['PlodDate']=df1.iloc[0,3]
df=df.assign(PlodShift='D6')
df['PlodShift']=df1.iloc[1,3]
df =df.rename({'Qty.2':'Loads','Total (L)':'Litres'},axis=1)
df = df.reindex(columns=['PlodDate','PlodShift','Loads','Litres','DataSource'])
df=df.assign(DataSource='Name of the Source File')
df
Instead of the datasource='name of the source file', I want active excel sheet name.
Output should be:
Datasource='202009 Testing - September - Diamond Plod Day & Night MKY021'
As I have a file for every month, I just want a code that take the name of active excel sheet when I run the code.
I tried this code but it was not working in google colab.
import os
os.listdir('.')
Excel File Name:
Code Image:
Code in Google Colab
Excel File Attached
I have not used google colab, but I used to have a similar problem on how to extract sheet names of some Excel file. The solution turned out to be very simple:
using pandas as pd
excel_file = pd.ExcelFile("excel_file_name.xlsx")
sheet_names = excel_file.sheet_names
So, basically the idea is that you want to open the whole Excel file instead of specific sheet of it. This can be done by pd.ExcelFile( ... ). Once you have your excel file "open", you can get the names by some_excel_file.sheet_names. This is especially useful when you want to loop over all sheets in some excel file. For example, the code can be something like this:
excel_file = pd.ExcelFile("excel_file_name.xlsx")
sheet_names = excel_file.sheet_names
for sheet_name in sheet_names:
# do some operations here for this sheet
This is not a complete answer as I am not sure about Google Colab, but I hope this would give you an idea on what you can do to the sheet names.

How can I change the text format to number format of a table pulled from Python to a Google spreadsheet?

I got the next problem. I have a dataframe in Python and I putted it in a Google Spreadsheet and I got this for the numeric columns:
I want the numeric columns in a numeric format. I've been trying this but nothing happens:
from gspread_formatting import *
fmt = cellFormat(numberFormat=numberFormat(type='NUMBER', pattern='####.#'))
format_cell_range(worksheet, 'Q2:R2', fmt)
If I put a random text format inside fmt (like background color), it works...So, I don´t know why the transform to numberFormat isn´t working...
In the Python dataframe, these columns are in a numeric format:
Thanks!
EDIT
This is the code of put the dataframe from Python to a Google sheet:
import df2gspread as d2g
def spreadsheet_first_paste(file_name, sheet_id,sheet_name, df):
scope = ['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('...path/client_secret.json',scope)
gc = gspread.service_account(filename='...path/client_secret.json')
sh = gc.open(file_name)
spreadsheet_key = sheet_id
wks_name = sheet_name
d2g.upload(df, spreadsheet_key, wks_name, credentials=creds, row_names=True)
EDIT2
SOLVED:
Problem with data format while Importing pandas DF from python into google sheets using df2gsheets
In that post was the solution: I was using the df2gspread library, so I went to the df2gspread.py and modified the lines "wks.update_cells(cell_list)" to "wks.update_cells(cell_list, value_input_option='USER_ENTERED')"
This solution really worked. Thanks for posting the solution:
SOLVED: I was using the df2gspread library, so I went to the df2gspread.py and modified the lines "wks.update_cells(cell_list)" to "wks.update_cells(cell_list, value_input_option='USER_ENTERED')"
python
gspread

How to download a Google Docs excel sheet with a Gspread and access data locally (A1 notation)?

I need to download an excel sheet from Google Docs via Gspread and then multiple times I'll need to read the values of different cells in 'A1' notation. Thus, I can't just get the spreadsheet and then call val = worksheet.acell('B1').value, because the script will freeze out of too many API calls. My solution for now:
def download_hd_sheet():
worksheet = gc.values().get(spreadsheetId=excel_id, range='variables', valueRenderOption='FORMULA').execute()['values']
df = pd.DataFrame(worksheet)
writer = pd.ExcelWriter("Variables.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, header=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
writer.save()
book = openpyxl.load_workbook('Variables.xlsx', data_only=False)
global hd_sheet
hd_sheet = book.active
So far what I'm doing is:
I download the values from the worksheet.
Transform it (list of lists) into a pandas dataframe.
Then I write the df to a .xlsx file.
I read the .xlsx file to a global variable
It seems to me that I am doing so many things just to achieve something that can be done in two lines. Please, let me know what would be more effective than the above.
I believe your goal as follows.
You want to download the Google Spreadsheet as the XLSX data.
You want to use the downloaded XLSX data without saving as the file.
You have already been able to get and put values for Google Spreadsheet using gspread.
You want to achieve this using python.
In order to achieve your goal, I would like to propose the following flow.
Download the Google Spreadsheet as the XLSX data using the method of Files: export in Drive API.
Open the XLSX data using the downloaded binary data with openpyxl.load_workbook().
Sample script:
In this sample script, from your situation, the access token is used from the authorization for gspread.
spreadsheetId = "###" # Please set the Spreadsheet ID.
client = gspread.authorize(credentials)
access_token = client.auth.token
url = "https://www.googleapis.com/drive/v3/files/" + spreadsheetId + "/export?mimeType=application%2Fvnd.openxmlformats-officedocument.spreadsheetml.sheet"
res = requests.get(url, headers={"Authorization": "Bearer " + access_token})
book = openpyxl.load_workbook(filename=BytesIO(res.content), data_only=False)
hd_sheet = book.active
By above script, the XLSX data is directly downloaded from Google Spreadsheet and openpyxl.load_workbook
In this case, the following libraries in addition to gspread are used.
import openpyxl
import requests
from io import BytesIO
Note:
In this case, please include the scope of https://www.googleapis.com/auth/drive or https://www.googleapis.com/auth/drive.readonly. When you modified the scopes, please reauthorize the scopes. By this, the new scopes are reflected to the access token. So please be careful this.
References:
Files: export
Using openpyxl to read file from memory
I think that this thread might be useful for your situation.

Multiple Spreadsheets with Gspread

I'm really new with Python, and I’m working with gspread and Google Sheets. I have several spreadsheets I would like to pull data from. They all have the same name with an appended numerical value (e.g., SpreadSheet(1), SpreadSheet(2), SpreadSheet(3), etc.)
I would like to parse through each spreadsheet, pull the data, and generate a single data frame with the data. I can do this quite easily with a single spreadsheet, but I’m having trouble doing it with several.
I can create a list of the spreadsheets titles with the code below, but I'm not sure if that's the right direction.
titles_list = []
for spreadsheet in client.openall():
titles_list.append(spreadsheet.title)
Using a mix of both your starting code and #Tanaike's answer here you have a snippet of code that does what you expect.
# Create an authorized client
client = gspread.authorize(credentials)
# Create a list to hold the values
values = []
# Get all spreadsheets
for spreadsheet in client.openall():
# Get spreadsheet's worksheets
worksheets = spreadsheet.worksheets()
for ws in worksheets:
# Append the values of the worksheet to values
values.extend(ws.get_all_values())
# create df from values
df = pd.DataFrame(values)
print(df)
Hope I was clear.
I believe your goal as follows.
You want to merge the values retrieved from all sheets in a Google Spreadsheet.
You want to convert the retrieved values to the dataframe.
Each sheet has 4 columns, 100 rows and no header rows.
You want to achieve this using gspread with python.
You have already been able to get and put values for Google Spreadsheet using Sheets API.
For this, how about this answer?
Flow:
Retrieve all sheets in the Google Spreadsheet using worksheets().
Retrieve all values from all sheets using get_all_values() and merge the values.
Convert the retrieved values to the dataframe.
Sample script:
spreadsheetId = "###" # Please set the Spreadsheet ID.
client = gspread.authorize(credentials)
spreadsheet = client.open_by_key(spreadsheetId)
worksheets = spreadsheet.worksheets()
values = []
for ws in worksheets:
values.extend(ws.get_all_values())
df = pd.DataFrame(values)
print(df)
References:
worksheets()
get_all_values()

Categories

Resources