Reduce API call from gspread library - python

I'm currently trying to get data using gspread API from a drive folder containing about 50 excel files, each containing about 10 sheets (about 500 sheets in total).
I want to get 3 specific columns for all files in all sheets and append it into a dataframe.
I got this code working :
scope = ["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("gspread/service_account.json",scope)
client = gs.authorize(creds)
file_list = client.list_spreadsheet_files()
file_list = list(map(itemgetter('name'),file_list))
for files in file_list:
file = client.open(files)
worksheet_list = file.worksheets()
print(files)
for sheet in worksheet_list[1:]:
print(sheet)
set_with_dataframe(sheet, df)
df = get_as_dataframe(sheet, parse_dates=True, usecols=[5, 7, 9], skiprows=1, header=None)
df.drop(df.index[10:100], axis=0, inplace=True)
print('Ajout de : ',df)
df_final = df_final.append(df)
df_final.fillna('', inplace=True)
df_final.reset_index(drop=True, inplace=True)
print(df_final)
The thing is I always get an error 429 (too much API calls) and I can't figured out how to reduced the number of calls. Even when using get_all_values() or get_all_records() functions it has to loops through all sheets.
Even when making a time.sleep(30) for each files.
If I'm not wrong I have about 22 calls for each files of 10 sheets
(file = client.open(files) and worksheet_list = file.worksheets()) and 2 for each sheets ( set_with_dataframe() and get_as_dataframe()
I could make a time.sleep() for each worksheets instead but it would take very long (500+ sheets).
I could also change the pause time depending on the number of sheets (some files have more).
My questions are :
1) Is there function that could get all data from all sheets instead of having to loop through all
sheets?
2) If not, is there a solution to reduce the calls without having to use pauses...?
If yes, it would be a solution and would reduce drastically the calls
Thanks in advance and nice evening,
Alex

to answer your questions:
There will be soon in gspread a method they returns all the data from all the sheets. See this issue here
As of today the above solution is not available so in order to pull all data from a spreadsheet you can build your own request and use gspread to handle that request for you. Have a look at the library source code in the Worksheet class in the method that return some values (like get_range or get or range). You can easily build your request and request the full spreadsheet values.
on the side, if you wish to reduce API calls you can do so by opening your spreadsheet files using their ID instead of listing all the files then looping over them. You will reduce the initial API call to the Drive API to list the files.
The method list_spreadsheet_files is costly because the response is paginated, so it will make as many calls to get all the pages (i don't know the size of a page). Then the method open is costly too because it makes a new call to list_spreadsheet_files then loops over the result to open the right file using it's ID.
I would definitely start on that part of your code to reduce API calls.

Related

openpyxl blocking excel file after first read

I am trying to overwrite a value in a given cell using openpyxl. I have two sheets. One is called Raw, it is populated by API calls. Second is Data that is fed off of Raw sheet. Two sheets have exactly identical shape (cols/rows). I am doing a comparison of the two to see if there is a bay assignment in Raw. If there is - grab it to Data sheet. If both Raw and Data have the value in that column missing - then run a complex Algo (irrelevant for this question) to assign bay number based on logic.
I am having problems with rewriting Excel using openpyxl.
Here's example of my code.
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
no_bay_res = data_df[data_df['Bay assignment'].isnull()].reset_index() #grab rows where there is no bay assignment in a specific column
book = load_workbook("Algo Build v23test.xlsx")
sheet = book["MondayData"]
for index, reservation in no_bay_res.iterrows():
idx = int(reservation['index'])
if pd.isna(raw_df.iloc[idx, 13]):
continue
else:
value = raw_df.iat[idx,13]
data_df.iloc[idx, 13] = value
sheet.cell(idx+2, 14).value = int(value)
book.save("Algo Build v23test.xlsx")
book.close()
print(value) #302
Now the problem is that it seems that book.close() is not working. Book is still callable in python. Now, it overwrites Excel totally fine. However, if I try to run these two lines again
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
I am getting datasets full of NULL values, except for the value that was replaced. (attached the image).
However, if I open that Excel file manually from the folder and save it (CTRL+S) and try running the code again - it works properly. Weirdest problem.
I need to loop the code above for Monday-Sunday, so I need it to be able to read the data again without manually resaving the file.
Due to some reason, pandas will read all the formulas as NaN after the file been used in the script by openpyxl until the file has been opened, saved and closed. Here's the code that helps do that within the script. However, it is rather slow.
import xlwings as xl
def df_from_excel(path, sheet_name):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path, sheet_name)
I got the same problem, the only workaround I found is to terminate the excel.exe manually from taskmanager. After that everything went fine.

Reading csv files with glob to pass data to a database very slow

I have many csv files and I am trying to pass all the data that they contain into a database. For this reason, I found that I could use the glob library to iterate over all csv files in my folder. Following is the code I used:
import requests as req
import pandas as pd
import glob
import json
endpoint = "testEndpoint"
path = "test/*.csv"
for fname in glob.glob(path):
print(fname)
df = pd.read_csv(fname)
for index, row in df.iterrows():
#print(row['ID'], row['timestamp'], row['date'], row['time'],
# row['vltA'], row['curA'], row['pwrA'], row['rpwrA'], row['frq'])
print(row['timestamp'])
testjson = {"data":
{"installationid": row['ID'],
"active": row['pwrA'],
"reactive": row['rpwrA'],
"current": row['curA'],
"voltage": row['vltA'],
"frq": row['frq'],
}, "timestamp": row['timestamp']}
payload = {"payload": [testjson]}
json_data = json.dumps(payload)
response = req.post(
endpoint, data=json_data, headers=headers)
This code seems to work fine in the beginning. However, after some time it starts to become really slow (I noticed this because I print the timestamp as I upload the data) and eventually stops completely. What is the reason for this? Is something I am doing here really inefficient?
I can see 3 possible problems here:
memory. read_csv is fast, but it loads the content of a full file in memory. If the files are really large, you could exhaust the real memory and start using swap which has terrible performances
iterrows: you seem to build a dataframe - meaning a data structure optimized for column wise access - to then access it by rows. This already is a bad idea and iterrows is know to have terrible performances because it builds a Series per each row
one post request per row. An http request has its own overhead, but furthemore, this means that you add rows to the database one at a time. If this is the only interface for your database, you may have no other choice, but you should search whether it is possible to prepare a bunch of rows and load it as a whole. It often provides a gain of more than one magnitude order.
Without more info I can hardly say more, but IHMO the higher gain is to be found on database feeding so here in point 3. If nothing can be done on that point, of if further performance gain is required, I would try to replace pandas with the csv module which is row oriented and has a limited footprint because it only processes one line at a time whatever the file size.
Finally, and if it makes sense for your use case, I would try to use one thread for the reading of the csv file that would feed a queue and a pool of threads to send requests to the database. That should allow to gain the HTTP overhead. But beware, depending on the endpoint implementation it could not improve much if really the database access if the limiting factor.

Hitting AWS Lambda Memory limit in Python

I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf

input file is not getting read from pd.read_csv

I'm trying to read a file stored in google storage from apache beam using pandas but getting error
def Panda_a(self):
import pandas as pd
data = 'gs://tegclorox/Input/merge1.csv'
df1 = pd.read_csv(data, names = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
return df1
ip2 = p |'Split WeeklyDueto' >> beam.Map(Panda_a)
ip7 = ip2 | 'print' >> beam.io.WriteToText('gs://tegclorox/Output/merge1234')
When I'm executing the above code , the error says the path does not exist. Any idea why ?
A bunch of things are wrong with this code.
Trying to get Pandas to read a file from Google Cloud Storage. Pandas does not support the Google Cloud Storage filesystem (as #Andrew pointed out - documentation says supported schemes are http, ftp, s3, file). However, you can use the Beam FileSystems.open() API to get a file object, and give that object to Pandas instead of the file path.
p | ... >> beam.Map(...) - beam.Map(f) transforms every element of the input PCollection using the given function f, it can't be applied to the pipeline itself. It seems that in your case, you want to simply run the Pandas code without any input. You can simulate that by supplying a bogus input, e.g. beam.Create(['ignored'])
beam.Map(f) requires f to return a single value (or more like: if it returns a list, it will interpret that list as a single value), but your code is giving it a function that returns a Pandas dataframe. I strongly doubt that you want to create a PCollection containing a single element where this element is the entire dataframe - more likely, you're looking to have 1 element for every row of the dataframe. For that, you need to use beam.FlatMap, and you need df.iterrows() or something like it.
In general, I am not sure why read the CSV file using Pandas at all. You can read it using Beam's ReadFromText with skip_header_lines=1, and then parse each line yourself - if you have a large amount of data, this will be a lot more efficient (and if you have only a small amount of data and do not anticipate it becoming large enough to exceed the capabilities of a single machine - say, if it will never be above a few GB - then Beam is the wrong tool).

Saving scraped documents in two sheets in an excel file

I've created a scraper which is supposed to parse some documents from a webpage and save it to an excel file creating two sheets. However, when I run it, I can see that It only saves the documents of last link in a single sheet whereas there should be two sheets with documents from two links properly. I even printed the results to see what is happening in the background but i found there nothing wrong. I thing the first sheet is overwritten and second one is never created. How to get around this so that data will be saved in two sheets in an excel file. Thanks in advance to take a look into it.
Here is my code:
import requests
from lxml import html
from pyexcel_ods3 import save_data
name_list = ['Altronix','APC']
def docs_parser(link, name):
res = requests.get(link)
root = html.fromstring(res.text)
vault = {}
for post in root.cssselect(".SubBrandList a"):
if post.text == name:
refining_docs(post.attrib['href'], vault)
def refining_docs(new_link, vault):
res = requests.get(new_link).text
root = html.fromstring(res)
sheet = root.cssselect("#BrandContent h2")[0].text
for elem in root.cssselect(".ProductDetails"):
name_url = elem.cssselect("a[class]")[0].attrib['href']
vault.setdefault(sheet, []).append([str(name_url)])
save_data("docs.ods", vault)
if __name__ == '__main__':
for name in name_list:
docs_parser("http://store.immediasys.com/brands/" , name)
But, the same way when I write code for another site, it meets the expectation creating different sheets and saving documents in those. Here is the link:
https://www.dropbox.com/s/bgyh1xxhew8hcvm/Pyexcel_so.txt?dl=0
Question: I thing the first sheet is overwritten and second one is never created. How to get around this so that data will be saved in two sheets in an excel file.
You overwrite the Workbook File on every Link that's be appended.
You should never call save_data(... within a loop, only once at the End of your Script.
Comparing you Two Scripts there is No difference, both behave the same, again and again overwriting the Workbook File. Maybe the File IO get overloaded as you doing more than 160 Times overwriting the Workbook File within a short Time.
The First Script should create 13 Sheets:
data sheet:powerpivot-etc links:20
data sheet:flappy-owl-videos links:1
data sheet:reporting-services-videos links:20
data sheet:csharp links:14
data sheet:excel-videos links:9
data sheet:excel-vba-videos links:20
data sheet:sql-server-videos links:9
data sheet:report-builder-2016-videos links:4
data sheet:ssrs-2016-videos links:5
data sheet:sql-videos links:20
data sheet:integration-services links:19
data sheet:excel-vba-user-form links:20
data sheet:archived-videos links:16
The Second Script should create 2 Sheets:
vault sheet:Altronix links:16
vault sheet:APC links:16

Categories

Resources