FastAPI Endpoint Pagination with Generator - python

I converted path operation to a generator as below. To get a generator when call is made to endpoint and to loop over pages of data instead of getting all data at once, but what I observe actually I still get data at once. Now instead of getting a list of dictionaries, I just get a two dimensional list in which there are multiple lists whose length is equal to my page size and their each element is also dictionaries. Why generator behaved in this way and what should I use to implement pagination? Does storing data in a global variable and returning a chunk of it for each call to endpoint make sense?
#overlaps_router.get("/query")
def query_overlaps(query: str,
testdrive_id_list: List[str] = Query(None),
predecessor: bool = False,
db: Session = Depends(get_db)):
try:
data_to_paginate = []
if not data_to_paginate:
logger.info("No data to paginate. Data will be calculated!")
results = crud.query_overlaps(query, db, testdrive_id_list)
logger.info(f"First row of results: {results[0]}")
if predecessor:
result_with_predecessor = calculate_predecessor(results)
data_to_paginate = result_with_predecessor
else:
data_to_paginate = process_result(results)
logger.info(f"Length of data to paginate:{len(data_to_paginate)}")
limit = environment.env_variables["PAGINATION_OFFSET"]
page = []
for datum in data_to_paginate:
if len(page) < limit:
page.append(datum)
else:
yield page
page = []
yield page
except Exception as e:
logger.exception(f"Exception: {str(e)}")
Endpoint is called some other place:
# This one returns three lists in another list because
# data size is 2125 records and PAGINATION_OFFSET env
# variable is 1000, so it still returns all data at once
# in three different lists.
result_text = requests.get(request_string, params=params).text
print(json.loads(result_text))

Related

How can I iterate the entries of an API?

I have this api : https://api.publicapis.org/entries
And I wat to iterate key entries of it. I tried as it follows :
r = requests.get('https://api.publicapis.org/entries')
entries = [] #initializing the vector entries
for i in entries: #iterating it
return(i.text) #trying to print the entries
Then I received the following error :
TypeError: The view function did not return a valid response. The function either returned None or ended without a return statement.
How can I solve this problem ?
For that particular API endpoint, you should be fine with
resp = requests.get('https://api.publicapis.org/entries')
resp.raise_for_status() # raise exception on HTTP errors
entries = resp.json()["entries"] # parse JSON, dig out the entries list
# ... do something with entries.
You can use json.loads to parse response text.
Let me add the full code.
import requests
import json
r = requests.get('https://api.publicapis.org/entries')
entries = json.loads(r.text)['entries'] #initializing the vector entries
for i in entries: #iterating it
API = i['API']
Description = i['Description']

Faster alternative to slicing and list comprehension to get an attribute of the last N items in a list

I am looking to speed up my code. I've tried recapping my code in the following:
I start by reading certain CSV files from a folder and store the data in a dict:
Everything is part of methods inside classes.
# csv_dir: Folder with all the CSV files
# list: List of which CSV files from the above folder to read
data = {}
for s in list:
data[s] = pd.io.parsers.read_csv(
os.path.join(csv_dir, '%s.csv' % s),
header=0, index_col=0, parse_dates=True, date_parser=dateparse,
names=names
)
Then I reindex and create a generator
# comb_index: Combined index made previously in the code
reindexMethod = None
for s in list:
data[s] = data[s].reindex(index=comb_index, method=reindexMethod)
data[s] = data[s].iterrows()
Then I loop over all entries where I do the following in each iteration:
Get new observation using yield
Store new observation in list together with all previously yielded observations
Call a function that:
Get last N entries from the above list using get_latest_data (defined below together with helpmethod)
Calculate stuff
Code definitions:
# self (part of the same class as the rest of the code)
# x: which file to retrieve the latest N datapoints from
# N: Number of datapoints to retrieve
def get_latest_data(x, val_type, N=1):
try:
data_list = get_latest_entries(x, N)
except KeyError:
print("Not available.")
raise
else:
return np.array([getattr(b[1], val_type) for b in data_list])
with help method
# latest_data: Dict with all previously looped through data for each CSV file
def get_latest_entries(x, N=1):
try:
bars_list = latest_data[x]
except KeyError:
print("Not available.")
raise
else:
return bars_list[-N:]
My profiler says that the list comprehension is the problem, but I don't know a better alternative.

Easiest way to impliment multithread in this function [Python]

So I have data known as id_list that is coming into the function in this format [(u'SGP-3630', 1202), (u'MTSCR-534', 1244)]. The format being two values paired together, there could be 1 pair or a hundred pairs.
This is the function:
def ListParser(id_list):
list_length = len(id_list)
count = 0
table = ""
while count < list_length:
jira = id_list[count][0]
stash = id_list[count][1]
count = count + 1
table = table + RetrieveFromAPI(stash, jira)
table = TableFormatter(table)
table = TableColouriser(table)
return table
What this function does is goes through the list and extracts the pairs and puts them through a function called RetrieveFromAPI() which fetches information from a URL.
Anyone have an idea on how to impliment multithreading here? I've had a shot at splitting both lists up into their own lists and getting the pool to iterate through each list but it hasn't quite worked.
def ListParser(id_list):
pool = ThreadPool(4)
list_length = len(id_list)
count = 0
table = ""
jira_list = list()
stash_list = list()
while count < list_length:
jira_list = jira_list.extend(id_list[count][0])
print jira_list
stash_list = stash_list.extend(id_list[count][1])
print stash_list
count = count + 1
table = table + pool.map(RetrieveFromAPI, stash_list, jira_list)
table = TableFormatter(table)
table = TableColouriser(table)
return table
The error I'm getting for this attempt is TypeError: 'int' object is not iterable
EDIT 2: Okay so I've managed to get the first list with tuples split up into two different lists, but I'm unsure how to get multithreading working with it.
jira,stash= map(list,zip(*id_list))
You're working too hard! From help(multiprocessing.pool.ThreadPool)
map(self, func, iterable, chunksize=None)
Apply `func` to each element in `iterable`, collecting the results
in a list that is returned.
The second argument is an iterable of the arguments you want to pass to the worker threads. You have a list of lists and you want the first two items from the inner list for each call. id_list is already iterable, so we're close. A small function (in this case implemented as a lambda) bridges the gap.
I worked up a full mock solution just to make sure it works, so here it goes. As an aside, you can benefit from a fairly large pool size since they spend much of their time waiting on I/O.
from multiprocessing.pool import ThreadPool
def RetrieveFromAPI(stash, jira):
# boring mock of api
return '{}-{}.'.format(stash, jira)
def TableFormatter(table):
# mock
return table
def TableColouriser(table):
# mock
return table
def ListParser(id_list):
if id_list:
pool = ThreadPool(min(12, len(id_list)))
table = ''.join(pool.map(lambda item: RetrieveFromAPI(item[1], item[0]),
id_list, chunksize=1))
pool.close()
pool.join()
else:
table = ''
table = TableFormatter(table)
table = TableColouriser(table)
return table
id_list = [[0,1,'foo'], [2,3,'bar'], [4,5, 'baz']]
print(ListParser(id_list))

Returning the entire dataset using Google App Engine indexed search

Is there any way to fetch the entire dataset in an app engine search index? The below search takes an integer limit through QueryOptions, and the limit which always needs to be present.
I'm unable to determine if there is some special flag that can bypass this limit and return the entire result set. If the query is made without a QueryOptions, the result set is limited to 20 somehow.
_INDEX = search.Index(name=constants.SEARCH_INDEX)
_INDEX.search(query=search.Query(
query,
options=search.QueryOptions(
limit=limit,
sort_options=search.SortOptions(...))))
Any ideas?
You could customise the delete all example, if indeed you want every document in the index rather then every result in a query https://cloud.google.com/appengine/docs/python/search/#Python_Deleting_documents_from_an_index
from google.appengine.api import search
def delete_all_in_index(index_name):
"""Delete all the docs in the given index."""
doc_index = search.Index(name=index_name)
# looping because get_range by default returns up to 100 documents at a time
while True:
# Get a list of documents populating only the doc_id field and extract the ids.
document_ids = [document.doc_id
for document in doc_index.get_range(ids_only=True)]
if not document_ids:
break
# Delete the documents for the given ids from the Index.
doc_index.delete(document_ids)
So you might end up with something like:
while True:
document_ids = [document.doc_id
for document in doc_index.get_range(ids_only=True)]
if not document_ids:
break
# Get then something with the document
for id in document_ids:
document = index.get(id)
You'd probably want to get the document itself in the list comprehension rather then getting the ID then getting the document from that ID, but you get the idea.
Firstly, if you peek into the constructor of QueryOptions, that answers your question why it returns 20 results:
def __init__(self, limit=20, number_found_accuracy=None, cursor=None,
offset=None, sort_options=None, returned_fields=None,
ids_only=False, snippeted_fields=None,
returned_expressions=None):
The reason I think why the API is doing this is to avoid unnecessary fetching of results. You should use an offset if you need to fetch more results upon user action instead of always fetching all results. See this.
from google.appengine.api import search
...
# get the first set of results
page_size = 10
results = index.search(search.Query(query_string='some stuff',
options=search.QueryOptions(limit=page_size))
# calculate pages
pages = results.found_count / page_size
# user chooses page and hence an offset into results
next_page = ith * page_size
# get the search results for that page
results = index.search(search.Query(query_string='some stuff',
options=search.QueryOptions(limit=page_size, offset=next_page))

How to use ResultSet in PyES

I'm using PyES to use ElasticSearch in Python.
Typically, I build my queries in the following format:
# Create connection to server.
conn = ES('127.0.0.1:9200')
# Create a filter to select documents with 'stuff' in the title.
myFilter = TermFilter("title", "stuff")
# Create query.
q = FilteredQuery(MatchAllQuery(), myFilter).search()
# Execute the query.
results = conn.search(query=q, indices=['my-index'])
print type(results)
# > <class 'pyes.es.ResultSet'>
And this works perfectly. My problem begins when the query returns a large list of documents.
Converting the results to a list of dictionaries is computationally demanding, so I'm trying to return the query results already in a dictionary. I came across with this documentation:
http://pyes.readthedocs.org/en/latest/faq.html#id3
http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ResultSet
https://github.com/aparo/pyes/blob/master/pyes/es.py (line 1304)
But I can't figure out what exactly I'm supposed to do.
Based on the previous links, I've tried this:
from pyes import *
from pyes.query import *
from pyes.es import ResultSet
from pyes.connection import connect
# Create connection to server.
c = connect(servers=['127.0.0.1:9200'])
# Create a filter to select documents with 'stuff' in the title.
myFilter = TermFilter("title", "stuff")
# Create query / Search object.
q = FilteredQuery(MatchAllQuery(), myFilter).search()
# (How to) create the model ?
mymodel = lambda x, y: y
# Execute the query.
# class pyes.es.ResultSet(connection, search, indices=None, doc_types=None,
# query_params=None, auto_fix_keys=False, auto_clean_highlight=False, model=None)
resSet = ResultSet(connection=c, search=q, indices=['my-index'], model=mymodel)
# > resSet = ResultSet(connection=c, search=q, indices=['my-index'], model=mymodel)
# > TypeError: __init__() got an unexpected keyword argument 'search'
Anyone was able to get a dict from the ResultSet?
Any good sugestion to efficiently convert the ResultSet to a (list of) dictionary will be appreciated too.
I tried too many ways directly to cast ResultSet into dict but got nothing. The best way I recently use is appending ResultSet items into another list or dict. ResultSet covers every single item in itself as a dict.
Here is how I use:
#create a response dictionary
response = {"status_code": 200, "message": "Successful", "content": []}
#set restul set to content of response
response["content"] = [result for result in resultset]
#return a json object
return json.dumps(response)
Its not that complicated: just iterate over the result set. For example with a for loop:
for item in results:
print item

Categories

Resources