Google Forms API Record Limit? - python

When I use the service.forms().responses().list() snippet I am getting 5,000 rows in my df. When I look at the front end sheet I see more than 5,000 rows. Is there a limit to the number of rows that can be pulled?

You are hitting the default page size, as per this documentation.
When you don't specify a page size, it will return up to 5000 results, and provide you a page token if there are more results. Use this page token to call the API again, and you get the rest, up to 5000 again, with a new page token if you hit the page size again.
Note that you can set the page size to a large number. Do not do this. This method is not the one you should use, I'm including it for completeness' sake. You will still run into the same issue if you hit the larger page size. And there's a hard limit on Google's side (which is not specified/provided).
You can create a simple recursive function that handles this for you.
def get_all(service, formId, responses=[], pageToken=None) -> list:
# call the API
res = service.forms().responses(formId=formId, pageToken=pageToken).list().execute()
# Append responses
responses.extend(res["responses"])
# check for nextPageToken
if "nextPageToken" in res:
# recursively call the API again.
return get_all(service, formId, responses=responses, pageToken=res["nextPageToken"])
else:
return responses
please note that I didn't test this one, you might need to move the pageToken=pageToken from the .responses() to the .list() parameters. The implementation of Google's python libraries is not exactly consistent.
Let me know if this is the case so I can update the answer accordingly...

Related

Getting all review requests from Review Board Python Web API

I would like to get the information about all reviews from my server. That's my code that I used to achieve my goal.
from rbtools.api.client import RBClient
client = RBClient('http://my-server.net/')
root = client.get_root()
reviews = root.get_review_requests()
The variable reviews contains just 25 review requests (I expected much, much more). What's even stranger I tried something a bit different
count = root.get_review_requests(counts_only=True)
Now count.count is equal to 17164. How can I extract the rest of my reviews? I tried to check the official documentation but I haven't found anything connected to my problem.
According to the documentation (https://www.reviewboard.org/docs/manual/dev/webapi/2.0/resources/review-request-list/#webapi2.0-review-request-list-resource), counts_only is only a Boolean flag that indicates following:
If specified, a single count field is returned with the number of results, instead of the results themselves.
But, what you could do, is to provide it with status, so:
count = root.get_review_requests(counts_only=True, status='all')
should return you all the requests.
Keep in mind that I didn't test this part of the code locally. I referred to their repo test example -> https://github.com/reviewboard/rbtools/blob/master/rbtools/utils/tests/test_review_request.py#L643 and the documentation link posted above.
You have to use pagination (unfortunately I can't provide exact code without ability to reproduce your question):
The maximum number of results to return in this list. By default, this is 25. There is a hard limit of 200; if you need more than 200 results, you will need to make more than one request, using the “next” pagination link.
Looks like pagination helper class also available.
If you want to get 200 results you may set max_results:
requests = root.get_review_requests(max_results=200)
Anyway HERE is a good example how to iterate over results.
Also I don't recommend to get all 17164 results by one request even if it possible. Because total data of response will be huge (let's say if size one a result is 10KB total size will be more than 171MB)

Returning all Data from a CKAN API Request? (Python)

This is my first time using a CKAN Data API. I am trying to download public road accident data from a government website. It is only showing the first 100 rows. On the CKAN documentation it says that the default limit of rows it requests is "100".I am pretty sure you can write an ckan expression to the end of the url to give you the max rows but I am now sure how to write it. Please see python code below of what I have so of far. Is it possible? Thanks
is there any way I can write code similar to the psuedo ckan code request below?
url='https://data.gov.au/data/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb&limit=MAX_ROWS'
CKAN Documentation reference: http://docs.ckan.org/en/latest/maintaining/datastore.html
There are several interesting fields in the documentation for ckanext.datastore.logic.action.datastore_search(), but the ones that pop out are limit and offset.
limit seems to have an absolute maximum of 32000 so depending on the amount of data you might still hit this limit.
offset seems to be the way to go. You keep calling the API with the offset increasing by a set amount until you have all the data. See the code below.
But, actually calling the API revealed something interesting. It generates a next URL which you can call, it automagically updates the offset based on the limit used (and maintaining the limit set on the initial call).
You can call this URL to get the next batch of results.
Some testing showed that it will go past the maximum though, so you need to check if the returned records are lower than the limit you use.
import requests
BASE_URL = "https://data.gov.au/data"
INITIAL_URL = "/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb"
LIMIT = 10000
def get_all() -> list:
result = []
resp = requests.get(f"{BASE_URL}{INITIAL_URL}&limit={LIMIT}")
js = resp.json()["result"]
result.extend(js["records"])
while "_links" in js and "next" in js["_links"]:
resp = requests.get(BASE_URL + js["_links"]["next"])
js = resp.json()["result"]
result.extend(js["records"])
print(js["_links"]["next"]) # just so you know it's actually doing stuff
if len(js["records"]) < LIMIT:
# if it returned less records than the limit, the end has been reached
break
return result
print(len(get_all()))
Note, when exploring an API, it helps to check what exactly is returned. I used the simple code below to check what was returned, which made exploring the API a lot easier. Also, reading the docs helps, like the one I linked above.
from pprint import pprint
pprint(requests.get(BASE_URL+INITIAL_URL+"&limit=1").json()["result"])

Oracle cloud REST API WaasPolicy and auditEvents pagination

I am trying to figure out the exact query string to successfully get the next page of results for both the waasPolicy logs and auditEvents logs. I have successfully made a query to both endpoints and returned data but the documentation does not provide any examples of how to do pagination.
my example endpoint url string:
https://audit.us-phoenix-1.oraclecloud.com/20190901/auditEvents?compartmentId={}&startTime=2021-02-01T00:22:00Z&endTime=2021-02-10T00:22:00Z
I have of course omitted my compartmentId. When I perform a GET request against this url, it successfully returns data. In order to paginate, the documentation states:
"Make a new GET request against the same URL, modified by setting the page query parameter to the value from the opc-next-page header. Repeat this process until you get a response without an opc-next-page header. The absence of this header indicates that you have reached the last page of the list."
My question is what exactly is this meant to look like? An example would be very helpful. The response header 'opc-next-page' for the auditEvents pagination contains a very long string of characters. Am I meant to append this to the url in the GET request? Would it simply be something like this? of course replacing $('opc-next-page') with that long string in the header.
https://audit.us-phoenix-1.oraclecloud.com/20190901/auditEvents?compartmentId={}&startTime=2021-02-01T00:22:00Z&endTime=2021-02-10T00:22:00Z&page=$(opc-next-page)
And the query for waasPolicy:
https://waas.us-phoenix-1.oraclecloud.com/20181116/waasPolicies/{}/wafLogs
returns an opc-next-page header in the form of a page number. Would it simply require appending something like &page=2? (Tried this to no avail)
Again, I am not able to find any examples in the documentation.
https://docs.oracle.com/en-us/iaas/api/#/en/waas/20181116/WaasPolicy/GetWaasPolicy
https://docs.oracle.com/en-us/iaas/Content/API/Concepts/usingapi.htm#nine
Thank you in advance for your help
Found the answer. Needed to specify &page=$(opc-next-page) AND specify a &limit=X (where x = any integer i.e. 500) parameter. Without the limit param, the &page= param returns a 500 error which is slightly misleading. Will leave this up for anyone else stumbling upon this issue.

How can I efficiently obtain every entry from a website with a return limit?

I am interested in parsing all of the entries from the federal job site: https://www.usajobs.gov/ for data analysis.
I have read through the API and in this section: https://developer.usajobs.gov/Guides/Rate-Limiting, it says the following:
Maximum of 5,000 job records per query* (I am actually getting 10,000 job records in my output)
Maximum of 500 job records returned per request
Here is the rest of the API reference: https://developer.usajobs.gov/API-Reference
So here is my question:
How can I go to the next 10,000 until all records are found?
What I am doing:
response = requests.get('https://data.usajobs.gov/api/Search?Page=20&ResultsPerPage=500', headers=headers)
Gives me 500 results per page in the form of a .json in which I dump them all into one .json until the 20th page by an increment page loop which ends up being all 10,000. I'm not sure what to do to get the next 10,000 until all entries are found.
Another idea is that I can do a query for each state but the downside is that I will lose everything outside of the U.S.
If someone can point me in the right direction for a better, simpler, and more efficient way to get all the entries than my proposed ideas, I would appreciate that too.
The server likely gives some error when it can't find more pages. Try something like
"...?Page=25000&..."
just to see what it gives, then use a while loop with a manually managed incrementer instead of a for loop. The stopping condition for the while loop is to check if the server returns the error page.

google-coud-storage python list_blobs performance

I have a very simple python function:
def list_blobs(bucket, project)
storage_client = storage.Client(project=project)
bucket = storage_client.get_bucket(bucket)
blobs = bucket.list_blobs(prefix='basepath/', max_results=999999,
fields='items(name,md5Hash),nextPageToken')
r = [(b.name, b.md5_hash) for b in blobs]
The blobs list contains 14599 items, and this code takes 7 seconds to run.
When profiling most of the time is wasted reading from the server (there are 16 calls to page_iterator._next_page.
So, how can I improve here? The iteration code is deep in the library, and the pointer to each page comes from the previous page, so I see no way on how to fetch the 16 pages in parallel so I can cut down those 7 seconds.
I am on python 3.6.8,
google-api-core==1.7.0
google-auth==1.6.2
google-cloud-core==0.29.1
google-cloud-storage==1.14.0
google-resumable-media==0.3.2
googleapis-common-protos==1.5.6
protobuf==3.6.1
Your max_results=999999 is larger than 14599 - the number of objects, forcing all results into a single page. From Bucket.list_blobs():
Parameters:
max_results (int) – (Optional) The maximum number of blobs in each page of results from this request. Non-positive values are ignored.
Defaults to a sensible value set by the API.
My guess is that the code spends a lot of time blocked waiting for the server to provide the info needed to iterate through the results.
So the 1st thing I'd try would be to actually iterate through multiple pages, using a max_results smaller than the number of blobs. Maybe 1000 or 2000 and see the impact on overall duration?
Maybe even trying to use the multiple pages explicitly, using blobs.pages, as suggested in the deprecated page_token property doc (emphasis mine):
page_token (str) – (Optional) If present, return the next batch of blobs, using the value, which must correspond to the nextPageToken
value returned in the previous response. Deprecated: use the pages
property of the returned iterator instead of manually passing the
token.
But I'm not quite sure how to force the multiple pages to be simultaneously pulled. Maybe something like this?
[(b.name, b.md5_hash) for page in blobs.pages for b in page]

Categories

Resources