I would like to get the information about all reviews from my server. That's my code that I used to achieve my goal.
from rbtools.api.client import RBClient
client = RBClient('http://my-server.net/')
root = client.get_root()
reviews = root.get_review_requests()
The variable reviews contains just 25 review requests (I expected much, much more). What's even stranger I tried something a bit different
count = root.get_review_requests(counts_only=True)
Now count.count is equal to 17164. How can I extract the rest of my reviews? I tried to check the official documentation but I haven't found anything connected to my problem.
According to the documentation (https://www.reviewboard.org/docs/manual/dev/webapi/2.0/resources/review-request-list/#webapi2.0-review-request-list-resource), counts_only is only a Boolean flag that indicates following:
If specified, a single count field is returned with the number of results, instead of the results themselves.
But, what you could do, is to provide it with status, so:
count = root.get_review_requests(counts_only=True, status='all')
should return you all the requests.
Keep in mind that I didn't test this part of the code locally. I referred to their repo test example -> https://github.com/reviewboard/rbtools/blob/master/rbtools/utils/tests/test_review_request.py#L643 and the documentation link posted above.
You have to use pagination (unfortunately I can't provide exact code without ability to reproduce your question):
The maximum number of results to return in this list. By default, this is 25. There is a hard limit of 200; if you need more than 200 results, you will need to make more than one request, using the “next” pagination link.
Looks like pagination helper class also available.
If you want to get 200 results you may set max_results:
requests = root.get_review_requests(max_results=200)
Anyway HERE is a good example how to iterate over results.
Also I don't recommend to get all 17164 results by one request even if it possible. Because total data of response will be huge (let's say if size one a result is 10KB total size will be more than 171MB)
Related
This is my first time using a CKAN Data API. I am trying to download public road accident data from a government website. It is only showing the first 100 rows. On the CKAN documentation it says that the default limit of rows it requests is "100".I am pretty sure you can write an ckan expression to the end of the url to give you the max rows but I am now sure how to write it. Please see python code below of what I have so of far. Is it possible? Thanks
is there any way I can write code similar to the psuedo ckan code request below?
url='https://data.gov.au/data/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb&limit=MAX_ROWS'
CKAN Documentation reference: http://docs.ckan.org/en/latest/maintaining/datastore.html
There are several interesting fields in the documentation for ckanext.datastore.logic.action.datastore_search(), but the ones that pop out are limit and offset.
limit seems to have an absolute maximum of 32000 so depending on the amount of data you might still hit this limit.
offset seems to be the way to go. You keep calling the API with the offset increasing by a set amount until you have all the data. See the code below.
But, actually calling the API revealed something interesting. It generates a next URL which you can call, it automagically updates the offset based on the limit used (and maintaining the limit set on the initial call).
You can call this URL to get the next batch of results.
Some testing showed that it will go past the maximum though, so you need to check if the returned records are lower than the limit you use.
import requests
BASE_URL = "https://data.gov.au/data"
INITIAL_URL = "/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb"
LIMIT = 10000
def get_all() -> list:
result = []
resp = requests.get(f"{BASE_URL}{INITIAL_URL}&limit={LIMIT}")
js = resp.json()["result"]
result.extend(js["records"])
while "_links" in js and "next" in js["_links"]:
resp = requests.get(BASE_URL + js["_links"]["next"])
js = resp.json()["result"]
result.extend(js["records"])
print(js["_links"]["next"]) # just so you know it's actually doing stuff
if len(js["records"]) < LIMIT:
# if it returned less records than the limit, the end has been reached
break
return result
print(len(get_all()))
Note, when exploring an API, it helps to check what exactly is returned. I used the simple code below to check what was returned, which made exploring the API a lot easier. Also, reading the docs helps, like the one I linked above.
from pprint import pprint
pprint(requests.get(BASE_URL+INITIAL_URL+"&limit=1").json()["result"])
When I use the service.forms().responses().list() snippet I am getting 5,000 rows in my df. When I look at the front end sheet I see more than 5,000 rows. Is there a limit to the number of rows that can be pulled?
You are hitting the default page size, as per this documentation.
When you don't specify a page size, it will return up to 5000 results, and provide you a page token if there are more results. Use this page token to call the API again, and you get the rest, up to 5000 again, with a new page token if you hit the page size again.
Note that you can set the page size to a large number. Do not do this. This method is not the one you should use, I'm including it for completeness' sake. You will still run into the same issue if you hit the larger page size. And there's a hard limit on Google's side (which is not specified/provided).
You can create a simple recursive function that handles this for you.
def get_all(service, formId, responses=[], pageToken=None) -> list:
# call the API
res = service.forms().responses(formId=formId, pageToken=pageToken).list().execute()
# Append responses
responses.extend(res["responses"])
# check for nextPageToken
if "nextPageToken" in res:
# recursively call the API again.
return get_all(service, formId, responses=responses, pageToken=res["nextPageToken"])
else:
return responses
please note that I didn't test this one, you might need to move the pageToken=pageToken from the .responses() to the .list() parameters. The implementation of Google's python libraries is not exactly consistent.
Let me know if this is the case so I can update the answer accordingly...
I am trying to figure out the exact query string to successfully get the next page of results for both the waasPolicy logs and auditEvents logs. I have successfully made a query to both endpoints and returned data but the documentation does not provide any examples of how to do pagination.
my example endpoint url string:
https://audit.us-phoenix-1.oraclecloud.com/20190901/auditEvents?compartmentId={}&startTime=2021-02-01T00:22:00Z&endTime=2021-02-10T00:22:00Z
I have of course omitted my compartmentId. When I perform a GET request against this url, it successfully returns data. In order to paginate, the documentation states:
"Make a new GET request against the same URL, modified by setting the page query parameter to the value from the opc-next-page header. Repeat this process until you get a response without an opc-next-page header. The absence of this header indicates that you have reached the last page of the list."
My question is what exactly is this meant to look like? An example would be very helpful. The response header 'opc-next-page' for the auditEvents pagination contains a very long string of characters. Am I meant to append this to the url in the GET request? Would it simply be something like this? of course replacing $('opc-next-page') with that long string in the header.
https://audit.us-phoenix-1.oraclecloud.com/20190901/auditEvents?compartmentId={}&startTime=2021-02-01T00:22:00Z&endTime=2021-02-10T00:22:00Z&page=$(opc-next-page)
And the query for waasPolicy:
https://waas.us-phoenix-1.oraclecloud.com/20181116/waasPolicies/{}/wafLogs
returns an opc-next-page header in the form of a page number. Would it simply require appending something like &page=2? (Tried this to no avail)
Again, I am not able to find any examples in the documentation.
https://docs.oracle.com/en-us/iaas/api/#/en/waas/20181116/WaasPolicy/GetWaasPolicy
https://docs.oracle.com/en-us/iaas/Content/API/Concepts/usingapi.htm#nine
Thank you in advance for your help
Found the answer. Needed to specify &page=$(opc-next-page) AND specify a &limit=X (where x = any integer i.e. 500) parameter. Without the limit param, the &page= param returns a 500 error which is slightly misleading. Will leave this up for anyone else stumbling upon this issue.
I have to scrape tweets from Twitter for a specific user (#salvinimi), from January 2018. The issue is that there are a lot of tweets in this range of time, and so I am not able to scrape all the ones I need!
I tried multiple solutions:
1)
pip install twitterscraper
from twitterscraper import query_tweets_from_user as qtfu
tweets = qtfu(user='matteosalvinimi')
With this method, I get only a few teets (500~600 more or less), instead of all the tweets... Do you know why?
2)
!pip install twitter_scraper
from twitter_scraper import get_tweets
tweets = []
for i in get_tweets('matteosalvinimi', pages=100):
tweets.append(i)
With this method I get an error -> "ParserError: Document is empty"...
If I set "pages=40", I get the tweets without errors, but not all the ones. Do you know why?
Three things for the first issue you encounter:
first of all, every API has its limits and one like Twitter would be expected to monitor its use and eventually stop a user from retrieving data if the user is asking for more than the limits. Trying to overcome the limitations of the API might not be the best idea and might result in being banned from accessing the site or other things (I'm taking guesses here as I don't know what's the policy of Twitter on the matter). That said, the documentation on the library you're using states :
With Twitter's Search API you can only sent 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour.
By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.
then, the function you're using, query_tweets_from_user() has a limit argument which you can set to an integer. One thing you can try is changing that argument and seeing whether you get what you want or not.
finally, if the above does not work, you could be subsetting your time range in two, three ore more subsets if needed, collect the data separately and merge them together afterwards.
The second issue you mention might be due to many different things so I'll just take a broad guess here. For me, either setting pages=100 is too high and by one way or another the program or the API is unable to retrieve the data, or you're trying to look at a hundred pages when there is less than a hundred in pages to look for reality, which results in the program trying to parse an empty document.
Does anyone have experience with the Dota 2 API library in Python called 'dota2api'? I wish to pull a list of 200 recent games filtered by various criteria. I'm using the get_match_history() request (see link). Here's my code:
import dota2api
key = '<key>'
api = dota2api.Initialise(key)
match_list = api.get_match_history(matches_requested=200)
I haven't specified any filters yet, since I can't even get the matches_requested argument to work. When I run this code, I get exactly 100 matches. I fact, no matter how I specify the matches_requested argument, I allways get 100 matches.
Does anyone know if I'm specifying the argument wrong or some other reason why it's working as intended?
Thanks in advance.
For such rarely used libraries it is hard to get an answer here.
I have found this issue on the library's Github:
You can't get more than 500 matches through get_match_history, it's
limited by valve api. One approach you can do is alternate hero_id,
like, requesting with account_id, hero_id and start_at_match_id (none
if first request), values assigned, this way you can get at least 500
matches of each hero from that account_id.
Probably that has since changed and now the parameter is ignored by the API completely. Try creating a new issue on the Github.