How to collect all results from a web API in Python? - python

I am collecting data from a web API by using a Python script. The web API provides maximum 50 results ("size":50). However, I need to collect all the results. Please let me know how can I do it. My initial code is available below. Thank you in advance.
def getData():
headers = {
'Content-type': 'application/json',
}
data = '{"size":50,"sites.recruitment_status":"ACTIVE", "sites.org_state_or_province":"VA"}'
response = requests.post('https://clinicaltrialsapi.cancer.gov/v1/clinical-trials', headers=headers, data=data)
print(response.json())

To add to the answer already given you can get then total results from the initial json. You can then use a loop to increment for batches
import requests
import json
url = "https://clinicaltrialsapi.cancer.gov/v1/clinical-trials"
r = requests.get(url).json()
num_results = int(r['total'])
results_per_request = 50
total = 0
while total < num_results:
total+=results_per_request
print(total)

Everything is in the doc :
https://clinicaltrialsapi.cancer.gov/#!/Clinical45trials/searchTrialsByGet
GET clinical-trials
Filters all clinical trials based upon supplied filter params. Filter
params may be any of the fields in the schema as well as any of the
following params...
size: limit the amount of results a supplied amount (default is 10,
max is 50)
from: start the results from a supplied starting point (default is 0)
...
So you just have to specify a "from" value, and increment it 50 by 50.

Related

Googlemaps refuses to give more than 20 restaurants

I am trying to get more than 20 restaurants in googlemaps.places_nearby(). so, I am using the page_token mentioned in a question here on stackoverflow.
here's the code:
import time
with open('apikey.txt') as f:
api_key = f.readline()
f.close
import googlemaps
gmaps = googlemaps.Client(api_key)
locations = [(30.0416162646183, 31.187637297709912),(30.038828662868447, 31.21133524457125),(29.848956883337507, 31.334386571579085),(30.047845819479956, 31.262317130706496),(30.05312112490655, 31.24665544474578),(30.044482967408886, 31.23572953125353),(30.02023034028819, 31.176992671570066),(30.055592085960892, 31.18411299557052),(30.0512387364253, 31.20328697618034),(30.027741592767295, 31.174344307489818),(30.043337503059586, 31.17587613443309),(30.049286828183856, 31.181250916540794),(30.043423144171197, 31.187248209629644),(30.040934096091647, 31.183299998037857),(30.038296379882215, 31.189823130232988),(29.960107152991863, 31.250999388927262) , (29.83911392727571, 31.30468827173587) , (29.842752004034566, 31.332961535887694)]
search_string = "كشري"
distance = 10000
kosharies = []
for location in locations:
response = gmaps.places_nearby(location = location , keyword = search_string , name = 'كشري' , radius = distance)
kosharies.extend(response.get('results'))
next_page_token = response.get('next_page_token')
while next_page_token:
time.sleep(2)
another_response = gmaps.places_nearby(location = location , keyword = search_string , name = 'كشري' , radius = distance ,\
page_token = next_page_token)
kosharies.extend(another_response.get('results'))
next_page_token = another_response.get('next_page_token')
I provided 18 different locations.
Judging by the fact that each request must give back exactly 20 or less.
I looked manually in the 18 locations, I know that each location has more than 20 restaurants in this category!
I tried going for a while loop over the page_tokens, but no luck.
the shape of the dataset returned is (142 , 18) rows, columns.
I'd appreciate your help so much.
This seems to work just fine, please check your requests' parameters carefully:
First request:
location=30.0416162646183,31.187637297709912&radius=10000&keyword=كشري
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=30.0416162646183,31.187637297709912&radius=10000&keyword=%D9%83%D8%B4%D8%B1%D9%8A
Returns 20 results and a next_page_token. Second request uses only that token, with pagetoken (not page_token):
https://maps.googleapis.com/maps/api/place/nearbysearch/json?pagetoken=AfLeUgPos3GU6Ew0BWV52EyHv9ay7Q7H7N-44c0pSTfb0JN039qifhKotQiUPlF9O3P1jdeSJarnR72GHMmUW1jkBS0ErfYe_jpi9cUCx--XO1n1DUp_eF0XZ_Ue4KJ7_l5h6FjaE_1f2Z0G9q3lwGLg-0Ch5n4p7KaTKMq8CET8QX-lXa2_ssemCiFGXrOj6vn4wDXKNRqYAOGquNiaq9_3RWUs_k7Epe_uCicW94hC-PX90nisZxuW-zy3SBAmuRJpL4pV3CA9CWH0ygBigHWy88Sle6b1S-4k2GWK72n-eAMEEmmAOziqt1ETCy-li92pqjP4BgDv7jKYCD2uKgL3jRWdGyglroNkP02HFX49qHNbNrl0MfhKn_lTvw6zSjwF001nDOnYq5mgE8KCTe3b7cDxGgZVYWFFKwyNuswXiUPTy9D4lXhRJX7oRF6DH7YF-lH7faLpeh2eZBMP73AtFJlPX7B0c4_riCgeCK2C0Pvz_lBivx4VtkOehmuYfOVwBpi54rJW4-iZnrIjW0NrRD7HibL76MWyr_njLIf5eLx9Tl2PYiwTOj3Vd6Rjafry7b15M69Jhku1C22AVhy0R4HfYd5LHFn38N_ILL8PhDaMk3S2TKkzrohYyomrbvlffrBKUDij9Bbggvoy2s3iHrg6N-Em3SNrTPcWS65chZUALp_kve04rfU4wjhKow
This also returns 20 results and a final next_page_token. Third request:
https://maps.googleapis.com/maps/api/place/nearbysearch/json?pagetoken=AfLeUgPos3GU6Ew0BWV52EyHv9ay7Q7H7N-44c0pSTfb0JN039qifhKotQiUPlF9O3P1jdeSJarnR72GHMmUW1jkBS0ErfYe_jpi9cUCx--XO1n1DUp_eF0XZ_Ue4KJ7_l5h6FjaE_1f2Z0G9q3lwGLg-0Ch5n4p7KaTKMq8CET8QX-lXa2_ssemCiFGXrOj6vn4wDXKNRqYAOGquNiaq9_3RWUs_k7Epe_uCicW94hC-PX90nisZxuW-zy3SBAmuRJpL4pV3CA9CWH0ygBigHWy88Sle6b1S-4k2GWK72n-eAMEEmmAOziqt1ETCy-li92pqjP4BgDv7jKYCD2uKgL3jRWdGyglroNkP02HFX49qHNbNrl0MfhKn_lTvw6zSjwF001nDOnYq5mgE8KCTe3b7cDxGgZVYWFFKwyNuswXiUPTy9D4lXhRJX7oRF6DH7YF-lH7faLpeh2eZBMP73AtFJlPX7B0c4_riCgeCK2C0Pvz_lBivx4VtkOehmuYfOVwBpi54rJW4-iZnrIjW0NrRD7HibL76MWyr_njLIf5eLx9Tl2PYiwTOj3Vd6Rjafry7b15M69Jhku1C22AVhy0R4HfYd5LHFn38N_ILL8PhDaMk3S2TKkzrohYyomrbvlffrBKUDij9Bbggvoy2s3iHrg6N-Em3SNrTPcWS65chZUALp_kve04rfU4wjhKow
This also returns 20 results, but no next_page_token.
Bear in mind that there is no guarantee on the amount of results for any given request. The API can return up to 60 results (in 3 pages), but it can also return any number of them from 0 to 60. This can happen more easily with Place Nearby Search because, as the name indicates, it is meant to return only results that are near by; even though it will accept a large radius value, it usually doesn't return results that are far away.

Retrieve all the API response while you have maximum offset in Python

I am attempting to retrieve data from this API that has a max offset of 200000. The records I am attempting to pull are more than the max offset. Below is a sample of the code I am using but when I reach the offset limit of 200000 it breaks (the API doesn't return any helpful response in terms of how many pages/requests I need to do that's why I am going until there are no more results ). I need to find a way to loop through and pull all the data. Thanks
def pull_api_data():
offset_limit = 0
teltel_data = []
# Loop through the results and add if present
while True:
print("Skip", offset_limit, "rows before beginning to return results")
querystring = {"offset": "{}".format(offset_limit), "filter": "starttime>="'{}'.format(date_filter), "limit" : "5000"}
response = session.get(url=url, headers=the_headers, params=querystring)
data = response.json()['data']
# Do we have more data from teltel ?
if len(data) == 0:
break
# If yes ,then add the data to the main list ,teltel_data
teltel_data.extend(data)
# Increase offset_limit to skip the already added data
offset_limit = offset_limit + 5000
# transform the raw data by converting it to a dataframe and do necessary cleaning
pull_api_data()

Paginated XML parsing from API

I have a snippet of code that calls a request to an API and retrieves XML data. The API limits requests to 50 results per page. I can loop through the pages but they always return an end to the root (Interactions) for the page:
<interactions>
<interaction>
<media-type>Phone</media-type>
<channel-obj-id>12125000870</channel-obj-id>
</interaction>
</interactions>
<interactions>
<interaction>
<media-type>Phone</media-type>
<channel-obj-id>523452345</channel-obj-id>
</interaction>
</interactions>
I have looped through the record count to get all the data but it does not provide it as a continuous XML page.
The API docs does not provide any relevant information nor provide a count or total number of pages parameter.
dt = datetime.today() - timedelta(days =1)
dt = dt.strftime('%Y%m%d')
while rootLength == 50:
n+=50
response = (requests.get('https://APIURL/api/stats/allinteractions?
d=20190101,'+str(dt)+'&n='+str(n),
auth=('user','pass')
))
xtree = et.ElementTree(et.fromstring(response.content))
Is there a way to retrieve a single continuous XML page?

Optimizing speed for bulk SSL API requests

I have written a script to run around 530 API calls which i intend to run every 5 minutes, from these calls i store data to process in bulk later (Prediction ETC).
The API has a limit of 500 requests per second. However when running my code I am seeing a 2 second time per call (due to SSL i believe).
How can i speed this up to enable me to run 500 requests within 5 minutes, as the current time required renders the data i am collecting useless :(
Code:
def getsurge(lat, long):
response = client.get_price_estimates(
start_latitude=lat,
start_longitude=long,
end_latitude=-34.063676,
end_longitude=150.815075
)
result = response.json.get('prices')
return result
def writetocsv(database):
database_writer = csv.writer(database)
database_writer.writerow(HEADER)
pool = Pool()
# Open Estimate Database
while True:
for data in coordinates:
line = data.split(',')
long = line[3]
lat = line[4][:-2]
estimate = getsurge(lat, long)
timechecked = datetime.datetime.now()
for d in estimate:
if d['display_name'] == 'TAXI':
database_writer.writerow([timechecked, [line[0], line[1]], d['surge_multiplier']])
database.flush()
print(timechecked, [line[0], line[1]], d['surge_multiplier'])
Is the APi under your control? If so, create an endpoint which can give you all the data you need in one go.

Tweepy (Twitter API) Not Returning all Search Results

I'm using the search feature with Tweepy for Twitter and for some reason the search results are limited to 15. Here is my code
results=api.search(q="Football",rpp=1000)
for result in results:
print "%s" %(clNormalizeString(result.text))
print len(results)
and only 15 results are returned. Does it have something to do with different pages of results or something?
The question is more about Twitter API instead of tweepy itself.
According to the documentation, count parameter defines:
The number of tweets to return per page, up to a maximum of 100.
Defaults to 15. This was formerly the "rpp" parameter in the old
Search API.
FYI, you can use tweepy.Cursor to get paginated results, like this:
import tweepy
auth = tweepy.OAuthHandler(..., ...)
auth.set_access_token(..., ...)
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
q="google",
count=100,
result_type="recent",
include_entities=True,
lang="en").items():
print tweet.created_at, tweet.text
See also: https://github.com/tweepy/tweepy/issues/197.
Hope that helps.
Here's a minimal working example (once you replace the fake keys with real ones).
import tweepy
from math import ceil
def get_authorization():
info = {"consumer_key": "A7055154EEFAKE31BD4E4F3B01F679",
"consumer_secret": "C8578274816FAEBEB3B5054447B6046F34B41F52",
"access_token": "15225728-3TtzidHIj6HCLBsaKX7fNpuEUGWHHmQJGeF",
"access_secret": "61E3D5BD2E1341FFD235DF58B9E2FC2C22BADAD0"}
auth = tweepy.OAuthHandler(info['consumer_key'], info['consumer_secret'])
auth.set_access_token(info['access_token'], info['access_secret'])
return auth
def get_tweets(query, n):
_max_queries = 100 # arbitrarily chosen value
api = tweepy.API(get_authorization())
tweets = tweet_batch = api.search(q=query, count=n)
ct = 1
while len(tweets) < n and ct < _max_queries:
print(len(tweets))
tweet_batch = api.search(q=query,
count=n - len(tweets),
max_id=tweet_batch.max_id)
tweets.extend(tweet_batch)
ct += 1
return tweets
Note: I did try using a for loop, but the twitter api sometimes returns fewer than 100 results (despite being asked for 100, and 100 being available). I'm not sure why this is, but that's the reason why I didn't include a check to break the loop if tweet_batch is empty -- you may want to add such a check yourself as there is a query rate limit.
Another Note: You can avoid hitting the rate limit by invoking wait_on_rate_limit=True like so
api = tweepy.API(get_authorization(), wait_on_rate_limit=True)

Categories

Resources