Pagination loop with offset 100 - python

I am working on a code where I am fetching records from an API and this API has pagination implemented on it where it would allow maximum of 100 records. So I have to loop in the multiples of 100's. Currently my code compares the total records and loops from offset 100 and then 101,102,103 etc. I want it to loop in 100's(like 100,200,300) and stop as soon as the offset is greater than the total records. I am not sure how to do this, i have partial code which increment by 1 instead of 100 and wont stop when needed. Could anyone please help me with this issue.
import pandas as pd
from pandas.io.json import json_normalize
#Token for Authorization
API_ACCESS_KEY = 'Token'
Accept='application/xml'
#Query Details that is passed in the URL
since = '2018-01-01'
until = '2018-02-01'
limit = '100'
offset = '0'
total = 'true'
def get():
url_address = "https://mywebsite/web?offset="+str('0')
headers = {
'Authorization': 'token={0}'.format(API_ACCESS_KEY),
'Accept': Accept,
}
querystring = {"since":since,"until":until, "limit":limit, "total":total}
# find out total number of pages
r = requests.get(url=url_address, headers=headers, params=querystring).json()
total_record = int(r['total'])
print("Total record: " +str(total_record))
# results will be appended to this list
all_items = []
# loop through all offset and return JSON object
for offset in range(0, total_record):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
offset = offset + 100
print(offset)
# prettify JSON
data = json.dumps(all_items, sort_keys=True, indent=4)
return data
print(get())
Currently when I print the offset I see
Total Records: 345
100,
101,
102,
Expected:
Total Records: 345
100,
200,
300
Stop the loop!

One way you could do it is change
for offset in range(0, total_record):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
offset = offset + 100
print(offset)
to
for offset in range(0, total_record, 100):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
print(offset)
as you cannot change offset inside the loop

loop through all offset and return JSON object
for offset in range(0,total_record,100):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
print(offset)

Related

Reset offset variable in API when using a for loop

I am working with a real estate API pulling rental listings. I'd like to loop through a list of zipcodes to pull the data. The API requires an offset of 500 rows of data or less. The code below works fine until the while loop hits the second zipcode. The issue is that after the first zipcode has run successfully, I need the offset variable to reset to 500 and begin counting up again until the while loop breaks for the second zipcode in the list.
`# This just formats your token for the requests library.
headers = {
"X-RapidAPI-Key": "your-key-here",
"X-RapidAPI-Host": "realty-mole-property-api.p.rapidapi.com"
}
# Initial Limit and Offset values.
limit = 500
offset = 0
zipCode = [77449, 77008]
# This will be an array of all the listing records.
texas_listings = []
# We loop until we get no results.
for i in zipCode:
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset=. {offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
# Did we find any listings?
if len(data) == 0:
# If not, exit the loop
break
# If we did find listings, add them to the data
# and then move onto the next offset.
texas_listings.extend(data)
offset = offset + 500
`
Here is a snippet of the final printed output. As you can see, zipcode 77008 gets successfully passed to the zipCode variable after the 77449 zipcode returns an empty list and breaks the loop at offset 5500. However, you can also see that the 77008 offset starts at 5500 and it appears there aren't that many listings in that zipcode. How do I reset offset variable to 500 and begin counting again?
You can reset the offset variable back to 500 before starting the next iteration of the loop over the next zip code.
for i in zipCode:
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset={offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
if len(data) == 0:
break
texas_listings.extend(data)
offset = offset + 500
offset = 500 # reset offset to 500 for the next zip code
Update: put the offset and limit within the for loop and it works the way I expect.
# We loop until we get no results.
for i in zipCode:
limit = 500
offset = 0
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset={offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
# Did we find any listings?
if len(data) == 0:
# If not, exit the loop
break
# If we did find listings, add them to the data
# and then move onto the next offset.
texas_listings.extend(data)
offset = offset + 500
texas = pd.DataFrame(texas_listings).append(texas,
ignore_index=True)
texas['date_pulled'] = pd.Timestamp.now().normalize()

Batching API Requests

I have a list of 1,000 airports I am sending to an API to get flight data for each airport. The API cannot handle the entire list at once even if I delay the calls. I need to place the list of airports into batches of 100 for the API calls to work properly. My code below iterates over the list of airports and sends them one by one to the API. I want to break up the API calls (airport list) and call them in batches of 100 because it's causing errors in the data format when I use the entire 1,000. When I test the API with only 100 airports, all data is returned properly. I'm unsure where to place the batch code in my API call loop.
# Sample dataset for this post
airport = [['HLZN'], ['HLLQ'],['HLLB'],['HLGT'],['HLMS'],['HLLS'],['HLTQ'],['HLLT'],['HLLM']]
payload = {'max_pages': 500, 'type':'Airline'}
seconds = 1
count = 1
#Create an empty list to hold responses
json_responses = []
#Iterate through list
for airports in airport:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
if response.status_code == 200:
print(count, airports)
count +=1
for i in trange(100):
time.sleep(0.01)
else:
pass
results = response.json()
json_responses.append(response.json())
sleep(seconds)
I'm not sure where to place batching code inside the API call loop. I'm new to batching API calls and loops in general so any help will be appreciated.
total_count = len(airport)
#Iterate through list
for airports in airport:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
chunks = (total_count - 1) // 100 + 1
for i in range(chunks):
batch = airport[i*100:(i+1)*100] #Tried batch code here
if response.status_code == 200:
print(count, airports)
count +=1
for i in trange(100):
time.sleep(0.01)
else:
pass
results = response.json()
json_responses.append(response.json())
sleep(seconds)
I believe this is what you're trying to do:
# Sample dataset for this post
airports = [['HLZN'], ['HLLQ'],['HLLB'],['HLGT'],['HLMS'],['HLLS'],['HLTQ'],['HLLT'],['HLLM']]
payload = {'max_pages': 500, 'type':'Airline'}
seconds = 1
#Create an empty list to hold responses
json_responses = []
# Counter variable
counter = 0
# Chunk size
chunk_size = 100
#Iterate through list
for airport in airports:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
results = response.json()
json_responses.append(response.json())
# Increment counter and check if it is a multiple of the chunk size, if yes, sleep for a defined number of seconds
counter += 1
if counter % chunk_size == 0:
sleep(seconds)

Call API for each element in list

I have a list with over 1000 IDs and I want to call an API with different endpoints for every element of the list.
Example:
customerlist = [803818, 803808, 803803,803738,803730]
I tried the following:
import json
import requests
import pandas as pd
API_BASEURL = "https://exampleurl.com/"
API_TOKEN = "abc"
HEADERS = {'content-type' : 'application/json',
'Authorization': API_TOKEN }
def get_data(endpoint):
for i in customerlist:
api_endpoint = endpoint
params = {'customerid' : i}
response = requests.get(f"{API_BASEURL}/{api_endpoint}",
params = params,
headers = HEADERS)
if response.status_code == 200:
res = json.loads(response.text)
else:
raise Exception(f'API error with status code {response.status_code}')
res= pd.DataFrame([res])
return res
get_data(endpointexample)
This works, but it only returns the values for the first element of the list (803818). I want the function to return the values for every ID from customerlist for the endpoint I defined in the function argument.
I found this - possibly related - question, but I couldn't figure my problem out.
There is probably an easy solution for this which I am not seeing, as I am just starting with Python. Thanks.
The moment a function hits a return statement, it immediately finishes. Since your return statement is in the loop, the other iterations never actually get called.
To fix, you can create a list outside the loop, append to it every loop iteration, and then return the DataFrame created with that list:
def get_data(endpoint):
responses = []
for i in customerlist:
api_endpoint = endpoint
params = {'customerid' : i}
response = requests.get(f"{API_BASEURL}/{api_endpoint}",
params = params,
headers = HEADERS)
if response.status_code == 200:
res = json.loads(response.text)
else:
raise Exception(f'API error with status code {response.status_code}')
responses.append(res)
return pd.DataFrame(responses)
A much cleaner solution would be to use list comprehension:
def get_data(endpoint, i):
api_endpoint = endpoint
params = {'customerid' : i}
response = requests.get(f"{API_BASEURL}/{api_endpoint}",
params = params,
headers = HEADERS)
if response.status_code == 200:
res = json.loads(response.text)
else:
raise Exception(f'API error with status code {response.status_code}')
return res
responses = pd.DataFrame([get_data(endpoint, i) for i in customerlist])

Optimize script to flatten json output from API

I have a script that extracts data from an API, where the final output of requests.get(url=url, auth=(user, password)).json() is "all_results". The output is ~25K rows, but it contains nested fields.
The API is for portfolio data, and the children field is a dictionary holding ticker level information (so can be really large).
The script below flattens "all_results" and specifies only the columns I need:
final_df = pd.DataFrame()
for record in all_results:
df = pd.DataFrame(record.get('children', {}))
df['contactId'] = record.get('contactId')
df['origin'] = record.get('origin')
df['description'] = record.get('description')
final_df = final_df.append(df)
It works perfectly with smaller samples, however when trying to run it over the whole data set- it takes HOURS. Can anyone propose something more efficient than my current script? Need it to run way faster than currently.
Thank you in advance!
-- Full script--
user = ''
password= ""
# Starting values
start = 0
rows = 1500
base_url = 'https://....?start={0}&rows={1}'
print ("Connecting to API..")
url = base_url.format(start,rows)
req = requests.get(url=url, auth=(user, password))
print ("Extracting data..")
out = req.json()
total_records = out['other']['numFound']
print("Total records found: "+ str(total_records))
results = out['resultList']
all_results = results
print ("First " + str(rows) + " rows were extracted")
# Results will be an empty list if no more results are found
while results:
start += rows # Rebuild url based on current start
url = base_url.format(start, rows)
req = requests.get(url=url, auth=(user, password))
out = req.json()
results = out['resultList']
all_results += results
print ("Next " + str(rows) + " rows were extracted")
# All results will now contains all the responses of each request.
print("Total records returned from API: "+ str(len(all_results))) #should equal number of records in response
final_df = pd.DataFrame()
for record in all_results:
df = pd.DataFrame(record.get('children', {}))
df['contactId'] = record.get('contactId')
df['origin'] = record.get('origin')
df['description'] = record.get('description')
final_df = final_df.append(df)
final_df = final_df.reset_index()
del final_df['index']
final_df['ticker'] = final_df['identifier'].str.split('#').str.get(0) #extract ticker (anything before #)
final_df.drop_duplicates(keep='first') #removes duplicates
print('DataFrame from API created succesfully\n')
print(final_df.head(n=50))

Making ten api calls through a loop

An API I want to use limits requests to 10 items. I want to download 100 items. I am trying to write a function that makes 10 API, using their offset functionality to make it possible. I figured a loop would be the proper way to do this.
This is the code I have, but it doesn't work and I don't understand why:
import pandas as pd
import requests
api_key = 'THIS_IS_MY_KEY'
api_url = 'http://apiurl.com/doc?limit=10' # fake url
headers = {'Authorization': 'Bearer ' + api_key}
for x in range(0, 10):
number = 0
url = api_url + '&offset=' + str(number + 10)
r = requests.get(url, headers=headers)
x = pd.DataFrame(r.json())
x = x['data'].apply(pd.Series)
return x
You are also using x as your loop counter and as your data frame - which i think is not good practice - although your code might still work because of the way that the for loop works. A better is to use the step parameter in the range call - as demonstrated below. It is also not clear what you are expecting to return - are you wanting to return the last offset you fetched - or the the data frame (since your code re-uses x in 3 different ways it is impossible to determine what you intended - so I left it as it is - although I am pretty sure it is wrong - looking at the panda API)
import pandas as pd
import requests
api_key = 'THIS_IS_MY_KEY'
api_url = 'http://apiurl.com/doc?limit=10' # fake url
headers = {'Authorization': 'Bearer ' + api_key}
for offset in range(0, 100, 10): # makes a list [0, 10,20,30,40,50,60,70,80,90,100]
url = api_url + '&offset=' + str(offset)
r = requests.get(url, headers=headers)
x = pd.DataFrame(r.json())
x = x['data'].apply(pd.Series)
return x
what result do you see?
try
url = api_url + '&offset=' + str(x * 10)
The variable number never change, since it is set to 0 at the start of the loop.
I think you means this:
import pandas as pd
import requests
api_key = 'THIS_IS_MY_KEY'
api_url = 'http://apiurl.com/doc?limit=10' # fake url
headers = {'Authorization': 'Bearer ' + api_key}
number = 0
for x in range(0, 10):
url = api_url + '&offset=' + str(number + 10)
r = requests.get(url, headers=headers)
x = pd.DataFrame(r.json())
x = x['data'].apply(pd.Series)
number += 10
return x

Categories

Resources