I have a list of 1,000 airports I am sending to an API to get flight data for each airport. The API cannot handle the entire list at once even if I delay the calls. I need to place the list of airports into batches of 100 for the API calls to work properly. My code below iterates over the list of airports and sends them one by one to the API. I want to break up the API calls (airport list) and call them in batches of 100 because it's causing errors in the data format when I use the entire 1,000. When I test the API with only 100 airports, all data is returned properly. I'm unsure where to place the batch code in my API call loop.
# Sample dataset for this post
airport = [['HLZN'], ['HLLQ'],['HLLB'],['HLGT'],['HLMS'],['HLLS'],['HLTQ'],['HLLT'],['HLLM']]
payload = {'max_pages': 500, 'type':'Airline'}
seconds = 1
count = 1
#Create an empty list to hold responses
json_responses = []
#Iterate through list
for airports in airport:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
if response.status_code == 200:
print(count, airports)
count +=1
for i in trange(100):
time.sleep(0.01)
else:
pass
results = response.json()
json_responses.append(response.json())
sleep(seconds)
I'm not sure where to place batching code inside the API call loop. I'm new to batching API calls and loops in general so any help will be appreciated.
total_count = len(airport)
#Iterate through list
for airports in airport:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
chunks = (total_count - 1) // 100 + 1
for i in range(chunks):
batch = airport[i*100:(i+1)*100] #Tried batch code here
if response.status_code == 200:
print(count, airports)
count +=1
for i in trange(100):
time.sleep(0.01)
else:
pass
results = response.json()
json_responses.append(response.json())
sleep(seconds)
I believe this is what you're trying to do:
# Sample dataset for this post
airports = [['HLZN'], ['HLLQ'],['HLLB'],['HLGT'],['HLMS'],['HLLS'],['HLTQ'],['HLLT'],['HLLM']]
payload = {'max_pages': 500, 'type':'Airline'}
seconds = 1
#Create an empty list to hold responses
json_responses = []
# Counter variable
counter = 0
# Chunk size
chunk_size = 100
#Iterate through list
for airport in airports:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
results = response.json()
json_responses.append(response.json())
# Increment counter and check if it is a multiple of the chunk size, if yes, sleep for a defined number of seconds
counter += 1
if counter % chunk_size == 0:
sleep(seconds)
Related
I am working with a real estate API pulling rental listings. I'd like to loop through a list of zipcodes to pull the data. The API requires an offset of 500 rows of data or less. The code below works fine until the while loop hits the second zipcode. The issue is that after the first zipcode has run successfully, I need the offset variable to reset to 500 and begin counting up again until the while loop breaks for the second zipcode in the list.
`# This just formats your token for the requests library.
headers = {
"X-RapidAPI-Key": "your-key-here",
"X-RapidAPI-Host": "realty-mole-property-api.p.rapidapi.com"
}
# Initial Limit and Offset values.
limit = 500
offset = 0
zipCode = [77449, 77008]
# This will be an array of all the listing records.
texas_listings = []
# We loop until we get no results.
for i in zipCode:
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset=. {offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
# Did we find any listings?
if len(data) == 0:
# If not, exit the loop
break
# If we did find listings, add them to the data
# and then move onto the next offset.
texas_listings.extend(data)
offset = offset + 500
`
Here is a snippet of the final printed output. As you can see, zipcode 77008 gets successfully passed to the zipCode variable after the 77449 zipcode returns an empty list and breaks the loop at offset 5500. However, you can also see that the 77008 offset starts at 5500 and it appears there aren't that many listings in that zipcode. How do I reset offset variable to 500 and begin counting again?
You can reset the offset variable back to 500 before starting the next iteration of the loop over the next zip code.
for i in zipCode:
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset={offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
if len(data) == 0:
break
texas_listings.extend(data)
offset = offset + 500
offset = 500 # reset offset to 500 for the next zip code
Update: put the offset and limit within the for loop and it works the way I expect.
# We loop until we get no results.
for i in zipCode:
limit = 500
offset = 0
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset={offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
# Did we find any listings?
if len(data) == 0:
# If not, exit the loop
break
# If we did find listings, add them to the data
# and then move onto the next offset.
texas_listings.extend(data)
offset = offset + 500
texas = pd.DataFrame(texas_listings).append(texas,
ignore_index=True)
texas['date_pulled'] = pd.Timestamp.now().normalize()
Basically, if i loop a datetime performing an scan with date range per-day, like:
table_hook = dynamodb_resource.Table('table1')
date_filter = Key('date_column').between('2021-01-01T00:00:00+00:00', '2021-01-01T23:59:59+00:00')
response = table_hook.scan(FilterExpression=date_filter)
incoming_data = response['Items']
if (response['Count']) == 0:
return
_counter = 1
while 'LastEvaluatedKey' in response:
response = table_hook.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
if (
parser.parse(response['Items'][0]['date_column']).replace(tzinfo=None) < parser.parse('2021-01-01T00:00:00+00:00').replace(tzinfo=None)
or
parser.parse(response['Items'][0]['date_column']).replace(tzinfo=None).replace(tzinfo=None) > parser.parse('2021-06-07T23:59:59+00:00').replace(tzinfo=None)
):
break
incoming_data.extend(response['Items'])
_counter+=1
print("|-> Getting page %s" % _counter)
At the end of Day1 to Day2 loop, it retrieve me X rows,
But if i perform the same scan at the same way (paginating), with the same range (Day1 to Day2), without doing a loop, it retrieve me Y rows,
And to become better, when i perform a table.describe_table(TableName='table1'), row_count field comes with Z rows, i literally dont understand what is going on!
Based on help of above guys, i found my error, basically i'm not passing the filter again when performing pagination so the fixed code are:
table_hook = dynamodb_resource.Table('table1')
date_filter = Key('date_column').between('2021-01-01T00:00:00+00:00', '2021-01-01T23:59:59+00:00')
response = table_hook.scan(FilterExpression=date_filter)
incoming_data = response['Items']
_counter = 1
while 'LastEvaluatedKey' in response:
response = table_hook.scan(FilterExpression=date_filter,
ExclusiveStartKey=response['LastEvaluatedKey'])
incoming_data.extend(response['Items'])
_counter+=1
print("|-> Getting page %s" % _counter)
I'm currently trying to learn web scraping and decided to scrape some discord data. Code follows:
import requests
import json
def retrieve_messages(channelid):
num=0
headers = {
'authorization': 'here we enter the authorization code'
}
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit=100',headers=headers
)
jsonn = json.loads(r.text)
for value in jsonn:
print(value['content'], '\n')
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id goes here')
The problem: when I tried changing the limit here messages?limit=100 apparently it only accepts numbers between 0 and 100, meaning that the maximum number of messages I can get is 100. I tried changing this number to 900, for example, to scrape more messages. But then I get the error TypeError: string indices must be integers.
Any ideas on how I could get, possibly, all the messages in a channel?
Thank you very much for reading!
APIs that return a bunch of records are almost always limited to some number of items.
Otherwise, if a large quantity of items is requested, the API may fail due to being out of memory.
For that purpose, most APIs implement pagination using limit, before and after parameters where:
limit: tells you how many messages to fetch
before: get messages before this message ID
after: get messages after this message ID
Discord API is no exception as the documentation tells us.
Here's how you do it:
First, you will need to query the data multiple times.
For that, you can use a while loop.
Make sure to add an if the condition that will prevent the loop from running indefinitely - I added a check whether there are any messages left.
while True:
# ... requests code
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
num=num+1
Define a variable that has the last message that you fetched and save the last message id that you already printed
def retrieve_messages(channelid):
last_message_id = None
while True:
# ...
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
Now on the first run the last_message_id is None, and on subsequent requests it has the last message you printed.
Use that to build your query
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
# ...
Note: discord servers give you the latest message first, so you have to use the before parameter
Here's a fully working example of your code
import requests
import json
def retrieve_messages(channelid):
num = 0
limit = 10
headers = {
'authorization': 'auth header here'
}
last_message_id = None
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id here')
To answer this question, we must look at the discord API. Googling "discord api get messages" gets us the developer reference for the discord API. The particular endpoint you are using is documented here:
https://discord.com/developers/docs/resources/channel#get-channel-messages
The limit is documented here, along with the around, before, and after parameters. Using one of these parameters (most likely after) we can paginate the results.
In pseudocode, it would look something like this:
offset = 0
limit = 100
all_messages=[]
while True:
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit={limit}&after={offset}',headers=headers
)
all_messages.append(extract messages from response)
if (number of responses < limit):
break # We have reached the end of all the messages, exit the loop
else:
offset += limit
By the way, you will probably want to print(r.text) right after the response comes in so you can see what the response looks like. It will save a lot of confusion.
Here is my solution. Feedback is welcome as I'm newish to Python. Kindly provide me w/ credit/good-luck if using this. Thank you =)
import requests
CHANNELID = 'REPLACE_ME'
HEADERS = {'authorization': 'REPLACE_ME'}
LIMIT=100
all_messages = []
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())}','\n')
while len(r.json()) == LIMIT:
last_message_id = r.json()[-1].get('id')
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}&before={last_message_id}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())} and last_message_id is {last_message_id} and len(all_messages) is {len(all_messages)}')
I am working on a code where I am fetching records from an API and this API has pagination implemented on it where it would allow maximum of 100 records. So I have to loop in the multiples of 100's. Currently my code compares the total records and loops from offset 100 and then 101,102,103 etc. I want it to loop in 100's(like 100,200,300) and stop as soon as the offset is greater than the total records. I am not sure how to do this, i have partial code which increment by 1 instead of 100 and wont stop when needed. Could anyone please help me with this issue.
import pandas as pd
from pandas.io.json import json_normalize
#Token for Authorization
API_ACCESS_KEY = 'Token'
Accept='application/xml'
#Query Details that is passed in the URL
since = '2018-01-01'
until = '2018-02-01'
limit = '100'
offset = '0'
total = 'true'
def get():
url_address = "https://mywebsite/web?offset="+str('0')
headers = {
'Authorization': 'token={0}'.format(API_ACCESS_KEY),
'Accept': Accept,
}
querystring = {"since":since,"until":until, "limit":limit, "total":total}
# find out total number of pages
r = requests.get(url=url_address, headers=headers, params=querystring).json()
total_record = int(r['total'])
print("Total record: " +str(total_record))
# results will be appended to this list
all_items = []
# loop through all offset and return JSON object
for offset in range(0, total_record):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
offset = offset + 100
print(offset)
# prettify JSON
data = json.dumps(all_items, sort_keys=True, indent=4)
return data
print(get())
Currently when I print the offset I see
Total Records: 345
100,
101,
102,
Expected:
Total Records: 345
100,
200,
300
Stop the loop!
One way you could do it is change
for offset in range(0, total_record):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
offset = offset + 100
print(offset)
to
for offset in range(0, total_record, 100):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
print(offset)
as you cannot change offset inside the loop
loop through all offset and return JSON object
for offset in range(0,total_record,100):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
print(offset)
I'm trying to create a loop via API call to a json string since each call is limited to 200 rows. When I tried the below code, the loop doesn't seem to end even when I left the code running for an hour or so. Max rows I'm looking to pull is about ~200k rows from the API.
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
if len(data['rows'])<200:
break
Also, I'm looking to filter the loop to only output if json value 'pet.type' is "Puppies" or "Kittens." Haven't been able to figure out the syntax.
Any ideas?
Thanks
The break condition for you loop is incorrect. Notice it's checking len(data["rows"]), where data only includes rows from the most recent request.
Instead, you should be looking at the total number of rows you've collected so far: len(alldata).
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
# Check `alldata` instead of `data["rows"]`,
# and set the limit to 200k instead of 200.
if len(alldata) >= 200000:
break