Reset offset variable in API when using a for loop - python

I am working with a real estate API pulling rental listings. I'd like to loop through a list of zipcodes to pull the data. The API requires an offset of 500 rows of data or less. The code below works fine until the while loop hits the second zipcode. The issue is that after the first zipcode has run successfully, I need the offset variable to reset to 500 and begin counting up again until the while loop breaks for the second zipcode in the list.
`# This just formats your token for the requests library.
headers = {
"X-RapidAPI-Key": "your-key-here",
"X-RapidAPI-Host": "realty-mole-property-api.p.rapidapi.com"
}
# Initial Limit and Offset values.
limit = 500
offset = 0
zipCode = [77449, 77008]
# This will be an array of all the listing records.
texas_listings = []
# We loop until we get no results.
for i in zipCode:
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset=. {offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
# Did we find any listings?
if len(data) == 0:
# If not, exit the loop
break
# If we did find listings, add them to the data
# and then move onto the next offset.
texas_listings.extend(data)
offset = offset + 500
`
Here is a snippet of the final printed output. As you can see, zipcode 77008 gets successfully passed to the zipCode variable after the 77449 zipcode returns an empty list and breaks the loop at offset 5500. However, you can also see that the 77008 offset starts at 5500 and it appears there aren't that many listings in that zipcode. How do I reset offset variable to 500 and begin counting again?

You can reset the offset variable back to 500 before starting the next iteration of the loop over the next zip code.
for i in zipCode:
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset={offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
if len(data) == 0:
break
texas_listings.extend(data)
offset = offset + 500
offset = 500 # reset offset to 500 for the next zip code

Update: put the offset and limit within the for loop and it works the way I expect.
# We loop until we get no results.
for i in zipCode:
limit = 500
offset = 0
while True:
print("----")
url = f"https://realty-mole-property-api.p.rapidapi.com/rentalListings?offset={offset}&limit={limit}&zipCode={i}"
print("Requesting", url)
response = requests.get(url, headers=headers)
data = response.json()
print(data)
# Did we find any listings?
if len(data) == 0:
# If not, exit the loop
break
# If we did find listings, add them to the data
# and then move onto the next offset.
texas_listings.extend(data)
offset = offset + 500
texas = pd.DataFrame(texas_listings).append(texas,
ignore_index=True)
texas['date_pulled'] = pd.Timestamp.now().normalize()

Related

Batching API Requests

I have a list of 1,000 airports I am sending to an API to get flight data for each airport. The API cannot handle the entire list at once even if I delay the calls. I need to place the list of airports into batches of 100 for the API calls to work properly. My code below iterates over the list of airports and sends them one by one to the API. I want to break up the API calls (airport list) and call them in batches of 100 because it's causing errors in the data format when I use the entire 1,000. When I test the API with only 100 airports, all data is returned properly. I'm unsure where to place the batch code in my API call loop.
# Sample dataset for this post
airport = [['HLZN'], ['HLLQ'],['HLLB'],['HLGT'],['HLMS'],['HLLS'],['HLTQ'],['HLLT'],['HLLM']]
payload = {'max_pages': 500, 'type':'Airline'}
seconds = 1
count = 1
#Create an empty list to hold responses
json_responses = []
#Iterate through list
for airports in airport:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
if response.status_code == 200:
print(count, airports)
count +=1
for i in trange(100):
time.sleep(0.01)
else:
pass
results = response.json()
json_responses.append(response.json())
sleep(seconds)
I'm not sure where to place batching code inside the API call loop. I'm new to batching API calls and loops in general so any help will be appreciated.
total_count = len(airport)
#Iterate through list
for airports in airport:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
chunks = (total_count - 1) // 100 + 1
for i in range(chunks):
batch = airport[i*100:(i+1)*100] #Tried batch code here
if response.status_code == 200:
print(count, airports)
count +=1
for i in trange(100):
time.sleep(0.01)
else:
pass
results = response.json()
json_responses.append(response.json())
sleep(seconds)
I believe this is what you're trying to do:
# Sample dataset for this post
airports = [['HLZN'], ['HLLQ'],['HLLB'],['HLGT'],['HLMS'],['HLLS'],['HLTQ'],['HLLT'],['HLLM']]
payload = {'max_pages': 500, 'type':'Airline'}
seconds = 1
#Create an empty list to hold responses
json_responses = []
# Counter variable
counter = 0
# Chunk size
chunk_size = 100
#Iterate through list
for airport in airports:
response = requests.get(apiUrl + f"airports/{airports[0]}/flights",params=payload,
headers=auth_header)
results = response.json()
json_responses.append(response.json())
# Increment counter and check if it is a multiple of the chunk size, if yes, sleep for a defined number of seconds
counter += 1
if counter % chunk_size == 0:
sleep(seconds)

AWS DynamoDB BOTO3 Confusing Scan

Basically, if i loop a datetime performing an scan with date range per-day, like:
table_hook = dynamodb_resource.Table('table1')
date_filter = Key('date_column').between('2021-01-01T00:00:00+00:00', '2021-01-01T23:59:59+00:00')
response = table_hook.scan(FilterExpression=date_filter)
incoming_data = response['Items']
if (response['Count']) == 0:
return
_counter = 1
while 'LastEvaluatedKey' in response:
response = table_hook.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
if (
parser.parse(response['Items'][0]['date_column']).replace(tzinfo=None) < parser.parse('2021-01-01T00:00:00+00:00').replace(tzinfo=None)
or
parser.parse(response['Items'][0]['date_column']).replace(tzinfo=None).replace(tzinfo=None) > parser.parse('2021-06-07T23:59:59+00:00').replace(tzinfo=None)
):
break
incoming_data.extend(response['Items'])
_counter+=1
print("|-> Getting page %s" % _counter)
At the end of Day1 to Day2 loop, it retrieve me X rows,
But if i perform the same scan at the same way (paginating), with the same range (Day1 to Day2), without doing a loop, it retrieve me Y rows,
And to become better, when i perform a table.describe_table(TableName='table1'), row_count field comes with Z rows, i literally dont understand what is going on!
Based on help of above guys, i found my error, basically i'm not passing the filter again when performing pagination so the fixed code are:
table_hook = dynamodb_resource.Table('table1')
date_filter = Key('date_column').between('2021-01-01T00:00:00+00:00', '2021-01-01T23:59:59+00:00')
response = table_hook.scan(FilterExpression=date_filter)
incoming_data = response['Items']
_counter = 1
while 'LastEvaluatedKey' in response:
response = table_hook.scan(FilterExpression=date_filter,
ExclusiveStartKey=response['LastEvaluatedKey'])
incoming_data.extend(response['Items'])
_counter+=1
print("|-> Getting page %s" % _counter)

Pagination loop with offset 100

I am working on a code where I am fetching records from an API and this API has pagination implemented on it where it would allow maximum of 100 records. So I have to loop in the multiples of 100's. Currently my code compares the total records and loops from offset 100 and then 101,102,103 etc. I want it to loop in 100's(like 100,200,300) and stop as soon as the offset is greater than the total records. I am not sure how to do this, i have partial code which increment by 1 instead of 100 and wont stop when needed. Could anyone please help me with this issue.
import pandas as pd
from pandas.io.json import json_normalize
#Token for Authorization
API_ACCESS_KEY = 'Token'
Accept='application/xml'
#Query Details that is passed in the URL
since = '2018-01-01'
until = '2018-02-01'
limit = '100'
offset = '0'
total = 'true'
def get():
url_address = "https://mywebsite/web?offset="+str('0')
headers = {
'Authorization': 'token={0}'.format(API_ACCESS_KEY),
'Accept': Accept,
}
querystring = {"since":since,"until":until, "limit":limit, "total":total}
# find out total number of pages
r = requests.get(url=url_address, headers=headers, params=querystring).json()
total_record = int(r['total'])
print("Total record: " +str(total_record))
# results will be appended to this list
all_items = []
# loop through all offset and return JSON object
for offset in range(0, total_record):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
offset = offset + 100
print(offset)
# prettify JSON
data = json.dumps(all_items, sort_keys=True, indent=4)
return data
print(get())
Currently when I print the offset I see
Total Records: 345
100,
101,
102,
Expected:
Total Records: 345
100,
200,
300
Stop the loop!
One way you could do it is change
for offset in range(0, total_record):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
offset = offset + 100
print(offset)
to
for offset in range(0, total_record, 100):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
print(offset)
as you cannot change offset inside the loop
loop through all offset and return JSON object
for offset in range(0,total_record,100):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
print(offset)

Index out of range when sending requests in a loop

I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response

Python 3.6 API while loop to json script not ending

I'm trying to create a loop via API call to a json string since each call is limited to 200 rows. When I tried the below code, the loop doesn't seem to end even when I left the code running for an hour or so. Max rows I'm looking to pull is about ~200k rows from the API.
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
if len(data['rows'])<200:
break
Also, I'm looking to filter the loop to only output if json value 'pet.type' is "Puppies" or "Kittens." Haven't been able to figure out the syntax.
Any ideas?
Thanks
The break condition for you loop is incorrect. Notice it's checking len(data["rows"]), where data only includes rows from the most recent request.
Instead, you should be looking at the total number of rows you've collected so far: len(alldata).
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
# Check `alldata` instead of `data["rows"]`,
# and set the limit to 200k instead of 200.
if len(alldata) >= 200000:
break

Categories

Resources