I am running a function that produces a list like this:
rft_id_list = []
for i in payload_df_rft:
payload_rft = json.dumps(i)
url = 'https://domain/api/link/rft'
print(url)
response = requests.request('POST', url, headers=headers, data=payload_rft)
rft_script_output = response.json()
# print(rft_script_output)
rft_id = (rft_script_output['id'])
#print(rft_id)
rft_id_list.append(rft_id)
print(rft_id_list)
print('~~~Script Finished ~~~')
Script above gives me the values below:
['1234abc', '22345bcde', '33456cdef']
Next sub-function has a url and I want the rft_id_list to iterate through the values above and add them to the URL while doing a PUT.
url = 'https://domain/api/link/' + rft_id_list
What's the best way I can do this?
I need to make a request to an API that only responses with maximum of 200 results. If the total amount of data is more than 200, the API responses also with a parameter lastKey that I need to pass to a new request. When all the data has been returned the lastKey -param is not returned anymore.
My question is how to do it in a simple, clean way? This is how I make the first request and I can see if there is the lastKey -param or not:
url = 'https://example.com'
moreData = False
with requests.Session() as api:
data = requests.get(url)
try:
data.raise_for_status()
except HTTPError as e:
return Response(status=status.HTTP_500_INTERNAL_SERVER_ERROR)
result = data.json()
if 'lastKey' in result:
url = 'https://example.com&lastKey=' + result['lastKey']
moreData = True
How could I do this whole thing for example inside a while -loop?
Just get the first result out of the while loop, then call your api while you have "lastkey" in the result
url = 'https://example.com'
with requests.Session() as api:
data = requests.get(url)
try:
data.raise_for_status()
except HTTPError as e:
return Response(status=status.HTTP_500_INTERNAL_SERVER_ERROR)
result = data.json()
while 'lastKey' in result:
url = 'https://example.com&lastKey=' + result['lastKey']
with requests.Session() as api:
data = requests.get(url)
try:
data.raise_for_status()
except HTTPError as e:
return Response(status=status.HTTP_500_INTERNAL_SERVER_ERROR)
result = data.json()
I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response
I do know that python has the read_json function to effectively get data from an api into a pandas dataframe. But is there any way to actually read through all the pages of the api and input it into the same dataframe.
import requests
import pandas as pd
import config
api_key = config.api_key
url = " http://api.themoviedb.org/3/discover/movie?release_date.gte=2017-12-
01&release_date.lte=2017-12-31&api_key=" + api_key
payload = "{}"
response = requests.request("GET", url, data=payload)
print(response.text.encode("utf-8"))
I tried with the requests method but this only gives me the 1st page of the api. But I wanted to see if there is any way I can do it with the df method as below. I am unable to understand how to write a loop to effectively loop over all the pages and then input it all into 1 dataframe for further analysis.
df = pd.read_json('http://api.themoviedb.org/3/discover/movie?
release_date.gte=2017-12-01&release_date.lte=2017-12-
31&api_key=''&page=%s' %page)
You can read each page into a dataframe and concatenate them:
page = 0
df = []
while True:
try:
next_page = pd.read_json('http://api.themoviedb.org/3/discover/movie?
release_date.gte=2017-12-01&release_date.lte=2017-12-
31&api_key=''&page=%s' %page)
# doesn't get any content, stop
if len(next_page) == 0:
break
else:
# move on to the next page
df.append(next_page)
page += 1
except:
# if we got error from the API call, maybe the URL for that page doesn't exist
# the stop
break
df = pd.concat(df, axis=0)
Documentation for pd.concat here. Hope it helps :)
I'm trying to create a loop via API call to a json string since each call is limited to 200 rows. When I tried the below code, the loop doesn't seem to end even when I left the code running for an hour or so. Max rows I'm looking to pull is about ~200k rows from the API.
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
if len(data['rows'])<200:
break
Also, I'm looking to filter the loop to only output if json value 'pet.type' is "Puppies" or "Kittens." Haven't been able to figure out the syntax.
Any ideas?
Thanks
The break condition for you loop is incorrect. Notice it's checking len(data["rows"]), where data only includes rows from the most recent request.
Instead, you should be looking at the total number of rows you've collected so far: len(alldata).
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
# Check `alldata` instead of `data["rows"]`,
# and set the limit to 200k instead of 200.
if len(alldata) >= 200000:
break