Index out of range when sending requests in a loop - python

I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range

It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request

Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)

GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response

Related

Loop through a list and call API in python

I have list of author names (more than 1k) and my google api quote limit is 20k I want to pass the author name into the API to get book informations. When I tested my code I got "429 Client Error: Too Many Requests for url..." error, How can I slow down my running time without stopping the application. (I'm using Python in google colab )
author_List = ["J. K. Rowling", "mark twain","Emily Dickinson"]
connGoogleAPI(author_List)
def connGoogleAPI(booksData):
key= "**************************"
books_list = []
col= ['Title', 'Authors', 'published Date', 'Description','ISBN']
books_list.append(col)
res = ""
err = None
with requests.Session() as session:
#err= ""
for Authors in booksData:
params = {"q": Authors,"key": key,"maxResults": 1}
delays = 65 # approximately 1 minute total delay time for any given author
while True:
try:
#do something
except Exception as e:
if err.status_code == 429:
#print("******")
if delays <= 0:
raise(e) # we've spent too long delaying
time.sleep(1)
delays -= 1
else:
print("-----=")
raise(e) # some other status code
books_list.append(lookup(res,Authors))
return books_list
you can import time and then add:
time.sleep(1)
at the end of your for loop, to pause for a second between each iteration.
You could slow down your for loop like this:
first, you need to import time
delay = 2
for key in author_List:
author = author_List[key]
newList.append(searchData(author))
time.sleep(delay)
The delay you can set a number how many seconds the loop gets delayed (here it would be 2 seconds)
You probably don't want unconditional delays which will slow down your processing unnecessarily. On the other hand, if you start getting HTTP 429 you can't be sure when or even if the server is going to allow you to continue. So you need a strategy that only introduces delays when/if required but also doesn't get into an infinite loop. Consider this:
import requests
import time
listofauthors = ['Mark Twain', 'Dan Brown', 'William Shakespeare']
with requests.Session() as session:
for author in listofauthors:
params = {'q': author}
delays = 60 # approximately 1 minute total delay time for any given author
while True:
try:
r = session.get('https://www.googleapis.com/books/v1/volumes', params=params)
r.raise_for_status()
print(r.json())
break # all good :-)
except Exception as e:
if r.status_code == 429:
if delays <= 0:
raise(e) # we've spent too long delaying
time.sleep(1)
delays -= 1
else:
raise(e) # some other status code

Handling final page in Python paginated API request

I'm requesting Microsoft's Graph API, where I'm using the following function to request multiple pages. I'm trying to request all pages, merge the json files and finally write them to a pandas dataframe.
v = "v1.0"
r = "/users?$filter=userType eq 'Member'&$select=displayName,givenName,jobTitle,mail,department&$top=200"
def query(v, r):
all_records = []
url = uri.format(v=v, r=r)
while True:
if not url:
break
result = requests.get(url, headers=headers)
if result.status_code == 200:
json_data = json.loads(result.text)
all_records = all_records + json_data["value"]
url = json_data["#odata.nextLink"]
return all_records
The while-loop goes through all the pages, but when I run the function I'm getting a error:
KeyError: '#odata.nextLink'
I assume this is because the loop reaches the final page, and thus the '#odata.nextLink' cannot be found. But how can I handle this?
You are doing
url = json_data["#odata.nextLink"]
which suggest json_data is dict, so you should be able to use .get method which returns default value when key not found (None by default), please try doing following and write if it does work as excepted:
url = json_data.get("#odata.nextLink")
if url is None:
print("nextLink not found")
else:
print("nextLink found")

data scraping on discord using python

I'm currently trying to learn web scraping and decided to scrape some discord data. Code follows:
import requests
import json
def retrieve_messages(channelid):
num=0
headers = {
'authorization': 'here we enter the authorization code'
}
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit=100',headers=headers
)
jsonn = json.loads(r.text)
for value in jsonn:
print(value['content'], '\n')
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id goes here')
The problem: when I tried changing the limit here messages?limit=100 apparently it only accepts numbers between 0 and 100, meaning that the maximum number of messages I can get is 100. I tried changing this number to 900, for example, to scrape more messages. But then I get the error TypeError: string indices must be integers.
Any ideas on how I could get, possibly, all the messages in a channel?
Thank you very much for reading!
APIs that return a bunch of records are almost always limited to some number of items.
Otherwise, if a large quantity of items is requested, the API may fail due to being out of memory.
For that purpose, most APIs implement pagination using limit, before and after parameters where:
limit: tells you how many messages to fetch
before: get messages before this message ID
after: get messages after this message ID
Discord API is no exception as the documentation tells us.
Here's how you do it:
First, you will need to query the data multiple times.
For that, you can use a while loop.
Make sure to add an if the condition that will prevent the loop from running indefinitely - I added a check whether there are any messages left.
while True:
# ... requests code
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
num=num+1
Define a variable that has the last message that you fetched and save the last message id that you already printed
def retrieve_messages(channelid):
last_message_id = None
while True:
# ...
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
Now on the first run the last_message_id is None, and on subsequent requests it has the last message you printed.
Use that to build your query
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
# ...
Note: discord servers give you the latest message first, so you have to use the before parameter
Here's a fully working example of your code
import requests
import json
def retrieve_messages(channelid):
num = 0
limit = 10
headers = {
'authorization': 'auth header here'
}
last_message_id = None
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id here')
To answer this question, we must look at the discord API. Googling "discord api get messages" gets us the developer reference for the discord API. The particular endpoint you are using is documented here:
https://discord.com/developers/docs/resources/channel#get-channel-messages
The limit is documented here, along with the around, before, and after parameters. Using one of these parameters (most likely after) we can paginate the results.
In pseudocode, it would look something like this:
offset = 0
limit = 100
all_messages=[]
while True:
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit={limit}&after={offset}',headers=headers
)
all_messages.append(extract messages from response)
if (number of responses < limit):
break # We have reached the end of all the messages, exit the loop
else:
offset += limit
By the way, you will probably want to print(r.text) right after the response comes in so you can see what the response looks like. It will save a lot of confusion.
Here is my solution. Feedback is welcome as I'm newish to Python. Kindly provide me w/ credit/good-luck if using this. Thank you =)
import requests
CHANNELID = 'REPLACE_ME'
HEADERS = {'authorization': 'REPLACE_ME'}
LIMIT=100
all_messages = []
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())}','\n')
while len(r.json()) == LIMIT:
last_message_id = r.json()[-1].get('id')
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}&before={last_message_id}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())} and last_message_id is {last_message_id} and len(all_messages) is {len(all_messages)}')

Can't Stop ThreadPoolExecutor

I'm scraping hundreds of urls, each with a leaderboard of data I want, and the only difference between each url string is a 'platform','region', and lastly, the page number. There are only a few platforms and regions, but the page numbers change each day and I don't know how many there are. So that's the first function, I'm just creating lists of urls to be requested in parallel.
If I use page=1, then the result will contain 'table_rows > 0' in the last function. But around page=500, the requested url still pings back but very slowly and then it will show an error message, no leaderboard found, the last function will show 'table_rows == 0', etc. The problem is I need to get through the very last page and I want to do this quickly, hence the threadpoolexecutor - but I can't cancel all the threads or processes or whatever once PAGE_LIMIT is tripped. I threw the executor.shutdown(cancel_futures=True) just to kind of show what I'm looking for. If nobody can help me I'll miserably remove the parallelization and I'll scrape slowly, sadly, one url at a time...
Thanks
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
import pandas
import requests
PLATFORM = ['xbl', 'psn', 'atvi', 'battlenet']
REGION = ['us', 'ca']
PAGE_LIMIT = True
def leaderboardLister():
global REGION
global PLATFORM
list_url = []
for region in REGION:
for platform in PLATFORM:
for i in range(1,750):
list_url.append('https://cod.tracker.gg/warzone/leaderboards/battle-royale/' + platform + '/KdRatio?country=' + region + '&page=' + str(i))
leaderboardExecutor(list_url,30)
def leaderboardExecutor(urls,threads):
global PAGE_LIMIT
global INTERNET
if len(urls) > 0:
with ThreadPoolExecutor(max_workers=threads) as executor:
while True:
if PAGE_LIMIT == False:
executor.shutdown(cancel_futures=True)
while INTERNET == False:
try:
print('bad internet')
requests.get("http://google.com")
INTERNET = True
except:
time.sleep(3)
print('waited')
executor.map(scrapeLeaderboardPage, urls)
def scrapeLeaderboardPage(url):
global PAGE_LIMIT
checkInternet()
try:
page = requests.get(url)
soup = BeautifulSoup(page.content,features = 'lxml')
table_rows = soup.find_all('tr')
if len(table_rows) == 0:
PAGE_LIMIT = False
print(url)
else:
pass
print('success')
except:
INTERNET = False
leaderboardLister()

Python 3.6 API while loop to json script not ending

I'm trying to create a loop via API call to a json string since each call is limited to 200 rows. When I tried the below code, the loop doesn't seem to end even when I left the code running for an hour or so. Max rows I'm looking to pull is about ~200k rows from the API.
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
if len(data['rows'])<200:
break
Also, I'm looking to filter the loop to only output if json value 'pet.type' is "Puppies" or "Kittens." Haven't been able to figure out the syntax.
Any ideas?
Thanks
The break condition for you loop is incorrect. Notice it's checking len(data["rows"]), where data only includes rows from the most recent request.
Instead, you should be looking at the total number of rows you've collected so far: len(alldata).
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
# Check `alldata` instead of `data["rows"]`,
# and set the limit to 200k instead of 200.
if len(alldata) >= 200000:
break

Categories

Resources