I have list of author names (more than 1k) and my google api quote limit is 20k I want to pass the author name into the API to get book informations. When I tested my code I got "429 Client Error: Too Many Requests for url..." error, How can I slow down my running time without stopping the application. (I'm using Python in google colab )
author_List = ["J. K. Rowling", "mark twain","Emily Dickinson"]
connGoogleAPI(author_List)
def connGoogleAPI(booksData):
key= "**************************"
books_list = []
col= ['Title', 'Authors', 'published Date', 'Description','ISBN']
books_list.append(col)
res = ""
err = None
with requests.Session() as session:
#err= ""
for Authors in booksData:
params = {"q": Authors,"key": key,"maxResults": 1}
delays = 65 # approximately 1 minute total delay time for any given author
while True:
try:
#do something
except Exception as e:
if err.status_code == 429:
#print("******")
if delays <= 0:
raise(e) # we've spent too long delaying
time.sleep(1)
delays -= 1
else:
print("-----=")
raise(e) # some other status code
books_list.append(lookup(res,Authors))
return books_list
you can import time and then add:
time.sleep(1)
at the end of your for loop, to pause for a second between each iteration.
You could slow down your for loop like this:
first, you need to import time
delay = 2
for key in author_List:
author = author_List[key]
newList.append(searchData(author))
time.sleep(delay)
The delay you can set a number how many seconds the loop gets delayed (here it would be 2 seconds)
You probably don't want unconditional delays which will slow down your processing unnecessarily. On the other hand, if you start getting HTTP 429 you can't be sure when or even if the server is going to allow you to continue. So you need a strategy that only introduces delays when/if required but also doesn't get into an infinite loop. Consider this:
import requests
import time
listofauthors = ['Mark Twain', 'Dan Brown', 'William Shakespeare']
with requests.Session() as session:
for author in listofauthors:
params = {'q': author}
delays = 60 # approximately 1 minute total delay time for any given author
while True:
try:
r = session.get('https://www.googleapis.com/books/v1/volumes', params=params)
r.raise_for_status()
print(r.json())
break # all good :-)
except Exception as e:
if r.status_code == 429:
if delays <= 0:
raise(e) # we've spent too long delaying
time.sleep(1)
delays -= 1
else:
raise(e) # some other status code
Related
I have a python script that is failing when my request returns a blank JSON response. The code loops through the function repeatedly and is successful 99% of the time, but fails every once in a while when a blank JSON response is received.
i = 0
while i < 1000:
r = 0
while r < 4:
time.sleep(5)
response = c.get_quote(symbol)
jsonData = response.json()
for key in jsonData:
jsonkey = key
if jsonkey == symbol:
print (i)
print("you are good", response)
print(jsonData)
print ()
break
else:
print("you have a problem, jsonkey=")
print()
print (jsonData)
print()
r =+ 1
current_price = response.json()[symbol]["lastPrice"]
i += 1
The 'While r < 4:' loop was added in an attempt to add error handling. If I can figure out what to trap on, I would retry the response = c.get_quote(symbol) but the blank JSON response is slipping past the if jsonkey == symbol logic.
The error message received is "current_price = response.json()[symbol]["lastPrice"]
KeyError: 'NVCR'"
and the output from print (jsonData) is: {}
as opposed to a healthy response which contains the symbol as a key with additional data to follow. The request is returning a response [200] so unfortunately it isn't that simple...
Instead of validating the key with jsonkey == symbol, use a try-except block to catch the blank response errors and handle them.
For instance:
i = 0
while i < 1000:
time.sleep(5)
response = c.get_quote(symbol)
jsonData = response.json()
try:
for key in jsonData:
jsonkey = key
if jsonkey == symbol:
print (i)
print("you are good", response)
print(jsonData + "\n")
break
except:
print("you have a problem \n")
print (jsonData + "\n")
current_price = response.json()[symbol]["lastPrice"]
i += 1
#DeepSpace is also likely correct in the comments. My guess is that the server that you're pulling json data from (nsetools?) is throttling your requests, so it might be worth looking deeper in their docs to see if you can find a limit, and then use time.sleep() to stay under it.
Edit: If you are using nsetools, their api seems to be built by reverse-engineering the api that the nse website is built on and performing json api calls to urls such as this one (these can be found in this source code file). Because of this, it's not documented what the rate limit is, as this data is scraped directly from NSE and subject to their rate limit. Using this data is against NSE's terms of use (unless they have express written consent from the government of India which for all I know nsetools has, but I assume you do not.)
OK so thanks to #DeepSpace and #CarlHR I think I have a solution but it still seems like there is too much code for what I am trying to accomplish. This works:
i = 0
while i < 1000:
r = 1
while r < 5:
time.sleep(1)
response = c.get_quote(symbol)
jsonData = response.json()
try:
current_price = response.json()[symbol]["lastPrice"]
print ("Looks Good, moving on")
break
except KeyError:
print ("There was an problem with the JSON response,
trying again. retry number:", r)
print (jsonData)
print ()
r += 1
i += 1
print ("Moving on to the next iteration")
I'm currently trying to learn web scraping and decided to scrape some discord data. Code follows:
import requests
import json
def retrieve_messages(channelid):
num=0
headers = {
'authorization': 'here we enter the authorization code'
}
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit=100',headers=headers
)
jsonn = json.loads(r.text)
for value in jsonn:
print(value['content'], '\n')
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id goes here')
The problem: when I tried changing the limit here messages?limit=100 apparently it only accepts numbers between 0 and 100, meaning that the maximum number of messages I can get is 100. I tried changing this number to 900, for example, to scrape more messages. But then I get the error TypeError: string indices must be integers.
Any ideas on how I could get, possibly, all the messages in a channel?
Thank you very much for reading!
APIs that return a bunch of records are almost always limited to some number of items.
Otherwise, if a large quantity of items is requested, the API may fail due to being out of memory.
For that purpose, most APIs implement pagination using limit, before and after parameters where:
limit: tells you how many messages to fetch
before: get messages before this message ID
after: get messages after this message ID
Discord API is no exception as the documentation tells us.
Here's how you do it:
First, you will need to query the data multiple times.
For that, you can use a while loop.
Make sure to add an if the condition that will prevent the loop from running indefinitely - I added a check whether there are any messages left.
while True:
# ... requests code
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
num=num+1
Define a variable that has the last message that you fetched and save the last message id that you already printed
def retrieve_messages(channelid):
last_message_id = None
while True:
# ...
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
Now on the first run the last_message_id is None, and on subsequent requests it has the last message you printed.
Use that to build your query
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
# ...
Note: discord servers give you the latest message first, so you have to use the before parameter
Here's a fully working example of your code
import requests
import json
def retrieve_messages(channelid):
num = 0
limit = 10
headers = {
'authorization': 'auth header here'
}
last_message_id = None
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id here')
To answer this question, we must look at the discord API. Googling "discord api get messages" gets us the developer reference for the discord API. The particular endpoint you are using is documented here:
https://discord.com/developers/docs/resources/channel#get-channel-messages
The limit is documented here, along with the around, before, and after parameters. Using one of these parameters (most likely after) we can paginate the results.
In pseudocode, it would look something like this:
offset = 0
limit = 100
all_messages=[]
while True:
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit={limit}&after={offset}',headers=headers
)
all_messages.append(extract messages from response)
if (number of responses < limit):
break # We have reached the end of all the messages, exit the loop
else:
offset += limit
By the way, you will probably want to print(r.text) right after the response comes in so you can see what the response looks like. It will save a lot of confusion.
Here is my solution. Feedback is welcome as I'm newish to Python. Kindly provide me w/ credit/good-luck if using this. Thank you =)
import requests
CHANNELID = 'REPLACE_ME'
HEADERS = {'authorization': 'REPLACE_ME'}
LIMIT=100
all_messages = []
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())}','\n')
while len(r.json()) == LIMIT:
last_message_id = r.json()[-1].get('id')
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}&before={last_message_id}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())} and last_message_id is {last_message_id} and len(all_messages) is {len(all_messages)}')
I have a link that I want to test for robustness, for lack of a better word. What I have code that pings the URL multiple times, sequentially:
# Testing for robustness
for i in range(100000):
city = 'New York'
city = '%20'.join(city.split(' '))
res = requests.get(f'http://example.com/twofishes?query={city}')
data = res.json()
geo = data['interpretations'][0]['feature']['geometry']['center']
print('pinging xtime: %s ' % str(i))
print(geo['lat'], geo['lng'])
I want to take this code, but ping the link say, 10 or 12 times at once. I don't mind the sequential pinging, but that's not as efficient as pinging multiple times at once. I feel like this is a quick modification, where the for loop comes out and a PULL function goes in?
Here is an example program which should work for this task. Given that I do not want to be blacklisted, I have not actually tested the code to see if it works. Regardless, it should at least be in the ballpark of what your looking for. If you want actually have all of the threads execute at the same time I would look into adding events. Hope this helps.
Code
import threading
import requests
import requests.exceptions as exceptions
def stress_test(s):
for i in range(100000):
try:
city = 'New York'
city = '%20'.join(city.split(' '))
res = s.get(f'http://example.com/twofishes?query={city}')
data = res.json()
geo = data['interpretations'][0]['feature']['geometry']['center']
print('pinging xtime: %s ' % str(i))
print(geo['lat'], geo['lng'])
except (exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout):
pass
if __name__ == '__main__':
for i in range(1, 12):
s = requests.session()
t = threading.Thread(target=stress_test, args=(s,))
t.start()
for th in threading.enumerate():
if th != threading.current_thread():
th.join()
I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response
here is my code.
import requests,time
proxies = {'http':'36.33.1.177:21219'}
url='http://218.94.78.61:8080/newPub/service/json/call?serviceName=sysBasicManage&methodName=queryOutputOtherPollutionList¶msJson=%7B%22ticket%22:%22451a9846-058b-4944-86c6-fccafdb7d8d0%22,%22parameter%22:%7B%22monitorSiteType%22:%2202%22,%22enterpriseCode%22:%22320100000151%22,%22monitoringType%22:%222%22%7D%7D'
i = 0
a = requests.adapters.HTTPAdapter(max_retries=10)
s = requests.Session()
s.mount(url, a)
for x in xrange(1,1000):
time.sleep(1)
print x
try:
r= s.get(url,proxies=proxies)
print r
except Exception as ee:
i = i + 1
print ee
print 'i=%s' % i
the proxies is a little unstabitily,so I set up the max_retries, but it still have exception sometime, so is there some method to execute after some secondes at every retry??
Just with requests library it's not possible. However you can use external library like backoff.
backoff provides a decorator and you wrap it around your function. Sample code:
#backoff.on_exception(backoff.constant,
requests.exceptions.RequestException,
max_tries=10, interval=10)
def get_url(url):
return requests.get(url)
The above code waits for 10 seconds for next retry on every exception of requests.exceptions.RequestException and it tries for 10 times, as specified in max_tries.