does requests make the retry method execute after some seconds? - python

here is my code.
import requests,time
proxies = {'http':'36.33.1.177:21219'}
url='http://218.94.78.61:8080/newPub/service/json/call?serviceName=sysBasicManage&methodName=queryOutputOtherPollutionList&paramsJson=%7B%22ticket%22:%22451a9846-058b-4944-86c6-fccafdb7d8d0%22,%22parameter%22:%7B%22monitorSiteType%22:%2202%22,%22enterpriseCode%22:%22320100000151%22,%22monitoringType%22:%222%22%7D%7D'
i = 0
a = requests.adapters.HTTPAdapter(max_retries=10)
s = requests.Session()
s.mount(url, a)
for x in xrange(1,1000):
time.sleep(1)
print x
try:
r= s.get(url,proxies=proxies)
print r
except Exception as ee:
i = i + 1
print ee
print 'i=%s' % i
the proxies is a little unstabitily,so I set up the max_retries, but it still have exception sometime, so is there some method to execute after some secondes at every retry??

Just with requests library it's not possible. However you can use external library like backoff.
backoff provides a decorator and you wrap it around your function. Sample code:
#backoff.on_exception(backoff.constant,
requests.exceptions.RequestException,
max_tries=10, interval=10)
def get_url(url):
return requests.get(url)
The above code waits for 10 seconds for next retry on every exception of requests.exceptions.RequestException and it tries for 10 times, as specified in max_tries.

Related

Loop through a list and call API in python

I have list of author names (more than 1k) and my google api quote limit is 20k I want to pass the author name into the API to get book informations. When I tested my code I got "429 Client Error: Too Many Requests for url..." error, How can I slow down my running time without stopping the application. (I'm using Python in google colab )
author_List = ["J. K. Rowling", "mark twain","Emily Dickinson"]
connGoogleAPI(author_List)
def connGoogleAPI(booksData):
key= "**************************"
books_list = []
col= ['Title', 'Authors', 'published Date', 'Description','ISBN']
books_list.append(col)
res = ""
err = None
with requests.Session() as session:
#err= ""
for Authors in booksData:
params = {"q": Authors,"key": key,"maxResults": 1}
delays = 65 # approximately 1 minute total delay time for any given author
while True:
try:
#do something
except Exception as e:
if err.status_code == 429:
#print("******")
if delays <= 0:
raise(e) # we've spent too long delaying
time.sleep(1)
delays -= 1
else:
print("-----=")
raise(e) # some other status code
books_list.append(lookup(res,Authors))
return books_list
you can import time and then add:
time.sleep(1)
at the end of your for loop, to pause for a second between each iteration.
You could slow down your for loop like this:
first, you need to import time
delay = 2
for key in author_List:
author = author_List[key]
newList.append(searchData(author))
time.sleep(delay)
The delay you can set a number how many seconds the loop gets delayed (here it would be 2 seconds)
You probably don't want unconditional delays which will slow down your processing unnecessarily. On the other hand, if you start getting HTTP 429 you can't be sure when or even if the server is going to allow you to continue. So you need a strategy that only introduces delays when/if required but also doesn't get into an infinite loop. Consider this:
import requests
import time
listofauthors = ['Mark Twain', 'Dan Brown', 'William Shakespeare']
with requests.Session() as session:
for author in listofauthors:
params = {'q': author}
delays = 60 # approximately 1 minute total delay time for any given author
while True:
try:
r = session.get('https://www.googleapis.com/books/v1/volumes', params=params)
r.raise_for_status()
print(r.json())
break # all good :-)
except Exception as e:
if r.status_code == 429:
if delays <= 0:
raise(e) # we've spent too long delaying
time.sleep(1)
delays -= 1
else:
raise(e) # some other status code

Accelerate 2 loops with regex to find email address on website

I need help to find email adress on website. After some research, I found the solution but it's so long, I have a lot of datas (more than 90 000) and my code never stop.
Do you know tips to optimize/accelerate my code ?
This is my list of the URL:
http://etsgaidonsarl.site-solocal.com/
http://fr-fr.facebook.com/people/
http://ipm-mondia.com/
http://lfgenieclimatique.fr/
http://vpcinstallation.site-solocal.com
http://www.cavifroid.fr/
http://www.clim-monnier.com/
http://www.climacool.net/
I use 2 loops. The first is to find all pages of a website because the email adresse is not every time on the first page.
In the second loop, I scrall the page to find the email address, the code :
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-zA-Z](?:[a-z0-9-]*[a-zA-Z])?\.)+[a-zA-Z](?:[a-z0-9-]*[a-zA-Z])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-zA-Z]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
I think my regex is too long, it can be a problem ?
session = HTMLSession()
mailing=[]
for index, i in enumerate(link): #link is the list of the URLs
try:
r = session.get(i)
site=r.html.absolute_links
linkslist = list(r.html.absolute_links)
except:
linkslist=list(i)
for j in linkslist:
try:
r1 = session.get(j)
for re_match in re.finditer(EMAIL_REGEX, r1.html.raw_html.decode()):
mail=(re_match.group())
liste=[index,mail,j]
mailing.append(liste)
except:
pass
print(mailing)
df = pd.DataFrame(mailing, columns=['index1','mail','lien',])
Thank's for your help
I think multi-threading should do the job. your regex, i don't know what it does but assuming its working and helpful, the multi-threaded version should look like the following. I tested the code, it works.
`from threading import Thread, Lock
from requests_html import HTMLSession
import re
lock = Lock()
link = ["http://etsgaidonsarl.site-solocal.com/",
"http://fr-fr.facebook.com/people/",
"http://ipm-mondia.com/",
"http://lfgenieclimatique.fr/",
"http://vpcinstallation.site-solocal.com",
"http://www.cavifroid.fr/",
"http://www.clim-monnier.com/",
"http://www.climacool.net/"]
linklist = []
mailing = []
main_threads = []
minor_threads = []
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-zA-Z](?:[a-z0-9-]*[a-zA-Z])?\.)+[a-zA-Z](?:[a-z0-9-]*[a-zA-Z])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-zA-Z]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
def links_scraper(single_url):
try:
session = HTMLSession()
r = session.get(single_url)
site=r.html.absolute_links
the_list = list(r.html.absolute_links)
linklist.extend(list(zip([single_url for _ in range(len(the_list))], the_list)))
except Exception as e:
# print("Exception:", e)
linklist.append((single_url, single_url))
def mail_scrapper(main_url, single_link):
try:
session = HTMLSession()
r1 = session.get(single_link)
for re_match in re.finditer(EMAIL_REGEX, r1.html.raw_html.decode()):
mail=(re_match.group())
liste=[link.index(main_url),mail,single_link]
mailing.append(liste)
except Exception as e:
# print(f"Exception: {e}")
pass
def main():
for l in link:
t = Thread(target=links_scraper, args=(l,))
t.start()
main_threads.append(t)
while len(main_threads) > 0:
try:
with lock:
current_link = linklist.pop(0)
minor_thread = Thread(target=mail_scrapper, args=(current_link[0], current_link[1]))
minor_threads.append(minor_thread)
minor_thread.start()
except IndexError:
pass
for t in main_threads:
if t.isAlive() == False:
main_threads.pop(main_threads.index(t))
for t in minor_threads:
t.join()
main()
print("Mailing:", mailing)`

Index out of range when sending requests in a loop

I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response

Change proxy after x many loops

I have a list of let's say 100 URLS.
I want to change IP after every 10 URLS.
Let's say I have my own proxies that I'd like to use after each 10 URLS.
How would I use that proxy in my requests - ?
list = [100URLS items]
proxies ['ip:port','ip:port']
for urls in list:
try:
##request 10 URLS here then it might throw me error.
except:
#After it throws me error, I want to be able to use proxies inside a list something like this and reiterate the same request with a new proxy using requests.
#!/usr/bin/python
import requests
class Proxer:
proxy = ''
list = ['http://proxy1','http://proxy2', 'http://pox']
proxy_count = 0
page_count = 0
def proxy_changer(self):
try:
if self.proxy_count > 0:
self.proxy_count = self.proxy_count + 1
self.proxy = self.list[self.proxy_count]
return self.proxy
except:
print "you are out of proxies"
def open_site(self, url):
self.page_count = self.page_count + 1
try:
if self.page_count%10:
self.proxy_changer()
except:
pass
requests.get(url, {'http':self.proxy})
Proxer().open_site('http://google.com')
Here is the full code. Should change the proxy after 10 pages using the open_site('http://google.com') Once you run out of proxies a exception will be returned.

iterate the list in python

I have a loop inside loop i'm using try n catch once get error try n catch works fine but loop continues to next value. What I need is that where the loop breaks start from the same value don't continue to next so how i can do that with my code [like in other languages: in c++, it is i--]
for
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
c = td.find('a')['href']
urls = []
urls.append(c)
#collecting urls from first page then from those url collecting further info in below loop
for abc in urls:
try:
r = urllib2.urlopen(abc)
encoding = r.info().getparam('charset')
html = r.read()
except Exception as e:
last_error = e
time.sleep(retry_timeout) #here is the problem once get error then switch from next value
I need a more pythonic way to do this.
Waiting for a reply. Thank you.
Unfortunatly, there is no simple way to go back with iterator in Python :
http://docs.python.org/2/library/stdtypes.html
You should be interested in this stackoverflow's thread :
Making a python iterator go backwards?
For your particular case, i will use a simple while loop :
url = []
i = 0
while i < len(url): #url is list contain all urls which contain infinite as url updates every day
data = url[i]
try:
#getting data from there
i+=1
except:
#shows the error received and continue to next loop i need to make the loop start from same position
The problem with the way, you want to handle your problem is that you will risk to go on a infinite loop. For example if a link is broken r = urllib2.urlopen(abc) will always run an exception and you will always stay at the same position. You should consider doing something like that :
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
c = td.find('a')['href']
urls = []
urls.append(c)
#collecting urls from first page then from those url collecting further info in below loop
NUM_TRY = 3
for abc in urls:
for _ in range(NUM_TRY):
try:
r = urllib2.urlopen(abc)
encoding = r.info().getparam('charset')
html = r.read()
break #if we arrive to this line, it means no error occur so we don't need to retry again
#this is why we break the inner loop
except Exception as e:
last_error = e
time.sleep(retry_timeout) #here is the problem once get error then switch from next value

Categories

Resources