I have a problem with ConnectionError max retries using try,except. I am using BeautifulSoup and requests.
If my script working for about 10 hours, the script does nothing!
I am using try except and time.sleep(1) to catch this error.
for i, link in enumerate(links):
sleep(1)
try:
req = requests.get(link, verify=False)
except requests.exceptions.ConnectionError:
logger.debug("Connection error! " + link)
continue
if req.status_code != 404:
do_something()
else:
logger.debug("Website not exist! " + link)
continue
Related
I have a string of url's that link to pdf's that I am trying to download. Some of the url's are no good, but my string is 41,000 long, so I'd like to use an exception of requests.get in order to pass over these url's and continue searching and download for the next on the list.
I've tried to use the except function like below, and I've tried it in a few other formats and locations as well, but I cannot seem to get it to perform.
try:
r = requests.get(url, allow_redirects=True)
r.raise_for_status()
with open(('file'+str(u)+'.pdf'),"wb") as code:
code.write(r.content)
print("pdf")
except requests.exceptions.HTTPError as err:
print(err)
sys.exit(1)
i get this sort of readout when the error occurs:
requests.exceptions.SSLError: HTTPSConnectionPool(host=
as well as
(Caused by SSLError(CertificateError("hostname
Try this :)
# urls is the list of url
for url in urls:
try:
r = requests.get(url, allow_redirects=True)
r.raise_for_status()
with open(('file'+str(u)+'.pdf'),"wb") as code:
code.write(r.content)
print("pdf")
except requests.exceptions.HTTPError as err:
print('[http_error]: {}'.format(err))
except requests.exceptions.SSLError as bad_url:
print('[bad_url]: {}'.format(bad_url))
except Exception as e:
print('[error]: {}'.format(e))
I am using Python Requests + Cfscrape Module to Bypass the Cloudflare Enabled website but sometimes it does not validate the URL Properly brings 403 Status Header.
Also, I am using Tor Proxy for Find the Blocked URLs
import sys
import requests
import cfscrape
# Create the session and set the proxies.
proxies = {'http': 'socks5://127.0.0.1:9050',
'https': 'socks5://127.0.0.1:9050'}
# Start Session
#s = requests.Session()
s = cfscrape.create_scraper() # https://github.com/Anorov/cloudflare-scrape/issues/103
# Proxy Connection
s.proxies = proxies
# Bypass Cloudflare Enabled website - https://support.cloudflare.com/hc/en-us/articles/203306930-Does-Cloudflare-block-Tor-
scraper = cfscrape.create_scraper(sess=s, delay=10)
try:
#user input
LINK = input('Enter a URL: ')
response = scraper.get(LINK)
except requests.ConnectionError as e:
print("OOPS!! Connection Error - May be the URL is Not Valid or Can't Bypass them")
except requests.Timeout as e:
print("OOPS!! Timeout Error")
except requests.RequestException as e:
print("OOPS!! General Error (Enter a Valid URL) - Add HTTP/HTTPS infront of the URL")
except (KeyboardInterrupt, SystemExit):
print("Ok ok, quitting")
sys.exit(1)
else:
if response.history:
print("URL was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
break
else:
print(response.status_code, response.url + " - Current Live and Active URL")
I am trying to get the following url: ow dot ly/LApK30cbLKj that is working but I am getting http 404 error:
my_url = 'ow' + '.ly/LApK30cbLKj' # SO won't accept an ow.ly url
headers = {'User-Agent' : user_agent }
request = urllib2.Request(my_url,"", headers)
response = None
try:
response = urllib2.urlopen(request)
except urllib2.HTTPError, e:
print '+++HTTPError = ' + str(e.code)
Is there something I can do to get this url with a http 200 status as I do when I visit in a browser?
Your example works for me, except you need to add http://
my_url = 'http://ow' + '.ly/LApK30cbLKj'
You need to define the url's protocol, the thing is that when you visit the url in browser, the default protocol will be HTTP. However, urllib2 doesn't do that for you, you need to add http:// in the beginning of url, otherwise, the error will be raised:
ValueError: unknown url type: ow.ly/LApK30cbLKj
As #enjoi mentioned, I used requests:
import requests
result = None
try:
result = requests.get(agen_cont.source_url)
except requests.exceptions.Timeout as e:
print '+++timeout exception: '
print e
except requests.exceptions.TooManyRedirects as e:
print '+++ too manuy redirects exception: '
print e
except requests.exceptions.RequestException as e:
print '+++ request exception: '
print e
except Exception:
import traceback
print '+++generic exception: ' + traceback.format_exc()
if result:
final_url = result.url
print final_url
response = result.content
Question: I've 3 URLS - testurl1, testurl2 and testurl3. I'd like to try testurl1 first, if I get 404 error then try testurl2, if that gets 404 error then try testurl3. How to achieve this? So far I've tried below but that works only for two url, how to add support for third url?
from urllib2 import Request, urlopen
from urllib2 import URLError, HTTPError
def checkfiles():
req = Request('http://testurl1')
try:
response = urlopen(req)
url1=('http://testurl1')
except HTTPError, URLError:
url1 = ('http://testurl2')
print url1
finalURL='wget '+url1+'/testfile.tgz'
print finalURL
checkfiles()
Another job for plain old for loop:
for url in testurl1, testurl2, testurl3
req = Request(url)
try:
response = urlopen(req)
except HttpError as err:
if err.code == 404:
continue
raise
else:
# do what you want with successful response here (or outside the loop)
break
else:
# They ALL errored out with HTTPError code 404. Handle this?
raise err
Hmmm maybe something like this?
from urllib2 import Request, urlopen
from urllib2 import URLError, HTTPError
def checkfiles():
req = Request('http://testurl1')
try:
response = urlopen(req)
url1=('http://testurl1')
except HTTPError, URLError:
try:
url1 = ('http://testurl2')
except HTTPError, URLError:
url1 = ('http://testurl3')
print url1
finalURL='wget '+url1+'/testfile.tgz'
print finalURL
checkfiles()
I need to access a url and if it gives me an HTTPError I need to wait five minutes and try again (this works for this particular website). It looks like the code doesn't recognize the except clause and it still gives me an HTTPError instantly (without waiting the 5 min).
import urllib2, datetime, re,os, requests
from time import sleep
import time
from dateutil.relativedelta import relativedelta
from requests.exceptions import HTTPError, ConnectionError
from bs4 import BeautifulSoup
try:
resp = requests.get(url)
except HTTPError:
while True:
print "Wait."
time.sleep(305)
resp = requests.get(url)
except ConnectionError:
while True:
print "Wait."
time.sleep(305)
resp = requests.get(url)
You put this resp = requests.get(url) in to try/except block, but after except you put the same thing again. If something throws an error and you put that after except, it will throw that error again.
while True:
try:
resp = requests.get(url)
except HTTPError:
print "Wait."
time.sleep(305)
continue #pass the codes after this block
except ConnectionError:
print "Wait."
time.sleep(305)
continue #pass the codes after this block
else:
break
Basically until your url responds correctly, it will run the same thing again and again.
Inside your except blocks, you have this:
resp = requests.get(url)
This isn't protected by a try block, so it throws an error. You have to rearrange your code a little:
while True:
try:
resp = requests.get(url)
except HTTPError:
print "Wait."
time.sleep(305)
except ConnectionError:
print "Wait."
time.sleep(305)
else: break
It's now an infinite loop. When the connection fails, the loop just continues. When it succeeds, the loop exits.