Trying to use requests to download a list of urls and catch the exception if it is a bad url. Here's my test code:
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
print "the url is good"
except ConnectionError,e:
print e
print "the url is bad"
The problem is if I pass in url = "http://www.google.com" everything works as it should and as expected since it is a good url.
http://www.google.com
the url is good
But if I pass in url = "http://www.google.com/thereisnothing.jpg"
I still get :
http://www.google.com/thereisnothing.jpg
the url is good
So its almost like its not even looking at anything after the "/"
just to see if the error checking is working at all I passed a bad hostname: #url = "http://somethingpotato.com"
Which kicked back the error message I expected:
http://somethingpotato.com
HTTPConnectionPool(host='somethingpotato.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b6cd15b90>: Failed to establish a new connection: [Errno -2] Name or service not known',))
the url is bad
What am I missing to make request capture a bad url not just a bad hostname?
Thanks
requests do not create a throwable exception at a 404 response. Instead you need to filter them out be checking to see if the status is 'ok' (HTTP response 200)
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com/nothing"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
print "the url is bad"
except ConnectionError,e:
print e
print "the url is bad"
EDIT:
import requests
from requests.exceptions import ConnectionError
def printFailedUrl(url, response):
if isinstance(response, ConnectionError):
print "The url " + url + " failed to connect with the exception " + str(response)
else:
print "The url " + url + " produced the failed response code " + str(response.status_code)
def testUrl(url):
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
printFailedUrl(url, r)
except ConnectionError,e:
printFailedUrl(url, e)
def main():
testUrl("http://www.google.com") #'Good' Url
testUrl("http://www.google.com/doesnotexist.jpg") #'Bad' Url with 404 response
testUrl("http://sdjgb") #'Bad' url with inaccessable url
main()
In this case one function can handle both getting an exception or a request response passed into it. This way you can have separate responses for if the url returns some non 'good' (non-200) response vs an unusable url which throws an exception. Hope this has the information you need in it.
what you want is to check r.status_code. Getting r.status_code on "http://www.google.com/thereisnothing.jpg" will give you 404. you can put a condition for only 200 code URL to be "good".
Related
I have a .txt file that contains a list of URLs. The structure of the URLs varies - some may begin with https, some with http, others with just www and others with just the domain name (stackoverflow.com). So an example of the .txt file content is:-
www.google.com
microsoft.com
https://www.yahoo.com
http://www.bing.com
What I want to do is parse through the list and check if the URLs are live. In order to do that, the stucture of the URL must be correct otherwise the request will fail. Here's my code so far:-
import requests
with open('urls.txt', 'r') as f:
urls = f.readlines()
for url in urls:
url = url.replace('\n', '')
if not url.startswith('http'): #This is to handle just domain names and those that begin with 'www'
url = 'http://' + url
if url.startswith('http:'):
print("trying url {}".format(url))
response = requests.get(url, timeout=10)
status_code = response.status_code
if status_code == 200:
continue
else:
print("URL {} has a response code of {}".format(url, status_code))
print("encountered error. Now trying with https")
url = url.replace('http://', 'https://')
print("Now replacing http with https and trying again")
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
else:
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
I feel like I've overcomplicated this somewhat and there must be an easier way of trying variants (ie. domain name, domain with 'www' at the beginning, with 'http' at the beginning and with 'https://' at the beginning, until a site is identified as being live or not (ie. all variables have been exhausted).
Any suggestions on my code or a better way to approach this? In essence, I want to handle the formatting of the URL to ensure that I then attempt to check the status of the URL.
Thanks in advance
This is a little too long for a comment, but, yes, it can be simplified, starting from, and replacing, the startswith part:
if not '//' in url:
url = 'http://' + url
response = requests.get(url, timeout=10)
etc.
I am using Python Requests + Cfscrape Module to Bypass the Cloudflare Enabled website but sometimes it does not validate the URL Properly brings 403 Status Header.
Also, I am using Tor Proxy for Find the Blocked URLs
import sys
import requests
import cfscrape
# Create the session and set the proxies.
proxies = {'http': 'socks5://127.0.0.1:9050',
'https': 'socks5://127.0.0.1:9050'}
# Start Session
#s = requests.Session()
s = cfscrape.create_scraper() # https://github.com/Anorov/cloudflare-scrape/issues/103
# Proxy Connection
s.proxies = proxies
# Bypass Cloudflare Enabled website - https://support.cloudflare.com/hc/en-us/articles/203306930-Does-Cloudflare-block-Tor-
scraper = cfscrape.create_scraper(sess=s, delay=10)
try:
#user input
LINK = input('Enter a URL: ')
response = scraper.get(LINK)
except requests.ConnectionError as e:
print("OOPS!! Connection Error - May be the URL is Not Valid or Can't Bypass them")
except requests.Timeout as e:
print("OOPS!! Timeout Error")
except requests.RequestException as e:
print("OOPS!! General Error (Enter a Valid URL) - Add HTTP/HTTPS infront of the URL")
except (KeyboardInterrupt, SystemExit):
print("Ok ok, quitting")
sys.exit(1)
else:
if response.history:
print("URL was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
break
else:
print(response.status_code, response.url + " - Current Live and Active URL")
I can't connect with page. Here is my code and error witch I have:
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import urllib
someurl = "https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET"
req = Request(someurl)
try:
response = urllib.request.urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("Everything is fine")
Error code: 403
Some websites require a browser-like "User-Agent" header, other requires specific cookies. In this case, I found out by trial and error that both are required. What you need to do is:
Send an initial request with a browser-like user-agent. This will fail with 403, but you will also obtain a valid cookie in the response.
Send a second request with the same user-agent and the cookie that you got before.
In code:
import urllib.request
from urllib.error import URLError
# This handler will store and send cookies for us.
handler = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(handler)
# Browser-like user agent to make the website happy.
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
request = urllib.request.Request(url, headers=headers)
for i in range(2):
try:
response = opener.open(request)
except URLError as exc:
print(exc)
print(response)
# Output:
# HTTP Error 403: Forbidden (expected, first request always fails)
# <http.client.HTTPResponse object at 0x...> (correct 200 response)
Or, if you prefer, using requests:
import requests
session = requests.Session()
jar = requests.cookies.RequestsCookieJar()
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
for i in range(2):
response = session.get(url, cookies=jar, headers=headers)
print(response)
# Output:
# <Response [403]>
# <Response [200]>
You can use http.client. First, you need to open a connection with the server. And, after, make a GET request. Like this:
import http.client
conn = http.client.HTTPConnection("genecards.org:80")
conn.request("GET", "/cgi-bin/carddisp.pl?gene=MET")
try:
response = conn.getresponse().read().decode("UTF-8")
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("Everything is fine")
I am trying to get the following url: ow dot ly/LApK30cbLKj that is working but I am getting http 404 error:
my_url = 'ow' + '.ly/LApK30cbLKj' # SO won't accept an ow.ly url
headers = {'User-Agent' : user_agent }
request = urllib2.Request(my_url,"", headers)
response = None
try:
response = urllib2.urlopen(request)
except urllib2.HTTPError, e:
print '+++HTTPError = ' + str(e.code)
Is there something I can do to get this url with a http 200 status as I do when I visit in a browser?
Your example works for me, except you need to add http://
my_url = 'http://ow' + '.ly/LApK30cbLKj'
You need to define the url's protocol, the thing is that when you visit the url in browser, the default protocol will be HTTP. However, urllib2 doesn't do that for you, you need to add http:// in the beginning of url, otherwise, the error will be raised:
ValueError: unknown url type: ow.ly/LApK30cbLKj
As #enjoi mentioned, I used requests:
import requests
result = None
try:
result = requests.get(agen_cont.source_url)
except requests.exceptions.Timeout as e:
print '+++timeout exception: '
print e
except requests.exceptions.TooManyRedirects as e:
print '+++ too manuy redirects exception: '
print e
except requests.exceptions.RequestException as e:
print '+++ request exception: '
print e
except Exception:
import traceback
print '+++generic exception: ' + traceback.format_exc()
if result:
final_url = result.url
print final_url
response = result.content
I wanted to check if a certain website exists, this is what I'm doing:
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!
If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?
You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.
For python 2.7.x, you can use httplib:
import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
or urllib2:
import urllib2
try:
urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
or for 2.7 and 3.x, you can install requests
import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')
It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error
If you want to check if page exists and don't want to download the whole page, you should use Head Request:
import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400
taken from this answer.
If you want to download the whole page, just make a normal request and check the status code. Example using requests:
import requests
response = requests.get('http://google.com')
assert response.status_code < 400
See also similar topics:
Python script to see if a web page exists without downloading the whole page?
Checking whether a link is dead or not using Python without downloading the webpage
How do you send a HEAD HTTP request in Python 2?
Making HTTP HEAD request with urllib2 from Python 2
from urllib2 import Request, urlopen, HTTPError, URLError
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
page_open = urlopen(req)
except HTTPError, e:
print e.code
except URLError, e:
print e.reason
else:
print 'ok'
To answer the comment of unutbu:
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
Source
There is an excellent answer provided by #Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.
The previous answer for requests suggested something like the following:
def uri_exists_get(uri: str) -> bool:
try:
response = requests.get(uri)
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.
def uri_exists_stream(uri: str) -> bool:
try:
with requests.get(uri, stream=True) as response:
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
I ran the above snippets with timers attached against two web resources:
1) http://bbb3d.renderfarming.net/download.html, a very light html page
2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file
Timing results below:
uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239
uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007
uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224
uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007
As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.
I see many answers that use requests.get, but I suggest you this solution using only requests.head which is faster and also better for the webserver since it doesn't need to send back the body too.
import requests
def check_url_exists(url: str):
"""
Checks if a url exists
:param url: url to check
:return: True if the url exists, false otherwise.
"""
return requests.head(url, allow_redirects=True).status_code == 200
The meta-information contained in the HTTP headers in response to a HEAD request should be identical to the information sent in response to a GET request.
code:
a="http://www.example.com"
try:
print urllib.urlopen(a)
except:
print a+" site does not exist"
You can simply use stream method to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.
def uri_exists(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
return True
else:
return False
def isok(mypath):
try:
thepage = urllib.request.urlopen(mypath)
except HTTPError as e:
return 0
except URLError as e:
return 0
else:
return 1
Try this one::
import urllib2
website='https://www.allyourmusic.com'
try:
response = urllib2.urlopen(website)
if response.code==200:
print("site exists!")
else:
print("site doesn't exists!")
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)